Skip to content

Refactor warpspeed scan 2/2 #9169

Open
bernhardmgruber wants to merge 8 commits into
NVIDIA:mainfrom
bernhardmgruber:ref_scan_part
Open

Refactor warpspeed scan 2/2 #9169
bernhardmgruber wants to merge 8 commits into
NVIDIA:mainfrom
bernhardmgruber:ref_scan_part

Conversation

@bernhardmgruber
Copy link
Copy Markdown
Contributor

@bernhardmgruber bernhardmgruber commented May 28, 2026

This is the second part of a few cleanup commits that should improve the readability of the warpspeed scan implementation.

This PR causes no SASS changes for the warpspeed scan benchmark on SM120.

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 28, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL May 28, 2026
No SASS changes
No SASS changes
No SASS changes
No SASS changes
No SASS changes
No SASS changes
No SASS changes
@bernhardmgruber bernhardmgruber marked this pull request as ready for review May 29, 2026 10:40
@bernhardmgruber bernhardmgruber requested a review from a team as a code owner May 29, 2026 10:40
@bernhardmgruber bernhardmgruber requested a review from fbusato May 29, 2026 10:40
@cccl-authenticator-app cccl-authenticator-app Bot moved this from In Progress to In Review in CCCL May 29, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • Refactor
    • Refactored warpspeed scan kernel logic from a lookback to a lookahead formulation to enhance performance and code maintainability.
    • Updated shared-memory resource management and aggregate handling within the scan kernel.
    • Improved kernel parameter qualifications for better optimization.
    • Updated public API function signatures for squad dispatch operations.

Walkthrough

This PR refactors the CUB warpspeed scan kernel from a lookback to a lookahead staging formulation. Core tile-loading primitives and incremental scan logic are renamed and updated, squad dispatch gains std::array support, the tuning policy renames staging parameters and squad helpers, and the main kernel resources and closure are substantially refactored with aggregate-centric naming throughout shared memory, registers, and helper functions.

Changes

Warpspeed Scan Lookahead Refactoring

Layer / File(s) Summary
Lookahead tile-loading primitives
cub/cub/detail/warpspeed/look_ahead.cuh
storeTileAggregate parameter renamed from sum to aggr; warpLoadLookback renamed to warpLoadLookahead with tile index renamed idxTileLookahead; warpIncrementalLookback replaced by warpIncrementalLookahead with accumulators renamed aggrExclusiveCtaPrev/CtaCur and local reduction variable renamed local_aggr.
Squad dispatch interface modernization
cub/cub/detail/warpspeed/squad/squad.cuh
Forward declaration added for ::cuda::std::array primary template; new squadDispatch overload accepts std::array<SquadDesc, numSquads> and forwards to underlying squadDispatch<numSquads> via squads.__elems_.
Warpspeed tuning and squad scheduling
cub/cub/device/dispatch/tuning/tuning_scan.cuh
scan_warpspeed_policy field renamed lookback_stageslookahead_stages (default 2); squad_lookback helper renamed to squad_lookahead; equality operator and stream output updated; resource setup phases replaced squad_lookback calls with squad_lookahead.
Kernel parameter qualification
cub/cub/device/dispatch/kernels/kernel_scan.cuh
DeviceScanKernel scan_op parameter qualified with _CCCL_GRID_CONSTANT const.
Kernel resources and scan closure refactoring
cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh
Removed get_warpspeed_policy() helper; ScanResources refactored with InOutTin_out_t, SumThreadAndWarpTthread_and_warp_aggr_t, and SMEM members smemSumExclusiveCtasmemAggrExclusiveCta, smemSumThreadAndWarpsmemThreadAndWarpAggr; allocResources updated with new stage counting and member initialization; added fillWithIdentity helper and rewrote threadScanPartial for conditional identity handling; warpspeed_scan_closure refactored with aggregate-centric type aliases and SMEM phase bindings; reduce_tile rewritten to use regThreadAggr, regWarpAggr, regSquadAggr registers and updated SMEM load/reduce flow; scan_and_store_tile refactored to use fillWithIdentity prefilling, renamed aggregate exclusive computation, and regAggrInclusive for output stores; dispatch_squad updated with renamed aggregate state variables and phase bindings; device_scan_warpspeed_body refactored to use closure's scanSquads std::array and updated dispatch semantics.

Possibly related PRs

  • NVIDIA/cccl#9128: Both PRs modify warpspeed scan tuning in cub/cub/device/dispatch/tuning/tuning_scan.cuh with overlapping changes to stage-count policy fields and squad phase scheduling.

Suggested reviewers

  • NaderAlAwar
  • shwina
  • miscco

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0c53ebc2-a325-4e9a-9636-b8b113c476ca

📥 Commits

Reviewing files that changed from the base of the PR and between 64a42f1 and d61b895.

📒 Files selected for processing (5)
  • cub/cub/detail/warpspeed/look_ahead.cuh
  • cub/cub/detail/warpspeed/squad/squad.cuh
  • cub/cub/device/dispatch/kernels/kernel_scan.cuh
  • cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh
  • cub/cub/device/dispatch/tuning/tuning_scan.cuh

Comment on lines +20 to +23
_CCCL_BEGIN_NAMESPACE_CUDA_STD
template <class _Tp, size_t _Size>
struct _CCCL_TYPE_VISIBILITY_DEFAULT array;
_CCCL_END_NAMESPACE_CUDA_STD
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="cub/cub/detail/warpspeed/squad/squad.cuh"
echo "== File =="
ls -la "$FILE"

echo "== Includes =="
sed -n '1,120p' "$FILE" | nl -ba

echo "== Lines 15-70 =="
sed -n '15,70p' "$FILE" | nl -ba

echo "== Lines 130-220 =="
sed -n '130,220p' "$FILE" | nl -ba

echo "== Search for __elems_ usage =="
rg -n "__elems_" "$FILE" || true

echo "== Search for cuda/std/array include =="
rg -n "<cuda/std/array>|cuda/std/array" "$FILE" || true

echo "== Search for forward decl of array =="
rg -n "struct _CCCL_TYPE_VISIBILITY_DEFAULT array" "$FILE" || true

Repository: NVIDIA/cccl

Length of output: 232


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="cub/cub/detail/warpspeed/squad/squad.cuh"

echo "== Includes (first 120 lines) =="
sed -n '1,120p' "$FILE"

echo "== Lines 15-80 =="
sed -n '15,80p' "$FILE"

echo "== Lines 130-220 =="
sed -n '130,220p' "$FILE"

echo "== Search for __elems_ usage =="
rg -n "__elems_" "$FILE" || true

echo "== Search for cuda/std/array include =="
rg -n "<cuda/std/array>|cuda/std/array" "$FILE" || true

echo "== Search for forward decl of array =="
rg -n "struct _CCCL_TYPE_VISIBILITY_DEFAULT array" "$FILE" || true

Repository: NVIDIA/cccl

Length of output: 6833


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate the squad_desc header and inspect its includes.
FILE="cub/cub/detail/warpspeed/squad/squad_desc.cuh"
ls -la "$FILE"
echo "== squad_desc.cuh includes (first 120 lines) =="
sed -n '1,120p' "$FILE"

echo "== Search for cuda/std/array in squad_desc.cuh and dependencies =="
rg -n "cuda/std/array" "$FILE" || true

# Also search for __elems_ usage in the squad directory to see layout assumptions.
rg -n "__elems_" cub/cub/detail/warpspeed/squad || true

Repository: NVIDIA/cccl

Length of output: 2618


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "<cuda/std/array>|cuda/std/array" cub/cub/detail/warpspeed/squad cub/cub/config.cuh cub/cub/detail -S || true

# Inspect cub/config.cuh top includes to see whether it already pulls in cuda/std/array.
CONFIG="cub/cub/config.cuh"
ls -la "$CONFIG"
sed -n '1,200p' "$CONFIG"

Repository: NVIDIA/cccl

Length of output: 1221


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the headers included by squad.cuh
FILE="cub/cub/detail/warpspeed/special_registers.cuh"
ls -la "$FILE"
sed -n '1,200p' "$FILE"

echo "== Search for cuda/std/array in transitive headers =="
rg -n "<cuda/std/array>|cuda/std/array" \
  cub/cub/detail/warpspeed/special_registers.cuh \
  cub/cub/detail/warpspeed/squad/squad_desc.cuh \
  cub/cub/config.cuh || true

echo "== Search for __elems_ across cub/ =="
rg -n "__elems_" cub | head -n 50 || true

echo "== Search for the forward declaration of array =="
rg -n "struct _CCCL_TYPE_VISIBILITY_DEFAULT array" cub | head -n 50 || true

Repository: NVIDIA/cccl

Length of output: 1879


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Count occurrences of __elems_ =="
rg -n "__elems_" cub | wc -l

echo "== Occurrences of __elems_ (top 50) =="
rg -n "__elems_" cub | head -n 50 || true

echo "== Occurrences of forward decl: struct _CCCL_TYPE_VISIBILITY_DEFAULT array =="
rg -n "struct _CCCL_TYPE_VISIBILITY_DEFAULT array" cub | head -n 50 || true

echo "== Files that include <cuda/std/array> =="
rg -n "<cuda/std/array>" cub | cut -d: -f1 | sort -u | head -n 50 || true

Repository: NVIDIA/cccl

Length of output: 1643


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="cub/cub/detail/warpspeed/squad/squad.cuh"

echo "== squad.cuh forward decl lines (18-30) =="
sed -n '18,35p' "$FILE"

echo "== squad.cuh array overload lines (140-175) =="
sed -n '140,175p' "$FILE"

Repository: NVIDIA/cccl

Length of output: 1312


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for any cuda/std/array include within warpspeed subtree (likely transitive deps for squad.cuh)
rg -n "<cuda/std/array>|cuda/std/array" cub/cub/detail/warpspeed || true

# Inspect make_warp_uniform.cuh (pulled in by special_registers.cuh)
FILE="cub/cub/detail/warpspeed/make_warp_uniform.cuh"
ls -la "$FILE"
sed -n '1,200p' "$FILE" | head -n 200

Repository: NVIDIA/cccl

Length of output: 1407


critical: cub/cub/detail/warpspeed/squad/squad.cuh forward-declares ::cuda::std::array but doesn’t include <cuda/std/array>, while the squadDispatch overload takes ::cuda::std::array<SquadDesc, numSquads> by value and accesses squads.__elems_. This couples the header to libcudacxx’s private __elems_ layout and makes compilation dependent on include order; include <cuda/std/array> and use a public accessor/interface instead of __elems_.

Comment on lines 464 to +469
// Fill the registers with the scan identity, if there is one, before acquiring/waiting on any resources
AccumT regSumInclusive[elemPerThread];
fillWithIdentity<IsLastTile, ScanOpT>(regSumInclusive, valid_items_this_thread);
AccumT regAggrInclusive[elemPerThread];
if constexpr (IsLastTile)
{
fillWithIdentity<ScanOpT>(regAggrInclusive, valid_items_this_thread);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

critical: the last-tile identity prefill is overwritten before it is used. fillWithIdentity runs before squadLoadSmem, but the full-tile load then replaces the invalid tail slots with overcopied input bytes. For operators with an identity, threadScanPartial takes the full-scan path here, so the tail of the last tile is scanned with garbage instead of identity values.

Also applies to: 623-632

Comment thread cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh
@github-actions
Copy link
Copy Markdown
Contributor

😬 CI Workflow Results

🟥 Finished in 3h 02m: Pass: 97%/285 | Total: 11d 10h | Max: 2h 33m | Hits: 16%/983793

See results here.


#include <cuda/__ptx/instructions/elect_sync.h>

_CCCL_BEGIN_NAMESPACE_CUDA_STD
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also getting an error here:

error: identifier "_CCCL_PROLOGUE_INCLUDED" is undefined

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

1 participant