Refactor warpspeed scan 2/2 #9169
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
No SASS changes
No SASS changes
No SASS changes
No SASS changes
No SASS changes
No SASS changes
No SASS changes
No SASS changes
8fedbdf to
d61b895
Compare
📝 WalkthroughSummary by CodeRabbit
WalkthroughThis PR refactors the CUB warpspeed scan kernel from a lookback to a lookahead staging formulation. Core tile-loading primitives and incremental scan logic are renamed and updated, squad dispatch gains std::array support, the tuning policy renames staging parameters and squad helpers, and the main kernel resources and closure are substantially refactored with aggregate-centric naming throughout shared memory, registers, and helper functions. ChangesWarpspeed Scan Lookahead Refactoring
Possibly related PRs
Suggested reviewers
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 0c53ebc2-a325-4e9a-9636-b8b113c476ca
📒 Files selected for processing (5)
cub/cub/detail/warpspeed/look_ahead.cuhcub/cub/detail/warpspeed/squad/squad.cuhcub/cub/device/dispatch/kernels/kernel_scan.cuhcub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuhcub/cub/device/dispatch/tuning/tuning_scan.cuh
| _CCCL_BEGIN_NAMESPACE_CUDA_STD | ||
| template <class _Tp, size_t _Size> | ||
| struct _CCCL_TYPE_VISIBILITY_DEFAULT array; | ||
| _CCCL_END_NAMESPACE_CUDA_STD |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="cub/cub/detail/warpspeed/squad/squad.cuh"
echo "== File =="
ls -la "$FILE"
echo "== Includes =="
sed -n '1,120p' "$FILE" | nl -ba
echo "== Lines 15-70 =="
sed -n '15,70p' "$FILE" | nl -ba
echo "== Lines 130-220 =="
sed -n '130,220p' "$FILE" | nl -ba
echo "== Search for __elems_ usage =="
rg -n "__elems_" "$FILE" || true
echo "== Search for cuda/std/array include =="
rg -n "<cuda/std/array>|cuda/std/array" "$FILE" || true
echo "== Search for forward decl of array =="
rg -n "struct _CCCL_TYPE_VISIBILITY_DEFAULT array" "$FILE" || trueRepository: NVIDIA/cccl
Length of output: 232
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="cub/cub/detail/warpspeed/squad/squad.cuh"
echo "== Includes (first 120 lines) =="
sed -n '1,120p' "$FILE"
echo "== Lines 15-80 =="
sed -n '15,80p' "$FILE"
echo "== Lines 130-220 =="
sed -n '130,220p' "$FILE"
echo "== Search for __elems_ usage =="
rg -n "__elems_" "$FILE" || true
echo "== Search for cuda/std/array include =="
rg -n "<cuda/std/array>|cuda/std/array" "$FILE" || true
echo "== Search for forward decl of array =="
rg -n "struct _CCCL_TYPE_VISIBILITY_DEFAULT array" "$FILE" || trueRepository: NVIDIA/cccl
Length of output: 6833
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate the squad_desc header and inspect its includes.
FILE="cub/cub/detail/warpspeed/squad/squad_desc.cuh"
ls -la "$FILE"
echo "== squad_desc.cuh includes (first 120 lines) =="
sed -n '1,120p' "$FILE"
echo "== Search for cuda/std/array in squad_desc.cuh and dependencies =="
rg -n "cuda/std/array" "$FILE" || true
# Also search for __elems_ usage in the squad directory to see layout assumptions.
rg -n "__elems_" cub/cub/detail/warpspeed/squad || trueRepository: NVIDIA/cccl
Length of output: 2618
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "<cuda/std/array>|cuda/std/array" cub/cub/detail/warpspeed/squad cub/cub/config.cuh cub/cub/detail -S || true
# Inspect cub/config.cuh top includes to see whether it already pulls in cuda/std/array.
CONFIG="cub/cub/config.cuh"
ls -la "$CONFIG"
sed -n '1,200p' "$CONFIG"Repository: NVIDIA/cccl
Length of output: 1221
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the headers included by squad.cuh
FILE="cub/cub/detail/warpspeed/special_registers.cuh"
ls -la "$FILE"
sed -n '1,200p' "$FILE"
echo "== Search for cuda/std/array in transitive headers =="
rg -n "<cuda/std/array>|cuda/std/array" \
cub/cub/detail/warpspeed/special_registers.cuh \
cub/cub/detail/warpspeed/squad/squad_desc.cuh \
cub/cub/config.cuh || true
echo "== Search for __elems_ across cub/ =="
rg -n "__elems_" cub | head -n 50 || true
echo "== Search for the forward declaration of array =="
rg -n "struct _CCCL_TYPE_VISIBILITY_DEFAULT array" cub | head -n 50 || trueRepository: NVIDIA/cccl
Length of output: 1879
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Count occurrences of __elems_ =="
rg -n "__elems_" cub | wc -l
echo "== Occurrences of __elems_ (top 50) =="
rg -n "__elems_" cub | head -n 50 || true
echo "== Occurrences of forward decl: struct _CCCL_TYPE_VISIBILITY_DEFAULT array =="
rg -n "struct _CCCL_TYPE_VISIBILITY_DEFAULT array" cub | head -n 50 || true
echo "== Files that include <cuda/std/array> =="
rg -n "<cuda/std/array>" cub | cut -d: -f1 | sort -u | head -n 50 || trueRepository: NVIDIA/cccl
Length of output: 1643
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="cub/cub/detail/warpspeed/squad/squad.cuh"
echo "== squad.cuh forward decl lines (18-30) =="
sed -n '18,35p' "$FILE"
echo "== squad.cuh array overload lines (140-175) =="
sed -n '140,175p' "$FILE"Repository: NVIDIA/cccl
Length of output: 1312
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Search for any cuda/std/array include within warpspeed subtree (likely transitive deps for squad.cuh)
rg -n "<cuda/std/array>|cuda/std/array" cub/cub/detail/warpspeed || true
# Inspect make_warp_uniform.cuh (pulled in by special_registers.cuh)
FILE="cub/cub/detail/warpspeed/make_warp_uniform.cuh"
ls -la "$FILE"
sed -n '1,200p' "$FILE" | head -n 200Repository: NVIDIA/cccl
Length of output: 1407
critical: cub/cub/detail/warpspeed/squad/squad.cuh forward-declares ::cuda::std::array but doesn’t include <cuda/std/array>, while the squadDispatch overload takes ::cuda::std::array<SquadDesc, numSquads> by value and accesses squads.__elems_. This couples the header to libcudacxx’s private __elems_ layout and makes compilation dependent on include order; include <cuda/std/array> and use a public accessor/interface instead of __elems_.
| // Fill the registers with the scan identity, if there is one, before acquiring/waiting on any resources | ||
| AccumT regSumInclusive[elemPerThread]; | ||
| fillWithIdentity<IsLastTile, ScanOpT>(regSumInclusive, valid_items_this_thread); | ||
| AccumT regAggrInclusive[elemPerThread]; | ||
| if constexpr (IsLastTile) | ||
| { | ||
| fillWithIdentity<ScanOpT>(regAggrInclusive, valid_items_this_thread); | ||
| } |
There was a problem hiding this comment.
critical: the last-tile identity prefill is overwritten before it is used. fillWithIdentity runs before squadLoadSmem, but the full-tile load then replaces the invalid tail slots with overcopied input bytes. For operators with an identity, threadScanPartial takes the full-scan path here, so the tail of the last tile is scanned with garbage instead of identity values.
Also applies to: 623-632
😬 CI Workflow Results🟥 Finished in 3h 02m: Pass: 97%/285 | Total: 11d 10h | Max: 2h 33m | Hits: 16%/983793See results here. |
|
|
||
| #include <cuda/__ptx/instructions/elect_sync.h> | ||
|
|
||
| _CCCL_BEGIN_NAMESPACE_CUDA_STD |
There was a problem hiding this comment.
I am also getting an error here:
error: identifier "_CCCL_PROLOGUE_INCLUDED" is undefined
This is the second part of a few cleanup commits that should improve the readability of the warpspeed scan implementation.
This PR causes no SASS changes for the warpspeed scan benchmark on SM120.