[cub] Harden histogram benchmarks and add bin-by-bin verification#9144
[cub] Harden histogram benchmarks and add bin-by-bin verification#9144robobryce wants to merge 9 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 Walkthroughimportant: ## Walkthrough Adds GPU-side reference histogram functors, host snapshot/compare utilities, untimed warmups with bin-by-bin verification, per-iteration persisting‑L2 cache resets for timed runs, host-side jittered level generation, updated benchmark axes, and tests for perturbation and thread-local cache scenarios. important: ## Changes Histogram benchmark verification and L2 cache reset flow
important: suggestion: Suggested labels
important: suggestion: Suggested reviewers
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (9)
cub/benchmarks/bench/histogram/histogram_common.cuh (5)
115-156: 💤 Low valuesuggestion: Data members in
bench_ref_even_oplack the postfix underscore required by guidelines (e.g.,d_input_,num_bins_, etc.). Also, line 119'sd_histshould be plural (d_hists_) since it's an array. Line 136 usesstd::size_t—should be::cuda::std::size_tper type qualification rules.These are benchmark internals, so lower priority, but worth noting for consistency.
158-191: ⚡ Quick winsuggestion: Same member naming issues as the EVEN functor. Additionally, line 183 uses
cub::UpperBound—free function calls should be fully qualified as::cub::UpperBound. This matches the guideline that "all calls to free functions must be fully qualified starting from the global namespace."- const int idx = cub::UpperBound(d_levels[c], num_levels, s); + const int idx = ::cub::UpperBound(d_levels[c], num_levels, s);
193-240: 💤 Low valuesuggestion: Line 213 uses
std::size_t; prefer::cuda::std::size_tfor consistency with the codebase type qualification rules.- for (std::size_t b = 0; b < opt.size(); ++b) + for (::cuda::std::size_t b = 0; b < opt.size(); ++b)
242-253: ⚡ Quick winsuggestion:
bench_snapshot_histogramsreturns a non-void value; add[[nodiscard]]per guidelines. Also consider addinginlinesince it's not constexpr, though templates get implicit inline linkage.template <typename CounterT> +[[nodiscard]] std::vector<thrust::host_vector<CounterT>> bench_snapshot_histograms(const std::vector<thrust::device_vector<CounterT>>& d_hists)
297-334: 💤 Low valuesuggestion: Line 311 assumes all channels have identical level counts by reading only
d_levels_per_channel[0].size(). If channels could differ, this would silently produce incorrect validation. If same-size is an invariant, a debug assert or comment documenting this assumption would help future maintainers.cub/test/catch2_test_device_histogram.cu (1)
209-209: ⚡ Quick winsuggestion: Mark
perturbation_stepconst since it's computed once and never modified.- const auto perturbation_step = min_bin_width / 4; + const auto perturbation_step = min_bin_width / 4;As per coding guidelines, all variables that are not modified must use const qualifier.
cub/test/catch2_test_device_histogram_thread_local_cache.cu (2)
33-34: ⚡ Quick winsuggestion: Mark
upperconstexpr andstepconst for const correctness.- constexpr double upper = 1024.0; - const double step = upper / static_cast<double>(num_bins); + constexpr double upper = 1024.0; + const double step = upper / static_cast<double>(num_bins);As per coding guidelines, variables that are not modified must use const qualifier, and variables that can be evaluated at compile-time should use constexpr.
79-79: ⚡ Quick winsuggestion: Mark
num_levelsconst since it's not modified after initialization.- const int num_levels = static_cast<int>(d_levels.size()); + const int num_levels = static_cast<int>(d_levels.size());cub/benchmarks/bench/histogram/even.cu (1)
97-128: 💤 Low valuesuggestion:
cudaDeviceSetLimitandcudaCtxResetPersistingL2Cachereturn error codes that are silently ignored. Consider wrapping them inbench_check_cudafor diagnostic visibility when L2 reset fails unexpectedly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 54ee710d-d89a-4b79-a192-ae1b59574dfb
📒 Files selected for processing (7)
cub/benchmarks/bench/histogram/even.cucub/benchmarks/bench/histogram/histogram_common.cuhcub/benchmarks/bench/histogram/multi/even.cucub/benchmarks/bench/histogram/multi/range.cucub/benchmarks/bench/histogram/range.cucub/test/catch2_test_device_histogram.cucub/test/catch2_test_device_histogram_thread_local_cache.cu
Three changes to cub/benchmarks/bench/histogram/{even,range}.cu so the
benchmarks exercise the code paths real users hit:
- range.cu: build levels[] with quadratic spacing (still strictly
monotonic across [lower_level, upper_level]) so DispatchRange stays on
the SearchTransform path. The previous thrust::sequence boundaries
were perfectly uniform, letting any uniform-detection fast path
collapse the bench to DispatchEven performance.
- both: replace the power-of-two Elements{io} axis with non-power-of-two
sizes so tunings that hard-code on round counts (exact tile multiples,
pow2 shortcuts) get measured at sizes where those shortcuts cannot
fire. The total axis cardinality is unchanged.
- both: switch to the manual-timer exec_tag and call
cudaCtxResetPersistingL2Cache() outside the timed window. nvbench's
cold measurement already evicts cached lines between iterations, but
it does not demote persistence-marked addresses set via
cudaStreamSetAttribute / cudaAccessPolicyWindow.
…cache coverage
Benchmarks
- multi/even.cu, multi/range.cu: same hardening as the prior commit's
even.cu/range.cu changes — quadratic-spaced range levels (still
strictly monotonic across [lower_level, upper_level]) so DispatchRange
stays on the SearchTransform path; the manual-timer exec_tag with
cudaCtxResetPersistingL2Cache() outside the timed window; non-power-
of-two Elements{io} so tunings that hard-code on round counts (exact
tile multiples, pow2 shortcuts) are exercised at sizes where those
shortcuts cannot fire. Axis cardinality is unchanged.
- even.cu, range.cu, multi/{even,range}.cu: replace two of the four Bins
values (128 -> 100, 2048 -> 2000) so tunings that hard-code on power-
of-two bin counts cannot use those shortcuts. Cardinality unchanged.
Tests
- catch2_test_device_histogram.cu: setup_bin_levels_for_range now
perturbs interior levels by +/- min_bin_width/4 (alternating sign),
falling back to uniform when the type is too tight (e.g. byte-sample
with 256 levels). The std::upper_bound reference already handled
arbitrary spacings; this just exercises the SearchTransform code path
in addition to the uniform-detection fast path.
- catch2_test_device_histogram_thread_local_cache.cu (new): three
Catch2 cases targeting the thread_local detection_stream / detection_
buf cache in dispatch_range. Sequential calls across multiple user
streams, four-thread concurrent calls on the same device, and a
single-thread cross-device case that skips when fewer than two GPUs
are present.
…eration The default cudaLimitPersistingL2CacheSize is 0, so hardcoding 0 (rather than relying on the default) defends against prior benchmarks in the same nvbench process having bumped the reservation. The cudaCtxResetPersistingL2Cache call already demoted persistence-marked addresses; this extends the defense to the reservation itself.
Quadratic spacing produced bin widths spanning ~2n× (last bin vs first), which is not representative of typical workloads. Jittered uniform spacing (±25% of step, fixed mt19937 seed) keeps consecutive widths within ~3× while still defeating uniform-spacing detection so DispatchRange stays on the SearchTransform path.
Extends the Bins axis to cover the 10k-65k range, which sits between the existing 2000 and 2097152 entries. Applied to range/even and the multi-channel variants so all four histogram benches share the same axis.
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
cub/benchmarks/bench/histogram/even.cu (1)
54-57: ⚡ Quick winsuggestion: the new free-function calls are not globally qualified. Please switch the newly added calls here to
::thrust::fill,::bench_check_cuda,::bench_verify_histogram_even,::cudaDeviceSetLimit, and::cudaCtxResetPersistingL2Cacheto match the repo rule. As per coding guidelines, "All calls to free functions must be fully qualified starting from the global namespace, e.g., ::cuda::ceil_div, including calls to functions in the same namespace."Also applies to: 69-70, 81-86, 95-95, 106-107
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5ea8179f-497f-4fd8-8df1-af60b9401dc1
📒 Files selected for processing (7)
cub/benchmarks/bench/histogram/even.cucub/benchmarks/bench/histogram/histogram_common.cuhcub/benchmarks/bench/histogram/multi/even.cucub/benchmarks/bench/histogram/multi/range.cucub/benchmarks/bench/histogram/range.cucub/test/catch2_test_device_histogram.cucub/test/catch2_test_device_histogram_thread_local_cache.cu
🚧 Files skipped from review as they are similar to previous changes (5)
- cub/benchmarks/bench/histogram/multi/range.cu
- cub/benchmarks/bench/histogram/histogram_common.cuh
- cub/test/catch2_test_device_histogram.cu
- cub/benchmarks/bench/histogram/range.cu
- cub/benchmarks/bench/histogram/multi/even.cu
| bench_check_cuda( | ||
| cub::DeviceHistogram::HistogramEven( | ||
| d_temp_storage, | ||
| temp_storage_bytes, | ||
| d_input, | ||
| d_histogram, | ||
| num_levels, | ||
| lower_level, | ||
| upper_level, | ||
| static_cast<OffsetT>(elements)), | ||
| "warmup HistogramEven temp-size"); | ||
| thrust::device_vector<unsigned char> warmup_tmp(temp_storage_bytes); | ||
| d_temp_storage = thrust::raw_pointer_cast(warmup_tmp.data()); | ||
| bench_check_cuda( | ||
| cub::DeviceHistogram::HistogramEven( | ||
| d_temp_storage, | ||
| temp_storage_bytes, | ||
| d_input, | ||
| d_histogram, | ||
| num_levels, | ||
| lower_level, | ||
| upper_level, | ||
| static_cast<OffsetT>(elements)), | ||
| "warmup HistogramEven"); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="cub/benchmarks/bench/histogram/even.cu"
echo "== File = $FILE =="
wc -l "$FILE"
echo
echo "== Context around warmup verifier (lines 40-110) =="
nl -ba "$FILE" | sed -n '40,110p'
echo
echo "== Context around benchmark timing path (lines 100-150) =="
nl -ba "$FILE" | sed -n '100,170p'
echo
echo "== Search for relevant calls =="
rg -n "HistogramEven\(|cudaDeviceSetLimit|cudaCtxResetPersistingL2Cache|raw_pointer_cast|thrust::fill|bench_check_cuda|bench_verify_histogram_even" "$FILE"Repository: NVIDIA/cccl
Length of output: 265
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="cub/benchmarks/bench/histogram/even.cu"
echo "== File = $FILE =="
wc -l "$FILE"
echo
echo "== Lines 40-110 =="
sed -n '40,110p' "$FILE" | cat -n
echo
echo "== Lines 100-150 =="
sed -n '100,150p' "$FILE" | cat -n
echo
echo "== Search for relevant calls =="
rg -n "HistogramEven\(|cudaDeviceSetLimit|cudaCtxResetPersistingL2Cache|raw_pointer_cast|thrust::fill|bench_check_cuda|bench_verify_histogram_even" "$FILE"Repository: NVIDIA/cccl
Length of output: 6422
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="cub/benchmarks/bench/histogram/even.cu"
echo "== File head =="
sed -n '1,120p' "$FILE" | cat -n
echo
echo "== Search for bench_check_cuda / bench_verify_histogram_even declarations/usages =="
rg -n "bench_check_cuda|bench_verify_histogram_even|namespace|using " "$FILE"
echo
echo "== Show any local includes/aliases near the top =="
rg -n "`#include`|cub::|thrust::|cuda::|_CCCL_TRY_CUDA_API|CUB_BENCH|bench_" "$FILE" | head -n 80Repository: NVIDIA/cccl
Length of output: 7373
important: Warmup validation and cache-state setup must follow the same execution/control path as the timed benchmark.
- Warmup correctness uses the non-
envtemp-storage overload ofcub::DeviceHistogram::HistogramEven(...)(nocub_bench_env/ nocuda::execution::tune(...)), while the timed loop uses theenvoverload and applies tuning when!TUNE_BASE; warmup can therefore “verify” a different dispatch path than what’s benchmarked. cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, 0)andcudaCtxResetPersistingL2Cache()are called without checking theircudaError_treturn values, so persisting-L2 demotion failures can silently invalidate the intended cache-state control.- Apply the free-function global-qualification guideline to newly added calls (e.g., use
::thrust::fill,::thrust::raw_pointer_cast,::cudaDeviceSynchronize,::cudaDeviceSetLimit,::cudaCtxResetPersistingL2Cache).
| cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, 0); | ||
| cudaCtxResetPersistingL2Cache(); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the target file around the reported lines
FILE="cub/benchmarks/bench/histogram/even.cu"
if [ ! -f "$FILE" ]; then
echo "Missing file: $FILE" >&2
exit 1
fi
echo "=== File: $FILE ==="
echo "--- Lines 80-140 ---"
nl -ba "$FILE" | sed -n '80,140p'
# Check whether bench_check_cuda is available/used near those calls
echo "--- Search for bench_check_cuda usage in file ---"
rg -n "bench_check_cuda" "$FILE" || true
# Check for any macro/wrapper around cudaDeviceSetLimit / cudaCtxResetPersistingL2Cache
echo "--- Search for cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize and cudaCtxResetPersistingL2Cache in file ---"
rg -n "cudaDeviceSetLimit\\(cudaLimitPersistingL2CacheSize,|cudaCtxResetPersistingL2Cache\\(" "$FILE" || trueRepository: NVIDIA/cccl
Length of output: 191
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="cub/benchmarks/bench/histogram/even.cu"
if [ ! -f "$FILE" ]; then
echo "Missing file: $FILE" >&2
exit 1
fi
echo "=== File: $FILE ==="
echo "--- Lines 80-140 (with line numbers) ---"
awk 'NR>=80 && NR<=140 {printf "%d:%s\n", NR, $0}' "$FILE" | sed -n '1,200p'
echo "--- Search for bench_check_cuda usage in file ---"
rg -n "bench_check_cuda" "$FILE" || true
echo "--- Search for cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize and cudaCtxResetPersistingL2Cache in file ---"
rg -n "cudaDeviceSetLimit\\(cudaLimitPersistingL2CacheSize,|cudaCtxResetPersistingL2Cache\\(" "$FILE" || trueRepository: NVIDIA/cccl
Length of output: 3035
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="cub/benchmarks/bench/histogram/even.cu"
awk 'NR>=40 && NR<=90 {printf "%d:%s\n", NR, $0}' "$FILE" | sed -n '1,200p'Repository: NVIDIA/cccl
Length of output: 2036
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "bench_check_cuda\\b" cub/benchmarks/bench/histogram/even.cu
# Find a definition of bench_check_cuda in the benchmarks harness
rg -n "bench_check_cuda\\s*\\(" cub/benchmarks -S
rg -n "auto\\s+bench_check_cuda|void\\s+bench_check_cuda|`#define`\\s+bench_check_cuda|bench_check_cuda\\s*=" cub/benchmarks -S
rg -n "bench_check_cuda" cub/benchmarks -S --files-with-matches | head -n 50Repository: NVIDIA/cccl
Length of output: 2068
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect bench_check_cuda definition
FILE1="cub/benchmarks/bench/histogram/histogram_common.cuh"
echo "=== $FILE1 lines 80-140 ==="
awk 'NR>=80 && NR<=140 {printf "%d:%s\n", NR, $0}' "$FILE1" | sed -n '1,200p'
# Look for persisting L2 reset pattern in other histogram benchmarks
echo "=== Search for persisting-L2 reset calls in cub/benchmarks/bench/histogram ==="
rg -n "cudaDeviceSetLimit\\(cudaLimitPersistingL2CacheSize,\\s*0\\)|cudaCtxResetPersistingL2Cache\\(" cub/benchmarks/bench/histogramRepository: NVIDIA/cccl
Length of output: 3277
🏁 Script executed:
#!/bin/bash
set -euo pipefail
for FILE in \
cub/benchmarks/bench/histogram/range.cu \
cub/benchmarks/bench/histogram/multi/even.cu \
cub/benchmarks/bench/histogram/multi/range.cu; do
echo "=== $FILE lines around persisting-L2 reset ==="
# print a small window around the matching lines (122/etc varies)
awk '
/cudaDeviceSetLimit\\(cudaLimitPersistingL2CacheSize, 0\\)/ {start=NR-3; end=NR+6}
NR>=start && NR<=end {print NR ":" $0}
NR>end {start=0}
' "$FILE" | sed -n '1,120p'
doneRepository: NVIDIA/cccl
Length of output: 309
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILES=(
cub/benchmarks/bench/histogram/range.cu
cub/benchmarks/bench/histogram/multi/even.cu
cub/benchmarks/bench/histogram/multi/range.cu
)
for FILE in "${FILES[@]}"; do
echo "=== $FILE ==="
# Get exact line numbers for both calls
mapfile -t LNS < <(rg -n "cudaDeviceSetLimit\\(cudaLimitPersistingL2CacheSize,\\s*0\\);|cudaCtxResetPersistingL2Cache\\(\\);" "$FILE" | cut -d: -f1 | sort -n)
if [ ${`#LNS`[@]} -eq 0 ]; then
echo " (no matches found)"
continue
fi
for ln in "${LNS[@]}"; do
start=$((ln-6))
end=$((ln+6))
awk -v s="$start" -v e="$end" 'NR>=s && NR<=e {printf "%d:%s\n", NR, $0}' "$FILE" | sed -n '1,40p'
echo " ---"
done
doneRepository: NVIDIA/cccl
Length of output: 179
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n -C 8 "cudaDeviceSetLimit\\(cudaLimitPersistingL2CacheSize,\\s*0\\);|cudaCtxResetPersistingL2Cache\\(\\);" \
cub/benchmarks/bench/histogram/range.cu \
cub/benchmarks/bench/histogram/multi/even.cu \
cub/benchmarks/bench/histogram/multi/range.cuRepository: NVIDIA/cccl
Length of output: 5392
important: Check CUDA error returns for persisting-L2 resets before timer.start() in cub/benchmarks/bench/histogram/even.cu (lines 106-107): cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, 0); and cudaCtxResetPersistingL2Cache(); ignore their cudaError_t results—wrap both with bench_check_cuda(...) so a reset failure aborts the benchmark instead of running with stale persisting-L2 state.
| for (int t = 0; t < num_threads; ++t) | ||
| { | ||
| threads.emplace_back([&]() { | ||
| cudaStream_t stream; | ||
| if (cudaStreamCreate(&stream) != cudaSuccess) | ||
| { | ||
| failures.fetch_add(1); | ||
| return; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="cub/test/catch2_test_device_histogram_thread_local_cache.cu"
echo "== File: $FILE =="
echo "-- lines 130-240 --"
nl -ba "$FILE" | sed -n '130,240p'
echo "-- lines 240-340 --"
nl -ba "$FILE" | sed -n '240,340p'
echo "-- grep for cudaSetDevice in this file --"
rg -n "cudaSetDevice\(" "$FILE" || true
echo "-- grep for cudaGetDevice in this file --"
rg -n "cudaGetDevice\(" "$FILE" || trueRepository: NVIDIA/cccl
Length of output: 208
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="cub/test/catch2_test_device_histogram_thread_local_cache.cu"
echo "== File: $FILE =="
echo "-- lines 140-240 (numbered) --"
python3 - <<'PY'
import itertools
path="cub/test/catch2_test_device_histogram_thread_local_cache.cu"
start=140; end=240
with open(path,'r',encoding='utf-8') as f:
for i,line in enumerate(f, start=1):
if i<start: continue
if i>end: break
print(f"{i:5d}: {line.rstrip()}")
PY
echo "-- lines 240-320 (numbered) --"
python3 - <<'PY'
path="cub/test/catch2_test_device_histogram_thread_local_cache.cu"
start=240; end=320
with open(path,'r',encoding='utf-8') as f:
for i,line in enumerate(f, start=1):
if i<start: continue
if i>end: break
print(f"{i:5d}: {line.rstrip()}")
PY
echo "-- grep for cudaSetDevice in this file --"
rg -n "cudaSetDevice\(" "$FILE" || true
echo "-- grep for cudaGetDevice in this file --"
rg -n "cudaGetDevice\(" "$FILE" || trueRepository: NVIDIA/cccl
Length of output: 5082
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="cub/test/catch2_test_device_histogram_thread_local_cache.cu"
python3 - <<'PY'
path="cub/test/catch2_test_device_histogram_thread_local_cache.cu"
for (a,b) in [(1,140),(300,500)]:
with open(path,'r',encoding='utf-8') as f:
for i,line in enumerate(f, start=1):
if i<a: continue
if i>b: break
print(f"{i:5d}: {line.rstrip()}")
PY
echo "-- grep for cudaSetDevice/cudaGetDevice --"
rg -n "cuda(SetDevice|GetDevice)\(" "$FILE" || true
echo "-- show definitions of run_histogram_range / helpers (if present) --"
rg -n "run_histogram_range|make_levels|reference_histogram" "$FILE"Repository: NVIDIA/cccl
Length of output: 6568
🌐 Web query:
Does CUDA Runtime maintain current device as thread-local state (cudaSetDevice affects only calling thread) when using std::thread?
💡 Result:
Yes. CUDA Runtime maintains the “current device / active CUcontext” as per-calling-host-thread state, so cudaSetDevice() affects only the calling thread (e.g., when you spawn work with std::thread, each thread has its own current-device setting unless you call cudaSetDevice() in that thread). Evidence: - NVIDIA’s CUDA Runtime API documentation states that CUDA Runtime API calls operate on the CUDA Driver CUcontext “current to the calling host thread”, and that cudaSetDevice() “makes [the primary context] current to the calling thread” by calling cuCtxSetCurrent().[1] - NVIDIA’s docs also explain that if no CUcontext is current to the calling thread, the runtime selects/makes the primary context current to that thread when a runtime call that requires an active context is made.[1] - NVIDIA’s guidance for multithreading explicitly says the CUDA runtime API maintains per-thread state about the current device, and recommends calling cudaSetDevice() in new host threads because otherwise they may use the default device.[2] Important nuance: - The primary context for a device is “device execution context” at runtime level, but the “currentness” is per host thread. Also, cudaDeviceReset() deinitializes the primary context for the calling thread’s current device “immediately”, indicating that the thread’s current device/context matters for behavior.[1] So, with std::thread: calling cudaSetDevice(d) in one thread does not implicitly set the current device for other threads; each thread should call cudaSetDevice(d) (or otherwise establish the correct context/device) before making CUDA runtime calls that rely on the active context/device.[1][2]
Citations:
- 1: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DRIVER.html
- 2: https://developer.nvidia.com/blog/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/
important: Each std::thread needs to set the CUDA device before calling cub::DeviceHistogram::HistogramRange with thrust::device_vector pointers allocated on the spawning thread’s current device (no cudaSetDevice inside the worker lambda at ~165-218).
- important: Restore the original active device at the end of the cross-device test instead of hard-coding
cudaSetDevice(0)(ends at ~260) to avoid order-dependent device-state leakage.
| int num_devices = 0; | ||
| REQUIRE(cudaSuccess == cudaGetDeviceCount(&num_devices)); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file="cub/test/catch2_test_device_histogram_thread_local_cache.cu"
echo "== file =="
ls -l "$file" || true
echo "== snippet around requested lines =="
sed -n '220,290p' "$file" | nl -ba | sed -n '1,120p'
echo "== find cudaSetDevice occurrences =="
rg -n "cudaSetDevice\\(" "$file" || trueRepository: NVIDIA/cccl
Length of output: 273
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file="cub/test/catch2_test_device_histogram_thread_local_cache.cu"
echo "== find cudaSetDevice occurrences =="
rg -n "cudaSetDevice\\(" "$file" || true
echo "== find cudaGetDevice occurrences =="
rg -n "cudaGetDevice\\(" "$file" || true
echo "== snippet around 240-280 =="
sed -n '235,275p' "$file" | awk '{printf "%6d | %s\n", NR+234, $0}'
echo "== snippet around 250-265 =="
sed -n '245,270p' "$file" | awk '{printf "%6d | %s\n", NR+244, $0}'Repository: NVIDIA/cccl
Length of output: 2195
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file="cub/test/catch2_test_device_histogram_thread_local_cache.cu"
echo "== find cudaGetDevice occurrences =="
rg -n "cudaGetDevice\\(" "$file" || true
echo "== find cudaGetDeviceCount occurrences =="
rg -n "cudaGetDeviceCount\\(" "$file" || true
echo "== snippet around first set device block (235-265) for context =="
sed -n '200,280p' "$file" | cat -n | sed -n '1,120p'
echo "== snippet around device selection loop (240-270) =="
sed -n '235,270p' "$file" | cat -nRepository: NVIDIA/cccl
Length of output: 3557
important: restore the previously active CUDA device at teardown instead of forcing device 0
The test unconditionally calls cudaSetDevice(0) after switching devices; save the original device via cudaGetDevice before the {0, 1, 0} loop and restore that exact value afterward (currently cudaSetDevice(dev) and cudaSetDevice(0) around lines 248-261).
| cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, 0); | ||
| cudaCtxResetPersistingL2Cache(); |
There was a problem hiding this comment.
This shouldn't be repeated in every benchmark. This is the kind of thing that should be in nvbench.
There was a problem hiding this comment.
Sure, but it's not valid for every NVBench test, some may not want the reset. I'd be happy to add it upstream.
Without this agents will cheat with L2 persistence on the benchmark.
| // Exercises the thread_local detection-stream / detection-buffer cache used by | ||
| // dispatch_range's uniform-levels detection path: sequential calls on multiple | ||
| // user streams, concurrent calls from multiple threads, and the cross-device | ||
| // hazard where the cache is bound to the device current at first call. |
There was a problem hiding this comment.
These tests feel really ad hoc and not clear these are needed.
There was a problem hiding this comment.
Without these checks, an AI agent optimizing this code will attempt to cache the CUDA streams and other state in a thread local.
| state.add_global_memory_reads<SampleT>(elements); | ||
| state.add_global_memory_writes<CounterT>(num_bins); | ||
|
|
||
| // Warmup + correctness check: run HistogramEven once outside `state.exec`, |
There was a problem hiding this comment.
nvbench already does warmups. Manual warmups should not be added.
There was a problem hiding this comment.
This is not a warmup. It's a correctness check to validate that the histogram is actually correctly executed.
The unit tests only test with a problem size of up to 1024, and doesn't provide sufficient code coverage.
Each cell of the four `cub.bench.histogram.{even,range,multi.even,
multi.range}.base` benchmarks now runs the dispatch once before
NVBench's timed window and compares the produced per-channel histogram
bin-by-bin against an independent reference computed on-device with
`thrust::for_each` + global `atomicAdd`. The warmup also checks the
dispatch return code so a non-`cudaSuccess` return is reported instead
of being silently discarded.
The verifier runs entirely outside `state.exec`, so timed-region
bandwidth is unchanged within measurement noise. Wall-clock per
benchmark cell increases proportionally to the input size of that cell
(the reference loops over every sample once on device).
The verifier is on by default and can be disabled at run time by
setting the environment variable `CUB_BENCH_HISTOGRAM_VERIFY` to one
of: `0`, `false`, `no`, `off` (case-insensitive). Disabling it skips
the warmup dispatch, the reference build, and the bin-by-bin compare.
The verifier catches two bug classes that the existing CTest histogram
suite does not:
- dispatch-time errors (e.g. `cudaErrorInvalidValue` from a temp-
storage size mismatch in the chunked-staging path) that are not
reported by NVBench because the dispatch return code is dropped on
the floor.
- per-bin count corruption that still produces a non-empty histogram
with the right shape but the wrong values (e.g. a partition mask
that drops samples that should have landed in another partition's
write set). These pass any sum-of-counts sanity check.
| // `state.exec`, checking the dispatch return code, then verify each | ||
| // channel's histogram bin-by-bin against an independent reference. | ||
| // Skipped when CUB_BENCH_HISTOGRAM_VERIFY=0|false|no|off. | ||
| if (bench_correctness_checks_enabled()) |
There was a problem hiding this comment.
Question: This is also costly to compile, should this be an if constexpr?
| cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, 0); | ||
| cudaCtxResetPersistingL2Cache(); | ||
| timer.start(); |
There was a problem hiding this comment.
We should have a helper function for that to use it in other / all benchmarks
The jittered-uniform level construction in the range benches sets upper_level via get_upper_level, which previously returned num_bins for integer SampleT. That produced step = 1.0, so the ±0.25*step jitter sat in [-0.25, 0.25] and was annihilated by the integer cast in the level loop. The subsequent dedup-by-1 step then forced every collision back onto the next consecutive integer, leaving the level array bit-identical to a perfect uniform stride-1 sequence. A DispatchRange uniform-spacing detection then has nothing to detect against: it sees a perfectly uniform level array on every integer axis row and routes straight to the EVEN classify path - exactly the fast path the range bench is supposed to avoid measuring. Widen upper_level to ~4 * num_bins for integer SampleT so step is at least ~4 and ±step/4 jitter survives integer truncation as ±1, which is enough to break uniformity. Clamp to the type max when 4 * bins overflows SampleT; those axes (e.g. int8_t with bins >= 64) already have step < 1 and the level array is degenerate regardless of jitter.
|
Actionable comments posted: 0 |
…velT Two latent bugs in cub::DeviceHistogram surfaced when widening the bench to use the full SampleT range (lower_level = numeric_limits<SampleT>::min() for signed integers). 1. ScaleTransform stored `m_max`, `m_min`, `m_scale.fraction.range`, and `m_scale.fraction.bins` in `CommonT = common_type<LevelT, SampleT>`, then ComputeBin promoted through the wider `IntArithmeticT` only at the multiply/divide step. For narrow integer CommonT (int8_t, int16_t) the precomputed `range = max - min` overflowed CommonT before the promotion: int8_t with [-128, 127] gave `range = 255` truncated back to int8_t = -1, sign-extended in IntArithmeticT to 0xFFFFFFFF, and ComputeBin's division by that gigantic divisor returned 0 for every sample. The histogram was non-empty but every count landed in bin 0. Fix: introduce FractionStorageT = IntArithmeticT for integer CommonT (CommonT for non-integer types) and store both `range` and `bins` in it. Compute `max - min` through ULevelT = make_unsigned_t<T>: the intermediate cast is required because C++ integer promotion lifts `(uint8_t) - (uint8_t)` to int(127 - 128) = -1, and going directly to FractionStorageT sign-extends that to a huge garbage value. Truncating through ULevelT first lets unsigned modular wrap-around recover the correct difference. 2. The MayOverflow precondition check at the byte-sample EVEN dispatch sites in DispatchEven cast `num_levels - 1` to CommonT before passing it to MayOverflow: `static_cast<int8_t>(128) = -128` for int8_t, sign-extended in IntArithmeticT to 0xFFFFFF80, and the subsequent division `numeric_limits<IntArithmeticT>::max() / 0xFFFFFF80 = 1` reported overflow for any non-trivial range. Fix: pass `num_levels - 1` directly (it's already an `int`) and apply the same unsigned-promotion- safe subtraction in MayOverflow's `(upper - lower)` computation. 3. PassThruTransform::BinSelect computed `bin = static_cast<int>(sample)` for the byte-sample privatized histogram. For signed int8_t samples this preserved the sign, producing negative bin indices in [-128, -1] for half the input range; the kernel's `if (bin >= 0)` check then silently dropped them. Fix: cast through make_unsigned_t<_SampleT> first so int8_t(-128..127) reinterprets as uint8_t(128..255, 0..127). The existing "DeviceHistogram::HistogramEven num_bins exceeds LevelT range" test was asserting `cudaErrorInvalidValue` for inputs that are now correctly handled. Updated to assert success — the bin width can be fractional (smaller than one distinct LevelT value), and the integer ComputeBin path handles that without overflow once the storage-type and cast bugs are fixed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For signed integer SampleT the bench now picks `lower = numeric_limits<SampleT>::min()` instead of `0`. This doubles the testable range — int8_t goes from 128 distinct values [0, 127] to 256 distinct [-128, 127], int16_t from 32768 to 65536 — letting the benchmarks exercise more bin counts before hitting the distinct-level-values cap. With this change and the matching DeviceHistogram fix, int8_t now runs the dense matrix at bins=128 and bins=255 (previously skipped or producing zero-filled histograms). Helpers added in histogram_common.cuh: - get_lower_level<SampleT>() returns numeric_limits::min() for signed integer SampleT and 0 otherwise. - max_representable_bins<SampleT>() returns the count of distinct SampleT values minus 1 (the upper bound on bins + 1 strictly-monotonic levels). For 64-bit and floating-point SampleT it's int64_t::max(), effectively unbounded for the bench's bin axes. The four bench files (`even`, `range`, `multi/even`, `multi/range`) swap their hardcoded `lower_level = 0` for `get_lower_level<SampleT>()` and gate on `num_bins > max_representable_bins<SampleT>()`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Hardens the four
cub.bench.histogram.{even,range,multi.even,multi.range}.baseNVBench benchmarks (undercub/benchmarks/bench/histogram/) against pattern-matching tunings and silent dispatch failures, and adds an opt-out bin-by-bin correctness verifier that runs outside the timed region. Also extends and adds CTest coverage incub/test/.The four benchmarks each call into
cub::DeviceHistogram::HistogramEven/HistogramRange(single-channel) orMultiHistogramEven/MultiHistogramRange(3-active-channel of 4) over a sweep of input sizes, bin counts, sample types, and entropy values, and report a bandwidth in bytes/sec via NVBench'sGlobalMem BWcolumn.Why
When you tune CUB's histogram code paths, NVBench measures elapsed time inside
state.execand reports bandwidth asbytes_processed / elapsed. Two failure modes can quietly inflate the reported bandwidth:cudaSuccesserror code, but the benchmark drops the return value on the floor. The kernel never launched, so the elapsed time is small, so the reported bandwidth is large.cudaSuccessbut writes wrong counts into the output histogram. For example, a partition mask that drops samples that should have landed in another partition's write set produces a non-empty histogram with the right shape but the wrong values. CTest cases that aggregate to a sum-of-counts check pass; the bench reports inflated bandwidth because effectively only a fraction of the input was processed.There is also a third class — uniform-spacing detection fast paths and tunings that key off power-of-two element/bin counts — where the bench measures a code path real users do not exercise, so the reported bandwidth overstates production behavior.
What changed
Six commits, each scoped narrowly
[cub] Harden histogram benchmarks against pattern-matching tunings— single-channeleven.cu/range.cu. Quadratic-spaced range levels (still strictly monotonic across[lower_level, upper_level]) soDispatchRangestays on theSearchTransformpath instead of collapsing to the uniform-detection fast path. Replaces the power-of-twoElements{io}axis with non-power-of-two sizes. Switches to NVBench's manual-timerexec_tagand callscudaCtxResetPersistingL2Cache()outside the timed window, since cold-cache eviction does not demote persistence-marked addresses set viacudaStreamSetAttribute/cudaAccessPolicyWindow.[cub] Extend histogram bench/test hardening to multi-channel and add cache coverage— same three changes formulti/even.cuandmulti/range.cu. Replaces 128/2048 with 100/2000 in theBinsaxis on all four benches so tunings that hard-code on power-of-two bin counts cannot use those shortcuts. Incub/test/catch2_test_device_histogram.cu,setup_bin_levels_for_rangenow perturbs interior levels by ±min_bin_width/4(alternating sign), falling back to uniform when the type is too tight (e.g. byte-sample with 256 levels). Adds a newcub/test/catch2_test_device_histogram_thread_local_cache.cuwith three Catch2 cases targeting thethread_localdetection-stream / detection-buffer cache indispatch_range: sequential calls across multiple user streams, four-thread concurrent calls on the same device, and a single-thread cross-device case that skips when fewer than two GPUs are present.[cub] Force histogram bench persisting-L2 reservation to zero each iteration— addscudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, 0)immediately before every timed iteration, so a prior bench in the same NVBench process can't leave the reservation bumped. The default is 0; this hardcodes 0 instead of relying on the default.[cub] Use jittered-uniform histogram bench levels instead of quadratic— quadratic spacing produced bin widths spanning ~2n× (last bin vs first), which is not representative of typical workloads. Jittered uniform spacing (±25% of step, fixedmt19937seed for reproducibility) keeps consecutive widths within ~3× while still defeating the uniform-spacing detection fast path.[cub] Add 16384 and 60000 bin sizes to histogram benchmarks— covers the 10k–65k bin tier between 2000 and 2097152 across all four benches.[cub] Verify histogram benchmark output bin-by-bin against a reference— the verifier described below.How the verifier works
Each benchmark cell builds the strided pixel-major sample buffer and the output histograms exactly as before. Then, before
state.exec:The dispatch (
HistogramEven/HistogramRange/MultiHistogramEven/MultiHistogramRange) is called once with explicit return-code checking. A non-cudaSuccessreturn throwsstd::runtime_errorand the cell is reported as failed; NVBench prints the message in the per-cell log and emits no bandwidth row for that cell.The produced histogram(s) are compared bin-by-bin against an independent reference. The reference is computed on the same device, but with
thrust::for_each+ a globalatomicAddper pixel per active channel — the EVEN reference uses the closed-form bin index(sample - lo) * num_bins / (hi - lo)(in double); the RANGE reference usescub::UpperBoundon the per-channel level array. Neither reference shares any kernel withcub::DeviceHistogram, so a bug that produces a wrong-but-plausible histogram from the optimized path is caught.The compare reports both the total count (
opt_totalvsref_total) and the first mismatched bin index with both values, so a failure log identifies the channel and bin to investigate.The warmup, the reference build, and the comparison all run outside
state.exec, so the timed region only contains the dispatch under test.Disabling the verifier at run time
The verifier is on by default. To skip it (e.g. during a tuning sweep where only relative throughput matters and the verifier's wall-clock cost is undesirable), set
CUB_BENCH_HISTOGRAM_VERIFYto0,false,no, oroff(case-insensitive) in the environment when invoking the benchmark binary. The flag is read once per process via a function-local static, so toggling has no per-cell overhead.Cost
The verifier's cost on the timed region (NVBench's reported
histogrammetric — geometric mean across all four binaries'GlobalMem BW (bytes/sec)rows, inGiB/s) is zero within measurement noise. Wall-clock benchmark runtime grows by ~14% (median ~7 s on a ~52 s baseline run) because the on-device reference loops over every sample once per cell.Hardware: NVIDIA B200, CUDA driver 580.126.09, CUDA toolkit 12.8.93, host Linux 6.8 with 30 CPUs / 180 GiB RAM. Five interleaved runs of
cub/benchmarks/bench/histogram/run_histogram_benchmark.pyeach (interleaved to factor out NVBench-iteration-count run-to-run variance):histogram(GiB/s)CUB_BENCH_HISTOGRAM_VERIFY=0)The timed-region metric is unchanged across all three modes within ±0.05% (well inside the bench's natural noise of ±0.05–0.10%). The wall-clock overhead from running the verifier is ~7 s; with the verifier disabled, wall-clock is statistically indistinguishable from baseline.
Validation
Built and ran the full sweep on the B200 host. With the verifier on the unmodified branch, all 160 benchmark cells (40 cells × 4 binaries) pass; the reported
histogrammetric is unchanged within noise.To confirm the verifier triggers when the optimized output is actually wrong, I rebased a known-bad change onto this branch — a CUB optimization attempt that introduced a partition-mask bug that silently dropped most input samples on multi-channel
Bins=60000configurations. With the verifier in place, that change is rejected: 44 cells across the four binaries fail, with messages likeThe message reports both the total-count mismatch (
1.2Mvs268M— only ~0.45% of samples landed in the optimized histogram) and the first mismatched bin's per-bin counts. Without the verifier (CUB_BENCH_HISTOGRAM_VERIFY=0), the same change reportshistogram=893.25GiB/s — a ~4.2× bandwidth speedup that is entirely an artifact of the dropped samples — exactly the failure mode this PR is intended to prevent.Test plan
ci/util/build_and_test_targets.sh --preset cub-cpp20 --build-targets "cub.bench.histogram.even.base cub.bench.histogram.range.base cub.bench.histogram.multi.even.base cub.bench.histogram.multi.range.base"builds clean.ctest -R 'cub\.test\.device\.histogram(\.lid_0|_api\.lid_0|_env\.lid_0|_thread_local_cache\.lid_0)$'passes (including the newcub.test.device.histogram_thread_local_cache.lid_0).histogrammetric is unchanged within noise.CUB_BENCH_HISTOGRAM_VERIFY=0skips the verifier; reportedhistogrammetric is unchanged within noise; wall-clock is statistically indistinguishable from baseline.histogramvalue when the verifier is off (demonstrating the failure mode the verifier exists to prevent).