Skip to content

[cub] Harden histogram benchmarks and add bin-by-bin verification#9144

Open
robobryce wants to merge 9 commits into
NVIDIA:mainfrom
robobryce:main
Open

[cub] Harden histogram benchmarks and add bin-by-bin verification#9144
robobryce wants to merge 9 commits into
NVIDIA:mainfrom
robobryce:main

Conversation

@robobryce
Copy link
Copy Markdown

@robobryce robobryce commented May 27, 2026

Summary

Hardens the four cub.bench.histogram.{even,range,multi.even,multi.range}.base NVBench benchmarks (under cub/benchmarks/bench/histogram/) against pattern-matching tunings and silent dispatch failures, and adds an opt-out bin-by-bin correctness verifier that runs outside the timed region. Also extends and adds CTest coverage in cub/test/.

The four benchmarks each call into cub::DeviceHistogram::HistogramEven / HistogramRange (single-channel) or MultiHistogramEven / MultiHistogramRange (3-active-channel of 4) over a sweep of input sizes, bin counts, sample types, and entropy values, and report a bandwidth in bytes/sec via NVBench's GlobalMem BW column.

Why

When you tune CUB's histogram code paths, NVBench measures elapsed time inside state.exec and reports bandwidth as bytes_processed / elapsed. Two failure modes can quietly inflate the reported bandwidth:

  1. The dispatch returns a non-cudaSuccess error code, but the benchmark drops the return value on the floor. The kernel never launched, so the elapsed time is small, so the reported bandwidth is large.
  2. The kernel completes with cudaSuccess but writes wrong counts into the output histogram. For example, a partition mask that drops samples that should have landed in another partition's write set produces a non-empty histogram with the right shape but the wrong values. CTest cases that aggregate to a sum-of-counts check pass; the bench reports inflated bandwidth because effectively only a fraction of the input was processed.

There is also a third class — uniform-spacing detection fast paths and tunings that key off power-of-two element/bin counts — where the bench measures a code path real users do not exercise, so the reported bandwidth overstates production behavior.

What changed

Six commits, each scoped narrowly

  1. [cub] Harden histogram benchmarks against pattern-matching tunings — single-channel even.cu / range.cu. Quadratic-spaced range levels (still strictly monotonic across [lower_level, upper_level]) so DispatchRange stays on the SearchTransform path instead of collapsing to the uniform-detection fast path. Replaces the power-of-two Elements{io} axis with non-power-of-two sizes. Switches to NVBench's manual-timer exec_tag and calls cudaCtxResetPersistingL2Cache() outside the timed window, since cold-cache eviction does not demote persistence-marked addresses set via cudaStreamSetAttribute / cudaAccessPolicyWindow.

  2. [cub] Extend histogram bench/test hardening to multi-channel and add cache coverage — same three changes for multi/even.cu and multi/range.cu. Replaces 128/2048 with 100/2000 in the Bins axis on all four benches so tunings that hard-code on power-of-two bin counts cannot use those shortcuts. In cub/test/catch2_test_device_histogram.cu, setup_bin_levels_for_range now perturbs interior levels by ±min_bin_width/4 (alternating sign), falling back to uniform when the type is too tight (e.g. byte-sample with 256 levels). Adds a new cub/test/catch2_test_device_histogram_thread_local_cache.cu with three Catch2 cases targeting the thread_local detection-stream / detection-buffer cache in dispatch_range: sequential calls across multiple user streams, four-thread concurrent calls on the same device, and a single-thread cross-device case that skips when fewer than two GPUs are present.

  3. [cub] Force histogram bench persisting-L2 reservation to zero each iteration — adds cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, 0) immediately before every timed iteration, so a prior bench in the same NVBench process can't leave the reservation bumped. The default is 0; this hardcodes 0 instead of relying on the default.

  4. [cub] Use jittered-uniform histogram bench levels instead of quadratic — quadratic spacing produced bin widths spanning ~2n× (last bin vs first), which is not representative of typical workloads. Jittered uniform spacing (±25% of step, fixed mt19937 seed for reproducibility) keeps consecutive widths within ~3× while still defeating the uniform-spacing detection fast path.

  5. [cub] Add 16384 and 60000 bin sizes to histogram benchmarks — covers the 10k–65k bin tier between 2000 and 2097152 across all four benches.

  6. [cub] Verify histogram benchmark output bin-by-bin against a reference — the verifier described below.

How the verifier works

Each benchmark cell builds the strided pixel-major sample buffer and the output histograms exactly as before. Then, before state.exec:

  1. The dispatch (HistogramEven / HistogramRange / MultiHistogramEven / MultiHistogramRange) is called once with explicit return-code checking. A non-cudaSuccess return throws std::runtime_error and the cell is reported as failed; NVBench prints the message in the per-cell log and emits no bandwidth row for that cell.

  2. The produced histogram(s) are compared bin-by-bin against an independent reference. The reference is computed on the same device, but with thrust::for_each + a global atomicAdd per pixel per active channel — the EVEN reference uses the closed-form bin index (sample - lo) * num_bins / (hi - lo) (in double); the RANGE reference uses cub::UpperBound on the per-channel level array. Neither reference shares any kernel with cub::DeviceHistogram, so a bug that produces a wrong-but-plausible histogram from the optimized path is caught.

  3. The compare reports both the total count (opt_total vs ref_total) and the first mismatched bin index with both values, so a failure log identifies the channel and bin to investigate.

The warmup, the reference build, and the comparison all run outside state.exec, so the timed region only contains the dispatch under test.

Disabling the verifier at run time

The verifier is on by default. To skip it (e.g. during a tuning sweep where only relative throughput matters and the verifier's wall-clock cost is undesirable), set CUB_BENCH_HISTOGRAM_VERIFY to 0, false, no, or off (case-insensitive) in the environment when invoking the benchmark binary. The flag is read once per process via a function-local static, so toggling has no per-cell overhead.

Cost

The verifier's cost on the timed region (NVBench's reported histogram metric — geometric mean across all four binaries' GlobalMem BW (bytes/sec) rows, in GiB/s) is zero within measurement noise. Wall-clock benchmark runtime grows by ~14% (median ~7 s on a ~52 s baseline run) because the on-device reference loops over every sample once per cell.

Hardware: NVIDIA B200, CUDA driver 580.126.09, CUDA toolkit 12.8.93, host Linux 6.8 with 30 CPUs / 180 GiB RAM. Five interleaved runs of cub/benchmarks/bench/histogram/run_histogram_benchmark.py each (interleaved to factor out NVBench-iteration-count run-to-run variance):

Mean histogram (GiB/s) Median wall-clock (s)
baseline (no verifier) 210.87 52.0
verifier on (default) 210.88 59.4
verifier off (CUB_BENCH_HISTOGRAM_VERIFY=0) 210.78 52.8

The timed-region metric is unchanged across all three modes within ±0.05% (well inside the bench's natural noise of ±0.05–0.10%). The wall-clock overhead from running the verifier is ~7 s; with the verifier disabled, wall-clock is statistically indistinguishable from baseline.

Validation

Built and ran the full sweep on the B200 host. With the verifier on the unmodified branch, all 160 benchmark cells (40 cells × 4 binaries) pass; the reported histogram metric is unchanged within noise.

To confirm the verifier triggers when the optimized output is actually wrong, I rebased a known-bad change onto this branch — a CUB optimization attempt that introduced a partition-mask bug that silently dropped most input samples on multi-channel Bins=60000 configurations. With the verifier in place, that change is rejected: 44 cells across the four binaries fail, with messages like

Run:  [1/1] base [Device=0 SampleT{ct}=I32 CounterT{ct}=I32 OffsetT{ct}=I32 Elements{io}=268435456 Bins=60000 Entropy=0.201]
Fail: Unexpected error: bench correctness check [multi.range]: channel=0 total opt=1209247 ref=268435456;
      first mismatched bin=0 (opt=763903 ref=169668305)

The message reports both the total-count mismatch (1.2M vs 268M — only ~0.45% of samples landed in the optimized histogram) and the first mismatched bin's per-bin counts. Without the verifier (CUB_BENCH_HISTOGRAM_VERIFY=0), the same change reports histogram=893.25 GiB/s — a ~4.2× bandwidth speedup that is entirely an artifact of the dropped samples — exactly the failure mode this PR is intended to prevent.

Test plan

  • ci/util/build_and_test_targets.sh --preset cub-cpp20 --build-targets "cub.bench.histogram.even.base cub.bench.histogram.range.base cub.bench.histogram.multi.even.base cub.bench.histogram.multi.range.base" builds clean.
  • ctest -R 'cub\.test\.device\.histogram(\.lid_0|_api\.lid_0|_env\.lid_0|_thread_local_cache\.lid_0)$' passes (including the new cub.test.device.histogram_thread_local_cache.lid_0).
  • All 160 benchmark cells (40 cells × 4 binaries) pass the bin-by-bin verifier on the unmodified branch; reported histogram metric is unchanged within noise.
  • CUB_BENCH_HISTOGRAM_VERIFY=0 skips the verifier; reported histogram metric is unchanged within noise; wall-clock is statistically indistinguishable from baseline.
  • Cherry-picking a known-bad change onto this branch causes 44 cells to fail with explicit bin-by-bin mismatch messages when the verifier is on, and "passes" with an inflated histogram value when the verifier is off (demonstrating the failure mode the verifier exists to prevent).

@robobryce robobryce requested review from a team as code owners May 27, 2026 21:04
@robobryce robobryce requested a review from shwina May 27, 2026 21:04
@robobryce robobryce requested a review from elstehle May 27, 2026 21:04
@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 27, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 27, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

important: ## Walkthrough

Adds GPU-side reference histogram functors, host snapshot/compare utilities, untimed warmups with bin-by-bin verification, per-iteration persisting‑L2 cache resets for timed runs, host-side jittered level generation, updated benchmark axes, and tests for perturbation and thread-local cache scenarios.

important: ## Changes

Histogram benchmark verification and L2 cache reset flow

Layer / File(s) Summary
Histogram verification library
cub/benchmarks/bench/histogram/histogram_common.cuh
bench_correctness_checks_enabled, bench_check_cuda, GPU reference functors (bench_ref_even_op, bench_ref_range_op), snapshot/compare helpers, and verifier entry points bench_verify_histogram_even / bench_verify_histogram_range.
Single-channel even benchmark
cub/benchmarks/bench/histogram/even.cu
Adds untimed warmup that sizes/executes HistogramEven and validates bins against reference; timed loop now resets persisting‑L2 cache per iteration and runs HistogramEven under nvbench::exec_tag::timer. Axes switched to explicit element/bin lists.
Single-channel range benchmark
cub/benchmarks/bench/histogram/range.cu
Builds jittered monotonic bin levels on host (fixed seed), copies to device; untimed warmup + validation via bench_verify_histogram_range; timed loop clears persisting‑L2 cache per iteration and times HistogramRange. Axes updated to explicit lists.
Multi-channel even benchmark
cub/benchmarks/bench/histogram/multi/even.cu
Zeros per-channel histograms, runs untimed warmup (temp-storage sizing + exec), validates per-channel outputs, then times MultiHistogramEven under nvbench after per-iteration persisting‑L2 reset. Axes changed to explicit lists.
Multi-channel range benchmark
cub/benchmarks/bench/histogram/multi/range.cu
Host-generated jittered bin edges (fixed seed) copied to device; untimed warmup with validation and rebind of verified pointers; timed loop resets persisting‑L2 cache per iteration and runs MultiHistogramRange under nvbench timer. Axes switched to explicit lists.
Histogram test enhancements
cub/test/catch2_test_device_histogram.cu, cub/test/catch2_test_device_histogram_thread_local_cache.cu
Perturbs interior range-level edges by bounded offsets and adds a thread-local cache test exercising sequential stream reuse, concurrent per-thread streams, and same-thread device switching for HistogramRange, verifying GPU results against CPU reference.

important: suggestion: Suggested labels

backport branch/3.4.x

important: suggestion: Suggested reviewers

  • shwina
  • bernhardmgruber
  • srinivasyadav18

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (9)
cub/benchmarks/bench/histogram/histogram_common.cuh (5)

115-156: 💤 Low value

suggestion: Data members in bench_ref_even_op lack the postfix underscore required by guidelines (e.g., d_input_, num_bins_, etc.). Also, line 119's d_hist should be plural (d_hists_) since it's an array. Line 136 uses std::size_t—should be ::cuda::std::size_t per type qualification rules.

These are benchmark internals, so lower priority, but worth noting for consistency.


158-191: ⚡ Quick win

suggestion: Same member naming issues as the EVEN functor. Additionally, line 183 uses cub::UpperBound—free function calls should be fully qualified as ::cub::UpperBound. This matches the guideline that "all calls to free functions must be fully qualified starting from the global namespace."

-      const int idx   = cub::UpperBound(d_levels[c], num_levels, s);
+      const int idx   = ::cub::UpperBound(d_levels[c], num_levels, s);

193-240: 💤 Low value

suggestion: Line 213 uses std::size_t; prefer ::cuda::std::size_t for consistency with the codebase type qualification rules.

-    for (std::size_t b = 0; b < opt.size(); ++b)
+    for (::cuda::std::size_t b = 0; b < opt.size(); ++b)

242-253: ⚡ Quick win

suggestion: bench_snapshot_histograms returns a non-void value; add [[nodiscard]] per guidelines. Also consider adding inline since it's not constexpr, though templates get implicit inline linkage.

 template <typename CounterT>
+[[nodiscard]]
 std::vector<thrust::host_vector<CounterT>>
 bench_snapshot_histograms(const std::vector<thrust::device_vector<CounterT>>& d_hists)

297-334: 💤 Low value

suggestion: Line 311 assumes all channels have identical level counts by reading only d_levels_per_channel[0].size(). If channels could differ, this would silently produce incorrect validation. If same-size is an invariant, a debug assert or comment documenting this assumption would help future maintainers.

cub/test/catch2_test_device_histogram.cu (1)

209-209: ⚡ Quick win

suggestion: Mark perturbation_step const since it's computed once and never modified.

-  const auto perturbation_step = min_bin_width / 4;
+  const auto perturbation_step = min_bin_width / 4;

As per coding guidelines, all variables that are not modified must use const qualifier.

cub/test/catch2_test_device_histogram_thread_local_cache.cu (2)

33-34: ⚡ Quick win

suggestion: Mark upper constexpr and step const for const correctness.

-  constexpr double upper = 1024.0;
-  const double step      = upper / static_cast<double>(num_bins);
+  constexpr double upper = 1024.0;
+  const double step      = upper / static_cast<double>(num_bins);

As per coding guidelines, variables that are not modified must use const qualifier, and variables that can be evaluated at compile-time should use constexpr.


79-79: ⚡ Quick win

suggestion: Mark num_levels const since it's not modified after initialization.

-  const int num_levels = static_cast<int>(d_levels.size());
+  const int num_levels = static_cast<int>(d_levels.size());
cub/benchmarks/bench/histogram/even.cu (1)

97-128: 💤 Low value

suggestion: cudaDeviceSetLimit and cudaCtxResetPersistingL2Cache return error codes that are silently ignored. Consider wrapping them in bench_check_cuda for diagnostic visibility when L2 reset fails unexpectedly.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 54ee710d-d89a-4b79-a192-ae1b59574dfb

📥 Commits

Reviewing files that changed from the base of the PR and between ba6b4ea and f97562f.

📒 Files selected for processing (7)
  • cub/benchmarks/bench/histogram/even.cu
  • cub/benchmarks/bench/histogram/histogram_common.cuh
  • cub/benchmarks/bench/histogram/multi/even.cu
  • cub/benchmarks/bench/histogram/multi/range.cu
  • cub/benchmarks/bench/histogram/range.cu
  • cub/test/catch2_test_device_histogram.cu
  • cub/test/catch2_test_device_histogram_thread_local_cache.cu

Comment thread cub/test/catch2_test_device_histogram_thread_local_cache.cu Outdated
Three changes to cub/benchmarks/bench/histogram/{even,range}.cu so the
benchmarks exercise the code paths real users hit:

- range.cu: build levels[] with quadratic spacing (still strictly
  monotonic across [lower_level, upper_level]) so DispatchRange stays on
  the SearchTransform path. The previous thrust::sequence boundaries
  were perfectly uniform, letting any uniform-detection fast path
  collapse the bench to DispatchEven performance.

- both: replace the power-of-two Elements{io} axis with non-power-of-two
  sizes so tunings that hard-code on round counts (exact tile multiples,
  pow2 shortcuts) get measured at sizes where those shortcuts cannot
  fire. The total axis cardinality is unchanged.

- both: switch to the manual-timer exec_tag and call
  cudaCtxResetPersistingL2Cache() outside the timed window. nvbench's
  cold measurement already evicts cached lines between iterations, but
  it does not demote persistence-marked addresses set via
  cudaStreamSetAttribute / cudaAccessPolicyWindow.
…cache coverage

Benchmarks
- multi/even.cu, multi/range.cu: same hardening as the prior commit's
  even.cu/range.cu changes — quadratic-spaced range levels (still
  strictly monotonic across [lower_level, upper_level]) so DispatchRange
  stays on the SearchTransform path; the manual-timer exec_tag with
  cudaCtxResetPersistingL2Cache() outside the timed window; non-power-
  of-two Elements{io} so tunings that hard-code on round counts (exact
  tile multiples, pow2 shortcuts) are exercised at sizes where those
  shortcuts cannot fire. Axis cardinality is unchanged.
- even.cu, range.cu, multi/{even,range}.cu: replace two of the four Bins
  values (128 -> 100, 2048 -> 2000) so tunings that hard-code on power-
  of-two bin counts cannot use those shortcuts. Cardinality unchanged.

Tests
- catch2_test_device_histogram.cu: setup_bin_levels_for_range now
  perturbs interior levels by +/- min_bin_width/4 (alternating sign),
  falling back to uniform when the type is too tight (e.g. byte-sample
  with 256 levels). The std::upper_bound reference already handled
  arbitrary spacings; this just exercises the SearchTransform code path
  in addition to the uniform-detection fast path.
- catch2_test_device_histogram_thread_local_cache.cu (new): three
  Catch2 cases targeting the thread_local detection_stream / detection_
  buf cache in dispatch_range. Sequential calls across multiple user
  streams, four-thread concurrent calls on the same device, and a
  single-thread cross-device case that skips when fewer than two GPUs
  are present.
…eration

The default cudaLimitPersistingL2CacheSize is 0, so hardcoding 0 (rather
than relying on the default) defends against prior benchmarks in the
same nvbench process having bumped the reservation. The
cudaCtxResetPersistingL2Cache call already demoted persistence-marked
addresses; this extends the defense to the reservation itself.
Quadratic spacing produced bin widths spanning ~2n× (last bin vs first),
which is not representative of typical workloads. Jittered uniform
spacing (±25% of step, fixed mt19937 seed) keeps consecutive widths
within ~3× while still defeating uniform-spacing detection so
DispatchRange stays on the SearchTransform path.
Extends the Bins axis to cover the 10k-65k range, which sits between
the existing 2000 and 2097152 entries. Applied to range/even and the
multi-channel variants so all four histogram benches share the same
axis.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
cub/benchmarks/bench/histogram/even.cu (1)

54-57: ⚡ Quick win

suggestion: the new free-function calls are not globally qualified. Please switch the newly added calls here to ::thrust::fill, ::bench_check_cuda, ::bench_verify_histogram_even, ::cudaDeviceSetLimit, and ::cudaCtxResetPersistingL2Cache to match the repo rule. As per coding guidelines, "All calls to free functions must be fully qualified starting from the global namespace, e.g., ::cuda::ceil_div, including calls to functions in the same namespace."

Also applies to: 69-70, 81-86, 95-95, 106-107


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5ea8179f-497f-4fd8-8df1-af60b9401dc1

📥 Commits

Reviewing files that changed from the base of the PR and between f97562f and 1a26924.

📒 Files selected for processing (7)
  • cub/benchmarks/bench/histogram/even.cu
  • cub/benchmarks/bench/histogram/histogram_common.cuh
  • cub/benchmarks/bench/histogram/multi/even.cu
  • cub/benchmarks/bench/histogram/multi/range.cu
  • cub/benchmarks/bench/histogram/range.cu
  • cub/test/catch2_test_device_histogram.cu
  • cub/test/catch2_test_device_histogram_thread_local_cache.cu
🚧 Files skipped from review as they are similar to previous changes (5)
  • cub/benchmarks/bench/histogram/multi/range.cu
  • cub/benchmarks/bench/histogram/histogram_common.cuh
  • cub/test/catch2_test_device_histogram.cu
  • cub/benchmarks/bench/histogram/range.cu
  • cub/benchmarks/bench/histogram/multi/even.cu

Comment on lines +57 to +80
bench_check_cuda(
cub::DeviceHistogram::HistogramEven(
d_temp_storage,
temp_storage_bytes,
d_input,
d_histogram,
num_levels,
lower_level,
upper_level,
static_cast<OffsetT>(elements)),
"warmup HistogramEven temp-size");
thrust::device_vector<unsigned char> warmup_tmp(temp_storage_bytes);
d_temp_storage = thrust::raw_pointer_cast(warmup_tmp.data());
bench_check_cuda(
cub::DeviceHistogram::HistogramEven(
d_temp_storage,
temp_storage_bytes,
d_input,
d_histogram,
num_levels,
lower_level,
upper_level,
static_cast<OffsetT>(elements)),
"warmup HistogramEven");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="cub/benchmarks/bench/histogram/even.cu"

echo "== File = $FILE =="
wc -l "$FILE"

echo
echo "== Context around warmup verifier (lines 40-110) =="
nl -ba "$FILE" | sed -n '40,110p'

echo
echo "== Context around benchmark timing path (lines 100-150) =="
nl -ba "$FILE" | sed -n '100,170p'

echo
echo "== Search for relevant calls =="
rg -n "HistogramEven\(|cudaDeviceSetLimit|cudaCtxResetPersistingL2Cache|raw_pointer_cast|thrust::fill|bench_check_cuda|bench_verify_histogram_even" "$FILE"

Repository: NVIDIA/cccl

Length of output: 265


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="cub/benchmarks/bench/histogram/even.cu"

echo "== File = $FILE =="
wc -l "$FILE"

echo
echo "== Lines 40-110 =="
sed -n '40,110p' "$FILE" | cat -n

echo
echo "== Lines 100-150 =="
sed -n '100,150p' "$FILE" | cat -n

echo
echo "== Search for relevant calls =="
rg -n "HistogramEven\(|cudaDeviceSetLimit|cudaCtxResetPersistingL2Cache|raw_pointer_cast|thrust::fill|bench_check_cuda|bench_verify_histogram_even" "$FILE"

Repository: NVIDIA/cccl

Length of output: 6422


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="cub/benchmarks/bench/histogram/even.cu"

echo "== File head =="
sed -n '1,120p' "$FILE" | cat -n

echo
echo "== Search for bench_check_cuda / bench_verify_histogram_even declarations/usages =="
rg -n "bench_check_cuda|bench_verify_histogram_even|namespace|using " "$FILE"

echo
echo "== Show any local includes/aliases near the top =="
rg -n "`#include`|cub::|thrust::|cuda::|_CCCL_TRY_CUDA_API|CUB_BENCH|bench_" "$FILE" | head -n 80

Repository: NVIDIA/cccl

Length of output: 7373


important: Warmup validation and cache-state setup must follow the same execution/control path as the timed benchmark.

  • Warmup correctness uses the non-env temp-storage overload of cub::DeviceHistogram::HistogramEven(...) (no cub_bench_env / no cuda::execution::tune(...)), while the timed loop uses the env overload and applies tuning when !TUNE_BASE; warmup can therefore “verify” a different dispatch path than what’s benchmarked.
  • cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, 0) and cudaCtxResetPersistingL2Cache() are called without checking their cudaError_t return values, so persisting-L2 demotion failures can silently invalidate the intended cache-state control.
  • Apply the free-function global-qualification guideline to newly added calls (e.g., use ::thrust::fill, ::thrust::raw_pointer_cast, ::cudaDeviceSynchronize, ::cudaDeviceSetLimit, ::cudaCtxResetPersistingL2Cache).

Comment on lines +106 to +107
cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, 0);
cudaCtxResetPersistingL2Cache();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the target file around the reported lines
FILE="cub/benchmarks/bench/histogram/even.cu"
if [ ! -f "$FILE" ]; then
  echo "Missing file: $FILE" >&2
  exit 1
fi

echo "=== File: $FILE ==="
echo "--- Lines 80-140 ---"
nl -ba "$FILE" | sed -n '80,140p'

# Check whether bench_check_cuda is available/used near those calls
echo "--- Search for bench_check_cuda usage in file ---"
rg -n "bench_check_cuda" "$FILE" || true

# Check for any macro/wrapper around cudaDeviceSetLimit / cudaCtxResetPersistingL2Cache
echo "--- Search for cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize and cudaCtxResetPersistingL2Cache in file ---"
rg -n "cudaDeviceSetLimit\\(cudaLimitPersistingL2CacheSize,|cudaCtxResetPersistingL2Cache\\(" "$FILE" || true

Repository: NVIDIA/cccl

Length of output: 191


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="cub/benchmarks/bench/histogram/even.cu"
if [ ! -f "$FILE" ]; then
  echo "Missing file: $FILE" >&2
  exit 1
fi

echo "=== File: $FILE ==="
echo "--- Lines 80-140 (with line numbers) ---"
awk 'NR>=80 && NR<=140 {printf "%d:%s\n", NR, $0}' "$FILE" | sed -n '1,200p'

echo "--- Search for bench_check_cuda usage in file ---"
rg -n "bench_check_cuda" "$FILE" || true

echo "--- Search for cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize and cudaCtxResetPersistingL2Cache in file ---"
rg -n "cudaDeviceSetLimit\\(cudaLimitPersistingL2CacheSize,|cudaCtxResetPersistingL2Cache\\(" "$FILE" || true

Repository: NVIDIA/cccl

Length of output: 3035


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="cub/benchmarks/bench/histogram/even.cu"

awk 'NR>=40 && NR<=90 {printf "%d:%s\n", NR, $0}' "$FILE" | sed -n '1,200p'

Repository: NVIDIA/cccl

Length of output: 2036


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "bench_check_cuda\\b" cub/benchmarks/bench/histogram/even.cu

# Find a definition of bench_check_cuda in the benchmarks harness
rg -n "bench_check_cuda\\s*\\(" cub/benchmarks -S
rg -n "auto\\s+bench_check_cuda|void\\s+bench_check_cuda|`#define`\\s+bench_check_cuda|bench_check_cuda\\s*=" cub/benchmarks -S
rg -n "bench_check_cuda" cub/benchmarks -S --files-with-matches | head -n 50

Repository: NVIDIA/cccl

Length of output: 2068


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect bench_check_cuda definition
FILE1="cub/benchmarks/bench/histogram/histogram_common.cuh"
echo "=== $FILE1 lines 80-140 ==="
awk 'NR>=80 && NR<=140 {printf "%d:%s\n", NR, $0}' "$FILE1" | sed -n '1,200p'

# Look for persisting L2 reset pattern in other histogram benchmarks
echo "=== Search for persisting-L2 reset calls in cub/benchmarks/bench/histogram ==="
rg -n "cudaDeviceSetLimit\\(cudaLimitPersistingL2CacheSize,\\s*0\\)|cudaCtxResetPersistingL2Cache\\(" cub/benchmarks/bench/histogram

Repository: NVIDIA/cccl

Length of output: 3277


🏁 Script executed:

#!/bin/bash
set -euo pipefail

for FILE in \
  cub/benchmarks/bench/histogram/range.cu \
  cub/benchmarks/bench/histogram/multi/even.cu \
  cub/benchmarks/bench/histogram/multi/range.cu; do
  echo "=== $FILE lines around persisting-L2 reset ==="
  # print a small window around the matching lines (122/etc varies)
  awk '
    /cudaDeviceSetLimit\\(cudaLimitPersistingL2CacheSize, 0\\)/ {start=NR-3; end=NR+6}
    NR>=start && NR<=end {print NR ":" $0}
    NR>end {start=0}
  ' "$FILE" | sed -n '1,120p'
done

Repository: NVIDIA/cccl

Length of output: 309


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILES=(
  cub/benchmarks/bench/histogram/range.cu
  cub/benchmarks/bench/histogram/multi/even.cu
  cub/benchmarks/bench/histogram/multi/range.cu
)

for FILE in "${FILES[@]}"; do
  echo "=== $FILE ==="
  # Get exact line numbers for both calls
  mapfile -t LNS < <(rg -n "cudaDeviceSetLimit\\(cudaLimitPersistingL2CacheSize,\\s*0\\);|cudaCtxResetPersistingL2Cache\\(\\);" "$FILE" | cut -d: -f1 | sort -n)
  if [ ${`#LNS`[@]} -eq 0 ]; then
    echo "  (no matches found)"
    continue
  fi

  for ln in "${LNS[@]}"; do
    start=$((ln-6))
    end=$((ln+6))
    awk -v s="$start" -v e="$end" 'NR>=s && NR<=e {printf "%d:%s\n", NR, $0}' "$FILE" | sed -n '1,40p'
    echo "  ---"
  done
done

Repository: NVIDIA/cccl

Length of output: 179


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n -C 8 "cudaDeviceSetLimit\\(cudaLimitPersistingL2CacheSize,\\s*0\\);|cudaCtxResetPersistingL2Cache\\(\\);" \
  cub/benchmarks/bench/histogram/range.cu \
  cub/benchmarks/bench/histogram/multi/even.cu \
  cub/benchmarks/bench/histogram/multi/range.cu

Repository: NVIDIA/cccl

Length of output: 5392


important: Check CUDA error returns for persisting-L2 resets before timer.start() in cub/benchmarks/bench/histogram/even.cu (lines 106-107): cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, 0); and cudaCtxResetPersistingL2Cache(); ignore their cudaError_t results—wrap both with bench_check_cuda(...) so a reset failure aborts the benchmark instead of running with stale persisting-L2 state.

Comment on lines +163 to +171
for (int t = 0; t < num_threads; ++t)
{
threads.emplace_back([&]() {
cudaStream_t stream;
if (cudaStreamCreate(&stream) != cudaSuccess)
{
failures.fetch_add(1);
return;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="cub/test/catch2_test_device_histogram_thread_local_cache.cu"
echo "== File: $FILE =="
echo "-- lines 130-240 --"
nl -ba "$FILE" | sed -n '130,240p'
echo "-- lines 240-340 --"
nl -ba "$FILE" | sed -n '240,340p'
echo "-- grep for cudaSetDevice in this file --"
rg -n "cudaSetDevice\(" "$FILE" || true
echo "-- grep for cudaGetDevice in this file --"
rg -n "cudaGetDevice\(" "$FILE" || true

Repository: NVIDIA/cccl

Length of output: 208


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="cub/test/catch2_test_device_histogram_thread_local_cache.cu"
echo "== File: $FILE =="

echo "-- lines 140-240 (numbered) --"
python3 - <<'PY'
import itertools
path="cub/test/catch2_test_device_histogram_thread_local_cache.cu"
start=140; end=240
with open(path,'r',encoding='utf-8') as f:
    for i,line in enumerate(f, start=1):
        if i<start: continue
        if i>end: break
        print(f"{i:5d}: {line.rstrip()}")
PY

echo "-- lines 240-320 (numbered) --"
python3 - <<'PY'
path="cub/test/catch2_test_device_histogram_thread_local_cache.cu"
start=240; end=320
with open(path,'r',encoding='utf-8') as f:
    for i,line in enumerate(f, start=1):
        if i<start: continue
        if i>end: break
        print(f"{i:5d}: {line.rstrip()}")
PY

echo "-- grep for cudaSetDevice in this file --"
rg -n "cudaSetDevice\(" "$FILE" || true

echo "-- grep for cudaGetDevice in this file --"
rg -n "cudaGetDevice\(" "$FILE" || true

Repository: NVIDIA/cccl

Length of output: 5082


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="cub/test/catch2_test_device_histogram_thread_local_cache.cu"
python3 - <<'PY'
path="cub/test/catch2_test_device_histogram_thread_local_cache.cu"
for (a,b) in [(1,140),(300,500)]:
    with open(path,'r',encoding='utf-8') as f:
        for i,line in enumerate(f, start=1):
            if i<a: continue
            if i>b: break
            print(f"{i:5d}: {line.rstrip()}")
PY

echo "-- grep for cudaSetDevice/cudaGetDevice --"
rg -n "cuda(SetDevice|GetDevice)\(" "$FILE" || true

echo "-- show definitions of run_histogram_range / helpers (if present) --"
rg -n "run_histogram_range|make_levels|reference_histogram" "$FILE"

Repository: NVIDIA/cccl

Length of output: 6568


🌐 Web query:

Does CUDA Runtime maintain current device as thread-local state (cudaSetDevice affects only calling thread) when using std::thread?

💡 Result:

Yes. CUDA Runtime maintains the “current device / active CUcontext” as per-calling-host-thread state, so cudaSetDevice() affects only the calling thread (e.g., when you spawn work with std::thread, each thread has its own current-device setting unless you call cudaSetDevice() in that thread). Evidence: - NVIDIA’s CUDA Runtime API documentation states that CUDA Runtime API calls operate on the CUDA Driver CUcontext “current to the calling host thread”, and that cudaSetDevice() “makes [the primary context] current to the calling thread” by calling cuCtxSetCurrent().[1] - NVIDIA’s docs also explain that if no CUcontext is current to the calling thread, the runtime selects/makes the primary context current to that thread when a runtime call that requires an active context is made.[1] - NVIDIA’s guidance for multithreading explicitly says the CUDA runtime API maintains per-thread state about the current device, and recommends calling cudaSetDevice() in new host threads because otherwise they may use the default device.[2] Important nuance: - The primary context for a device is “device execution context” at runtime level, but the “currentness” is per host thread. Also, cudaDeviceReset() deinitializes the primary context for the calling thread’s current device “immediately”, indicating that the thread’s current device/context matters for behavior.[1] So, with std::thread: calling cudaSetDevice(d) in one thread does not implicitly set the current device for other threads; each thread should call cudaSetDevice(d) (or otherwise establish the correct context/device) before making CUDA runtime calls that rely on the active context/device.[1][2]

Citations:


important: Each std::thread needs to set the CUDA device before calling cub::DeviceHistogram::HistogramRange with thrust::device_vector pointers allocated on the spawning thread’s current device (no cudaSetDevice inside the worker lambda at ~165-218).

  • important: Restore the original active device at the end of the cross-device test instead of hard-coding cudaSetDevice(0) (ends at ~260) to avoid order-dependent device-state leakage.

Comment on lines +230 to +231
int num_devices = 0;
REQUIRE(cudaSuccess == cudaGetDeviceCount(&num_devices));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

file="cub/test/catch2_test_device_histogram_thread_local_cache.cu"
echo "== file =="
ls -l "$file" || true

echo "== snippet around requested lines =="
sed -n '220,290p' "$file" | nl -ba | sed -n '1,120p'

echo "== find cudaSetDevice occurrences =="
rg -n "cudaSetDevice\\(" "$file" || true

Repository: NVIDIA/cccl

Length of output: 273


🏁 Script executed:

#!/bin/bash
set -euo pipefail

file="cub/test/catch2_test_device_histogram_thread_local_cache.cu"

echo "== find cudaSetDevice occurrences =="
rg -n "cudaSetDevice\\(" "$file" || true

echo "== find cudaGetDevice occurrences =="
rg -n "cudaGetDevice\\(" "$file" || true

echo "== snippet around 240-280 =="
sed -n '235,275p' "$file" | awk '{printf "%6d | %s\n", NR+234, $0}'

echo "== snippet around 250-265 =="
sed -n '245,270p' "$file" | awk '{printf "%6d | %s\n", NR+244, $0}'

Repository: NVIDIA/cccl

Length of output: 2195


🏁 Script executed:

#!/bin/bash
set -euo pipefail
file="cub/test/catch2_test_device_histogram_thread_local_cache.cu"

echo "== find cudaGetDevice occurrences =="
rg -n "cudaGetDevice\\(" "$file" || true

echo "== find cudaGetDeviceCount occurrences =="
rg -n "cudaGetDeviceCount\\(" "$file" || true

echo "== snippet around first set device block (235-265) for context =="
sed -n '200,280p' "$file" | cat -n | sed -n '1,120p'

echo "== snippet around device selection loop (240-270) =="
sed -n '235,270p' "$file" | cat -n

Repository: NVIDIA/cccl

Length of output: 3557


important: restore the previously active CUDA device at teardown instead of forcing device 0
The test unconditionally calls cudaSetDevice(0) after switching devices; save the original device via cudaGetDevice before the {0, 1, 0} loop and restore that exact value afterward (currently cudaSetDevice(dev) and cudaSetDevice(0) around lines 248-261).

Comment on lines +121 to +122
cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, 0);
cudaCtxResetPersistingL2Cache();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be repeated in every benchmark. This is the kind of thing that should be in nvbench.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but it's not valid for every NVBench test, some may not want the reset. I'd be happy to add it upstream.

Without this agents will cheat with L2 persistence on the benchmark.

Comment on lines +4 to +7
// Exercises the thread_local detection-stream / detection-buffer cache used by
// dispatch_range's uniform-levels detection path: sequential calls on multiple
// user streams, concurrent calls from multiple threads, and the cross-device
// hazard where the cache is bound to the device current at first call.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests feel really ad hoc and not clear these are needed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without these checks, an AI agent optimizing this code will attempt to cache the CUDA streams and other state in a thread local.

state.add_global_memory_reads<SampleT>(elements);
state.add_global_memory_writes<CounterT>(num_bins);

// Warmup + correctness check: run HistogramEven once outside `state.exec`,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvbench already does warmups. Manual warmups should not be added.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a warmup. It's a correctness check to validate that the histogram is actually correctly executed.

The unit tests only test with a problem size of up to 1024, and doesn't provide sufficient code coverage.

Each cell of the four `cub.bench.histogram.{even,range,multi.even,
multi.range}.base` benchmarks now runs the dispatch once before
NVBench's timed window and compares the produced per-channel histogram
bin-by-bin against an independent reference computed on-device with
`thrust::for_each` + global `atomicAdd`. The warmup also checks the
dispatch return code so a non-`cudaSuccess` return is reported instead
of being silently discarded.

The verifier runs entirely outside `state.exec`, so timed-region
bandwidth is unchanged within measurement noise. Wall-clock per
benchmark cell increases proportionally to the input size of that cell
(the reference loops over every sample once on device).

The verifier is on by default and can be disabled at run time by
setting the environment variable `CUB_BENCH_HISTOGRAM_VERIFY` to one
of: `0`, `false`, `no`, `off` (case-insensitive). Disabling it skips
the warmup dispatch, the reference build, and the bin-by-bin compare.

The verifier catches two bug classes that the existing CTest histogram
suite does not:
  - dispatch-time errors (e.g. `cudaErrorInvalidValue` from a temp-
    storage size mismatch in the chunked-staging path) that are not
    reported by NVBench because the dispatch return code is dropped on
    the floor.
  - per-bin count corruption that still produces a non-empty histogram
    with the right shape but the wrong values (e.g. a partition mask
    that drops samples that should have landed in another partition's
    write set). These pass any sum-of-counts sanity check.
// `state.exec`, checking the dispatch return code, then verify each
// channel's histogram bin-by-bin against an independent reference.
// Skipped when CUB_BENCH_HISTOGRAM_VERIFY=0|false|no|off.
if (bench_correctness_checks_enabled())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: This is also costly to compile, should this be an if constexpr?

Comment on lines +126 to +128
cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, 0);
cudaCtxResetPersistingL2Cache();
timer.start();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a helper function for that to use it in other / all benchmarks

The jittered-uniform level construction in the range benches sets
upper_level via get_upper_level, which previously returned num_bins
for integer SampleT. That produced step = 1.0, so the ±0.25*step
jitter sat in [-0.25, 0.25] and was annihilated by the integer cast
in the level loop. The subsequent dedup-by-1 step then forced every
collision back onto the next consecutive integer, leaving the level
array bit-identical to a perfect uniform stride-1 sequence.

A DispatchRange uniform-spacing detection then has nothing to detect
against: it sees a perfectly uniform level array on every integer
axis row and routes straight to the EVEN classify path - exactly the
fast path the range bench is supposed to avoid measuring.

Widen upper_level to ~4 * num_bins for integer SampleT so step is at
least ~4 and ±step/4 jitter survives integer truncation as ±1, which
is enough to break uniformity. Clamp to the type max when 4 * bins
overflows SampleT; those axes (e.g. int8_t with bins >= 64) already
have step < 1 and the level array is degenerate regardless of jitter.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

brycelelbach and others added 2 commits May 28, 2026 20:02
…velT

Two latent bugs in cub::DeviceHistogram surfaced when widening the bench
to use the full SampleT range (lower_level = numeric_limits<SampleT>::min()
for signed integers).

1. ScaleTransform stored `m_max`, `m_min`, `m_scale.fraction.range`, and
   `m_scale.fraction.bins` in `CommonT = common_type<LevelT, SampleT>`,
   then ComputeBin promoted through the wider `IntArithmeticT` only at
   the multiply/divide step. For narrow integer CommonT (int8_t, int16_t)
   the precomputed `range = max - min` overflowed CommonT before the
   promotion: int8_t with [-128, 127] gave `range = 255` truncated back
   to int8_t = -1, sign-extended in IntArithmeticT to 0xFFFFFFFF, and
   ComputeBin's division by that gigantic divisor returned 0 for every
   sample. The histogram was non-empty but every count landed in bin 0.

   Fix: introduce FractionStorageT = IntArithmeticT for integer CommonT
   (CommonT for non-integer types) and store both `range` and `bins` in
   it. Compute `max - min` through ULevelT = make_unsigned_t<T>: the
   intermediate cast is required because C++ integer promotion lifts
   `(uint8_t) - (uint8_t)` to int(127 - 128) = -1, and going directly to
   FractionStorageT sign-extends that to a huge garbage value. Truncating
   through ULevelT first lets unsigned modular wrap-around recover the
   correct difference.

2. The MayOverflow precondition check at the byte-sample EVEN dispatch
   sites in DispatchEven cast `num_levels - 1` to CommonT before passing
   it to MayOverflow: `static_cast<int8_t>(128) = -128` for int8_t,
   sign-extended in IntArithmeticT to 0xFFFFFF80, and the subsequent
   division `numeric_limits<IntArithmeticT>::max() / 0xFFFFFF80 = 1`
   reported overflow for any non-trivial range. Fix: pass `num_levels - 1`
   directly (it's already an `int`) and apply the same unsigned-promotion-
   safe subtraction in MayOverflow's `(upper - lower)` computation.

3. PassThruTransform::BinSelect computed `bin = static_cast<int>(sample)`
   for the byte-sample privatized histogram. For signed int8_t samples
   this preserved the sign, producing negative bin indices in [-128, -1]
   for half the input range; the kernel's `if (bin >= 0)` check then
   silently dropped them. Fix: cast through make_unsigned_t<_SampleT>
   first so int8_t(-128..127) reinterprets as uint8_t(128..255, 0..127).

The existing
"DeviceHistogram::HistogramEven num_bins exceeds LevelT range" test was
asserting `cudaErrorInvalidValue` for inputs that are now correctly
handled. Updated to assert success — the bin width can be fractional
(smaller than one distinct LevelT value), and the integer ComputeBin
path handles that without overflow once the storage-type and cast bugs
are fixed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For signed integer SampleT the bench now picks
`lower = numeric_limits<SampleT>::min()` instead of `0`. This doubles
the testable range — int8_t goes from 128 distinct values [0, 127] to
256 distinct [-128, 127], int16_t from 32768 to 65536 — letting the
benchmarks exercise more bin counts before hitting the
distinct-level-values cap. With this change and the matching
DeviceHistogram fix, int8_t now runs the dense matrix at bins=128 and
bins=255 (previously skipped or producing zero-filled histograms).

Helpers added in histogram_common.cuh:
- get_lower_level<SampleT>() returns numeric_limits::min() for signed
  integer SampleT and 0 otherwise.
- max_representable_bins<SampleT>() returns the count of distinct
  SampleT values minus 1 (the upper bound on bins + 1 strictly-monotonic
  levels). For 64-bit and floating-point SampleT it's int64_t::max(),
  effectively unbounded for the bench's bin axes.

The four bench files (`even`, `range`, `multi/even`, `multi/range`)
swap their hardcoded `lower_level = 0` for `get_lower_level<SampleT>()`
and gate on `num_bins > max_representable_bins<SampleT>()`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

4 participants