[cub] Harden histogram benchmarks and add bin-by-bin verification by robobryce · Pull Request #9144 · NVIDIA/cccl

robobryce · 2026-05-27T21:04:26Z

Summary

Hardens the four cub.bench.histogram.{even,range,multi.even,multi.range}.base NVBench benchmarks (under cub/benchmarks/bench/histogram/) against pattern-matching tunings and silent dispatch failures, and adds an opt-out bin-by-bin correctness verifier that runs outside the timed region. Also extends and adds CTest coverage in cub/test/.

The four benchmarks each call into cub::DeviceHistogram::HistogramEven / HistogramRange (single-channel) or MultiHistogramEven / MultiHistogramRange (3-active-channel of 4) over a sweep of input sizes, bin counts, sample types, and entropy values, and report a bandwidth in bytes/sec via NVBench's GlobalMem BW column.

Why

When you tune CUB's histogram code paths, NVBench measures elapsed time inside state.exec and reports bandwidth as bytes_processed / elapsed. Two failure modes can quietly inflate the reported bandwidth:

The dispatch returns a non-cudaSuccess error code, but the benchmark drops the return value on the floor. The kernel never launched, so the elapsed time is small, so the reported bandwidth is large.
The kernel completes with cudaSuccess but writes wrong counts into the output histogram. For example, a partition mask that drops samples that should have landed in another partition's write set produces a non-empty histogram with the right shape but the wrong values. CTest cases that aggregate to a sum-of-counts check pass; the bench reports inflated bandwidth because effectively only a fraction of the input was processed.

There is also a third class — uniform-spacing detection fast paths and tunings that key off power-of-two element/bin counts — where the bench measures a code path real users do not exercise, so the reported bandwidth overstates production behavior.

What changed

Six commits, each scoped narrowly

[cub] Harden histogram benchmarks against pattern-matching tunings — single-channel even.cu / range.cu. Quadratic-spaced range levels (still strictly monotonic across [lower_level, upper_level]) so DispatchRange stays on the SearchTransform path instead of collapsing to the uniform-detection fast path. Replaces the power-of-two Elements{io} axis with non-power-of-two sizes. Switches to NVBench's manual-timer exec_tag and calls cudaCtxResetPersistingL2Cache() outside the timed window, since cold-cache eviction does not demote persistence-marked addresses set via cudaStreamSetAttribute / cudaAccessPolicyWindow.
[cub] Extend histogram bench/test hardening to multi-channel and add cache coverage — same three changes for multi/even.cu and multi/range.cu. Replaces 128/2048 with 100/2000 in the Bins axis on all four benches so tunings that hard-code on power-of-two bin counts cannot use those shortcuts. In cub/test/catch2_test_device_histogram.cu, setup_bin_levels_for_range now perturbs interior levels by ±min_bin_width/4 (alternating sign), falling back to uniform when the type is too tight (e.g. byte-sample with 256 levels). Adds a new cub/test/catch2_test_device_histogram_thread_local_cache.cu with three Catch2 cases targeting the thread_local detection-stream / detection-buffer cache in dispatch_range: sequential calls across multiple user streams, four-thread concurrent calls on the same device, and a single-thread cross-device case that skips when fewer than two GPUs are present.
[cub] Force histogram bench persisting-L2 reservation to zero each iteration — adds cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, 0) immediately before every timed iteration, so a prior bench in the same NVBench process can't leave the reservation bumped. The default is 0; this hardcodes 0 instead of relying on the default.
[cub] Use jittered-uniform histogram bench levels instead of quadratic — quadratic spacing produced bin widths spanning ~2n× (last bin vs first), which is not representative of typical workloads. Jittered uniform spacing (±25% of step, fixed mt19937 seed for reproducibility) keeps consecutive widths within ~3× while still defeating the uniform-spacing detection fast path.
[cub] Add 16384 and 60000 bin sizes to histogram benchmarks — covers the 10k–65k bin tier between 2000 and 2097152 across all four benches.
[cub] Verify histogram benchmark output bin-by-bin against a reference — the verifier described below.

How the verifier works

Each benchmark cell builds the strided pixel-major sample buffer and the output histograms exactly as before. Then, before state.exec:

The dispatch (HistogramEven / HistogramRange / MultiHistogramEven / MultiHistogramRange) is called once with explicit return-code checking. A non-cudaSuccess return throws std::runtime_error and the cell is reported as failed; NVBench prints the message in the per-cell log and emits no bandwidth row for that cell.
The produced histogram(s) are compared bin-by-bin against an independent reference. The reference is computed on the same device, but with thrust::for_each + a global atomicAdd per pixel per active channel — the EVEN reference uses the closed-form bin index (sample - lo) * num_bins / (hi - lo) (in double); the RANGE reference uses cub::UpperBound on the per-channel level array. Neither reference shares any kernel with cub::DeviceHistogram, so a bug that produces a wrong-but-plausible histogram from the optimized path is caught.
The compare reports both the total count (opt_total vs ref_total) and the first mismatched bin index with both values, so a failure log identifies the channel and bin to investigate.

The warmup, the reference build, and the comparison all run outside state.exec, so the timed region only contains the dispatch under test.

Disabling the verifier at run time

The verifier is on by default. To skip it (e.g. during a tuning sweep where only relative throughput matters and the verifier's wall-clock cost is undesirable), set CUB_BENCH_HISTOGRAM_VERIFY to 0, false, no, or off (case-insensitive) in the environment when invoking the benchmark binary. The flag is read once per process via a function-local static, so toggling has no per-cell overhead.

Cost

The verifier's cost on the timed region (NVBench's reported histogram metric — geometric mean across all four binaries' GlobalMem BW (bytes/sec) rows, in GiB/s) is zero within measurement noise. Wall-clock benchmark runtime grows by ~14% (median ~7 s on a ~52 s baseline run) because the on-device reference loops over every sample once per cell.

Hardware: NVIDIA B200, CUDA driver 580.126.09, CUDA toolkit 12.8.93, host Linux 6.8 with 30 CPUs / 180 GiB RAM. Five interleaved runs of cub/benchmarks/bench/histogram/run_histogram_benchmark.py each (interleaved to factor out NVBench-iteration-count run-to-run variance):

	Mean `histogram` (GiB/s)	Median wall-clock (s)
baseline (no verifier)	210.87	52.0
verifier on (default)	210.88	59.4
verifier off (`CUB_BENCH_HISTOGRAM_VERIFY=0`)	210.78	52.8

The timed-region metric is unchanged across all three modes within ±0.05% (well inside the bench's natural noise of ±0.05–0.10%). The wall-clock overhead from running the verifier is ~7 s; with the verifier disabled, wall-clock is statistically indistinguishable from baseline.

Validation

Built and ran the full sweep on the B200 host. With the verifier on the unmodified branch, all 160 benchmark cells (40 cells × 4 binaries) pass; the reported histogram metric is unchanged within noise.

To confirm the verifier triggers when the optimized output is actually wrong, I rebased a known-bad change onto this branch — a CUB optimization attempt that introduced a partition-mask bug that silently dropped most input samples on multi-channel Bins=60000 configurations. With the verifier in place, that change is rejected: 44 cells across the four binaries fail, with messages like

Run:  [1/1] base [Device=0 SampleT{ct}=I32 CounterT{ct}=I32 OffsetT{ct}=I32 Elements{io}=268435456 Bins=60000 Entropy=0.201]
Fail: Unexpected error: bench correctness check [multi.range]: channel=0 total opt=1209247 ref=268435456;
      first mismatched bin=0 (opt=763903 ref=169668305)

The message reports both the total-count mismatch (1.2M vs 268M — only ~0.45% of samples landed in the optimized histogram) and the first mismatched bin's per-bin counts. Without the verifier (CUB_BENCH_HISTOGRAM_VERIFY=0), the same change reports histogram=893.25 GiB/s — a ~4.2× bandwidth speedup that is entirely an artifact of the dropped samples — exactly the failure mode this PR is intended to prevent.

Test plan

ci/util/build_and_test_targets.sh --preset cub-cpp20 --build-targets "cub.bench.histogram.even.base cub.bench.histogram.range.base cub.bench.histogram.multi.even.base cub.bench.histogram.multi.range.base" builds clean.
ctest -R 'cub\.test\.device\.histogram(\.lid_0|_api\.lid_0|_env\.lid_0|_thread_local_cache\.lid_0)$' passes (including the new cub.test.device.histogram_thread_local_cache.lid_0).
All 160 benchmark cells (40 cells × 4 binaries) pass the bin-by-bin verifier on the unmodified branch; reported histogram metric is unchanged within noise.
CUB_BENCH_HISTOGRAM_VERIFY=0 skips the verifier; reported histogram metric is unchanged within noise; wall-clock is statistically indistinguishable from baseline.
Cherry-picking a known-bad change onto this branch causes 44 cells to fail with explicit bin-by-bin mismatch messages when the verifier is on, and "passes" with an inflated histogram value when the verifier is off (demonstrating the failure mode the verifier exists to prevent).

copy-pr-bot · 2026-05-27T21:04:30Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-27T21:10:05Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

important: ## Walkthrough

Adds GPU-side reference histogram functors, host snapshot/compare utilities, untimed warmups with bin-by-bin verification, per-iteration persisting‑L2 cache resets for timed runs, host-side jittered level generation, updated benchmark axes, and tests for perturbation and thread-local cache scenarios.

important: ## Changes

Histogram benchmark verification and L2 cache reset flow

Layer / File(s)	Summary
Histogram verification library `cub/benchmarks/bench/histogram/histogram_common.cuh`	`bench_correctness_checks_enabled`, `bench_check_cuda`, GPU reference functors (`bench_ref_even_op`, `bench_ref_range_op`), snapshot/compare helpers, and verifier entry points `bench_verify_histogram_even` / `bench_verify_histogram_range`.
Single-channel even benchmark `cub/benchmarks/bench/histogram/even.cu`	Adds untimed warmup that sizes/executes `HistogramEven` and validates bins against reference; timed loop now resets persisting‑L2 cache per iteration and runs `HistogramEven` under `nvbench::exec_tag::timer`. Axes switched to explicit element/bin lists.
Single-channel range benchmark `cub/benchmarks/bench/histogram/range.cu`	Builds jittered monotonic bin levels on host (fixed seed), copies to device; untimed warmup + validation via `bench_verify_histogram_range`; timed loop clears persisting‑L2 cache per iteration and times `HistogramRange`. Axes updated to explicit lists.
Multi-channel even benchmark `cub/benchmarks/bench/histogram/multi/even.cu`	Zeros per-channel histograms, runs untimed warmup (temp-storage sizing + exec), validates per-channel outputs, then times `MultiHistogramEven` under nvbench after per-iteration persisting‑L2 reset. Axes changed to explicit lists.
Multi-channel range benchmark `cub/benchmarks/bench/histogram/multi/range.cu`	Host-generated jittered bin edges (fixed seed) copied to device; untimed warmup with validation and rebind of verified pointers; timed loop resets persisting‑L2 cache per iteration and runs `MultiHistogramRange` under nvbench timer. Axes switched to explicit lists.
Histogram test enhancements `cub/test/catch2_test_device_histogram.cu`, `cub/test/catch2_test_device_histogram_thread_local_cache.cu`	Perturbs interior range-level edges by bounded offsets and adds a thread-local cache test exercising sequential stream reuse, concurrent per-thread streams, and same-thread device switching for `HistogramRange`, verifying GPU results against CPU reference.

important: suggestion: Suggested labels

backport branch/3.4.x

important: suggestion: Suggested reviewers

shwina
bernhardmgruber
srinivasyadav18

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (9)

cub/benchmarks/bench/histogram/histogram_common.cuh (5)
115-156: 💤 Low value

suggestion: Data members in bench_ref_even_op lack the postfix underscore required by guidelines (e.g., d_input_, num_bins_, etc.). Also, line 119's d_hist should be plural (d_hists_) since it's an array. Line 136 uses std::size_t—should be ::cuda::std::size_t per type qualification rules.

These are benchmark internals, so lower priority, but worth noting for consistency.

158-191: ⚡ Quick win

suggestion: Same member naming issues as the EVEN functor. Additionally, line 183 uses cub::UpperBound—free function calls should be fully qualified as ::cub::UpperBound. This matches the guideline that "all calls to free functions must be fully qualified starting from the global namespace."
-      const int idx   = cub::UpperBound(d_levels[c], num_levels, s);
+      const int idx   = ::cub::UpperBound(d_levels[c], num_levels, s);
193-240: 💤 Low value

suggestion: Line 213 uses std::size_t; prefer ::cuda::std::size_t for consistency with the codebase type qualification rules.
-    for (std::size_t b = 0; b < opt.size(); ++b)
+    for (::cuda::std::size_t b = 0; b < opt.size(); ++b)
242-253: ⚡ Quick win

suggestion: bench_snapshot_histograms returns a non-void value; add [[nodiscard]] per guidelines. Also consider adding inline since it's not constexpr, though templates get implicit inline linkage.
 template <typename CounterT>
+[[nodiscard]]
 std::vector<thrust::host_vector<CounterT>>
 bench_snapshot_histograms(const std::vector<thrust::device_vector<CounterT>>& d_hists)
297-334: 💤 Low value

suggestion: Line 311 assumes all channels have identical level counts by reading only d_levels_per_channel[0].size(). If channels could differ, this would silently produce incorrect validation. If same-size is an invariant, a debug assert or comment documenting this assumption would help future maintainers.
cub/test/catch2_test_device_histogram.cu (1)
209-209: ⚡ Quick win

suggestion: Mark perturbation_step const since it's computed once and never modified.
-  const auto perturbation_step = min_bin_width / 4;
+  const auto perturbation_step = min_bin_width / 4;
As per coding guidelines, all variables that are not modified must use const qualifier.
cub/test/catch2_test_device_histogram_thread_local_cache.cu (2)
33-34: ⚡ Quick win

suggestion: Mark upper constexpr and step const for const correctness.
-  constexpr double upper = 1024.0;
-  const double step      = upper / static_cast<double>(num_bins);
+  constexpr double upper = 1024.0;
+  const double step      = upper / static_cast<double>(num_bins);
As per coding guidelines, variables that are not modified must use const qualifier, and variables that can be evaluated at compile-time should use constexpr.

79-79: ⚡ Quick win

suggestion: Mark num_levels const since it's not modified after initialization.
-  const int num_levels = static_cast<int>(d_levels.size());
+  const int num_levels = static_cast<int>(d_levels.size());
cub/benchmarks/bench/histogram/even.cu (1)

97-128: 💤 Low value

suggestion: cudaDeviceSetLimit and cudaCtxResetPersistingL2Cache return error codes that are silently ignored. Consider wrapping them in bench_check_cuda for diagnostic visibility when L2 reset fails unexpectedly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 54ee710d-d89a-4b79-a192-ae1b59574dfb

📥 Commits

Reviewing files that changed from the base of the PR and between ba6b4ea and f97562f.

📒 Files selected for processing (7)

cub/benchmarks/bench/histogram/even.cu
cub/benchmarks/bench/histogram/histogram_common.cuh
cub/benchmarks/bench/histogram/multi/even.cu
cub/benchmarks/bench/histogram/multi/range.cu
cub/benchmarks/bench/histogram/range.cu
cub/test/catch2_test_device_histogram.cu
cub/test/catch2_test_device_histogram_thread_local_cache.cu

Three changes to cub/benchmarks/bench/histogram/{even,range}.cu so the benchmarks exercise the code paths real users hit: - range.cu: build levels[] with quadratic spacing (still strictly monotonic across [lower_level, upper_level]) so DispatchRange stays on the SearchTransform path. The previous thrust::sequence boundaries were perfectly uniform, letting any uniform-detection fast path collapse the bench to DispatchEven performance. - both: replace the power-of-two Elements{io} axis with non-power-of-two sizes so tunings that hard-code on round counts (exact tile multiples, pow2 shortcuts) get measured at sizes where those shortcuts cannot fire. The total axis cardinality is unchanged. - both: switch to the manual-timer exec_tag and call cudaCtxResetPersistingL2Cache() outside the timed window. nvbench's cold measurement already evicts cached lines between iterations, but it does not demote persistence-marked addresses set via cudaStreamSetAttribute / cudaAccessPolicyWindow.

…cache coverage Benchmarks - multi/even.cu, multi/range.cu: same hardening as the prior commit's even.cu/range.cu changes — quadratic-spaced range levels (still strictly monotonic across [lower_level, upper_level]) so DispatchRange stays on the SearchTransform path; the manual-timer exec_tag with cudaCtxResetPersistingL2Cache() outside the timed window; non-power- of-two Elements{io} so tunings that hard-code on round counts (exact tile multiples, pow2 shortcuts) are exercised at sizes where those shortcuts cannot fire. Axis cardinality is unchanged. - even.cu, range.cu, multi/{even,range}.cu: replace two of the four Bins values (128 -> 100, 2048 -> 2000) so tunings that hard-code on power- of-two bin counts cannot use those shortcuts. Cardinality unchanged. Tests - catch2_test_device_histogram.cu: setup_bin_levels_for_range now perturbs interior levels by +/- min_bin_width/4 (alternating sign), falling back to uniform when the type is too tight (e.g. byte-sample with 256 levels). The std::upper_bound reference already handled arbitrary spacings; this just exercises the SearchTransform code path in addition to the uniform-detection fast path. - catch2_test_device_histogram_thread_local_cache.cu (new): three Catch2 cases targeting the thread_local detection_stream / detection_ buf cache in dispatch_range. Sequential calls across multiple user streams, four-thread concurrent calls on the same device, and a single-thread cross-device case that skips when fewer than two GPUs are present.

…eration The default cudaLimitPersistingL2CacheSize is 0, so hardcoding 0 (rather than relying on the default) defends against prior benchmarks in the same nvbench process having bumped the reservation. The cudaCtxResetPersistingL2Cache call already demoted persistence-marked addresses; this extends the defense to the reservation itself.

Quadratic spacing produced bin widths spanning ~2n× (last bin vs first), which is not representative of typical workloads. Jittered uniform spacing (±25% of step, fixed mt19937 seed) keeps consecutive widths within ~3× while still defeating uniform-spacing detection so DispatchRange stays on the SearchTransform path.

Extends the Bins axis to cover the 10k-65k range, which sits between the existing 2000 and 2097152 entries. Applied to range/even and the multi-channel variants so all four histogram benches share the same axis.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

cub/benchmarks/bench/histogram/even.cu (1)

54-57: ⚡ Quick win

suggestion: the new free-function calls are not globally qualified. Please switch the newly added calls here to ::thrust::fill, ::bench_check_cuda, ::bench_verify_histogram_even, ::cudaDeviceSetLimit, and ::cudaCtxResetPersistingL2Cache to match the repo rule. As per coding guidelines, "All calls to free functions must be fully qualified starting from the global namespace, e.g., ::cuda::ceil_div, including calls to functions in the same namespace."

Also applies to: 69-70, 81-86, 95-95, 106-107

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5ea8179f-497f-4fd8-8df1-af60b9401dc1

📥 Commits

Reviewing files that changed from the base of the PR and between f97562f and 1a26924.

📒 Files selected for processing (7)

cub/benchmarks/bench/histogram/even.cu
cub/benchmarks/bench/histogram/histogram_common.cuh
cub/benchmarks/bench/histogram/multi/even.cu
cub/benchmarks/bench/histogram/multi/range.cu
cub/benchmarks/bench/histogram/range.cu
cub/test/catch2_test_device_histogram.cu
cub/test/catch2_test_device_histogram_thread_local_cache.cu

🚧 Files skipped from review as they are similar to previous changes (5)

cub/benchmarks/bench/histogram/multi/range.cu
cub/benchmarks/bench/histogram/histogram_common.cuh
cub/test/catch2_test_device_histogram.cu
cub/benchmarks/bench/histogram/range.cu
cub/benchmarks/bench/histogram/multi/even.cu

coderabbitai · 2026-05-27T21:49:41Z

+    bench_check_cuda(
+      cub::DeviceHistogram::HistogramEven(
+        d_temp_storage,
+        temp_storage_bytes,
+        d_input,
+        d_histogram,
+        num_levels,
+        lower_level,
+        upper_level,
+        static_cast<OffsetT>(elements)),
+      "warmup HistogramEven temp-size");
+    thrust::device_vector<unsigned char> warmup_tmp(temp_storage_bytes);
+    d_temp_storage = thrust::raw_pointer_cast(warmup_tmp.data());
+    bench_check_cuda(
+      cub::DeviceHistogram::HistogramEven(
+        d_temp_storage,
+        temp_storage_bytes,
+        d_input,
+        d_histogram,
+        num_levels,
+        lower_level,
+        upper_level,
+        static_cast<OffsetT>(elements)),
+      "warmup HistogramEven");


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="cub/benchmarks/bench/histogram/even.cu" echo "== File = $FILE ==" wc -l "$FILE" echo echo "== Context around warmup verifier (lines 40-110) ==" nl -ba "$FILE" | sed -n '40,110p' echo echo "== Context around benchmark timing path (lines 100-150) ==" nl -ba "$FILE" | sed -n '100,170p' echo echo "== Search for relevant calls ==" rg -n "HistogramEven\(|cudaDeviceSetLimit|cudaCtxResetPersistingL2Cache|raw_pointer_cast|thrust::fill|bench_check_cuda|bench_verify_histogram_even" "$FILE"

Repository: NVIDIA/cccl

Length of output: 265

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="cub/benchmarks/bench/histogram/even.cu" echo "== File = $FILE ==" wc -l "$FILE" echo echo "== Lines 40-110 ==" sed -n '40,110p' "$FILE" | cat -n echo echo "== Lines 100-150 ==" sed -n '100,150p' "$FILE" | cat -n echo echo "== Search for relevant calls ==" rg -n "HistogramEven\(|cudaDeviceSetLimit|cudaCtxResetPersistingL2Cache|raw_pointer_cast|thrust::fill|bench_check_cuda|bench_verify_histogram_even" "$FILE"

Repository: NVIDIA/cccl

Length of output: 6422

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="cub/benchmarks/bench/histogram/even.cu" echo "== File head ==" sed -n '1,120p' "$FILE" | cat -n echo echo "== Search for bench_check_cuda / bench_verify_histogram_even declarations/usages ==" rg -n "bench_check_cuda|bench_verify_histogram_even|namespace|using " "$FILE" echo echo "== Show any local includes/aliases near the top ==" rg -n "`#include`|cub::|thrust::|cuda::|_CCCL_TRY_CUDA_API|CUB_BENCH|bench_" "$FILE" | head -n 80

Repository: NVIDIA/cccl

Length of output: 7373

important: Warmup validation and cache-state setup must follow the same execution/control path as the timed benchmark.

Warmup correctness uses the non-env temp-storage overload of cub::DeviceHistogram::HistogramEven(...) (no cub_bench_env / no cuda::execution::tune(...)), while the timed loop uses the env overload and applies tuning when !TUNE_BASE; warmup can therefore “verify” a different dispatch path than what’s benchmarked.

cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, 0) and cudaCtxResetPersistingL2Cache() are called without checking their cudaError_t return values, so persisting-L2 demotion failures can silently invalidate the intended cache-state control.

Apply the free-function global-qualification guideline to newly added calls (e.g., use ::thrust::fill, ::thrust::raw_pointer_cast, ::cudaDeviceSynchronize, ::cudaDeviceSetLimit, ::cudaCtxResetPersistingL2Cache).

coderabbitai · 2026-05-27T21:49:41Z

+               cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, 0);
+               cudaCtxResetPersistingL2Cache();


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the target file around the reported lines FILE="cub/benchmarks/bench/histogram/even.cu" if [ ! -f "$FILE" ]; then echo "Missing file: $FILE" >&2 exit 1 fi echo "=== File: $FILE ===" echo "--- Lines 80-140 ---" nl -ba "$FILE" | sed -n '80,140p' # Check whether bench_check_cuda is available/used near those calls echo "--- Search for bench_check_cuda usage in file ---" rg -n "bench_check_cuda" "$FILE" || true # Check for any macro/wrapper around cudaDeviceSetLimit / cudaCtxResetPersistingL2Cache echo "--- Search for cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize and cudaCtxResetPersistingL2Cache in file ---" rg -n "cudaDeviceSetLimit\\(cudaLimitPersistingL2CacheSize,|cudaCtxResetPersistingL2Cache\\(" "$FILE" || true

Repository: NVIDIA/cccl

Length of output: 191

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="cub/benchmarks/bench/histogram/even.cu" if [ ! -f "$FILE" ]; then echo "Missing file: $FILE" >&2 exit 1 fi echo "=== File: $FILE ===" echo "--- Lines 80-140 (with line numbers) ---" awk 'NR>=80 && NR<=140 {printf "%d:%s\n", NR, $0}' "$FILE" | sed -n '1,200p' echo "--- Search for bench_check_cuda usage in file ---" rg -n "bench_check_cuda" "$FILE" || true echo "--- Search for cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize and cudaCtxResetPersistingL2Cache in file ---" rg -n "cudaDeviceSetLimit\\(cudaLimitPersistingL2CacheSize,|cudaCtxResetPersistingL2Cache\\(" "$FILE" || true

Repository: NVIDIA/cccl

Length of output: 3035

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="cub/benchmarks/bench/histogram/even.cu" awk 'NR>=40 && NR<=90 {printf "%d:%s\n", NR, $0}' "$FILE" | sed -n '1,200p'

Repository: NVIDIA/cccl

Length of output: 2036

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "bench_check_cuda\\b" cub/benchmarks/bench/histogram/even.cu # Find a definition of bench_check_cuda in the benchmarks harness rg -n "bench_check_cuda\\s*\$" cub/benchmarks -S rg -n "auto\\s+bench_check_cuda|void\\s+bench_check_cuda|`#define`\\s+bench_check_cuda|bench_check_cuda\\s*=" cub/benchmarks -S rg -n "bench_check_cuda" cub/benchmarks -S --files-with-matches | head -n 50

Repository: NVIDIA/cccl

Length of output: 2068

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect bench_check_cuda definition FILE1="cub/benchmarks/bench/histogram/histogram_common.cuh" echo "=== $FILE1 lines 80-140 ===" awk 'NR>=80 && NR<=140 {printf "%d:%s\n", NR, $0}' "$FILE1" | sed -n '1,200p' # Look for persisting L2 reset pattern in other histogram benchmarks echo "=== Search for persisting-L2 reset calls in cub/benchmarks/bench/histogram ===" rg -n "cudaDeviceSetLimit\\(cudaLimitPersistingL2CacheSize,\\s*0\$|cudaCtxResetPersistingL2Cache\$" cub/benchmarks/bench/histogram

Repository: NVIDIA/cccl

Length of output: 3277

🏁 Script executed:

#!/bin/bash set -euo pipefail for FILE in \ cub/benchmarks/bench/histogram/range.cu \ cub/benchmarks/bench/histogram/multi/even.cu \ cub/benchmarks/bench/histogram/multi/range.cu; do echo "=== $FILE lines around persisting-L2 reset ===" # print a small window around the matching lines (122/etc varies) awk ' /cudaDeviceSetLimit\\(cudaLimitPersistingL2CacheSize, 0\$/ {start=NR-3; end=NR+6} NR>=start && NR<=end {print NR ":" $0} NR>end {start=0} ' "$FILE" | sed -n '1,120p' done

Repository: NVIDIA/cccl

Length of output: 309

🏁 Script executed:

#!/bin/bash set -euo pipefail FILES=( cub/benchmarks/bench/histogram/range.cu cub/benchmarks/bench/histogram/multi/even.cu cub/benchmarks/bench/histogram/multi/range.cu ) for FILE in "${FILES[@]}"; do echo "=== $FILE ===" # Get exact line numbers for both calls mapfile -t LNS < <(rg -n "cudaDeviceSetLimit\$cudaLimitPersistingL2CacheSize,\\s*0\$;|cudaCtxResetPersistingL2Cache\$\$;" "$FILE" | cut -d: -f1 | sort -n) if [ ${`#LNS`[@]} -eq 0 ]; then echo " (no matches found)" continue fi for ln in "${LNS[@]}"; do start=$((ln-6)) end=$((ln+6)) awk -v s="$start" -v e="$end" 'NR>=s && NR<=e {printf "%d:%s\n", NR, $0}' "$FILE" | sed -n '1,40p' echo " ---" done done

Repository: NVIDIA/cccl

Length of output: 179

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n -C 8 "cudaDeviceSetLimit\$cudaLimitPersistingL2CacheSize,\\s*0\$;|cudaCtxResetPersistingL2Cache\$\$;" \ cub/benchmarks/bench/histogram/range.cu \ cub/benchmarks/bench/histogram/multi/even.cu \ cub/benchmarks/bench/histogram/multi/range.cu

Repository: NVIDIA/cccl

Length of output: 5392

important: Check CUDA error returns for persisting-L2 resets before timer.start() in cub/benchmarks/bench/histogram/even.cu (lines 106-107): cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, 0); and cudaCtxResetPersistingL2Cache(); ignore their cudaError_t results—wrap both with bench_check_cuda(...) so a reset failure aborts the benchmark instead of running with stale persisting-L2 state.

coderabbitai · 2026-05-27T21:49:41Z

+  for (int t = 0; t < num_threads; ++t)
+  {
+    threads.emplace_back([&]() {
+      cudaStream_t stream;
+      if (cudaStreamCreate(&stream) != cudaSuccess)
+      {
+        failures.fetch_add(1);
+        return;
+      }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="cub/test/catch2_test_device_histogram_thread_local_cache.cu" echo "== File: $FILE ==" echo "-- lines 130-240 --" nl -ba "$FILE" | sed -n '130,240p' echo "-- lines 240-340 --" nl -ba "$FILE" | sed -n '240,340p' echo "-- grep for cudaSetDevice in this file --" rg -n "cudaSetDevice\(" "$FILE" || true echo "-- grep for cudaGetDevice in this file --" rg -n "cudaGetDevice\(" "$FILE" || true

Repository: NVIDIA/cccl

Length of output: 208

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="cub/test/catch2_test_device_histogram_thread_local_cache.cu" echo "== File: $FILE ==" echo "-- lines 140-240 (numbered) --" python3 - <<'PY' import itertools path="cub/test/catch2_test_device_histogram_thread_local_cache.cu" start=140; end=240 with open(path,'r',encoding='utf-8') as f: for i,line in enumerate(f, start=1): if i<start: continue if i>end: break print(f"{i:5d}: {line.rstrip()}") PY echo "-- lines 240-320 (numbered) --" python3 - <<'PY' path="cub/test/catch2_test_device_histogram_thread_local_cache.cu" start=240; end=320 with open(path,'r',encoding='utf-8') as f: for i,line in enumerate(f, start=1): if i<start: continue if i>end: break print(f"{i:5d}: {line.rstrip()}") PY echo "-- grep for cudaSetDevice in this file --" rg -n "cudaSetDevice\(" "$FILE" || true echo "-- grep for cudaGetDevice in this file --" rg -n "cudaGetDevice\(" "$FILE" || true

Repository: NVIDIA/cccl

Length of output: 5082

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="cub/test/catch2_test_device_histogram_thread_local_cache.cu" python3 - <<'PY' path="cub/test/catch2_test_device_histogram_thread_local_cache.cu" for (a,b) in [(1,140),(300,500)]: with open(path,'r',encoding='utf-8') as f: for i,line in enumerate(f, start=1): if i<a: continue if i>b: break print(f"{i:5d}: {line.rstrip()}") PY echo "-- grep for cudaSetDevice/cudaGetDevice --" rg -n "cuda(SetDevice|GetDevice)\(" "$FILE" || true echo "-- show definitions of run_histogram_range / helpers (if present) --" rg -n "run_histogram_range|make_levels|reference_histogram" "$FILE"

Repository: NVIDIA/cccl

Length of output: 6568

🌐 Web query:

Does CUDA Runtime maintain current device as thread-local state (cudaSetDevice affects only calling thread) when using std::thread?

💡 Result:

Yes. CUDA Runtime maintains the “current device / active CUcontext” as per-calling-host-thread state, so cudaSetDevice() affects only the calling thread (e.g., when you spawn work with std::thread, each thread has its own current-device setting unless you call cudaSetDevice() in that thread). Evidence: - NVIDIA’s CUDA Runtime API documentation states that CUDA Runtime API calls operate on the CUDA Driver CUcontext “current to the calling host thread”, and that cudaSetDevice() “makes [the primary context] current to the calling thread” by calling cuCtxSetCurrent().[1] - NVIDIA’s docs also explain that if no CUcontext is current to the calling thread, the runtime selects/makes the primary context current to that thread when a runtime call that requires an active context is made.[1] - NVIDIA’s guidance for multithreading explicitly says the CUDA runtime API maintains per-thread state about the current device, and recommends calling cudaSetDevice() in new host threads because otherwise they may use the default device.[2] Important nuance: - The primary context for a device is “device execution context” at runtime level, but the “currentness” is per host thread. Also, cudaDeviceReset() deinitializes the primary context for the calling thread’s current device “immediately”, indicating that the thread’s current device/context matters for behavior.[1] So, with std::thread: calling cudaSetDevice(d) in one thread does not implicitly set the current device for other threads; each thread should call cudaSetDevice(d) (or otherwise establish the correct context/device) before making CUDA runtime calls that rely on the active context/device.[1][2]

Citations:

1: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DRIVER.html

2: https://developer.nvidia.com/blog/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/

important: Each std::thread needs to set the CUDA device before calling cub::DeviceHistogram::HistogramRange with thrust::device_vector pointers allocated on the spawning thread’s current device (no cudaSetDevice inside the worker lambda at ~165-218).

important: Restore the original active device at the end of the cross-device test instead of hard-coding cudaSetDevice(0) (ends at ~260) to avoid order-dependent device-state leakage.

coderabbitai · 2026-05-27T21:49:41Z

+  int num_devices = 0;
+  REQUIRE(cudaSuccess == cudaGetDeviceCount(&num_devices));


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail file="cub/test/catch2_test_device_histogram_thread_local_cache.cu" echo "== file ==" ls -l "$file" || true echo "== snippet around requested lines ==" sed -n '220,290p' "$file" | nl -ba | sed -n '1,120p' echo "== find cudaSetDevice occurrences ==" rg -n "cudaSetDevice\\(" "$file" || true

Repository: NVIDIA/cccl

Length of output: 273

🏁 Script executed:

#!/bin/bash set -euo pipefail file="cub/test/catch2_test_device_histogram_thread_local_cache.cu" echo "== find cudaSetDevice occurrences ==" rg -n "cudaSetDevice\\(" "$file" || true echo "== find cudaGetDevice occurrences ==" rg -n "cudaGetDevice\\(" "$file" || true echo "== snippet around 240-280 ==" sed -n '235,275p' "$file" | awk '{printf "%6d | %s\n", NR+234, $0}' echo "== snippet around 250-265 ==" sed -n '245,270p' "$file" | awk '{printf "%6d | %s\n", NR+244, $0}'

Repository: NVIDIA/cccl

Length of output: 2195

🏁 Script executed:

#!/bin/bash set -euo pipefail file="cub/test/catch2_test_device_histogram_thread_local_cache.cu" echo "== find cudaGetDevice occurrences ==" rg -n "cudaGetDevice\\(" "$file" || true echo "== find cudaGetDeviceCount occurrences ==" rg -n "cudaGetDeviceCount\\(" "$file" || true echo "== snippet around first set device block (235-265) for context ==" sed -n '200,280p' "$file" | cat -n | sed -n '1,120p' echo "== snippet around device selection loop (240-270) ==" sed -n '235,270p' "$file" | cat -n

Repository: NVIDIA/cccl

Length of output: 3557

important: restore the previously active CUDA device at teardown instead of forcing device 0
The test unconditionally calls cudaSetDevice(0) after switching devices; save the original device via cudaGetDevice before the {0, 1, 0} loop and restore that exact value afterward (currently cudaSetDevice(dev) and cudaSetDevice(0) around lines 248-261).

jrhemstad · 2026-05-27T22:01:59Z

+               cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, 0);
+               cudaCtxResetPersistingL2Cache();


This shouldn't be repeated in every benchmark. This is the kind of thing that should be in nvbench.

Sure, but it's not valid for every NVBench test, some may not want the reset. I'd be happy to add it upstream.

Without this agents will cheat with L2 persistence on the benchmark.

jrhemstad · 2026-05-27T22:04:11Z

+// Exercises the thread_local detection-stream / detection-buffer cache used by
+// dispatch_range's uniform-levels detection path: sequential calls on multiple
+// user streams, concurrent calls from multiple threads, and the cross-device
+// hazard where the cache is bound to the device current at first call.


These tests feel really ad hoc and not clear these are needed.

Without these checks, an AI agent optimizing this code will attempt to cache the CUDA streams and other state in a thread local.

jrhemstad · 2026-05-27T22:04:49Z

  state.add_global_memory_reads<SampleT>(elements);
  state.add_global_memory_writes<CounterT>(num_bins);

+  // Warmup + correctness check: run HistogramEven once outside `state.exec`,


nvbench already does warmups. Manual warmups should not be added.

This is not a warmup. It's a correctness check to validate that the histogram is actually correctly executed.

The unit tests only test with a problem size of up to 1024, and doesn't provide sufficient code coverage.

Each cell of the four `cub.bench.histogram.{even,range,multi.even, multi.range}.base` benchmarks now runs the dispatch once before NVBench's timed window and compares the produced per-channel histogram bin-by-bin against an independent reference computed on-device with `thrust::for_each` + global `atomicAdd`. The warmup also checks the dispatch return code so a non-`cudaSuccess` return is reported instead of being silently discarded. The verifier runs entirely outside `state.exec`, so timed-region bandwidth is unchanged within measurement noise. Wall-clock per benchmark cell increases proportionally to the input size of that cell (the reference loops over every sample once on device). The verifier is on by default and can be disabled at run time by setting the environment variable `CUB_BENCH_HISTOGRAM_VERIFY` to one of: `0`, `false`, `no`, `off` (case-insensitive). Disabling it skips the warmup dispatch, the reference build, and the bin-by-bin compare. The verifier catches two bug classes that the existing CTest histogram suite does not: - dispatch-time errors (e.g. `cudaErrorInvalidValue` from a temp- storage size mismatch in the chunked-staging path) that are not reported by NVBench because the dispatch return code is dropped on the floor. - per-bin count corruption that still produces a non-empty histogram with the right shape but the wrong values (e.g. a partition mask that drops samples that should have landed in another partition's write set). These pass any sum-of-counts sanity check.

miscco · 2026-05-28T06:30:41Z

+  // `state.exec`, checking the dispatch return code, then verify each
+  // channel's histogram bin-by-bin against an independent reference.
+  // Skipped when CUB_BENCH_HISTOGRAM_VERIFY=0|false|no|off.
+  if (bench_correctness_checks_enabled())


Question: This is also costly to compile, should this be an if constexpr?

miscco · 2026-05-28T06:31:34Z

+               cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, 0);
+               cudaCtxResetPersistingL2Cache();
+               timer.start();


We should have a helper function for that to use it in other / all benchmarks

The jittered-uniform level construction in the range benches sets upper_level via get_upper_level, which previously returned num_bins for integer SampleT. That produced step = 1.0, so the ±0.25*step jitter sat in [-0.25, 0.25] and was annihilated by the integer cast in the level loop. The subsequent dedup-by-1 step then forced every collision back onto the next consecutive integer, leaving the level array bit-identical to a perfect uniform stride-1 sequence. A DispatchRange uniform-spacing detection then has nothing to detect against: it sees a perfectly uniform level array on every integer axis row and routes straight to the EVEN classify path - exactly the fast path the range bench is supposed to avoid measuring. Widen upper_level to ~4 * num_bins for integer SampleT so step is at least ~4 and ±step/4 jitter survives integer truncation as ±1, which is enough to break uniformity. Clamp to the type max when 4 * bins overflows SampleT; those axes (e.g. int8_t with bins >= 64) already have step < 1 and the level array is degenerate regardless of jitter.

coderabbitai · 2026-05-28T19:16:44Z

Actionable comments posted: 0

…velT Two latent bugs in cub::DeviceHistogram surfaced when widening the bench to use the full SampleT range (lower_level = numeric_limits<SampleT>::min() for signed integers). 1. ScaleTransform stored `m_max`, `m_min`, `m_scale.fraction.range`, and `m_scale.fraction.bins` in `CommonT = common_type<LevelT, SampleT>`, then ComputeBin promoted through the wider `IntArithmeticT` only at the multiply/divide step. For narrow integer CommonT (int8_t, int16_t) the precomputed `range = max - min` overflowed CommonT before the promotion: int8_t with [-128, 127] gave `range = 255` truncated back to int8_t = -1, sign-extended in IntArithmeticT to 0xFFFFFFFF, and ComputeBin's division by that gigantic divisor returned 0 for every sample. The histogram was non-empty but every count landed in bin 0. Fix: introduce FractionStorageT = IntArithmeticT for integer CommonT (CommonT for non-integer types) and store both `range` and `bins` in it. Compute `max - min` through ULevelT = make_unsigned_t<T>: the intermediate cast is required because C++ integer promotion lifts `(uint8_t) - (uint8_t)` to int(127 - 128) = -1, and going directly to FractionStorageT sign-extends that to a huge garbage value. Truncating through ULevelT first lets unsigned modular wrap-around recover the correct difference. 2. The MayOverflow precondition check at the byte-sample EVEN dispatch sites in DispatchEven cast `num_levels - 1` to CommonT before passing it to MayOverflow: `static_cast<int8_t>(128) = -128` for int8_t, sign-extended in IntArithmeticT to 0xFFFFFF80, and the subsequent division `numeric_limits<IntArithmeticT>::max() / 0xFFFFFF80 = 1` reported overflow for any non-trivial range. Fix: pass `num_levels - 1` directly (it's already an `int`) and apply the same unsigned-promotion- safe subtraction in MayOverflow's `(upper - lower)` computation. 3. PassThruTransform::BinSelect computed `bin = static_cast<int>(sample)` for the byte-sample privatized histogram. For signed int8_t samples this preserved the sign, producing negative bin indices in [-128, -1] for half the input range; the kernel's `if (bin >= 0)` check then silently dropped them. Fix: cast through make_unsigned_t<_SampleT> first so int8_t(-128..127) reinterprets as uint8_t(128..255, 0..127). The existing "DeviceHistogram::HistogramEven num_bins exceeds LevelT range" test was asserting `cudaErrorInvalidValue` for inputs that are now correctly handled. Updated to assert success — the bin width can be fractional (smaller than one distinct LevelT value), and the integer ComputeBin path handles that without overflow once the storage-type and cast bugs are fixed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

For signed integer SampleT the bench now picks `lower = numeric_limits<SampleT>::min()` instead of `0`. This doubles the testable range — int8_t goes from 128 distinct values [0, 127] to 256 distinct [-128, 127], int16_t from 32768 to 65536 — letting the benchmarks exercise more bin counts before hitting the distinct-level-values cap. With this change and the matching DeviceHistogram fix, int8_t now runs the dense matrix at bins=128 and bins=255 (previously skipped or producing zero-filled histograms). Helpers added in histogram_common.cuh: - get_lower_level<SampleT>() returns numeric_limits::min() for signed integer SampleT and 0 otherwise. - max_representable_bins<SampleT>() returns the count of distinct SampleT values minus 1 (the upper bound on bins + 1 strictly-monotonic levels). For 64-bit and floating-point SampleT it's int64_t::max(), effectively unbounded for the bench's bin axes. The four bench files (`even`, `range`, `multi/even`, `multi/range`) swap their hardcoded `lower_level = 0` for `get_lower_level<SampleT>()` and gate on `num_bins > max_representable_bins<SampleT>()`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

robobryce requested review from a team as code owners May 27, 2026 21:04

robobryce requested a review from shwina May 27, 2026 21:04

github-project-automation Bot added this to CCCL May 27, 2026

robobryce requested a review from elstehle May 27, 2026 21:04

github-project-automation Bot moved this to Todo in CCCL May 27, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 27, 2026

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

Comment thread cub/test/catch2_test_device_histogram_thread_local_cache.cu Outdated

robobryce force-pushed the main branch from f97562f to 1a26924 Compare May 27, 2026 21:43

brycelelbach added 5 commits May 27, 2026 21:48

[cub] Add 16384 and 60000 bin sizes to histogram benchmarks

24954d5

Extends the Bins axis to cover the 10k-65k range, which sits between the existing 2000 and 2097152 entries. Applied to range/even and the multi-channel variants so all four histogram benches share the same axis.

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

robobryce force-pushed the main branch from 1a26924 to a1d55af Compare May 27, 2026 21:52

jrhemstad reviewed May 27, 2026

View reviewed changes

robobryce force-pushed the main branch from a1d55af to 043d5b9 Compare May 27, 2026 23:22

miscco reviewed May 28, 2026

View reviewed changes

brycelelbach and others added 2 commits May 28, 2026 20:02

		cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, 0);
		cudaCtxResetPersistingL2Cache();

		int num_devices = 0;
		REQUIRE(cudaSuccess == cudaGetDeviceCount(&num_devices));

Conversation

robobryce commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

What changed

Six commits, each scoped narrowly

How the verifier works

Disabling the verifier at run time

Cost

Validation

Test plan

Uh oh!

copy-pr-bot Bot commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

robobryce commented May 27, 2026 •

edited

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading