fix: CUDA bitpacked sliced output allocation#8622
Conversation
|
report: A sliced bit-packed array carries a non-zero Likely fix: allocate ReproductionAdd to the Output: ( ImpactDictionary columns slice their bit-packed |
Decode sliced bit-packed arrays in padded coordinates by sizing and launching for offset + len. This keeps the returned offset..offset+len device slice in bounds and ensures the final touched 1024-value chunk is decoded. Signed-off-by: "Alexander Droste" <alexander.droste@protonmail.com>
6927825 to
f8a35cc
Compare
|
Thanks for the heads up on this one: @gargiulofrancesco ! |
Merging this PR will improve performance by 16.39%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | slice_empty_vortex |
339.4 ns | 397.8 ns | -14.66% |
| ⚡ | Simulation | chunked_bool_canonical_into[(1000, 10)] |
26.3 µs | 15.9 µs | +65.8% |
| ⚡ | Simulation | encode_varbin[(1000, 32)] |
163.7 µs | 146.9 µs | +11.45% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing ad/fix-cuda-bitpacked-slice-offset (f8a35cc) with develop (a9f77d1)
Footnotes
-
4 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
Decode sliced bit-packed arrays in padded coordinates by sizing and launching for offset + len. This keeps the returned offset..offset+len device slice in bounds and ensures the final touched 1024-value chunk is decoded.