Skip to content

Batch host->device upload in GPU TIFF decode for nvCOMP path#1528

Merged
brendancol merged 2 commits intoxarray-contrib:mainfrom
brendancol:perf/nvcomp-batch-h2d-upload
May 9, 2026
Merged

Batch host->device upload in GPU TIFF decode for nvCOMP path#1528
brendancol merged 2 commits intoxarray-contrib:mainfrom
brendancol:perf/nvcomp-batch-h2d-upload

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

Replaces per-tile cupy.asarray(np.frombuffer(t, np.uint8)) in
_try_nvcomp_batch_decompress with a single concatenated host buffer
and one H2D transfer. Per-tile device pointers are derived as
base_ptr + offsets. Both call sites are touched: the
kvikio.nvcomp.DeflateManager fallback (L976-988 of the original
file) and the direct ctypes nvCOMP entry (L1027 of the original file).

Reference pattern

The fix mirrors the LZW/Deflate concat-then-upload pattern already
in the same file at xrspatial/geotiff/_gpu_decode.py L1714-1722:
build a single contiguous host buffer, do one cupy.asarray, then
build per-tile views or pointers from (base_ptr + offsets).

Measurement

From the perf audit:

  • Before (per-tile cupy.asarray): 256 tiles x 64 KB = 6.07 ms
  • After (batched): 3.65 ms
  • ~1.66x speedup; scales worse with more tiles because per-tile
    cupy.asarray is O(n) in CUDA driver dispatches.

Tests

  • test_nvcomp_batch_upload_p3.py:
    • Bit-exact correctness across 256x256 / 1024x1024 / 2048x2048
      deflate-tiled TIFFs (CPU read_to_array vs read_geotiff_gpu).
    • Performance regression guard: 2048x2048 deflate-tiled GPU
      decode under 200 ms.
  • Existing GPU decode tests pass: test_gpu_byteswap_1508,
    test_predictor2_big_endian, test_predictor3_big_endian,
    test_predictor_multisample (51 tests, plus 54 in compression /
    GPU strict tests).

Test plan

  • Existing GPU decode regression suite passes
  • New test_nvcomp_batch_upload_p3.py correctness tests pass
    on 256 / 1024 / 2048 grid sizes
  • New perf regression guard passes on 2048x2048 deflate-tiled
    TIFF
  • CI green on a CUDA runner

Replaces per-tile `cupy.asarray(np.frombuffer(t, np.uint8))` in
`_try_nvcomp_batch_decompress` (both the kvikio.nvcomp DeflateManager
fallback and the direct ctypes nvCOMP entry) with a single concatenated
host buffer and one H2D transfer. Per-tile device pointers are derived
as `base_ptr + offsets`, mirroring the pattern already used in the
LZW/Deflate path at `_gpu_decode.py` L1714-1722.

Measured for 256 tiles x 64 KB: 6.07 ms (per-tile) -> 3.65 ms (batched),
~1.66x speedup. The win scales with tile count because per-tile
`cupy.asarray` costs O(n) CUDA driver dispatches.

Adds `test_nvcomp_batch_upload_p3.py`: bit-exact correctness across
256/1024/2048 sizes and a 200 ms regression guard on a 2048x2048
deflate-tiled TIFF. Existing GPU decode tests
(test_gpu_byteswap_1508, test_predictor2/3_big_endian,
test_predictor_multisample) all pass.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 8, 2026
@brendancol brendancol requested a review from Copilot May 9, 2026 01:14
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes the GeoTIFF GPU nvCOMP decompression path by batching host→device uploads of compressed tiles to reduce per-tile CUDA dispatch overhead, and adds regression tests intended to validate correctness and guard against performance regressions.

Changes:

  • Refactors _try_nvcomp_batch_decompress to concatenate all compressed tiles into a single host buffer and perform a single cupy.asarray transfer, deriving per-tile device views/pointers from offsets (both kvikio fallback and direct ctypes nvCOMP paths).
  • Avoids per-tile output allocations in the direct nvCOMP ctypes path by allocating one contiguous decompressed output buffer.
  • Adds a new GPU test module with correctness coverage and a timing-based regression guard for the nvCOMP batch-upload optimization.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
xrspatial/geotiff/_gpu_decode.py Implements batched compressed-tile upload and contiguous output buffer strategy for nvCOMP batch decompress paths.
xrspatial/geotiff/tests/test_nvcomp_batch_upload_p3.py Adds GPU-only correctness tests and a performance regression guard for the nvCOMP batched upload change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread xrspatial/geotiff/tests/test_nvcomp_batch_upload_p3.py Outdated
Comment thread xrspatial/geotiff/tests/test_nvcomp_batch_upload_p3.py Outdated
Comment on lines 974 to +982
import cupy
try:
raw_tiles = []
for tile in compressed_tiles:
raw_tiles.append(tile[2:-4] if len(tile) > 6 else tile)
manager = nvcomp.DeflateManager(chunk_size=tile_bytes)
d_compressed = [cupy.asarray(np.frombuffer(t, dtype=np.uint8))
for t in raw_tiles]
# Batch host->device upload: concatenate all tiles into one host
# buffer, then a single cupy.asarray transfer. Mirrors the
# LZW/Deflate concat-then-upload pattern below (~L1714-1722).
Three findings addressed:

- Source: gate the kvikio.nvcomp fallback to Deflate-only. The block
  unconditionally stripped a 2-byte zlib header + 4-byte adler32 from
  every tile and ran them through DeflateManager, even when compression
  was 50000 (ZSTD). For ZSTD that corrupts the frame and at best wastes
  time inside a DeflateManager that returns None on the resulting decode
  failure. ZSTD now returns None from the kvikio branch so the caller
  picks another decoder.
- Tests: tighten the skip condition so the suite only runs when
  libnvcomp or kvikio.nvcomp is actually importable on the host.
  Without one of those backends the optimised path returns None and
  the changed code is never exercised, so a passing test on a
  cupy-only host would be misleading.
- Tests: wrap _try_nvcomp_batch_decompress with a call recorder and
  assert it returned non-None at least once during the timed and
  correctness calls. A silent fall-through to the slow numba kernel
  is now a test failure.
- Tests: add test_nvcomp_kvikio_fallback_skips_zstd which exercises
  the new ZSTD gate by monkeypatching _get_nvcomp -> None and
  asserting the kvikio branch returns None for compression=50000.
  Skips when kvikio.nvcomp is not actually importable on the host.
@brendancol brendancol merged commit f11e1e4 into xarray-contrib:main May 9, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants