Batch host->device upload in GPU TIFF decode for nvCOMP path by brendancol · Pull Request #1528 · xarray-contrib/xarray-spatial

brendancol · 2026-05-08T20:42:09Z

Summary

Replaces per-tile cupy.asarray(np.frombuffer(t, np.uint8)) in
_try_nvcomp_batch_decompress with a single concatenated host buffer
and one H2D transfer. Per-tile device pointers are derived as
base_ptr + offsets. Both call sites are touched: the
kvikio.nvcomp.DeflateManager fallback (L976-988 of the original
file) and the direct ctypes nvCOMP entry (L1027 of the original file).

Reference pattern

The fix mirrors the LZW/Deflate concat-then-upload pattern already
in the same file at xrspatial/geotiff/_gpu_decode.py L1714-1722:
build a single contiguous host buffer, do one cupy.asarray, then
build per-tile views or pointers from (base_ptr + offsets).

Measurement

From the perf audit:

Before (per-tile cupy.asarray): 256 tiles x 64 KB = 6.07 ms
After (batched): 3.65 ms
~1.66x speedup; scales worse with more tiles because per-tile
cupy.asarray is O(n) in CUDA driver dispatches.

Tests

test_nvcomp_batch_upload_p3.py:
- Bit-exact correctness across 256x256 / 1024x1024 / 2048x2048
  deflate-tiled TIFFs (CPU read_to_array vs read_geotiff_gpu).
- Performance regression guard: 2048x2048 deflate-tiled GPU
  decode under 200 ms.
Existing GPU decode tests pass: test_gpu_byteswap_1508,
test_predictor2_big_endian, test_predictor3_big_endian,
test_predictor_multisample (51 tests, plus 54 in compression /
GPU strict tests).

Test plan

Existing GPU decode regression suite passes
New test_nvcomp_batch_upload_p3.py correctness tests pass
on 256 / 1024 / 2048 grid sizes
New perf regression guard passes on 2048x2048 deflate-tiled
TIFF
CI green on a CUDA runner

Replaces per-tile `cupy.asarray(np.frombuffer(t, np.uint8))` in `_try_nvcomp_batch_decompress` (both the kvikio.nvcomp DeflateManager fallback and the direct ctypes nvCOMP entry) with a single concatenated host buffer and one H2D transfer. Per-tile device pointers are derived as `base_ptr + offsets`, mirroring the pattern already used in the LZW/Deflate path at `_gpu_decode.py` L1714-1722. Measured for 256 tiles x 64 KB: 6.07 ms (per-tile) -> 3.65 ms (batched), ~1.66x speedup. The win scales with tile count because per-tile `cupy.asarray` costs O(n) CUDA driver dispatches. Adds `test_nvcomp_batch_upload_p3.py`: bit-exact correctness across 256/1024/2048 sizes and a 200 ms regression guard on a 2048x2048 deflate-tiled TIFF. Existing GPU decode tests (test_gpu_byteswap_1508, test_predictor2/3_big_endian, test_predictor_multisample) all pass.

Copilot

Pull request overview

Optimizes the GeoTIFF GPU nvCOMP decompression path by batching host→device uploads of compressed tiles to reduce per-tile CUDA dispatch overhead, and adds regression tests intended to validate correctness and guard against performance regressions.

Changes:

Refactors _try_nvcomp_batch_decompress to concatenate all compressed tiles into a single host buffer and perform a single cupy.asarray transfer, deriving per-tile device views/pointers from offsets (both kvikio fallback and direct ctypes nvCOMP paths).
Avoids per-tile output allocations in the direct nvCOMP ctypes path by allocating one contiguous decompressed output buffer.
Adds a new GPU test module with correctness coverage and a timing-based regression guard for the nvCOMP batch-upload optimization.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`xrspatial/geotiff/_gpu_decode.py`	Implements batched compressed-tile upload and contiguous output buffer strategy for nvCOMP batch decompress paths.
`xrspatial/geotiff/tests/test_nvcomp_batch_upload_p3.py`	Adds GPU-only correctness tests and a performance regression guard for the nvCOMP batched upload change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

        import cupy
        try:
            raw_tiles = []
            for tile in compressed_tiles:
                raw_tiles.append(tile[2:-4] if len(tile) > 6 else tile)
            manager = nvcomp.DeflateManager(chunk_size=tile_bytes)
-            d_compressed = [cupy.asarray(np.frombuffer(t, dtype=np.uint8))
-                            for t in raw_tiles]
+            # Batch host->device upload: concatenate all tiles into one host
+            # buffer, then a single cupy.asarray transfer. Mirrors the
+            # LZW/Deflate concat-then-upload pattern below (~L1714-1722).


Three findings addressed: - Source: gate the kvikio.nvcomp fallback to Deflate-only. The block unconditionally stripped a 2-byte zlib header + 4-byte adler32 from every tile and ran them through DeflateManager, even when compression was 50000 (ZSTD). For ZSTD that corrupts the frame and at best wastes time inside a DeflateManager that returns None on the resulting decode failure. ZSTD now returns None from the kvikio branch so the caller picks another decoder. - Tests: tighten the skip condition so the suite only runs when libnvcomp or kvikio.nvcomp is actually importable on the host. Without one of those backends the optimised path returns None and the changed code is never exercised, so a passing test on a cupy-only host would be misleading. - Tests: wrap _try_nvcomp_batch_decompress with a call recorder and assert it returned non-None at least once during the timed and correctness calls. A silent fall-through to the slow numba kernel is now a test failure. - Tests: add test_nvcomp_kvikio_fallback_skips_zstd which exercises the new ZSTD gate by monkeypatching _get_nvcomp -> None and asserting the kvikio branch returns None for compression=50000. Skips when kvikio.nvcomp is not actually importable on the host.

github-actions Bot added the performance PR touches performance-sensitive code label May 8, 2026

brendancol requested a review from Copilot May 9, 2026 01:14

Copilot started reviewing on behalf of brendancol May 9, 2026 01:14 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

brendancol merged commit f11e1e4 into xarray-contrib:main May 9, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch host->device upload in GPU TIFF decode for nvCOMP path#1528

Batch host->device upload in GPU TIFF decode for nvCOMP path#1528
brendancol merged 2 commits intoxarray-contrib:mainfrom
brendancol:perf/nvcomp-batch-h2d-upload

brendancol commented May 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brendancol commented May 8, 2026

Summary

Reference pattern

Measurement

Tests

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants