Batch host->device upload in GPU TIFF decode for nvCOMP path#1528
Merged
brendancol merged 2 commits intoxarray-contrib:mainfrom May 9, 2026
Merged
Batch host->device upload in GPU TIFF decode for nvCOMP path#1528brendancol merged 2 commits intoxarray-contrib:mainfrom
brendancol merged 2 commits intoxarray-contrib:mainfrom
Conversation
Replaces per-tile `cupy.asarray(np.frombuffer(t, np.uint8))` in `_try_nvcomp_batch_decompress` (both the kvikio.nvcomp DeflateManager fallback and the direct ctypes nvCOMP entry) with a single concatenated host buffer and one H2D transfer. Per-tile device pointers are derived as `base_ptr + offsets`, mirroring the pattern already used in the LZW/Deflate path at `_gpu_decode.py` L1714-1722. Measured for 256 tiles x 64 KB: 6.07 ms (per-tile) -> 3.65 ms (batched), ~1.66x speedup. The win scales with tile count because per-tile `cupy.asarray` costs O(n) CUDA driver dispatches. Adds `test_nvcomp_batch_upload_p3.py`: bit-exact correctness across 256/1024/2048 sizes and a 200 ms regression guard on a 2048x2048 deflate-tiled TIFF. Existing GPU decode tests (test_gpu_byteswap_1508, test_predictor2/3_big_endian, test_predictor_multisample) all pass.
There was a problem hiding this comment.
Pull request overview
Optimizes the GeoTIFF GPU nvCOMP decompression path by batching host→device uploads of compressed tiles to reduce per-tile CUDA dispatch overhead, and adds regression tests intended to validate correctness and guard against performance regressions.
Changes:
- Refactors
_try_nvcomp_batch_decompressto concatenate all compressed tiles into a single host buffer and perform a singlecupy.asarraytransfer, deriving per-tile device views/pointers from offsets (both kvikio fallback and direct ctypes nvCOMP paths). - Avoids per-tile output allocations in the direct nvCOMP ctypes path by allocating one contiguous decompressed output buffer.
- Adds a new GPU test module with correctness coverage and a timing-based regression guard for the nvCOMP batch-upload optimization.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
xrspatial/geotiff/_gpu_decode.py |
Implements batched compressed-tile upload and contiguous output buffer strategy for nvCOMP batch decompress paths. |
xrspatial/geotiff/tests/test_nvcomp_batch_upload_p3.py |
Adds GPU-only correctness tests and a performance regression guard for the nvCOMP batched upload change. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
974
to
+982
| import cupy | ||
| try: | ||
| raw_tiles = [] | ||
| for tile in compressed_tiles: | ||
| raw_tiles.append(tile[2:-4] if len(tile) > 6 else tile) | ||
| manager = nvcomp.DeflateManager(chunk_size=tile_bytes) | ||
| d_compressed = [cupy.asarray(np.frombuffer(t, dtype=np.uint8)) | ||
| for t in raw_tiles] | ||
| # Batch host->device upload: concatenate all tiles into one host | ||
| # buffer, then a single cupy.asarray transfer. Mirrors the | ||
| # LZW/Deflate concat-then-upload pattern below (~L1714-1722). |
Three findings addressed: - Source: gate the kvikio.nvcomp fallback to Deflate-only. The block unconditionally stripped a 2-byte zlib header + 4-byte adler32 from every tile and ran them through DeflateManager, even when compression was 50000 (ZSTD). For ZSTD that corrupts the frame and at best wastes time inside a DeflateManager that returns None on the resulting decode failure. ZSTD now returns None from the kvikio branch so the caller picks another decoder. - Tests: tighten the skip condition so the suite only runs when libnvcomp or kvikio.nvcomp is actually importable on the host. Without one of those backends the optimised path returns None and the changed code is never exercised, so a passing test on a cupy-only host would be misleading. - Tests: wrap _try_nvcomp_batch_decompress with a call recorder and assert it returned non-None at least once during the timed and correctness calls. A silent fall-through to the slow numba kernel is now a test failure. - Tests: add test_nvcomp_kvikio_fallback_skips_zstd which exercises the new ZSTD gate by monkeypatching _get_nvcomp -> None and asserting the kvikio branch returns None for compression=50000. Skips when kvikio.nvcomp is not actually importable on the host.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces per-tile
cupy.asarray(np.frombuffer(t, np.uint8))in_try_nvcomp_batch_decompresswith a single concatenated host bufferand one H2D transfer. Per-tile device pointers are derived as
base_ptr + offsets. Both call sites are touched: thekvikio.nvcomp.DeflateManagerfallback (L976-988 of the originalfile) and the direct ctypes nvCOMP entry (L1027 of the original file).
Reference pattern
The fix mirrors the LZW/Deflate concat-then-upload pattern already
in the same file at
xrspatial/geotiff/_gpu_decode.pyL1714-1722:build a single contiguous host buffer, do one
cupy.asarray, thenbuild per-tile views or pointers from
(base_ptr + offsets).Measurement
From the perf audit:
cupy.asarray): 256 tiles x 64 KB = 6.07 mscupy.asarrayis O(n) in CUDA driver dispatches.Tests
test_nvcomp_batch_upload_p3.py:deflate-tiled TIFFs (CPU
read_to_arrayvsread_geotiff_gpu).decode under 200 ms.
test_gpu_byteswap_1508,test_predictor2_big_endian,test_predictor3_big_endian,test_predictor_multisample(51 tests, plus 54 in compression /GPU strict tests).
Test plan
test_nvcomp_batch_upload_p3.pycorrectness tests passon 256 / 1024 / 2048 grid sizes
TIFF