Batch GDS-fallback per-tile GPU to host transfer (#1552)#1555
Merged
Conversation
When GDS reads compressed tiles to GPU but nvCOMP cannot decompress them on-device (LZW or non-ZSTD/non-deflate codecs), the tiles have to come back to host memory for the CPU decoder. Previously this was done with a per-tile ``.get().tobytes()`` loop, where each call is its own D2H copy on the default stream. The transfers serialise and per-DMA setup overhead dominates wall time when there are many tiles. Replace the loop with a single ``cupy.concatenate`` plus one ``.get()``, then slice the host bytes by per-tile offsets. This mirrors the batched D2H pattern already used on the deflate path at ``_gpu_decode.py`` lines ~2317-2330. Measured 3.2x speedup on a 256-tile / ~10 MiB workload locally, exceeding the ~1.66x quoted in the H2D batched-upload comment. Adds ``_batched_d2h_to_bytes`` so the helper is unit-testable without needing kvikio installed. New tests cover empty input, single tile, zero-size tile in a list, and the many-small-tiles regime that motivates the batching.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR optimizes the GeoTIFF GPU decode pipeline’s GDS (kvikIO) fallback path by batching device-to-host transfers of compressed tiles, reducing per-tile .get() overhead when nvCOMP can’t decompress on-device (e.g., LZW and other non-ZSTD codecs).
Changes:
- Added
_batched_d2h_to_byteshelper to concatenate device buffers and perform a single D2H transfer, returning per-tilebytes. - Switched the GDS fallback branch in
gpu_decode_tiles_from_fileto use the new batched helper instead of a per-tile.get().tobytes()loop. - Added a new regression test module covering correctness across empty, single, mixed-size, zero-size, and many-tile scenarios.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
xrspatial/geotiff/_gpu_decode.py |
Introduces a batched D2H helper and wires it into the GDS fallback path to reduce transfer overhead. |
xrspatial/geotiff/tests/test_gds_fallback_batched_d2h_1552.py |
Adds GPU-skipped regression tests validating the batched D2H behavior matches the previous per-tile baseline. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Docstring no longer references the fictional _try_nvcomp_compress_from_device_bufs. Now points at the symmetric H2D batched-upload pattern in _try_nvcomp_decompress. - Added _check_gpu_memory(total_bytes, "batched D2H staging buffer") before the cupy.concatenate. The concat allocates sum(sizes) bytes, which the prior per-tile .get() loop did not, so the new peak VRAM bump now fails early with a clear OOM message instead of inside cupy. - New test_batched_d2h_checks_gpu_memory_before_concat pins the contract: it monkeypatches _check_gpu_memory to raise and asserts the function propagates before any cupy allocation, with the right total-byte count and a meaningful 'what' label. - Test module docstring also no longer references the stale lines 2317-2330 anchor.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
.get().tobytes()loop in the GDS fallback branch ofgpu_decode_tiles_from_filewith onecupy.concatenate+ one D2H transfer._batched_d2h_to_bytesso it can be tested without kvikio installed.Why
When GDS reads compressed tiles to GPU but nvCOMP can't decompress them on-device (LZW and any codec other than ZSTD/deflate), the tiles have to come back to host memory for the CPU decoder. Each
.get()is its own copy on the default stream, so they serialise and per-DMA setup overhead dominates with many tiles.Measured speedup
256 tiles, ~10 MiB total (representative LZW COG fallback workload):
The 3.21x exceeds the ~1.66x quoted in the H2D batched-upload comment because there's no decompress kernel between transfers here, so DMA setup is a larger fraction of the total.
Test plan
xrspatial/geotiff/tests/test_gds_fallback_batched_d2h_1552.py: 5 tests covering empty input, equality with the old per-tile baseline, single tile, zero-size tile in a list, and the many-small-tiles regime.pytest xrspatial/geotiff/tests/is green for everything touched by this change. The 3 pre-existingTestPalettefailures intest_features.pyreproduce onmainand are unrelated (deepcopy recursion in palette plotting).