Batch GDS-fallback per-tile GPU to host transfer (#1552) by brendancol · Pull Request #1555 · xarray-contrib/xarray-spatial

brendancol · 2026-05-10T13:48:49Z

Summary

Closes Batch GDS-fallback per-tile GPU to host transfer #1552. Replaces the per-tile .get().tobytes() loop in the GDS fallback branch of gpu_decode_tiles_from_file with one cupy.concatenate + one D2H transfer.
The helper lives as _batched_d2h_to_bytes so it can be tested without kvikio installed.
Mirrors the batched D2H pattern already in the same file on the deflate path (lines ~2317-2330).

Why

When GDS reads compressed tiles to GPU but nvCOMP can't decompress them on-device (LZW and any codec other than ZSTD/deflate), the tiles have to come back to host memory for the CPU decoder. Each .get() is its own copy on the default stream, so they serialise and per-DMA setup overhead dominates with many tiles.

Measured speedup

256 tiles, ~10 MiB total (representative LZW COG fallback workload):

per-tile .get() loop : 10.240 ms
batched concat + get :  3.191 ms
speedup              : 3.21x

The 3.21x exceeds the ~1.66x quoted in the H2D batched-upload comment because there's no decompress kernel between transfers here, so DMA setup is a larger fraction of the total.

Test plan

New xrspatial/geotiff/tests/test_gds_fallback_batched_d2h_1552.py: 5 tests covering empty input, equality with the old per-tile baseline, single tile, zero-size tile in a list, and the many-small-tiles regime.
Full pytest xrspatial/geotiff/tests/ is green for everything touched by this change. The 3 pre-existing TestPalette failures in test_features.py reproduce on main and are unrelated (deepcopy recursion in palette plotting).

When GDS reads compressed tiles to GPU but nvCOMP cannot decompress them on-device (LZW or non-ZSTD/non-deflate codecs), the tiles have to come back to host memory for the CPU decoder. Previously this was done with a per-tile ``.get().tobytes()`` loop, where each call is its own D2H copy on the default stream. The transfers serialise and per-DMA setup overhead dominates wall time when there are many tiles. Replace the loop with a single ``cupy.concatenate`` plus one ``.get()``, then slice the host bytes by per-tile offsets. This mirrors the batched D2H pattern already used on the deflate path at ``_gpu_decode.py`` lines ~2317-2330. Measured 3.2x speedup on a 256-tile / ~10 MiB workload locally, exceeding the ~1.66x quoted in the H2D batched-upload comment. Adds ``_batched_d2h_to_bytes`` so the helper is unit-testable without needing kvikio installed. New tests cover empty input, single tile, zero-size tile in a list, and the many-small-tiles regime that motivates the batching.

Copilot

Pull request overview

This PR optimizes the GeoTIFF GPU decode pipeline’s GDS (kvikIO) fallback path by batching device-to-host transfers of compressed tiles, reducing per-tile .get() overhead when nvCOMP can’t decompress on-device (e.g., LZW and other non-ZSTD codecs).

Changes:

Added _batched_d2h_to_bytes helper to concatenate device buffers and perform a single D2H transfer, returning per-tile bytes.
Switched the GDS fallback branch in gpu_decode_tiles_from_file to use the new batched helper instead of a per-tile .get().tobytes() loop.
Added a new regression test module covering correctness across empty, single, mixed-size, zero-size, and many-tile scenarios.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`xrspatial/geotiff/_gpu_decode.py`	Introduces a batched D2H helper and wires it into the GDS fallback path to reduce transfer overhead.
`xrspatial/geotiff/tests/test_gds_fallback_batched_d2h_1552.py`	Adds GPU-skipped regression tests validating the batched D2H behavior matches the previous per-tile baseline.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Docstring no longer references the fictional _try_nvcomp_compress_from_device_bufs. Now points at the symmetric H2D batched-upload pattern in _try_nvcomp_decompress. - Added _check_gpu_memory(total_bytes, "batched D2H staging buffer") before the cupy.concatenate. The concat allocates sum(sizes) bytes, which the prior per-tile .get() loop did not, so the new peak VRAM bump now fails early with a clear OOM message instead of inside cupy. - New test_batched_d2h_checks_gpu_memory_before_concat pins the contract: it monkeypatches _check_gpu_memory to raise and asserts the function propagates before any cupy allocation, with the right total-byte count and a meaningful 'what' label. - Test module docstring also no longer references the stale lines 2317-2330 anchor.

github-actions Bot added the performance PR touches performance-sensitive code label May 10, 2026

brendancol requested a review from Copilot May 11, 2026 03:21

Copilot started reviewing on behalf of brendancol May 11, 2026 03:22 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

Comment thread xrspatial/geotiff/_gpu_decode.py Outdated

Comment thread xrspatial/geotiff/_gpu_decode.py

brendancol merged commit f23ec8f into main May 11, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch GDS-fallback per-tile GPU to host transfer (#1552)#1555

Batch GDS-fallback per-tile GPU to host transfer (#1552)#1555
brendancol merged 2 commits into
mainfrom
issue-1552

brendancol commented May 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brendancol commented May 10, 2026

Summary

Why

Measured speedup

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants