Skip to content

Batch GDS-fallback per-tile GPU to host transfer (#1552)#1555

Merged
brendancol merged 2 commits into
mainfrom
issue-1552
May 11, 2026
Merged

Batch GDS-fallback per-tile GPU to host transfer (#1552)#1555
brendancol merged 2 commits into
mainfrom
issue-1552

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

  • Closes Batch GDS-fallback per-tile GPU to host transfer #1552. Replaces the per-tile .get().tobytes() loop in the GDS fallback branch of gpu_decode_tiles_from_file with one cupy.concatenate + one D2H transfer.
  • The helper lives as _batched_d2h_to_bytes so it can be tested without kvikio installed.
  • Mirrors the batched D2H pattern already in the same file on the deflate path (lines ~2317-2330).

Why

When GDS reads compressed tiles to GPU but nvCOMP can't decompress them on-device (LZW and any codec other than ZSTD/deflate), the tiles have to come back to host memory for the CPU decoder. Each .get() is its own copy on the default stream, so they serialise and per-DMA setup overhead dominates with many tiles.

Measured speedup

256 tiles, ~10 MiB total (representative LZW COG fallback workload):

per-tile .get() loop : 10.240 ms
batched concat + get :  3.191 ms
speedup              : 3.21x

The 3.21x exceeds the ~1.66x quoted in the H2D batched-upload comment because there's no decompress kernel between transfers here, so DMA setup is a larger fraction of the total.

Test plan

  • New xrspatial/geotiff/tests/test_gds_fallback_batched_d2h_1552.py: 5 tests covering empty input, equality with the old per-tile baseline, single tile, zero-size tile in a list, and the many-small-tiles regime.
  • Full pytest xrspatial/geotiff/tests/ is green for everything touched by this change. The 3 pre-existing TestPalette failures in test_features.py reproduce on main and are unrelated (deepcopy recursion in palette plotting).

When GDS reads compressed tiles to GPU but nvCOMP cannot decompress
them on-device (LZW or non-ZSTD/non-deflate codecs), the tiles have
to come back to host memory for the CPU decoder. Previously this was
done with a per-tile ``.get().tobytes()`` loop, where each call is
its own D2H copy on the default stream. The transfers serialise and
per-DMA setup overhead dominates wall time when there are many tiles.

Replace the loop with a single ``cupy.concatenate`` plus one ``.get()``,
then slice the host bytes by per-tile offsets. This mirrors the batched
D2H pattern already used on the deflate path at ``_gpu_decode.py``
lines ~2317-2330.

Measured 3.2x speedup on a 256-tile / ~10 MiB workload locally,
exceeding the ~1.66x quoted in the H2D batched-upload comment.

Adds ``_batched_d2h_to_bytes`` so the helper is unit-testable without
needing kvikio installed. New tests cover empty input, single tile,
zero-size tile in a list, and the many-small-tiles regime that
motivates the batching.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 10, 2026
@brendancol brendancol requested a review from Copilot May 11, 2026 03:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the GeoTIFF GPU decode pipeline’s GDS (kvikIO) fallback path by batching device-to-host transfers of compressed tiles, reducing per-tile .get() overhead when nvCOMP can’t decompress on-device (e.g., LZW and other non-ZSTD codecs).

Changes:

  • Added _batched_d2h_to_bytes helper to concatenate device buffers and perform a single D2H transfer, returning per-tile bytes.
  • Switched the GDS fallback branch in gpu_decode_tiles_from_file to use the new batched helper instead of a per-tile .get().tobytes() loop.
  • Added a new regression test module covering correctness across empty, single, mixed-size, zero-size, and many-tile scenarios.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
xrspatial/geotiff/_gpu_decode.py Introduces a batched D2H helper and wires it into the GDS fallback path to reduce transfer overhead.
xrspatial/geotiff/tests/test_gds_fallback_batched_d2h_1552.py Adds GPU-skipped regression tests validating the batched D2H behavior matches the previous per-tile baseline.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread xrspatial/geotiff/_gpu_decode.py Outdated
Comment thread xrspatial/geotiff/_gpu_decode.py
- Docstring no longer references the fictional
  _try_nvcomp_compress_from_device_bufs. Now points at the symmetric
  H2D batched-upload pattern in _try_nvcomp_decompress.
- Added _check_gpu_memory(total_bytes, "batched D2H staging buffer")
  before the cupy.concatenate. The concat allocates sum(sizes) bytes,
  which the prior per-tile .get() loop did not, so the new peak VRAM
  bump now fails early with a clear OOM message instead of inside
  cupy.
- New test_batched_d2h_checks_gpu_memory_before_concat pins the
  contract: it monkeypatches _check_gpu_memory to raise and asserts
  the function propagates before any cupy allocation, with the right
  total-byte count and a meaningful 'what' label.
- Test module docstring also no longer references the stale
  lines 2317-2330 anchor.
@brendancol brendancol merged commit f23ec8f into main May 11, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Batch GDS-fallback per-tile GPU to host transfer

2 participants