Fix GPU predictor=2 decode on big-endian TIFFs (#1517) by brendancol · Pull Request #1524 · xarray-contrib/xarray-spatial

brendancol · 2026-05-08T15:41:18Z

Summary

The per-dtype GPU predictor=2 kernels (_predictor_decode_kernel_u16/u32/u64) view the raw decompressed byte buffer as native unsigned integers. On big-endian files the bytes are stored MSB-first, so the kernel reads the wrong integer interpretation and the prefix-sum differencing produces garbage. After PR #1515 stopped the BE byteswap from crashing, this remained as a silent correctness bug: the reproducer in #1517 showed ~93% pixel mismatch against the CPU baseline.

Fix

Option A from the issue: byteswap the decompressed buffer to native order BEFORE running the predictor=2 kernel, then skip the post-assembly byteswap. Predictor=1 and predictor=3 keep the existing post-decode swap (predictor=3 still drives byte-lane reordering through its own big_endian kernel flag).

A small helper, _swap_bytes_inplace, reverses bytes per sample on the flat uint8 buffer for both numpy and cupy arrays. The change is gated on byte_order == '>' and dtype.itemsize > 1, so the LE and uint8 paths are untouched.

Files

xrspatial/geotiff/_gpu_decode.py: pre-swap in _apply_predictor_and_assemble and gpu_decode_tiles; skip the post-swap when predictor == 2. Adds the _swap_bytes_inplace helper.
xrspatial/geotiff/tests/test_predictor2_big_endian_gpu_1517.py: regression tests covering the int32 tiled deflate reproducer, uint16/int16/uint32/int32 tiled, stripped-layout fallback, LE predictor=2, BE predictor=3, and two pure-numpy unit tests for the byte-swap helper.

Test plan

pytest xrspatial/geotiff/tests/test_predictor2_big_endian_gpu_1517.py (10 passed on a CUDA host)
pytest xrspatial/geotiff/tests/test_predictor2_big_endian.py (CPU regression, 6 passed)
pytest xrspatial/geotiff/tests/test_gpu_byteswap_1508.py xrspatial/geotiff/tests/test_predictor3_big_endian.py xrspatial/geotiff/tests/test_predictor_multisample.py (51 passed)
Full geotiff suite minus an unrelated pre-existing test_features deepcopy failure: 652 passed.

Closes #1517. Builds on #1515.

The per-dtype predictor=2 kernels view the raw decompressed byte buffer as native uint16/uint32/uint64. On big-endian files the bytes are MSB-first, so the prefix-sum runs on the wrong integer interpretation and the differencing produces garbage (~93% pixel mismatch in the reproducer from xarray-contrib#1517). Swap the buffer to native byte order before running the predictor=2 kernel, then skip the post-assembly byteswap that previously tried to fix things up. Predictor=1 and predictor=3 paths keep the existing post-decode swap. Closes xarray-contrib#1517. Builds on the byteswap helper added in xarray-contrib#1515.

Copilot

Pull request overview

Fixes a silent correctness bug in the GeoTIFF GPU decode pipeline for big-endian TIFFs with predictor=2, where the predictor kernels were interpreting decompressed bytes in native order and producing incorrect values. This aligns GPU results with the established CPU baseline behavior for these files and adds targeted regression coverage.

Changes:

Add _swap_bytes_inplace to byteswap the decompressed byte stream to native order before running predictor=2 kernels on big-endian, multi-byte dtypes.
Skip the post-assembly byteswap for big-endian predictor=2 since the buffer is now pre-normalized.
Add a GPU regression test suite for issue #1517 plus small unit tests for the new helper.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`xrspatial/geotiff/_gpu_decode.py`	Pre-byteswaps decompressed buffers for BE predictor=2 and conditionally skips the final output swap; adds `_swap_bytes_inplace`.
`xrspatial/geotiff/tests/test_predictor2_big_endian_gpu_1517.py`	Adds regression tests for BE predictor=2 GPU correctness across dtypes/layouts plus helper unit tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    if bps <= 1:
+        return
+    n = buf.size
+    if n % bps != 0:
+        raise ValueError(
+            f"buffer size {n} is not a multiple of bps={bps}")
+    view = buf.reshape(n // bps, bps)
+    view[:, :] = view[:, ::-1].copy()


+    gpu_da = read_geotiff_gpu(str(path))
+    assert isinstance(gpu_da.data, cupy.ndarray), (
+        "expected cupy-backed DataArray; GPU path may have fallen back"
+    )
+    assert gpu_da.data.dtype == np.dtype(np.int32)
+    assert gpu_da.data.dtype.isnative
+    np.testing.assert_array_equal(gpu_da.data.get(), cpu)


+    gpu_da = read_geotiff_gpu(str(path))
+    assert isinstance(gpu_da.data, cupy.ndarray)
+    assert gpu_da.data.dtype == np.dtype(dtype)
+    assert gpu_da.data.dtype.isnative
+    np.testing.assert_array_equal(gpu_da.data.get(), cpu)


- Rename _swap_bytes_inplace -> _swap_byte_lanes. The previous name implied a zero-copy operation but view[:, ::-1].copy() allocates a same-sized temporary. The docstring now states the temp explicitly so callers can size their working set, and a CUDA kernel is left as a follow-up if peak GPU memory becomes an issue. - Add _block_cpu_fallback helper that monkeypatches the read_to_array binding inside xrspatial.geotiff. When the GPU tests call read_geotiff_gpu, any silent CPU fallback now raises an AssertionError instead of producing a passing-via-fallback result. - Apply the guard to the four tests where the GPU decode path is what is being exercised (BE int32 reproducer, BE dtype matrix, LE pred2, BE pred3). The stripped-uint16 test still relies on the CPU fallback by design and stays unguarded.

Replaces the view[:, ::-1].copy() form with in-place byte reversal: - numpy: dispatch to ndarray.byteswap(inplace=True) via a uint16/32/64 view of the original buffer, so the original allocation is mutated with no temp. - cupy: launch _byte_swap_lanes_kernel, one thread per sample, swapping bytes through register-resident temporaries. Peak GPU memory for the swap is now O(1) instead of O(buffer size). Adds explicit validation: bps must be 2, 4, or 8 (3/5/6/7 are rejected rather than silently corrupting), and size must be a multiple of bps. Tests cover bps=2/4/8 on numpy and cupy, the bps=1 no-op, the validation errors, and address-equality assertions confirming both paths really are in-place. The existing integration tests (test_gpu_predictor2_big_endian_*, predictor3_big_endian) still pass unchanged.

github-actions Bot added the performance PR touches performance-sensitive code label May 8, 2026

brendancol requested a review from Copilot May 8, 2026 16:17

Copilot started reviewing on behalf of brendancol May 8, 2026 16:17 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

brendancol added 2 commits May 8, 2026 10:11

brendancol merged commit 44dfbbc into xarray-contrib:main May 8, 2026
11 checks passed

brendancol mentioned this pull request May 8, 2026

Fix read_geotiff_gpu dim mismatch on stripped multi-band TIFFs #1525

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GPU predictor=2 decode on big-endian TIFFs (#1517)#1524

Fix GPU predictor=2 decode on big-endian TIFFs (#1517)#1524
brendancol merged 3 commits intoxarray-contrib:mainfrom
brendancol:fix/1517-gpu-predictor2-be

brendancol commented May 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brendancol commented May 8, 2026

Summary

Fix

Files

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants