Fix GPU predictor=2 decode on big-endian TIFFs (#1517)#1524
Merged
brendancol merged 3 commits intoxarray-contrib:mainfrom May 8, 2026
Merged
Fix GPU predictor=2 decode on big-endian TIFFs (#1517)#1524brendancol merged 3 commits intoxarray-contrib:mainfrom
brendancol merged 3 commits intoxarray-contrib:mainfrom
Conversation
The per-dtype predictor=2 kernels view the raw decompressed byte buffer as native uint16/uint32/uint64. On big-endian files the bytes are MSB-first, so the prefix-sum runs on the wrong integer interpretation and the differencing produces garbage (~93% pixel mismatch in the reproducer from xarray-contrib#1517). Swap the buffer to native byte order before running the predictor=2 kernel, then skip the post-assembly byteswap that previously tried to fix things up. Predictor=1 and predictor=3 paths keep the existing post-decode swap. Closes xarray-contrib#1517. Builds on the byteswap helper added in xarray-contrib#1515.
There was a problem hiding this comment.
Pull request overview
Fixes a silent correctness bug in the GeoTIFF GPU decode pipeline for big-endian TIFFs with predictor=2, where the predictor kernels were interpreting decompressed bytes in native order and producing incorrect values. This aligns GPU results with the established CPU baseline behavior for these files and adds targeted regression coverage.
Changes:
- Add
_swap_bytes_inplaceto byteswap the decompressed byte stream to native order before running predictor=2 kernels on big-endian, multi-byte dtypes. - Skip the post-assembly byteswap for big-endian predictor=2 since the buffer is now pre-normalized.
- Add a GPU regression test suite for issue #1517 plus small unit tests for the new helper.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
xrspatial/geotiff/_gpu_decode.py |
Pre-byteswaps decompressed buffers for BE predictor=2 and conditionally skips the final output swap; adds _swap_bytes_inplace. |
xrspatial/geotiff/tests/test_predictor2_big_endian_gpu_1517.py |
Adds regression tests for BE predictor=2 GPU correctness across dtypes/layouts plus helper unit tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+94
to
+101
| if bps <= 1: | ||
| return | ||
| n = buf.size | ||
| if n % bps != 0: | ||
| raise ValueError( | ||
| f"buffer size {n} is not a multiple of bps={bps}") | ||
| view = buf.reshape(n // bps, bps) | ||
| view[:, :] = view[:, ::-1].copy() |
Comment on lines
+66
to
+72
| gpu_da = read_geotiff_gpu(str(path)) | ||
| assert isinstance(gpu_da.data, cupy.ndarray), ( | ||
| "expected cupy-backed DataArray; GPU path may have fallen back" | ||
| ) | ||
| assert gpu_da.data.dtype == np.dtype(np.int32) | ||
| assert gpu_da.data.dtype.isnative | ||
| np.testing.assert_array_equal(gpu_da.data.get(), cpu) |
Comment on lines
+106
to
+110
| gpu_da = read_geotiff_gpu(str(path)) | ||
| assert isinstance(gpu_da.data, cupy.ndarray) | ||
| assert gpu_da.data.dtype == np.dtype(dtype) | ||
| assert gpu_da.data.dtype.isnative | ||
| np.testing.assert_array_equal(gpu_da.data.get(), cpu) |
- Rename _swap_bytes_inplace -> _swap_byte_lanes. The previous name implied a zero-copy operation but view[:, ::-1].copy() allocates a same-sized temporary. The docstring now states the temp explicitly so callers can size their working set, and a CUDA kernel is left as a follow-up if peak GPU memory becomes an issue. - Add _block_cpu_fallback helper that monkeypatches the read_to_array binding inside xrspatial.geotiff. When the GPU tests call read_geotiff_gpu, any silent CPU fallback now raises an AssertionError instead of producing a passing-via-fallback result. - Apply the guard to the four tests where the GPU decode path is what is being exercised (BE int32 reproducer, BE dtype matrix, LE pred2, BE pred3). The stripped-uint16 test still relies on the CPU fallback by design and stays unguarded.
Replaces the view[:, ::-1].copy() form with in-place byte reversal: - numpy: dispatch to ndarray.byteswap(inplace=True) via a uint16/32/64 view of the original buffer, so the original allocation is mutated with no temp. - cupy: launch _byte_swap_lanes_kernel, one thread per sample, swapping bytes through register-resident temporaries. Peak GPU memory for the swap is now O(1) instead of O(buffer size). Adds explicit validation: bps must be 2, 4, or 8 (3/5/6/7 are rejected rather than silently corrupting), and size must be a multiple of bps. Tests cover bps=2/4/8 on numpy and cupy, the bps=1 no-op, the validation errors, and address-equality assertions confirming both paths really are in-place. The existing integration tests (test_gpu_predictor2_big_endian_*, predictor3_big_endian) still pass unchanged.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The per-dtype GPU predictor=2 kernels (
_predictor_decode_kernel_u16/u32/u64) view the raw decompressed byte buffer as native unsigned integers. On big-endian files the bytes are stored MSB-first, so the kernel reads the wrong integer interpretation and the prefix-sum differencing produces garbage. After PR #1515 stopped the BE byteswap from crashing, this remained as a silent correctness bug: the reproducer in #1517 showed ~93% pixel mismatch against the CPU baseline.Fix
Option A from the issue: byteswap the decompressed buffer to native order BEFORE running the predictor=2 kernel, then skip the post-assembly byteswap. Predictor=1 and predictor=3 keep the existing post-decode swap (predictor=3 still drives byte-lane reordering through its own
big_endiankernel flag).A small helper,
_swap_bytes_inplace, reverses bytes per sample on the flat uint8 buffer for both numpy and cupy arrays. The change is gated onbyte_order == '>'anddtype.itemsize > 1, so the LE and uint8 paths are untouched.Files
xrspatial/geotiff/_gpu_decode.py: pre-swap in_apply_predictor_and_assembleandgpu_decode_tiles; skip the post-swap whenpredictor == 2. Adds the_swap_bytes_inplacehelper.xrspatial/geotiff/tests/test_predictor2_big_endian_gpu_1517.py: regression tests covering the int32 tiled deflate reproducer, uint16/int16/uint32/int32 tiled, stripped-layout fallback, LE predictor=2, BE predictor=3, and two pure-numpy unit tests for the byte-swap helper.Test plan
pytest xrspatial/geotiff/tests/test_predictor2_big_endian_gpu_1517.py(10 passed on a CUDA host)pytest xrspatial/geotiff/tests/test_predictor2_big_endian.py(CPU regression, 6 passed)pytest xrspatial/geotiff/tests/test_gpu_byteswap_1508.py xrspatial/geotiff/tests/test_predictor3_big_endian.py xrspatial/geotiff/tests/test_predictor_multisample.py(51 passed)Closes #1517. Builds on #1515.