Fix nvjpegOutputFormat_t constants for multi-band JPEG GPU decode (#1549)#1550
Conversation
…G GPU decode The nvJPEG output format constants in `_gpu_decode.py` were off by two from the SDK's `nvjpegOutputFormat_t` enum. The wrapper sent `3` thinking it meant `NVJPEG_OUTPUT_RGBI` (interleaved RGB), but `3` is `NVJPEG_OUTPUT_RGB` (planar) in the real SDK. nvJPEG dereferenced the G/B plane pointers in `nvjpegImage.channel[1..2]`, which the wrapper sets to NULL for interleaved output, producing an out-of-bounds GPU write inside `ycbcr_to_format_kernel_roi`. The resulting `cudaErrorIllegalAddress` was sticky and broke the CUDA context for every later GPU call in the same process. The same off-by-two affected the single-band path: it sent `5` thinking it meant `NVJPEG_OUTPUT_UNCHANGED` but `5` is `NVJPEG_OUTPUT_RGBI`, so nvJPEG produced 3 bytes per pixel into a 1-byte-per-pixel buffer and returned visibly wrong pixels rather than crashing. Two changes: - Match the constants to `nvjpeg.h` from the CUDA toolkit. Sanitizer showed the kernel was compiled with `(nvjpegOutputFormat_t)3`, the literal value the code was sending. The fix sets `_NVJPEG_OUTPUT_RGBI` to 5 and `_NVJPEG_OUTPUT_UNCHANGED` to 0. - Add `cupy.cuda.Device().synchronize()` after each `nvjpegDecode`. The shared `jpeg_state` cannot be reused for the next tile until the previous async decode finishes; without the sync the multi-tile output was non-deterministic. The constant bug hid this because no decode ever succeeded. Regression test builds the reproducer from the issue, decodes it with `gpu='strict'`, checks pixels match the CPU read within JPEG rounding tolerance, and checks the CUDA context survives a follow-up GPU read. compute-sanitizer reports 0 errors with the fix; 20/20 stress runs pass. Closes #1549
There was a problem hiding this comment.
Pull request overview
Fixes a GPU nvJPEG decode crash when reading multi-band (RGB) tiled JPEG GeoTIFFs via open_geotiff(..., gpu=True) by correcting the nvjpegOutputFormat_t constant values and preventing reuse of a shared nvjpegJpegState_t before an async decode completes. Adds a targeted GPU regression test for issue #1549 that validates decode correctness and CUDA-context survivability.
Changes:
- Correct
nvjpegOutputFormat_tconstant values used by the nvJPEG ctypes wrapper. - Add a per-tile CUDA synchronization after
nvjpegDecodeto avoid reuse of shared decode state mid-flight. - Add GPU regression tests covering RGB/grayscale JPEG tiled reads and post-read CUDA context health.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
xrspatial/geotiff/_gpu_decode.py |
Fixes nvJPEG output-format constants and serializes per-tile decode to avoid async state reuse. |
xrspatial/geotiff/tests/test_jpeg_gpu_1549.py |
Adds regression coverage for GPU JPEG tiled reads (RGB + grayscale) and CUDA context survivability. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| _HAS_GPU = _gpu_available() | ||
| _HAS_TIFFFILE = importlib.util.find_spec("tifffile") is not None | ||
| _HAS_PIL = importlib.util.find_spec("PIL") is not None | ||
|
|
||
| _gpu_only = pytest.mark.skipif( | ||
| not (_HAS_GPU and _HAS_TIFFFILE and _HAS_PIL), | ||
| reason="cupy + CUDA + tifffile + Pillow required", | ||
| ) |
There was a problem hiding this comment.
Good catch. Added _HAS_NVJPEG to the skip gate so the test skips when libnvjpeg.so doesn't load, and test_rgb_jpeg_gpu_no_crash now spies on _try_nvjpeg_batch_decode and asserts both that the nvJPEG branch was called and that it returned non-None. A silent Pillow fallback would now fail the test loudly. 80e2078.
| import tifffile | ||
| if noise: | ||
| rng = np.random.default_rng(seed) | ||
| arr = rng.integers(0, 256, size=(256, 256, 3), dtype=np.uint8) | ||
| else: | ||
| # Smooth gradient: per-channel ramp + cross terms. | ||
| ys, xs = np.mgrid[0:256, 0:256].astype(np.int32) | ||
| r = (ys + xs) // 2 | ||
| g = ys | ||
| b = xs | ||
| arr = np.stack([r, g, b], axis=2).clip(0, 255).astype(np.uint8) | ||
| tifffile.imwrite(path, arr, photometric='rgb', tile=(128, 128), | ||
| compression='jpeg') | ||
| return arr |
There was a problem hiding this comment.
Added _HAS_IMAGECODECS to the skip gate. The tests now skip cleanly when tifffile's JPEG codec delegate isn't installed instead of erroring on the first tifffile.imwrite(compression='jpeg'). 80e2078.
| # iteration overwrites jpeg_state mid-decode and the | ||
| # output is non-deterministic (tile contents drift | ||
| # between runs even when tile_data is identical). | ||
| cupy.cuda.Device().synchronize() |
There was a problem hiding this comment.
Switched to cupy.cuda.Stream.null.synchronize() (the default stream we passed to nvjpegDecode as stream=0). The data dependency is on the shared jpeg_state, so syncing only that stream is sufficient and lets concurrent work on other streams continue. 80e2078.
- Sync only the default CUDA stream after each nvjpegDecode call. The decode runs on stream=0 and the data dependency is on the shared jpeg_state, not on the whole device, so a full deviceSynchronize was blocking concurrent work on unrelated streams. - Test gate now also requires imagecodecs (tifffile's JPEG codec delegate) and a loadable libnvjpeg. Without nvJPEG the GPU pipeline silently falls back to CPU Pillow decode, and the test would pass while never touching the code path the bug lived on. - test_rgb_jpeg_gpu_no_crash now spies on _try_nvjpeg_batch_decode and asserts the nvJPEG branch was called and returned non-None, so a silent CPU fallback would fail the test loudly.
Summary
cudaErrorIllegalAddresscrash inopen_geotiff(..., gpu=True)on 3-band tiled JPEG GeoTIFFs (GPU JPEG-tiled 3-band read crashes with cudaErrorIllegalAddress and poisons CUDA context #1549). Constants in
_gpu_decode.pywere two off from the SDK's
nvjpegOutputFormat_tenum, so nvJPEG wroteplanar G/B planes through NULL channel pointers.
cupy.cuda.Device().synchronize()after eachnvjpegDecodecall so the sharednvjpegJpegState_tis not reusedbefore the previous async decode finishes (the constant bug hid this
because no decode ever succeeded).
on GPU with
gpu='strict', compares pixels to the CPU read, and checksthe CUDA context survives.
compute-sanitizer with the fix:
Without the fix:
The kernel template parameter
(nvjpegOutputFormat_t)3is the value thecode was sending. The SDK header defines
3 = NVJPEG_OUTPUT_RGB(planar).Test plan
python -m pytest xrspatial/geotiff/tests/test_jpeg_gpu_1549.py -vpasses 4/4python -m pytest xrspatial/geotiff/tests/passes 940/940 (3unrelated matplotlib palette failures are pre-existing on main)
compute-sanitizer --tool memcheckreports 0 errors on theregression test
cupy.sum(...)and a follow-upGPU TIFF read both succeed after the GPU JPEG read