Skip to content

Fix nvjpegOutputFormat_t constants for multi-band JPEG GPU decode (#1549)#1550

Merged
brendancol merged 2 commits into
mainfrom
worktree-agent-a15b52aa87f24c1ac
May 11, 2026
Merged

Fix nvjpegOutputFormat_t constants for multi-band JPEG GPU decode (#1549)#1550
brendancol merged 2 commits into
mainfrom
worktree-agent-a15b52aa87f24c1ac

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

  • Fixes the cudaErrorIllegalAddress crash in open_geotiff(..., gpu=True)
    on 3-band tiled JPEG GeoTIFFs (GPU JPEG-tiled 3-band read crashes with cudaErrorIllegalAddress and poisons CUDA context #1549). Constants in _gpu_decode.py
    were two off from the SDK's nvjpegOutputFormat_t enum, so nvJPEG wrote
    planar G/B planes through NULL channel pointers.
  • Adds a per-tile cupy.cuda.Device().synchronize() after each
    nvjpegDecode call so the shared nvjpegJpegState_t is not reused
    before the previous async decode finishes (the constant bug hid this
    because no decode ever succeeded).
  • Adds a regression test that builds the issue's reproducer, decodes it
    on GPU with gpu='strict', compares pixels to the CPU read, and checks
    the CUDA context survives.

compute-sanitizer with the fix:

========= ERROR SUMMARY: 0 errors

Without the fix:

Invalid __global__ write of size 1 bytes
    at void nvjpeg::ycbcr_to_format_kernel_roi<(nvjpegChromaSubsampling_t)2,
        (bool)0, (nvjpegOutputFormat_t)3, ...>(...)
    Access to 0x0 is out of bounds

The kernel template parameter (nvjpegOutputFormat_t)3 is the value the
code was sending. The SDK header defines 3 = NVJPEG_OUTPUT_RGB (planar).

Test plan

  • python -m pytest xrspatial/geotiff/tests/test_jpeg_gpu_1549.py -v passes 4/4
  • 20 back-to-back runs of the regression test, all pass
  • python -m pytest xrspatial/geotiff/tests/ passes 940/940 (3
    unrelated matplotlib palette failures are pre-existing on main)
  • compute-sanitizer --tool memcheck reports 0 errors on the
    regression test
  • CUDA context survival: a follow-up cupy.sum(...) and a follow-up
    GPU TIFF read both succeed after the GPU JPEG read

…G GPU decode

The nvJPEG output format constants in `_gpu_decode.py` were off by two from
the SDK's `nvjpegOutputFormat_t` enum. The wrapper sent `3` thinking it
meant `NVJPEG_OUTPUT_RGBI` (interleaved RGB), but `3` is
`NVJPEG_OUTPUT_RGB` (planar) in the real SDK. nvJPEG dereferenced the G/B
plane pointers in `nvjpegImage.channel[1..2]`, which the wrapper sets to
NULL for interleaved output, producing an out-of-bounds GPU write inside
`ycbcr_to_format_kernel_roi`. The resulting `cudaErrorIllegalAddress` was
sticky and broke the CUDA context for every later GPU call in the same
process.

The same off-by-two affected the single-band path: it sent `5` thinking it
meant `NVJPEG_OUTPUT_UNCHANGED` but `5` is `NVJPEG_OUTPUT_RGBI`, so nvJPEG
produced 3 bytes per pixel into a 1-byte-per-pixel buffer and returned
visibly wrong pixels rather than crashing.

Two changes:

- Match the constants to `nvjpeg.h` from the CUDA toolkit. Sanitizer
  showed the kernel was compiled with `(nvjpegOutputFormat_t)3`, the
  literal value the code was sending. The fix sets `_NVJPEG_OUTPUT_RGBI`
  to 5 and `_NVJPEG_OUTPUT_UNCHANGED` to 0.
- Add `cupy.cuda.Device().synchronize()` after each `nvjpegDecode`. The
  shared `jpeg_state` cannot be reused for the next tile until the
  previous async decode finishes; without the sync the multi-tile output
  was non-deterministic. The constant bug hid this because no decode
  ever succeeded.

Regression test builds the reproducer from the issue, decodes it with
`gpu='strict'`, checks pixels match the CPU read within JPEG rounding
tolerance, and checks the CUDA context survives a follow-up GPU read.

compute-sanitizer reports 0 errors with the fix; 20/20 stress runs pass.

Closes #1549
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 10, 2026
@brendancol brendancol requested a review from Copilot May 11, 2026 03:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a GPU nvJPEG decode crash when reading multi-band (RGB) tiled JPEG GeoTIFFs via open_geotiff(..., gpu=True) by correcting the nvjpegOutputFormat_t constant values and preventing reuse of a shared nvjpegJpegState_t before an async decode completes. Adds a targeted GPU regression test for issue #1549 that validates decode correctness and CUDA-context survivability.

Changes:

  • Correct nvjpegOutputFormat_t constant values used by the nvJPEG ctypes wrapper.
  • Add a per-tile CUDA synchronization after nvjpegDecode to avoid reuse of shared decode state mid-flight.
  • Add GPU regression tests covering RGB/grayscale JPEG tiled reads and post-read CUDA context health.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
xrspatial/geotiff/_gpu_decode.py Fixes nvJPEG output-format constants and serializes per-tile decode to avoid async state reuse.
xrspatial/geotiff/tests/test_jpeg_gpu_1549.py Adds regression coverage for GPU JPEG tiled reads (RGB + grayscale) and CUDA context survivability.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +50 to +57
_HAS_GPU = _gpu_available()
_HAS_TIFFFILE = importlib.util.find_spec("tifffile") is not None
_HAS_PIL = importlib.util.find_spec("PIL") is not None

_gpu_only = pytest.mark.skipif(
not (_HAS_GPU and _HAS_TIFFFILE and _HAS_PIL),
reason="cupy + CUDA + tifffile + Pillow required",
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Added _HAS_NVJPEG to the skip gate so the test skips when libnvjpeg.so doesn't load, and test_rgb_jpeg_gpu_no_crash now spies on _try_nvjpeg_batch_decode and asserts both that the nvJPEG branch was called and that it returned non-None. A silent Pillow fallback would now fail the test loudly. 80e2078.

Comment on lines +75 to +88
import tifffile
if noise:
rng = np.random.default_rng(seed)
arr = rng.integers(0, 256, size=(256, 256, 3), dtype=np.uint8)
else:
# Smooth gradient: per-channel ramp + cross terms.
ys, xs = np.mgrid[0:256, 0:256].astype(np.int32)
r = (ys + xs) // 2
g = ys
b = xs
arr = np.stack([r, g, b], axis=2).clip(0, 255).astype(np.uint8)
tifffile.imwrite(path, arr, photometric='rgb', tile=(128, 128),
compression='jpeg')
return arr
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added _HAS_IMAGECODECS to the skip gate. The tests now skip cleanly when tifffile's JPEG codec delegate isn't installed instead of erroring on the first tifffile.imwrite(compression='jpeg'). 80e2078.

Comment thread xrspatial/geotiff/_gpu_decode.py Outdated
# iteration overwrites jpeg_state mid-decode and the
# output is non-deterministic (tile contents drift
# between runs even when tile_data is identical).
cupy.cuda.Device().synchronize()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to cupy.cuda.Stream.null.synchronize() (the default stream we passed to nvjpegDecode as stream=0). The data dependency is on the shared jpeg_state, so syncing only that stream is sufficient and lets concurrent work on other streams continue. 80e2078.

- Sync only the default CUDA stream after each nvjpegDecode call. The
  decode runs on stream=0 and the data dependency is on the shared
  jpeg_state, not on the whole device, so a full deviceSynchronize was
  blocking concurrent work on unrelated streams.
- Test gate now also requires imagecodecs (tifffile's JPEG codec
  delegate) and a loadable libnvjpeg. Without nvJPEG the GPU pipeline
  silently falls back to CPU Pillow decode, and the test would pass
  while never touching the code path the bug lived on.
- test_rgb_jpeg_gpu_no_crash now spies on _try_nvjpeg_batch_decode and
  asserts the nvJPEG branch was called and returned non-None, so a
  silent CPU fallback would fail the test loudly.
@brendancol brendancol merged commit 593aee5 into main May 11, 2026
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants