Skip to content

Add realtime AI decoder / predecoder infrastructure (GPU + Host) w/ host dispatcher#457

Merged
bmhowe23 merged 80 commits intoNVIDIA:mainfrom
wsttiger:add_realtime_ai_predecoder_host_side_gb200
Apr 3, 2026
Merged

Add realtime AI decoder / predecoder infrastructure (GPU + Host) w/ host dispatcher#457
bmhowe23 merged 80 commits intoNVIDIA:mainfrom
wsttiger:add_realtime_ai_predecoder_host_side_gb200

Conversation

@wsttiger
Copy link
Copy Markdown
Collaborator

@wsttiger wsttiger commented Mar 4, 2026

Host-side AI predecoder pipeline with RealtimePipeline abstraction

Summary

Adds a complete host-side realtime decoding pipeline that pairs a GPU-based AI predecoder (TensorRT) with CPU-based PyMatching MWPM decoding, orchestrated by a new RealtimePipeline C++ abstraction that hides all low-level atomics and thread management from application code.

  • AI predecoder service (AIPreDecoderService / AIDecoderService): TensorRT inference wrapped in CUDA graphs with gateway kernels for ring buffer I/O, supporting ONNX model loading, dynamic batch dims, FP16, and engine caching
  • Host-side dispatcher: Spin-polling dispatcher that replaces the device-side persistent kernel, with dynamic worker pool, per-worker CUDA graph launch, idle bitmask scheduling, pre/post-launch callbacks (DMA input copy), and RPC-based function dispatch
  • RealtimePipeline abstraction: Encapsulates dispatcher thread, worker threads, consumer thread, and ring buffer management behind a clean factory/callback API (GpuStageFactory, CpuStageCallback, CompletionCallback), eliminating direct atomic access from application code
  • RingBufferInjector: Dedicated software injection class for test/replay workloads, with thread-safe compare_exchange_weak slot claiming and backpressure tracking
  • GPU-only mode: Pipeline can operate without CPU worker threads when no post-processing decoder is needed
  • Performance optimizations: Vectorized GPU copy kernels (uint4 loads), DMA-based output copy, out-of-order consumer harvesting, ARM memory ordering fixes (std::atomic acquire loads, __sync_synchronize fences)
  • Comprehensive defect fixes: All critical (C1–C4) and major (M1–M12) defects from code review addressed, including race conditions, premature flag writes, dynamic-shape TRT volume overflow, and vector<bool> UB
  • Test coverage: GTest suite with 21 tests (identity passthrough, correctness, host dispatcher, sustained throughput) plus full benchmark with configurable distance/rounds/injection rate

Key files

Area Files
Pipeline abstraction realtime/include/cudaq/realtime/pipeline.h, realtime/lib/pipeline/realtime_pipeline.cu
Host dispatcher realtime/include/.../host_dispatcher.h, realtime/lib/.../host_dispatcher.cu, host_dispatcher_capi.cu
Realtime C API realtime/include/.../cudaq_realtime.h, realtime/lib/.../cudaq_realtime_api.cpp
AI decoder services libs/qec/include/.../ai_decoder_service.h, ai_predecoder_service.h, libs/qec/lib/realtime/ai_decoder_service.cu, ai_predecoder_service.cu
Benchmark libs/qec/lib/realtime/test_realtime_predecoder_w_pymatching.cpp
Unit tests libs/qec/unittests/test_realtime_pipeline.cu, realtime/unittests/test_host_dispatcher.cu
Design docs docs/host_side_dispatcher_design_gemini.md, docs/hybrid_ai_predecoder_pipeline.md

Test plan

  • test_realtime_pipeline — 21 GTest cases pass (identity passthrough, multi-request correctness, shutdown, slot wraparound, sustained throughput)
  • test_realtime_predecoder_w_pymatching — end-to-end benchmark runs with d7, d13_r13, d13_r104 configs at various injection rates
  • test_host_dispatcher — host dispatcher unit tests pass
  • test_dispatch_kernel — dispatch kernel unit tests pass
  • Verify ARM (Grace) memory ordering correctness under sustained load (no stuck slots, no phantom completions)
  • Verify SKIP_TRT=1 passthrough mode works for CI environments without GPU/TRT

wsttiger and others added 29 commits February 18, 2026 20:51
…atch

Introduce the cudaq-realtime library under realtime/, providing
infrastructure for low-latency GPU-accelerated realtime coprocessing
between FPGA/CPU and GPU systems in the NVQLink architecture.

Key components:
- C-compatible host API (cudaq_realtime.h) with dispatch manager/dispatcher
  lifecycle management (create, configure ring buffers, start/stop)
- Persistent GPU dispatch kernel that polls a ring buffer for incoming
  RPC requests and dispatches to registered handlers via function table
  lookup using FNV-1a hashed function IDs
- Two dispatch modes: DeviceCallMode (direct __device__ function calls)
  and GraphLaunchMode (device-side cudaGraphLaunch with backpressure
  and single-launch guards, requires sm_80+)
- Two kernel synchronization strategies: RegularKernel (__syncthreads)
  and CooperativeKernel (grid-wide cooperative_groups sync)
- Schema-driven type system for RPC argument/result descriptors
- Shared library (libcudaq-realtime.so) for the host API and static
  library (libcudaq-realtime-dispatch.a) for GPU kernel device code
- GTest-based unit tests covering device-call dispatch, host API
  integration, and device-side graph launch

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Introduce AIDecoderService and AIPreDecoderService in the QEC library,
enabling a hybrid realtime pipeline where GPU-side TensorRT inference
(predecoding) hands off results to CPU-side classical decoders like
PyMatching.

Key components:
- AIDecoderService: wraps TensorRT inference in a CUDA graph using a
  gateway kernel pattern (mailbox pointer indirection) to bridge the
  dispatch kernel's dynamic ring buffer addresses to TRT's fixed I/O
  buffers. Supports SKIP_TRT env var for testing without TensorRT.
- AIPreDecoderService: extends AIDecoderService with an N-deep pinned
  memory circular queue for GPU-to-CPU handoff, slot claim/release
  protocol (d_claimed_slot, d_inflight_flag), backpressure signaling
  via d_ready_flags/d_queue_idx, and poll_next_job/release_job API
  with proper acquire/release memory ordering
- ThreadPool utility with optional Linux CPU core pinning for
  low-latency PyMatching worker threads
- End-to-end integration test demonstrating the full hybrid pipeline:
  dispatcher -> 4x AIPreDecoderService GPU inference -> polling thread
  -> 4-worker PyMatching thread pool -> TX flag acknowledgment
- CMake integration to find TensorRT and build the test with CUDA
  separable compilation

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Upgrade the AI predecoder test from a dummy identity TRT engine to a
real d=7 r=7 surface code Z-type ONNX model. The service classes now
support ONNX→TRT engine compilation, multi-output tensor bindings, and
type-agnostic (INT32) I/O. The test fires 8 realistic syndrome payloads
through 4 GPU pre-decoders and verifies end-to-end residual detector
output handed off to simulated PyMatching workers.

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Replace the simulated PyMatching worker with a real MWPM decoder
using the d=7 surface code's static Z parity check matrix via the
cudaq-qec decoder plugin system. The 336 residual detectors from
the AI predecoder are sliced into 14 spatial rounds of 24 Z-stabilizer
syndromes and decoded independently. A mutex protects the decoder
for thread safety across the 4-worker thread pool.

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Extract hard-coded d=7 parameters into a PipelineConfig struct with
static factory methods for d=7, d=13, d=21, and d=31 surface codes.
Runtime config selection via command-line argument (d7|d13|d21|d31)
preserves existing d=7 functionality while enabling larger-distance
experiments. ONNX_MODEL_PATH replaced with ONNX_MODEL_DIR to support
per-config model filenames.

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Introduce a reusable header-only latency and throughput tracker for
realtime decoding pipelines. Provides per-request submit/complete
timestamping, percentile statistics (p50/p90/p95/p99), and a formatted
report including wall time, throughput, and per-request breakdown.

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
… requests

Enhance PipelineBenchmark to distinguish submitted vs completed requests,
report timeouts, and cap per-request output to 50 entries. Integrate it
into the predecoder pipeline test with per-request submit/complete
markers and spin-wait polling for accurate latency measurement. Increase
default total_requests from 20 to 100 across all distance configs.

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Instrument the PyMatching worker with high-resolution timestamps to
measure decode time vs worker overhead. Report a breakdown showing
PyMatching decode, worker overhead, and GPU+dispatch+poll latency as
percentages of the total end-to-end pipeline, plus per-round latency.

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Add engine caching: prefer a pre-built .engine file when available,
otherwise build from ONNX and save the engine for subsequent runs.
Replace the single mutex-protected PyMatching decoder with a pool of
per-worker decoder instances using thread-local index assignment,
eliminating lock contention in the decode path.

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Introduce a streaming test mode alongside the existing batch mode,
activated via CLI (`stream [rate_us] [duration_s]`). The streaming mode
uses dedicated producer/consumer threads to simulate continuous FPGA
syndrome arrival with configurable inter-arrival rate, in-flight
throttling (capped to num_predecoders), backpressure tracking, and
warmup period exclusion from latency stats. Reports steady-state
throughput, percentile latencies, and per-round timing breakdown.

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Signed-off-by: Scott Thornton <sthornton@nvidia.com>
…nt kernel

The CUDA device runtime has a hardcoded 128 fire-and-forget graph launch
slot limit that is never reclaimed while a persistent parent kernel runs,
making the device-side dispatcher unsuitable for sustained operation.

This adds a host-side CPU dispatcher thread that polls rx_flags and calls
cudaGraphLaunch from host code on per-predecoder CUDA streams, bypassing
the device runtime limit entirely. Streaming mode uses the host dispatcher;
batch mode retains the device-side dispatcher for backward compatibility.

Key changes:
- New host_dispatcher.h/.cpp with host_dispatcher_loop()
- AIPreDecoderService::capture_graph() gains device_launch flag for
  conditional cudaGraphInstantiateFlagDeviceLaunch vs standard instantiation
- d_queue_idx_ changed from cudaMalloc to cudaHostAllocMapped so the host
  dispatcher can read backpressure state without cudaMemcpy
- Mailbox bank changed to mapped pinned memory for zero-copy host writes
- Streaming test uses host dispatcher with per-predecoder streams

Verified: d7 streaming 16,824 requests (219 us mean, 31 us/round),
d13 streaming 6,227 requests (455 us mean, 35 us/round), zero errors.

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
…actor

- Add host dispatcher with dynamic worker pool (idle_mask, inflight_slot_tags)
  to avoid head-of-line blocking; use libcu++ system-scope atomics for
  rx/tx/ready flags and mapped pinned memory.
- Extend AIPreDecoderService and PreDecoderJob with origin_slot for
  out-of-order completion; default queue_depth 1 for host dispatch.
- Add design doc (host_side_dispatcher_design_gemini.md) with
  spin-polling dispatcher and worker pseudocode/constraints.
- Refactor test_realtime_predecoder_w_pymatching for dynamic pool and
  update CMakeLists; adjust nvqlink daemon and dispatch_kernel for
  host-side dispatch.

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
This commit fundamentally redesigns the host-side execution model to achieve
microsecond-level latency, shifting from a general-purpose thread pool to
a strict, pinned, and lock-free architecture.

Key architectural changes in `test_realtime_predecoder_w_pymatching.cpp`:

1. Dedicated Polling Threads (Removed Thread Pool)
   - Replaced `cudaq::qec::utils::ThreadPool` and the single `incoming_thread`
     with a vector of dedicated `std::thread` worker loops.
   - Eliminates queueing latency, mutex locking, and context switching
     overhead. Each worker thread now spins continuously checking for its own
     GPU completions.

2. Strict CPU Thread Pinning
   - Introduced `pin_thread_to_core` and `pin_current_thread_to_core` using
     the Linux `pthread_setaffinity_np` API.
   - Pinned the Dispatcher (Core 2), Producer (Core 3), Consumer (Core 4),
     and all Worker threads (Cores 10+) to ensure they never migrate, keeping
     their CPU caches perfectly warm.

3. High-Resolution Sub-Component Timing
   - Added tracking arrays (`dispatch_ts`, `poll_ts`, `debug_dispatch_ts_arr`)
     piped through `WorkerPoolContext` and `PreDecoderJob`.
   - Updated end-of-run reporting to calculate differences between timestamps,
     proving that Host Dispatch overhead is negligible (~1-3µs) and the
     bottleneck is the GPU inference itself.

4. PyMatching Data Conversion Optimization
   - Inside `pymatching_worker_task`, replaced the conversion of `int32_t`
     syndrome data into a `std::vector<double>`.
   - Now populates a pre-allocated `cudaqx::tensor<uint8_t>` to avoid slow
     double-precision conversions inside the latency-critical worker loop.

5. NVTX Profiling Markers
   - Included `<nvtx3/nvToolsExt.h>` and wrapped key blocks in
     `nvtxRangePushA` and `nvtxRangePop`.
   - Enables generation of `nsys` profiles to visually align CPU thread
     activity with GPU TensorRT execution.

Other changes:
- Enable TensorRT FP16 builder flag (`kFP16`) in `ai_decoder_service.cu`
  for supported platforms to accelerate GPU inference.

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
…e directory. Refactored the pymatching demo code to use the updated functions in realtime.

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
…nstrumentation

Remove the duplicate QEC-local host_dispatcher.{h,cpp} in favor of
the canonical realtime library versions, eliminating link ambiguity.

Fix three correctness/performance bugs in the streaming pipeline:
- Consumer was strict in-order, causing 327 µs head-of-line blocking
  when parallel workers complete out of order. Changed to scan all
  active slots and harvest whichever are ready.
- Dispatcher set tx_flags=READY immediately on graph launch (when
  tx_data_host was non-null), causing phantom completions. Set
  tx_data_host/dev to nullptr so dispatcher uses IN_FLIGHT sentinel.
- Race between consumer clearing tx_flags and resetting slot_request:
  producer could see slot available and write slot_request before the
  consumer's slot_request=-1, permanently orphaning the slot. Fixed
  by resetting slot_request before clearing tx_flags with a store
  fence (__sync_synchronize) for ARM memory ordering.

Replace broken timing breakdown (dispatch_ts was always 0, making the
entire report show "Other/Misc Wait") with a 3-stage per-request
breakdown: [A] submit→worker poll, [B] worker task, [C] consumer
poll lag, with p50/p99 percentiles.

Also: reduce NUM_SLOTS 64→16 to cut queuing delay, remove unused
queue_depth from PipelineConfig, add DISABLE_PYMATCHING conditional
compilation, add stuck-request diagnostics, and remove batch mode /
watchdog / dead code.

Results (d7, 8 workers, open-loop):
  62.5K req/s, 230 µs mean latency, 500K/500K completed, 0 drops.

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Increase num_predecoders and num_workers from 8 to 16 across all
config presets, and set NUM_SLOTS to 32. With 8 workers the pipeline
capacity (~24K req/s) was below the 33K req/s arrival rate at 30 µs
spacing, causing unbounded queuing and p99 latency spikes to 4.9 ms.

With 16 workers and 32 slots, d13 at 30 µs arrival sustains 25K req/s
with 299 µs mean latency (23 µs/round), p99 = 334 µs, and near-zero
backpressure (9K stalls vs 38M previously).

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Add optimization profile in build_engine_from_onnx for ONNX models
with dynamic dimensions (batch dim = 0). When detected, pin all
dynamic dims to 1 via min/opt/max profile so TensorRT can build
the engine. Previously these models failed with "Failed to build
TRT engine from ONNX".

Switch d13 config to predecoder_memory_d13_T13_X.onnx, which takes
detectors as input rather than raw measurements. End-to-end latency
drops from 299 µs to 226 µs, mainly from PyMatching (69 µs → 12 µs).

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Replace byte-by-byte memory copies with wider load/store operations
to reduce memory transactions in the CUDA graph. The input kernel now
uses uint32_t (4-byte) copies, the passthrough kernel uses uint4
(16-byte) copies, and the output kernel is replaced entirely with
cudaMemcpyAsync (DMA copy engine) followed by a minimal 1-thread
signal kernel. Thread counts bumped from 128 to 256.

Reduces d13 mean end-to-end latency from 226 µs to 141 µs (~85 µs)
and per-round latency from 17.4 µs to 10.8 µs.

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Move the input copy from an SM-based kernel inside the CUDA graph to a
host-issued cudaMemcpyAsync via a new pre_launch_fn callback on
HostDispatchWorker. This frees GPU compute units for TRT inference and
reduces Stage A latency by ~19 µs. Add get_trt_input_ptr() and
get_host_ring_ptrs() accessors to support the callback wiring.

Separate the T104 ONNX model into its own d13_r104 pipeline config
(104 rounds, 32K slots) and restore d13_r13 to use the T13 model.

Update design document to reflect DMA data movement, pre-launch
callback, out-of-order consumer, and ARM memory ordering constraints.

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Introduce a RealtimePipeline class (pipeline.h, realtime_pipeline.cu)
that encapsulates all ring buffer allocation, atomic synchronization,
dispatcher wiring, worker thread management, and consumer slot lifecycle
behind a callback-driven API. Application code provides a GPU stage
factory, a CPU stage callback, and a completion handler -- zero direct
atomic access required.

Refactor test_realtime_predecoder_w_pymatching.cpp from 1083 lines to
~470 lines by replacing inline atomics, thread management, and slot
tracking with pipeline.submit() / pipeline.stop() / pipeline.stats().
Add d13_r104 config (T=104 model, 131K slot size).

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
21 tests covering AIDecoderService, AIPreDecoderService, and the
host-side dispatcher. Correctness tests push 5,000 random 6.4 KB
payloads through the full CUDA graph pipeline and verify bitwise
identity. Integration tests exercise multi-predecoder concurrency
and sustained throughput (200 requests, regression for the 128-launch
limit fix). SKIP_TRT buffer size increased to 1600 floats to match
realistic syndrome payload sizes.

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
The legacy predecoder_input_kernel and its cudaq::nvqlink includes
are no longer used since input data arrives via the pre_launch DMA
callback. Design doc updated to reflect current code: removed kernel
deletion, RealtimePipeline scaffolding, test suite, and SKIP_TRT
buffer size (1600 floats).

Signed-off-by: Scott Thornton <sthornton@nvidia.com>
…provements

Add GPU-only pipeline mode that skips CPU worker threads when no
cpu_stage callback is registered, using cudaLaunchHostFunc for
completion signaling instead. Add post_launch_fn/post_launch_data
callback to HostDispatchWorker and GpuWorkerResources, called after
successful cudaGraphLaunch. Rename CpuStageContext fields to
gpu_output/gpu_output_size and AIPreDecoderService buffers to
h_predecoder_outputs_/d_predecoder_outputs_ for clarity.

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
…om:wsttiger/cudaqx into add_realtime_ai_predecoder_host_side_gb200
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Address all critical (C1-C4) and major (M1-M12) defects identified
during code review:

Critical fixes:
- C1: Fix race condition in try_submit via compare_exchange_weak
- C2: Use uint64_t + separate occupancy flag for slot_request to
  support full request_id range (was int64_t with -1 sentinel)
- C3: Add __syncthreads() before response header write in
  gateway_output_kernel to prevent partially-written result reads
- C4: Always write IN_FLIGHT sentinel to tx_flags after graph launch

Major fixes:
- M1: Remove cudaSetDeviceFlags from RingBufferManager (caller's duty)
- M2: Use std::atomic load with memory_order_acquire for tx/rx flag
  reads instead of plain volatile (ARM correctness)
- M3: Validate num_workers <= 64 (idle_mask capacity)
- M4: Validate gpu_factory is set before start()
- M5: Check producer_stop in RingBufferInjector::submit to prevent
  infinite spin after shutdown
- M6: Make started flag std::atomic<bool>
- M7: Add CUDA error checks in AIDecoderService::capture_graph
- M8: Check enqueueV3 return value in both service files
- M9: Fix tensor_volume for dynamic-shape dims (was wrapping to
  SIZE_MAX on dim=-1)
- M10: Assert num_workers == num_predecoders in benchmark
- M11: Add aarch64 paths to predecoder test's TRT CMake search
- M12: Replace vector<bool> with vector<uint8_t> to avoid concurrent
  write UB

Also extracts submit logic into RingBufferInjector class to separate
test infrastructure from pipeline core.

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
@wsttiger wsttiger requested a review from bmhowe23 March 4, 2026 02:26
wsttiger added 3 commits April 2, 2026 03:53
Rename all public classes, structs, and type aliases in the realtime
QEC headers to snake_case, matching the cudaqx project convention
per PR review feedback.

Key renames: AIDecoderService → ai_decoder_service,
AIPreDecoderService → ai_predecoder_service,
PreDecoderJob → pre_decoder_job, RealtimePipeline → realtime_pipeline,
RingBufferInjector → ring_buffer_injector, PipelineStageConfig →
pipeline_stage_config, GpuWorkerResources → gpu_worker_resources,
CpuStageContext → cpu_stage_context, Completion → completion, and
all associated callback type aliases.

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
The gateway_output_kernel was leaving request_id and ptp_timestamp
unset in the RPCResponse. Read both fields from the incoming
RPCHeader before overwriting with the response, then echo them
into the corresponding RPCResponse fields.

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
@wsttiger
Copy link
Copy Markdown
Collaborator Author

wsttiger commented Apr 2, 2026

Does anything in this PR use our cudaq::qec::decoder interface? When I Ctrl-F for trt_decoder in this PR, I don't see any hits, but it's unclear if that's because we're not actually using the standard decoder interface or if GitHub is just having trouble with the large diffs.

Yes, we use cudaq::qec::decoder for the CPU-side PyMatching stage — the benchmark creates a pool via decoder::get("pymatching", H, params) and each worker thread grabs one to decode residual syndromes.

The ai_decoder_service / ai_predecoder_service classes don't subclass decoder though — they're TRT wrappers that manage CUDA graphs, ring buffer I/O, and ready-flag signaling, which is a fundamentally different abstraction than the synchronous decode(syndrome) → result interface. The trt_decoder plugin isn't used here.

So it's a two-tier hybrid: GPU tier = ai_predecoder_service (TRT), CPU tier = standard decoder plugin (PyMatching).

bmhowe23 and others added 24 commits April 2, 2026 07:26
….github.com>

I, Ben Howe <141149032+bmhowe23@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8cd20a5

Signed-off-by: Ben Howe <141149032+bmhowe23@users.noreply.github.com>
….github.com>

I, Ben Howe <141149032+bmhowe23@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8cd20a5

Signed-off-by: Ben Howe <141149032+bmhowe23@users.noreply.github.com>
I, Ben Howe <bhowe@nvidia.com>, hereby add my Signed-off-by to this commit: 74221d6

Signed-off-by: Ben Howe <bhowe@nvidia.com>
I, Ben Howe <bhowe@nvidia.com>, hereby add my Signed-off-by to this commit: 30107d0

Signed-off-by: Ben Howe <bhowe@nvidia.com>
….github.com>

I, Ben Howe <141149032+bmhowe23@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8cd20a5

Signed-off-by: Ben Howe <141149032+bmhowe23@users.noreply.github.com>
I, Ben Howe <bhowe@nvidia.com>, hereby add my Signed-off-by to this commit: 008a734

Signed-off-by: Ben Howe <bhowe@nvidia.com>
….github.com>

I, Ben Howe <141149032+bmhowe23@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8cd20a5

Signed-off-by: Ben Howe <141149032+bmhowe23@users.noreply.github.com>
Replace CMAKE_SOURCE_DIR with CMAKE_CURRENT_SOURCE_DIR-relative paths
for test source files and ONNX_MODEL_DIR. In standalone QEC builds
(CI), CMAKE_SOURCE_DIR is libs/qec rather than the repo root, causing
doubled paths like libs/qec/libs/qec/lib/realtime/... that fail to
resolve.

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
…om:wsttiger/cudaqx into add_realtime_ai_predecoder_host_side_gb200
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Signed-off-by: Ben Howe <bhowe@nvidia.com>
The ai_decoder_service constructor checked std::getenv("SKIP_TRT") to
decide between TRT model loading and a passthrough identity kernel.
This was fragile and caused CI GPU test failures on 3 of 4 platforms
where the env var wasn't visible at construction time.

Add create_passthrough() static factories on ai_decoder_service and
ai_predecoder_service that construct test-only instances without
touching TRT. Remove the getenv check from the production constructor
and replace the SKIP_TRT check in capture_graph() with a context_
null check. Update all tests to use the factory instead of setenv.

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
…om:wsttiger/cudaqx into add_realtime_ai_predecoder_host_side_gb200

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Signed-off-by: Ben Howe <bhowe@nvidia.com>
Signed-off-by: Ben Howe <bhowe@nvidia.com>
Signed-off-by: Ben Howe <bhowe@nvidia.com>
Signed-off-by: Ben Howe <bhowe@nvidia.com>
Signed-off-by: Ben Howe <bhowe@nvidia.com>
Copy link
Copy Markdown
Collaborator

@bmhowe23 bmhowe23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @wsttiger !

@bmhowe23 bmhowe23 merged commit 61afa38 into NVIDIA:main Apr 3, 2026
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants