Skip to content

[DO NOT MERGE] Reproduce CI setup locally and Catch stalling#568

Draft
xyao-nv wants to merge 1 commit intomainfrom
xyao/exp/repro_ci_stalling
Draft

[DO NOT MERGE] Reproduce CI setup locally and Catch stalling#568
xyao-nv wants to merge 1 commit intomainfrom
xyao/exp/repro_ci_stalling

Conversation

@xyao-nv
Copy link
Copy Markdown
Collaborator

@xyao-nv xyao-nv commented Apr 10, 2026

Setup

  1. Constrains CPUs
  2. Optionally wipes Kit caches (shader/USD caches)
  3. Phase 1 (in-process camera tests)
  4. Phase 2a (subprocess bulk tests)
  5. Phase 2b (subprocess isolated)

Steps to reproduce

  1. With 4 cpu cores and no cache cleanup, running tests succeed.
    CPUS=0-3 bash scripts/repro_ci_subprocess_stall.sh
    repro_20260409_222116.log

  2. With only 2 cpu cores (our runner setup), it reproduces the CI stalling at phase 2a.
    CPUS=0-1 SKIP_CACHE_WIPE=1 bash scripts/repro_ci_subprocess_stall.sh
    repro_20260409_233745.log

Explanations from Claude

Phase 1 (in-process): 95s total, left behind omni.telemetry.transmitter orphan (PID 6774).

Phase 2 (subprocess) — each test spawns Kit, and you can see the degradation:

Test Kit startup Scene creation Total
test_action_chunking_client 23:39:30 → 23:40:19 = 49s 0.66s ~66s
test_external_environment_franka_table 23:40:36 → 23:40:59 = 23s 0.46s ~30s
test_external_environment_franka_table_with_task 23:41:06 → never completes -- STALLED

The pattern is clear: on 2 CPUs, each subprocess Kit instance starts slower, and by the 3rd one it deadlocks entirely. The scene creation itself is fast, but Kit startup is the bottleneck. Each test also leaves behind additional child processes (telemetry transmitters), so by test #3, you have the pytest host + orphans all fighting for 2 CPU cores.

The root cause is confirmed: Kit startup contends for CPU during extension loading/shader compilation, and with only 2 cores + accumulated orphans, it eventually deadlocks.

Copy link
Copy Markdown

@kellyguo11 kellyguo11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Reproduction Script for CI Subprocess Stalling

Overall: Solid diagnostic script — well-structured, good logging, and the phased approach (bulk → isolated) is exactly right for root-causing CI stalls. Since this is [DO NOT MERGE], treating it as a debugging artifact rather than production code.

What Works Well

  • Clean phased design: cache wipe → in-process → bulk subprocess → isolated fallback
  • taskset -c $CPUS to faithfully simulate CI core constraints
  • Tee-to-logfile with timestamps — good for offline analysis
  • Orphan process detection between tests (show_orphans) is the key diagnostic
  • --subprocess-only and --loop modes are thoughtful additions

Suggestions

  1. set -euo pipefail + || true pattern — The script uses set -e but then appends || true on most pytest calls. This is fine for the in-process section, but note that in run_subprocess_bulk, there is no || true — if the bulk run fails, it correctly falls through to isolation. Consistent and intentional.

  2. kill_orphans is aggressivepkill -9 -f "isaac-sim/kit/python" could match unrelated processes if someone runs this outside a container. Since this is a repro script meant for containers, it is acceptable, but worth a comment.

  3. Missing --loop cache wipe between iterations — In loop mode, wipe_kit_caches runs each iteration, which is correct. But kill_orphans is only called on bulk failure. If loop mode is meant to stress-test accumulation, consider adding an option to skip orphan cleanup between loops to truly simulate process buildup.

  4. Hardcoded $PYTHON path/isaac-sim/python.sh assumes the NVIDIA container layout. A fallback like PYTHON="${PYTHON:-/isaac-sim/python.sh}" would make this usable outside the standard container (already done for other env vars).

  5. Log rotation — In --loop mode, each iteration gets the same logfile (since TIMESTAMP is set once). This means all loop iterations append to one file, which is probably fine but worth noting. If separate logs per iteration are desired, move TIMESTAMP inside run_once.

  6. Phase 1 has commented-out sections — The Newton and PhysX-no-cameras sections are commented out. Makes sense for focused repro, but a --full-ci flag that enables all sections could be useful for complete CI simulation.

Root Cause Analysis (from the PR description)

The diagnosis is convincing: Kit startup CPU contention + orphaned telemetry processes = deadlock on 2 cores by the 3rd subprocess test. The script proves this clearly. The fix should likely be:

  • Kill orphan processes between subprocess tests in CI
  • Or increase CI runner core count
  • Or add subprocess timeouts with proper cleanup

Nice debugging work. 👍

xyao-nv added a commit that referenced this pull request Apr 10, 2026
## Summary
Subprocess-spawning tests hang indefinitely on CI.

## Causes & Fixes

### Problems

From Lab:

1. Lab reports "AppLauncher doesnt quit properly after app.close(),
app.quit() doesn't help either."
2. Cold startup times for tests using IS can be upwards of 10 min on Lab
CI machines.

Above issues apply to us, because tests hang during sub-process tests
section, between the end of last test and the beginning of the next
test. See detailed logs and analysis from reproducing locally
[here](#568)

### Fixes

1. `SimulationApp` Force Exit: Skips `app.close()` (which can hang
indefinitely in Kit's shutdown path) when the env var
`ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE=1` is set.
Calls a new `_kill_child_processes()` helper that walks `/proc` to
`SIGKILL` all direct children before doing `os._exit(0)`, preventing
orphaned Kit processes from holding GPU resources.

2. `run_subprocess` has a configuarable wall-clock timeouts and process
isolation, such that when needed, it could trigger the force exit path
above.

3. Add wall-clock timing and logging inside the SimulationApp start
method. Keep track of how much startup time is taking on CI.

## Minor fixes

1. Add timing stats into pytest cmds such that it reports the slowests
test func at the end of each section.

2. Parametrize multi-config tests: Convert nested for-loops in
`test_zero_action_policy_kitchen_pick_and_place` (6 configs) and
`test_zero_action_policy_gr1_open_microwave` (3 configs) into
`@pytest.mark.parametrize.` Each config gets its own timeout, pass/fail,
and timing.

3. Reduce num_envs in gr00t eval_runner test to speed up.

### Local validation
With the repro script
#568, I do not have
local stalling.
Log for more details.

[repro_20260410_041313.log](https://github.com/user-attachments/files/26620524/repro_20260410_041313.log)


### CI Before -- timeout
<img width="1219" height="170" alt="image"
src="https://github.com/user-attachments/assets/2f9eabb2-403d-4257-bd84-4da508de7d00"
/>

### CI After
<img width="1219" height="170" alt="image"
src="https://github.com/user-attachments/assets/dbaf2a7d-e3a4-4ad2-85a4-389eae962c1d"
/>
<img width="1198" height="472" alt="image"
src="https://github.com/user-attachments/assets/8a24f1aa-4bcb-4030-b075-09f3885673c2"
/>

## TODOs
- test_camera_observations takes 10mins to start the app due to Kit cold
start. Experimenting with a warm start before tests process here
#565
- Kit itself intermittently deadlocks during startup — not because of
orphans, but because Kit's internal thread synchronization fails on
low-CPU runners. Experimenting with retry here
#570
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants