[DO NOT MERGE] Reproduce CI setup locally and Catch stalling#568
[DO NOT MERGE] Reproduce CI setup locally and Catch stalling#568
Conversation
kellyguo11
left a comment
There was a problem hiding this comment.
Review: Reproduction Script for CI Subprocess Stalling
Overall: Solid diagnostic script — well-structured, good logging, and the phased approach (bulk → isolated) is exactly right for root-causing CI stalls. Since this is [DO NOT MERGE], treating it as a debugging artifact rather than production code.
What Works Well
- Clean phased design: cache wipe → in-process → bulk subprocess → isolated fallback
taskset -c $CPUSto faithfully simulate CI core constraints- Tee-to-logfile with timestamps — good for offline analysis
- Orphan process detection between tests (
show_orphans) is the key diagnostic --subprocess-onlyand--loopmodes are thoughtful additions
Suggestions
-
set -euo pipefail+|| truepattern — The script usesset -ebut then appends|| trueon most pytest calls. This is fine for the in-process section, but note that inrun_subprocess_bulk, there is no|| true— if the bulk run fails, it correctly falls through to isolation. Consistent and intentional. -
kill_orphansis aggressive —pkill -9 -f "isaac-sim/kit/python"could match unrelated processes if someone runs this outside a container. Since this is a repro script meant for containers, it is acceptable, but worth a comment. -
Missing
--loopcache wipe between iterations — In loop mode,wipe_kit_cachesruns each iteration, which is correct. Butkill_orphansis only called on bulk failure. If loop mode is meant to stress-test accumulation, consider adding an option to skip orphan cleanup between loops to truly simulate process buildup. -
Hardcoded
$PYTHONpath —/isaac-sim/python.shassumes the NVIDIA container layout. A fallback likePYTHON="${PYTHON:-/isaac-sim/python.sh}"would make this usable outside the standard container (already done for other env vars). -
Log rotation — In
--loopmode, each iteration gets the same logfile (sinceTIMESTAMPis set once). This means all loop iterations append to one file, which is probably fine but worth noting. If separate logs per iteration are desired, moveTIMESTAMPinsiderun_once. -
Phase 1 has commented-out sections — The Newton and PhysX-no-cameras sections are commented out. Makes sense for focused repro, but a
--full-ciflag that enables all sections could be useful for complete CI simulation.
Root Cause Analysis (from the PR description)
The diagnosis is convincing: Kit startup CPU contention + orphaned telemetry processes = deadlock on 2 cores by the 3rd subprocess test. The script proves this clearly. The fix should likely be:
- Kill orphan processes between subprocess tests in CI
- Or increase CI runner core count
- Or add subprocess timeouts with proper cleanup
Nice debugging work. 👍
## Summary Subprocess-spawning tests hang indefinitely on CI. ## Causes & Fixes ### Problems From Lab: 1. Lab reports "AppLauncher doesnt quit properly after app.close(), app.quit() doesn't help either." 2. Cold startup times for tests using IS can be upwards of 10 min on Lab CI machines. Above issues apply to us, because tests hang during sub-process tests section, between the end of last test and the beginning of the next test. See detailed logs and analysis from reproducing locally [here](#568) ### Fixes 1. `SimulationApp` Force Exit: Skips `app.close()` (which can hang indefinitely in Kit's shutdown path) when the env var `ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE=1` is set. Calls a new `_kill_child_processes()` helper that walks `/proc` to `SIGKILL` all direct children before doing `os._exit(0)`, preventing orphaned Kit processes from holding GPU resources. 2. `run_subprocess` has a configuarable wall-clock timeouts and process isolation, such that when needed, it could trigger the force exit path above. 3. Add wall-clock timing and logging inside the SimulationApp start method. Keep track of how much startup time is taking on CI. ## Minor fixes 1. Add timing stats into pytest cmds such that it reports the slowests test func at the end of each section. 2. Parametrize multi-config tests: Convert nested for-loops in `test_zero_action_policy_kitchen_pick_and_place` (6 configs) and `test_zero_action_policy_gr1_open_microwave` (3 configs) into `@pytest.mark.parametrize.` Each config gets its own timeout, pass/fail, and timing. 3. Reduce num_envs in gr00t eval_runner test to speed up. ### Local validation With the repro script #568, I do not have local stalling. Log for more details. [repro_20260410_041313.log](https://github.com/user-attachments/files/26620524/repro_20260410_041313.log) ### CI Before -- timeout <img width="1219" height="170" alt="image" src="https://github.com/user-attachments/assets/2f9eabb2-403d-4257-bd84-4da508de7d00" /> ### CI After <img width="1219" height="170" alt="image" src="https://github.com/user-attachments/assets/dbaf2a7d-e3a4-4ad2-85a4-389eae962c1d" /> <img width="1198" height="472" alt="image" src="https://github.com/user-attachments/assets/8a24f1aa-4bcb-4030-b075-09f3885673c2" /> ## TODOs - test_camera_observations takes 10mins to start the app due to Kit cold start. Experimenting with a warm start before tests process here #565 - Kit itself intermittently deadlocks during startup — not because of orphans, but because Kit's internal thread synchronization fails on low-CPU runners. Experimenting with retry here #570
Setup
Steps to reproduce
With 4 cpu cores and no cache cleanup, running tests succeed.
CPUS=0-3 bash scripts/repro_ci_subprocess_stall.shrepro_20260409_222116.log
With only 2 cpu cores (our runner setup), it reproduces the CI stalling at phase 2a.
CPUS=0-1 SKIP_CACHE_WIPE=1 bash scripts/repro_ci_subprocess_stall.shrepro_20260409_233745.log
Explanations from Claude
Phase 1 (in-process): 95s total, left behind omni.telemetry.transmitter orphan (PID 6774).
Phase 2 (subprocess) — each test spawns Kit, and you can see the degradation:
Test Kit startup Scene creation Total
test_action_chunking_client 23:39:30 → 23:40:19 = 49s 0.66s ~66s
test_external_environment_franka_table 23:40:36 → 23:40:59 = 23s 0.46s ~30s
test_external_environment_franka_table_with_task 23:41:06 → never completes -- STALLED
The pattern is clear: on 2 CPUs, each subprocess Kit instance starts slower, and by the 3rd one it deadlocks entirely. The scene creation itself is fast, but Kit startup is the bottleneck. Each test also leaves behind additional child processes (telemetry transmitters), so by test #3, you have the pytest host + orphans all fighting for 2 CPU cores.
The root cause is confirmed: Kit startup contends for CPU during extension loading/shader compilation, and with only 2 cores + accumulated orphans, it eventually deadlocks.