[DO NOT MERGE] Reproduce CI setup locally and Catch stalling by xyao-nv · Pull Request #568 · isaac-sim/IsaacLab-Arena

xyao-nv · 2026-04-10T00:18:36Z

Setup

Constrains CPUs
Optionally wipes Kit caches (shader/USD caches)
Phase 1 (in-process camera tests)
Phase 2a (subprocess bulk tests)
Phase 2b (subprocess isolated)

Steps to reproduce

With 4 cpu cores and no cache cleanup, running tests succeed.
CPUS=0-3 bash scripts/repro_ci_subprocess_stall.sh
repro_20260409_222116.log
With only 2 cpu cores (our runner setup), it reproduces the CI stalling at phase 2a.
CPUS=0-1 SKIP_CACHE_WIPE=1 bash scripts/repro_ci_subprocess_stall.sh
repro_20260409_233745.log

Explanations from Claude

Phase 1 (in-process): 95s total, left behind omni.telemetry.transmitter orphan (PID 6774).

Phase 2 (subprocess) — each test spawns Kit, and you can see the degradation:

Test Kit startup Scene creation Total
test_action_chunking_client 23:39:30 → 23:40:19 = 49s 0.66s ~66s
test_external_environment_franka_table 23:40:36 → 23:40:59 = 23s 0.46s ~30s
test_external_environment_franka_table_with_task 23:41:06 → never completes -- STALLED

The pattern is clear: on 2 CPUs, each subprocess Kit instance starts slower, and by the 3rd one it deadlocks entirely. The scene creation itself is fast, but Kit startup is the bottleneck. Each test also leaves behind additional child processes (telemetry transmitters), so by test #3, you have the pytest host + orphans all fighting for 2 CPU cores.

The root cause is confirmed: Kit startup contends for CPU during extension loading/shader compilation, and with only 2 cores + accumulated orphans, it eventually deadlocks.

kellyguo11

Review: Reproduction Script for CI Subprocess Stalling

Overall: Solid diagnostic script — well-structured, good logging, and the phased approach (bulk → isolated) is exactly right for root-causing CI stalls. Since this is [DO NOT MERGE], treating it as a debugging artifact rather than production code.

What Works Well

Clean phased design: cache wipe → in-process → bulk subprocess → isolated fallback
taskset -c $CPUS to faithfully simulate CI core constraints
Tee-to-logfile with timestamps — good for offline analysis
Orphan process detection between tests (show_orphans) is the key diagnostic
--subprocess-only and --loop modes are thoughtful additions

Suggestions

set -euo pipefail + || true pattern — The script uses set -e but then appends || true on most pytest calls. This is fine for the in-process section, but note that in run_subprocess_bulk, there is no || true — if the bulk run fails, it correctly falls through to isolation. Consistent and intentional.
kill_orphans is aggressive — pkill -9 -f "isaac-sim/kit/python" could match unrelated processes if someone runs this outside a container. Since this is a repro script meant for containers, it is acceptable, but worth a comment.
Missing --loop cache wipe between iterations — In loop mode, wipe_kit_caches runs each iteration, which is correct. But kill_orphans is only called on bulk failure. If loop mode is meant to stress-test accumulation, consider adding an option to skip orphan cleanup between loops to truly simulate process buildup.
Hardcoded $PYTHON path — /isaac-sim/python.sh assumes the NVIDIA container layout. A fallback like PYTHON="${PYTHON:-/isaac-sim/python.sh}" would make this usable outside the standard container (already done for other env vars).
Log rotation — In --loop mode, each iteration gets the same logfile (since TIMESTAMP is set once). This means all loop iterations append to one file, which is probably fine but worth noting. If separate logs per iteration are desired, move TIMESTAMP inside run_once.
Phase 1 has commented-out sections — The Newton and PhysX-no-cameras sections are commented out. Makes sense for focused repro, but a --full-ci flag that enables all sections could be useful for complete CI simulation.

Root Cause Analysis (from the PR description)

The diagnosis is convincing: Kit startup CPU contention + orphaned telemetry processes = deadlock on 2 cores by the 3rd subprocess test. The script proves this clearly. The fix should likely be:

Kill orphan processes between subprocess tests in CI
Or increase CI runner core count
Or add subprocess timeouts with proper cleanup

Nice debugging work. 👍

## Summary Subprocess-spawning tests hang indefinitely on CI. ## Causes & Fixes ### Problems From Lab: 1. Lab reports "AppLauncher doesnt quit properly after app.close(), app.quit() doesn't help either." 2. Cold startup times for tests using IS can be upwards of 10 min on Lab CI machines. Above issues apply to us, because tests hang during sub-process tests section, between the end of last test and the beginning of the next test. See detailed logs and analysis from reproducing locally [here](#568) ### Fixes 1. `SimulationApp` Force Exit: Skips `app.close()` (which can hang indefinitely in Kit's shutdown path) when the env var `ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE=1` is set. Calls a new `_kill_child_processes()` helper that walks `/proc` to `SIGKILL` all direct children before doing `os._exit(0)`, preventing orphaned Kit processes from holding GPU resources. 2. `run_subprocess` has a configuarable wall-clock timeouts and process isolation, such that when needed, it could trigger the force exit path above. 3. Add wall-clock timing and logging inside the SimulationApp start method. Keep track of how much startup time is taking on CI. ## Minor fixes 1. Add timing stats into pytest cmds such that it reports the slowests test func at the end of each section. 2. Parametrize multi-config tests: Convert nested for-loops in `test_zero_action_policy_kitchen_pick_and_place` (6 configs) and `test_zero_action_policy_gr1_open_microwave` (3 configs) into `@pytest.mark.parametrize.` Each config gets its own timeout, pass/fail, and timing. 3. Reduce num_envs in gr00t eval_runner test to speed up. ### Local validation With the repro script #568, I do not have local stalling. Log for more details. [repro_20260410_041313.log](https://github.com/user-attachments/files/26620524/repro_20260410_041313.log) ### CI Before -- timeout <img width="1219" height="170" alt="image" src="https://github.com/user-attachments/assets/2f9eabb2-403d-4257-bd84-4da508de7d00" /> ### CI After <img width="1219" height="170" alt="image" src="https://github.com/user-attachments/assets/dbaf2a7d-e3a4-4ad2-85a4-389eae962c1d" /> <img width="1198" height="472" alt="image" src="https://github.com/user-attachments/assets/8a24f1aa-4bcb-4030-b075-09f3885673c2" /> ## TODOs - test_camera_observations takes 10mins to start the app due to Kit cold start. Experimenting with a warm start before tests process here #565 - Kit itself intermittently deadlocks during startup — not because of orphans, but because Kit's internal thread synchronization fails on low-CPU runners. Experimenting with retry here #570

repro scripts

173e539

kellyguo11 reviewed Apr 10, 2026

View reviewed changes

xyao-nv mentioned this pull request Apr 10, 2026

[CI] Fix CI subprocess test hangs #557

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] Reproduce CI setup locally and Catch stalling#568

[DO NOT MERGE] Reproduce CI setup locally and Catch stalling#568
xyao-nv wants to merge 1 commit intomainfrom
xyao/exp/repro_ci_stalling

xyao-nv commented Apr 10, 2026 •

edited

Loading

Uh oh!

kellyguo11 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xyao-nv commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Setup

Steps to reproduce

Explanations from Claude

Uh oh!

kellyguo11 left a comment

Choose a reason for hiding this comment

Review: Reproduction Script for CI Subprocess Stalling

What Works Well

Suggestions

Root Cause Analysis (from the PR description)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xyao-nv commented Apr 10, 2026 •

edited

Loading