docs: add NCCL troubleshooting notes for multi-GPU training#5212
Conversation
…m#5195) Adds a troubleshooting note to the multi-GPU training docs for Linux systems where distributed training may fail with `CUDA error: an illegal memory access was encountered` reported by `ProcessGroupNCCL`. This PR is documentation-only. It does not change the default distributed training behavior in IsaacLab or `rsl_rl`. The note documents NCCL environment-variable workarounds that were observed to restore stability on some affected systems: - `NCCL_SHM_DISABLE=1` - `NCCL_IB_DISABLE=1` - `NCCL_ALGO=Ring` The motivation for this change is to provide an official troubleshooting path for users who hit NCCL transport/algo issues on specific Linux multi-GPU setups. In our local reproduction, the failure was not caused by IsaacLab task logic itself, but occurred in the distributed training stack when using NCCL with humanoid locomotion workloads. Dependencies: none. Refs isaac-sim#4011 Refs isaac-sim#2756 - Documentation update N/A - [x] I have read and understood the [contribution guidelines](https://isaac-sim.github.io/IsaacLab/main/source/refs/contributing.html) - [ ] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./isaaclab.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [ ] I have added my name to the `CONTRIBUTORS.md` or my name already exists there Local reproduction environment: - Ubuntu 22.04.5 - RTX 5090 x2 - Isaac Sim / IsaacLab multi-GPU training - official distributed minimal reproduction with `Isaac-Velocity-Flat-G1-v0` Observed behavior: - the default distributed launch failed with NCCL illegal memory access - `NCCL_SHM_DISABLE=1` was sufficient to make the official dual-GPU minimal reproduction pass - `NCCL_SHM_DISABLE=1 NCCL_IB_DISABLE=1 NCCL_ALGO=Ring` also restored stability in a longer validation run This PR documents those workarounds without changing defaults, since the NCCL transport/algo selection is handled below the IsaacLab task layer. --------- Signed-off-by: bxwang <bixiong.wang@x-humanoid.com> Signed-off-by: bixiong wang <wangbx02@126.com> Co-authored-by: bxwang <bixiong.wang@x-humanoid.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Antoine RICHARD <antoiner@nvidia.com> (cherry picked from commit 4df6560)
Greptile SummaryThis PR adds an NCCL troubleshooting subsection to Confidence Score: 5/5Documentation-only PR with no code changes; safe to merge. All changes are RST documentation. The Sphinx anchor label and cross-reference are correctly paired, heading underlines meet RST length requirements, the NCCL environment variables are valid, and the new sections are stylistically consistent with the rest of both files. No P0/P1 findings. No files require special attention.
|
| Filename | Overview |
|---|---|
| docs/source/features/multi_gpu.rst | Adds a properly anchored "Troubleshooting NCCL Errors" subsection (^-level) with three env-var workarounds and a performance-impact note; RST syntax and label placement are correct. |
| docs/source/refs/troubleshooting.rst | Adds a top-level (-level) section that briefly describes the NCCL error and defers to the multi_gpu.rst anchor via :ref:multi-gpu-nccl-troubleshooting; heading underline length matches exactly (48 chars each), cross-reference resolves correctly. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[User hits NCCL error\nduring multi-GPU training] --> B{Error type}
B -->|CUDA illegal memory access\nfrom ProcessGroupNCCL| C[Set NCCL_SHM_DISABLE=1]
C --> D{Resolved?}
D -->|No| E[Set NCCL_IB_DISABLE=1\nand NCCL_ALGO=Ring]
E --> F[Relaunch distributed training]
D -->|Yes| F
B -->|Other| G[See general troubleshooting.rst]
Reviews (1): Last reviewed commit: "docs: add NCCL troubleshooting notes for..." | Re-trigger Greptile
There was a problem hiding this comment.
Pull request overview
Adds documentation guidance for diagnosing and mitigating NCCL-related failures seen during Linux multi-GPU distributed training, and links this guidance from the general troubleshooting page.
Changes:
- Add a new troubleshooting entry in the general troubleshooting guide that points users to NCCL-specific guidance.
- Add an NCCL troubleshooting section to the multi-GPU training docs documenting environment-variable workarounds (
NCCL_SHM_DISABLE,NCCL_IB_DISABLE,NCCL_ALGO).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| docs/source/refs/troubleshooting.rst | Adds a troubleshooting section and links to the multi-GPU NCCL troubleshooting anchor. |
| docs/source/features/multi_gpu.rst | Introduces a new NCCL troubleshooting subsection with suggested environment-variable mitigations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Docs-only PR — LGTM. The NCCL env vars are correct, the escalation order (shared memory → IB → ring algorithm) is sensible, RST formatting is clean, and the cross-reference from troubleshooting.rst resolves properly.
A couple of optional suggestions for future improvement (not blocking):
-
NCCL_P2P_DISABLE=1— worth mentioning as another fallback. Peer-to-peer transport failures can produce the sameillegal memory accesssymptom, especially on multi-GPU desktops where P2P isn't well supported across PCIe bridges. -
NCCL_DEBUG=INFO— a brief mention that users can set this to identify which transport is failing would help them narrow down whichNCCL_*_DISABLEvariable they actually need, rather than blindly stacking all of them. -
The
.. note::block could be slightly more specific about the trade-off — e.g., "may reduce inter-GPU communication throughput by falling back to slower transports" rather than the current generic "may change communication behavior or performance".
None of these are blockers. Good addition — the referenced issues (#4011, #2756) show this trips up real users.
|
Thanks @binzily ! |
…m#5212) Mirrors the documentation changes from isaac-sim#5195 onto `develop`. ## Description Adds NCCL troubleshooting notes to the multi-GPU docs and links them from the general troubleshooting page. Refs isaac-sim#4011 Refs isaac-sim#2756 ## Type of change - Documentation update Signed-off-by: bxwang <bixiong.wang@x-humanoid.com> Signed-off-by: bixiong wang <wangbx02@126.com> Co-authored-by: bxwang <bixiong.wang@x-humanoid.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Antoine RICHARD <antoiner@nvidia.com>
Mirrors the documentation changes from #5195 onto
develop.Description
Adds NCCL troubleshooting notes to the multi-GPU docs and links them from the
general troubleshooting page.
Refs #4011
Refs #2756
Type of change