docs: add NCCL troubleshooting notes for multi-GPU training#5195
Conversation
Signed-off-by: bxwang <bixiong.wang@x-humanoid.com>
Signed-off-by: bxwang <bixiong.wang@x-humanoid.com>
Greptile SummaryThis documentation-only PR adds a "Troubleshooting NCCL Errors" subsection to the multi-GPU training page, documenting three environment variable workarounds ( Confidence Score: 5/5This PR is safe to merge — it is a documentation-only change with no code modifications. Single .rst file addition with correct RST formatting, accurate NCCL environment variable names and valid values, and an appropriate performance caveat. No code is altered and no behavior changes. All observations are P2 or lower. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Launch multi-GPU training] --> B{NCCL error?}
B -- No --> C[Training runs successfully]
B -- Yes: illegal memory access --> D[Set NCCL_SHM_DISABLE=1 and relaunch]
D --> E{Error persists?}
E -- No --> C
E -- Yes --> F[Also set NCCL_IB_DISABLE=1 and NCCL_ALGO=Ring]
F --> G[Relaunch training]
G --> C
Reviews (1): Last reviewed commit: "docs: reflow NCCL troubleshooting notes" | Re-trigger Greptile |
There was a problem hiding this comment.
Pull request overview
Adds documentation guidance for troubleshooting NCCL-related failures during Linux multi-GPU distributed training (specifically ProcessGroupNCCL “illegal memory access” errors), providing environment-variable workarounds users can try.
Changes:
- Add a new “Troubleshooting NCCL Errors” subsection under the multi-GPU training docs.
- Document NCCL environment variables (
NCCL_SHM_DISABLE,NCCL_IB_DISABLE,NCCL_ALGO) as potential stability workarounds with a cautionary note.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: bixiong wang <wangbx02@126.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: bixiong wang <wangbx02@126.com>
|
@binzily thanks a lot for the PR! Could you add a link to these mutli-gpus issues to the general troubleshooting? :) |
Signed-off-by: bxwang <bixiong.wang@x-humanoid.com>
|
I checked both failing CI jobs. The broken-link failures are in pre-existing docs files outside this PR, and the |
Yes, I wouldn't be concerned about these failures. @myurasov-nv do you know why the CI is running the test general for these doc changes? I don't think these should be triggered. |
|
Oh nevermind this is targeting main... |
|
@binzily Would you mind mirroring these changes to develop? |
…m#5195) Adds a troubleshooting note to the multi-GPU training docs for Linux systems where distributed training may fail with `CUDA error: an illegal memory access was encountered` reported by `ProcessGroupNCCL`. This PR is documentation-only. It does not change the default distributed training behavior in IsaacLab or `rsl_rl`. The note documents NCCL environment-variable workarounds that were observed to restore stability on some affected systems: - `NCCL_SHM_DISABLE=1` - `NCCL_IB_DISABLE=1` - `NCCL_ALGO=Ring` The motivation for this change is to provide an official troubleshooting path for users who hit NCCL transport/algo issues on specific Linux multi-GPU setups. In our local reproduction, the failure was not caused by IsaacLab task logic itself, but occurred in the distributed training stack when using NCCL with humanoid locomotion workloads. Dependencies: none. Refs isaac-sim#4011 Refs isaac-sim#2756 - Documentation update N/A - [x] I have read and understood the [contribution guidelines](https://isaac-sim.github.io/IsaacLab/main/source/refs/contributing.html) - [ ] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./isaaclab.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [ ] I have added my name to the `CONTRIBUTORS.md` or my name already exists there Local reproduction environment: - Ubuntu 22.04.5 - RTX 5090 x2 - Isaac Sim / IsaacLab multi-GPU training - official distributed minimal reproduction with `Isaac-Velocity-Flat-G1-v0` Observed behavior: - the default distributed launch failed with NCCL illegal memory access - `NCCL_SHM_DISABLE=1` was sufficient to make the official dual-GPU minimal reproduction pass - `NCCL_SHM_DISABLE=1 NCCL_IB_DISABLE=1 NCCL_ALGO=Ring` also restored stability in a longer validation run This PR documents those workarounds without changing defaults, since the NCCL transport/algo selection is handled below the IsaacLab task layer. --------- Signed-off-by: bxwang <bixiong.wang@x-humanoid.com> Signed-off-by: bixiong wang <wangbx02@126.com> Co-authored-by: bxwang <bixiong.wang@x-humanoid.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Antoine RICHARD <antoiner@nvidia.com> (cherry picked from commit 4df6560)
|
Mirrored to develop in #5212. |
Mirrors the documentation changes from #5195 onto `develop`. ## Description Adds NCCL troubleshooting notes to the multi-GPU docs and links them from the general troubleshooting page. Refs #4011 Refs #2756 ## Type of change - Documentation update Signed-off-by: bxwang <bixiong.wang@x-humanoid.com> Signed-off-by: bixiong wang <wangbx02@126.com> Co-authored-by: bxwang <bixiong.wang@x-humanoid.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Antoine RICHARD <antoiner@nvidia.com>
…m#5212) Mirrors the documentation changes from isaac-sim#5195 onto `develop`. ## Description Adds NCCL troubleshooting notes to the multi-GPU docs and links them from the general troubleshooting page. Refs isaac-sim#4011 Refs isaac-sim#2756 ## Type of change - Documentation update Signed-off-by: bxwang <bixiong.wang@x-humanoid.com> Signed-off-by: bixiong wang <wangbx02@126.com> Co-authored-by: bxwang <bixiong.wang@x-humanoid.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Antoine RICHARD <antoiner@nvidia.com>
Description
Adds a troubleshooting note to the multi-GPU training docs for Linux systems
where distributed training may fail with
CUDA error: an illegal memory access was encounteredreported by
ProcessGroupNCCL.This PR is documentation-only. It does not change the default distributed training behavior
in IsaacLab or
rsl_rl. The note documents NCCL environment-variable workarounds that wereobserved to restore stability on some affected systems:
NCCL_SHM_DISABLE=1NCCL_IB_DISABLE=1NCCL_ALGO=RingThe motivation for this change is to provide an official troubleshooting path for users who
hit NCCL transport/algo issues on specific Linux multi-GPU setups. In our local reproduction,
the failure was not caused by IsaacLab task logic itself, but occurred in the distributed
training stack when using NCCL with humanoid locomotion workloads.
Dependencies: none.
Refs #4011
Refs #2756
Type of change
Screenshots
N/A
Checklist
pre-commitchecks with./isaaclab.sh --formatconfig/extension.tomlfileCONTRIBUTORS.mdor my name already exists thereContext
Local reproduction environment:
Isaac-Velocity-Flat-G1-v0Observed behavior:
NCCL_SHM_DISABLE=1was sufficient to make the official dual-GPU minimal reproduction passNCCL_SHM_DISABLE=1 NCCL_IB_DISABLE=1 NCCL_ALGO=Ringalso restored stability in a longer validation runThis PR documents those workarounds without changing defaults, since the NCCL transport/algo
selection is handled below the IsaacLab task layer.