docs: add NCCL troubleshooting notes for multi-GPU training by binzily · Pull Request #5195 · isaac-sim/IsaacLab

binzily · 2026-04-07T08:20:18Z

Description

Adds a troubleshooting note to the multi-GPU training docs for Linux systems
where distributed training may fail with CUDA error: an illegal memory access was encountered
reported by ProcessGroupNCCL.

This PR is documentation-only. It does not change the default distributed training behavior
in IsaacLab or rsl_rl. The note documents NCCL environment-variable workarounds that were
observed to restore stability on some affected systems:

NCCL_SHM_DISABLE=1
NCCL_IB_DISABLE=1
NCCL_ALGO=Ring

The motivation for this change is to provide an official troubleshooting path for users who
hit NCCL transport/algo issues on specific Linux multi-GPU setups. In our local reproduction,
the failure was not caused by IsaacLab task logic itself, but occurred in the distributed
training stack when using NCCL with humanoid locomotion workloads.

Dependencies: none.

Refs #4011
Refs #2756

Type of change

Documentation update

Screenshots

N/A

Checklist

I have read and understood the contribution guidelines
I have run the pre-commit checks with ./isaaclab.sh --format
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I have updated the changelog and the corresponding version in the extension's config/extension.toml file
I have added my name to the CONTRIBUTORS.md or my name already exists there

Context

Local reproduction environment:

Ubuntu 22.04.5
RTX 5090 x2
Isaac Sim / IsaacLab multi-GPU training
official distributed minimal reproduction with Isaac-Velocity-Flat-G1-v0

Observed behavior:

the default distributed launch failed with NCCL illegal memory access
NCCL_SHM_DISABLE=1 was sufficient to make the official dual-GPU minimal reproduction pass
NCCL_SHM_DISABLE=1 NCCL_IB_DISABLE=1 NCCL_ALGO=Ring also restored stability in a longer validation run

This PR documents those workarounds without changing defaults, since the NCCL transport/algo
selection is handled below the IsaacLab task layer.

Signed-off-by: bxwang <bixiong.wang@x-humanoid.com>

greptile-apps · 2026-04-07T08:22:24Z

Greptile Summary

This documentation-only PR adds a "Troubleshooting NCCL Errors" subsection to the multi-GPU training page, documenting three environment variable workarounds (NCCL_SHM_DISABLE=1, NCCL_IB_DISABLE=1, NCCL_ALGO=Ring) for systems that encounter CUDA error: an illegal memory access was encountered from ProcessGroupNCCL during distributed training. The RST heading hierarchy is correct, the .. code-block:: bash and .. note:: directives are properly formed, and the added caveat about potential performance impact is appropriate.

Confidence Score: 5/5

This PR is safe to merge — it is a documentation-only change with no code modifications.

Single .rst file addition with correct RST formatting, accurate NCCL environment variable names and valid values, and an appropriate performance caveat. No code is altered and no behavior changes. All observations are P2 or lower.

No files require special attention.

Important Files Changed

Filename	Overview
docs/source/features/multi_gpu.rst	Adds a 'Troubleshooting NCCL Errors' subsection with correct RST syntax and accurate NCCL environment variable workarounds; no issues found.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Launch multi-GPU training] --> B{NCCL error?}
    B -- No --> C[Training runs successfully]
    B -- Yes: illegal memory access --> D[Set NCCL_SHM_DISABLE=1 and relaunch]
    D --> E{Error persists?}
    E -- No --> C
    E -- Yes --> F[Also set NCCL_IB_DISABLE=1 and NCCL_ALGO=Ring]
    F --> G[Relaunch training]
    G --> C

_{Reviews (1): Last reviewed commit: "docs: reflow NCCL troubleshooting notes" | Re-trigger Greptile}

Copilot

Pull request overview

Adds documentation guidance for troubleshooting NCCL-related failures during Linux multi-GPU distributed training (specifically ProcessGroupNCCL “illegal memory access” errors), providing environment-variable workarounds users can try.

Changes:

Add a new “Troubleshooting NCCL Errors” subsection under the multi-GPU training docs.
Document NCCL environment variables (NCCL_SHM_DISABLE, NCCL_IB_DISABLE, NCCL_ALGO) as potential stability workarounds with a cautionary note.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: bixiong wang <wangbx02@126.com>

AntoineRichard · 2026-04-07T11:31:30Z

@binzily thanks a lot for the PR! Could you add a link to these mutli-gpus issues to the general troubleshooting? :)

Signed-off-by: bxwang <bixiong.wang@x-humanoid.com>

binzily · 2026-04-08T09:07:40Z

I checked both failing CI jobs. The broken-link failures are in pre-existing docs files outside this PR, and the test-isaaclab-tasks failure comes from the existing Anymal-C determinism tests in source/isaaclab_tasks/test/test_environment_determinism.py. Both appear unrelated to the documentation-only changes here.

AntoineRichard · 2026-04-08T13:27:59Z

I checked both failing CI jobs. The broken-link failures are in pre-existing docs files outside this PR, and the test-isaaclab-tasks failure comes from the existing Anymal-C determinism tests in source/isaaclab_tasks/test/test_environment_determinism.py. Both appear unrelated to the documentation-only changes here.

Yes, I wouldn't be concerned about these failures. @myurasov-nv do you know why the CI is running the test general for these doc changes? I don't think these should be triggered.

AntoineRichard · 2026-04-08T15:17:15Z

Oh nevermind this is targeting main...

AntoineRichard · 2026-04-08T15:18:00Z

@binzily Would you mind mirroring these changes to develop?

…m#5195) Adds a troubleshooting note to the multi-GPU training docs for Linux systems where distributed training may fail with `CUDA error: an illegal memory access was encountered` reported by `ProcessGroupNCCL`. This PR is documentation-only. It does not change the default distributed training behavior in IsaacLab or `rsl_rl`. The note documents NCCL environment-variable workarounds that were observed to restore stability on some affected systems: - `NCCL_SHM_DISABLE=1` - `NCCL_IB_DISABLE=1` - `NCCL_ALGO=Ring` The motivation for this change is to provide an official troubleshooting path for users who hit NCCL transport/algo issues on specific Linux multi-GPU setups. In our local reproduction, the failure was not caused by IsaacLab task logic itself, but occurred in the distributed training stack when using NCCL with humanoid locomotion workloads. Dependencies: none. Refs isaac-sim#4011 Refs isaac-sim#2756 - Documentation update N/A - [x] I have read and understood the [contribution guidelines](https://isaac-sim.github.io/IsaacLab/main/source/refs/contributing.html) - [ ] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./isaaclab.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [ ] I have added my name to the `CONTRIBUTORS.md` or my name already exists there Local reproduction environment: - Ubuntu 22.04.5 - RTX 5090 x2 - Isaac Sim / IsaacLab multi-GPU training - official distributed minimal reproduction with `Isaac-Velocity-Flat-G1-v0` Observed behavior: - the default distributed launch failed with NCCL illegal memory access - `NCCL_SHM_DISABLE=1` was sufficient to make the official dual-GPU minimal reproduction pass - `NCCL_SHM_DISABLE=1 NCCL_IB_DISABLE=1 NCCL_ALGO=Ring` also restored stability in a longer validation run This PR documents those workarounds without changing defaults, since the NCCL transport/algo selection is handled below the IsaacLab task layer. --------- Signed-off-by: bxwang <bixiong.wang@x-humanoid.com> Signed-off-by: bixiong wang <wangbx02@126.com> Co-authored-by: bxwang <bixiong.wang@x-humanoid.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Antoine RICHARD <antoiner@nvidia.com> (cherry picked from commit 4df6560)

binzily · 2026-04-09T02:03:41Z

Mirrored to develop in #5212.

Mirrors the documentation changes from #5195 onto `develop`. ## Description Adds NCCL troubleshooting notes to the multi-GPU docs and links them from the general troubleshooting page. Refs #4011 Refs #2756 ## Type of change - Documentation update Signed-off-by: bxwang <bixiong.wang@x-humanoid.com> Signed-off-by: bixiong wang <wangbx02@126.com> Co-authored-by: bxwang <bixiong.wang@x-humanoid.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Antoine RICHARD <antoiner@nvidia.com>

…m#5212) Mirrors the documentation changes from isaac-sim#5195 onto `develop`. ## Description Adds NCCL troubleshooting notes to the multi-GPU docs and links them from the general troubleshooting page. Refs isaac-sim#4011 Refs isaac-sim#2756 ## Type of change - Documentation update Signed-off-by: bxwang <bixiong.wang@x-humanoid.com> Signed-off-by: bixiong wang <wangbx02@126.com> Co-authored-by: bxwang <bixiong.wang@x-humanoid.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Antoine RICHARD <antoiner@nvidia.com>

bxwang added 2 commits April 7, 2026 15:56

docs: add NCCL troubleshooting notes for multi-GPU training

151f244

Signed-off-by: bxwang <bixiong.wang@x-humanoid.com>

docs: reflow NCCL troubleshooting notes

fce1e51

Signed-off-by: bxwang <bixiong.wang@x-humanoid.com>

Copilot AI review requested due to automatic review settings April 7, 2026 08:20

binzily requested review from Mayankm96, jtigue-bdai and kellyguo11 as code owners April 7, 2026 08:20

github-actions bot added the documentation Improvements or additions to documentation label Apr 7, 2026

Copilot started reviewing on behalf of binzily April 7, 2026 08:20 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

Comment thread docs/source/features/multi_gpu.rst Outdated

Comment thread docs/source/features/multi_gpu.rst Outdated

binzily and others added 2 commits April 7, 2026 17:20

Apply suggestion from @Copilot

d67f63d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: bixiong wang <wangbx02@126.com>

Apply suggestion from @Copilot

b299919

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: bixiong wang <wangbx02@126.com>

AntoineRichard requested changes Apr 7, 2026

View reviewed changes

Comment thread docs/source/features/multi_gpu.rst

docs: link NCCL troubleshooting from general FAQ

5f4654e

Signed-off-by: bxwang <bixiong.wang@x-humanoid.com>

AntoineRichard approved these changes Apr 8, 2026

View reviewed changes

Merge branch 'main' into docs/multi-gpu-nccl-troubleshooting

8ba32e5

AntoineRichard merged commit 4df6560 into isaac-sim:main Apr 8, 2026
6 of 9 checks passed

binzily mentioned this pull request Apr 9, 2026

docs: add NCCL troubleshooting notes for multi-GPU training #5212

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add NCCL troubleshooting notes for multi-GPU training#5195

docs: add NCCL troubleshooting notes for multi-GPU training#5195
AntoineRichard merged 6 commits intoisaac-sim:mainfrom
binzily:docs/multi-gpu-nccl-troubleshooting

binzily commented Apr 7, 2026

Uh oh!

greptile-apps bot commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AntoineRichard commented Apr 7, 2026

Uh oh!

binzily commented Apr 8, 2026

Uh oh!

AntoineRichard commented Apr 8, 2026

Uh oh!

AntoineRichard commented Apr 8, 2026

Uh oh!

AntoineRichard commented Apr 8, 2026

Uh oh!

Uh oh!

binzily commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

binzily commented Apr 7, 2026

Description

Type of change

Screenshots

Checklist

Context

Uh oh!

greptile-apps bot commented Apr 7, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AntoineRichard commented Apr 7, 2026

Uh oh!

binzily commented Apr 8, 2026

Uh oh!

AntoineRichard commented Apr 8, 2026

Uh oh!

AntoineRichard commented Apr 8, 2026

Uh oh!

AntoineRichard commented Apr 8, 2026

Uh oh!

Uh oh!

binzily commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants