Skip to content

docs: add NCCL troubleshooting notes for multi-GPU training#5195

Merged
AntoineRichard merged 6 commits intoisaac-sim:mainfrom
binzily:docs/multi-gpu-nccl-troubleshooting
Apr 8, 2026
Merged

docs: add NCCL troubleshooting notes for multi-GPU training#5195
AntoineRichard merged 6 commits intoisaac-sim:mainfrom
binzily:docs/multi-gpu-nccl-troubleshooting

Conversation

@binzily
Copy link
Copy Markdown
Contributor

@binzily binzily commented Apr 7, 2026

Description

Adds a troubleshooting note to the multi-GPU training docs for Linux systems
where distributed training may fail with CUDA error: an illegal memory access was encountered
reported by ProcessGroupNCCL.

This PR is documentation-only. It does not change the default distributed training behavior
in IsaacLab or rsl_rl. The note documents NCCL environment-variable workarounds that were
observed to restore stability on some affected systems:

  • NCCL_SHM_DISABLE=1
  • NCCL_IB_DISABLE=1
  • NCCL_ALGO=Ring

The motivation for this change is to provide an official troubleshooting path for users who
hit NCCL transport/algo issues on specific Linux multi-GPU setups. In our local reproduction,
the failure was not caused by IsaacLab task logic itself, but occurred in the distributed
training stack when using NCCL with humanoid locomotion workloads.

Dependencies: none.

Refs #4011
Refs #2756

Type of change

  • Documentation update

Screenshots

N/A

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

Context

Local reproduction environment:

  • Ubuntu 22.04.5
  • RTX 5090 x2
  • Isaac Sim / IsaacLab multi-GPU training
  • official distributed minimal reproduction with Isaac-Velocity-Flat-G1-v0

Observed behavior:

  • the default distributed launch failed with NCCL illegal memory access
  • NCCL_SHM_DISABLE=1 was sufficient to make the official dual-GPU minimal reproduction pass
  • NCCL_SHM_DISABLE=1 NCCL_IB_DISABLE=1 NCCL_ALGO=Ring also restored stability in a longer validation run

This PR documents those workarounds without changing defaults, since the NCCL transport/algo
selection is handled below the IsaacLab task layer.

bxwang added 2 commits April 7, 2026 15:56
Signed-off-by: bxwang <bixiong.wang@x-humanoid.com>
Signed-off-by: bxwang <bixiong.wang@x-humanoid.com>
Copilot AI review requested due to automatic review settings April 7, 2026 08:20
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Apr 7, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 7, 2026

Greptile Summary

This documentation-only PR adds a "Troubleshooting NCCL Errors" subsection to the multi-GPU training page, documenting three environment variable workarounds (NCCL_SHM_DISABLE=1, NCCL_IB_DISABLE=1, NCCL_ALGO=Ring) for systems that encounter CUDA error: an illegal memory access was encountered from ProcessGroupNCCL during distributed training. The RST heading hierarchy is correct, the .. code-block:: bash and .. note:: directives are properly formed, and the added caveat about potential performance impact is appropriate.

Confidence Score: 5/5

This PR is safe to merge — it is a documentation-only change with no code modifications.

Single .rst file addition with correct RST formatting, accurate NCCL environment variable names and valid values, and an appropriate performance caveat. No code is altered and no behavior changes. All observations are P2 or lower.

No files require special attention.

Important Files Changed

Filename Overview
docs/source/features/multi_gpu.rst Adds a 'Troubleshooting NCCL Errors' subsection with correct RST syntax and accurate NCCL environment variable workarounds; no issues found.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Launch multi-GPU training] --> B{NCCL error?}
    B -- No --> C[Training runs successfully]
    B -- Yes: illegal memory access --> D[Set NCCL_SHM_DISABLE=1 and relaunch]
    D --> E{Error persists?}
    E -- No --> C
    E -- Yes --> F[Also set NCCL_IB_DISABLE=1 and NCCL_ALGO=Ring]
    F --> G[Relaunch training]
    G --> C
Loading

Reviews (1): Last reviewed commit: "docs: reflow NCCL troubleshooting notes" | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds documentation guidance for troubleshooting NCCL-related failures during Linux multi-GPU distributed training (specifically ProcessGroupNCCL “illegal memory access” errors), providing environment-variable workarounds users can try.

Changes:

  • Add a new “Troubleshooting NCCL Errors” subsection under the multi-GPU training docs.
  • Document NCCL environment variables (NCCL_SHM_DISABLE, NCCL_IB_DISABLE, NCCL_ALGO) as potential stability workarounds with a cautionary note.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/source/features/multi_gpu.rst Outdated
Comment thread docs/source/features/multi_gpu.rst Outdated
binzily and others added 2 commits April 7, 2026 17:20
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: bixiong wang <wangbx02@126.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: bixiong wang <wangbx02@126.com>
Comment thread docs/source/features/multi_gpu.rst
@AntoineRichard
Copy link
Copy Markdown
Collaborator

@binzily thanks a lot for the PR! Could you add a link to these mutli-gpus issues to the general troubleshooting? :)

Signed-off-by: bxwang <bixiong.wang@x-humanoid.com>
@binzily
Copy link
Copy Markdown
Contributor Author

binzily commented Apr 8, 2026

I checked both failing CI jobs. The broken-link failures are in pre-existing docs files outside this PR, and the test-isaaclab-tasks failure comes from the existing Anymal-C determinism tests in source/isaaclab_tasks/test/test_environment_determinism.py. Both appear unrelated to the documentation-only changes here.

@AntoineRichard
Copy link
Copy Markdown
Collaborator

I checked both failing CI jobs. The broken-link failures are in pre-existing docs files outside this PR, and the test-isaaclab-tasks failure comes from the existing Anymal-C determinism tests in source/isaaclab_tasks/test/test_environment_determinism.py. Both appear unrelated to the documentation-only changes here.

Yes, I wouldn't be concerned about these failures. @myurasov-nv do you know why the CI is running the test general for these doc changes? I don't think these should be triggered.

@AntoineRichard
Copy link
Copy Markdown
Collaborator

Oh nevermind this is targeting main...

@AntoineRichard
Copy link
Copy Markdown
Collaborator

@binzily Would you mind mirroring these changes to develop?

@AntoineRichard AntoineRichard merged commit 4df6560 into isaac-sim:main Apr 8, 2026
6 of 9 checks passed
binzily added a commit to binzily/IsaacLab that referenced this pull request Apr 9, 2026
…m#5195)

Adds a troubleshooting note to the multi-GPU training docs for Linux
systems
where distributed training may fail with `CUDA error: an illegal memory
access was encountered`
reported by `ProcessGroupNCCL`.

This PR is documentation-only. It does not change the default
distributed training behavior
in IsaacLab or `rsl_rl`. The note documents NCCL environment-variable
workarounds that were
observed to restore stability on some affected systems:

- `NCCL_SHM_DISABLE=1`
- `NCCL_IB_DISABLE=1`
- `NCCL_ALGO=Ring`

The motivation for this change is to provide an official troubleshooting
path for users who
hit NCCL transport/algo issues on specific Linux multi-GPU setups. In
our local reproduction,
the failure was not caused by IsaacLab task logic itself, but occurred
in the distributed
training stack when using NCCL with humanoid locomotion workloads.

Dependencies: none.

Refs isaac-sim#4011
Refs isaac-sim#2756

- Documentation update

N/A

- [x] I have read and understood the [contribution
guidelines](https://isaac-sim.github.io/IsaacLab/main/source/refs/contributing.html)
- [ ] I have run the [`pre-commit` checks](https://pre-commit.com/) with
`./isaaclab.sh --format`
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have updated the changelog and the corresponding version in the
extension's `config/extension.toml` file
- [ ] I have added my name to the `CONTRIBUTORS.md` or my name already
exists there

Local reproduction environment:
- Ubuntu 22.04.5
- RTX 5090 x2
- Isaac Sim / IsaacLab multi-GPU training
- official distributed minimal reproduction with
`Isaac-Velocity-Flat-G1-v0`

Observed behavior:
- the default distributed launch failed with NCCL illegal memory access
- `NCCL_SHM_DISABLE=1` was sufficient to make the official dual-GPU
minimal reproduction pass
- `NCCL_SHM_DISABLE=1 NCCL_IB_DISABLE=1 NCCL_ALGO=Ring` also restored
stability in a longer validation run

This PR documents those workarounds without changing defaults, since the
NCCL transport/algo
selection is handled below the IsaacLab task layer.

---------

Signed-off-by: bxwang <bixiong.wang@x-humanoid.com>
Signed-off-by: bixiong wang <wangbx02@126.com>
Co-authored-by: bxwang <bixiong.wang@x-humanoid.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Antoine RICHARD <antoiner@nvidia.com>
(cherry picked from commit 4df6560)
@binzily
Copy link
Copy Markdown
Contributor Author

binzily commented Apr 9, 2026

Mirrored to develop in #5212.

AntoineRichard added a commit that referenced this pull request Apr 9, 2026
Mirrors the documentation changes from #5195 onto `develop`.

## Description

Adds NCCL troubleshooting notes to the multi-GPU docs and links them
from the
general troubleshooting page.

Refs #4011
Refs #2756

## Type of change

- Documentation update

Signed-off-by: bxwang <bixiong.wang@x-humanoid.com>
Signed-off-by: bixiong wang <wangbx02@126.com>
Co-authored-by: bxwang <bixiong.wang@x-humanoid.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Antoine RICHARD <antoiner@nvidia.com>
mmichelis pushed a commit to mmichelis/IsaacLab that referenced this pull request Apr 10, 2026
…m#5212)

Mirrors the documentation changes from isaac-sim#5195 onto `develop`.

## Description

Adds NCCL troubleshooting notes to the multi-GPU docs and links them
from the
general troubleshooting page.

Refs isaac-sim#4011
Refs isaac-sim#2756

## Type of change

- Documentation update

Signed-off-by: bxwang <bixiong.wang@x-humanoid.com>
Signed-off-by: bixiong wang <wangbx02@126.com>
Co-authored-by: bxwang <bixiong.wang@x-humanoid.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Antoine RICHARD <antoiner@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants