Skip to content

docs: add NCCL troubleshooting notes for multi-GPU training#5212

Merged
AntoineRichard merged 1 commit intoisaac-sim:developfrom
binzily:docs/multi-gpu-nccl-troubleshooting-develop
Apr 9, 2026
Merged

docs: add NCCL troubleshooting notes for multi-GPU training#5212
AntoineRichard merged 1 commit intoisaac-sim:developfrom
binzily:docs/multi-gpu-nccl-troubleshooting-develop

Conversation

@binzily
Copy link
Copy Markdown
Contributor

@binzily binzily commented Apr 9, 2026

Mirrors the documentation changes from #5195 onto develop.

Description

Adds NCCL troubleshooting notes to the multi-GPU docs and links them from the
general troubleshooting page.

Refs #4011
Refs #2756

Type of change

  • Documentation update

…m#5195)

Adds a troubleshooting note to the multi-GPU training docs for Linux
systems
where distributed training may fail with `CUDA error: an illegal memory
access was encountered`
reported by `ProcessGroupNCCL`.

This PR is documentation-only. It does not change the default
distributed training behavior
in IsaacLab or `rsl_rl`. The note documents NCCL environment-variable
workarounds that were
observed to restore stability on some affected systems:

- `NCCL_SHM_DISABLE=1`
- `NCCL_IB_DISABLE=1`
- `NCCL_ALGO=Ring`

The motivation for this change is to provide an official troubleshooting
path for users who
hit NCCL transport/algo issues on specific Linux multi-GPU setups. In
our local reproduction,
the failure was not caused by IsaacLab task logic itself, but occurred
in the distributed
training stack when using NCCL with humanoid locomotion workloads.

Dependencies: none.

Refs isaac-sim#4011
Refs isaac-sim#2756

- Documentation update

N/A

- [x] I have read and understood the [contribution
guidelines](https://isaac-sim.github.io/IsaacLab/main/source/refs/contributing.html)
- [ ] I have run the [`pre-commit` checks](https://pre-commit.com/) with
`./isaaclab.sh --format`
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have updated the changelog and the corresponding version in the
extension's `config/extension.toml` file
- [ ] I have added my name to the `CONTRIBUTORS.md` or my name already
exists there

Local reproduction environment:
- Ubuntu 22.04.5
- RTX 5090 x2
- Isaac Sim / IsaacLab multi-GPU training
- official distributed minimal reproduction with
`Isaac-Velocity-Flat-G1-v0`

Observed behavior:
- the default distributed launch failed with NCCL illegal memory access
- `NCCL_SHM_DISABLE=1` was sufficient to make the official dual-GPU
minimal reproduction pass
- `NCCL_SHM_DISABLE=1 NCCL_IB_DISABLE=1 NCCL_ALGO=Ring` also restored
stability in a longer validation run

This PR documents those workarounds without changing defaults, since the
NCCL transport/algo
selection is handled below the IsaacLab task layer.

---------

Signed-off-by: bxwang <bixiong.wang@x-humanoid.com>
Signed-off-by: bixiong wang <wangbx02@126.com>
Co-authored-by: bxwang <bixiong.wang@x-humanoid.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Antoine RICHARD <antoiner@nvidia.com>
(cherry picked from commit 4df6560)
Copilot AI review requested due to automatic review settings April 9, 2026 02:01
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Apr 9, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 9, 2026

Greptile Summary

This PR adds an NCCL troubleshooting subsection to docs/source/features/multi_gpu.rst (with a Sphinx anchor label multi-gpu-nccl-troubleshooting) and a short forwarding entry in docs/source/refs/troubleshooting.rst that cross-references it. The documentation-only change is well-structured and the RST formatting is consistent with the rest of both files.

Confidence Score: 5/5

Documentation-only PR with no code changes; safe to merge.

All changes are RST documentation. The Sphinx anchor label and cross-reference are correctly paired, heading underlines meet RST length requirements, the NCCL environment variables are valid, and the new sections are stylistically consistent with the rest of both files. No P0/P1 findings.

No files require special attention.

Vulnerabilities

No security concerns identified.

Important Files Changed

Filename Overview
docs/source/features/multi_gpu.rst Adds a properly anchored "Troubleshooting NCCL Errors" subsection (^-level) with three env-var workarounds and a performance-impact note; RST syntax and label placement are correct.
docs/source/refs/troubleshooting.rst Adds a top-level (-level) section that briefly describes the NCCL error and defers to the multi_gpu.rst anchor via :ref:multi-gpu-nccl-troubleshooting; heading underline length matches exactly (48 chars each), cross-reference resolves correctly.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[User hits NCCL error\nduring multi-GPU training] --> B{Error type}
    B -->|CUDA illegal memory access\nfrom ProcessGroupNCCL| C[Set NCCL_SHM_DISABLE=1]
    C --> D{Resolved?}
    D -->|No| E[Set NCCL_IB_DISABLE=1\nand NCCL_ALGO=Ring]
    E --> F[Relaunch distributed training]
    D -->|Yes| F
    B -->|Other| G[See general troubleshooting.rst]
Loading

Reviews (1): Last reviewed commit: "docs: add NCCL troubleshooting notes for..." | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds documentation guidance for diagnosing and mitigating NCCL-related failures seen during Linux multi-GPU distributed training, and links this guidance from the general troubleshooting page.

Changes:

  • Add a new troubleshooting entry in the general troubleshooting guide that points users to NCCL-specific guidance.
  • Add an NCCL troubleshooting section to the multi-GPU training docs documenting environment-variable workarounds (NCCL_SHM_DISABLE, NCCL_IB_DISABLE, NCCL_ALGO).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
docs/source/refs/troubleshooting.rst Adds a troubleshooting section and links to the multi-GPU NCCL troubleshooting anchor.
docs/source/features/multi_gpu.rst Introduces a new NCCL troubleshooting subsection with suggested environment-variable mitigations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs-only PR — LGTM. The NCCL env vars are correct, the escalation order (shared memory → IB → ring algorithm) is sensible, RST formatting is clean, and the cross-reference from troubleshooting.rst resolves properly.

A couple of optional suggestions for future improvement (not blocking):

  1. NCCL_P2P_DISABLE=1 — worth mentioning as another fallback. Peer-to-peer transport failures can produce the same illegal memory access symptom, especially on multi-GPU desktops where P2P isn't well supported across PCIe bridges.

  2. NCCL_DEBUG=INFO — a brief mention that users can set this to identify which transport is failing would help them narrow down which NCCL_*_DISABLE variable they actually need, rather than blindly stacking all of them.

  3. The .. note:: block could be slightly more specific about the trade-off — e.g., "may reduce inter-GPU communication throughput by falling back to slower transports" rather than the current generic "may change communication behavior or performance".

None of these are blockers. Good addition — the referenced issues (#4011, #2756) show this trips up real users.

@AntoineRichard AntoineRichard merged commit a62f5a1 into isaac-sim:develop Apr 9, 2026
11 of 12 checks passed
@AntoineRichard
Copy link
Copy Markdown
Collaborator

Thanks @binzily !

mmichelis pushed a commit to mmichelis/IsaacLab that referenced this pull request Apr 10, 2026
…m#5212)

Mirrors the documentation changes from isaac-sim#5195 onto `develop`.

## Description

Adds NCCL troubleshooting notes to the multi-GPU docs and links them
from the
general troubleshooting page.

Refs isaac-sim#4011
Refs isaac-sim#2756

## Type of change

- Documentation update

Signed-off-by: bxwang <bixiong.wang@x-humanoid.com>
Signed-off-by: bixiong wang <wangbx02@126.com>
Co-authored-by: bxwang <bixiong.wang@x-humanoid.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Antoine RICHARD <antoiner@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants