docs: add NCCL troubleshooting notes for multi-GPU training by binzily · Pull Request #5212 · isaac-sim/IsaacLab

binzily · 2026-04-09T02:01:49Z

Mirrors the documentation changes from #5195 onto develop.

Description

Adds NCCL troubleshooting notes to the multi-GPU docs and links them from the
general troubleshooting page.

Type of change

Documentation update

…m#5195) Adds a troubleshooting note to the multi-GPU training docs for Linux systems where distributed training may fail with `CUDA error: an illegal memory access was encountered` reported by `ProcessGroupNCCL`. This PR is documentation-only. It does not change the default distributed training behavior in IsaacLab or `rsl_rl`. The note documents NCCL environment-variable workarounds that were observed to restore stability on some affected systems: - `NCCL_SHM_DISABLE=1` - `NCCL_IB_DISABLE=1` - `NCCL_ALGO=Ring` The motivation for this change is to provide an official troubleshooting path for users who hit NCCL transport/algo issues on specific Linux multi-GPU setups. In our local reproduction, the failure was not caused by IsaacLab task logic itself, but occurred in the distributed training stack when using NCCL with humanoid locomotion workloads. Dependencies: none. Refs isaac-sim#4011 Refs isaac-sim#2756 - Documentation update N/A - [x] I have read and understood the [contribution guidelines](https://isaac-sim.github.io/IsaacLab/main/source/refs/contributing.html) - [ ] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./isaaclab.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [ ] I have added my name to the `CONTRIBUTORS.md` or my name already exists there Local reproduction environment: - Ubuntu 22.04.5 - RTX 5090 x2 - Isaac Sim / IsaacLab multi-GPU training - official distributed minimal reproduction with `Isaac-Velocity-Flat-G1-v0` Observed behavior: - the default distributed launch failed with NCCL illegal memory access - `NCCL_SHM_DISABLE=1` was sufficient to make the official dual-GPU minimal reproduction pass - `NCCL_SHM_DISABLE=1 NCCL_IB_DISABLE=1 NCCL_ALGO=Ring` also restored stability in a longer validation run This PR documents those workarounds without changing defaults, since the NCCL transport/algo selection is handled below the IsaacLab task layer. --------- Signed-off-by: bxwang <bixiong.wang@x-humanoid.com> Signed-off-by: bixiong wang <wangbx02@126.com> Co-authored-by: bxwang <bixiong.wang@x-humanoid.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Antoine RICHARD <antoiner@nvidia.com> (cherry picked from commit 4df6560)

greptile-apps · 2026-04-09T02:03:56Z

Greptile Summary

This PR adds an NCCL troubleshooting subsection to docs/source/features/multi_gpu.rst (with a Sphinx anchor label multi-gpu-nccl-troubleshooting) and a short forwarding entry in docs/source/refs/troubleshooting.rst that cross-references it. The documentation-only change is well-structured and the RST formatting is consistent with the rest of both files.

Confidence Score: 5/5

Documentation-only PR with no code changes; safe to merge.

All changes are RST documentation. The Sphinx anchor label and cross-reference are correctly paired, heading underlines meet RST length requirements, the NCCL environment variables are valid, and the new sections are stylistically consistent with the rest of both files. No P0/P1 findings.

No files require special attention.

Vulnerabilities

No security concerns identified.

Important Files Changed

Filename	Overview
docs/source/features/multi_gpu.rst	Adds a properly anchored "Troubleshooting NCCL Errors" subsection (^-level) with three env-var workarounds and a performance-impact note; RST syntax and label placement are correct.
docs/source/refs/troubleshooting.rst	Adds a top-level (-level) section that briefly describes the NCCL error and defers to the multi_gpu.rst anchor via :ref:`multi-gpu-nccl-troubleshooting`; heading underline length matches exactly (48 chars each), cross-reference resolves correctly.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[User hits NCCL error\nduring multi-GPU training] --> B{Error type}
    B -->|CUDA illegal memory access\nfrom ProcessGroupNCCL| C[Set NCCL_SHM_DISABLE=1]
    C --> D{Resolved?}
    D -->|No| E[Set NCCL_IB_DISABLE=1\nand NCCL_ALGO=Ring]
    E --> F[Relaunch distributed training]
    D -->|Yes| F
    B -->|Other| G[See general troubleshooting.rst]

_{Reviews (1): Last reviewed commit: "docs: add NCCL troubleshooting notes for..." | Re-trigger Greptile}

Copilot

Pull request overview

Adds documentation guidance for diagnosing and mitigating NCCL-related failures seen during Linux multi-GPU distributed training, and links this guidance from the general troubleshooting page.

Changes:

Add a new troubleshooting entry in the general troubleshooting guide that points users to NCCL-specific guidance.
Add an NCCL troubleshooting section to the multi-GPU training docs documenting environment-variable workarounds (NCCL_SHM_DISABLE, NCCL_IB_DISABLE, NCCL_ALGO).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
docs/source/refs/troubleshooting.rst	Adds a troubleshooting section and links to the multi-GPU NCCL troubleshooting anchor.
docs/source/features/multi_gpu.rst	Introduces a new NCCL troubleshooting subsection with suggested environment-variable mitigations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

isaaclab-review-bot

Docs-only PR — LGTM. The NCCL env vars are correct, the escalation order (shared memory → IB → ring algorithm) is sensible, RST formatting is clean, and the cross-reference from troubleshooting.rst resolves properly.

A couple of optional suggestions for future improvement (not blocking):

NCCL_P2P_DISABLE=1 — worth mentioning as another fallback. Peer-to-peer transport failures can produce the same illegal memory access symptom, especially on multi-GPU desktops where P2P isn't well supported across PCIe bridges.
NCCL_DEBUG=INFO — a brief mention that users can set this to identify which transport is failing would help them narrow down which NCCL_*_DISABLE variable they actually need, rather than blindly stacking all of them.
The .. note:: block could be slightly more specific about the trade-off — e.g., "may reduce inter-GPU communication throughput by falling back to slower transports" rather than the current generic "may change communication behavior or performance".

None of these are blockers. Good addition — the referenced issues (#4011, #2756) show this trips up real users.

AntoineRichard · 2026-04-09T11:15:57Z

Thanks @binzily !

…m#5212) Mirrors the documentation changes from isaac-sim#5195 onto `develop`. ## Description Adds NCCL troubleshooting notes to the multi-GPU docs and links them from the general troubleshooting page. Refs isaac-sim#4011 Refs isaac-sim#2756 ## Type of change - Documentation update Signed-off-by: bxwang <bixiong.wang@x-humanoid.com> Signed-off-by: bixiong wang <wangbx02@126.com> Co-authored-by: bxwang <bixiong.wang@x-humanoid.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Antoine RICHARD <antoiner@nvidia.com>

Copilot AI review requested due to automatic review settings April 9, 2026 02:01

binzily requested review from Mayankm96, jtigue-bdai and kellyguo11 as code owners April 9, 2026 02:01

github-actions bot added the documentation Improvements or additions to documentation label Apr 9, 2026

Copilot started reviewing on behalf of binzily April 9, 2026 02:02 View session

binzily mentioned this pull request Apr 9, 2026

docs: add NCCL troubleshooting notes for multi-GPU training #5195

Merged

7 tasks

Copilot AI reviewed Apr 9, 2026

View reviewed changes

isaaclab-review-bot bot approved these changes Apr 9, 2026

View reviewed changes

AntoineRichard merged commit a62f5a1 into isaac-sim:develop Apr 9, 2026
11 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add NCCL troubleshooting notes for multi-GPU training#5212

docs: add NCCL troubleshooting notes for multi-GPU training#5212
AntoineRichard merged 1 commit intoisaac-sim:developfrom
binzily:docs/multi-gpu-nccl-troubleshooting-develop

binzily commented Apr 9, 2026

Uh oh!

greptile-apps bot commented Apr 9, 2026

Vulnerabilities

Uh oh!

Copilot AI left a comment

Uh oh!

isaaclab-review-bot bot left a comment

Uh oh!

Uh oh!

AntoineRichard commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

binzily commented Apr 9, 2026

Description

Type of change

Uh oh!

greptile-apps bot commented Apr 9, 2026

Greptile Summary

Confidence Score: 5/5

Vulnerabilities

Important Files Changed

Flowchart

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

isaaclab-review-bot bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AntoineRichard commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants