Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions docs/source/features/multi_gpu.rst
Comment thread
binzily marked this conversation as resolved.
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,36 @@ To train with multiple GPUs, use the following command, where ``--nproc_per_node
python -m skrl.utils.distributed.jax --nnodes=1 --nproc_per_node=2 scripts/reinforcement_learning/skrl/train.py --task=Isaac-Cartpole-v0 --headless --distributed --ml_framework jax
.. _multi-gpu-nccl-troubleshooting:

Troubleshooting NCCL Errors
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

On some Linux multi-GPU systems, distributed training may fail with
``CUDA error: an illegal memory access was encountered`` reported by ``ProcessGroupNCCL``
during or shortly after communicator initialization.

If this occurs, try disabling the NCCL shared-memory transport before launching training:

.. code-block:: shell
export NCCL_SHM_DISABLE=1
If the issue persists, additional NCCL fallbacks that may help are:

.. code-block:: shell
export NCCL_IB_DISABLE=1
export NCCL_ALGO=Ring
Then relaunch the distributed training command as usual.

.. note::

These variables are NCCL-level workarounds intended for affected systems. They are not
required on all machines, and may change communication behavior or performance depending
on the hardware topology.

Multi-Node Training
-------------------

Expand Down
8 changes: 8 additions & 0 deletions docs/source/refs/troubleshooting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,14 @@ Tricks and Troubleshooting
assistance.


Troubleshooting distributed training NCCL errors
------------------------------------------------

On some Linux multi-GPU systems, distributed training may fail with
``CUDA error: an illegal memory access was encountered`` reported by ``ProcessGroupNCCL``.
For documented NCCL workarounds, see :ref:`multi-gpu-nccl-troubleshooting`.


Debugging physics simulation stability issues
---------------------------------------------

Expand Down
Loading