Skip to content

AsyncGRPO: make NCCL weight sync robust on PCIe-only GPUs (auto fallback from P2P/SHM) #5865

@rycerzes

Description

@rycerzes

Feature request

Please add a topology-aware NCCL fallback for AsyncGRPO weight sync:

  • Detect when multi-GPU machines have no usable peer access / NVLink.
  • Automatically set:
    • NCCL_P2P_DISABLE=1
    • NCCL_SHM_DISABLE=1
  • Respect explicit user overrides if env vars are already set.
  • Log a clear warning with override instructions.

Additionally, document that for vllm_mode="server" / remote vLLM, these env vars may need to be set on the vLLM server process too.

Motivation

On PCIe-only or constrained topologies (e.g. many cloud A10/L4/T4 setups), NCCL P2P/SHM can hang during large broadcast/allreduce operations used in trainer→inference weight sync.

Other RL frameworks already treat this as an operational reliability issue:

Frameworks that proactively handle this

  1. Prime-RL

  2. SkyRL

Frameworks that mostly rely on other NCCL knobs

  • NeMo-RL / VERL / AReaL commonly apply NCCL_CUMEM_ENABLE / NCCL_NVLS_ENABLE workarounds, but do not generally auto-apply P2P+SHM disable fallback for this case.

Given AsyncGRPO’s frequent NCCL weight sync operations, this fallback significantly improves out-of-the-box stability on non-NVLink hardware.

Your contribution

I can open a PR for this, PoC

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions