🐛 Describe the bug
fairseq2 is relying on clusterscope to setup the distributed environment variables when running on Slurm.
This is nice, however, setting CUDA_VISIBLE_DEVICES to SLURM_LOCALID sometimes interferes with how fairseq2 manages the devices (see an example in https://github.com/fairinternal/omnilingual/issues/173 if you have access).
Specifically, when launching multi-gpu training by invoking a Python script with a slurm command, fairseq2 expects every device to be visible to every rank, and because it no longer the case, it fails with an error "duplicate GPU detected".
A more detailed error stack example
```
2026-04-27 17:33:29 INFO fairseq2 - Creating the root gang.
2026-04-27 17:33:33 ERROR fairseq2 - Recipe failed due to an operational error. See logged stack trace for details.
Traceback (most recent call last):
File "/storage/home/daviddale/workspace/omnilingual/.venv/lib/python3.12/site-packages/fairseq2/gang.py", line 492, in create_default_process_group
dist.init_process_group(
File "/storage/home/daviddale/workspace/omnilingual/.venv/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/storage/home/daviddale/workspace/omnilingual/.venv/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 95, in wrapper
func_return = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/storage/home/daviddale/workspace/omnilingual/.venv/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 1769, in init_process_group
default_pg, _ = _new_process_group_helper(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/storage/home/daviddale/workspace/omnilingual/.venv/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 2134, in _new_process_group_helper
eager_backend.eager_connect_single_device(device_id)
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:94, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.27.5
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 53000
```
My suggestion would be to stop setting CUDA_VISIBLE_DEVICES by JobInfo.set_torch_distributed_env_from_slurm, or at least to make this operation optional.
System information
- clusterscope version: 0.0.32
- Operating system: Ubuntu
- GPU models and configuration: 8 H100s
- Using torch 2.9.1+cu128 (although it does not seem to play the role)
🐛 Describe the bug
fairseq2 is relying on clusterscope to setup the distributed environment variables when running on Slurm.
This is nice, however, setting CUDA_VISIBLE_DEVICES to SLURM_LOCALID sometimes interferes with how fairseq2 manages the devices (see an example in https://github.com/fairinternal/omnilingual/issues/173 if you have access).
Specifically, when launching multi-gpu training by invoking a Python script with a slurm command, fairseq2 expects every device to be visible to every rank, and because it no longer the case, it fails with an error "duplicate GPU detected".
A more detailed error stack example
``` 2026-04-27 17:33:29 INFO fairseq2 - Creating the root gang. 2026-04-27 17:33:33 ERROR fairseq2 - Recipe failed due to an operational error. See logged stack trace for details. Traceback (most recent call last): File "/storage/home/daviddale/workspace/omnilingual/.venv/lib/python3.12/site-packages/fairseq2/gang.py", line 492, in create_default_process_group dist.init_process_group( File "/storage/home/daviddale/workspace/omnilingual/.venv/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/storage/home/daviddale/workspace/omnilingual/.venv/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 95, in wrapper func_return = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/storage/home/daviddale/workspace/omnilingual/.venv/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 1769, in init_process_group default_pg, _ = _new_process_group_helper( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/storage/home/daviddale/workspace/omnilingual/.venv/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 2134, in _new_process_group_helper eager_backend.eager_connect_single_device(device_id) torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:94, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.27.5 ncclInvalidUsage: This usually reflects invalid usage of NCCL library. Last error: Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 53000 ```
My suggestion would be to stop setting
CUDA_VISIBLE_DEVICESbyJobInfo.set_torch_distributed_env_from_slurm, or at least to make this operation optional.System information