🐛[Bug Report]: modification of CUDA_VISIBLE_DEVICES may interfere with multi-GPU fairseq2 training on Slurm

### 🐛 Describe the bug

[fairseq2](https://github.com/facebookresearch/fairseq2) is relying on clusterscope to [setup the distributed environment variables when running on Slurm](https://github.com/facebookresearch/clusterscope/blob/main/clusterscope/job_info.py#L134-L144). 
This is nice, however, setting CUDA_VISIBLE_DEVICES to SLURM_LOCALID sometimes interferes with how fairseq2 manages the devices (see an example in https://github.com/fairinternal/omnilingual/issues/173 if you have access). 

Specifically, when launching multi-gpu training by invoking a Python script with a slurm command, fairseq2 expects every device to be visible to every rank, and because it no longer the case, it fails with an error "duplicate GPU detected". 
<details><summary>A more detailed error stack example</summary>
<p>
```
2026-04-27 17:33:29 INFO     fairseq2 - Creating the root gang.
2026-04-27 17:33:33 ERROR    fairseq2 - Recipe failed due to an operational error. See logged stack trace for details.
                             Traceback (most recent call last):
                               File "/storage/home/daviddale/workspace/omnilingual/.venv/lib/python3.12/site-packages/fairseq2/gang.py", line 492, in create_default_process_group
                                 dist.init_process_group(
                               File "/storage/home/daviddale/workspace/omnilingual/.venv/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
                                 return func(*args, **kwargs)
                                        ^^^^^^^^^^^^^^^^^^^^^
                               File "/storage/home/daviddale/workspace/omnilingual/.venv/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 95, in wrapper
                                 func_return = func(*args, **kwargs)
                                               ^^^^^^^^^^^^^^^^^^^^^
                               File "/storage/home/daviddale/workspace/omnilingual/.venv/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 1769, in init_process_group
                                 default_pg, _ = _new_process_group_helper(
                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
                               File "/storage/home/daviddale/workspace/omnilingual/.venv/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 2134, in _new_process_group_helper
                                 eager_backend.eager_connect_single_device(device_id)
                             torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:94, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.27.5
                             ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
                             Last error:
                             Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 53000
```
</p>
</details> 

My suggestion would be to stop setting `CUDA_VISIBLE_DEVICES` by `JobInfo.set_torch_distributed_env_from_slurm`, or at least to make this operation optional.

### System information

* clusterscope version: 0.0.32
* Operating system: Ubuntu
* GPU models and configuration: 8 H100s
* Using torch 2.9.1+cu128 (although it does not seem to play the role)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛[Bug Report]: modification of CUDA_VISIBLE_DEVICES may interfere with multi-GPU fairseq2 training on Slurm #199

🐛 Describe the bug

System information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

🐛[Bug Report]: modification of CUDA_VISIBLE_DEVICES may interfere with multi-GPU fairseq2 training on Slurm #199

Description

🐛 Describe the bug

System information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions