The error RuntimeError: Invalid device string: 'cuda:[0, 1, 2, 3]' occurs because PyTorch's torch.device class only accepts a string referring to a single device (e.g., 'cuda:0').
17:25:18 [INFO] Layer group 0: 19 layers, 384.1 MB each (layers: [0, 1, 2, 3, 4]...) Traceback (most recent call last): File "/root/MegaTrain/examples/sft/run.py", line 318, in <module> main() File "/root/MegaTrain/examples/sft/run.py", line 204, in main model = CPUMasterModel(hf_model, config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/MegaTrain/infinity/model/cpu_master.py", line 466, in __init__ self._init_single_gpu(config) File "/root/MegaTrain/infinity/model/cpu_master.py", line 480, in _init_single_gpu ctx = self._create_gpu_context(0, config.devices[0]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/MegaTrain/infinity/model/cpu_master.py", line 525, in _create_gpu_context device = torch.device(f"cuda:{device_id}") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Invalid device string: 'cuda:[0, 1, 2, 3]'
The error RuntimeError: Invalid device string: 'cuda:[0, 1, 2, 3]' occurs because PyTorch's torch.device class only accepts a string referring to a single device (e.g., 'cuda:0').
17:25:18 [INFO] Layer group 0: 19 layers, 384.1 MB each (layers: [0, 1, 2, 3, 4]...) Traceback (most recent call last): File "/root/MegaTrain/examples/sft/run.py", line 318, in <module> main() File "/root/MegaTrain/examples/sft/run.py", line 204, in main model = CPUMasterModel(hf_model, config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/MegaTrain/infinity/model/cpu_master.py", line 466, in __init__ self._init_single_gpu(config) File "/root/MegaTrain/infinity/model/cpu_master.py", line 480, in _init_single_gpu ctx = self._create_gpu_context(0, config.devices[0]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/MegaTrain/infinity/model/cpu_master.py", line 525, in _create_gpu_context device = torch.device(f"cuda:{device_id}") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Invalid device string: 'cuda:[0, 1, 2, 3]'