[E ProcessGroupNCCL.cpp:737] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=56693, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804371 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:737] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=56693, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804371 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:737] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=56693, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804376 milliseconds before timing out.
Traceback (most recent call last):
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/lc/Desktop/wza/yrh/GCD/train/train.py", line 91, in
main()
File "/home/lc/Desktop/wza/yrh/GCD/train/train.py", line 85, in main
trainer.train_loop(max_epochs)
File "/home/lc/Desktop/wza/yrh/GCD/train/trainer.py", line 289, in train_loop
training_step_outputs = self.training_step(batch, batch_idx)
File "/home/lc/Desktop/wza/yrh/GCD/train/trainer.py", line 214, in training_step
losses = self.diffusion.training_losses(
File "/home/lc/Desktop/wza/yrh/GCD/diffusion/respace.py", line 97, in training_losses
return super().training_losses(self._wrap_model(model), *args, **kwargs)
File "/home/lc/Desktop/wza/yrh/GCD/diffusion/gaussian_diffusion.py", line 1356, in training_losses
model_output, style_embed = model(x_t, self._scale_timesteps(t), return_style=True, **model_kwargs)
File "/home/lc/Desktop/wza/yrh/GCD/diffusion/respace.py", line 132, in call
return self.model(x, new_ts, **kwargs)
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/accelerate/utils/operations.py", line 495, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 12, in decorate_autocast
return func(*args, **kwargs)
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1002, in forward
self._sync_buffers()
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1585, in _sync_buffers
self._sync_module_buffers(authoritative_rank)
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1589, in _sync_module_buffers
self._default_broadcast_coalesced(authoritative_rank=authoritative_rank)
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1610, in _default_broadcast_coalesced
self._distributed_broadcast_coalesced(
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1526, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: NCCL communicator was aborted on rank 3. Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=56693, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804371 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 31504) of binary: /home/lc/anaconda3/envs/gcd/bin/python
Traceback (most recent call last):
File "/home/lc/anaconda3/envs/gcd/bin/accelerate", line 8, in
sys.exit(main())
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/accelerate/commands/launch.py", line 914, in launch_command
multi_gpu_launcher(args)
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/accelerate/commands/launch.py", line 603, in multi_gpu_launcher
distrib_run.run(args)
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
train.train FAILED
Failures:
[1]:
time : 2024-11-21_21:26:57
host : lc-NF5468M5
rank : 2 (local_rank: 2)
exitcode : -6 (pid: 31505)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 31505
[2]:
time : 2024-11-21_21:26:57
host : lc-NF5468M5
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 31506)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 31506
Root Cause (first observed failure):
[0]:
time : 2024-11-21_21:26:57
host : lc-NF5468M5
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 31504)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 31504