Skip to content

ERROR NCCL communicator was aborted on rank 3. Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=56693, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804371 milliseconds before timing out. #6

@Erwin2233

Description

@Erwin2233

[E ProcessGroupNCCL.cpp:737] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=56693, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804371 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:737] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=56693, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804371 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:737] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=56693, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804376 milliseconds before timing out.
Traceback (most recent call last):
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/lc/Desktop/wza/yrh/GCD/train/train.py", line 91, in
main()
File "/home/lc/Desktop/wza/yrh/GCD/train/train.py", line 85, in main
trainer.train_loop(max_epochs)
File "/home/lc/Desktop/wza/yrh/GCD/train/trainer.py", line 289, in train_loop
training_step_outputs = self.training_step(batch, batch_idx)
File "/home/lc/Desktop/wza/yrh/GCD/train/trainer.py", line 214, in training_step
losses = self.diffusion.training_losses(
File "/home/lc/Desktop/wza/yrh/GCD/diffusion/respace.py", line 97, in training_losses
return super().training_losses(self._wrap_model(model), *args, **kwargs)
File "/home/lc/Desktop/wza/yrh/GCD/diffusion/gaussian_diffusion.py", line 1356, in training_losses
model_output, style_embed = model(x_t, self._scale_timesteps(t), return_style=True, **model_kwargs)
File "/home/lc/Desktop/wza/yrh/GCD/diffusion/respace.py", line 132, in call
return self.model(x, new_ts, **kwargs)
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/accelerate/utils/operations.py", line 495, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 12, in decorate_autocast
return func(*args, **kwargs)
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1002, in forward
self._sync_buffers()
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1585, in _sync_buffers
self._sync_module_buffers(authoritative_rank)
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1589, in _sync_module_buffers
self._default_broadcast_coalesced(authoritative_rank=authoritative_rank)
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1610, in _default_broadcast_coalesced
self._distributed_broadcast_coalesced(
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1526, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: NCCL communicator was aborted on rank 3. Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=56693, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804371 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 31504) of binary: /home/lc/anaconda3/envs/gcd/bin/python
Traceback (most recent call last):
File "/home/lc/anaconda3/envs/gcd/bin/accelerate", line 8, in
sys.exit(main())
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/accelerate/commands/launch.py", line 914, in launch_command
multi_gpu_launcher(args)
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/accelerate/commands/launch.py", line 603, in multi_gpu_launcher
distrib_run.run(args)
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/lc/anaconda3/envs/gcd/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
train.train FAILED


Failures:
[1]:
time : 2024-11-21_21:26:57
host : lc-NF5468M5
rank : 2 (local_rank: 2)
exitcode : -6 (pid: 31505)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 31505
[2]:
time : 2024-11-21_21:26:57
host : lc-NF5468M5
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 31506)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 31506

Root Cause (first observed failure):
[0]:
time : 2024-11-21_21:26:57
host : lc-NF5468M5
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 31504)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 31504

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions