-
Notifications
You must be signed in to change notification settings - Fork 72
Description
Environment:
DeepSpeed version: 0.18.6
PyTorch version: 2.4.0+cu124
Transformers version: 4.51.0
GPU: 8x NVIDIA H100
CUDA version: 12.4
Training Configuration:
Model: Dexmal/Dexbotic-PI05
Dataset: Dexmal/libero
Training scripts:libero_pi05.py
I'm trying to finetune Dexbotic-PI05 on LIBERO, but I'm getting this error:
Traceback (most recent call last): [rank7]: Traceback (most recent call last): [rank7]: File "/inspire/hdd/project/emotionalcomputing/lihaoyang-253308110306/tutorial/dexbotic/playground/benchmarks/libero/libero_pi05.py", line 260, in <module> [rank7]: exp.train() [rank7]: File "/inspire/hdd/project/emotionalcomputing/lihaoyang-253308110306/tutorial/dexbotic/dexbotic/exp/base_exp.py", line 872, in train [rank7]: self.trainer.train() [rank7]: File "/inspire/hdd/project/emotionalcomputing/lihaoyang-253308110306/miniconda3/envs/dexbotic/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train [rank7]: return inner_training_loop( [rank7]: File "/inspire/hdd/project/emotionalcomputing/lihaoyang-253308110306/miniconda3/envs/dexbotic/lib/python3.10/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop [rank7]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch) [rank7]: File "/inspire/hdd/project/emotionalcomputing/lihaoyang-253308110306/miniconda3/envs/dexbotic/lib/python3.10/site-packages/transformers/trainer.py", line 3782, in training_step [rank7]: self.accelerator.backward(loss, **kwargs) [rank7]: File "/inspire/hdd/project/emotionalcomputing/lihaoyang-253308110306/miniconda3/envs/dexbotic/lib/python3.10/site-packages/accelerate/accelerator.py", line 2844, in backward [rank7]: self.deepspeed_engine_wrapped.backward(loss, sync_gradients=self.sync_gradients, **kwargs) [rank7]: File "/inspire/hdd/project/emotionalcomputing/lihaoyang-253308110306/miniconda3/envs/dexbotic/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 270, in backward [rank7]: self.engine.backward(loss, **kwargs) [rank7]: File "/inspire/hdd/project/emotionalcomputing/lihaoyang-253308110306/miniconda3/envs/dexbotic/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn [rank7]: ret_val = func(*args, **kwargs) [rank7]: File "/inspire/hdd/project/emotionalcomputing/lihaoyang-253308110306/miniconda3/envs/dexbotic/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2583, in backward [rank7]: loss.backward(**backward_kwargs) [rank7]: File "/inspire/hdd/project/emotionalcomputing/lihaoyang-253308110306/miniconda3/envs/dexbotic/lib/python3.10/site-packages/torch/_tensor.py", line 521, in backward [rank7]: torch.autograd.backward( [rank7]: File "/inspire/hdd/project/emotionalcomputing/lihaoyang-253308110306/miniconda3/envs/dexbotic/lib/python3.10/site-packages/torch/autograd/__init__.py", line 289, in backward [rank7]: _engine_run_backward( [rank7]: File "/inspire/hdd/project/emotionalcomputing/lihaoyang-253308110306/miniconda3/envs/dexbotic/lib/python3.10/site-packages/torch/autograd/graph.py", line 768, in _engine_run_backward [rank7]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank7]: File "/inspire/hdd/project/emotionalcomputing/lihaoyang-253308110306/miniconda3/envs/dexbotic/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn [rank7]: ret_val = func(*args, **kwargs) [rank7]: File "/inspire/hdd/project/emotionalcomputing/lihaoyang-253308110306/miniconda3/envs/dexbotic/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1288, in reduce_partition_and_remove_grads [rank7]: current_expected = count_used_parameters_in_backward( [rank7]: File "/inspire/hdd/project/emotionalcomputing/lihaoyang-253308110306/miniconda3/envs/dexbotic/lib/python3.10/site-packages/deepspeed/runtime/utils.py", line 1458, in count_used_parameters_in_backward [rank7]: grad_fn = _get_grad_fn_or_grad_acc(param) [rank7]: File "/inspire/hdd/project/emotionalcomputing/lihaoyang-253308110306/miniconda3/envs/dexbotic/lib/python3.10/site-packages/torch/autograd/graph.py", line 161, in _get_grad_fn_or_grad_acc [rank7]: return t.view_as(t).grad_fn.next_functions[0][0] [rank7]: AttributeError: 'NoneType' object has no attribute 'next_functions'
Claude told me to add ignore_unused_parameters: true in zero3.json, but it didn't help.
Similar error happens when fine-tuning Dexmal/Dexbotic-Base on LIBERO with torchrun --nproc_per_node=8 playground/benchmarks/libero/libero_cogact.py(using 8×4090, zero3_offload.json) → AttributeError: 'NoneType' object has no attribute 'next_functions'
Could you please advise on how to fix this issue?