Fine-Tuning with Large Dataset - Out of Memory (OOM) Error

Hi,

Thank you for sharing this great repository!

I'm currently trying to run fine-tuning using a relatively large dataset (~60k records). However, I'm encountering an out-of-memory (OOM) error with PyTorch.

Could you please advise on how to modify the fine-tuning code to run in smaller batches without conflicting with other parts of the codebase?

Here is the error message I'm getting:
#####
torch.Size([2, 2048])
torch.Size([2, 2048])
FCNNetwork build torch.Size([2, 1])
0.001
Inner Loop parameters
names_learning_rates_dict.layer_dict-linear-weights torch.Size([6]) True
Outer Loop parameters
regressor.layer_dict.linear0.linear.weights torch.Size([2048, 2048]) True
regressor.layer_dict.linear0.linear.bias torch.Size([2048]) True
regressor.layer_dict.linear0.norm_layer.bias torch.Size([5, 2048]) True
regressor.layer_dict.linear0.norm_layer.weight torch.Size([5, 2048]) True
regressor.layer_dict.linear1.linear.weights torch.Size([2048, 2048]) True
regressor.layer_dict.linear1.linear.bias torch.Size([2048]) True
regressor.layer_dict.linear1.norm_layer.bias torch.Size([5, 2048]) True
regressor.layer_dict.linear1.norm_layer.weight torch.Size([5, 2048]) True
regressor.layer_dict.linear.weights torch.Size([1, 2048]) True
inner_loop_optimizer.names_learning_rates_dict.layer_dict-linear-weights torch.Size([6]) True
run_actfound_V1.py:30: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:274.)
  x_task = torch.tensor(xs)
Traceback (most recent call last):
  File "run_actfound_V1.py", line 249, in <module>
    main()
  File "run_actfound_V1.py", line 246, in main
    run_prediction(args, model)
  File "run_actfound_V1.py", line 169, in run_prediction
    y_pred, _ = model.run_predict(x_task, y_task_input, split)
  File "Actfound_demo/system_actfound_original.py", line 23, in run_predict
    names_weights_copy, support_loss_each_step, _ = self.inner_loop(x_task, y_task, 0, split,False, -1, num_steps)
  File "Actfound_demo/system_base.py", line 152, in inner_loop
    support_loss, support_preds = self.net_forward(x=x_task,
  File "Actfound_demo/system_actfound_original.py", line 68, in net_forward
    ddg_pred = support_value.unsqueeze(-1) - support_value.unsqueeze(0)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.40 GiB. GPU 

########


Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-Tuning with Large Dataset - Out of Memory (OOM) Error #6

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Fine-Tuning with Large Dataset - Out of Memory (OOM) Error #6

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions