Skip to content

OverflowError: out of range integral type conversion attempted #6

@sepulm01

Description

@sepulm01

Hi, i'm replicating your training shell just like readme said with sh train_bart_model.sh command. And this error apear al the end.

{'loss': 0.0193, 'grad_norm': 0.06763239204883575, 'learning_rate': 2.983362019506598e-06, 'epoch': 2.98}
{'loss': 0.0197, 'grad_norm': 0.07400441914796829, 'learning_rate': 1.835915088927137e-06, 'epoch': 2.99}
{'loss': 0.0201, 'grad_norm': 0.07796286791563034, 'learning_rate': 6.884681583476765e-07, 'epoch': 2.99}
{'train_runtime': 8481.4381, 'train_samples_per_second': 105.243, 'train_steps_per_second': 0.411, 'train_loss': 0.0540917304825899, 'epoch': 3.0}
100%|██████████████████████████████████████████████████████████████████████████████| 3486/3486 [2:21:21<00:00, 2.43s/it]
[WARNING|configuration_utils.py:447] 2024-03-30 18:17:42,674 >> Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41.
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
***** train metrics *****
epoch = 3.0
train_loss = 0.0541
train_runtime = 2:21:21.43
train_samples = 297536
train_samples_per_second = 105.243
train_steps_per_second = 0.411
100%|████████████████████████████████████████████████████████████████████████████████████| 63/63 [26:08<00:00, 21.08s/it]Traceback (most recent call last):
File "/var/www/nlp/spelling/run_summarization.py", line 708, in
main()
File "/var/www/nlp/spelling/run_summarization.py", line 650, in main
metrics = trainer.evaluate(max_length=max_length, num_beams=num_beams, metric_key_prefix="eval")
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 180, in evaluate
return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/trainer.py", line 3365, in evaluate
output = eval_loop(
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/trainer.py", line 3656, in evaluation_loop
metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
File "/var/www/nlp/spelling/run_summarization.py", line 590, in compute_metrics
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3785, in batch_decode
return [
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3786, in
self.decode(
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3825, in decode
return self._decode(
File "/var/www/nlp/spelling/venv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 625, in _decode
text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
OverflowError: out of range integral type conversion attempted
100%|████████████████████████████████████████████████████████████████████████████████████| 63/63 [26:09<00:00, 24.91s/it]

My enviroment is ubuntu 20.04 , 32GB RAM 48Cores, RTX4080.
Sat Mar 30 16:50:57 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:81:00.0 On | N/A |
| 53% 68C P2 196W / 320W | 10981MiB / 16376MiB | 79% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3901 G /usr/lib/xorg/Xorg 144MiB |
| 0 N/A N/A 4070 G /usr/bin/gnome-shell 66MiB |
| 0 N/A N/A 6775 G ...3/usr/lib/firefox/firefox 11MiB |
| 0 N/A N/A 15644 G ...on=20240329-134507.235000 58MiB |
| 0 N/A N/A 32210 C python 10694MiB |
+-----------------------------------------------------------------------------+

And mi last checkpoint was: 3000.
I don't know if the 3 process was finish.
Thanks in advance
Martín

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions