Skip to content

loss become infinite while training quant models #5

@RaidenE1

Description

@RaidenE1

hi, when i try to train a quant model using configdetectron2/configs/COCO-Detection/retinanet_R_18_FPN_1x-Full-SyncBN-lsq-2bit.yaml, and the loss became nan at iterations 390

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/zhangjinhe/anaconda3/envs/torch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/zhangjinhe/QTools/git/detectron2/detectron2/engine/launch.py", line 125, in _distributed_worker    main_func(*args)
  File "/home/zhangjinhe/QTools/git/detectron2/tools/train_net.py", line 154, in main
    return trainer.train()
  File "/home/zhangjinhe/QTools/git/detectron2/detectron2/engine/defaults.py", line 489, in train    super().train(self.start_iter, self.max_iter)
  File "/home/zhangjinhe/QTools/git/detectron2/detectron2/engine/train_loop.py", line 149, in train    self.run_step()  File "/home/zhangjinhe/QTools/git/detectron2/detectron2/engine/defaults.py", line 499, in run_step    self._trainer.run_step()  File "/home/zhangjinhe/QTools/git/detectron2/detectron2/engine/train_loop.py", line 289, in run_step    self._write_metrics(loss_dict, data_time)  File "/home/zhangjinhe/QTools/git/detectron2/detectron2/engine/train_loop.py", line 332, in _write_metrics
    f"Loss became infinite or NaN at iteration={self.iter}!\n"
FloatingPointError: Loss became infinite or NaN at iteration=390!

The commang i use is python tools/train_net.py --config-file configs/COCO-Detection/retinanet_R_18_FPN_1x-Full-SyncBN-lsq-2bit.yaml --num-gpus 4 MODEL.WEIGHTS output/coco-detection/retinanet_R_18_FPN_1x-Full_BN/model_final.pth

I change the input_size from (640, 672, 704, 736, 768, 800) to (800,) and the checkpoint file is the result of another experiment using config retinanet_R_18_FPN_1x-Full-BN.yaml

Any ideas why?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions