hi, when i try to train a quant model using configdetectron2/configs/COCO-Detection/retinanet_R_18_FPN_1x-Full-SyncBN-lsq-2bit.yaml, and the loss became nan at iterations 390
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/zhangjinhe/anaconda3/envs/torch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/zhangjinhe/QTools/git/detectron2/detectron2/engine/launch.py", line 125, in _distributed_worker main_func(*args)
File "/home/zhangjinhe/QTools/git/detectron2/tools/train_net.py", line 154, in main
return trainer.train()
File "/home/zhangjinhe/QTools/git/detectron2/detectron2/engine/defaults.py", line 489, in train super().train(self.start_iter, self.max_iter)
File "/home/zhangjinhe/QTools/git/detectron2/detectron2/engine/train_loop.py", line 149, in train self.run_step() File "/home/zhangjinhe/QTools/git/detectron2/detectron2/engine/defaults.py", line 499, in run_step self._trainer.run_step() File "/home/zhangjinhe/QTools/git/detectron2/detectron2/engine/train_loop.py", line 289, in run_step self._write_metrics(loss_dict, data_time) File "/home/zhangjinhe/QTools/git/detectron2/detectron2/engine/train_loop.py", line 332, in _write_metrics
f"Loss became infinite or NaN at iteration={self.iter}!\n"
FloatingPointError: Loss became infinite or NaN at iteration=390!
The commang i use is python tools/train_net.py --config-file configs/COCO-Detection/retinanet_R_18_FPN_1x-Full-SyncBN-lsq-2bit.yaml --num-gpus 4 MODEL.WEIGHTS output/coco-detection/retinanet_R_18_FPN_1x-Full_BN/model_final.pth
I change the input_size from (640, 672, 704, 736, 768, 800) to (800,) and the checkpoint file is the result of another experiment using config retinanet_R_18_FPN_1x-Full-BN.yaml
Any ideas why?
hi, when i try to train a quant model using config
detectron2/configs/COCO-Detection/retinanet_R_18_FPN_1x-Full-SyncBN-lsq-2bit.yaml, and the loss becamenanat iterations 390The commang i use is
python tools/train_net.py --config-file configs/COCO-Detection/retinanet_R_18_FPN_1x-Full-SyncBN-lsq-2bit.yaml --num-gpus 4 MODEL.WEIGHTS output/coco-detection/retinanet_R_18_FPN_1x-Full_BN/model_final.pthI change the input_size from
(640, 672, 704, 736, 768, 800)to(800,)and the checkpoint file is the result of another experiment using configretinanet_R_18_FPN_1x-Full-BN.yamlAny ideas why?