-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
from-mmdetectionIssue imported from upstream open-mmlab/mmdetection repositoryIssue imported from upstream open-mmlab/mmdetection repository
Description
Original issue: open-mmlab/mmdetection#7148
Created: 2022-02-13
Last updated: 2022-03-13
I am using yolact for training, when an epoch ends, it will be OOM in the Val phase.
My environment is:
- Python 3.7.11
- CUDA 11.3
- CUDNN 8200
- numpy 1.21.2
- pycocotools 2.0.4
- pytorch 1.10.1
The GPU is 3090 (24G Vram)
I just modified the num_classes in file yolact_r50_1x8_coco.py
data = dict(
samples_per_gpu=8,
workers_per_gpu=4,
train=dict(
type=dataset_type,
ann_file=data_root + 'annotations/train2017.json',
img_prefix=data_root + 'train2017/',
pipeline=train_pipeline),
val=dict(
type=dataset_type,
ann_file=data_root + 'annotations/val2017.json',
img_prefix=data_root + 'val2017/',
pipeline=test_pipeline),
test=dict(
type=dataset_type,
ann_file=data_root + 'annotations/val2017.json',
img_prefix=data_root + 'val2017/',
pipeline=test_pipeline))
then I started training python tools/train.py configs/yolact/yolact_r50_1x8_coco.py by using my dataset.
My dataset:
- https://resources.mpi-inf.mpg.de/d2/orekondy/redactions/, The format of annotations has been changed to CoCo.
But when an epoch ends, it will be out of memory in the Val phase.
2022-02-13 14:38:01,355 - mmdet - INFO - Saving checkpoint at 1 epochs
[ ] 26/1611, 0.6 task/s, elapsed: 43s, ETA: 2636sTraceback (most recent call last):
File "tools/train.py", line 195, in <module>
main()
File "tools/train.py", line 191, in main
meta=meta)
File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmdet/apis/train.py", line 209, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train
self.call_hook('after_train_epoch')
File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook
getattr(hook, fn_name)(self)
File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmcv/runner/hooks/evaluation.py", line 267, in after_train_epoch
self._do_evaluate(runner)
File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmdet/core/evaluation/eval_hooks.py", line 56, in _do_evaluate
results = single_gpu_test(runner.model, self.dataloader, show=False)
File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmdet/apis/test.py", line 28, in single_gpu_test
result = model(return_loss=False, rescale=True, **data)
File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 50, in forward
return super().forward(*inputs, **kwargs)
File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
return old_func(*args, **kwargs)
File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 174, in forward
return self.forward_test(img, img_metas, **kwargs)
File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 147, in forward_test
return self.simple_test(imgs[0], img_metas[0], **kwargs)
File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmdet/models/detectors/yolact.py", line 113, in simple_test
rescale=rescale)
File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmdet/models/dense_heads/yolact_head.py", line 999, in simple_test
img_metas[i], rescale)
File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmdet/models/dense_heads/yolact_head.py", line 869, in get_seg_masks
align_corners=False).squeeze(0) > 0.5
RuntimeError: CUDA out of memory. Tried to allocate 4.69 GiB (GPU 0; 23.70 GiB total capacity; 19.18 GiB already allocated; 276.56 MiB free; 21.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
After I exchange the test set and verification set, the error report disappears
data = dict(
samples_per_gpu=8,
workers_per_gpu=4,
train=dict(
type=dataset_type,
ann_file=data_root + 'annotations/val2017.json',
img_prefix=data_root + 'val2017/',
pipeline=train_pipeline),
val=dict(
type=dataset_type,
ann_file=data_root + 'annotations/train2017.json',
img_prefix=data_root + 'train2017/',
pipeline=test_pipeline),
test=dict(
type=dataset_type,
ann_file=data_root + 'annotations/val2017.json',
img_prefix=data_root + 'val2017/',
pipeline=test_pipeline))
Is there a way I can use to solve this issue? Thanks
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
from-mmdetectionIssue imported from upstream open-mmlab/mmdetection repositoryIssue imported from upstream open-mmlab/mmdetection repository