yolact Out of memory during training

**Original issue:** https://github.com/open-mmlab/mmdetection/issues/7148
**Created:** 2022-02-13
**Last updated:** 2022-03-13

---

I am using yolact for training, when an epoch ends, it will be **OOM** in the Val phase.

My environment is:

- Python  3.7.11
- CUDA  11.3
- CUDNN  8200
- numpy  1.21.2
- pycocotools  2.0.4 
- pytorch   1.10.1

The GPU is 3090 (24G Vram)

I just modified the `num_classes` in file `yolact_r50_1x8_coco.py`
```
data = dict(
    samples_per_gpu=8, 
    workers_per_gpu=4,
    train=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/train2017.json',
        img_prefix=data_root + 'train2017/',
        pipeline=train_pipeline),
    val=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/val2017.json',
        img_prefix=data_root + 'val2017/',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/val2017.json',
        img_prefix=data_root + 'val2017/',
        pipeline=test_pipeline))
```
then I started training `python tools/train.py configs/yolact/yolact_r50_1x8_coco.py` by using my dataset. 

My dataset：
- https://resources.mpi-inf.mpg.de/d2/orekondy/redactions/, The format of annotations has been changed to CoCo.

But when an epoch ends, it will be out of memory in the Val phase.

```
2022-02-13 14:38:01,355 - mmdet - INFO - Saving checkpoint at 1 epochs
[                                                  ] 26/1611, 0.6 task/s, elapsed: 43s, ETA:  2636sTraceback (most recent call last):
  File "tools/train.py", line 195, in <module>
    main()
  File "tools/train.py", line 191, in main
    meta=meta)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmdet/apis/train.py", line 209, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train
    self.call_hook('after_train_epoch')
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmcv/runner/hooks/evaluation.py", line 267, in after_train_epoch
    self._do_evaluate(runner)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmdet/core/evaluation/eval_hooks.py", line 56, in _do_evaluate
    results = single_gpu_test(runner.model, self.dataloader, show=False)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmdet/apis/test.py", line 28, in single_gpu_test
    result = model(return_loss=False, rescale=True, **data)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 50, in forward
    return super().forward(*inputs, **kwargs)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
    return old_func(*args, **kwargs)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 174, in forward
    return self.forward_test(img, img_metas, **kwargs)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 147, in forward_test
    return self.simple_test(imgs[0], img_metas[0], **kwargs)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmdet/models/detectors/yolact.py", line 113, in simple_test
    rescale=rescale)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmdet/models/dense_heads/yolact_head.py", line 999, in simple_test
    img_metas[i], rescale)
  File "/home/zjj/anaconda3/envs/image/lib/python3.7/site-packages/mmdet/models/dense_heads/yolact_head.py", line 869, in get_seg_masks
    align_corners=False).squeeze(0) > 0.5
RuntimeError: CUDA out of memory. Tried to allocate 4.69 GiB (GPU 0; 23.70 GiB total capacity; 19.18 GiB already allocated; 276.56 MiB free; 21.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
```
After I exchange the test set and verification set, the error report disappears
```
data = dict(
    samples_per_gpu=8, 
    workers_per_gpu=4,
    train=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/val2017.json',
        img_prefix=data_root + 'val2017/',
        pipeline=train_pipeline),
    val=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/train2017.json',
        img_prefix=data_root + 'train2017/',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/val2017.json',
        img_prefix=data_root + 'val2017/',
        pipeline=test_pipeline))
```
Is there a way I can use to solve this issue? Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

yolact Out of memory during training #993

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

yolact Out of memory during training #993

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions