Skip to content

CUDA out of memory #26

@dingjietao

Description

@dingjietao

when i run, i got "RuntimeError: CUDA out of memory."i don't know how to modify.
gpu0: 12G; gpu1: 12G
my command is "bash experiments/scripts/train_faster_rcnn.sh 0 pascal_voc vgg16" or
"bash experiments/scripts/train_faster_rcnn.sh 0,1 pascal_voc vgg16". The two commands both caused "RuntimeError: CUDA out of memory"
And i modified vgg16.yml. TRAIN.BATCH_SIZE : 256 --> 2
Running Logs:

  • set -e
  • export PYTHONUNBUFFERED=True
  • PYTHONUNBUFFERED=True
  • GPU_ID=0
  • DATASET=pascal_voc
  • NET=vgg16
  • array=($@)
  • len=3
  • EXTRA_ARGS=
  • EXTRA_ARGS_SLUG=
  • case ${DATASET} in
  • TRAIN_IMDB=voc_2007_trainval
  • TEST_IMDB=voc_2007_test
  • STEPSIZE='[50000]'
  • ITERS=100000
  • ANCHORS='[8,16,32]'
  • RATIOS='[0.5,1,2]'
    ++ date +%Y-%m-%d_%H-%M-%S
  • LOG=experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2021-09-11_15-11-21
  • exec
    ++ tee -a experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2021-09-11_15-11-21
    tee: experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2021-09-11_15-11-21: No such file or directory
  • echo Logging output to experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2021-09-11_15-11-21
    Logging output to experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2021-09-11_15-11-21
  • set +x
  • '[' '!' -f output/vgg16/voc_2007_trainval/default/vgg16_MELM_iter_100000.pth.index ']'
  • [[ ! -z '' ]]
  • CUDA_VISIBLE_DEVICES=0
  • python ./tools/trainval_net.py --weight data/imagenet_weights/vgg16.pth --imdb voc_2007_trainval --imdbval voc_2007_test --iters 100000 --cfg experiments/cfgs/vgg16.yml --net vgg16 --set ANCHOR_SCALES '[8,16,32]' ANCHOR_RATIOS '[0.5,1,2]' TRAIN.STEPSIZE '[50000]'
    Called with args:
    Namespace(cfg_file='experiments/cfgs/vgg16.yml', imdb_name='voc_2007_trainval', imdbval_name='voc_2007_test', max_iters=100000, net='vgg16', set_cfgs=['ANCHOR_SCALES', '[8,16,32]', 'ANCHOR_RATIOS', '[0.5,1,2]', 'TRAIN.STEPSIZE', '[50000]'], tag=None, weight='data/imagenet_weights/vgg16.pth')
    /media/omnisky/28db8425-dc36-4700-92ef-0dd7e98ccd67/djt/CASD/tools/../lib/model/config.py:369: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
    yaml_cfg = edict(yaml.load(f))
    Loaded dataset voc_2007_trainval for training
    Set proposal method: selective_search
    Appending horizontally-flipped training examples...
    voc_2007_trainval ss roidb loaded from /media/omnisky/28db8425-dc36-4700-92ef-0dd7e98ccd67/djt/CASD/data/cache/voc_2007_trainval_selective_search_roidb.pkl
    done
    Preparing training data...
    done
    10022 roidb entries
    Output will be saved to /media/omnisky/28db8425-dc36-4700-92ef-0dd7e98ccd67/djt/CASD/output/vgg16_MELM/voc_2007_trainval/default
    TensorFlow summaries will be saved to /media/omnisky/28db8425-dc36-4700-92ef-0dd7e98ccd67/djt/CASD/tensorboard/vgg16_MELM/voc_2007_trainval/default
    Loaded dataset voc_2007_test for training
    Set proposal method: selective_search
    Preparing training data...
    voc_2007_test ss roidb loaded from /media/omnisky/28db8425-dc36-4700-92ef-0dd7e98ccd67/djt/CASD/data/cache/voc_2007_test_selective_search_roidb.pkl
    done
    4952 validation roidb entries
    Filtered 0 roidb entries: 10022 -> 10022
    Filtered 0 roidb entries: 4952 -> 4952
    Solving...
    Loading initial model weights from data/imagenet_weights/vgg16.pth
    Loaded.
    Traceback (most recent call last):
    File "./tools/trainval_net.py", line 135, in
    max_iters=args.max_iters)
    File "/media/omnisky/28db8425-dc36-4700-92ef-0dd7e98ccd67/djt/CASD/tools/../lib/model/train_val.py", line 377, in train_net
    sw.train_model(max_iters)
    File "/media/omnisky/28db8425-dc36-4700-92ef-0dd7e98ccd67/djt/CASD/tools/../lib/model/train_val.py", line 291, in train_model
    cls_det_loss, refine_loss_1, refine_loss_2, consistency_loss, total_loss = self.net.train_step(blobs,self.optimizer,iter)
    File "/media/omnisky/28db8425-dc36-4700-92ef-0dd7e98ccd67/djt/CASD/tools/../lib/nets/network.py", line 634, in train_step
    self.forward(blobs['data'], blobs['image_level_labels'], blobs['im_info'], blobs['gt_boxes'], blobs['ss_boxes'], step)
    File "/media/omnisky/28db8425-dc36-4700-92ef-0dd7e98ccd67/djt/CASD/tools/../lib/nets/network.py", line 562, in forward
    roi_labels_1, keep_inds_1, roi_labels_2, keep_inds_2, bbox_pred, rois = self._predict_train(ss_boxes_all, step)
    File "/media/omnisky/28db8425-dc36-4700-92ef-0dd7e98ccd67/djt/CASD/tools/../lib/nets/network.py", line 508, in _predict_train
    roi_labels_2, keep_inds_2, bbox_pred = self._region_classification_train(pool5_roi, fc7_roi,fc7_context, fc7_frame, step)
    File "/media/omnisky/28db8425-dc36-4700-92ef-0dd7e98ccd67/djt/CASD/tools/../lib/nets/network.py", line 398, in _region_classification_train
    mask_1 = self._inverted_attention(bbox_feats_new, gt, keep_inds_1_new, 1, step, fg_num_1_new, bg_num_1_new)
    File "/media/omnisky/28db8425-dc36-4700-92ef-0dd7e98ccd67/djt/CASD/tools/../lib/nets/network.py", line 147, in _inverted_attention
    pooled_feat_before_after = torch.cat((bbox_feats_new, bbox_feats_new * mask_all), dim=0)
    RuntimeError: CUDA out of memory. Tried to allocate 766.00 MiB (GPU 0; 11.91 GiB total capacity; 9.59 GiB already allocated; 99.19 MiB free; 1.49 GiB cached)

I would appreciate it if you could help me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions