Skip to content

The train loss becomes NaN #18

@baiyu858

Description

@baiyu858

@mlzxy Hello, your work has been very helpful to me.
Image

python train.py \
  config=./configs/arp_plus.yaml \
  hydra.job.name=arp_plus \
  train.num_gpus=2 \
  train.bs=192

lr: 1.25e-5
warmup_steps: 2000

[step:00015300 time:35910.5s] collision.ce_loss:0.0925 grip.ce_loss:0.0089 lr:0.0037 rot-x.ce_loss:0.0819 rot-y.ce_loss:0.0197 rot-z.ce_loss:1.1917 stage1-screen-pts.2d_ce_loss:3.7933 stage2-screen-pts.2d_ce_loss:4.4807 v1_norm:11860.1357 v2_norm:207099.8750
[step:00015400 time:36153.5s] collision.ce_loss:0.0933 grip.ce_loss:0.0092 lr:0.0037 rot-x.ce_loss:0.0822 rot-y.ce_loss:0.0199 rot-z.ce_loss:1.1998 stage1-screen-pts.2d_ce_loss:3.7927 stage2-screen-pts.2d_ce_loss:4.4872 v1_norm:12486.6768 v2_norm:269669.1562
[step:00015500 time:36393.0s] collision.ce_loss:0.0941 grip.ce_loss:0.0094 lr:0.0037 rot-x.ce_loss:0.0824 rot-y.ce_loss:0.0201 rot-z.ce_loss:1.2047 stage1-screen-pts.2d_ce_loss:3.7919 stage2-screen-pts.2d_ce_loss:4.4921 v1_norm:12509.2148 v2_norm:251419.4219
[step:00015600 time:36631.4s] collision.ce_loss:0.0945 grip.ce_loss:0.0095 lr:0.0037 rot-x.ce_loss:0.0825 rot-y.ce_loss:0.0203 rot-z.ce_loss:1.2086 stage1-screen-pts.2d_ce_loss:3.7913 stage2-screen-pts.2d_ce_loss:4.4963 v1_norm:12924.7100 v2_norm:293097.1250
[step:00015700 time:36869.6s] collision.ce_loss:0.0947 grip.ce_loss:0.0095 lr:0.0036 rot-x.ce_loss:0.0825 rot-y.ce_loss:0.0204 rot-z.ce_loss:1.2089 stage1-screen-pts.2d_ce_loss:3.7906 stage2-screen-pts.2d_ce_loss:4.4983 v1_norm:12613.3252 v2_norm:402923.5000
[step:00015800 time:37108.3s] collision.ce_loss:0.0947 grip.ce_loss:0.0095 lr:0.0036 rot-x.ce_loss:0.0825 rot-y.ce_loss:0.0204 rot-z.ce_loss:1.2095 stage1-screen-pts.2d_ce_loss:3.7901 stage2-screen-pts.2d_ce_loss:nan v1_norm:12359.4805 v2_norm:597317.8125
[step:00015900 time:37343.7s] collision.ce_loss:0.0947 grip.ce_loss:0.0094 lr:0.0036 rot-x.ce_loss:0.0826 rot-y.ce_loss:0.0204 rot-z.ce_loss:1.2100 stage1-screen-pts.2d_ce_loss:3.7901 stage2-screen-pts.2d_ce_loss:nan v1_norm:12621.3320 v2_norm:646560.8125
[step:00016000 time:37579.8s] collision.ce_loss:0.0946 grip.ce_loss:0.0094 lr:0.0036 rot-x.ce_loss:0.0826 rot-y.ce_loss:0.0204 rot-z.ce_loss:1.2103 stage1-screen-pts.2d_ce_loss:3.7903 stage2-screen-pts.2d_ce_loss:nan v1_norm:12569.4219 v2_norm:731155.6250
[step:00016100 time:37816.2s] collision.ce_loss:nan grip.ce_loss:nan lr:0.0036 rot-x.ce_loss:nan rot-y.ce_loss:nan rot-z.ce_loss:nan stage1-screen-pts.2d_ce_loss:nan stage2-screen-pts.2d_ce_loss:nan v1_norm:nan v2_norm:nan
[step:00016200 time:38047.3s] collision.ce_loss:nan grip.ce_loss:nan lr:0.0036 rot-x.ce_loss:nan rot-y.ce_loss:nan rot-z.ce_loss:nan stage1-screen-pts.2d_ce_loss:nan stage2-screen-pts.2d_ce_loss:nan v1_norm:nan v2_norm:nan
[step:00016300 time:38278.7s] collision.ce_loss:nan grip.ce_loss:nan lr:0.0035 rot-x.ce_loss:nan rot-y.ce_loss:nan rot-z.ce_loss:nan stage1-screen-pts.2d_ce_loss:nan stage2-screen-pts.2d_ce_loss:nan v1_norm:nan v2_norm:nan

During the training process, what causes the problem shown in the above figure? And how can it be solved🤔?
Thank you very much.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions