Extremely unstable training on multiple gpus

Hi, I'm trying to reproduce the classification training results.

I tried on 2 different machines, machine A with one RTX 3090 and machine B with four A100 gpus.

The training on machine A with a single GPU is fine; see green line (with default parameters).
But on machine B with 4 gpus, it's not training properly and very erratic; see gray, yellow, teal lines (with default and custom parameters).
Purple line is DeiT training on the same machine B (default parameters).

All experiments done with --batch-size=128 (128 samples per gpu).

This is validation loss, other metrics tell the same story, some even worse.
![Screen Shot 2022-01-05 at 10 32 58](https://user-images.githubusercontent.com/46146827/148156685-d1debdea-37cb-46ac-9e9e-f37133f144e7.png)

Example of the commands I used:
```
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py \
    --model xcit_small_12_p16 --batch-size 128 --drop-path 0.05 --epochs 400
```

Anyone's seen this or know how to fix it? Many thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely unstable training on multiple gpus #23

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Extremely unstable training on multiple gpus #23

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions