Distributed training lowers perfomance

Hello, first of all thanks for sharing your codebase!
We've been testing it for a while and it's working well for us.
But unfortunately we've noticed that turning on distributed training degrades the performance significantly on our setup.
Running fully supervised on the S3DIS dataset with spvcnn as the model we get ~62% validation mIoU.
With same hyper-parameters and distributed_training on 4 gpus it is much faster, but we only get ~50%.
Tweaking some hps and increasing the training epochs, the best we got was ~56%. (with batch size 2 and lr 0.005)

Now we're wondering, if you used the distributed training and noticed similar performance drops?
Or are there maybe some other parameters that need to be adjusted when using distributed training?

 Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training lowers perfomance #10

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Distributed training lowers perfomance #10

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions