-
Notifications
You must be signed in to change notification settings - Fork 21
Open
Description
I'm using SLS to train my own model, but I found it's different to train with plain SGD or SGD+wd+mom.
When I use plain SGD, the step size increase at first, following exponential trend, which is consistent with you published work.
However, if I use SGD+weight decay+momentum, the step size is very stable (0.02~0.03) for most of the time.
Can you explain why? Is SPS incompatible with optimizer momentum and weight decay?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels