-
Notifications
You must be signed in to change notification settings - Fork 53
Description
Thank you for your great work! I tried to reproduce the first two stages of training and conducted three experiments on A800 GPUs. Below are the details:
Experiment 1: Fully used 75% mask training for 227K steps, with a batch size of 2048.
Experiment 2:(github page recommended)Used 75% mask training for 179K steps, followed by 0% mask training for 48K steps, with a batch size of 2048.
Experiment 3: Used 0% mask training for 138K steps, with a batch size of 2048.
I found that Experiment 3 achieved the highest training efficiency. The computational cost across all three experiments was roughly the same. However, I did not observe any efficiency improvement from using mask training. Based on the final visualization results, the samples from Experiment 3 also appear to be the best.
If I missed any important details, please let me know!
Note on Resources: Due to instability in my resource pool, these experiments were sometimes run on 3 A800 GPUs and other times on a single GPU. However, the total computational cost remained consistent.
Attached below are the detailed visualization results:
