Observations on Training Efficiency with Different Masking Strategies

Thank you for your great work! I tried to reproduce the first two stages of training and conducted three experiments on A800 GPUs. Below are the details:

**Experiment 1:** Fully used 75% mask training for 227K steps, with a batch size of 2048.

**Experiment 2:**（github page recommended)Used 75% mask training for 179K steps, followed by 0% mask training for 48K steps, with a batch size of 2048.

**Experiment 3:** Used 0% mask training for 138K steps, with a batch size of 2048.

I found that Experiment 3 achieved the highest training efficiency. The computational cost across all three experiments was roughly the same. However, I did not observe any efficiency improvement from using mask training. Based on the final visualization results, the samples from Experiment 3 also appear to be the best.

If I missed any important details, please let me know!

**Note on Resources:** Due to instability in my resource pool, these experiments were sometimes run on 3 A800 GPUs and other times on a single GPU. However, the total computational cost remained consistent.

Attached below are the detailed visualization results:

<img width="677" alt="Image" src="https://github.com/user-attachments/assets/3a6b456e-b314-4781-8e12-02c2eac732f4" />

<img width="696" alt="Image" src="https://github.com/user-attachments/assets/51f849d0-3c1e-45aa-a007-5cf24af34122" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Observations on Training Efficiency with Different Masking Strategies #10

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Observations on Training Efficiency with Different Masking Strategies #10

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions