Skip to content

Observations on Training Efficiency with Different Masking Strategies #10

@LuciferZap

Description

@LuciferZap

Thank you for your great work! I tried to reproduce the first two stages of training and conducted three experiments on A800 GPUs. Below are the details:

Experiment 1: Fully used 75% mask training for 227K steps, with a batch size of 2048.

Experiment 2:(github page recommended)Used 75% mask training for 179K steps, followed by 0% mask training for 48K steps, with a batch size of 2048.

Experiment 3: Used 0% mask training for 138K steps, with a batch size of 2048.

I found that Experiment 3 achieved the highest training efficiency. The computational cost across all three experiments was roughly the same. However, I did not observe any efficiency improvement from using mask training. Based on the final visualization results, the samples from Experiment 3 also appear to be the best.

If I missed any important details, please let me know!

Note on Resources: Due to instability in my resource pool, these experiments were sometimes run on 3 A800 GPUs and other times on a single GPU. However, the total computational cost remained consistent.

Attached below are the detailed visualization results:

Image Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions