Large scale 4D parallelism pre-training for 🤗 transformers in Mixture of Experts *(still work in progress)*
-
Updated
Dec 14, 2023 - Python
Large scale 4D parallelism pre-training for 🤗 transformers in Mixture of Experts *(still work in progress)*
GPT-2 (124M) fixed-work multi-GPU training benchmark on Slurm (V100) using DeepSpeed ZeRO-1 + AMP. Measured 1→4 GPU scaling (3.42× throughput) with reproducible run artifacts (configs + metrics JSON + commit IDs).
Add a description, image, and links to the zero-1 topic page so that developers can more easily learn about it.
To associate your repository with the zero-1 topic, visit your repo's landing page and select "manage topics."