-
Notifications
You must be signed in to change notification settings - Fork 22
Open
0 / 10 of 1 issue completedDescription
Large Scale MoE Support
Start from deepseek v3, SOTA open source models are all MoE models with parameter numbers ranging from 100B to 1T. We want to refine existing implementation to automatically generate high performance distributed plans for these models efficiently.
Tracer & Parser
- trace large model in less than 10 minutes
Schedule
- integrate zero-bubble schedule
- integrate dual pipeline
- implement and test computation and communication overlap
AutoDist
- support partitioning multiple dimension
- support profiling operators with communication, like ring-attention
- add interleaved pipeline parallelism to the search space
- refine partition constraint interface: add constraint by torch's full qualified name
Codegen
- reduce code generation time when scale unit is large, like 128 devices
Runtime
- improve saving checkpoint
- support parameter in bf16, but accumulated in fp32 in reducer
- support multiple parameter groups, like muon optimizer
- support dynamic sequence length and forbidden certain dims to be partitioned
- dedup checkpoints for modules not parallelized
User experience
- add examples for hooks, e.g., logging router logits in MoE
- integrate nnScaler in RL training framework, like veRL
- bump transformers version in
examplefolder
Reactions are currently unavailable
Sub-issues
Metadata
Metadata
Assignees
Labels
No labels