Skip to content

[flex_shard] Add DeepSeek V3 eager training entry point#3603

Draft
weifengpy wants to merge 1 commit into
gh/weifengpy/32/basefrom
gh/weifengpy/32/head
Draft

[flex_shard] Add DeepSeek V3 eager training entry point#3603
weifengpy wants to merge 1 commit into
gh/weifengpy/32/basefrom
gh/weifengpy/32/head

Conversation

@weifengpy

@weifengpy weifengpy commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Stack from ghstack (oldest at bottom):

Add an experimental flex_shard.deepseek_v3 module that reuses the DeepSeek V3 model config with a FlexShard eager parallelizer. The path slices local experts for EP, shards dense params on the FSDP mesh, shards local expert params on the expert-FSDP mesh, and rejects unsupported PP/TP/CP/HSDP/model-compile modes for now.

Make FlexShard work with TorchTitan's meta-module to_empty flow by deferring eager hook installation until bucket storage is materialized, then reinstalling sharded parameter views. Also add FlexShard handling for EP grad norm clipping and DCP state-dict introspection.

Include runtime/state-dict coverage plus local design notes and repro/profiling helpers for the ongoing FlexShard training and compile investigations.

Test Plan:

  • CUDA_VISIBLE_DEVICES=6 NGPU=1 MODULE=deepseek_v3 CONFIG=deepseek_v3_debugmodel_ep ./run_train.sh --parallelism.data_parallel_shard_degree=1 --parallelism.expert_parallel_degree=1 --training.steps=1 --activation_checkpoint.mode=none --dump_folder=outputs/retry_fully_shard_norm_dp1

  • CUDA_VISIBLE_DEVICES=0,1,6,7 NGPU=4 MODULE=deepseek_v3 CONFIG=deepseek_v3_debugmodel_ep ./run_train.sh --parallelism.data_parallel_shard_degree=4 --parallelism.expert_parallel_degree=4 --training.steps=1 --activation_checkpoint.mode=none --dump_folder=outputs/retry_fully_shard_norm_dp4_ep4

[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant