[flex_shard] Add DeepSeek V3 eager training entry point by weifengpy · Pull Request #3603 · pytorch/torchtitan

weifengpy · 2026-06-10T05:46:57Z

Stack from ghstack (oldest at bottom):

Add an experimental flex_shard.deepseek_v3 module that reuses the DeepSeek V3 model config with a FlexShard eager parallelizer. The path slices local experts for EP, shards dense params on the FSDP mesh, shards local expert params on the expert-FSDP mesh, and rejects unsupported PP/TP/CP/HSDP/model-compile modes for now.

Make FlexShard work with TorchTitan's meta-module to_empty flow by deferring eager hook installation until bucket storage is materialized, then reinstalling sharded parameter views. Also add FlexShard handling for EP grad norm clipping and DCP state-dict introspection.

Include runtime/state-dict coverage plus local design notes and repro/profiling helpers for the ongoing FlexShard training and compile investigations.

Test Plan:

CUDA_VISIBLE_DEVICES=6 NGPU=1 MODULE=deepseek_v3 CONFIG=deepseek_v3_debugmodel_ep ./run_train.sh --parallelism.data_parallel_shard_degree=1 --parallelism.expert_parallel_degree=1 --training.steps=1 --activation_checkpoint.mode=none --dump_folder=outputs/retry_fully_shard_norm_dp1
CUDA_VISIBLE_DEVICES=0,1,6,7 NGPU=4 MODULE=deepseek_v3 CONFIG=deepseek_v3_debugmodel_ep ./run_train.sh --parallelism.data_parallel_shard_degree=4 --parallelism.expert_parallel_degree=4 --training.steps=1 --activation_checkpoint.mode=none --dump_folder=outputs/retry_fully_shard_norm_dp4_ep4

[ghstack-poisoned]

Update

d3ca92b

[ghstack-poisoned]

pytorch-bot Bot added the ciflow/8gpu label Jun 10, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[flex_shard] Add DeepSeek V3 eager training entry point#3603

[flex_shard] Add DeepSeek V3 eager training entry point#3603
weifengpy wants to merge 1 commit into
gh/weifengpy/32/basefrom
gh/weifengpy/32/head

weifengpy commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

weifengpy commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

weifengpy commented Jun 10, 2026 •

edited

Loading