[model, ci] feat: migrate deepseek_v3 to transformers v5 by TimYangst · Pull Request #661 · ByteDance-Seed/VeOmni

TimYangst · 2026-04-15T20:57:25Z

What does this PR do?

Migrates deepseek_v3 from the v4 runtime monkey-patch path to the
transformers v5 patchgen + self-contained generated modeling path
(Pattern B: v4↔v5 coexist, MoE, GPU + NPU). v5 gate is 5.2.0.

Checklist Before Starting

PR title follows [{modules}] {type}: {description} format

Test

Ran against transformers==5.2.0 on an 8×GPU box.

python -m veomni.patchgen.check_patchgen — all generated files up to date.
make quality — clean.
pytest tests/models/test_models_patch.py -k deepseek_v3 — PASSED (1 passed in 20.58s).
Validates HF↔VeOmni fwd/bwd parity for the patched v5 model.

pytest tests/e2e/test_e2e_parallel.py -k deepseek_v3_v5 — PASSED
(1 passed in 168.89s). Compares 4 parallel configs under SP/EP.

grad_norm
  run 1: 1.37599146, 1.25248539
  run 2: 1.37604284, 1.25065529
  run 3: 1.37589073, 1.25190842
  run 4: 1.37582600, 1.25116014
loss
  run 1: 12.17874718, 11.68118858
  run 2: 12.17911124, 11.68350935
  run 3: 12.17874718, 11.68191600
  run 4: 12.17911100, 11.68174267

pytest tests/distributed/test_fsdp_equivalence.py -k deepseek_v3_v5 —
PASSED (1 passed in 84.21s). Single-GPU (no FSDP) vs 2-GPU FSDP2.

grad_norm
  single_gpu: 1.34375000, 1.25781250
  fsdp2_2gpu: 1.34375000, 1.25781250        # exact match
loss
  single_gpu: 12.16680050, 11.83021641
  fsdp2_2gpu: 12.18102837, 11.84898996      # diff expected, see test docstring

API and Usage Example

N/A — internal model registration; public API unchanged.

Design & Code Changes

Added deepseek_v3_gpu_patch_gen_config.py and
deepseek_v3_npu_patch_gen_config.py with patches for
DeepseekV3NaiveMoe (fused gate_up_proj/down_proj layout, drops
upstream @use_experts_implementation, eager + fused branches),
DeepseekV3TopkRouter.forward (autocast-disabled fp32 router for
actor/rollout parity), DeepseekV3ForCausalLM.forward (fused-CE path)
and get_parallel_plan.

Liger kernels and deterministic Triton RoPE + batch-invariant RMSNorm
are kept out of the generated file and re-applied at runtime in
__init__.py based on VEOMNI_USE_LIGER_KERNEL, mirroring v4 behavior.
LigerSwiGLUMLP is intentionally skipped — shared_experts passes an
intermediate_size kwarg that Liger's MLP does not accept.

Added checkpoint_tensor_converter.py for HF per-expert →
v5 fused runtime conversion; registered as staticmethod on all v5
model classes.

Wired __init__.py with Pattern B: v5 gate at 5.2.0 with GPU/NPU
branching; falls back to the v4 monkey-patch path below that.

Generated patched_modeling_deepseek_v3_{gpu,npu}.py committed.

Added deepseek_v3 entry to _TEST_CASES_TRANSFORMERS_V5 and a
deepseek_v3_v5 param to test_e2e_parallel.py.

Updated sync_weight_deepseek_v3 to skip per-expert stacking when HF
state dict already uses the v5 fused layout.

Checklist Before Submitting

Read the Contribute Guide
Applied pre-commit checks
Added/updated documentation
Added tests to CI workflow (or explained why not feasible)

gemini-code-assist

Code Review

This pull request introduces comprehensive support for DeepseekV3 models, specifically targeting transformers library versions 5.2.0 and above. Key changes include a new runtime checkpoint tensor converter that automatically handles the conversion of HuggingFace's per-expert checkpoint format to a fused v5 format, thereby streamlining the loading process by removing the need for offline merging. The DeepseekV3 modeling code has been refactored to dynamically apply GPU or NPU-specific patches and optimized kernels (such as Liger kernels or batch-invariant alternatives) based on the execution environment and transformers version. These patches enhance the fused Mixture-of-Experts (MoE) implementation, ensure numerical parity for router autocast behavior, and enable a fused cross-entropy path in the causal language model. Corresponding e2e and model patching tests have been added to validate the DeepseekV3 integration. No specific feedback was provided in the review comments.

TimYangst added 2 commits April 15, 2026 20:28

Ting: initial generated by migration skills

4a524de

Ting: remove liger patch

d51ba57

github-actions Bot added ci hf_v5 Related for transformers v5 labels Apr 15, 2026

Ting : remove liger patch on init

69ee142

gemini-code-assist Bot reviewed Apr 15, 2026

View reviewed changes

TimYangst added 3 commits April 15, 2026 21:12

Ting: fix parallel_plan

59cf6b3

Ting: update skills

5b74518

Ting: add fsdp equivalence test

7d777e9

TimYangst requested review from Fazziekey, FoolPlayer, Luosuu and piyifan123 April 15, 2026 22:07

TimYangst marked this pull request as draft April 17, 2026 04:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[model, ci] feat: migrate deepseek_v3 to transformers v5#661

[model, ci] feat: migrate deepseek_v3 to transformers v5#661
TimYangst wants to merge 6 commits intomainfrom
tingyang/chrone/modeling_v5_qwen

TimYangst commented Apr 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TimYangst commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TimYangst commented Apr 15, 2026 •

edited

Loading