Skip to content

[model, ci] feat: migrate deepseek_v3 to transformers v5#661

Draft
TimYangst wants to merge 6 commits intomainfrom
tingyang/chrone/modeling_v5_qwen
Draft

[model, ci] feat: migrate deepseek_v3 to transformers v5#661
TimYangst wants to merge 6 commits intomainfrom
tingyang/chrone/modeling_v5_qwen

Conversation

@TimYangst
Copy link
Copy Markdown
Collaborator

@TimYangst TimYangst commented Apr 15, 2026

What does this PR do?

Migrates deepseek_v3 from the v4 runtime monkey-patch path to the
transformers v5 patchgen + self-contained generated modeling path
(Pattern B: v4↔v5 coexist, MoE, GPU + NPU). v5 gate is 5.2.0.

Checklist Before Starting

  • PR title follows [{modules}] {type}: {description} format

Test

Ran against transformers==5.2.0 on an 8×GPU box.

  • python -m veomni.patchgen.check_patchgen — all generated files up to date.

  • make quality — clean.

  • pytest tests/models/test_models_patch.py -k deepseek_v3PASSED (1 passed in 20.58s).
    Validates HF↔VeOmni fwd/bwd parity for the patched v5 model.

  • pytest tests/e2e/test_e2e_parallel.py -k deepseek_v3_v5PASSED
    (1 passed in 168.89s). Compares 4 parallel configs under SP/EP.

    grad_norm
      run 1: 1.37599146, 1.25248539
      run 2: 1.37604284, 1.25065529
      run 3: 1.37589073, 1.25190842
      run 4: 1.37582600, 1.25116014
    loss
      run 1: 12.17874718, 11.68118858
      run 2: 12.17911124, 11.68350935
      run 3: 12.17874718, 11.68191600
      run 4: 12.17911100, 11.68174267
    
  • pytest tests/distributed/test_fsdp_equivalence.py -k deepseek_v3_v5
    PASSED (1 passed in 84.21s). Single-GPU (no FSDP) vs 2-GPU FSDP2.

    grad_norm
      single_gpu: 1.34375000, 1.25781250
      fsdp2_2gpu: 1.34375000, 1.25781250        # exact match
    loss
      single_gpu: 12.16680050, 11.83021641
      fsdp2_2gpu: 12.18102837, 11.84898996      # diff expected, see test docstring
    

API and Usage Example

N/A — internal model registration; public API unchanged.

Design & Code Changes

  • Added deepseek_v3_gpu_patch_gen_config.py and
    deepseek_v3_npu_patch_gen_config.py with patches for
    DeepseekV3NaiveMoe (fused gate_up_proj/down_proj layout, drops
    upstream @use_experts_implementation, eager + fused branches),
    DeepseekV3TopkRouter.forward (autocast-disabled fp32 router for
    actor/rollout parity), DeepseekV3ForCausalLM.forward (fused-CE path)
    and get_parallel_plan.
  • Liger kernels and deterministic Triton RoPE + batch-invariant RMSNorm
    are kept out of the generated file and re-applied at runtime in
    __init__.py based on VEOMNI_USE_LIGER_KERNEL, mirroring v4 behavior.
    LigerSwiGLUMLP is intentionally skipped — shared_experts passes an
    intermediate_size kwarg that Liger's MLP does not accept.
  • Added checkpoint_tensor_converter.py for HF per-expert →
    v5 fused runtime conversion; registered as staticmethod on all v5
    model classes.
  • Wired __init__.py with Pattern B: v5 gate at 5.2.0 with GPU/NPU
    branching; falls back to the v4 monkey-patch path below that.
  • Generated patched_modeling_deepseek_v3_{gpu,npu}.py committed.
  • Added deepseek_v3 entry to _TEST_CASES_TRANSFORMERS_V5 and a
    deepseek_v3_v5 param to test_e2e_parallel.py.
  • Updated sync_weight_deepseek_v3 to skip per-expert stacking when HF
    state dict already uses the v5 fused layout.

Checklist Before Submitting

  • Read the Contribute Guide
  • Applied pre-commit checks
  • Added/updated documentation
  • Added tests to CI workflow (or explained why not feasible)

@github-actions github-actions Bot added ci hf_v5 Related for transformers v5 labels Apr 15, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive support for DeepseekV3 models, specifically targeting transformers library versions 5.2.0 and above. Key changes include a new runtime checkpoint tensor converter that automatically handles the conversion of HuggingFace's per-expert checkpoint format to a fused v5 format, thereby streamlining the loading process by removing the need for offline merging. The DeepseekV3 modeling code has been refactored to dynamically apply GPU or NPU-specific patches and optimized kernels (such as Liger kernels or batch-invariant alternatives) based on the execution environment and transformers version. These patches enhance the fused Mixture-of-Experts (MoE) implementation, ensure numerical parity for router autocast behavior, and enable a fused cross-entropy path in the causal language model. Corresponding e2e and model patching tests have been added to validate the DeepseekV3 integration. No specific feedback was provided in the review comments.

@TimYangst TimYangst marked this pull request as draft April 17, 2026 04:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci hf_v5 Related for transformers v5

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant