[model, ci] feat: migrate deepseek_v3 to transformers v5#661
[model, ci] feat: migrate deepseek_v3 to transformers v5#661
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive support for DeepseekV3 models, specifically targeting transformers library versions 5.2.0 and above. Key changes include a new runtime checkpoint tensor converter that automatically handles the conversion of HuggingFace's per-expert checkpoint format to a fused v5 format, thereby streamlining the loading process by removing the need for offline merging. The DeepseekV3 modeling code has been refactored to dynamically apply GPU or NPU-specific patches and optimized kernels (such as Liger kernels or batch-invariant alternatives) based on the execution environment and transformers version. These patches enhance the fused Mixture-of-Experts (MoE) implementation, ensure numerical parity for router autocast behavior, and enable a fused cross-entropy path in the causal language model. Corresponding e2e and model patching tests have been added to validate the DeepseekV3 integration. No specific feedback was provided in the review comments.
What does this PR do?
Checklist Before Starting
[{modules}] {type}: {description}formatTest
Ran against
transformers==5.2.0on an 8×GPU box.python -m veomni.patchgen.check_patchgen— all generated files up to date.make quality— clean.pytest tests/models/test_models_patch.py -k deepseek_v3— PASSED (1 passed in 20.58s).Validates HF↔VeOmni fwd/bwd parity for the patched v5 model.
pytest tests/e2e/test_e2e_parallel.py -k deepseek_v3_v5— PASSED(1 passed in 168.89s). Compares 4 parallel configs under SP/EP.
pytest tests/distributed/test_fsdp_equivalence.py -k deepseek_v3_v5—PASSED (1 passed in 84.21s). Single-GPU (no FSDP) vs 2-GPU FSDP2.
API and Usage Example
Design & Code Changes
Checklist Before Submitting