Skip to content

Add Qwen3.5 MoE hybrid decoder model (back-port from upstream PR #2545)#1

Open
akeshet wants to merge 2 commits into
v0.2.0-with-positionsfrom
add-qwen3.5-moe
Open

Add Qwen3.5 MoE hybrid decoder model (back-port from upstream PR #2545)#1
akeshet wants to merge 2 commits into
v0.2.0-with-positionsfrom
add-qwen3.5-moe

Conversation

@akeshet

@akeshet akeshet commented Apr 1, 2026

Copy link
Copy Markdown
Owner

Back-ports the Qwen3.5 MoE model (GatedDeltaNet + full attention + MoE) from pytorch#2545 to v0.2.0 APIs. Core model math preserved; registration/parallelization adapted to TrainSpec/BaseModelArgs/ModelProtocol.

Includes: model definition, parallelization (TP/EP/FSDP with DTensor-safe wrappers for GatedDeltaNet), HF state dict adapter, 5 model flavors (debugmodel through 397B), and debug training config.

akeshet and others added 2 commits April 1, 2026 14:49
…rch#2545)

Back-ports the Qwen3.5 MoE model (GatedDeltaNet + full attention + MoE)
from pytorch#2545 to v0.2.0 APIs. Core model math preserved;
registration/parallelization adapted to TrainSpec/BaseModelArgs/ModelProtocol.

Includes: model definition, parallelization (TP/EP/FSDP with DTensor-safe
wrappers for GatedDeltaNet), HF state dict adapter, 5 model flavors
(debugmodel through 397B), and debug training config.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
v0.2.0 exports the class as `Compile`, not `CompileConfig`. Remove the
direct import since we only pass `job_config.compile` through to
llama4's `apply_compile` which handles the alias internally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant