Qwen3/Qwen3.5 vLLM Components — Performance Optimization Study
Source: vllm-project/vllm/tree/main/vllm/model_executor/models
Our models: Qwen3.5-35B-A3B (dense) and potentially Qwen3.5 MoE variants
Hardware: 3× RTX 4090 (72GB, SM89 — NOT SM90), TP=2 or TP=3
1. All Qwen3-related model implementations in vLLM
| File |
Model |
Architecture |
Relevance to us |
qwen3.py |
Qwen3ForCausalLM |
Dense, GQA + SwiGLU |
⭐ Our current dense model family |
qwen3_moe.py |
Qwen3MoeForCausalLM |
MoE (sparse + dense MLP) |
Potential future MoE deployment |
qwen3_dflash.py |
DFlashQwen3ForCausalLM |
Speculative decoding drafter |
⭐ Already tracked in #3 |
qwen3_next.py |
Qwen3NextForCausalLM |
Hybrid: GQA + GatedDeltaNet |
⭐ Future Qwen3-Next architecture |
qwen3_next_mtp.py |
Qwen3NextMTP |
Multi-token prediction (Qwen3-Next) |
Future speculative decoding |
qwen3_5.py |
Qwen3_5ForCausalLM |
Hybrid: GQA + GatedDeltaNet + MoE |
⭐ Qwen3.5 architecture (our target!) |
qwen3_5_mtp.py |
Qwen3_5MTP |
Multi-token prediction (Qwen3.5) |
⭐ Future speculative decoding for 3.5 |
qwen3_vl.py |
Qwen3VLForConditionalGeneration |
Vision-Language |
Not relevant |
qwen3_vl_moe.py |
Qwen3VLMoEForConditionalGeneration |
VL + MoE |
Not relevant |
colqwen3.py / colqwen3_5.py |
ColQwen3 |
ColBERT late interaction |
Not relevant |
qwen3_asr*.py (3 files) |
ASR models |
Audio |
Not relevant |
qwen3_omni_moe_thinker.py |
Omni MoE Thinker |
Multimodal MoE |
Not relevant |
2. Key Architectural Differences
Qwen3 (current — qwen3.py)
- Standard GQA attention (
Qwen3Attention → Attention)
- SwiGLU MLP (
Qwen2MLP)
- QK-norm (RMSNorm on Q and K)
- RoPE positional encoding
@support_torch_compile enabled
Qwen3-Next (qwen3_next.py)
- Hybrid attention: standard GQA + GatedDeltaNet (linear attention)
- Uses
GatedDeltaNetAttention from vllm/model_executor/layers/mamba/gdn_linear_attn.py
- Gated Delta Networks parameters:
g (gate), beta (beta logits), A_log (log-gate), dt_bias
- GemmaRMSNorm (instead of RMSNorm)
GDNAttentionBackend (dedicated attention backend, not standard FlashAttention)
- Dual chunk processing: chunk_gated_delta_rule (prefill) + recurrent decode
Qwen3.5 (qwen3_5.py)
- Also hybrid: same GQA + GatedDeltaNet pattern as Qwen3-Next
- MoE support (sparse experts + shared expert)
GatedDeltaNetAttention for linear attention layers
Qwen3_5MTP for multi-token prediction speculative decoding
- Supports LoRA on GDN layers (separate qkv/z projections)
💡 Note on the FlashKDA question (#5): Qwen3 base does NOT use Gated Delta Networks. But Qwen3-Next and Qwen3.5 DO. The user was correct — they were referring to the Qwen3.5 architecture, not the base Qwen3. This makes FlashKDA/KDA kernels potentially relevant for future Qwen3.5 deployments.
3. Optimization Components to Study
3.1 ⭐ DFlash Speculative Decoding (already tracked in #3)
3.2 ⭐ GatedDeltaNet (GDN) Attention — NEW RELEVANCE
- File:
vllm/model_executor/layers/mamba/gdn_linear_attn.py (1211 lines)
- What: Linear attention variant with gated delta recurrence for select layers
- Used by: Qwen3-Next, Qwen3.5 (NOT base Qwen3)
- Hardware requirement: SM90+ (Hopper/H100) for FlashInfer backend; SM89 (RTX 4090) falls back to Triton/FLA
- Key config:
--gdn-prefill-backend flashinfer|triton|auto
- Kernels:
- Prefill:
chunk_gated_delta_rule (FlashInfer on SM90, Triton/FLA on SM89)
- Decode:
fused_recurrent_gated_delta_rule_packed_decode + fused_sigmoid_gating_delta_rule_update
- Conv:
causal_conv1d_fn / causal_conv1d_update
- Parameters:
g (gate), beta (beta logits), A_log (log-gate), dt_bias, conv1d, in_proj_qkvz, in_proj_ba
- Impact on us:
- If we upgrade to Qwen3.5 (which uses GDN layers), this is the critical performance path
- RTX 4090 (SM89) can only use Triton/FLA backend — no FlashInfer
- The Triton backend may have different performance characteristics vs FlashInfer
- Need to benchmark: GDN layer throughput on SM89 vs SM90
3.3 Multi-Token Prediction (MTP) — Speculative Decoding
- Files:
qwen3_5_mtp.py, qwen3_next_mtp.py
- What: Draft model that predicts multiple tokens per forward pass
- Used by: Qwen3.5 (dedicated MTP model), Qwen3-Next
- Relation to DFlash: MTP is a different speculative decoding approach than DFlash. Both could be evaluated.
- Questions:
- Does Qwen3.5-35B-A3B have an MTP companion model?
- MTP + AWQ 4-bit compatibility?
- Acceptance rate comparison: MTP vs DFlash?
3.4 MoE Support (Expert Parallelism)
- Files:
qwen3_moe.py, qwen3_5.py (MoE variant)
- What: FusedMoE with expert parallelism (EP), EPLB (Expert Parallelism Load Balancing)
- Features:
--enable-expert-parallel for multi-GPU expert distribution
enable_eplb for redundant experts to balance load
- Sequence parallel for MoE (
use_sequence_parallel_moe)
- Shared expert + routed experts (Qwen3.5 MoE)
- Impact on us:
- If we deploy a Qwen3.5 MoE model, EP across 3× 4090 would be the primary parallelism strategy
- Current Qwen3.5-35B-A3B is dense (no MoE), but Qwen3.5-A3B-MoE variants exist
- EPLB with redundant experts could help balance load across GPUs
3.5 dual_chunk_attention_config
- What: Dual-chunk attention for long-context processing
- Available in:
qwen3.py (base Qwen3), qwen3_moe.py
- Purpose: Split long sequences into chunks for memory-efficient attention
- Impact: Useful for long-context inference (>32K tokens)
- Config:
dual_chunk_attention_config in HuggingFace config
3.6 @support_torch_compile
- Available in:
qwen3.py, qwen3_moe.py, qwen3_dflash.py, qwen3_next.py, qwen3_5.py, etc.
- What: Enables
torch.compile for the model forward pass
- Impact: Can improve throughput by fusing CUDA kernels
- Config:
--enforce-eager disables it; enabled by default
- Caveat: First invocation has compilation overhead; may not help with small batch sizes
3.7 KV Cache Optimization
3.8 QK-Norm
- What: RMSNorm applied to Q and K before attention
- Available in:
qwen3.py (base Qwen3), qwen3_moe.py
- Impact: Already active in our current model — no action needed, just noting it exists as an architectural feature that affects numerical precision
4. Priority Matrix for Our Setup
| Component |
Priority |
Effort |
Expected Gain |
Hardware Fit |
| DFlash speculative decoding |
🔴 HIGH |
Medium |
1.9-2.8x decode |
✅ SM89 OK |
| GDN Triton backend perf |
🟡 MEDIUM |
Low |
Critical for Qwen3.5 upgrade |
⚠️ SM89 only (no FlashInfer) |
| MTP speculative decoding |
🟡 MEDIUM |
Medium |
Unknown (needs benchmark) |
✅ SM89 OK |
| MoE + Expert Parallelism |
🟢 LOW |
High |
Future MoE deployment |
✅ 3× 4090 ideal |
| dual_chunk_attention |
🟢 LOW |
Low |
Long-context only |
✅ SM89 OK |
| torch_compile |
🟢 LOW |
None (default) |
Marginal |
✅ SM89 OK |
5. Action Items
Immediate (current model — Qwen3.5-35B-A3B dense)
For Qwen3.5 upgrade path
For future MoE deployment
6. Cross-reference with existing issues
| Issue |
Topic |
Relation |
| #1 |
TurboQuant / RotorQuant KV cache compression |
Complementary to this study |
| #3 |
DFlash + PARO evaluation |
Subsumed — DFlash is component 3.1 |
| #4 |
Qwen3.5 → Qwen3.6 migration |
GDN backend perf is input to migration decision |
| #5 |
FlashKDA compatibility |
UPDATE NEEDED — GDN in Qwen3.5 makes KDA kernels potentially relevant for SM89 Triton fallback |
Update to #5
The FlashKDA issue concluded "NOT compatible" based on base Qwen3 (which uses standard GQA). However, Qwen3.5 and Qwen3-Next both use GatedDeltaNet layers with the same mathematical primitive (gated delta rule). FlashKDA's chunk_kda operation is architecturally similar to vLLM's chunk_gated_delta_rule. While FlashKDA targets SM90+ (FlashInfer), our RTX 4090 currently uses the Triton/FLA backend for GDN — FlashKDA's CUTLASS kernels could potentially provide a faster SM89 path if ported.
Study based on vLLM main branch as of 2026-04-23. Components are actively evolving.
Qwen3/Qwen3.5 vLLM Components — Performance Optimization Study
Source: vllm-project/vllm/tree/main/vllm/model_executor/models
Our models:
Qwen3.5-35B-A3B(dense) and potentially Qwen3.5 MoE variantsHardware: 3× RTX 4090 (72GB, SM89 — NOT SM90), TP=2 or TP=3
1. All Qwen3-related model implementations in vLLM
qwen3.pyqwen3_moe.pyqwen3_dflash.pyqwen3_next.pyqwen3_next_mtp.pyqwen3_5.pyqwen3_5_mtp.pyqwen3_vl.pyqwen3_vl_moe.pycolqwen3.py/colqwen3_5.pyqwen3_asr*.py(3 files)qwen3_omni_moe_thinker.py2. Key Architectural Differences
Qwen3 (current —
qwen3.py)Qwen3Attention→Attention)Qwen2MLP)@support_torch_compileenabledQwen3-Next (
qwen3_next.py)GatedDeltaNetAttentionfromvllm/model_executor/layers/mamba/gdn_linear_attn.pyg(gate),beta(beta logits),A_log(log-gate),dt_biasGDNAttentionBackend(dedicated attention backend, not standard FlashAttention)Qwen3.5 (
qwen3_5.py)GatedDeltaNetAttentionfor linear attention layersQwen3_5MTPfor multi-token prediction speculative decoding3. Optimization Components to Study
3.1 ⭐ DFlash Speculative Decoding (already tracked in #3)
qwen3_dflash.pyz-lab/Qwen3.5-35B-A3B-DFlash3.2 ⭐ GatedDeltaNet (GDN) Attention — NEW RELEVANCE
vllm/model_executor/layers/mamba/gdn_linear_attn.py(1211 lines)--gdn-prefill-backend flashinfer|triton|autochunk_gated_delta_rule(FlashInfer on SM90, Triton/FLA on SM89)fused_recurrent_gated_delta_rule_packed_decode+fused_sigmoid_gating_delta_rule_updatecausal_conv1d_fn/causal_conv1d_updateg(gate),beta(beta logits),A_log(log-gate),dt_bias,conv1d,in_proj_qkvz,in_proj_ba3.3 Multi-Token Prediction (MTP) — Speculative Decoding
qwen3_5_mtp.py,qwen3_next_mtp.py3.4 MoE Support (Expert Parallelism)
qwen3_moe.py,qwen3_5.py(MoE variant)--enable-expert-parallelfor multi-GPU expert distributionenable_eplbfor redundant experts to balance loaduse_sequence_parallel_moe)3.5
dual_chunk_attention_configqwen3.py(base Qwen3),qwen3_moe.pydual_chunk_attention_configin HuggingFace config3.6
@support_torch_compileqwen3.py,qwen3_moe.py,qwen3_dflash.py,qwen3_next.py,qwen3_5.py, etc.torch.compilefor the model forward pass--enforce-eagerdisables it; enabled by default3.7 KV Cache Optimization
--kv-cache-dtype fp8(tracked in Investigate TurboQuant / RotorQuant KV cache compression for Qwen3.5 MoE #1)--kv-transfer-configfor cross-node KV cache transfer (not relevant for single-node)3.8 QK-Norm
qwen3.py(base Qwen3),qwen3_moe.py4. Priority Matrix for Our Setup
5. Action Items
Immediate (current model — Qwen3.5-35B-A3B dense)
torch.compileis not being disabled by our current flagsdual_chunk_attention_configis set in our model's config.jsonFor Qwen3.5 upgrade path
layer_types)For future MoE deployment
--enable-expert-parallelwith 3× 40906. Cross-reference with existing issues
Update to #5
The FlashKDA issue concluded "NOT compatible" based on base Qwen3 (which uses standard GQA). However, Qwen3.5 and Qwen3-Next both use GatedDeltaNet layers with the same mathematical primitive (gated delta rule). FlashKDA's
chunk_kdaoperation is architecturally similar to vLLM'schunk_gated_delta_rule. While FlashKDA targets SM90+ (FlashInfer), our RTX 4090 currently uses the Triton/FLA backend for GDN — FlashKDA's CUTLASS kernels could potentially provide a faster SM89 path if ported.Study based on vLLM main branch as of 2026-04-23. Components are actively evolving.