Skip to content

Study: vLLM optimization components for Qwen3 / Qwen3.5 architectures #7

@jsboige

Description

@jsboige

Qwen3/Qwen3.5 vLLM Components — Performance Optimization Study

Source: vllm-project/vllm/tree/main/vllm/model_executor/models
Our models: Qwen3.5-35B-A3B (dense) and potentially Qwen3.5 MoE variants
Hardware: 3× RTX 4090 (72GB, SM89 — NOT SM90), TP=2 or TP=3


1. All Qwen3-related model implementations in vLLM

File Model Architecture Relevance to us
qwen3.py Qwen3ForCausalLM Dense, GQA + SwiGLU ⭐ Our current dense model family
qwen3_moe.py Qwen3MoeForCausalLM MoE (sparse + dense MLP) Potential future MoE deployment
qwen3_dflash.py DFlashQwen3ForCausalLM Speculative decoding drafter ⭐ Already tracked in #3
qwen3_next.py Qwen3NextForCausalLM Hybrid: GQA + GatedDeltaNet ⭐ Future Qwen3-Next architecture
qwen3_next_mtp.py Qwen3NextMTP Multi-token prediction (Qwen3-Next) Future speculative decoding
qwen3_5.py Qwen3_5ForCausalLM Hybrid: GQA + GatedDeltaNet + MoE ⭐ Qwen3.5 architecture (our target!)
qwen3_5_mtp.py Qwen3_5MTP Multi-token prediction (Qwen3.5) ⭐ Future speculative decoding for 3.5
qwen3_vl.py Qwen3VLForConditionalGeneration Vision-Language Not relevant
qwen3_vl_moe.py Qwen3VLMoEForConditionalGeneration VL + MoE Not relevant
colqwen3.py / colqwen3_5.py ColQwen3 ColBERT late interaction Not relevant
qwen3_asr*.py (3 files) ASR models Audio Not relevant
qwen3_omni_moe_thinker.py Omni MoE Thinker Multimodal MoE Not relevant

2. Key Architectural Differences

Qwen3 (current — qwen3.py)

  • Standard GQA attention (Qwen3AttentionAttention)
  • SwiGLU MLP (Qwen2MLP)
  • QK-norm (RMSNorm on Q and K)
  • RoPE positional encoding
  • @support_torch_compile enabled

Qwen3-Next (qwen3_next.py)

  • Hybrid attention: standard GQA + GatedDeltaNet (linear attention)
  • Uses GatedDeltaNetAttention from vllm/model_executor/layers/mamba/gdn_linear_attn.py
  • Gated Delta Networks parameters: g (gate), beta (beta logits), A_log (log-gate), dt_bias
  • GemmaRMSNorm (instead of RMSNorm)
  • GDNAttentionBackend (dedicated attention backend, not standard FlashAttention)
  • Dual chunk processing: chunk_gated_delta_rule (prefill) + recurrent decode

Qwen3.5 (qwen3_5.py)

  • Also hybrid: same GQA + GatedDeltaNet pattern as Qwen3-Next
  • MoE support (sparse experts + shared expert)
  • GatedDeltaNetAttention for linear attention layers
  • Qwen3_5MTP for multi-token prediction speculative decoding
  • Supports LoRA on GDN layers (separate qkv/z projections)

💡 Note on the FlashKDA question (#5): Qwen3 base does NOT use Gated Delta Networks. But Qwen3-Next and Qwen3.5 DO. The user was correct — they were referring to the Qwen3.5 architecture, not the base Qwen3. This makes FlashKDA/KDA kernels potentially relevant for future Qwen3.5 deployments.


3. Optimization Components to Study

3.1 ⭐ DFlash Speculative Decoding (already tracked in #3)

3.2 ⭐ GatedDeltaNet (GDN) Attention — NEW RELEVANCE

  • File: vllm/model_executor/layers/mamba/gdn_linear_attn.py (1211 lines)
  • What: Linear attention variant with gated delta recurrence for select layers
  • Used by: Qwen3-Next, Qwen3.5 (NOT base Qwen3)
  • Hardware requirement: SM90+ (Hopper/H100) for FlashInfer backend; SM89 (RTX 4090) falls back to Triton/FLA
  • Key config: --gdn-prefill-backend flashinfer|triton|auto
  • Kernels:
    • Prefill: chunk_gated_delta_rule (FlashInfer on SM90, Triton/FLA on SM89)
    • Decode: fused_recurrent_gated_delta_rule_packed_decode + fused_sigmoid_gating_delta_rule_update
    • Conv: causal_conv1d_fn / causal_conv1d_update
  • Parameters: g (gate), beta (beta logits), A_log (log-gate), dt_bias, conv1d, in_proj_qkvz, in_proj_ba
  • Impact on us:
    • If we upgrade to Qwen3.5 (which uses GDN layers), this is the critical performance path
    • RTX 4090 (SM89) can only use Triton/FLA backend — no FlashInfer
    • The Triton backend may have different performance characteristics vs FlashInfer
    • Need to benchmark: GDN layer throughput on SM89 vs SM90

3.3 Multi-Token Prediction (MTP) — Speculative Decoding

  • Files: qwen3_5_mtp.py, qwen3_next_mtp.py
  • What: Draft model that predicts multiple tokens per forward pass
  • Used by: Qwen3.5 (dedicated MTP model), Qwen3-Next
  • Relation to DFlash: MTP is a different speculative decoding approach than DFlash. Both could be evaluated.
  • Questions:
    • Does Qwen3.5-35B-A3B have an MTP companion model?
    • MTP + AWQ 4-bit compatibility?
    • Acceptance rate comparison: MTP vs DFlash?

3.4 MoE Support (Expert Parallelism)

  • Files: qwen3_moe.py, qwen3_5.py (MoE variant)
  • What: FusedMoE with expert parallelism (EP), EPLB (Expert Parallelism Load Balancing)
  • Features:
    • --enable-expert-parallel for multi-GPU expert distribution
    • enable_eplb for redundant experts to balance load
    • Sequence parallel for MoE (use_sequence_parallel_moe)
    • Shared expert + routed experts (Qwen3.5 MoE)
  • Impact on us:
    • If we deploy a Qwen3.5 MoE model, EP across 3× 4090 would be the primary parallelism strategy
    • Current Qwen3.5-35B-A3B is dense (no MoE), but Qwen3.5-A3B-MoE variants exist
    • EPLB with redundant experts could help balance load across GPUs

3.5 dual_chunk_attention_config

  • What: Dual-chunk attention for long-context processing
  • Available in: qwen3.py (base Qwen3), qwen3_moe.py
  • Purpose: Split long sequences into chunks for memory-efficient attention
  • Impact: Useful for long-context inference (>32K tokens)
  • Config: dual_chunk_attention_config in HuggingFace config

3.6 @support_torch_compile

  • Available in: qwen3.py, qwen3_moe.py, qwen3_dflash.py, qwen3_next.py, qwen3_5.py, etc.
  • What: Enables torch.compile for the model forward pass
  • Impact: Can improve throughput by fusing CUDA kernels
  • Config: --enforce-eager disables it; enabled by default
  • Caveat: First invocation has compilation overhead; may not help with small batch sizes

3.7 KV Cache Optimization

3.8 QK-Norm

  • What: RMSNorm applied to Q and K before attention
  • Available in: qwen3.py (base Qwen3), qwen3_moe.py
  • Impact: Already active in our current model — no action needed, just noting it exists as an architectural feature that affects numerical precision

4. Priority Matrix for Our Setup

Component Priority Effort Expected Gain Hardware Fit
DFlash speculative decoding 🔴 HIGH Medium 1.9-2.8x decode ✅ SM89 OK
GDN Triton backend perf 🟡 MEDIUM Low Critical for Qwen3.5 upgrade ⚠️ SM89 only (no FlashInfer)
MTP speculative decoding 🟡 MEDIUM Medium Unknown (needs benchmark) ✅ SM89 OK
MoE + Expert Parallelism 🟢 LOW High Future MoE deployment ✅ 3× 4090 ideal
dual_chunk_attention 🟢 LOW Low Long-context only ✅ SM89 OK
torch_compile 🟢 LOW None (default) Marginal ✅ SM89 OK

5. Action Items

Immediate (current model — Qwen3.5-35B-A3B dense)

For Qwen3.5 upgrade path

  • Benchmark GDN Triton backend throughput on SM89 (RTX 4090) vs documented SM90 numbers
  • Identify which layers in Qwen3.5-35B-A3B use GDN vs standard GQA (check config layer_types)
  • Evaluate if FlashKDA (Study: FlashKDA compatibility with Qwen models #5) could replace the Triton GDN backend on SM89 (both implement gated delta rule)
  • Test MTP speculative decoding with Qwen3.5 if an MTP companion model exists
  • Profile memory usage: GDN recurrent state vs KV cache for our typical batch sizes

For future MoE deployment

  • Evaluate --enable-expert-parallel with 3× 4090
  • Test EPLB load balancing with redundant experts
  • Compare dense Qwen3.5-35B-A3B vs MoE variant latency/quality

6. Cross-reference with existing issues

Issue Topic Relation
#1 TurboQuant / RotorQuant KV cache compression Complementary to this study
#3 DFlash + PARO evaluation Subsumed — DFlash is component 3.1
#4 Qwen3.5 → Qwen3.6 migration GDN backend perf is input to migration decision
#5 FlashKDA compatibility UPDATE NEEDED — GDN in Qwen3.5 makes KDA kernels potentially relevant for SM89 Triton fallback

Update to #5

The FlashKDA issue concluded "NOT compatible" based on base Qwen3 (which uses standard GQA). However, Qwen3.5 and Qwen3-Next both use GatedDeltaNet layers with the same mathematical primitive (gated delta rule). FlashKDA's chunk_kda operation is architecturally similar to vLLM's chunk_gated_delta_rule. While FlashKDA targets SM90+ (FlashInfer), our RTX 4090 currently uses the Triton/FLA backend for GDN — FlashKDA's CUTLASS kernels could potentially provide a faster SM89 path if ported.


Study based on vLLM main branch as of 2026-04-23. Components are actively evolving.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions