Study: vLLM optimization components for Qwen3 / Qwen3.5 architectures

## Qwen3/Qwen3.5 vLLM Components — Performance Optimization Study

**Source:** [vllm-project/vllm/tree/main/vllm/model_executor/models](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models)
**Our models:** `Qwen3.5-35B-A3B` (dense) and potentially Qwen3.5 MoE variants
**Hardware:** 3× RTX 4090 (72GB, SM89 — NOT SM90), TP=2 or TP=3

---

## 1. All Qwen3-related model implementations in vLLM

| File | Model | Architecture | Relevance to us |
|------|-------|-------------|-----------------|
| `qwen3.py` | Qwen3ForCausalLM | Dense, GQA + SwiGLU | ⭐ Our current dense model family |
| `qwen3_moe.py` | Qwen3MoeForCausalLM | MoE (sparse + dense MLP) | Potential future MoE deployment |
| `qwen3_dflash.py` | DFlashQwen3ForCausalLM | Speculative decoding drafter | ⭐ Already tracked in #3 |
| `qwen3_next.py` | Qwen3NextForCausalLM | Hybrid: GQA + GatedDeltaNet | ⭐ Future Qwen3-Next architecture |
| `qwen3_next_mtp.py` | Qwen3NextMTP | Multi-token prediction (Qwen3-Next) | Future speculative decoding |
| `qwen3_5.py` | Qwen3_5ForCausalLM | Hybrid: GQA + GatedDeltaNet + MoE | ⭐ Qwen3.5 architecture (our target!) |
| `qwen3_5_mtp.py` | Qwen3_5MTP | Multi-token prediction (Qwen3.5) | ⭐ Future speculative decoding for 3.5 |
| `qwen3_vl.py` | Qwen3VLForConditionalGeneration | Vision-Language | Not relevant |
| `qwen3_vl_moe.py` | Qwen3VLMoEForConditionalGeneration | VL + MoE | Not relevant |
| `colqwen3.py` / `colqwen3_5.py` | ColQwen3 | ColBERT late interaction | Not relevant |
| `qwen3_asr*.py` (3 files) | ASR models | Audio | Not relevant |
| `qwen3_omni_moe_thinker.py` | Omni MoE Thinker | Multimodal MoE | Not relevant |

---

## 2. Key Architectural Differences

### Qwen3 (current — `qwen3.py`)
- Standard GQA attention (`Qwen3Attention` → `Attention`)
- SwiGLU MLP (`Qwen2MLP`)
- QK-norm (RMSNorm on Q and K)
- RoPE positional encoding
- `@support_torch_compile` enabled

### Qwen3-Next (`qwen3_next.py`)
- **Hybrid attention**: standard GQA + **GatedDeltaNet** (linear attention)
- Uses `GatedDeltaNetAttention` from `vllm/model_executor/layers/mamba/gdn_linear_attn.py`
- Gated Delta Networks parameters: `g` (gate), `beta` (beta logits), `A_log` (log-gate), `dt_bias`
- GemmaRMSNorm (instead of RMSNorm)
- `GDNAttentionBackend` (dedicated attention backend, not standard FlashAttention)
- Dual chunk processing: chunk_gated_delta_rule (prefill) + recurrent decode

### Qwen3.5 (`qwen3_5.py`)
- **Also hybrid**: same GQA + GatedDeltaNet pattern as Qwen3-Next
- MoE support (sparse experts + shared expert)
- `GatedDeltaNetAttention` for linear attention layers
- `Qwen3_5MTP` for multi-token prediction speculative decoding
- Supports LoRA on GDN layers (separate qkv/z projections)

> **💡 Note on the FlashKDA question (#5):** Qwen3 *base* does NOT use Gated Delta Networks. But Qwen3-Next and Qwen3.5 DO. The user was correct — they were referring to the Qwen3.5 architecture, not the base Qwen3. This makes FlashKDA/KDA kernels *potentially relevant* for future Qwen3.5 deployments.

---

## 3. Optimization Components to Study

### 3.1 ⭐ DFlash Speculative Decoding (already tracked in #3)
- **File:** `qwen3_dflash.py`
- **What:** Block diffusion drafter (0.5B) for 1.9-2.8x decode speedup
- **Model:** `z-lab/Qwen3.5-35B-A3B-DFlash`
- **Status:** Supported in vLLM nightly, needs AWQ compatibility testing
- **Action:** See #3 for full evaluation plan

### 3.2 ⭐ GatedDeltaNet (GDN) Attention — NEW RELEVANCE
- **File:** `vllm/model_executor/layers/mamba/gdn_linear_attn.py` (1211 lines)
- **What:** Linear attention variant with gated delta recurrence for select layers
- **Used by:** Qwen3-Next, Qwen3.5 (NOT base Qwen3)
- **Hardware requirement:** SM90+ (Hopper/H100) for FlashInfer backend; SM89 (RTX 4090) falls back to Triton/FLA
- **Key config:** `--gdn-prefill-backend flashinfer|triton|auto`
- **Kernels:**
  - Prefill: `chunk_gated_delta_rule` (FlashInfer on SM90, Triton/FLA on SM89)
  - Decode: `fused_recurrent_gated_delta_rule_packed_decode` + `fused_sigmoid_gating_delta_rule_update`
  - Conv: `causal_conv1d_fn` / `causal_conv1d_update`
- **Parameters:** `g` (gate), `beta` (beta logits), `A_log` (log-gate), `dt_bias`, `conv1d`, `in_proj_qkvz`, `in_proj_ba`
- **Impact on us:**
  - If we upgrade to Qwen3.5 (which uses GDN layers), this is the critical performance path
  - RTX 4090 (SM89) can only use Triton/FLA backend — no FlashInfer
  - The Triton backend may have different performance characteristics vs FlashInfer
  - **Need to benchmark**: GDN layer throughput on SM89 vs SM90

### 3.3 Multi-Token Prediction (MTP) — Speculative Decoding
- **Files:** `qwen3_5_mtp.py`, `qwen3_next_mtp.py`
- **What:** Draft model that predicts multiple tokens per forward pass
- **Used by:** Qwen3.5 (dedicated MTP model), Qwen3-Next
- **Relation to DFlash:** MTP is a different speculative decoding approach than DFlash. Both could be evaluated.
- **Questions:**
  - Does Qwen3.5-35B-A3B have an MTP companion model?
  - MTP + AWQ 4-bit compatibility?
  - Acceptance rate comparison: MTP vs DFlash?

### 3.4 MoE Support (Expert Parallelism)
- **Files:** `qwen3_moe.py`, `qwen3_5.py` (MoE variant)
- **What:** FusedMoE with expert parallelism (EP), EPLB (Expert Parallelism Load Balancing)
- **Features:**
  - `--enable-expert-parallel` for multi-GPU expert distribution
  - `enable_eplb` for redundant experts to balance load
  - Sequence parallel for MoE (`use_sequence_parallel_moe`)
  - Shared expert + routed experts (Qwen3.5 MoE)
- **Impact on us:**
  - If we deploy a Qwen3.5 MoE model, EP across 3× 4090 would be the primary parallelism strategy
  - Current Qwen3.5-35B-A3B is dense (no MoE), but Qwen3.5-A3B-MoE variants exist
  - EPLB with redundant experts could help balance load across GPUs

### 3.5 `dual_chunk_attention_config`
- **What:** Dual-chunk attention for long-context processing
- **Available in:** `qwen3.py` (base Qwen3), `qwen3_moe.py`
- **Purpose:** Split long sequences into chunks for memory-efficient attention
- **Impact:** Useful for long-context inference (>32K tokens)
- **Config:** `dual_chunk_attention_config` in HuggingFace config

### 3.6 `@support_torch_compile`
- **Available in:** `qwen3.py`, `qwen3_moe.py`, `qwen3_dflash.py`, `qwen3_next.py`, `qwen3_5.py`, etc.
- **What:** Enables `torch.compile` for the model forward pass
- **Impact:** Can improve throughput by fusing CUDA kernels
- **Config:** `--enforce-eager` disables it; enabled by default
- **Caveat:** First invocation has compilation overhead; may not help with small batch sizes

### 3.7 KV Cache Optimization
- **Already in use:** `--kv-cache-dtype fp8` (tracked in #1)
- **Also relevant:** `--kv-transfer-config` for cross-node KV cache transfer (not relevant for single-node)
- **GDN-specific:** GDN layers maintain recurrent state instead of KV cache — different memory profile

### 3.8 QK-Norm
- **What:** RMSNorm applied to Q and K before attention
- **Available in:** `qwen3.py` (base Qwen3), `qwen3_moe.py`
- **Impact:** Already active in our current model — no action needed, just noting it exists as an architectural feature that affects numerical precision

---

## 4. Priority Matrix for Our Setup

| Component | Priority | Effort | Expected Gain | Hardware Fit |
|-----------|----------|--------|---------------|-------------|
| DFlash speculative decoding | 🔴 HIGH | Medium | 1.9-2.8x decode | ✅ SM89 OK |
| GDN Triton backend perf | 🟡 MEDIUM | Low | Critical for Qwen3.5 upgrade | ⚠️ SM89 only (no FlashInfer) |
| MTP speculative decoding | 🟡 MEDIUM | Medium | Unknown (needs benchmark) | ✅ SM89 OK |
| MoE + Expert Parallelism | 🟢 LOW | High | Future MoE deployment | ✅ 3× 4090 ideal |
| dual_chunk_attention | 🟢 LOW | Low | Long-context only | ✅ SM89 OK |
| torch_compile | 🟢 LOW | None (default) | Marginal | ✅ SM89 OK |

---

## 5. Action Items

### Immediate (current model — Qwen3.5-35B-A3B dense)
- [ ] Complete DFlash evaluation per #3 (AWQ compatibility, acceptance rate, VRAM)
- [ ] Verify `torch.compile` is not being disabled by our current flags
- [ ] Check if `dual_chunk_attention_config` is set in our model's config.json

### For Qwen3.5 upgrade path
- [ ] Benchmark GDN Triton backend throughput on SM89 (RTX 4090) vs documented SM90 numbers
- [ ] Identify which layers in Qwen3.5-35B-A3B use GDN vs standard GQA (check config `layer_types`)
- [ ] Evaluate if FlashKDA (#5) could replace the Triton GDN backend on SM89 (both implement gated delta rule)
- [ ] Test MTP speculative decoding with Qwen3.5 if an MTP companion model exists
- [ ] Profile memory usage: GDN recurrent state vs KV cache for our typical batch sizes

### For future MoE deployment
- [ ] Evaluate `--enable-expert-parallel` with 3× 4090
- [ ] Test EPLB load balancing with redundant experts
- [ ] Compare dense Qwen3.5-35B-A3B vs MoE variant latency/quality

---

## 6. Cross-reference with existing issues

| Issue | Topic | Relation |
|-------|-------|----------|
| #1 | TurboQuant / RotorQuant KV cache compression | Complementary to this study |
| #3 | DFlash + PARO evaluation | Subsumed — DFlash is component 3.1 |
| #4 | Qwen3.5 → Qwen3.6 migration | GDN backend perf is input to migration decision |
| #5 | FlashKDA compatibility | **UPDATE NEEDED** — GDN in Qwen3.5 makes KDA kernels potentially relevant for SM89 Triton fallback |

### Update to #5
The FlashKDA issue concluded "NOT compatible" based on base Qwen3 (which uses standard GQA). However, Qwen3.5 and Qwen3-Next both use GatedDeltaNet layers with the same mathematical primitive (gated delta rule). FlashKDA's `chunk_kda` operation is architecturally similar to vLLM's `chunk_gated_delta_rule`. While FlashKDA targets SM90+ (FlashInfer), our RTX 4090 currently uses the Triton/FLA backend for GDN — FlashKDA's CUTLASS kernels could potentially provide a faster SM89 path if ported.

---

*Study based on vLLM main branch as of 2026-04-23. Components are actively evolving.*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Study: vLLM optimization components for Qwen3 / Qwen3.5 architectures #7

Qwen3/Qwen3.5 vLLM Components — Performance Optimization Study

1. All Qwen3-related model implementations in vLLM

2. Key Architectural Differences

Qwen3 (current — `qwen3.py`)

Qwen3-Next (`qwen3_next.py`)

Qwen3.5 (`qwen3_5.py`)

3. Optimization Components to Study

3.1 ⭐ DFlash Speculative Decoding (already tracked in #3)

3.2 ⭐ GatedDeltaNet (GDN) Attention — NEW RELEVANCE

3.3 Multi-Token Prediction (MTP) — Speculative Decoding

3.4 MoE Support (Expert Parallelism)

3.5 `dual_chunk_attention_config`

3.6 `@support_torch_compile`

3.7 KV Cache Optimization

3.8 QK-Norm

4. Priority Matrix for Our Setup

5. Action Items

Immediate (current model — Qwen3.5-35B-A3B dense)

For Qwen3.5 upgrade path

For future MoE deployment

6. Cross-reference with existing issues

Update to #5

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

File	Model	Architecture	Relevance to us
`qwen3.py`	Qwen3ForCausalLM	Dense, GQA + SwiGLU	⭐ Our current dense model family
`qwen3_moe.py`	Qwen3MoeForCausalLM	MoE (sparse + dense MLP)	Potential future MoE deployment
`qwen3_dflash.py`	DFlashQwen3ForCausalLM	Speculative decoding drafter	⭐ Already tracked in #3
`qwen3_next.py`	Qwen3NextForCausalLM	Hybrid: GQA + GatedDeltaNet	⭐ Future Qwen3-Next architecture
`qwen3_next_mtp.py`	Qwen3NextMTP	Multi-token prediction (Qwen3-Next)	Future speculative decoding
`qwen3_5.py`	Qwen3_5ForCausalLM	Hybrid: GQA + GatedDeltaNet + MoE	⭐ Qwen3.5 architecture (our target!)
`qwen3_5_mtp.py`	Qwen3_5MTP	Multi-token prediction (Qwen3.5)	⭐ Future speculative decoding for 3.5
`qwen3_vl.py`	Qwen3VLForConditionalGeneration	Vision-Language	Not relevant
`qwen3_vl_moe.py`	Qwen3VLMoEForConditionalGeneration	VL + MoE	Not relevant
`colqwen3.py` / `colqwen3_5.py`	ColQwen3	ColBERT late interaction	Not relevant
`qwen3_asr*.py` (3 files)	ASR models	Audio	Not relevant
`qwen3_omni_moe_thinker.py`	Omni MoE Thinker	Multimodal MoE	Not relevant

Component	Priority	Effort	Expected Gain	Hardware Fit
DFlash speculative decoding	🔴 HIGH	Medium	1.9-2.8x decode	✅ SM89 OK
GDN Triton backend perf	🟡 MEDIUM	Low	Critical for Qwen3.5 upgrade	⚠️ SM89 only (no FlashInfer)
MTP speculative decoding	🟡 MEDIUM	Medium	Unknown (needs benchmark)	✅ SM89 OK
MoE + Expert Parallelism	🟢 LOW	High	Future MoE deployment	✅ 3× 4090 ideal
dual_chunk_attention	🟢 LOW	Low	Long-context only	✅ SM89 OK
torch_compile	🟢 LOW	None (default)	Marginal	✅ SM89 OK

Issue	Topic	Relation
#1	TurboQuant / RotorQuant KV cache compression	Complementary to this study
#3	DFlash + PARO evaluation	Subsumed — DFlash is component 3.1
#4	Qwen3.5 → Qwen3.6 migration	GDN backend perf is input to migration decision
#5	FlashKDA compatibility	UPDATE NEEDED — GDN in Qwen3.5 makes KDA kernels potentially relevant for SM89 Triton fallback

Study: vLLM optimization components for Qwen3 / Qwen3.5 architectures #7

Description

Qwen3/Qwen3.5 vLLM Components — Performance Optimization Study

1. All Qwen3-related model implementations in vLLM

2. Key Architectural Differences

Qwen3 (current — qwen3.py)

Qwen3-Next (qwen3_next.py)

Qwen3.5 (qwen3_5.py)

3. Optimization Components to Study

3.1 ⭐ DFlash Speculative Decoding (already tracked in #3)

3.2 ⭐ GatedDeltaNet (GDN) Attention — NEW RELEVANCE

3.3 Multi-Token Prediction (MTP) — Speculative Decoding

3.4 MoE Support (Expert Parallelism)

3.5 dual_chunk_attention_config

3.6 @support_torch_compile

3.7 KV Cache Optimization

3.8 QK-Norm

4. Priority Matrix for Our Setup

5. Action Items

Immediate (current model — Qwen3.5-35B-A3B dense)

For Qwen3.5 upgrade path

For future MoE deployment

6. Cross-reference with existing issues

Update to #5

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Qwen3 (current — `qwen3.py`)

Qwen3-Next (`qwen3_next.py`)

Qwen3.5 (`qwen3_5.py`)

3.5 `dual_chunk_attention_config`

3.6 `@support_torch_compile`