Migration roadmap: Qwen3.5-35B-A3B → Qwen3.6-35B-A3B

## Context

Qwen team released **Qwen3.6-35B-A3B** (April 2026). Same architecture as our current Qwen3.5 but with significant benchmark improvements:

| Benchmark | 3.5 | 3.6 | Gain |
|-----------|-----|-----|------|
| SWE-bench Verified | 70.0 | **73.4** | +3.4 |
| Terminal-Bench 2.0 | 40.5 | **51.5** | **+11** |
| NL2Repo | 20.5 | **29.4** | **+9** |
| QwenWebBench | 978 | **1397** | **+43%** |

**New feature**: `preserve_thinking` for multi-turn agentic workflows (retains reasoning context).

## Quantization Assessment

**No AWQ 4-bit published yet** on HuggingFace. Available alternatives:

| Provider | Format | Status | Verdict |
|----------|--------|--------|---------|
| Qwen/Qwen3.6-35B-A3B-FP8 | FP8 (official) | ~35 GB → ~100K KV tokens (vs our 335K) | Acceptable fallback |
| unsloth/Qwen3.6-35B-A3B-GGUF | GGUF | llama.cpp only | Not for vLLM |
| mlx-community/* | MLX 4/5/6/8-bit | Apple Silicon only | Not applicable |
| caiovicentino1/HLWQ-CT-INT4 | compressed-tensors INT4 | Requires custom fork + `--enforce-eager` (3-4x slower) | **Rejected** |
| caiovicentino1/HLWQ-Q5 | PolarQuant 5-bit | Research only (per model card) | **Rejected** |
| mmangkad/Qwen3.6-NVFP4 | NVIDIA ModelOpt FP4 | Untested on 4090 | Risky |
| SocialLocalMobile/HQQ-INT4 | ExecuTorch HQQ | Not vLLM-native | Risky |

**Recommendation**: **Create our own AWQ 4-bit** using llmcompressor (same approach as `quantize_zwz_8b.py`). Qwen3.6 architecture is identical to Qwen3.5 (same `Qwen3_5MoeForConditionalGeneration` class per model card), so cyankiwi's existing AWQ approach should transfer.

## Migration Roadmap

### Phase 1: Quantization (~1 day)
- [ ] Adapt `quantize_zwz_8b.py` → `quantize_qwen36_35b_a3b.py`
  - Change class to `Qwen3_5MoeForConditionalGeneration` or `Qwen3_6...` (check model card)
  - Keep vision encoder and merger in BF16 (`re:.*visual.*`, `re:.*merger.*`)
  - Exclude `lm_head`
  - 512 calibration samples Open-Platypus (or add vision samples)
- [ ] Run quantization on BF16 base (~35B × 2 bytes = 70 GB download)
- [ ] Expected output: ~19-23 GB AWQ 4-bit model
- [ ] Estimated time: 3-8 hours on RTX 4090 (MoE quantization is slower than dense)
- [ ] **Alternative**: wait 1-2 weeks for cyankiwi/QuantTrio to publish

### Phase 2: Validation (~2 hours)
- [ ] Create `medium-qwen36-moe.yml` profile (copy from qwen35)
- [ ] Deploy in parallel or swap
- [ ] Run `benchmark_current_models.py` (decode speed, concurrent, tool calling, vision)
- [ ] Compare vs Qwen3.5 baseline (117 tok/s decode, 311 concurrent, 910ms tool)
- [ ] Quality tests: GSM8K, IFEval, MME (our existing benchmarks)
- [ ] Test `preserve_thinking` feature for Roo multi-turn

### Phase 3: Production cutover
- [ ] Update CLAUDE.md with new version info
- [ ] Archive Qwen3.5 profile to `myia_vllm/archives/`
- [ ] Clear old compile cache: `docker volume rm profiles_vllm-compile-cache-qwen35`
- [ ] Deploy Qwen3.6 as primary on port 5002
- [ ] Update watchdog script (container name references)

### Phase 4: Downstream client updates
Clients that reference model name `qwen3.5-35b-a3b`:
- [ ] **roo-extensions** (separate repo): Roo "simple" profiles, sk_agent model config, roo-state-manager condensation .env
- [ ] **OWUI model wrappers** (8 wrappers): Qwen_think, Qwen_think-code, Qwen_think-reason, Qwen_instruct, Local.qwen3.5-35b-a3b (+ variants)
- [ ] **SK Agent config**: `myia_vllm/mcp/sk_agent_config.json` + `d:\roo-extensions\mcps\internal\servers\sk-agent\sk_agent_config.json`
- [ ] **nanoClaw** (if applicable): check if it references the model
- [ ] **Dashboard clients** (OWUI third-party models, APIs)
- [ ] Update environment variable `VLLM_MODEL_QWEN35_MOE` → `VLLM_MODEL_QWEN36_MOE`
- [ ] Update `served-model-name qwen3.5-35b-a3b` → `qwen3.6-35b-a3b`

### Phase 5: Sampling re-calibration
- [ ] Benchmark repetition with existing sampling params (pp=1.5, rp=1.1)
- [ ] Qwen3.6 recommends same params (temp 0.7, top_p 0.95, top_k 20, pp 1.5) for Q4
- [ ] Adjust OWUI wrappers if needed based on new benchmarks

## Risk Assessment

**Low risk**:
- Same architecture → existing flags should work (`--enable-expert-parallel`, `--kv-cache-dtype fp8`, `--tool-call-parser qwen3_coder`, `--reasoning-parser qwen3`)
- Marlin MoE kernels should work identically

**Medium risk**:
- `preserve_thinking` is a new feature — test multi-turn behavior carefully
- Our own AWQ quantization may be lower quality than cyankiwi's (different calibration dataset)
- MTP speculative decoding: Qwen3.6 supports `qwen3_next_mtp` natively — worth testing

**High risk**:
- None identified for architecture compatibility

## Open Questions

1. Should we wait for cyankiwi's AWQ (historical lag: 2-3 months after model release) or quantize now?
2. Should we try Qwen's official FP8 as intermediate while AWQ is being prepared?
3. Should we enable MTP speculative decoding (`qwen3_next_mtp`) on Qwen3.6? Test acceptance rate first.
4. Does `preserve_thinking` impact our OWUI wrapper designs (especially `-fast` variants that disable thinking)?

## References

- [Qwen3.6-35B-A3B model card](https://huggingface.co/Qwen/Qwen3.6-35B-A3B)
- [Qwen3.6 blog](https://qwen.ai/blog?id=qwen3.6-35b-a3b)
- Related: jsboige/vllm#3 (z-lab DFlash/PARO evaluation)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migration roadmap: Qwen3.5-35B-A3B → Qwen3.6-35B-A3B #4

Context

Quantization Assessment