Context
Qwen team released Qwen3.6-35B-A3B (April 2026). Same architecture as our current Qwen3.5 but with significant benchmark improvements:
| Benchmark |
3.5 |
3.6 |
Gain |
| SWE-bench Verified |
70.0 |
73.4 |
+3.4 |
| Terminal-Bench 2.0 |
40.5 |
51.5 |
+11 |
| NL2Repo |
20.5 |
29.4 |
+9 |
| QwenWebBench |
978 |
1397 |
+43% |
New feature: preserve_thinking for multi-turn agentic workflows (retains reasoning context).
Quantization Assessment
No AWQ 4-bit published yet on HuggingFace. Available alternatives:
| Provider |
Format |
Status |
Verdict |
| Qwen/Qwen3.6-35B-A3B-FP8 |
FP8 (official) |
~35 GB → ~100K KV tokens (vs our 335K) |
Acceptable fallback |
| unsloth/Qwen3.6-35B-A3B-GGUF |
GGUF |
llama.cpp only |
Not for vLLM |
| mlx-community/* |
MLX 4/5/6/8-bit |
Apple Silicon only |
Not applicable |
| caiovicentino1/HLWQ-CT-INT4 |
compressed-tensors INT4 |
Requires custom fork + --enforce-eager (3-4x slower) |
Rejected |
| caiovicentino1/HLWQ-Q5 |
PolarQuant 5-bit |
Research only (per model card) |
Rejected |
| mmangkad/Qwen3.6-NVFP4 |
NVIDIA ModelOpt FP4 |
Untested on 4090 |
Risky |
| SocialLocalMobile/HQQ-INT4 |
ExecuTorch HQQ |
Not vLLM-native |
Risky |
Recommendation: Create our own AWQ 4-bit using llmcompressor (same approach as quantize_zwz_8b.py). Qwen3.6 architecture is identical to Qwen3.5 (same Qwen3_5MoeForConditionalGeneration class per model card), so cyankiwi's existing AWQ approach should transfer.
Migration Roadmap
Phase 1: Quantization (~1 day)
Phase 2: Validation (~2 hours)
Phase 3: Production cutover
Phase 4: Downstream client updates
Clients that reference model name qwen3.5-35b-a3b:
Phase 5: Sampling re-calibration
Risk Assessment
Low risk:
- Same architecture → existing flags should work (
--enable-expert-parallel, --kv-cache-dtype fp8, --tool-call-parser qwen3_coder, --reasoning-parser qwen3)
- Marlin MoE kernels should work identically
Medium risk:
preserve_thinking is a new feature — test multi-turn behavior carefully
- Our own AWQ quantization may be lower quality than cyankiwi's (different calibration dataset)
- MTP speculative decoding: Qwen3.6 supports
qwen3_next_mtp natively — worth testing
High risk:
- None identified for architecture compatibility
Open Questions
- Should we wait for cyankiwi's AWQ (historical lag: 2-3 months after model release) or quantize now?
- Should we try Qwen's official FP8 as intermediate while AWQ is being prepared?
- Should we enable MTP speculative decoding (
qwen3_next_mtp) on Qwen3.6? Test acceptance rate first.
- Does
preserve_thinking impact our OWUI wrapper designs (especially -fast variants that disable thinking)?
References
Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com
Context
Qwen team released Qwen3.6-35B-A3B (April 2026). Same architecture as our current Qwen3.5 but with significant benchmark improvements:
New feature:
preserve_thinkingfor multi-turn agentic workflows (retains reasoning context).Quantization Assessment
No AWQ 4-bit published yet on HuggingFace. Available alternatives:
--enforce-eager(3-4x slower)Recommendation: Create our own AWQ 4-bit using llmcompressor (same approach as
quantize_zwz_8b.py). Qwen3.6 architecture is identical to Qwen3.5 (sameQwen3_5MoeForConditionalGenerationclass per model card), so cyankiwi's existing AWQ approach should transfer.Migration Roadmap
Phase 1: Quantization (~1 day)
quantize_zwz_8b.py→quantize_qwen36_35b_a3b.pyQwen3_5MoeForConditionalGenerationorQwen3_6...(check model card)re:.*visual.*,re:.*merger.*)lm_headPhase 2: Validation (~2 hours)
medium-qwen36-moe.ymlprofile (copy from qwen35)benchmark_current_models.py(decode speed, concurrent, tool calling, vision)preserve_thinkingfeature for Roo multi-turnPhase 3: Production cutover
myia_vllm/archives/docker volume rm profiles_vllm-compile-cache-qwen35Phase 4: Downstream client updates
Clients that reference model name
qwen3.5-35b-a3b:myia_vllm/mcp/sk_agent_config.json+d:\roo-extensions\mcps\internal\servers\sk-agent\sk_agent_config.jsonVLLM_MODEL_QWEN35_MOE→VLLM_MODEL_QWEN36_MOEserved-model-name qwen3.5-35b-a3b→qwen3.6-35b-a3bPhase 5: Sampling re-calibration
Risk Assessment
Low risk:
--enable-expert-parallel,--kv-cache-dtype fp8,--tool-call-parser qwen3_coder,--reasoning-parser qwen3)Medium risk:
preserve_thinkingis a new feature — test multi-turn behavior carefullyqwen3_next_mtpnatively — worth testingHigh risk:
Open Questions
qwen3_next_mtp) on Qwen3.6? Test acceptance rate first.preserve_thinkingimpact our OWUI wrapper designs (especially-fastvariants that disable thinking)?References
Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com