Skip to content

Migration roadmap: Qwen3.5-35B-A3B → Qwen3.6-35B-A3B #4

@jsboige

Description

@jsboige

Context

Qwen team released Qwen3.6-35B-A3B (April 2026). Same architecture as our current Qwen3.5 but with significant benchmark improvements:

Benchmark 3.5 3.6 Gain
SWE-bench Verified 70.0 73.4 +3.4
Terminal-Bench 2.0 40.5 51.5 +11
NL2Repo 20.5 29.4 +9
QwenWebBench 978 1397 +43%

New feature: preserve_thinking for multi-turn agentic workflows (retains reasoning context).

Quantization Assessment

No AWQ 4-bit published yet on HuggingFace. Available alternatives:

Provider Format Status Verdict
Qwen/Qwen3.6-35B-A3B-FP8 FP8 (official) ~35 GB → ~100K KV tokens (vs our 335K) Acceptable fallback
unsloth/Qwen3.6-35B-A3B-GGUF GGUF llama.cpp only Not for vLLM
mlx-community/* MLX 4/5/6/8-bit Apple Silicon only Not applicable
caiovicentino1/HLWQ-CT-INT4 compressed-tensors INT4 Requires custom fork + --enforce-eager (3-4x slower) Rejected
caiovicentino1/HLWQ-Q5 PolarQuant 5-bit Research only (per model card) Rejected
mmangkad/Qwen3.6-NVFP4 NVIDIA ModelOpt FP4 Untested on 4090 Risky
SocialLocalMobile/HQQ-INT4 ExecuTorch HQQ Not vLLM-native Risky

Recommendation: Create our own AWQ 4-bit using llmcompressor (same approach as quantize_zwz_8b.py). Qwen3.6 architecture is identical to Qwen3.5 (same Qwen3_5MoeForConditionalGeneration class per model card), so cyankiwi's existing AWQ approach should transfer.

Migration Roadmap

Phase 1: Quantization (~1 day)

  • Adapt quantize_zwz_8b.pyquantize_qwen36_35b_a3b.py
    • Change class to Qwen3_5MoeForConditionalGeneration or Qwen3_6... (check model card)
    • Keep vision encoder and merger in BF16 (re:.*visual.*, re:.*merger.*)
    • Exclude lm_head
    • 512 calibration samples Open-Platypus (or add vision samples)
  • Run quantization on BF16 base (~35B × 2 bytes = 70 GB download)
  • Expected output: ~19-23 GB AWQ 4-bit model
  • Estimated time: 3-8 hours on RTX 4090 (MoE quantization is slower than dense)
  • Alternative: wait 1-2 weeks for cyankiwi/QuantTrio to publish

Phase 2: Validation (~2 hours)

  • Create medium-qwen36-moe.yml profile (copy from qwen35)
  • Deploy in parallel or swap
  • Run benchmark_current_models.py (decode speed, concurrent, tool calling, vision)
  • Compare vs Qwen3.5 baseline (117 tok/s decode, 311 concurrent, 910ms tool)
  • Quality tests: GSM8K, IFEval, MME (our existing benchmarks)
  • Test preserve_thinking feature for Roo multi-turn

Phase 3: Production cutover

  • Update CLAUDE.md with new version info
  • Archive Qwen3.5 profile to myia_vllm/archives/
  • Clear old compile cache: docker volume rm profiles_vllm-compile-cache-qwen35
  • Deploy Qwen3.6 as primary on port 5002
  • Update watchdog script (container name references)

Phase 4: Downstream client updates

Clients that reference model name qwen3.5-35b-a3b:

  • roo-extensions (separate repo): Roo "simple" profiles, sk_agent model config, roo-state-manager condensation .env
  • OWUI model wrappers (8 wrappers): Qwen_think, Qwen_think-code, Qwen_think-reason, Qwen_instruct, Local.qwen3.5-35b-a3b (+ variants)
  • SK Agent config: myia_vllm/mcp/sk_agent_config.json + d:\roo-extensions\mcps\internal\servers\sk-agent\sk_agent_config.json
  • nanoClaw (if applicable): check if it references the model
  • Dashboard clients (OWUI third-party models, APIs)
  • Update environment variable VLLM_MODEL_QWEN35_MOEVLLM_MODEL_QWEN36_MOE
  • Update served-model-name qwen3.5-35b-a3bqwen3.6-35b-a3b

Phase 5: Sampling re-calibration

  • Benchmark repetition with existing sampling params (pp=1.5, rp=1.1)
  • Qwen3.6 recommends same params (temp 0.7, top_p 0.95, top_k 20, pp 1.5) for Q4
  • Adjust OWUI wrappers if needed based on new benchmarks

Risk Assessment

Low risk:

  • Same architecture → existing flags should work (--enable-expert-parallel, --kv-cache-dtype fp8, --tool-call-parser qwen3_coder, --reasoning-parser qwen3)
  • Marlin MoE kernels should work identically

Medium risk:

  • preserve_thinking is a new feature — test multi-turn behavior carefully
  • Our own AWQ quantization may be lower quality than cyankiwi's (different calibration dataset)
  • MTP speculative decoding: Qwen3.6 supports qwen3_next_mtp natively — worth testing

High risk:

  • None identified for architecture compatibility

Open Questions

  1. Should we wait for cyankiwi's AWQ (historical lag: 2-3 months after model release) or quantize now?
  2. Should we try Qwen's official FP8 as intermediate while AWQ is being prepared?
  3. Should we enable MTP speculative decoding (qwen3_next_mtp) on Qwen3.6? Test acceptance rate first.
  4. Does preserve_thinking impact our OWUI wrapper designs (especially -fast variants that disable thinking)?

References

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions