Problem
Decode still pays too much for generic primitive execution around attention and KV cache maintenance. Queue and scheduler fixes alone will not close the decode gap if the backend keeps using generic hot-path kernels and generic KV update flows.
This is not a duplicate of #6. Issue #6 focuses on matmul tuning tables and vendor-specific kernel selection. This issue focuses on the decode hot path itself: attention plus KV cache update/append behavior.
Why This Matters
The reference runtimes both invest heavily in inference-shaped hot paths:
- ggml-vulkan has substantial attention specialization and decode-oriented KV behavior
- Zinc hand-codes the token loop around attention, KV write, and immediate consumption
MLX needs a native decode hot path rather than paying repeated generic primitive overhead around these operations.
Tasks
Acceptance Criteria
- Qwen3 decode throughput improves materially
- Decode traces show fewer copy/sync boundaries around attention + KV work
- No correctness regressions on causal/GQA decode shapes
References
mlx-vulkan-reference-conclusions.md
references/ggml-vulkan-findings.md
references/zinc-findings.md
references/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp
references/llama.cpp/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn*.comp
references/zinc/src/compute/attention.zig
references/zinc/src/compute/forward.zig (attention + KV write sequencing)
Related
Problem
Decode still pays too much for generic primitive execution around attention and KV cache maintenance. Queue and scheduler fixes alone will not close the decode gap if the backend keeps using generic hot-path kernels and generic KV update flows.
This is not a duplicate of #6. Issue #6 focuses on matmul tuning tables and vendor-specific kernel selection. This issue focuses on the decode hot path itself: attention plus KV cache update/append behavior.
Why This Matters
The reference runtimes both invest heavily in inference-shaped hot paths:
MLX needs a native decode hot path rather than paying repeated generic primitive overhead around these operations.
Tasks
Acceptance Criteria
References
mlx-vulkan-reference-conclusions.mdreferences/ggml-vulkan-findings.mdreferences/zinc-findings.mdreferences/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cppreferences/llama.cpp/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn*.compreferences/zinc/src/compute/attention.zigreferences/zinc/src/compute/forward.zig(attention + KV write sequencing)Related