You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue tracks the current open performance work for the Vulkan backend and orders it by preferred implementation sequence.
The ordering below prioritizes kernel-level optimizations with proven ROI (based on llama.cpp benchmarks) first, then structural/scheduling improvements.
NEW: Priority 0: Host-Side Orchestration Bottlenecks (Updated April 2025)
Based on detailed code review, the following issues represent the most likely root causes of the current ~8.5 tok/s throughput wall. These are host-side/backend-side issues, not shader problems.
0.1 #25[Vulkan] Force device-local memory allocation on discrete GPUs
Effort: Low | Impact: Very High
Why first: The allocator allows fallback to host-visible memory on discrete GPUs, causing PCIe bandwidth bottlenecks for GEMM weights and KV-cache. This is a P0 issue that likely contributes significantly to the throughput wall.
Key fix: Change memory type preference chain to force eDeviceLocal only.
Why here: Any CPU buffer access triggers a full GPU queue drain via synchronize_all(). This should only wait on the specific buffer's timeline semaphore.
Key fix: Replace global sync with targeted wait on buffer's last_semaphore.
0.3 #27[Vulkan] Matmul success path pays unnecessary transpose copy overhead
Effort: Medium | Impact: Medium-High
Even successful matmul writes to transposed scratch, then copies with transpose. Extra dispatch + memory bandwidth per GEMM.
Key fix: Add shader variant that writes directly to non-transposed output.
0.5 #29[Vulkan] SDPA fast path does excessive layout fixup and cast churn
Effort: Medium | Impact: Medium
SDPA performs 2-5 copies per call for layout fixup and dtype casting. Should validate layouts once before decode loop.
Key fix: Early-exit in cast functions; pre-decode layout validation.
NEW: Priority 0.5: Gaps in Closed Issues (April 2025)
The following issues were previously closed but code review reveals gaps where the implementation doesn't cover all paths or the work wasn't fully completed.
Original:#3 (Persistent staging allocator) Gap:try_host_vector_cast_copy() does fresh allocation + sync instead of using the staging arena Impact: Copy fallback path destroys throughput if triggered
0.5.2 #31[Vulkan] Resource-aware decode classification not used by CPU access paths
Original:#15 (Resource-aware decode synchronization) Gap:DecodeResourceClass exists but raw_ptr() doesn't use it for sync decisions Impact: Missed opportunity to skip sync for ReadOnlyWeight buffers
0.5.3 #32[Vulkan] Token-level decode batching still inserts excessive barriers
Original:#14 (Token-level execution regions) Gap: Decode batching infrastructure exists but hazard model still conservative Impact: Barriers per token may still be higher than #14 acceptance criteria
0.5.4 #33[Vulkan] Per-dispatch overhead in dispatch_with_spec still high despite descriptor reuse
Original:#4 (Descriptor set churn) Gap: Descriptor allocation is solved, but per-dispatch driver calls remain Impact: vkUpdateDescriptorSets, vkCmdBind* per dispatch still expensive
Priority 1: Kernel Optimizations (High ROI from llama.cpp)
These have proven 2-5x speedups in llama.cpp Vulkan backend with clear implementation paths.
1.1 #23[Vulkan] Implement Flash Attention Scalar Refactor for AMD/Intel optimization
Effort: Medium-High | Impact: Very High (2-5x on AMD/Intel)
Why here: llama.cpp PR #19625 shows 122-193% improvement on AMD Pro VII for large context lengths. The refactor adds vendor-specific tuning:
AMD RDNA: Use wave32 subgroup size when N=1
Intel: Disable subgroup operations in favor of shared memory
Row splitting within workgroups for better utilization
FP16 shader variants for broader hardware support
Reference: llama.cpp PR #19625 (see vulkan-performance-prs.md for detailed analysis)
1.2 #24[Vulkan] Optimize Softmax kernel with subgroup operations
Effort: Medium | Impact: High (decode hot path)
Why here: Softmax is on the critical path for every attention layer during decode. llama.cpp PR #10301 optimizes with:
Subgroup operations for reduction (replacing barriers)
Vectorized loads/stores
Vendor-specific workgroup sizing
Reference: llama.cpp PR #10301
1.3 #22[Vulkan] Support direct conv2d and coopmat2 optimizations in C++ dispatch
Effort: Medium-High | Impact: High for vision models (2-5x speedup)
The shader already has COOPMAT2 and UNROLL flags, but C++ dispatch needs updates to:
Create Spec Constant pipelines properly
Route vision workloads to direct convolution (avoiding im2col overhead)
Enable VK_NV_cooperative_matrix2 for NVIDIA tensor cores
These reduce overhead in the command submission and synchronization path.
2.1 #17[Vulkan] Lower common decode primitive sequences into fused execution regions
Effort: High | Impact: Medium-High
Why here: After kernel optimizations are in place, fusion can compound the benefits by reducing dispatch overhead.
2.2 #16[Vulkan] Add native decode hot path for attention and KV cache update
Effort: High | Impact: Medium-High
Why here: Specialized decode paths work best when the underlying kernels (Flash Attention, Softmax) are already optimized, and when memory is in device-local VRAM (#25).
2.3 #5[Vulkan] Add VkPipelineCache persistence and kernel warmup
Effort: Medium | Impact: Medium (startup/first-run latency)
Why here: Startup optimization is less critical than steady-state decode performance, but important for production use.
Priority 3: Future/Model-Specific Optimizations
3.1 GATED_DELTA_NET Operation (Qwen3.5 models)
Effort: High | Impact: Model-specific (17-22% speedup for Qwen3.5)
New operation for Qwen3.5/Qwen3-Next models. Only pursue if Qwen3.5 model support is prioritized.
Reference: llama.cpp PR #20334 (see vulkan-performance-prs.md)
[Vulkan Bug] Wrong-device selection on multi-GPU systems #8[Vulkan Bug] Wrong-device selection on multi-GPU systems
This can have major real-world performance impact on mixed-GPU systems, but it is better treated as device-selection correctness with performance consequences rather than part of the core Vulkan perf-program ordering above.
llama.cpp Vulkan Performance PR Reference
The following llama.cpp Vulkan PRs have been analyzed and cross-referenced with MLX issues:
Summary
This issue tracks the current open performance work for the Vulkan backend and orders it by preferred implementation sequence.
The ordering below prioritizes kernel-level optimizations with proven ROI (based on llama.cpp benchmarks) first, then structural/scheduling improvements.
NEW: Priority 0: Host-Side Orchestration Bottlenecks (Updated April 2025)
Based on detailed code review, the following issues represent the most likely root causes of the current ~8.5 tok/s throughput wall. These are host-side/backend-side issues, not shader problems.
0.1 #25
[Vulkan] Force device-local memory allocation on discrete GPUsEffort: Low | Impact: Very High
Why first: The allocator allows fallback to host-visible memory on discrete GPUs, causing PCIe bandwidth bottlenecks for GEMM weights and KV-cache. This is a P0 issue that likely contributes significantly to the throughput wall.
Key fix: Change memory type preference chain to force
eDeviceLocalonly.0.2 #26
[Vulkan] raw_ptr() unconditional synchronize_all() kills decode throughputEffort: Low | Impact: Very High
Why here: Any CPU buffer access triggers a full GPU queue drain via
synchronize_all(). This should only wait on the specific buffer's timeline semaphore.Key fix: Replace global sync with targeted wait on buffer's
last_semaphore.0.3 #27
[Vulkan] Matmul success path pays unnecessary transpose copy overheadEffort: Medium | Impact: Medium-High
Even successful matmul writes to transposed scratch, then copies with transpose. Extra dispatch + memory bandwidth per GEMM.
Key fix: Add shader variant that writes directly to non-transposed output.
0.4 #28
[Vulkan] Copy fallback path triggers GPU readback + sync + CPU conversionEffort: Low-Medium | Impact: High (if triggered)
Missing copy shaders cause fallback to GPU→CPU readback + sync + CPU conversion. If hit during decode, throughput collapses.
Key fix: Add missing conversion shaders; add tracing to detect fallback.
0.5 #29
[Vulkan] SDPA fast path does excessive layout fixup and cast churnEffort: Medium | Impact: Medium
SDPA performs 2-5 copies per call for layout fixup and dtype casting. Should validate layouts once before decode loop.
Key fix: Early-exit in cast functions; pre-decode layout validation.
NEW: Priority 0.5: Gaps in Closed Issues (April 2025)
The following issues were previously closed but code review reveals gaps where the implementation doesn't cover all paths or the work wasn't fully completed.
0.5.1 #30
[Vulkan] Copy fallback path bypasses persistent staging allocator and triggers syncOriginal: #3 (Persistent staging allocator)
Gap:
try_host_vector_cast_copy()does fresh allocation + sync instead of using the staging arenaImpact: Copy fallback path destroys throughput if triggered
0.5.2 #31
[Vulkan] Resource-aware decode classification not used by CPU access pathsOriginal: #15 (Resource-aware decode synchronization)
Gap:
DecodeResourceClassexists butraw_ptr()doesn't use it for sync decisionsImpact: Missed opportunity to skip sync for ReadOnlyWeight buffers
0.5.3 #32
[Vulkan] Token-level decode batching still inserts excessive barriersOriginal: #14 (Token-level execution regions)
Gap: Decode batching infrastructure exists but hazard model still conservative
Impact: Barriers per token may still be higher than #14 acceptance criteria
0.5.4 #33
[Vulkan] Per-dispatch overhead in dispatch_with_spec still high despite descriptor reuseOriginal: #4 (Descriptor set churn)
Gap: Descriptor allocation is solved, but per-dispatch driver calls remain
Impact: vkUpdateDescriptorSets, vkCmdBind* per dispatch still expensive
Priority 1: Kernel Optimizations (High ROI from llama.cpp)
These have proven 2-5x speedups in llama.cpp Vulkan backend with clear implementation paths.
1.1 #23
[Vulkan] Implement Flash Attention Scalar Refactor for AMD/Intel optimizationEffort: Medium-High | Impact: Very High (2-5x on AMD/Intel)
Why here: llama.cpp PR #19625 shows 122-193% improvement on AMD Pro VII for large context lengths. The refactor adds vendor-specific tuning:
Reference: llama.cpp PR #19625 (see
vulkan-performance-prs.mdfor detailed analysis)1.2 #24
[Vulkan] Optimize Softmax kernel with subgroup operationsEffort: Medium | Impact: High (decode hot path)
Why here: Softmax is on the critical path for every attention layer during decode. llama.cpp PR #10301 optimizes with:
Reference: llama.cpp PR #10301
1.3 #22
[Vulkan] Support direct conv2d and coopmat2 optimizations in C++ dispatchEffort: Medium-High | Impact: High for vision models (2-5x speedup)
The shader already has COOPMAT2 and UNROLL flags, but C++ dispatch needs updates to:
References: llama.cpp PRs #14933, #14982, #16978, #14316
Priority 2: Structural/Scheduling Optimizations
These reduce overhead in the command submission and synchronization path.
2.1 #17
[Vulkan] Lower common decode primitive sequences into fused execution regionsEffort: High | Impact: Medium-High
Why here: After kernel optimizations are in place, fusion can compound the benefits by reducing dispatch overhead.
2.2 #16
[Vulkan] Add native decode hot path for attention and KV cache updateEffort: High | Impact: Medium-High
Why here: Specialized decode paths work best when the underlying kernels (Flash Attention, Softmax) are already optimized, and when memory is in device-local VRAM (#25).
2.3 #5
[Vulkan] Add VkPipelineCache persistence and kernel warmupEffort: Medium | Impact: Medium (startup/first-run latency)
Why here: Startup optimization is less critical than steady-state decode performance, but important for production use.
Priority 3: Future/Model-Specific Optimizations
3.1 GATED_DELTA_NET Operation (Qwen3.5 models)
Effort: High | Impact: Model-specific (17-22% speedup for Qwen3.5)
New operation for Qwen3.5/Qwen3-Next models. Only pursue if Qwen3.5 model support is prioritized.
Reference: llama.cpp PR #20334 (see
vulkan-performance-prs.md)Completed / Closed Issues
Adjacent / Situational
[Vulkan Bug] Wrong-device selection on multi-GPU systemsThis can have major real-world performance impact on mixed-GPU systems, but it is better treated as device-selection correctness with performance consequences rather than part of the core Vulkan perf-program ordering above.
llama.cpp Vulkan Performance PR Reference
The following llama.cpp Vulkan PRs have been analyzed and cross-referenced with MLX issues:
See
vulkan-performance-prs.mdin repo root for detailed diffs and analysis of each PR.Performance Investigation Notes
April 2025 Code Review Findings:
After detailed analysis of
allocator.cpp,matmul.cpp,copy.cpp,fast.cpp,device.cpp, andkernels.cpp:The 8.5 tok/s wall is NOT primarily shader-bound - shaders appear fine
Root causes are host-side:
raw_ptr()- Issue [Vulkan] raw_ptr() unconditional synchronize_all() kills decode throughput #26Gaps in closed issues also contribute:
Recommended investigation order:
Notes