Skip to content

Performance tracking issue #19

@goniz

Description

@goniz

Summary

This issue tracks the current open performance work for the Vulkan backend and orders it by preferred implementation sequence.

The ordering below prioritizes kernel-level optimizations with proven ROI (based on llama.cpp benchmarks) first, then structural/scheduling improvements.


NEW: Priority 0: Host-Side Orchestration Bottlenecks (Updated April 2025)

Based on detailed code review, the following issues represent the most likely root causes of the current ~8.5 tok/s throughput wall. These are host-side/backend-side issues, not shader problems.

0.1 #25 [Vulkan] Force device-local memory allocation on discrete GPUs

Effort: Low | Impact: Very High

Why first: The allocator allows fallback to host-visible memory on discrete GPUs, causing PCIe bandwidth bottlenecks for GEMM weights and KV-cache. This is a P0 issue that likely contributes significantly to the throughput wall.

Key fix: Change memory type preference chain to force eDeviceLocal only.

0.2 #26 [Vulkan] raw_ptr() unconditional synchronize_all() kills decode throughput

Effort: Low | Impact: Very High

Why here: Any CPU buffer access triggers a full GPU queue drain via synchronize_all(). This should only wait on the specific buffer's timeline semaphore.

Key fix: Replace global sync with targeted wait on buffer's last_semaphore.

0.3 #27 [Vulkan] Matmul success path pays unnecessary transpose copy overhead

Effort: Medium | Impact: Medium-High

Even successful matmul writes to transposed scratch, then copies with transpose. Extra dispatch + memory bandwidth per GEMM.

Key fix: Add shader variant that writes directly to non-transposed output.

0.4 #28 [Vulkan] Copy fallback path triggers GPU readback + sync + CPU conversion

Effort: Low-Medium | Impact: High (if triggered)

Missing copy shaders cause fallback to GPU→CPU readback + sync + CPU conversion. If hit during decode, throughput collapses.

Key fix: Add missing conversion shaders; add tracing to detect fallback.

0.5 #29 [Vulkan] SDPA fast path does excessive layout fixup and cast churn

Effort: Medium | Impact: Medium

SDPA performs 2-5 copies per call for layout fixup and dtype casting. Should validate layouts once before decode loop.

Key fix: Early-exit in cast functions; pre-decode layout validation.


NEW: Priority 0.5: Gaps in Closed Issues (April 2025)

The following issues were previously closed but code review reveals gaps where the implementation doesn't cover all paths or the work wasn't fully completed.

0.5.1 #30 [Vulkan] Copy fallback path bypasses persistent staging allocator and triggers sync

Original: #3 (Persistent staging allocator)
Gap: try_host_vector_cast_copy() does fresh allocation + sync instead of using the staging arena
Impact: Copy fallback path destroys throughput if triggered

0.5.2 #31 [Vulkan] Resource-aware decode classification not used by CPU access paths

Original: #15 (Resource-aware decode synchronization)
Gap: DecodeResourceClass exists but raw_ptr() doesn't use it for sync decisions
Impact: Missed opportunity to skip sync for ReadOnlyWeight buffers

0.5.3 #32 [Vulkan] Token-level decode batching still inserts excessive barriers

Original: #14 (Token-level execution regions)
Gap: Decode batching infrastructure exists but hazard model still conservative
Impact: Barriers per token may still be higher than #14 acceptance criteria

0.5.4 #33 [Vulkan] Per-dispatch overhead in dispatch_with_spec still high despite descriptor reuse

Original: #4 (Descriptor set churn)
Gap: Descriptor allocation is solved, but per-dispatch driver calls remain
Impact: vkUpdateDescriptorSets, vkCmdBind* per dispatch still expensive


Priority 1: Kernel Optimizations (High ROI from llama.cpp)

These have proven 2-5x speedups in llama.cpp Vulkan backend with clear implementation paths.

1.1 #23 [Vulkan] Implement Flash Attention Scalar Refactor for AMD/Intel optimization

Effort: Medium-High | Impact: Very High (2-5x on AMD/Intel)

Why here: llama.cpp PR #19625 shows 122-193% improvement on AMD Pro VII for large context lengths. The refactor adds vendor-specific tuning:

  • AMD RDNA: Use wave32 subgroup size when N=1
  • Intel: Disable subgroup operations in favor of shared memory
  • Row splitting within workgroups for better utilization
  • FP16 shader variants for broader hardware support

Reference: llama.cpp PR #19625 (see vulkan-performance-prs.md for detailed analysis)

1.2 #24 [Vulkan] Optimize Softmax kernel with subgroup operations

Effort: Medium | Impact: High (decode hot path)

Why here: Softmax is on the critical path for every attention layer during decode. llama.cpp PR #10301 optimizes with:

  • Subgroup operations for reduction (replacing barriers)
  • Vectorized loads/stores
  • Vendor-specific workgroup sizing

Reference: llama.cpp PR #10301

1.3 #22 [Vulkan] Support direct conv2d and coopmat2 optimizations in C++ dispatch

Effort: Medium-High | Impact: High for vision models (2-5x speedup)

The shader already has COOPMAT2 and UNROLL flags, but C++ dispatch needs updates to:

  • Create Spec Constant pipelines properly
  • Route vision workloads to direct convolution (avoiding im2col overhead)
  • Enable VK_NV_cooperative_matrix2 for NVIDIA tensor cores

References: llama.cpp PRs #14933, #14982, #16978, #14316


Priority 2: Structural/Scheduling Optimizations

These reduce overhead in the command submission and synchronization path.

2.1 #17 [Vulkan] Lower common decode primitive sequences into fused execution regions

Effort: High | Impact: Medium-High

Why here: After kernel optimizations are in place, fusion can compound the benefits by reducing dispatch overhead.

2.2 #16 [Vulkan] Add native decode hot path for attention and KV cache update

Effort: High | Impact: Medium-High

Why here: Specialized decode paths work best when the underlying kernels (Flash Attention, Softmax) are already optimized, and when memory is in device-local VRAM (#25).

2.3 #5 [Vulkan] Add VkPipelineCache persistence and kernel warmup

Effort: Medium | Impact: Medium (startup/first-run latency)

Why here: Startup optimization is less critical than steady-state decode performance, but important for production use.


Priority 3: Future/Model-Specific Optimizations

3.1 GATED_DELTA_NET Operation (Qwen3.5 models)

Effort: High | Impact: Model-specific (17-22% speedup for Qwen3.5)

New operation for Qwen3.5/Qwen3-Next models. Only pursue if Qwen3.5 model support is prioritized.

Reference: llama.cpp PR #20334 (see vulkan-performance-prs.md)


Completed / Closed Issues

Issue Description Result
#14 Batch transformer decode into token-level execution regions ✅ Completed (with gaps - see #32)
#15 Make decode synchronization resource-aware ✅ Completed (with gaps - see #31)
#2 Remove over-synchronization ✅ Completed (deferred op barriers fixed; CPU access sync remaining - see #26)
#3 Persistent staging allocator ✅ Completed (with gaps - see #30)
#4 Reduce descriptor set churn ✅ Completed (allocation fixed; per-dispatch overhead remaining - see #33)
#1 Multi-queue backend with timeline semaphore ✅ Completed

Adjacent / Situational

  • [Vulkan Bug] Wrong-device selection on multi-GPU systems #8 [Vulkan Bug] Wrong-device selection on multi-GPU systems
    This can have major real-world performance impact on mixed-GPU systems, but it is better treated as device-selection correctness with performance consequences rather than part of the core Vulkan perf-program ordering above.

llama.cpp Vulkan Performance PR Reference

The following llama.cpp Vulkan PRs have been analyzed and cross-referenced with MLX issues:

llama.cpp PR Description MLX Status Issue
#19625 Scalar Flash Attention Refactor 🆕 New issue #23
#10301 Optimize soft_max 🆕 New issue #24
#10206 VK_NV_cooperative_matrix2 support 🔄 In Progress #22
#14933 Direct convolution optimizations 🔄 In Progress #22
#14982 Coopmat2 for conv2d 🔄 In Progress #22
#12505 Optimize mul_mat_vec p021/nc ✅ Complete #21
#10296 Optimize mat-vec mul quant shaders ✅ Complete #20
#10991 Optimize mul_mat for small N ✅ Complete #20
#20334 GATED_DELTA_NET op (Qwen3.5) 📋 Not started (none)
#11595 MMV kernels for IQ2/IQ3 quants 📋 Not started (none)
#14903 Integer Dot Product for legacy quants 📋 Not started (none)
#17582 Improve topk perf for large k 📋 Not started (none)

See vulkan-performance-prs.md in repo root for detailed diffs and analysis of each PR.


Performance Investigation Notes

April 2025 Code Review Findings:

After detailed analysis of allocator.cpp, matmul.cpp, copy.cpp, fast.cpp, device.cpp, and kernels.cpp:

  1. The 8.5 tok/s wall is NOT primarily shader-bound - shaders appear fine

  2. Root causes are host-side:

  3. Gaps in closed issues also contribute:

  4. Recommended investigation order:


Notes

  • This list includes both open issues and recently created issues that are directly performance-related
  • The ordering prioritizes proven kernel optimizations over structural changes
  • NEW (April 2025): Added Priority 0 for host-side orchestration issues identified in code review
  • NEW (April 2025): Added Priority 0.5 for gaps in previously closed issues
  • If new profiling data shows different bottlenecks, this order should be updated

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions