Performance tracking issue

## Summary

This issue tracks the current open performance work for the Vulkan backend and orders it by preferred implementation sequence.

The ordering below prioritizes **kernel-level optimizations with proven ROI** (based on llama.cpp benchmarks) first, then structural/scheduling improvements.

---

## NEW: Priority 0: Host-Side Orchestration Bottlenecks (Updated April 2025)

Based on detailed code review, the following issues represent the **most likely root causes** of the current ~8.5 tok/s throughput wall. These are **host-side/backend-side** issues, not shader problems.

### 0.1 #25 `[Vulkan] Force device-local memory allocation on discrete GPUs`
   **Effort:** Low | **Impact:** Very High
   
   **Why first:** The allocator allows fallback to host-visible memory on discrete GPUs, causing PCIe bandwidth bottlenecks for GEMM weights and KV-cache. This is a **P0** issue that likely contributes significantly to the throughput wall.
   
   Key fix: Change memory type preference chain to force `eDeviceLocal` only.

### 0.2 #26 `[Vulkan] raw_ptr() unconditional synchronize_all() kills decode throughput`
   **Effort:** Low | **Impact:** Very High
   
   **Why here:** Any CPU buffer access triggers a full GPU queue drain via `synchronize_all()`. This should only wait on the specific buffer's timeline semaphore.
   
   Key fix: Replace global sync with targeted wait on buffer's `last_semaphore`.

### 0.3 #27 `[Vulkan] Matmul success path pays unnecessary transpose copy overhead`
   **Effort:** Medium | **Impact:** Medium-High
   
   Even successful matmul writes to transposed scratch, then copies with transpose. Extra dispatch + memory bandwidth per GEMM.
   
   Key fix: Add shader variant that writes directly to non-transposed output.

### 0.4 #28 `[Vulkan] Copy fallback path triggers GPU readback + sync + CPU conversion`
   **Effort:** Low-Medium | **Impact:** High (if triggered)
   
   Missing copy shaders cause fallback to GPU→CPU readback + sync + CPU conversion. If hit during decode, throughput collapses.
   
   Key fix: Add missing conversion shaders; add tracing to detect fallback.

### 0.5 #29 `[Vulkan] SDPA fast path does excessive layout fixup and cast churn`
   **Effort:** Medium | **Impact:** Medium
   
   SDPA performs 2-5 copies per call for layout fixup and dtype casting. Should validate layouts once before decode loop.
   
   Key fix: Early-exit in cast functions; pre-decode layout validation.

---

## NEW: Priority 0.5: Gaps in Closed Issues (April 2025)

The following issues were previously closed but code review reveals gaps where the implementation doesn't cover all paths or the work wasn't fully completed.

### 0.5.1 #30 `[Vulkan] Copy fallback path bypasses persistent staging allocator and triggers sync`
   **Original:** #3 (Persistent staging allocator)
   **Gap:** `try_host_vector_cast_copy()` does fresh allocation + sync instead of using the staging arena
   **Impact:** Copy fallback path destroys throughput if triggered

### 0.5.2 #31 `[Vulkan] Resource-aware decode classification not used by CPU access paths`
   **Original:** #15 (Resource-aware decode synchronization)
   **Gap:** `DecodeResourceClass` exists but `raw_ptr()` doesn't use it for sync decisions
   **Impact:** Missed opportunity to skip sync for ReadOnlyWeight buffers

### 0.5.3 #32 `[Vulkan] Token-level decode batching still inserts excessive barriers`
   **Original:** #14 (Token-level execution regions)
   **Gap:** Decode batching infrastructure exists but hazard model still conservative
   **Impact:** Barriers per token may still be higher than #14 acceptance criteria

### 0.5.4 #33 `[Vulkan] Per-dispatch overhead in dispatch_with_spec still high despite descriptor reuse`
   **Original:** #4 (Descriptor set churn)
   **Gap:** Descriptor allocation is solved, but per-dispatch driver calls remain
   **Impact:** vkUpdateDescriptorSets, vkCmdBind* per dispatch still expensive

---

## Priority 1: Kernel Optimizations (High ROI from llama.cpp)

These have proven 2-5x speedups in llama.cpp Vulkan backend with clear implementation paths.

### 1.1 #23 `[Vulkan] Implement Flash Attention Scalar Refactor for AMD/Intel optimization`
   **Effort:** Medium-High | **Impact:** Very High (2-5x on AMD/Intel)
   
   Why here: llama.cpp PR #19625 shows **122-193% improvement** on AMD Pro VII for large context lengths. The refactor adds vendor-specific tuning:
   - AMD RDNA: Use wave32 subgroup size when N=1
   - Intel: Disable subgroup operations in favor of shared memory
   - Row splitting within workgroups for better utilization
   - FP16 shader variants for broader hardware support
   
   Reference: llama.cpp PR #19625 (see `vulkan-performance-prs.md` for detailed analysis)

### 1.2 #24 `[Vulkan] Optimize Softmax kernel with subgroup operations`
   **Effort:** Medium | **Impact:** High (decode hot path)
   
   Why here: Softmax is on the critical path for every attention layer during decode. llama.cpp PR #10301 optimizes with:
   - Subgroup operations for reduction (replacing barriers)
   - Vectorized loads/stores
   - Vendor-specific workgroup sizing
   
   Reference: llama.cpp PR #10301

### 1.3 #22 `[Vulkan] Support direct conv2d and coopmat2 optimizations in C++ dispatch`
   **Effort:** Medium-High | **Impact:** High for vision models (2-5x speedup)
   
   The shader already has COOPMAT2 and UNROLL flags, but C++ dispatch needs updates to:
   - Create Spec Constant pipelines properly
   - Route vision workloads to direct convolution (avoiding im2col overhead)
   - Enable VK_NV_cooperative_matrix2 for NVIDIA tensor cores
   
   References: llama.cpp PRs #14933, #14982, #16978, #14316

---

## Priority 2: Structural/Scheduling Optimizations

These reduce overhead in the command submission and synchronization path.

### 2.1 #17 `[Vulkan] Lower common decode primitive sequences into fused execution regions`
   **Effort:** High | **Impact:** Medium-High
   
   Why here: After kernel optimizations are in place, fusion can compound the benefits by reducing dispatch overhead.

### 2.2 #16 `[Vulkan] Add native decode hot path for attention and KV cache update`
   **Effort:** High | **Impact:** Medium-High
   
   Why here: Specialized decode paths work best when the underlying kernels (Flash Attention, Softmax) are already optimized, and when memory is in device-local VRAM (#25).

### 2.3 #5 `[Vulkan] Add VkPipelineCache persistence and kernel warmup`
   **Effort:** Medium | **Impact:** Medium (startup/first-run latency)
   
   Why here: Startup optimization is less critical than steady-state decode performance, but important for production use.

---

## Priority 3: Future/Model-Specific Optimizations

### 3.1 GATED_DELTA_NET Operation (Qwen3.5 models)
   **Effort:** High | **Impact:** Model-specific (17-22% speedup for Qwen3.5)
   
   New operation for Qwen3.5/Qwen3-Next models. Only pursue if Qwen3.5 model support is prioritized.
   
   Reference: llama.cpp PR #20334 (see `vulkan-performance-prs.md`)

---

## Completed / Closed Issues

| Issue | Description | Result |
|-------|-------------|--------|
| #14 | Batch transformer decode into token-level execution regions | ✅ Completed (with gaps - see #32) |
| #15 | Make decode synchronization resource-aware | ✅ Completed (with gaps - see #31) |
| #2 | Remove over-synchronization | ✅ Completed (deferred op barriers fixed; CPU access sync remaining - see #26) |
| #3 | Persistent staging allocator | ✅ Completed (with gaps - see #30) |
| #4 | Reduce descriptor set churn | ✅ Completed (allocation fixed; per-dispatch overhead remaining - see #33) |
| #1 | Multi-queue backend with timeline semaphore | ✅ Completed |

---

## Adjacent / Situational

- #8 `[Vulkan Bug] Wrong-device selection on multi-GPU systems`
  This can have major real-world performance impact on mixed-GPU systems, but it is better treated as device-selection correctness with performance consequences rather than part of the core Vulkan perf-program ordering above.

---

## llama.cpp Vulkan Performance PR Reference

The following llama.cpp Vulkan PRs have been analyzed and cross-referenced with MLX issues:

| llama.cpp PR | Description | MLX Status | Issue |
|--------------|-------------|------------|-------|
| #19625 | Scalar Flash Attention Refactor | 🆕 New issue | #23 |
| #10301 | Optimize soft_max | 🆕 New issue | #24 |
| #10206 | VK_NV_cooperative_matrix2 support | 🔄 In Progress | #22 |
| #14933 | Direct convolution optimizations | 🔄 In Progress | #22 |
| #14982 | Coopmat2 for conv2d | 🔄 In Progress | #22 |
| #12505 | Optimize mul_mat_vec p021/nc | ✅ Complete | #21 |
| #10296 | Optimize mat-vec mul quant shaders | ✅ Complete | #20 |
| #10991 | Optimize mul_mat for small N | ✅ Complete | #20 |
| #20334 | GATED_DELTA_NET op (Qwen3.5) | 📋 Not started | (none) |
| #11595 | MMV kernels for IQ2/IQ3 quants | 📋 Not started | (none) |
| #14903 | Integer Dot Product for legacy quants | 📋 Not started | (none) |
| #17582 | Improve topk perf for large k | 📋 Not started | (none) |

See `vulkan-performance-prs.md` in repo root for detailed diffs and analysis of each PR.

---

## Performance Investigation Notes

**April 2025 Code Review Findings:**

After detailed analysis of `allocator.cpp`, `matmul.cpp`, `copy.cpp`, `fast.cpp`, `device.cpp`, and `kernels.cpp`:

1. **The 8.5 tok/s wall is NOT primarily shader-bound** - shaders appear fine
2. **Root causes are host-side:**
   - Bad memory placement (host-visible fallback on discrete GPUs) - Issue #25
   - Unconditional synchronization in `raw_ptr()` - Issue #26
   - Unnecessary transpose copies in matmul success path - Issue #27
   - Copy fallback path with GPU readback - Issue #28
   - SDPA layout churn - Issue #29

3. **Gaps in closed issues also contribute:**
   - Staging arena not used by copy fallback - Issue #30
   - Resource classification not used by CPU access - Issue #31
   - Decode barriers still high despite #14 - Issue #32
   - Per-dispatch driver overhead despite #4 - Issue #33

4. **Recommended investigation order:**
   - Fix #25 (device-local memory) - easiest, highest impact
   - Fix #26 (raw_ptr sync) - easy, high impact
   - Fix #28 (copy fallback) - medium, prevents collapse
   - Fix #30 (staging arena gap) - medium, completes #3
   - Fix #27 (matmul transpose) - medium, improves GEMM throughput
   - Fix #29 (SDPA churn) - medium, reduces attention overhead
   - Fix #31, #32, #33 (gap issues) - verify and complete closed work
   - Then continue with kernel optimizations (#23, #24, #22)

---

## Notes

- This list includes both **open issues** and **recently created issues** that are directly performance-related
- The ordering prioritizes proven kernel optimizations over structural changes
- **NEW (April 2025):** Added Priority 0 for host-side orchestration issues identified in code review
- **NEW (April 2025):** Added Priority 0.5 for gaps in previously closed issues
- If new profiling data shows different bottlenecks, this order should be updated


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance tracking issue #19

Summary

NEW: Priority 0: Host-Side Orchestration Bottlenecks (Updated April 2025)

0.1 #25 `[Vulkan] Force device-local memory allocation on discrete GPUs`

0.2 #26 `[Vulkan] raw_ptr() unconditional synchronize_all() kills decode throughput`

0.3 #27 `[Vulkan] Matmul success path pays unnecessary transpose copy overhead`

0.4 #28 `[Vulkan] Copy fallback path triggers GPU readback + sync + CPU conversion`

0.5 #29 `[Vulkan] SDPA fast path does excessive layout fixup and cast churn`

NEW: Priority 0.5: Gaps in Closed Issues (April 2025)

0.5.1 #30 `[Vulkan] Copy fallback path bypasses persistent staging allocator and triggers sync`

0.5.2 #31 `[Vulkan] Resource-aware decode classification not used by CPU access paths`

0.5.3 #32 `[Vulkan] Token-level decode batching still inserts excessive barriers`

0.5.4 #33 `[Vulkan] Per-dispatch overhead in dispatch_with_spec still high despite descriptor reuse`

Priority 1: Kernel Optimizations (High ROI from llama.cpp)

1.1 #23 `[Vulkan] Implement Flash Attention Scalar Refactor for AMD/Intel optimization`

1.2 #24 `[Vulkan] Optimize Softmax kernel with subgroup operations`

1.3 #22 `[Vulkan] Support direct conv2d and coopmat2 optimizations in C++ dispatch`

Priority 2: Structural/Scheduling Optimizations

2.1 #17 `[Vulkan] Lower common decode primitive sequences into fused execution regions`

2.2 #16 `[Vulkan] Add native decode hot path for attention and KV cache update`

2.3 #5 `[Vulkan] Add VkPipelineCache persistence and kernel warmup`

Priority 3: Future/Model-Specific Optimizations

3.1 GATED_DELTA_NET Operation (Qwen3.5 models)

Completed / Closed Issues

Adjacent / Situational

llama.cpp Vulkan Performance PR Reference

Performance Investigation Notes

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue	Description	Result
#14	Batch transformer decode into token-level execution regions	✅ Completed (with gaps - see #32)
#15	Make decode synchronization resource-aware	✅ Completed (with gaps - see #31)
#2	Remove over-synchronization	✅ Completed (deferred op barriers fixed; CPU access sync remaining - see #26)
#3	Persistent staging allocator	✅ Completed (with gaps - see #30)
#4	Reduce descriptor set churn	✅ Completed (allocation fixed; per-dispatch overhead remaining - see #33)
#1	Multi-queue backend with timeline semaphore	✅ Completed

llama.cpp PR	Description	MLX Status	Issue
#19625	Scalar Flash Attention Refactor	🆕 New issue	#23
#10301	Optimize soft_max	🆕 New issue	#24
#10206	VK_NV_cooperative_matrix2 support	🔄 In Progress	#22
#14933	Direct convolution optimizations	🔄 In Progress	#22
#14982	Coopmat2 for conv2d	🔄 In Progress	#22
#12505	Optimize mul_mat_vec p021/nc	✅ Complete	#21
#10296	Optimize mat-vec mul quant shaders	✅ Complete	#20
#10991	Optimize mul_mat for small N	✅ Complete	#20
#20334	GATED_DELTA_NET op (Qwen3.5)	📋 Not started	(none)
#11595	MMV kernels for IQ2/IQ3 quants	📋 Not started	(none)
#14903	Integer Dot Product for legacy quants	📋 Not started	(none)
#17582	Improve topk perf for large k	📋 Not started	(none)

Performance tracking issue #19

Description

Summary

NEW: Priority 0: Host-Side Orchestration Bottlenecks (Updated April 2025)

0.1 #25 [Vulkan] Force device-local memory allocation on discrete GPUs

0.2 #26 [Vulkan] raw_ptr() unconditional synchronize_all() kills decode throughput

0.3 #27 [Vulkan] Matmul success path pays unnecessary transpose copy overhead

0.4 #28 [Vulkan] Copy fallback path triggers GPU readback + sync + CPU conversion

0.5 #29 [Vulkan] SDPA fast path does excessive layout fixup and cast churn

NEW: Priority 0.5: Gaps in Closed Issues (April 2025)

0.5.1 #30 [Vulkan] Copy fallback path bypasses persistent staging allocator and triggers sync

0.5.2 #31 [Vulkan] Resource-aware decode classification not used by CPU access paths

0.5.3 #32 [Vulkan] Token-level decode batching still inserts excessive barriers

0.5.4 #33 [Vulkan] Per-dispatch overhead in dispatch_with_spec still high despite descriptor reuse

Priority 1: Kernel Optimizations (High ROI from llama.cpp)

1.1 #23 [Vulkan] Implement Flash Attention Scalar Refactor for AMD/Intel optimization

1.2 #24 [Vulkan] Optimize Softmax kernel with subgroup operations

1.3 #22 [Vulkan] Support direct conv2d and coopmat2 optimizations in C++ dispatch

Priority 2: Structural/Scheduling Optimizations

2.1 #17 [Vulkan] Lower common decode primitive sequences into fused execution regions

2.2 #16 [Vulkan] Add native decode hot path for attention and KV cache update

2.3 #5 [Vulkan] Add VkPipelineCache persistence and kernel warmup

Priority 3: Future/Model-Specific Optimizations

3.1 GATED_DELTA_NET Operation (Qwen3.5 models)

Completed / Closed Issues

Adjacent / Situational

llama.cpp Vulkan Performance PR Reference

Performance Investigation Notes

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

0.1 #25 `[Vulkan] Force device-local memory allocation on discrete GPUs`

0.2 #26 `[Vulkan] raw_ptr() unconditional synchronize_all() kills decode throughput`

0.3 #27 `[Vulkan] Matmul success path pays unnecessary transpose copy overhead`

0.4 #28 `[Vulkan] Copy fallback path triggers GPU readback + sync + CPU conversion`

0.5 #29 `[Vulkan] SDPA fast path does excessive layout fixup and cast churn`

0.5.1 #30 `[Vulkan] Copy fallback path bypasses persistent staging allocator and triggers sync`

0.5.2 #31 `[Vulkan] Resource-aware decode classification not used by CPU access paths`

0.5.3 #32 `[Vulkan] Token-level decode batching still inserts excessive barriers`

0.5.4 #33 `[Vulkan] Per-dispatch overhead in dispatch_with_spec still high despite descriptor reuse`

1.1 #23 `[Vulkan] Implement Flash Attention Scalar Refactor for AMD/Intel optimization`

1.2 #24 `[Vulkan] Optimize Softmax kernel with subgroup operations`

1.3 #22 `[Vulkan] Support direct conv2d and coopmat2 optimizations in C++ dispatch`

2.1 #17 `[Vulkan] Lower common decode primitive sequences into fused execution regions`

2.2 #16 `[Vulkan] Add native decode hot path for attention and KV cache update`

2.3 #5 `[Vulkan] Add VkPipelineCache persistence and kernel warmup`