GPU-accelerated vector similarity search for Apple Silicon, built on Metal Compute Shaders and the Accelerate framework (AMX). Implements brute-force exact search and CAGRA graph-based approximate nearest neighbor (ANN) search.
- Brute-force exact search: 2.87× faster than metalfaiss (MLX) at N=1M, D=1024
- CAGRA ANN: beats hnswlib build time at N=100K with competitive recall
- 2-pass GPU pipeline: MPS float16 matmul + GPU top-K kernel — only Q×K results leave the GPU
- Index persistence: save/load built CAGRA index to skip rebuild
| Implementation | QPS | p50 latency | recall@10 |
|---|---|---|---|
| cuvs-silicon | 1012 | 99 ms | 1.0000 |
| metalfaiss (MLX) | 351 | 275 ms | 1.0000 |
| AMX cblas_sgemm (CPU) | 14 | — | 1.0000 |
CAGRA was implemented and benchmarked but the results were not satisfying. See docs/apple_silicon_gpu_vector_search_report.md for detailed results and analysis.
Requirements: macOS 14+, Xcode 15+, Apple Silicon (M1/M2/M3), CMake 3.20+
git clone https://github.com/ysys143/cuvs-silicon.git
cd cuvs-silicon
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --parallelPass graph_degree=0 to skip graph build and use exact GPU search:
./build/cohere_wiki_validation \
data/1m/base.bin data/1m/queries.bin data/1m/gt.bin \
10 1 3 0 # K warmup measure graph_degree=0# Build index and search
./build/cohere_wiki_validation \
data/100k/base.bin data/100k/queries.bin data/100k/gt.bin \
10 1 3 # K warmup measure [graph_degree=64]
# Save index after build (argv[8] = save path)
./build/cohere_wiki_validation ... 10 1 3 64 /path/to/index.idx
# Load saved index and search (argv[9] = load path)
./build/cohere_wiki_validation ... 10 1 3 64 "" /path/to/index.idxpython3 scripts/download_cohere_wiki.py # downloads 10K / 100K / 1M splitsPass 1 MPS MPSMatrixMultiplication (float16 inputs)
queries[Q×D] @ dataset[N×D]^T → cross[Q×N] (GPU-private buffer)
Pass 2 Metal kernel: l2_topk_from_cross
cross + q_norms + d_norms → top-K indices/distances
CPU receives only Q×K bytes (e.g. 100 queries × 10 results × 8 B = 8 KB)
Dataset and d_norms are cached across calls as float16 Metal buffers.
Phase 1a IVF K-means seeding
N ≤ 200K : exact cblas_sgemm per cluster (AMX, L3-resident)
N > 200K : IVF_PQ — LUT distance lookup, CPU PQ training
Phase 1b GPU random bucketing (Metal kernel, cross-cluster diversity)
Phase 2 nn-descent refinement (Metal kernel, float4 SIMD distances)
Entry-point selection at search time uses AMX-accelerated cblas_sgemm over cached navigation vectors.
CAGRA Q=1 latency is 54 ms on cuvs-silicon vs. 2.4 ms on NVIDIA CUVS (A100).
Root cause: Metal command buffer fixed overhead (~2 ms) + single threadgroup dispatched for Q=1 → 2.6% GPU utilization. Multi-CTA beam search was implemented and tested but worsened latency due to cache thrashing on memory-bound traversal.
| Stage | Time | Root cause |
|---|---|---|
| K-means (chunk=2048) | 341 s | M-step scatter-add is DRAM-bound on CPU |
| PQ training | 764 s | 8 subspace × N=1M → 229 B memory ops |
| CPU PQ seeding | 1548 s | 229 B pair lookups × ~3 ns each |
| GPU bucketing + misc | ~60 s | — |
| Total | 2714 s | — |
NVIDIA CUVS completes the same workload in 22 s (A100, HBM2e 2 TB/s vs. CPU ~50 GB/s effective).
A Metal GPU kernel for the PQ seeding step would reduce the bottleneck to ~100 s.
See docs/multi-cta-ivf-pq-lessons.md for full analysis.
Contributions are welcome. The highest-impact areas:
| Area | Expected impact | Status |
|---|---|---|
| Metal GPU K-means M-step (scatter-add kernel) | 1M build: 341 s → ~50 s | open |
| Metal GPU IVF_PQ seeding kernel | 1M build: 1548 s → ~100 s | open — draft kernel exists, bug in PQ distance scale |
| nn-descent optimization for N=1M | enables recall improvement at 1M | open |
| Multi-CTA beam search (Q=1 latency) | 54 ms → ~10 ms | open — kernel implemented, disabled |
Please open an issue before starting large changes.
- NVIDIA CUVS — CUDA reference implementation
- MetalFaiss / Faiss-mlx — MLX-based FAISS (comparison baseline)
- hnswlib — CPU HNSW baseline
Apache 2.0 — Copyright 2026 JAESOL SHIN