cuvs-silicon

GPU-accelerated vector similarity search for Apple Silicon, built on Metal Compute Shaders and the Accelerate framework (AMX). Implements brute-force exact search and CAGRA graph-based approximate nearest neighbor (ANN) search.

Highlights

Brute-force exact search: 2.87× faster than metalfaiss (MLX) at N=1M, D=1024
CAGRA ANN: beats hnswlib build time at N=100K with competitive recall
2-pass GPU pipeline: MPS float16 matmul + GPU top-K kernel — only Q×K results leave the GPU
Index persistence: save/load built CAGRA index to skip rebuild

Benchmarks

Brute-Force Search — N=1M, D=1024, K=10 (M3 Max, Q=100)

Implementation	QPS	p50 latency	recall@10
cuvs-silicon	1012	99 ms	1.0000
metalfaiss (MLX)	351	275 ms	1.0000
AMX cblas_sgemm (CPU)	14	—	1.0000

CAGRA ANN — experimental

CAGRA was implemented and benchmarked but the results were not satisfying. See docs/apple_silicon_gpu_vector_search_report.md for detailed results and analysis.

Installation

Requirements: macOS 14+, Xcode 15+, Apple Silicon (M1/M2/M3), CMake 3.20+

git clone https://github.com/ysys143/cuvs-silicon.git
cd cuvs-silicon
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --parallel

Usage

Brute-Force Exact Search

Pass graph_degree=0 to skip graph build and use exact GPU search:

./build/cohere_wiki_validation \
    data/1m/base.bin data/1m/queries.bin data/1m/gt.bin \
    10 1 3 0   # K warmup measure graph_degree=0

CAGRA ANN Search

# Build index and search
./build/cohere_wiki_validation \
    data/100k/base.bin data/100k/queries.bin data/100k/gt.bin \
    10 1 3     # K warmup measure [graph_degree=64]

# Save index after build  (argv[8] = save path)
./build/cohere_wiki_validation ... 10 1 3 64 /path/to/index.idx

# Load saved index and search  (argv[9] = load path)
./build/cohere_wiki_validation ... 10 1 3 64 "" /path/to/index.idx

Download Benchmark Data

python3 scripts/download_cohere_wiki.py   # downloads 10K / 100K / 1M splits

Architecture

Brute-Force (2-pass GPU)

Pass 1  MPS MPSMatrixMultiplication (float16 inputs)
        queries[Q×D] @ dataset[N×D]^T  →  cross[Q×N]   (GPU-private buffer)

Pass 2  Metal kernel: l2_topk_from_cross
        cross + q_norms + d_norms  →  top-K indices/distances
        CPU receives only Q×K bytes (e.g. 100 queries × 10 results × 8 B = 8 KB)

Dataset and d_norms are cached across calls as float16 Metal buffers.

CAGRA Build Pipeline

Phase 1a  IVF K-means seeding
          N ≤ 200K : exact cblas_sgemm per cluster (AMX, L3-resident)
          N > 200K : IVF_PQ — LUT distance lookup, CPU PQ training

Phase 1b  GPU random bucketing  (Metal kernel, cross-cluster diversity)

Phase 2   nn-descent refinement (Metal kernel, float4 SIMD distances)

Entry-point selection at search time uses AMX-accelerated cblas_sgemm over cached navigation vectors.

CAGRA Limitations

Single-query latency

CAGRA Q=1 latency is 54 ms on cuvs-silicon vs. 2.4 ms on NVIDIA CUVS (A100).
Root cause: Metal command buffer fixed overhead (~2 ms) + single threadgroup dispatched for Q=1 → 2.6% GPU utilization. Multi-CTA beam search was implemented and tested but worsened latency due to cache thrashing on memory-bound traversal.

1M-scale build

Stage	Time	Root cause
K-means (chunk=2048)	341 s	M-step scatter-add is DRAM-bound on CPU
PQ training	764 s	8 subspace × N=1M → 229 B memory ops
CPU PQ seeding	1548 s	229 B pair lookups × ~3 ns each
GPU bucketing + misc	~60 s	—
Total	2714 s	—

NVIDIA CUVS completes the same workload in 22 s (A100, HBM2e 2 TB/s vs. CPU ~50 GB/s effective).
A Metal GPU kernel for the PQ seeding step would reduce the bottleneck to ~100 s.
See docs/multi-cta-ivf-pq-lessons.md for full analysis.

Contributing

Contributions are welcome. The highest-impact areas:

Area	Expected impact	Status
Metal GPU K-means M-step (scatter-add kernel)	1M build: 341 s → ~50 s	open
Metal GPU IVF_PQ seeding kernel	1M build: 1548 s → ~100 s	open — draft kernel exists, bug in PQ distance scale
nn-descent optimization for N=1M	enables recall improvement at 1M	open
Multi-CTA beam search (Q=1 latency)	54 ms → ~10 ms	open — kernel implemented, disabled

Please open an issue before starting large changes.

Related Projects

NVIDIA CUVS — CUDA reference implementation
MetalFaiss / Faiss-mlx — MLX-based FAISS (comparison baseline)
hnswlib — CPU HNSW baseline

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.metal-cagra		.metal-cagra
.ouroboros		.ouroboros
Testing/Temporary		Testing/Temporary
benchmarks		benchmarks
cuvs_silicon		cuvs_silicon
docs		docs
include		include
scripts		scripts
shaders		shaders
src		src
tests		tests
.gitignore		.gitignore
.ouroboros_eval_artifact.md		.ouroboros_eval_artifact.md
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
PLAN.md		PLAN.md
README.md		README.md
migrate-project-sessions.skill		migrate-project-sessions.skill
seed.yaml		seed.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cuvs-silicon

Highlights

Benchmarks

Brute-Force Search — N=1M, D=1024, K=10 (M3 Max, Q=100)

CAGRA ANN — experimental

Installation

Usage

Brute-Force Exact Search

CAGRA ANN Search

Download Benchmark Data

Architecture

Brute-Force (2-pass GPU)

CAGRA Build Pipeline

CAGRA Limitations

Single-query latency

1M-scale build

Contributing

Related Projects

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cuvs-silicon

Highlights

Benchmarks

Brute-Force Search — N=1M, D=1024, K=10 (M3 Max, Q=100)

CAGRA ANN — experimental

Installation

Usage

Brute-Force Exact Search

CAGRA ANN Search

Download Benchmark Data

Architecture

Brute-Force (2-pass GPU)

CAGRA Build Pipeline

CAGRA Limitations

Single-query latency

1M-scale build

Contributing

Related Projects

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages