A Docker-based lab to benchmark GPU-accelerated vector index building in PostgreSQL using PGPU.
We wanted to answer a simple question: how much faster can you build a vector index if you throw a GPU at it?
Short answer: at 1M vectors, the GPU path finishes in ~6 minutes. CPU takes ~27 minutes. The clustering step alone goes from 21 minutes on CPU to 10 seconds on GPU.
The whole thing runs from a single Docker container with:
- PostgreSQL 17 as the database
- pgvector for the
vectordata type - VectorChord for the IVF index (
vchordrq) - PGPU by EnterpriseDB — offloads the k-means clustering to the GPU using NVIDIA cuVS
- NVIDIA cuVS 25.12 — the actual GPU k-means library under the hood
The benchmark generates 1M random vectors (768 dimensions, same as BERT embeddings), then builds the same VectorChord index twice: once with GPU-accelerated clustering (PGPU) and once with CPU-only clustering. You compare the times and that's it.
- Docker with NVIDIA Container Toolkit installed
- An NVIDIA GPU (tested on Blackwell / GB10, should work on any CUDA-capable GPU)
- ~16 GB RAM and ~20 GB disk for the Docker image
Heads up: the Docker image is big (~10 GB). It has to install conda for cuVS, the Rust toolchain to build the extensions from source, GCC 14 for VectorChord's ARM SIMD code, and a bunch of other things. First build takes a while. Go grab a coffee.
# clone the repo
git clone https://github.com/rhossi/pgpu-lab.git
cd pgpu-lab
# build the image (this takes 15-30 min the first time)
docker compose build
# start the database
docker compose up -d
# check it's running
docker compose logs --tail 5You should see something like:
pgpu-1 | CREATE EXTENSION (vector)
pgpu-1 | CREATE EXTENSION (vchord)
pgpu-1 | CREATE EXTENSION (pgpu)
pgpu-1 | ==> Starting PostgreSQL with PGPU …
pgpu-1 | database system is ready to accept connections
All three extensions loaded? Good. Now run the benchmark:
# generate 1M vectors (takes ~5 min)
docker compose exec pgpu psql -U postgres -f /datasets/generate_vectors.sql
# run GPU vs CPU index build benchmark
docker compose exec pgpu psql -U postgres -f /benchmarks/01_vector_index_bench.sqlSomething like this (times will vary depending on your GPU):
--- B28: PGPU GPU index build (1M vectors, dim=768, lists=4000) ---
INFO: running GPU accelerated index build for public.vectors_1m.embedding
INFO: processing batch (1/10)
INFO: Clustering vectors on GPU
...
INFO: Training complete (9.73s). Building VectorChord Index...
Time: 351562.725 ms (05:51.563)
--- B29: CPU index build (1M vectors, dim=768, lists=4000) ---
INFO: clustering: using 4 threads
INFO: clustering: iteration 1
...
INFO: clustering: iteration 10
INFO: clustering: finished
Time: 1602497.725 ms (26:42.498)
The GPU path: 5 min 52 seconds. The CPU path: 26 min 42 seconds.
That's a 4.6x overall speedup. But the interesting part is the clustering phase — GPU did it in 9.7 seconds, CPU took about 21 minutes. That's roughly 130x faster on the clustering alone.
We ran this at three different scales:
| Dataset | GPU total | CPU total | Speedup |
|---|---|---|---|
| 100K vectors (dim=768) | 7 s | 16 s | 2.2x |
| 500K vectors (dim=768) | 55 s | 3 min 46 s | 4.1x |
| 1M vectors (dim=768) | 5 min 52 s | 26 min 42 s | 4.6x |
The speedup grows with data size. This makes sense — the GPU finishes clustering almost instantly at any scale, so the bigger the dataset, the more CPU time you're saving.
At 1M vectors, most of the GPU path's time (~342 seconds) is spent on the VectorChord index construction step, which runs on CPU either way. The GPU is basically waiting for the index builder to finish.
Building a VectorChord index has two phases:
-
Clustering — run k-means to split all vectors into groups (centroids). This is math-heavy: millions of distance computations across hundreds of dimensions, repeated over multiple iterations. Perfect for GPUs.
-
Index construction — assign each vector to its nearest centroid and write the index to disk. This is I/O-heavy and runs on CPU no matter what.
PGPU accelerates phase 1. It reads the vectors from PostgreSQL, sends them to the GPU in batches, runs cuVS k-means, stores the centroids back in a PG table, and then tells VectorChord to build the index using those pre-computed centroids.
| Platform | NVIDIA Project DIGITS (GB10) |
| CPU | NVIDIA Grace (ARM, aarch64) |
| GPU | NVIDIA Blackwell (unified memory with CPU) |
| CUDA | 13.0 (driver 580.126.09) |
The unified memory is nice because there's no PCIe bottleneck copying data between CPU RAM and GPU VRAM. The GPU reads directly from the same memory PostgreSQL uses.
pgpu-lab/
├── docker-compose.yml # just one service: pgpu
├── pgpu/
│ ├── Dockerfile # CUDA + conda (cuVS) + Rust (pgrx) + PG17 + extensions
│ └── entrypoint.sh # inits the DB, creates extensions, starts PG
├── datasets/
│ └── generate_vectors.sql # generates 1M random vectors (dim=768)
├── benchmarks/
│ └── 01_vector_index_bench.sql # GPU vs CPU index build comparison
├── draft.md # blog post draft with full analysis
└── README.md
If you want to poke around:
docker compose exec pgpu psql -U postgresOr from outside the container:
PGPASSWORD=benchmark psql -h localhost -p 5432 -U postgresdocker compose down -v # stops everything and removes the data volumeBuild fails with GCC/SIMD errors on ARM
The Dockerfile installs GCC 14 specifically for VectorChord's aarch64 fp16 SIMD code. If you're on x86_64, you probably won't hit this — but if you do, check that gcc-14 is available in your base image.
vchord must be loaded via shared_preload_libraries
The entrypoint handles this. If you see this error, the data volume probably has an old config. Run docker compose down -v and start fresh.
libcuvs_c.so: cannot open shared object file
The Dockerfile runs ldconfig to register conda's lib path. If this breaks, make sure /opt/conda/lib is in /etc/ld.so.conf.d/.
Build takes forever Yeah, it's a big image. The Rust compilations (cargo-pgrx, VectorChord, PGPU) take the most time. Subsequent rebuilds are faster thanks to Docker layer caching — as long as you don't change the early layers.
- PGPU by EnterpriseDB
- VectorChord by TensorChord
- pgvector by Andrew Kane
- cuVS by NVIDIA RAPIDS