pgpu-lab

A Docker-based lab to benchmark GPU-accelerated vector index building in PostgreSQL using PGPU.

We wanted to answer a simple question: how much faster can you build a vector index if you throw a GPU at it?

Short answer: at 1M vectors, the GPU path finishes in ~6 minutes. CPU takes ~27 minutes. The clustering step alone goes from 21 minutes on CPU to 10 seconds on GPU.

What's inside

The whole thing runs from a single Docker container with:

PostgreSQL 17 as the database
pgvector for the vector data type
VectorChord for the IVF index (vchordrq)
PGPU by EnterpriseDB — offloads the k-means clustering to the GPU using NVIDIA cuVS
NVIDIA cuVS 25.12 — the actual GPU k-means library under the hood

The benchmark generates 1M random vectors (768 dimensions, same as BERT embeddings), then builds the same VectorChord index twice: once with GPU-accelerated clustering (PGPU) and once with CPU-only clustering. You compare the times and that's it.

Prerequisites

Docker with NVIDIA Container Toolkit installed
An NVIDIA GPU (tested on Blackwell / GB10, should work on any CUDA-capable GPU)
~16 GB RAM and ~20 GB disk for the Docker image

Heads up: the Docker image is big (~10 GB). It has to install conda for cuVS, the Rust toolchain to build the extensions from source, GCC 14 for VectorChord's ARM SIMD code, and a bunch of other things. First build takes a while. Go grab a coffee.

Quick start

# clone the repo
git clone https://github.com/rhossi/pgpu-lab.git
cd pgpu-lab

# build the image (this takes 15-30 min the first time)
docker compose build

# start the database
docker compose up -d

# check it's running
docker compose logs --tail 5

You should see something like:

pgpu-1  | CREATE EXTENSION    (vector)
pgpu-1  | CREATE EXTENSION    (vchord)
pgpu-1  | CREATE EXTENSION    (pgpu)
pgpu-1  | ==> Starting PostgreSQL with PGPU …
pgpu-1  | database system is ready to accept connections

All three extensions loaded? Good. Now run the benchmark:

# generate 1M vectors (takes ~5 min)
docker compose exec pgpu psql -U postgres -f /datasets/generate_vectors.sql

# run GPU vs CPU index build benchmark
docker compose exec pgpu psql -U postgres -f /benchmarks/01_vector_index_bench.sql

What you should see

Something like this (times will vary depending on your GPU):

--- B28: PGPU GPU index build (1M vectors, dim=768, lists=4000) ---
INFO:  running GPU accelerated index build for public.vectors_1m.embedding
INFO:  processing batch (1/10)
INFO:  Clustering vectors on GPU
...
INFO:  Training complete (9.73s). Building VectorChord Index...
Time: 351562.725 ms (05:51.563)

--- B29: CPU index build (1M vectors, dim=768, lists=4000) ---
INFO:  clustering: using 4 threads
INFO:  clustering: iteration 1
...
INFO:  clustering: iteration 10
INFO:  clustering: finished
Time: 1602497.725 ms (26:42.498)

The GPU path: 5 min 52 seconds. The CPU path: 26 min 42 seconds.

That's a 4.6x overall speedup. But the interesting part is the clustering phase — GPU did it in 9.7 seconds, CPU took about 21 minutes. That's roughly 130x faster on the clustering alone.

Our results

We ran this at three different scales:

Dataset	GPU total	CPU total	Speedup
100K vectors (dim=768)	7 s	16 s	2.2x
500K vectors (dim=768)	55 s	3 min 46 s	4.1x
1M vectors (dim=768)	5 min 52 s	26 min 42 s	4.6x

The speedup grows with data size. This makes sense — the GPU finishes clustering almost instantly at any scale, so the bigger the dataset, the more CPU time you're saving.

At 1M vectors, most of the GPU path's time (~342 seconds) is spent on the VectorChord index construction step, which runs on CPU either way. The GPU is basically waiting for the index builder to finish.

How it works

Building a VectorChord index has two phases:

Clustering — run k-means to split all vectors into groups (centroids). This is math-heavy: millions of distance computations across hundreds of dimensions, repeated over multiple iterations. Perfect for GPUs.
Index construction — assign each vector to its nearest centroid and write the index to disk. This is I/O-heavy and runs on CPU no matter what.

PGPU accelerates phase 1. It reads the vectors from PostgreSQL, sends them to the GPU in batches, runs cuVS k-means, stores the centroids back in a PG table, and then tells VectorChord to build the index using those pre-computed centroids.

Hardware we tested on


Platform	NVIDIA Project DIGITS (GB10)
CPU	NVIDIA Grace (ARM, aarch64)
GPU	NVIDIA Blackwell (unified memory with CPU)
CUDA	13.0 (driver 580.126.09)

The unified memory is nice because there's no PCIe bottleneck copying data between CPU RAM and GPU VRAM. The GPU reads directly from the same memory PostgreSQL uses.

Project structure

pgpu-lab/
├── docker-compose.yml       # just one service: pgpu
├── pgpu/
│   ├── Dockerfile           # CUDA + conda (cuVS) + Rust (pgrx) + PG17 + extensions
│   └── entrypoint.sh        # inits the DB, creates extensions, starts PG
├── datasets/
│   └── generate_vectors.sql  # generates 1M random vectors (dim=768)
├── benchmarks/
│   └── 01_vector_index_bench.sql  # GPU vs CPU index build comparison
├── draft.md                 # blog post draft with full analysis
└── README.md

Connecting directly

If you want to poke around:

docker compose exec pgpu psql -U postgres

Or from outside the container:

PGPASSWORD=benchmark psql -h localhost -p 5432 -U postgres

Tearing down

docker compose down -v    # stops everything and removes the data volume

Troubleshooting

Build fails with GCC/SIMD errors on ARM The Dockerfile installs GCC 14 specifically for VectorChord's aarch64 fp16 SIMD code. If you're on x86_64, you probably won't hit this — but if you do, check that gcc-14 is available in your base image.

vchord must be loaded via shared_preload_libraries The entrypoint handles this. If you see this error, the data volume probably has an old config. Run docker compose down -v and start fresh.

libcuvs_c.so: cannot open shared object file The Dockerfile runs ldconfig to register conda's lib path. If this breaks, make sure /opt/conda/lib is in /etc/ld.so.conf.d/.

Build takes forever Yeah, it's a big image. The Rust compilations (cargo-pgrx, VectorChord, PGPU) take the most time. Subsequent rebuilds are faster thanks to Docker layer caching — as long as you don't change the early layers.

Credits

PGPU by EnterpriseDB
VectorChord by TensorChord
pgvector by Andrew Kane
cuVS by NVIDIA RAPIDS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pgpu-lab

What's inside

Prerequisites

Quick start

What you should see

Our results

How it works

Hardware we tested on

Project structure

Connecting directly

Tearing down

Troubleshooting

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmarks		benchmarks
datasets		datasets
pgpu		pgpu
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
draft.md		draft.md

Folders and files

Latest commit

History

Repository files navigation

pgpu-lab

What's inside

Prerequisites

Quick start

What you should see

Our results

How it works

Hardware we tested on

Project structure

Connecting directly

Tearing down

Troubleshooting

Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages