Skip to content

rhossi/pgpu-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pgpu-lab

A Docker-based lab to benchmark GPU-accelerated vector index building in PostgreSQL using PGPU.

We wanted to answer a simple question: how much faster can you build a vector index if you throw a GPU at it?

Short answer: at 1M vectors, the GPU path finishes in ~6 minutes. CPU takes ~27 minutes. The clustering step alone goes from 21 minutes on CPU to 10 seconds on GPU.

What's inside

The whole thing runs from a single Docker container with:

  • PostgreSQL 17 as the database
  • pgvector for the vector data type
  • VectorChord for the IVF index (vchordrq)
  • PGPU by EnterpriseDB — offloads the k-means clustering to the GPU using NVIDIA cuVS
  • NVIDIA cuVS 25.12 — the actual GPU k-means library under the hood

The benchmark generates 1M random vectors (768 dimensions, same as BERT embeddings), then builds the same VectorChord index twice: once with GPU-accelerated clustering (PGPU) and once with CPU-only clustering. You compare the times and that's it.

Prerequisites

  • Docker with NVIDIA Container Toolkit installed
  • An NVIDIA GPU (tested on Blackwell / GB10, should work on any CUDA-capable GPU)
  • ~16 GB RAM and ~20 GB disk for the Docker image

Heads up: the Docker image is big (~10 GB). It has to install conda for cuVS, the Rust toolchain to build the extensions from source, GCC 14 for VectorChord's ARM SIMD code, and a bunch of other things. First build takes a while. Go grab a coffee.

Quick start

# clone the repo
git clone https://github.com/rhossi/pgpu-lab.git
cd pgpu-lab

# build the image (this takes 15-30 min the first time)
docker compose build

# start the database
docker compose up -d

# check it's running
docker compose logs --tail 5

You should see something like:

pgpu-1  | CREATE EXTENSION    (vector)
pgpu-1  | CREATE EXTENSION    (vchord)
pgpu-1  | CREATE EXTENSION    (pgpu)
pgpu-1  | ==> Starting PostgreSQL with PGPU …
pgpu-1  | database system is ready to accept connections

All three extensions loaded? Good. Now run the benchmark:

# generate 1M vectors (takes ~5 min)
docker compose exec pgpu psql -U postgres -f /datasets/generate_vectors.sql

# run GPU vs CPU index build benchmark
docker compose exec pgpu psql -U postgres -f /benchmarks/01_vector_index_bench.sql

What you should see

Something like this (times will vary depending on your GPU):

--- B28: PGPU GPU index build (1M vectors, dim=768, lists=4000) ---
INFO:  running GPU accelerated index build for public.vectors_1m.embedding
INFO:  processing batch (1/10)
INFO:  Clustering vectors on GPU
...
INFO:  Training complete (9.73s). Building VectorChord Index...
Time: 351562.725 ms (05:51.563)

--- B29: CPU index build (1M vectors, dim=768, lists=4000) ---
INFO:  clustering: using 4 threads
INFO:  clustering: iteration 1
...
INFO:  clustering: iteration 10
INFO:  clustering: finished
Time: 1602497.725 ms (26:42.498)

The GPU path: 5 min 52 seconds. The CPU path: 26 min 42 seconds.

That's a 4.6x overall speedup. But the interesting part is the clustering phase — GPU did it in 9.7 seconds, CPU took about 21 minutes. That's roughly 130x faster on the clustering alone.

Our results

We ran this at three different scales:

Dataset GPU total CPU total Speedup
100K vectors (dim=768) 7 s 16 s 2.2x
500K vectors (dim=768) 55 s 3 min 46 s 4.1x
1M vectors (dim=768) 5 min 52 s 26 min 42 s 4.6x

The speedup grows with data size. This makes sense — the GPU finishes clustering almost instantly at any scale, so the bigger the dataset, the more CPU time you're saving.

At 1M vectors, most of the GPU path's time (~342 seconds) is spent on the VectorChord index construction step, which runs on CPU either way. The GPU is basically waiting for the index builder to finish.

How it works

Building a VectorChord index has two phases:

  1. Clustering — run k-means to split all vectors into groups (centroids). This is math-heavy: millions of distance computations across hundreds of dimensions, repeated over multiple iterations. Perfect for GPUs.

  2. Index construction — assign each vector to its nearest centroid and write the index to disk. This is I/O-heavy and runs on CPU no matter what.

PGPU accelerates phase 1. It reads the vectors from PostgreSQL, sends them to the GPU in batches, runs cuVS k-means, stores the centroids back in a PG table, and then tells VectorChord to build the index using those pre-computed centroids.

Hardware we tested on

Platform NVIDIA Project DIGITS (GB10)
CPU NVIDIA Grace (ARM, aarch64)
GPU NVIDIA Blackwell (unified memory with CPU)
CUDA 13.0 (driver 580.126.09)

The unified memory is nice because there's no PCIe bottleneck copying data between CPU RAM and GPU VRAM. The GPU reads directly from the same memory PostgreSQL uses.

Project structure

pgpu-lab/
├── docker-compose.yml       # just one service: pgpu
├── pgpu/
│   ├── Dockerfile           # CUDA + conda (cuVS) + Rust (pgrx) + PG17 + extensions
│   └── entrypoint.sh        # inits the DB, creates extensions, starts PG
├── datasets/
│   └── generate_vectors.sql  # generates 1M random vectors (dim=768)
├── benchmarks/
│   └── 01_vector_index_bench.sql  # GPU vs CPU index build comparison
├── draft.md                 # blog post draft with full analysis
└── README.md

Connecting directly

If you want to poke around:

docker compose exec pgpu psql -U postgres

Or from outside the container:

PGPASSWORD=benchmark psql -h localhost -p 5432 -U postgres

Tearing down

docker compose down -v    # stops everything and removes the data volume

Troubleshooting

Build fails with GCC/SIMD errors on ARM The Dockerfile installs GCC 14 specifically for VectorChord's aarch64 fp16 SIMD code. If you're on x86_64, you probably won't hit this — but if you do, check that gcc-14 is available in your base image.

vchord must be loaded via shared_preload_libraries The entrypoint handles this. If you see this error, the data volume probably has an old config. Run docker compose down -v and start fresh.

libcuvs_c.so: cannot open shared object file The Dockerfile runs ldconfig to register conda's lib path. If this breaks, make sure /opt/conda/lib is in /etc/ld.so.conf.d/.

Build takes forever Yeah, it's a big image. The Rust compilations (cargo-pgrx, VectorChord, PGPU) take the most time. Subsequent rebuilds are faster thanks to Docker layer caching — as long as you don't change the early layers.

Credits

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors