GitHub - j9smith/fast-vllm: Experimentation to reduce vLLM container startup times for efficient serving.

Model: Llama-3.2-3B-Instruct

Current progress

Naive cold start: ~81s
criu (patched to use mmap instead of preadv) + overlapped vmtouch on container start + vLLM sleep level 2 + reload_weights_fast: ~6s

Experiment 1 - cache mounting

Simply remount the caches (~/.cache/vllm/* ~/.triton/* ~/.cache/flashinfer/* ~/.nv/ComputeCache) generated during vllm init.
Cold start, weights on disk, all caches cleared: 81 seconds
Warm start, weights in page cache (RAM), caches filled: 38 seconds

Experiment 2 - criu, vllm sleep level 1

Capture vLLM at sleep level 1. See Sleep Mode.
Necessary criu flags:

--enable-external-masters
--tcp-established
--shell-job
--link-remap

Checkpointing fails on io_uring -- run sudo sysctl kernel.io_uring_disabled=2

Checkpoint + restore successful. However, takes 15-30s to complete (first response). Disk constrained transfer.

Separately, criu loads via preadv - we should be able to speed this up with mmap (see doubleword criu fork). preadv iteratively reads through batches of pages and stages the result in kernel space -> user space (copy required), adding significant overhead per call. mmap instead allows us to map the pages directly into the process's virtual address space (zero copy).

criu restore time is bounded by how fast we can read the pages-img file from disk into the restored process memory. To force slow case, clear page cache (criu restore takes 20.4s):

sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
./run.sh 2

Force prefetch image into page cache to get fast case (10.7s):

vmtouch -t experiments/[DUMP DIR]/pages-*.img
./run.sh 2

Switching to doubleword criu fork zero-copy branch (restorer.c rewritten to use mmap instead of preadv) reduces criu restore to 8.1s when warm (image pages already in page cache):

docker build \
  --build-arg CRIU_REPO=https://github.com/doublewordai/criu \
  --build-arg CRIU_REF=warmstart/zero-copy-restore .

We can overlap vmtouch (to move *.img into page cache) with container load, so the cache is warmed when criu loads the pages.

Experiment 3 - criu, vllm sleep level 2

All else applies as above, but capture vLLM at sleep level 2.

Using sleep level 2 reduced the checkpoint size from 11.4gb to 3.4gb. criu restore now takes ~2.5-3s (with vmtouch overlapped), and reloading weights takes ~6-8s for a total of ~10-11s.

Experiment 4 - as above, but custom weight reloader

See exp/fast-weights vLLM fork (particularly ad13a45).

Firstly serialise the loaded weights onto disk (via _dump_weights) before sleep + checkpoint, then on restore reload them via a parallelised double-buffered weight loader (reload_weights_fast).

Reloading the checkpoint via vLLM's path (reload_weights) takes ~8s - reloading via reload_weights_fast takes ~2.5-3s. With criu restore taking ~2.5-3s, we now have a new reload time of ~6s.

Progress

~~Exploring how to release pinned anonymous memory (~8.5gb in /dev/zero); current hypothesis is that it is PyTorch allocator/CUDA -- possibly maintaining weight buffer alloc even after weights are released from device memory. criu dumps this memory, inflating checkpoint sizes. If we could decrease checkpoint size, restore should be faster.~~ -- vLLM was in sleep mode 1 (not 2), and so the weights were being stored in the checkpoint. Rectifying this reduced from 16s to 11s.
~~criu restore dropped to ~3s. Weight loading taking up to 8s. Need to address this bottleneck.~~ Implemented new weight loader, reload_weights_fast; reduced weight load from ~6-8s -> ~2.5-3s.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
experiments		experiments
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Current progress

Experiment 1 - cache mounting

Experiment 2 - criu, vllm sleep level 1

Experiment 3 - criu, vllm sleep level 2

Experiment 4 - as above, but custom weight reloader

Progress

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Current progress

Experiment 1 - cache mounting

Experiment 2 - criu, vllm sleep level 1

Experiment 3 - criu, vllm sleep level 2

Experiment 4 - as above, but custom weight reloader

Progress

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages