mini-vllm

A minimal LLM inference engine implementing PagedAttention-style KV cache management on NanoGPT. Based on the "Efficient Memory Management for Large Language Model Serving with PagedAttention" paper.

Description and Motivation

In transformers, each step of token generation requires the computation of query, key, and value tensors for self attention. The key and value tensors must be calculated with respect to all the previous tokens in the input, which scales quadratically if done from scratch for every step.

To save on computation, KV caches store computed KV tensors, allowing new iterations in the autoregressive stage to only compute a single key and value tensor for the new token generated in the preceding step. However, these KV caches for requests are typically stored in contiguous memory space. Because they grow over their lifetime, this requires that enough memory is initially reserved for the maximum possible generated sequence length. This leads to significant amounts of internal and external fragmentation that waste memory for an already memory-limited process.

Implementing a virtual memory and paging system similar to how the OS manages memory allows KV tensors to be stored in blocks that can be allocated dynamically, do not need to be contiguous, and can be shared between requests. This reduces memory waste significantly; the original vLLM paper reported 2-4x improvements over systems using contiguous allocation.

Architecture

The engine is composed of three core components:

BlockAllocator: Maintains state of blocks, allocates for prefill and decode stages, and frees once requests are complete.

KVCache: Uses the block table managed by BlockAllocator to determine where in the pre-allocated GPU tensor to read and write KV tensors.

InferenceEngine: The orchestrator; receives a prompt, runs prefill, decode loop, frees blocks, and returns output.

Benchmarking

The benchmark.py script evaluates the performance impact of KV caching by comparing three generation methods:

KV Cache (Paged): Uses InferenceEngine with a small block size so KV tensors are stored across multiple blocks managed by the allocator.
KV Cache (No Paging): Uses InferenceEngine with a single large contiguous block, so KV caching is still used but without page-like block splitting.
No KV Cache (Baseline): Uses the standard NanoGPT implementation to generate tokens. In this mode, the model performs a full forward pass over the entire growing sequence (prompt + all generated tokens) for every single new token, which is computationally expensive as the sequence length increases.

Key Features

Metric Tracking: Measures average generation time, tokens per second (TPS), and calculates the specific speedup ratio provided by the KV cache.
Configurable Parameters: Allows testing across different prompt lengths, token counts, and memory configurations (block size and total blocks).
Automated Reporting: Appends timing data to a CSV file (benchmark_results.csv) for tracking and analysis.

The benchmark_results_viz.ipynb notebook visualizes the data from the output CSV. It graphs Average Runtime vs Generated Tokens, KV Cache Speedup by Trial, and Throughput by Trial (tokens/sec).

On CPU, the paged KV cache implementation typically provides a 2x to 10x speedup compared to no KV cache for moderate generation lengths (200-500 tokens), significantly reducing latency as the sequence grows. When it comes to comparing a paged KV cache and non paged KV cache, there is currently not a significant difference in speedup considering that only sequential generation is supported. Once concurrent requests are implemented, a different is expected.

Roadmap

The project has been fully implemented in its most minimal form, allowing sequential generation. Currently in progress is implementing concurrent requests to maximize the benefit of a paged KV cache.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.gemini/skills/vllm-benchmark		.gemini/skills/vllm-benchmark
benchmarking		benchmarking
images		images
.gitignore		.gitignore
README.md		README.md
block_allocator.py		block_allocator.py
inference_engine.py		inference_engine.py
kv_cache.py		kv_cache.py
model.py		model.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mini-vllm

Description and Motivation

Architecture

Benchmarking

Roadmap

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mini-vllm

Description and Motivation

Architecture

Benchmarking

Roadmap

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages