MICROGPT.CPP

A C++ implementation of a minimal GPT model inspired by Andrej Karpathy’s microgpt.py, using only the C++ standard library and a simple memory arena allocator.

The focus is on readability (and optimization) rather than minimizing line count.

Build & Run

g++ -std=c++17 -DDEBUG -Ofast -march=native microgpt.cpp -o microgpt
./microgpt

You can remove -DDEBUG for slightly faster execution. For maximum performance:

Performance

Andrej Karpathy's microgpt.py was never meant to be fast, its goal is extreme readability and making it easy to learn how a GPT works from scratch. This C++ version focuses on optimization while remaining reasonably readable (barely at this point), because optimization is fun.

All benchmarks run on Intel Core Ultra 7 165H, using PyPy JIT as the Python performance baseline.

Note: 16x16 = N_EMBD=16, BLOCK_SIZE=16

16x16 network, 10000 steps

Implementation	Time	vs PyPy JIT	Main changes
Python (CPython)	22m 4s	~6.7x slower
Python (PyPy JIT)	3m 16s	1x	JIT compilation
C++ (original)	3.3s	~60x	Wengert tape, AoS arena, `f64`
C++ (enhanced)	2.2s	~88x	+ SoA arena, flat KV cache, stack arrays
Rust	2.0s	~97x	+ Op enum backward, `f32`, unsafe ptrs
C++ (current) / Rust (current)	1.3s	~152x	+ true FMA, stack KV cache, `-Ofast`

64x64 network, 1000 steps

Implementation	Time	vs PyPy JIT	Main changes
Python (CPython)	1h 14m	~10.9x slower
Python (PyPy JIT)	6m 47s	1x	JIT compilation
C++ (original)	8.2s	~50x	Wengert tape, AoS arena, `f64`
C++ (enhanced)	3.5s	~115x	+ SoA arena, flat KV cache, stack arrays
Rust	2.6s	~157x	+ Op enum backward, `f32`, unsafe ptrs
C++ (current) / Rust (current)	1.6s	~249x	+ true FMA, stack KV cache, `-Ofast`

C++ (original/enhanced) compiled with g++ -std=c++17 -O3. C++ (current) compiled with g++ -std=c++17 -Ofast -march=native. PyPy benchmarked with PyPy 7.3.17. Rust compilation profile:

[profile.release]
opt-level = 3
lto = true
codegen-units = 1
panic = "abort"

Rust

The following rust implementation is faster than C++ (original) out of the box. By introducing autograd optimizations, using unsafe pointers, calculating derivatives during the backward pass, and only recording operations during the forward pass, it became faster than C++ (enhanced). However, C++ (Current) builds on these optimizations and adds true fused multiply-add (FMA), computing gradients only during the backward pass, utilizing f32, and compiling with -Ofast (matched with the latest Rust code).

For a version that replicates Python's random numbers for an exact match of the original Python output, see this fork.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
README.md		README.md
microgpt.cpp		microgpt.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MICROGPT.CPP

Build & Run

Performance

16x16 network, 10000 steps

64x64 network, 1000 steps

Rust

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

MICROGPT.CPP

Build & Run

Performance

16x16 network, 10000 steps

64x64 network, 1000 steps

Rust

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages