A ground-up implementation of a GPT-style transformer language model, progressing from a simple bigram model to a full transformer with custom CUDA kernels.
Understanding transformer internals by building every component from scratch — no nn.TransformerEncoder, no shortcuts. This includes eventually writing custom CUDA kernels for core operations.
Character-level language model trained on War and Peace (3.2M characters, 111 unique tokens).
| Metric | Value |
|---|---|
| Vocabulary | 111 characters |
| Training data | 2.9M characters |
| Validation data | 323K characters |
| Initial loss | 5.33 (random) |
| Final loss | 2.39 (after 100K steps) |
Sample output (bigram only):
"We bothf
mpof uprerust tllowhilyan m teleld ollkn.
fe verng he wad. ame stheve The se wim Pin...
Word-like patterns emerge, but no coherent meaning — the model only sees one character of context.
- Scaled dot-product attention
- Causal masking (decoder-style)
- Single attention head
- Multiple parallel attention heads
- Head concatenation + projection
- LayerNorm
- Feedforward network (MLP)
- Residual connections
- Dropout
- Positional embeddings
- Stacked transformer blocks
- Scaled initialization
- Custom attention kernel
- Fused softmax
- Flash attention implementation
- Memory-efficient backprop
Input tokens
│
▼
┌─────────────────┐
│ Token Embedding │
│ + Position Emb │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Transformer │ ×N blocks
│ Block │
│ ┌─────────────┐ │
│ │ Multi-Head │ │
│ │ Attention │ │
│ └──────┬──────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ Feedforward │ │
│ └─────────────┘ │
└────────┬────────┘
│
▼
┌─────────────────┐
│ LayerNorm │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Linear → Logits │
└─────────────────┘
generatively_pretrained_transformer/
├── src/
│ └── training_lab.ipynb # Main development notebook
├── cuda/ # Custom CUDA kernels (coming)
│ ├── attention.cu
│ └── softmax.cu
├── war_and_peace.txt # Training corpus
├── AGENTS.md # Multi-agent coordination
└── README.md
git clone https://github.com/LewallenAE/generatively_pretrained_transformer.git
cd generatively_pretrained_transformer
python -m venv venv
source venv/bin/activate
pip install torch# In training_lab.ipynb
# Hyperparameters (current)
batch_size = 32
block_size = 8
learning_rate = 1e-3
max_iters = 100000
# Training loop
for steps in range(max_iters):
xb, yb = get_batch('train')
logits, loss = model(xb, yb)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()- Attention Is All You Need — Vaswani et al.
- Let's Build GPT — Andrej Karpathy
- FlashAttention — Dao et al.
MIT