Build your own LLM from scratch — tokenizer, pretraining, and instruction tuning.
A complete LLM implementation in PyTorch, trained on 20B tokens and beating GPT-2 on standard benchmarks.
| Model | ARC-Easy | HellaSwag | PIQA | WinoGrande |
|---|---|---|---|---|
| nanolm-1.4B | 63.9% | 50.4% | 70.5% | 54.9% |
| nanolm-0.3B | 57.5% | 36.3% | 65.9% | 51.8% |
| GPT-2 Medium | 48.5% | 39.4% | 66.4% | 52.5% |
| Random | 25% | 25% | 50% | 50% |
| Model | Params | Tokens | HuggingFace |
|---|---|---|---|
| nanolm-1.4B | 1.4B | 20B | thibautbar/nanolm-1.4B |
| nanolm-1.4B-instruct | 1.4B | — | thibautbar/nanolm-1.4B-instruct |
| nanolm-0.3B | 336M | 12B | thibautbar/nanolm-0.3B |
| nanolm-0.3B-instruct | 336M | — | thibautbar/nanolm-0.3B-instruct |
See docs/SMALLER_MODELS.md for 54M and 14M variants.
Base model (pretrained, 1.4B):
Prompt: "The French Revolution began in"
The French Revolution began in 1789. The revolution was a reaction to the monarchy,
which ruled France for nearly 200 years. In 1789, the people of France were unhappy
with their government and wanted change. They called for a new constitution that
would give them more power...
Instruction-tuned model (SFT on Alpaca, 1.4B):
Prompt: "Explain what machine learning is in simple terms."
Machine learning is a type of artificial intelligence that allows computers to
learn from data without being explicitly programmed. It uses algorithms to
analyze large sets of data and identify patterns, which can then be used to make
predictions or decisions.
Prompt: "Write a Python function that checks if a number is prime."
def is_prime(num):
for i in range(2, num):
if (num % i) == 0:
return False
else:
return TrueThe 1.4B model uses a Qwen2-style transformer:
| Component | Choice | Details |
|---|---|---|
| Attention | Grouped Query Attention (GQA) | 16 heads, 8 KV heads |
| FFN | SwiGLU | Intermediate size 6144 |
| Norm | RMSNorm | Pre-norm |
| Tokenizer | BPE (trained from scratch) | Mixed text/code/math corpus |
| Context | 1024 tokens | — |
Training data mix (20B tokens, curated):
- 75% FineWeb-Edu (high-quality educational web text)
- 10% FineWeb (general web text)
- 10% StarCoder (Python + JavaScript)
- 5% OpenWebMath
Training curves:
This repo implements the full LLM pipeline from scratch:
- BPE Tokenizer — trained on a mixed corpus for better code/math coverage
- Pretraining — Qwen2-style GPT with GQA, trained with HuggingFace
accelerate - Supervised Fine-Tuning — instruction tuning on Alpaca/Dolly
| Model | Loss (start → end) | Token Accuracy |
|---|---|---|
| nanolm-1.4B-instruct | 2.4 → 1.1 | 54% → 70% |
| nanolm-0.3B-instruct | 3.0 → 1.4 | 50% → 64% |
git clone <repo-url> && cd nanolm
uv venv && source .venv/bin/activate
uv pip install -e .uv run python -m src.scripts.autocomplete \
--checkpoint data/checkpoints/nanolm-1.4B \
--prompt "The future of AI" \
--max-tokens 100 \
--temperature 0.8# 1. Download data
python -m src.dataset.download --dataset HuggingFaceFW/fineweb-edu --config sample-10BT --output data/sources/fineweb-edu-10BT --num-proc 8
# 2. Train tokenizer
python -m src.tokenizer.train \
--sources "data/sources/fineweb-edu-10BT:0.7,data/sources/code/python:0.1,data/sources/code/javascript:0.1,data/sources/openwebmath:0.1" \
--num-samples 1000000 \
--output data/tokenizer-v2
# 3. Pretrain
uv run accelerate launch -m src.pretraining.train_hf \
--sources "data/sources/fineweb-edu-10BT:0.75,data/sources/fineweb-10BT:0.1,data/sources/starcoderdata/python:0.05,data/sources/starcoderdata/javascript:0.05,data/sources/openwebmath:0.05" \
--tokenizer data/tokenizer-v2/tokenizer.json \
--seq-len 1024 --batch-size 8 --grad-accum 4 \
--n-layer 28 --n-head 16 --n-kv-head 8 --n-embd 2048 --intermediate-size 6144 \
--max-lr 4e-4 --max-tokens 20B \
--wandb --wandb-name "nanolm-1.4B"
# 4. Fine-tune (optional)
uv run python -m src.posttraining.sft \
--checkpoint data/checkpoints/nanolm-1.4B \
--output data/checkpoints/nanolm-1.4B-instruct \
--dataset alpaca --epochs 3nanolm/
├── src/
│ ├── tokenizer/ # BPE tokenizer training & eval
│ ├── dataset/ # Data download & pipeline
│ ├── model/ # GPT architecture
│ ├── pretraining/ # Pretraining with HF Trainer
│ ├── posttraining/ # Supervised fine-tuning (SFT)
│ └── scripts/ # Text generation
└── data/
├── corpus/ # Training data
├── tokenizer/ # Trained tokenizer
└── checkpoints/ # Model checkpoints
| Parameter | Default | Description |
|---|---|---|
--n-layer |
6 | Transformer layers |
--n-head |
8 | Attention heads |
--n-kv-head |
— | KV heads for GQA |
--n-embd |
512 | Embedding dimension |
--intermediate-size |
— | FFN intermediate size |
--seq-len |
512 | Context window |
--dropout |
0.1 | Dropout rate |
| Parameter | Default | Description |
|---|---|---|
--batch-size |
8 | Batch size |
--grad-accum |
1 | Gradient accumulation steps |
--max-lr |
3e-4 | Peak learning rate |
--warmup-steps |
1000 | LR warmup steps |
--max-steps |
100000 | Total training steps |
--grad-clip |
1.0 | Gradient clipping |
| Parameter | Default | Description |
|---|---|---|
--max-tokens |
50 | Tokens to generate |
--temperature |
0.8 | Sampling temperature |
--top-k |
50 | Top-k sampling |
--device |
auto |
Device (auto/cpu/cuda/mps) |

