🧠 NanoLM

Build your own LLM from scratch — tokenizer, pretraining, and instruction tuning.

A complete LLM implementation in PyTorch, trained on 20B tokens and beating GPT-2 on standard benchmarks.

📊 Benchmarks (0-shot)

Model	ARC-Easy	HellaSwag	PIQA	WinoGrande
nanolm-1.4B	63.9%	50.4%	70.5%	54.9%
nanolm-0.3B	57.5%	36.3%	65.9%	51.8%
GPT-2 Medium	48.5%	39.4%	66.4%	52.5%
Random	25%	25%	50%	50%

🤗 Pretrained Models

Model	Params	Tokens	HuggingFace
nanolm-1.4B	1.4B	20B	thibautbar/nanolm-1.4B
nanolm-1.4B-instruct	1.4B	—	thibautbar/nanolm-1.4B-instruct
nanolm-0.3B	336M	12B	thibautbar/nanolm-0.3B
nanolm-0.3B-instruct	336M	—	thibautbar/nanolm-0.3B-instruct

See docs/SMALLER_MODELS.md for 54M and 14M variants.

✨ Highlights

Base model (pretrained, 1.4B):

Prompt: "The French Revolution began in"

The French Revolution began in 1789. The revolution was a reaction to the monarchy,
which ruled France for nearly 200 years. In 1789, the people of France were unhappy
with their government and wanted change. They called for a new constitution that
would give them more power...

Instruction-tuned model (SFT on Alpaca, 1.4B):

Prompt: "Explain what machine learning is in simple terms."

Machine learning is a type of artificial intelligence that allows computers to
learn from data without being explicitly programmed. It uses algorithms to
analyze large sets of data and identify patterns, which can then be used to make
predictions or decisions.

Prompt: "Write a Python function that checks if a number is prime."

def is_prime(num):
    for i in range(2, num):
        if (num % i) == 0:
            return False
    else:
        return True

🏗 Architecture

The 1.4B model uses a Qwen2-style transformer:

Component	Choice	Details
Attention	Grouped Query Attention (GQA)	16 heads, 8 KV heads
FFN	SwiGLU	Intermediate size 6144
Norm	RMSNorm	Pre-norm
Tokenizer	BPE (trained from scratch)	Mixed text/code/math corpus
Context	1024 tokens	—

Training data mix (20B tokens, curated):

75% FineWeb-Edu (high-quality educational web text)
10% FineWeb (general web text)
10% StarCoder (Python + JavaScript)
5% OpenWebMath

Training curves:

🔄 Pipeline

This repo implements the full LLM pipeline from scratch:

BPE Tokenizer — trained on a mixed corpus for better code/math coverage
Pretraining — Qwen2-style GPT with GQA, trained with HuggingFace accelerate
Supervised Fine-Tuning — instruction tuning on Alpaca/Dolly

SFT Results

Model	Loss (start → end)	Token Accuracy
nanolm-1.4B-instruct	2.4 → 1.1	54% → 70%
nanolm-0.3B-instruct	3.0 → 1.4	50% → 64%

🚀 Quick Start

git clone <repo-url> && cd nanolm
uv venv && source .venv/bin/activate
uv pip install -e .

Generate text

uv run python -m src.scripts.autocomplete \
  --checkpoint data/checkpoints/nanolm-1.4B \
  --prompt "The future of AI" \
  --max-tokens 100 \
  --temperature 0.8

Train from scratch

# 1. Download data
python -m src.dataset.download --dataset HuggingFaceFW/fineweb-edu --config sample-10BT --output data/sources/fineweb-edu-10BT --num-proc 8

# 2. Train tokenizer
python -m src.tokenizer.train \
  --sources "data/sources/fineweb-edu-10BT:0.7,data/sources/code/python:0.1,data/sources/code/javascript:0.1,data/sources/openwebmath:0.1" \
  --num-samples 1000000 \
  --output data/tokenizer-v2

# 3. Pretrain
uv run accelerate launch -m src.pretraining.train_hf \
  --sources "data/sources/fineweb-edu-10BT:0.75,data/sources/fineweb-10BT:0.1,data/sources/starcoderdata/python:0.05,data/sources/starcoderdata/javascript:0.05,data/sources/openwebmath:0.05" \
  --tokenizer data/tokenizer-v2/tokenizer.json \
  --seq-len 1024 --batch-size 8 --grad-accum 4 \
  --n-layer 28 --n-head 16 --n-kv-head 8 --n-embd 2048 --intermediate-size 6144 \
  --max-lr 4e-4 --max-tokens 20B \
  --wandb --wandb-name "nanolm-1.4B"

# 4. Fine-tune (optional)
uv run python -m src.posttraining.sft \
  --checkpoint data/checkpoints/nanolm-1.4B \
  --output data/checkpoints/nanolm-1.4B-instruct \
  --dataset alpaca --epochs 3

📁 Project Structure

nanolm/
├── src/
│   ├── tokenizer/        # BPE tokenizer training & eval
│   ├── dataset/          # Data download & pipeline
│   ├── model/            # GPT architecture
│   ├── pretraining/      # Pretraining with HF Trainer
│   ├── posttraining/     # Supervised fine-tuning (SFT)
│   └── scripts/          # Text generation
└── data/
    ├── corpus/           # Training data
    ├── tokenizer/        # Trained tokenizer
    └── checkpoints/      # Model checkpoints

⚙️ Configuration Reference

Model

Parameter	Default	Description
`--n-layer`	6	Transformer layers
`--n-head`	8	Attention heads
`--n-kv-head`	—	KV heads for GQA
`--n-embd`	512	Embedding dimension
`--intermediate-size`	—	FFN intermediate size
`--seq-len`	512	Context window
`--dropout`	0.1	Dropout rate

Training

Parameter	Default	Description
`--batch-size`	8	Batch size
`--grad-accum`	1	Gradient accumulation steps
`--max-lr`	3e-4	Peak learning rate
`--warmup-steps`	1000	LR warmup steps
`--max-steps`	100000	Total training steps
`--grad-clip`	1.0	Gradient clipping

Generation

Parameter	Default	Description
`--max-tokens`	50	Tokens to generate
`--temperature`	0.8	Sampling temperature
`--top-k`	50	Top-k sampling
`--device`	`auto`	Device (auto/cpu/cuda/mps)

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
assets		assets
demo		demo
src		src
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 NanoLM

📊 Benchmarks (0-shot)

🤗 Pretrained Models

✨ Highlights

🏗 Architecture

🔄 Pipeline

SFT Results

🚀 Quick Start

Generate text

Train from scratch

📁 Project Structure

⚙️ Configuration Reference

Model

Training

Generation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧠 NanoLM

📊 Benchmarks (0-shot)

🤗 Pretrained Models

✨ Highlights

🏗 Architecture

🔄 Pipeline

SFT Results

🚀 Quick Start

Generate text

Train from scratch

📁 Project Structure

⚙️ Configuration Reference

Model

Training

Generation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages