Skip to content

thibautbar/nanolm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 NanoLM

Build your own LLM from scratch — tokenizer, pretraining, and instruction tuning.

A complete LLM implementation in PyTorch, trained on 20B tokens and beating GPT-2 on standard benchmarks.

Try the live demo


📊 Benchmarks (0-shot)

Model ARC-Easy HellaSwag PIQA WinoGrande
nanolm-1.4B 63.9% 50.4% 70.5% 54.9%
nanolm-0.3B 57.5% 36.3% 65.9% 51.8%
GPT-2 Medium 48.5% 39.4% 66.4% 52.5%
Random 25% 25% 50% 50%

🤗 Pretrained Models

Model Params Tokens HuggingFace
nanolm-1.4B 1.4B 20B thibautbar/nanolm-1.4B
nanolm-1.4B-instruct 1.4B thibautbar/nanolm-1.4B-instruct
nanolm-0.3B 336M 12B thibautbar/nanolm-0.3B
nanolm-0.3B-instruct 336M thibautbar/nanolm-0.3B-instruct

See docs/SMALLER_MODELS.md for 54M and 14M variants.


✨ Highlights

Base model (pretrained, 1.4B):

Prompt: "The French Revolution began in"

The French Revolution began in 1789. The revolution was a reaction to the monarchy,
which ruled France for nearly 200 years. In 1789, the people of France were unhappy
with their government and wanted change. They called for a new constitution that
would give them more power...

Instruction-tuned model (SFT on Alpaca, 1.4B):

Prompt: "Explain what machine learning is in simple terms."

Machine learning is a type of artificial intelligence that allows computers to
learn from data without being explicitly programmed. It uses algorithms to
analyze large sets of data and identify patterns, which can then be used to make
predictions or decisions.
Prompt: "Write a Python function that checks if a number is prime."

def is_prime(num):
    for i in range(2, num):
        if (num % i) == 0:
            return False
    else:
        return True

🏗 Architecture

The 1.4B model uses a Qwen2-style transformer:

Component Choice Details
Attention Grouped Query Attention (GQA) 16 heads, 8 KV heads
FFN SwiGLU Intermediate size 6144
Norm RMSNorm Pre-norm
Tokenizer BPE (trained from scratch) Mixed text/code/math corpus
Context 1024 tokens

Training data mix (20B tokens, curated):

  • 75% FineWeb-Edu (high-quality educational web text)
  • 10% FineWeb (general web text)
  • 10% StarCoder (Python + JavaScript)
  • 5% OpenWebMath

Training curves:

Training Loss - 1.4B


🔄 Pipeline

This repo implements the full LLM pipeline from scratch:

  1. BPE Tokenizer — trained on a mixed corpus for better code/math coverage
  2. Pretraining — Qwen2-style GPT with GQA, trained with HuggingFace accelerate
  3. Supervised Fine-Tuning — instruction tuning on Alpaca/Dolly

SFT Results

Model Loss (start → end) Token Accuracy
nanolm-1.4B-instruct 2.4 → 1.1 54% → 70%
nanolm-0.3B-instruct 3.0 → 1.4 50% → 64%

SFT Training Loss - 1.4B


🚀 Quick Start

git clone <repo-url> && cd nanolm
uv venv && source .venv/bin/activate
uv pip install -e .

Generate text

uv run python -m src.scripts.autocomplete \
  --checkpoint data/checkpoints/nanolm-1.4B \
  --prompt "The future of AI" \
  --max-tokens 100 \
  --temperature 0.8

Train from scratch

# 1. Download data
python -m src.dataset.download --dataset HuggingFaceFW/fineweb-edu --config sample-10BT --output data/sources/fineweb-edu-10BT --num-proc 8

# 2. Train tokenizer
python -m src.tokenizer.train \
  --sources "data/sources/fineweb-edu-10BT:0.7,data/sources/code/python:0.1,data/sources/code/javascript:0.1,data/sources/openwebmath:0.1" \
  --num-samples 1000000 \
  --output data/tokenizer-v2

# 3. Pretrain
uv run accelerate launch -m src.pretraining.train_hf \
  --sources "data/sources/fineweb-edu-10BT:0.75,data/sources/fineweb-10BT:0.1,data/sources/starcoderdata/python:0.05,data/sources/starcoderdata/javascript:0.05,data/sources/openwebmath:0.05" \
  --tokenizer data/tokenizer-v2/tokenizer.json \
  --seq-len 1024 --batch-size 8 --grad-accum 4 \
  --n-layer 28 --n-head 16 --n-kv-head 8 --n-embd 2048 --intermediate-size 6144 \
  --max-lr 4e-4 --max-tokens 20B \
  --wandb --wandb-name "nanolm-1.4B"

# 4. Fine-tune (optional)
uv run python -m src.posttraining.sft \
  --checkpoint data/checkpoints/nanolm-1.4B \
  --output data/checkpoints/nanolm-1.4B-instruct \
  --dataset alpaca --epochs 3

📁 Project Structure

nanolm/
├── src/
│   ├── tokenizer/        # BPE tokenizer training & eval
│   ├── dataset/          # Data download & pipeline
│   ├── model/            # GPT architecture
│   ├── pretraining/      # Pretraining with HF Trainer
│   ├── posttraining/     # Supervised fine-tuning (SFT)
│   └── scripts/          # Text generation
└── data/
    ├── corpus/           # Training data
    ├── tokenizer/        # Trained tokenizer
    └── checkpoints/      # Model checkpoints

⚙️ Configuration Reference

Model

Parameter Default Description
--n-layer 6 Transformer layers
--n-head 8 Attention heads
--n-kv-head KV heads for GQA
--n-embd 512 Embedding dimension
--intermediate-size FFN intermediate size
--seq-len 512 Context window
--dropout 0.1 Dropout rate

Training

Parameter Default Description
--batch-size 8 Batch size
--grad-accum 1 Gradient accumulation steps
--max-lr 3e-4 Peak learning rate
--warmup-steps 1000 LR warmup steps
--max-steps 100000 Total training steps
--grad-clip 1.0 Gradient clipping

Generation

Parameter Default Description
--max-tokens 50 Tokens to generate
--temperature 0.8 Sampling temperature
--top-k 50 Top-k sampling
--device auto Device (auto/cpu/cuda/mps)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages