Step-by-step instructions for taking either model from "no weights" to "useful checkpoint". The repo ships no checkpoints — every weight file is produced on your hardware by running the stages below.
Aurora (omni-modal):
- Tokenizer training
- Audio codec autoencoder pretrain
- Text pretrain (short context, 8K)
- Long-context extension (→ 256K)
- Omni-modal pretrain (text + image + audio interleaved)
- SFT
- Omni-SFT (voice-instruction tuning)
- DPO
Forge (coder + DevOps):
- Tokenizer training
- Audio codec autoencoder pretrain (shared with Aurora)
- Text pretrain (short context)
- Long-context extension (→ 1M)
- Code-screenshot pretrain (vision tower warm-up)
- SFT
- DPO
- Tool-use SFT
- DevOps SFT (logs / incidents)
- RLEF (reinforcement learning from execution feedback)
The same scripts under scripts/ drive each stage; the config file is
what changes between them.
# Windows / PowerShell
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -e . # install chatbot in editable mode
pip install -r requirements-train.txt # heavy training deps (flash-attn skipped on Windows)# Linux / macOS
python -m venv .venv
source .venv/bin/activate
pip install -e .
pip install -r requirements-train.txtVerify everything imports:
python -c "import chatbot; print(chatbot.__version__)"
pytest tests/ -qThen sanity-check the model configs hit their target parameter counts:
python scripts/count_params.py --model tiny
python scripts/count_params.py --model aurora-72b
python scripts/count_params.py --model forge-460bBuilding the 250B graph on CPU takes a few minutes but only happens once.
Pretraining needs a tokenizer. We train one with the HF tokenizers
library on a representative text corpus.
python scripts/train_tokenizer.py `
--files data/raw/sample-corpus/*.txt `
--vocab-size 131072 `
--output checkpoints/aurora-tokenizer.jsonFor Forge use --vocab-size 200032 and feed code-heavy files in
--files.
Single-GPU dry run on the tiny config (great for verifying everything works end-to-end before you rent H100s):
python scripts/pretrain.py `
--model tiny `
--training pretrain `
--tokenizer checkpoints/aurora-tokenizer.json `
max_steps=50 micro_batch_size=2Real Aurora-72B pretraining run (multi-node):
torchrun --nproc-per-node=8 --nnodes=$NNODES --rdzv-backend=c10d \
--rdzv-endpoint=$MASTER_ADDR:29500 \
scripts/pretrain.py \
--model aurora-72b \
--training pretrain \
--tokenizer checkpoints/aurora-tokenizer.jsonHardware expectations:
| Model | Pretrain (full) | SFT (full) | LoRA | QLoRA |
|---|---|---|---|---|
| Tiny (50M) | 1 GPU | 1 GPU | 1 GPU | 1 GPU |
| Aurora-72B | 256–1024× H100 | 32–64× H100 | 8× H100 | 2–4× A100 / RTX 6000 |
| Forge-460B | 1024–4096× H100 | 64–128× H100 | 16× H100 | 4–8× H100 (NF4) |
(Hours/days depending on cluster size — that's compute, not code, so we don't try to estimate it precisely.)
After pretrain, raise the context window with YaRN and continue training on long packed documents:
torchrun --nproc-per-node=8 scripts/pretrain.py \
--model aurora-72b \
--training long_context \
--tokenizer checkpoints/aurora-tokenizer.json \
resume_from=outputs/pretrain/latestRun this in stages: 8K → 32K → 128K → 256K (Aurora) or 8K → 32K → 256K → 1M (Forge). Each stage gets its own checkpoint.
torchrun --nproc-per-node=8 scripts/sft.py \
--model aurora-72b \
--training sft \
--tokenizer checkpoints/aurora-tokenizer.json \
--resume-from outputs/long_context/latesttorchrun --nproc-per-node=8 scripts/dpo.py \
--model aurora-72b \
--training dpo \
--tokenizer checkpoints/aurora-tokenizer.json \
--resume-from outputs/sft/latesttorchrun --nproc-per-node=8 scripts/sft.py \
--model forge-460b \
--training tool-use-sft \
--tokenizer checkpoints/forge-tokenizer.json \
--resume-from outputs/dpo/latest(Uses the SFT script with the tool-use config because the loss is the same — only the data and masking differ.)
LoRA is the realistic option for end-users with modest hardware:
python scripts/lora_finetune.py `
--model aurora-72b `
--training lora `
--tokenizer checkpoints/aurora-tokenizer.json `
--base-checkpoint outputs/sft/latestFor QLoRA add qlora.enabled=true and make sure bitsandbytes is
installed. A 50B QLoRA fine-tune typically fits on a single 48 GB GPU; a
250B QLoRA fine-tune needs 4-8× 80 GB GPUs.
A four-command flow to make sure every code path executes on the tiny config — no internet required:
python scripts/train_tokenizer.py --files tests/conftest.py --vocab-size 1024 --output checkpoints/tiny-tok.json
python scripts/pretrain.py --model tiny --tokenizer checkpoints/tiny-tok.json max_steps=10 micro_batch_size=2
python scripts/sft.py --model tiny --tokenizer checkpoints/tiny-tok.json max_steps=10 --resume-from outputs/pretrain/latest
python scripts/chat.py --model tiny --checkpoint outputs/sft/latest --tokenizer checkpoints/tiny-tok.jsonIf that sequence runs end-to-end on a laptop, the bigger configs will run end-to-end on a GPU cluster — same code, just bigger numbers in YAML.