Skip to content

Neilblaze/LMPool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LMPool

LMPool

A personal collection of language model implementations, NLP experiments, and inference workflows ✨

breaker

Index

  1. Building a Transformer
  2. Training a GPT
  3. miniLLM
  4. miniMamba
  5. AevRL
  6. microAR

Others

  • Fine-Tuning LLMs (Guide)
  • Self-Optimizer Inference
  • Speculative Decoding
  • Quantization
    • AWQ
    • GPTQ
    • LLM.int8()
    • SmoothQuant
    • SpQR
    • FP8
    • NF4
    • QLoRA
    • PTQ
    • QAT
    • Others
      • TurboQuant

(1) Building a Transformer

A step-by-step guide to building a modern decoder-only transformer language model from scratch. It breaks down the fundamental components and architecture choices of modern large language models, explaining the shapes, tensor flows, and mathematics behind each part.

See the Transformers/ directory for the full collection of 18 lessons ranging from basic architecture overview to advanced topics like KV caching.

What you'll build:

  • Training batches and next-token prediction
  • Token embeddings and the language model head
  • Next-token cross-entropy loss
  • Autoregressive token generation
  • Core transformer components (RMSNorm, RoPE, SwiGLU)
  • Multi-head causal self-attention
  • A complete tiny GPT model and training loop
  • Checkpointing and validation pipelines

Architecture highlights:

  • Modern decoder-only design
  • Rotary Position Embeddings (RoPE)
  • RMSNorm for lightweight normalization
  • SwiGLU feed-forward network
  • KV caching for efficient autoregressive generation

(2) Training a GPT

A comprehensive, from-scratch guide to building and training a modern decoder-only Transformer language model. Every component is explained with analogies, math, and heavily-annotated code. No ML experience required β€” just basic Python.

Implements the same architecture as LLaMA 3, Mistral, and Qwen 2.5: RoPE positional encoding, RMSNorm, SwiGLU activation, pre-norm residuals, and weight tying. The guide walks through tokenization (BPE), embeddings, attention, transformer blocks, training, and inference with complete working code.

See GPT/README.md for the full chapter index and architecture overview.

What you'll build:

  • BPE tokenizer (same algorithm as GPT-4)
  • Multi-head attention with RoPE
  • Complete 124M parameter GPT model
  • Training pipeline with AdamW, cosine warmup, mixed precision
  • Inference engine with temperature, top-k/p sampling, KV cache

Architecture highlights:

  • RoPE (Rotary Position Embeddings) β€” relative positions without learned parameters
  • RMSNorm β€” 15% faster than LayerNorm, equally effective
  • SwiGLU β€” gated activation for selective information flow
  • Pre-Norm β€” stable training at any depth
  • Weight tying β€” shares embedding and output projection weights
  • Causal masking β€” autoregressive next-token prediction

Training setup:

  • WikiText-103 dataset
  • AdamW optimizer with decoupled weight decay
  • Cosine LR schedule with linear warmup
  • Gradient clipping and accumulation
  • Mixed precision (bfloat16)
  • Automatic checkpointing

Entry points:

  • GPT/chapters/ β€” 12 sequential chapters from basics to full implementation
  • GPT/main.py β€” complete runnable training script
  • GPT/notebooks/ β€” Jupyter notebooks for interactive learning

(3) miniLLM

A minimal, readable decoder-only language model baseline incorporating practices from standard LLMs.

Designed as a quick baseline for testing architecture ideas and research experiments. The focus is on legibility and ease of modification rather than production throughput or distributed training.

See miniLLM/README.md for full details.

Architecture highlights:

  • RMSNorm
  • Rotary Position Embeddings (RoPE) with interpolation/extrapolation
  • Flash Attention via F.scaled_dot_product_attention
  • SwiGLU feed-forward
  • Pre-layer normalization
  • Grouped Query Attention (GQA)
  • KV caching
  • Multi-head Latent Attention (MLA) with compressed KV caching (DeepSeek V3 style)

Training setup:

  • C4 dataset streamed from Hugging Face
  • GPT-NeoX-20B tokenizer
  • AdamW optimizer (betas 0.9, 0.95; weight decay 0.1)
  • Cosine LR schedule with linear warmup
  • Gradient clipping at 1.0
  • Automatic Mixed Precision (AMP)
  • torch.compile()

Entry points:

  • miniLLM/main.py - standalone training script
  • miniLLM/miniLLM.ipynb - Colab-compatible notebook

(4) miniMamba

A from-scratch implementation of the Mamba selective state space model, including the full parallel scan algorithm with a custom autograd function and support for autoregressive inference.

Built for legibility and experimentation. The model can be pretrained from scratch or used as a continued pretraining starting point, with a separate LoRA fine-tuning path.

Architecture highlights:

  • Selective SSM following the Mamba architecture (Gu & Dao, 2023)
  • Custom parallel scan (PScan) via torch.autograd.Function with up/down sweep, O(log T) steps
  • Depthwise causal 1D convolution as a short input filter
  • Dual-branch SiLU-gated projection (x and z branches)
  • S4D real initialization for the state matrix A; dt initialized via softplus inverse
  • RMSNorm with optional muP scaling
  • Autoregressive step-wise inference with O(1) per-step cost via RNN-style hidden state and input buffer cache
  • Optional inner layer normalization (Jamba-style)
  • Optional fused CUDA selective scan via mamba_ssm

Pretraining setup:

  • Wikitext-2-raw-v1 by default (dataset and config fully overridable via CLI)
  • Full parameter training, no PEFT
  • Causal language modeling objective; text concatenated and chunked into fixed-length blocks
  • HuggingFace Trainer with DataCollatorForLanguageModeling
  • Cosine LR schedule with warmup, AdamW optimizer
  • Automatic bf16 detection, gradient checkpointing support
  • Continued pretraining from existing weights or random init from architecture config

Fine-tuning setup:

  • LoRA via PEFT targeting Mamba SSM projection layers (x_proj, in_proj, out_proj, embeddings)
  • SFTTrainer from trl

Entry points:

  • miniMamba/mamba_mini.py - minimal model definition (Mamba, MambaBlock, PScan, RMSNorm)
  • miniMamba/mamba.py - model definition (Mamba, MambaBlock, PScan, RMSNorm)
  • miniMamba/pscan.py - parallel scan with custom forward and backward pass
  • miniMamba/pretraining/train.py - pretraining script with full CLI
  • miniMamba/pretraining/pretrain.py - minimal pretraining script
  • miniMamba/finetuning/train_lora.py - LoRA fine-tuning script

(5) AevRL

A lightweight RL stack for training language models with GRPO (Group Relative Policy Optimization). The main training loop is under 500 lines of code. Built to be hackable, modular, and straightforward to extend with new algorithms and environments.

The trainer runs async rollouts against a chat model served by vLLM, collects rewards from a pluggable environment, and trains a local LoRA adapter using clipped policy gradients with a KL penalty against the frozen base model. The vLLM server and PyTorch trainer time-share a single GPU via sleep/wake cycling.

See AevRL/README.md for setup instructions, configuration reference, and the full training loop walkthrough.

Algorithm

  • GRPO with group-normalized advantages (no value function needed)
  • Clipped importance-weighted policy loss (PPO-clip style) with per-token assistant masking
  • KL penalty against the frozen reference model using the unbiased exp(r) - r - 1 estimator
  • Pluggable algorithm interface via Algorithm ABC (compute_advantages, loss)

Included environments

  • SimpleMath (basic arithmetic with <think>/<answer> format rewards)
  • GSM8K (streamed from HuggingFace with thread-safe shuffled iteration and answer normalization)

Training setup

  • LoRA adapter via PEFT (rank 16, alpha 32) targeting all attention and MLP projections
  • Gradient checkpointing and CPU offloading to share GPU memory with vLLM
  • W&B logging enabled by default
  • Pydantic-validated YAML config for all hyperparameters

Entry points:


(6) microAR

Minimal, dependency-free implementations of Attention Residuals (MoonshotAI) applied to karpathy's microgpt. Both variants from the paper are implemented in the same pure-Python, scalar-autograd style as the original.

In a standard transformer, the residual stream is a running sum. Attention Residuals replace this with a learned selective mix: before each sublayer, a zero-initialized projection vector scores all previous outputs via softmax, and the sublayer receives a weighted combination instead of the undifferentiated cumulative sum.

Follow microAR/README.md for the full walkthrough, reference pseudocode from the paper, and execution traces.

Variants

  • Full AttnRes (FAR) tracks every individual sublayer output as a candidate. O(2*L) candidates.
  • Block AttnRes (BAR) groups layers into blocks with a partial accumulator, committing block summaries at boundaries. O(blocks) candidates.

Entry points:


breaker

Others

πŸ”΅ Fine-Tuning LLMs (Guide)

A practical guide to supervised fine-tuning of pre-trained language models using the Hugging Face transformers library.

Covers the four core fine-tuning methods (SFT, CPT, DPO, RLHF), walks through a complete end-to-end training pipeline on the GSM8K math dataset using Qwen 3 (0.6B), and documents best practices around data preparation, training strategy, and common failure modes.

See Others/Fine-Tuning/README.md for the full guide.

Topics covered:

  • Supervised Fine-Tuning (SFT)
  • Continued Pre-Training (CPT)
  • Direct Preference Optimization (DPO)
  • Reinforcement Learning from Human Feedback (RLHF)
  • Dataset loading and tokenization with loss masking
  • Training configuration and hyperparameter guidance
  • Evaluation using loss and perplexity
  • Parameter-efficient fine-tuning with LoRA and NEFTune

Entry point:


πŸ”΅ Self-Optimizer Inference

An autonomous agent loop for optimising LLM inference throughput on Apple Silicon using MLX. Inspired by karpathy/autoresearch.

The setup is simple. inference.py is the only file the agent can modify. prepare.py is a locked evaluation harness that benchmarks every change and enforces quality gates (perplexity and task-level sanity checks). The agent hill-climbs on generation tokens/sec, commits each experiment, and reverts anything that fails.

Tested with Claude Opus 4.6 on a MacBook Pro M4 (24GB RAM) against two models. Argmax sampling was the biggest consistent gain (+10.9% on Qwen2.5-0.5B-Instruct-4bit, +3.1% on Gemma-3-270m-it-4bit). KV cache quantisation consistently hurt, and the sanity check gate caught quality regressions that perplexity alone missed.

See Others/SelfOptimizer-Inference/README.md for full benchmark results and the agent protocol.

Entry points:


πŸ”΅ Speculative Decoding

An implementation of Speculative Decoding (Leviathan et al., 2023) with rejection sampling that provably preserves the target model's output distribution.

A small draft model (Qwen3-0.6B) proposes gamma tokens per step, the target model (Qwen3-4B) verifies them in a single forward pass, and a rejection sampling scheme guarantees the output matches sampling from the target alone. Includes greedy and KV-cached baselines for throughput comparison.

See Others/Speculative-Decoding/README.md for the algorithm walkthrough, sampling modes, and benchmark harness.

Entry point:


πŸ”΅ Quantization

A collection of minimal, readable implementations and reference pipelines for modern neural network quantization techniques. The focus is on establishing robust baselines for post-training compression, mixed-precision inference, and parameter-efficient fine-tuning.

Note

Most of the baseline implementations require external dependencies (transformers, bitsandbytes, autoawq, auto-gptq). Ensure your environment is configured correctly before executing the API-driven examples.

See Others/Quantization/README.md for the full breakdown of techniques, comparisons, and additional references.

Topics covered:

  • PTQ & QAT: Baselines for dynamic post-training quantization and quantization-aware training.
  • GPTQ & AWQ: Industry-standard 4-bit weight quantization methods targeting salient parameters.
  • SmoothQuant & LLM.int8(): Mixed-precision and mathematical migration techniques for stable W8A8 compute.
  • NF4 & FP8: Information-theoretically optimal and hardware-native low-bit data types.
  • SpQR & QLoRA: Sparse outlier isolation and parameter-efficient adapter tuning over frozen 4-bit models.

Tip

For an extended breakdown of quantization concepts, visit the official Hugging Face Quantization Guide. For a comprehensive list of modern quantization papers, refer to the Awesome-LLM-Quantization repository.

Entry points:


🟒 TurboQuant

An unofficial, end-to-end PyTorch implementation of TurboQuant (Online Vector Quantization with Near-optimal Distortion Rate) for KV cache compression during Hugging Face generation.

Compresses KV cache entries online using a two-stage quantizer: Lloyd-Max scalar quantization after random rotation (Qmse), plus a 1-bit QJL residual sketch for unbiased inner-product estimation (Qprod). Targets Qwen2.5-3B-Instruct (dense, non-MoE). Will try to add MoE support in future editions.

See Others/Quantization/TurboQuant/README.md for the full architecture walkthrough and usage guide.

Entry point:


License

MIT

About

Personal hub of LM implementations, NLP experiments, and inference workflows πŸ› οΈ

Topics

Resources

License

Stars

Watchers

Forks

Contributors