tiny-ton

A Triton-inspired GPU kernel compiler. Write GPU kernels in Python, compile them via MLIR to real hardware instructions.

import tiny_ton as tt
import numpy as np

@tt.jit
def vector_add(a_ptr, b_ptr, c_ptr, N):
    pid = tt.program_id(0)
    offsets = pid * 64 + tt.arange(0, 64)
    mask = offsets < N
    a = tt.load(a_ptr + offsets, mask=mask)
    b = tt.load(b_ptr + offsets, mask=mask)
    tt.store(c_ptr + offsets, a + b, mask=mask)

a = np.array([1, 2, 3, 4], dtype=np.int32)
b = np.array([10, 20, 30, 40], dtype=np.int32)
c = np.zeros(4, dtype=np.int32)

vector_add[(1,)](a, b, c, len(a))
print(c)  # [11, 22, 33, 44]

Architecture

Python (@jit) → AST capture → pybind11 → C++ IRBuilder → MLIR (TinyTon dialect)
    → Register Allocation → CodeGen → Runtime/Simulator → Execution

Building

Prerequisites

CMake 3.20+
C++17 compiler
LLVM/MLIR 18
pybind11
Python 3.10+

Build

# Docker (recommended)
docker build -t tiny-ton .
docker run tiny-ton ttc --emit asm examples/vector_add.tgc

# Native
brew install cmake ninja llvm@18
rm -rf build
cmake -G Ninja -S . -B build \
  -DCMAKE_BUILD_TYPE=Debug \
  -DMLIR_DIR=/opt/homebrew/opt/llvm@18/lib/cmake/mlir \
  -DLLVM_DIR=/opt/homebrew/opt/llvm@18/lib/cmake/llvm \
  -DTTN_ENABLE_PYTHON=OFF
cmake --build build
./build/bin/ttc --help

Python package

cd python
pip install -e .

Roadmap — microgpt on GPU

Goal: run Karpathy's microgpt forward pass on GPU via tiny-ton JIT kernels.

Done

Element-wise arithmetic: add, sub, mul, div (i32/f32/f16)
Math intrinsics: exp, log, sqrt, rsqrt, abs, max (f32/f16)
Masked load/store with program_id threading
NVIDIA GPU backend: MLIR → PTX via combined pass + libdevice
Google Colab CI: build + test on T4 GPU

Stage 1 — Standalone GPU kernels (one op at a time)

Each operation is a single kernel, tested independently against NumPy.

Stage 2 — Wire into microgpt

Replace microgpt's Python ops one by one with tiny-ton GPU kernels. Each op is still a separate launch — no fusion yet.

Replace softmax(), rmsnorm(), linear() with GPU kernels
Replace attention + MLP with composed GPU launches
Full forward pass end-to-end on GPU
Benchmark vs Python CPU baseline

Stage 3 — Optimize for GPU

Reduce launch overhead, fuse kernels, improve throughput.

Fused softmax (single kernel)
Fused rmsnorm (single kernel)
Tiled matmul with shared memory + barriers
Fused attention (Flash-attention style)
Fused MLP + transformer block
Automatic operator fusion pass in the compiler

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
bindings		bindings
examples		examples
include		include
lib		lib
python		python
test		test
tools		tools
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tiny-ton

Architecture

Building

Prerequisites

Build

Python package

Roadmap — microgpt on GPU

Done

Stage 1 — Standalone GPU kernels (one op at a time)

Stage 2 — Wire into microgpt

Stage 3 — Optimize for GPU

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tiny-ton

Architecture

Building

Prerequisites

Build

Python package

Roadmap — microgpt on GPU

Done

Stage 1 — Standalone GPU kernels (one op at a time)

Stage 2 — Wire into microgpt

Stage 3 — Optimize for GPU

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages