Fast LLM speculative inference server for consumer hardware.
-
Updated
Jun 12, 2026 - C++
Fast LLM speculative inference server for consumer hardware.
An agent harness that compiles a model into one provably-correct, self-retargeting CUDA megakernel and self-tunes it past cuBLAS at batch-1 LLM decode.
Air.rs 70B+ inference on consumer GPU, LLM inference in Rust
A light, transparent, and modular inference & quantization engine for studying LLMs.
Add a description, image, and links to the megakernel topic page so that developers can more easily learn about it.
To associate your repository with the megakernel topic, visit your repo's landing page and select "manage topics."