CPU-optimized LLM inference runtime
Faster than llama.cpp. Written from scratch in C.
All tests on 12-core/24-thread x86-64 CPU, DDR4-3200 dual-channel. GGUF Q4_K_M format.
| Model | Architecture | BinaryAI | llama.cpp | Advantage |
|---|---|---|---|---|
| Qwen3-8B | Qwen3, 36 layers | 12.2 tok/s | 10.7 tok/s | +14% |
| Sherkala-8B | Llama 3.1, 32 layers | 12.6 tok/s | 11.35 tok/s | +11% |
| KazLLM-8B | Llama 3.1, 32 layers | 11.9 tok/s | 10.73 tok/s | +11% |
- 4 ILP accumulators — Saturate AVX2 pipeline for Q4_K dot product
- Fused Residual+RMSNorm — 50% fewer memory passes per layer
- JND Perceptual Pruning — 20-40% FFN sparsity at zero quality loss (from EntropyX codec research)
- 99.7% Delta Sparsity — Path to 100× LM head acceleration
- 557 KB static binary — No Python, no CUDA, no dependencies
- Qwen3 (ChatML)
- Qwen2 (ChatML)
- Llama 3.1 (Llama 3 template)
- Sherkala-8B (Kazakh/English/Russian)
- KazLLM-8B (Kazakh/English)
- Any GGUF Q4_K_M model
# Download
wget https://github.com/bauratynov/binaryai-releases/releases/latest/download/binaryai-windows-amd64.exe
# Run chat
binaryai.exe -m model.gguf -p "Hello, world!"
# Run OpenAI-compatible server
binaryai.exe -m model.gguf --server --port 8080
BinaryAI Engine — Built in Kazakhstan 🇰🇿
Combines research from BaiterekLLM, EntropyX, and BaiterekSkip