Skip to content

bauratynov/binaryai-releases

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

BinaryAI Engine

CPU-optimized LLM inference runtime
Faster than llama.cpp. Written from scratch in C.

Download 12.6 tok/s 557 KB 0 deps


Benchmarks

All tests on 12-core/24-thread x86-64 CPU, DDR4-3200 dual-channel. GGUF Q4_K_M format.

Model Architecture BinaryAI llama.cpp Advantage
Qwen3-8B Qwen3, 36 layers 12.2 tok/s 10.7 tok/s +14%
Sherkala-8B Llama 3.1, 32 layers 12.6 tok/s 11.35 tok/s +11%
KazLLM-8B Llama 3.1, 32 layers 11.9 tok/s 10.73 tok/s +11%

Key Innovations

  • 4 ILP accumulators — Saturate AVX2 pipeline for Q4_K dot product
  • Fused Residual+RMSNorm — 50% fewer memory passes per layer
  • JND Perceptual Pruning — 20-40% FFN sparsity at zero quality loss (from EntropyX codec research)
  • 99.7% Delta Sparsity — Path to 100× LM head acceleration
  • 557 KB static binary — No Python, no CUDA, no dependencies

Supported Models

  • Qwen3 (ChatML)
  • Qwen2 (ChatML)
  • Llama 3.1 (Llama 3 template)
  • Sherkala-8B (Kazakh/English/Russian)
  • KazLLM-8B (Kazakh/English)
  • Any GGUF Q4_K_M model

Quick Start

# Download
wget https://github.com/bauratynov/binaryai-releases/releases/latest/download/binaryai-windows-amd64.exe

# Run chat
binaryai.exe -m model.gguf -p "Hello, world!"

# Run OpenAI-compatible server
binaryai.exe -m model.gguf --server --port 8080

Documentation

Download

Download Latest Release →


BinaryAI Engine — Built in Kazakhstan 🇰🇿
Combines research from BaiterekLLM, EntropyX, and BaiterekSkip

About

BinaryAI Engine — CPU-optimized LLM inference runtime. Faster than llama.cpp.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors