eLLM can infer LLM on CPUs faster than on GPUs
-
Updated
Jun 11, 2026 - Rust
eLLM can infer LLM on CPUs faster than on GPUs
Efficient LLM inference on Slurm clusters.
Layered prefill changes the scheduling axis from tokens to layers and removes redundant MoE weight reloads while keeping decode stall free. The result is lower TTFT, lower end-to-end latency, and lower energy per token without hurting TBT stability.
🤖🗜️⚡️ Local LLM server for Apple Silicon. 5.4× faster end-to-end on long contexts vs Ollama, 33% less RAM, INT3 support for Qwen3. OpenAI + Ollama drop-in. Built for repeated long-context workloads on memory-constrained Macs.
Add a description, image, and links to the llm-infernece topic page so that developers can more easily learn about it.
To associate your repository with the llm-infernece topic, visit your repo's landing page and select "manage topics."