llm-infernece

Star

Here are 5 public repositories matching this topic...

lucienhuangfu / eLLM

Star

eLLM can infer LLM on CPUs faster than on GPUs

inference transformer moe llama minimax cpu-inference qwen llm-infernece rust-llm

Updated Jun 11, 2026
Rust

VectorInstitute / vector-inference

Star

Efficient LLM inference on Slurm clusters.

inference speech-to-text vlm text-embedding multimodal audio-transcription llm vllm reward-model llm-infernece sglang llm-infrastructure

Updated Jun 11, 2026
Python

pandada8 / llm-inference-benchmark

Star

LLM 推理服务性能测试

llm-infernece

Updated Dec 17, 2023
Jupyter Notebook

Layered prefill changes the scheduling axis from tokens to layers and removes redundant MoE weight reloads while keeping decode stall free. The result is lower TTFT, lower end-to-end latency, and lower energy per token without hurting TBT stability.

inference moe llm llm-serving vllm llm-infernece

Updated Mar 9, 2026
Python

konjoai / squish

Star

🤖🗜️⚡️ Local LLM server for Apple Silicon. 5.4× faster end-to-end on long contexts vs Ollama, 33% less RAM, INT3 support for Qwen3. OpenAI + Ollama drop-in. Built for repeated long-context workloads on memory-constrained Macs.

Updated Jun 14, 2026
Python

Improve this page

Add a description, image, and links to the llm-infernece topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-infernece topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-infernece

Here are 5 public repositories matching this topic...

lucienhuangfu / eLLM

VectorInstitute / vector-inference

pandada8 / llm-inference-benchmark

scale-snu / layered-prefill

konjoai / squish

Improve this page

Add this topic to your repo