Python bindings for ik_llama.cpp — a high-performance fork of llama.cpp with faster CPU inference, novel quantization types (Trellis / IQK quants), and AVX-VNNI / AVX-512 optimizations.
Designed as a drop-in replacement for llama-cpp-python.
pip install ik-llama-cpp-pythonCUDA wheels are distributed via GitHub Releases (too large for PyPI). Install by downloading from the latest release:
# Replace the URL with the wheel matching your Python version
pip install https://github.com/gongpx20069/ik-llama-cpp-python/releases/download/v0.1.3/ik_llama_cpp_python_cuda-0.1.3-cp312-cp312-manylinux_2_28_x86_64.whlAvailable for Python 3.10–3.13, Linux x86_64, CUDA 12.4.
From source (requires CMake ≥ 3.21 and a C++20 compiler)
git clone https://github.com/gongpx20069/ik-llama-cpp-python
cd ik-llama-cpp-python
git submodule update --init --recursive
pip install -e .From source with CUDA
CMAKE_ARGS="-DGGML_CUDA=ON" pip install -e .From source with native CPU optimizations
For maximum performance on your specific CPU (AVX-512, AVX-VNNI, etc.):
CMAKE_ARGS="-DGGML_NATIVE=ON" pip install -e .from ik_llama_cpp import IkLlama
llm = IkLlama("model.gguf", n_ctx=4096)
# Simple chat
text = llm.chat("What is the theory of relativity?")
print(text)Returns a dict matching the llama_cpp.Llama.create_chat_completion schema.
response = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"},
],
temperature=0.3,
max_tokens=256,
)
print(response["choices"][0]["message"]["content"])
print(response["usage"])text = llm.chat("Explain quantum mechanics in one sentence.")tokens = llm.tokenize("Hello, world!")
output_ids = llm.generate(tokens, max_tokens=128, temperature=0.7)
text = llm.detokenize(output_ids)# Change this:
# from llama_cpp import Llama
# To this:
from ik_llama_cpp import IkLlama as Llama
llm = Llama("model.gguf", n_ctx=4096, flash_attn=True)
response = llm.create_chat_completion(
messages=[{"role": "user", "content": "Hello!"}],
)ik_llama.cpp provides novel Trellis quantization types (IQ1_KT–IQ4_KT) that are not available in upstream llama.cpp. This package includes llama-quantize and a Python API to create these quants from standard GGUF files.
pip install ik-llama-cpp-python[quantize]# Download bf16 source + imatrix, quantize to IQ4_KT in one step
ik-llama-quantize from-hf bartowski/google_gemma-4-E2B-it-GGUF
# Specify a different quant type
ik-llama-quantize from-hf bartowski/google_gemma-4-E2B-it-GGUF --type IQ3_KT
# Custom output directory
ik-llama-quantize from-hf bartowski/google_gemma-4-E2B-it-GGUF --output-dir models/# With imatrix (recommended for IQ quants)
ik-llama-quantize quantize model-bf16.gguf model-IQ4_KT.gguf IQ4_KT \
--imatrix model-imatrix.gguf
# Without imatrix
ik-llama-quantize quantize model-bf16.gguf model-IQ4_KT.gguf IQ4_KT
# Shorthand (without subcommand)
ik-llama-quantize model-bf16.gguf model-IQ4_KT.gguf IQ4_KTfrom ik_llama_cpp import quantize, quantize_from_hf
# One-step: download from HuggingFace and quantize
path = quantize_from_hf("bartowski/google_gemma-4-E2B-it-GGUF", quant_type="IQ4_KT")
# Or quantize a local file
path = quantize("model-bf16.gguf", "model-IQ4_KT.gguf", "IQ4_KT",
imatrix_path="model-imatrix.gguf")ik-llama-quantize check| Parameter | Type | Default | Description |
|---|---|---|---|
model_path |
str |
required | Path to GGUF model file |
n_ctx |
int |
4096 |
Context window size |
n_threads |
int |
0 |
CPU threads (0 = auto) |
use_mmap |
bool |
True |
Memory-map model file |
use_mlock |
bool |
False |
Lock model in RAM |
flash_attn |
bool |
True |
Enable flash attention |
n_gpu_layers |
int |
0 |
Number of layers to offload to GPU |
verbose |
bool |
True |
Logging verbosity |
| Platform | Wheels | Notes |
|---|---|---|
| Linux x86_64 | CPU (AVX2), CUDA 12.4 | CUDA wheels via GitHub Releases |
| Linux aarch64 | CPU | Python 3.10–3.13 |
| macOS arm64 | CPU + Metal | Python 3.10–3.13 |
| Windows x86_64 | CPU (AVX2) | Python 3.10–3.13 |
| Variable | Description |
|---|---|
IK_LLAMA_CPP_LIB_PATH |
Override path to the compiled shared library |
CMAKE_ARGS |
Extra CMake flags for source builds |
ik_llama.cpp is a llama.cpp fork focused on performance and quantization research. Key advantages:
- Faster CPU inference — improved prompt processing across all quantization types, better Flash Attention token generation
- Novel quantization types — Trellis quants (
IQ1_KT–IQ4_KT), IQK quants (IQ2_K–IQ6_K), row-interleaved R4 variants, MXFP4 - Better KV cache —
Q8_KV/Q4_0KV-cache quantization with Hadamard transforms - DeepSeek optimizations — FlashMLA (v1–v3), fused MoE operations, Smart Expert Reduction
- Hardware support — optimized kernels for AVX2, AVX-512, AVX-VNNI, ARM NEON, CUDA (Turing+)
- Broad model support — LLaMA-3/4, Qwen3, DeepSeek-V3, Gemma3/4, Mistral, and many more
ik_llama_cpp/
__init__.py # Public API: IkLlama, quantize, quantize_from_hf
_lib_loader.py # Finds and loads the shared library (.dll/.so/.dylib)
_ctypes_api.py # Low-level ctypes bindings to llama.h C API
_internals.py # RAII wrappers: IkModel, IkContext
llama.py # High-level IkLlama class
quantize.py # Quantization CLI and API (wraps llama-quantize)
lib/ # Compiled shared libraries (installed by CMake)
bin/ # llama-quantize binary (installed by CMake)
Contributions are welcome! Feel free to open an issue or submit a pull request on GitHub. Whether it's bug reports, feature requests, documentation improvements, or code contributions — all are appreciated.
MIT