ik-llama-cpp-python

Python bindings for ik_llama.cpp — a high-performance fork of llama.cpp with faster CPU inference, novel quantization types (Trellis / IQK quants), and AVX-VNNI / AVX-512 optimizations.

Designed as a drop-in replacement for llama-cpp-python.

Installation

Pre-built wheels (CPU, with AVX2)

pip install ik-llama-cpp-python

Pre-built wheels (CUDA)

CUDA wheels are distributed via GitHub Releases (too large for PyPI). Install by downloading from the latest release:

# Replace the URL with the wheel matching your Python version
pip install https://github.com/gongpx20069/ik-llama-cpp-python/releases/download/v0.1.3/ik_llama_cpp_python_cuda-0.1.3-cp312-cp312-manylinux_2_28_x86_64.whl

Available for Python 3.10–3.13, Linux x86_64, CUDA 12.4.

From source (requires CMake ≥ 3.21 and a C++20 compiler)

git clone https://github.com/gongpx20069/ik-llama-cpp-python
cd ik-llama-cpp-python
git submodule update --init --recursive
pip install -e .

From source with CUDA

CMAKE_ARGS="-DGGML_CUDA=ON" pip install -e .

From source with native CPU optimizations

For maximum performance on your specific CPU (AVX-512, AVX-VNNI, etc.):

CMAKE_ARGS="-DGGML_NATIVE=ON" pip install -e .

Quick Start

from ik_llama_cpp import IkLlama

llm = IkLlama("model.gguf", n_ctx=4096)

# Simple chat
text = llm.chat("What is the theory of relativity?")
print(text)

API

`create_chat_completion` — OpenAI-compatible

Returns a dict matching the llama_cpp.Llama.create_chat_completion schema.

response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"},
    ],
    temperature=0.3,
    max_tokens=256,
)
print(response["choices"][0]["message"]["content"])
print(response["usage"])

`chat` — Convenience wrapper

text = llm.chat("Explain quantum mechanics in one sentence.")

`generate` — Low-level token generation

tokens = llm.tokenize("Hello, world!")
output_ids = llm.generate(tokens, max_tokens=128, temperature=0.7)
text = llm.detokenize(output_ids)

Drop-in replacement for llama-cpp-python

# Change this:
# from llama_cpp import Llama
# To this:
from ik_llama_cpp import IkLlama as Llama

llm = Llama("model.gguf", n_ctx=4096, flash_attn=True)
response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello!"}],
)

Quantization (IQ4_KT)

ik_llama.cpp provides novel Trellis quantization types (IQ1_KT–IQ4_KT) that are not available in upstream llama.cpp. This package includes llama-quantize and a Python API to create these quants from standard GGUF files.

Install with quantization support

pip install ik-llama-cpp-python[quantize]

CLI: Download from HuggingFace and quantize

# Download bf16 source + imatrix, quantize to IQ4_KT in one step
ik-llama-quantize from-hf bartowski/google_gemma-4-E2B-it-GGUF

# Specify a different quant type
ik-llama-quantize from-hf bartowski/google_gemma-4-E2B-it-GGUF --type IQ3_KT

# Custom output directory
ik-llama-quantize from-hf bartowski/google_gemma-4-E2B-it-GGUF --output-dir models/

CLI: Quantize a local file

# With imatrix (recommended for IQ quants)
ik-llama-quantize quantize model-bf16.gguf model-IQ4_KT.gguf IQ4_KT \
    --imatrix model-imatrix.gguf

# Without imatrix
ik-llama-quantize quantize model-bf16.gguf model-IQ4_KT.gguf IQ4_KT

# Shorthand (without subcommand)
ik-llama-quantize model-bf16.gguf model-IQ4_KT.gguf IQ4_KT

Python API

from ik_llama_cpp import quantize, quantize_from_hf

# One-step: download from HuggingFace and quantize
path = quantize_from_hf("bartowski/google_gemma-4-E2B-it-GGUF", quant_type="IQ4_KT")

# Or quantize a local file
path = quantize("model-bf16.gguf", "model-IQ4_KT.gguf", "IQ4_KT",
                imatrix_path="model-imatrix.gguf")

Check if llama-quantize is available

ik-llama-quantize check

Constructor Parameters

Parameter	Type	Default	Description
`model_path`	`str`	required	Path to GGUF model file
`n_ctx`	`int`	`4096`	Context window size
`n_threads`	`int`	`0`	CPU threads (0 = auto)
`use_mmap`	`bool`	`True`	Memory-map model file
`use_mlock`	`bool`	`False`	Lock model in RAM
`flash_attn`	`bool`	`True`	Enable flash attention
`n_gpu_layers`	`int`	`0`	Number of layers to offload to GPU
`verbose`	`bool`	`True`	Logging verbosity

Supported Platforms

Platform	Wheels	Notes
Linux x86_64	CPU (AVX2), CUDA 12.4	CUDA wheels via GitHub Releases
Linux aarch64	CPU	Python 3.10–3.13
macOS arm64	CPU + Metal	Python 3.10–3.13
Windows x86_64	CPU (AVX2)	Python 3.10–3.13

Environment Variables

Variable	Description
`IK_LLAMA_CPP_LIB_PATH`	Override path to the compiled shared library
`CMAKE_ARGS`	Extra CMake flags for source builds

Why ik_llama.cpp?

ik_llama.cpp is a llama.cpp fork focused on performance and quantization research. Key advantages:

Faster CPU inference — improved prompt processing across all quantization types, better Flash Attention token generation
Novel quantization types — Trellis quants (IQ1_KT–IQ4_KT), IQK quants (IQ2_K–IQ6_K), row-interleaved R4 variants, MXFP4
Better KV cache — Q8_KV / Q4_0 KV-cache quantization with Hadamard transforms
DeepSeek optimizations — FlashMLA (v1–v3), fused MoE operations, Smart Expert Reduction
Hardware support — optimized kernels for AVX2, AVX-512, AVX-VNNI, ARM NEON, CUDA (Turing+)
Broad model support — LLaMA-3/4, Qwen3, DeepSeek-V3, Gemma3/4, Mistral, and many more

Architecture

ik_llama_cpp/
  __init__.py        # Public API: IkLlama, quantize, quantize_from_hf
  _lib_loader.py     # Finds and loads the shared library (.dll/.so/.dylib)
  _ctypes_api.py     # Low-level ctypes bindings to llama.h C API
  _internals.py      # RAII wrappers: IkModel, IkContext
  llama.py           # High-level IkLlama class
  quantize.py        # Quantization CLI and API (wraps llama-quantize)
  lib/               # Compiled shared libraries (installed by CMake)
  bin/               # llama-quantize binary (installed by CMake)

Contributing

Contributions are welcome! Feel free to open an issue or submit a pull request on GitHub. Whether it's bug reports, feature requests, documentation improvements, or code contributions — all are appreciated.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
ik_llama_cpp		ik_llama_cpp
tests		tests
vendor		vendor
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ik-llama-cpp-python

Installation

Pre-built wheels (CPU, with AVX2)

Pre-built wheels (CUDA)

Quick Start

API

`create_chat_completion` — OpenAI-compatible

`chat` — Convenience wrapper

`generate` — Low-level token generation

Drop-in replacement for llama-cpp-python

Quantization (IQ4_KT)

Install with quantization support

CLI: Download from HuggingFace and quantize

CLI: Quantize a local file

Python API

Check if llama-quantize is available

Constructor Parameters

Supported Platforms

Environment Variables

Why ik_llama.cpp?

Architecture

Contributing

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ik-llama-cpp-python

Installation

Pre-built wheels (CPU, with AVX2)

Pre-built wheels (CUDA)

Quick Start

API

create_chat_completion — OpenAI-compatible

chat — Convenience wrapper

generate — Low-level token generation

Drop-in replacement for llama-cpp-python

Quantization (IQ4_KT)

Install with quantization support

CLI: Download from HuggingFace and quantize

CLI: Quantize a local file

Python API

Check if llama-quantize is available

Constructor Parameters

Supported Platforms

Environment Variables

Why ik_llama.cpp?

Architecture

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`create_chat_completion` — OpenAI-compatible

`chat` — Convenience wrapper

`generate` — Low-level token generation

Packages