Skip to content

gongpx20069/ik-llama-cpp-python

Repository files navigation

ik-llama-cpp-python

PyPI version PyPI - Python Version License: MIT

Python bindings for ik_llama.cpp — a high-performance fork of llama.cpp with faster CPU inference, novel quantization types (Trellis / IQK quants), and AVX-VNNI / AVX-512 optimizations.

Designed as a drop-in replacement for llama-cpp-python.

Installation

Pre-built wheels (CPU, with AVX2)

pip install ik-llama-cpp-python

Pre-built wheels (CUDA)

CUDA wheels are distributed via GitHub Releases (too large for PyPI). Install by downloading from the latest release:

# Replace the URL with the wheel matching your Python version
pip install https://github.com/gongpx20069/ik-llama-cpp-python/releases/download/v0.1.3/ik_llama_cpp_python_cuda-0.1.3-cp312-cp312-manylinux_2_28_x86_64.whl

Available for Python 3.10–3.13, Linux x86_64, CUDA 12.4.

From source (requires CMake ≥ 3.21 and a C++20 compiler)
git clone https://github.com/gongpx20069/ik-llama-cpp-python
cd ik-llama-cpp-python
git submodule update --init --recursive
pip install -e .
From source with CUDA
CMAKE_ARGS="-DGGML_CUDA=ON" pip install -e .
From source with native CPU optimizations

For maximum performance on your specific CPU (AVX-512, AVX-VNNI, etc.):

CMAKE_ARGS="-DGGML_NATIVE=ON" pip install -e .

Quick Start

from ik_llama_cpp import IkLlama

llm = IkLlama("model.gguf", n_ctx=4096)

# Simple chat
text = llm.chat("What is the theory of relativity?")
print(text)

API

create_chat_completion — OpenAI-compatible

Returns a dict matching the llama_cpp.Llama.create_chat_completion schema.

response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"},
    ],
    temperature=0.3,
    max_tokens=256,
)
print(response["choices"][0]["message"]["content"])
print(response["usage"])

chat — Convenience wrapper

text = llm.chat("Explain quantum mechanics in one sentence.")

generate — Low-level token generation

tokens = llm.tokenize("Hello, world!")
output_ids = llm.generate(tokens, max_tokens=128, temperature=0.7)
text = llm.detokenize(output_ids)

Drop-in replacement for llama-cpp-python

# Change this:
# from llama_cpp import Llama
# To this:
from ik_llama_cpp import IkLlama as Llama

llm = Llama("model.gguf", n_ctx=4096, flash_attn=True)
response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello!"}],
)

Quantization (IQ4_KT)

ik_llama.cpp provides novel Trellis quantization types (IQ1_KTIQ4_KT) that are not available in upstream llama.cpp. This package includes llama-quantize and a Python API to create these quants from standard GGUF files.

Install with quantization support

pip install ik-llama-cpp-python[quantize]

CLI: Download from HuggingFace and quantize

# Download bf16 source + imatrix, quantize to IQ4_KT in one step
ik-llama-quantize from-hf bartowski/google_gemma-4-E2B-it-GGUF

# Specify a different quant type
ik-llama-quantize from-hf bartowski/google_gemma-4-E2B-it-GGUF --type IQ3_KT

# Custom output directory
ik-llama-quantize from-hf bartowski/google_gemma-4-E2B-it-GGUF --output-dir models/

CLI: Quantize a local file

# With imatrix (recommended for IQ quants)
ik-llama-quantize quantize model-bf16.gguf model-IQ4_KT.gguf IQ4_KT \
    --imatrix model-imatrix.gguf

# Without imatrix
ik-llama-quantize quantize model-bf16.gguf model-IQ4_KT.gguf IQ4_KT

# Shorthand (without subcommand)
ik-llama-quantize model-bf16.gguf model-IQ4_KT.gguf IQ4_KT

Python API

from ik_llama_cpp import quantize, quantize_from_hf

# One-step: download from HuggingFace and quantize
path = quantize_from_hf("bartowski/google_gemma-4-E2B-it-GGUF", quant_type="IQ4_KT")

# Or quantize a local file
path = quantize("model-bf16.gguf", "model-IQ4_KT.gguf", "IQ4_KT",
                imatrix_path="model-imatrix.gguf")

Check if llama-quantize is available

ik-llama-quantize check

Constructor Parameters

Parameter Type Default Description
model_path str required Path to GGUF model file
n_ctx int 4096 Context window size
n_threads int 0 CPU threads (0 = auto)
use_mmap bool True Memory-map model file
use_mlock bool False Lock model in RAM
flash_attn bool True Enable flash attention
n_gpu_layers int 0 Number of layers to offload to GPU
verbose bool True Logging verbosity

Supported Platforms

Platform Wheels Notes
Linux x86_64 CPU (AVX2), CUDA 12.4 CUDA wheels via GitHub Releases
Linux aarch64 CPU Python 3.10–3.13
macOS arm64 CPU + Metal Python 3.10–3.13
Windows x86_64 CPU (AVX2) Python 3.10–3.13

Environment Variables

Variable Description
IK_LLAMA_CPP_LIB_PATH Override path to the compiled shared library
CMAKE_ARGS Extra CMake flags for source builds

Why ik_llama.cpp?

ik_llama.cpp is a llama.cpp fork focused on performance and quantization research. Key advantages:

  • Faster CPU inference — improved prompt processing across all quantization types, better Flash Attention token generation
  • Novel quantization types — Trellis quants (IQ1_KTIQ4_KT), IQK quants (IQ2_KIQ6_K), row-interleaved R4 variants, MXFP4
  • Better KV cacheQ8_KV / Q4_0 KV-cache quantization with Hadamard transforms
  • DeepSeek optimizations — FlashMLA (v1–v3), fused MoE operations, Smart Expert Reduction
  • Hardware support — optimized kernels for AVX2, AVX-512, AVX-VNNI, ARM NEON, CUDA (Turing+)
  • Broad model support — LLaMA-3/4, Qwen3, DeepSeek-V3, Gemma3/4, Mistral, and many more

Architecture

ik_llama_cpp/
  __init__.py        # Public API: IkLlama, quantize, quantize_from_hf
  _lib_loader.py     # Finds and loads the shared library (.dll/.so/.dylib)
  _ctypes_api.py     # Low-level ctypes bindings to llama.h C API
  _internals.py      # RAII wrappers: IkModel, IkContext
  llama.py           # High-level IkLlama class
  quantize.py        # Quantization CLI and API (wraps llama-quantize)
  lib/               # Compiled shared libraries (installed by CMake)
  bin/               # llama-quantize binary (installed by CMake)

Contributing

Contributions are welcome! Feel free to open an issue or submit a pull request on GitHub. Whether it's bug reports, feature requests, documentation improvements, or code contributions — all are appreciated.

License

MIT

About

Python bindings for ik_llama.cpp — high-performance llama.cpp fork

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors