chimera is a statically linked C++ inference multitool for local ggml-backed models.
It bundles llama.cpp, whisper.cpp, stable-diffusion.cpp, SQLite, and sqlite-vec into one native binary. The same process can run text generation, interactive chat with persistent history, speech-to-text, text-to-image, embeddings, a personal RAG / vector store, and an OpenAI-compatible HTTP server across modalities, all sharing one ggml backend set and one SQLite database.
The same build also produces libchimera.a, a redistributable static library for embedding chimera's engines and HTTP server inside another C++ process. See As a library and docs/dev/oop-layer.md for the embedder-facing API.
Download dependencies and build the executable and library.
make build # or just makeOptionally, install the built executable
make installUse it
chimera info
chimera gen -m models/model.gguf -p "Tell me one useful thing about local inference."Use --help on any subcommand to see its options. For copy-pasteable examples, see docs/cheatsheet.md.
chimera targets CLI-first users who run more than one ggml-backed modality (text + audio + image) and want them sharing one process, one ggml backend set, one SQLite database, and one OpenAI-compatible HTTP API, rather than running, configuring, and gluing together three separate servers. It is most useful when:
-
You want faithful upstream flag coverage (
gen,chat,embedexpose most llama.cpp sampler / RoPE / YaRN / multi-GPU / cache / adapter flags directly), not a curated subset. -
You distribute a single static binary across machines and don't want a Python or Node runtime on the target host.
-
You're building on top of the HTTP server and need text, audio, image, embeddings, RAG, and chat-history routes in one origin.
-
You're a C++ embedder who wants to drive llama.cpp / whisper.cpp / sd.cpp from your own process without reimplementing the load-and-run scaffolding. Linking
libchimera.a(and optionally#include "chimera.hpp"for the persistent-handle OOP layer) gives you the same model lifecycle, sampler wiring, and HTTP-server code paths thechimerabinary uses. -
You build against multiple ggml backends (CPU, CUDA, ROCm, SYCL, Vulkan, Metal) from the same source tree, and want to verify the linked backend with
chimera inforather than runtime probing.
chimera is not a GUI application. The optional embedded web UI (make build-with-webui) bakes in upstream llama.cpp's chat UI for the / endpoint, but there is no model browser, launcher, or settings panel. Users who want a packaged desktop experience built specifically around chimera should look at chimera-desktop; other point-and-click options that work against chimera's OpenAI-compatible API include Ollama, LM Studio, and Jan.
-
Three engines, one ggml. llama.cpp, whisper.cpp, and stable-diffusion.cpp all link against a single shared copy of ggml (
--sd-shared-ggmlin the deps build). One process can host an LLM, a Whisper model, and a Stable Diffusion pipeline without three separately-configured servers. -
Shared persistence. Chat history, embedding caches, and vector-store collections live in a single SQLite database (sqlite-vec for ANN search).
chat,serve,index, andsearchread and write the same file. -
OpenAI-compatible HTTP across modalities.
chimera serveexposes/v1/chat/completions,/v1/embeddings,/v1/audio/transcriptions,/v1/images/generations, plus chimera-specific RAG, KV-slot-snapshot, and chat-history routes. -
Upstream-pin discipline. Vendored llama.cpp / whisper.cpp / sd.cpp versions are pinned in
scripts/manage.py.make bump-checkdiffs the currently-vendored upstream-server headers against a target llama.cpp version, andchimera_pin_check.cppstatic-asserts on the struct fields chimera relies on — so a version bump that silently retypes or renames a depended-on field fails at compile time rather than at runtime. -
Backend matrix. CPU, CUDA, ROCm (HIP), SYCL, Vulkan, and Metal are first-class build targets in the Makefile.
chimera inforeports built / loaded / registered backends and enumerated devices. -
Flag-coverage audit.
docs/dev/cli-api-coverage.mdtracks which upstream flags chimera exposes and which are deliberately skipped, so the gap between chimera's CLI and upstream's is auditable rather than implicit. -
Library artifact. The same build produces
libchimera.a(chimera's own code),libchimera_thirdparty.a(the bundled C++ stack), andlibchimera_ggml.a(ggml core + per-backend archives, must be whole-archived). External C++ projects can use the procedural API or the optional header-only OOP layer. See As a library for details.
| Command | Purpose |
|---|---|
gen |
One-shot llama text generation (text + optional images via --mmproj) |
chat |
Interactive chat with persistent KV cache across turns; optional save-to-DB |
tokenize |
Print token ids (or id<TAB>piece) for a prompt |
embed |
Emit a single pooled embedding vector for a prompt |
whisper |
Transcribe a WAV file (streaming, segment-by-segment) |
sd |
Text-to-image / img2img / inpaint with stable-diffusion.cpp |
serve |
OpenAI-compatible HTTP server (text + audio + image + RAG) |
index |
Vector-store collections (create / ingest / list / stats / drop) |
search |
KNN search over a vector-store collection |
db |
Embedded SQLite management (status, backup, vacuum) |
info |
Print versions + ggml backends/devices + CPU features (useful for bug reports) |
A top-level -v,--verbose flag re-enables native backend logging (silenced by default).
See docs/cheatsheet.md for a one-page command reference, and docs/serve.md for the HTTP server.
make buildThis will:
-
Run
python scripts/manage.py build --all --deps-only --sd-shared-ggml, which clones and builds llama.cpp, whisper.cpp, and stable-diffusion.cpp intothirdparty/<project>/{include,lib}and vendors the SQLite + sqlite-vec amalgamations intothirdparty/{sqlite,sqlite-vec}/. The--sd-shared-ggmlflag is load-bearing: stable-diffusion.cpp normally vendors its own copy of ggml, which would collide with llama.cpp's at link time. Building all three projects against the single ggml set is what makes the static binary possible. -
Configure with
cmake -S . -B build -DSD_USE_VENDORED_GGML=OFF. -
Build the
chimeratarget.
Output: build/chimera.
Run make deps alone if you just want to (re)build the third-party libs, or make rebuild after touching only chimera source.
Experimental: make build-with-webui is identical to make build but flips -DCHIMERA_WEBUI_EMBED=ON, which embeds upstream llama.cpp's prebuilt web chat UI bundle (GET / + /bundle.{js,css}) into the chimera binary. Adds ~6 MB to the stripped binary (~7 MB unstripped). No Node toolchain required. The UI is pinned to whichever llama.cpp version chimera vendored. Disable at runtime with chimera serve --no-webui. See docs/dev/webui.md for the implementation notes.
OpenSSL is required at link time (cpp-httplib uses it for TLS support inside the bundled HTTP server). On macOS this also pulls in the system Security and CoreFoundation frameworks. Install OpenSSL via your package manager (brew install openssl@3 on macOS; apt install libssl-dev on Debian/Ubuntu) before running make build.
On Windows, use one of the following methods:
winget install -e --id ShiningLight.OpenSSL.Devvcpkg install openssl:x64-windowschoco install openssl -yscoop install openssl- Download directly from Shining Light OpenSSL.
make install # /usr/local/bin/chimera (may need sudo)
make install PREFIX=$HOME/.local # ~/.local/bin/chimera
make install DESTDIR=/tmp/stage PREFIX=/usr # for packagingmake uninstall removes the binary from the same $PREFIX/bin/.
make smoke # CLI plumbing only -- no model files needed
make test # smoke + end-to-end runs gated on models/ presencescripts/test.py skips end-to-end checks when the matching model file is absent (see the script for the lookup paths), so a fresh clone reports SKIP rather than FAIL. 62 tests total — typically 56 PASS + 6 SKIP-when-fixture-missing on a fresh checkout.
On macOS, Metal is enabled by default. For other backends, use the matching one-shot build target — each one rebuilds the third-party deps with the right GGML_<BACKEND>=1 and configures chimera with -DGGML_<BACKEND>=ON:
make build-cuda # NVIDIA CUDA
make build-rocm # AMD ROCm (HIP)
make build-sycl # Intel oneAPI / SYCL
make build-vulkan # Vulkan (cross-platform)The backend toolkit (CUDA Toolkit, ROCm, oneAPI, or Vulkan SDK) must already be installed on the host. Override architectures with CMAKE_CUDA_ARCHITECTURES (e.g. 89 for Ada/RTX 40xx) or CMAKE_HIP_ARCHITECTURES (e.g. gfx1100 for RDNA3) to avoid the slow default fat-build. CUDA perf knobs (GGML_CUDA_FORCE_MMQ, GGML_CUDA_FORCE_CUBLAS, GGML_CUDA_FA_ALL_QUANTS) and the ROCm GGML_HIP_ROCWMMA_FATTN flash-attention switch are picked up from the env. Verify with chimera info, which prints the linked backends.
On Windows (or any host without GNU make), drive the same build directly from manage.py — it runs the deps build and the cmake configure + build in one step:
python scripts/manage.py build_chimera --cuda # NVIDIA CUDA
python scripts/manage.py build_chimera --rocm # AMD ROCm (HIP)
python scripts/manage.py build_chimera --sycl # Intel oneAPI / SYCL
python scripts/manage.py build_chimera --vulkan # Vulkan (cross-platform)
python scripts/manage.py build_chimera # CPU-only (Metal on macOS)Pass --webui to embed the prebuilt web UI, --build-dir <dir> to retarget the CMake build dir, or --build-type Debug for an unoptimized build. The same env-var overrides (CMAKE_CUDA_ARCHITECTURES, CMAKE_HIP_ARCHITECTURES, CUDA perf knobs, GGML_HIP_ROCWMMA_FATTN) apply.
If you'd rather drive the two stages by hand (mixing backends, or staging a deps build separately), set GGML_<BACKEND>=1 on make deps and pass -DGGML_<BACKEND>=ON to cmake yourself.
chimera gen -m models/Qwen3-4B-Q8_0.gguf -p "Why did..."
chimera chat -m models/Qwen3-4B-Q8_0.gguf
chimera tokenize -m models/Qwen3-4B-Q8_0.gguf -p "hello world" --pieces
chimera embed -m models/bge-small-en-v1.5-q8_0.gguf -p "a quick brown fox"
chimera whisper -m models/ggml-base.en.bin -i audio.wav
chimera sd -m models/sd-v1-5.gguf -p "a cat" -o out.png
chimera serve -m models/Qwen3-4B-Q8_0.gguf # OpenAI-compatible HTTP server
chimera index create -n notes -e models/bge-small-en-v1.5-q8_0.gguf
chimera search -n notes -q "how does X work" -k 5
chimera db statusgen and tokenize/embed accept -f <file> instead of -p (use -f - for stdin); chat accepts --system-prompt-file <file>.
Use --help on any subcommand to see its options.
chimera serve exposes an OpenAI-compatible HTTP API. With no extra flags it serves the text-LLM endpoints; opt-in flags enable additional surfaces that load the corresponding model alongside the LLM in the same process:
chimera serve -m model.gguf # text-only
chimera serve -m model.gguf --embeddings # +/v1/embeddings (single-model embed mode)
chimera serve -m model.gguf --enable-embeddings embed.gguf # +/v1/embeddings (dedicated model)
chimera serve -m model.gguf --reranking rerank.gguf # +/v1/rerank
chimera serve -m model.gguf --enable-audio whisper.gguf # +/v1/audio/{transcriptions,translations}
chimera serve -m model.gguf --enable-image sd.gguf # +/v1/images/*
chimera serve -m model.gguf --enable-rag embed.gguf # +/v1/vector_stores/*
chimera serve -m model.gguf --persist-chats # save every chat to DBPoint any OpenAI client at it:
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="not-used")Supported endpoints, by default: /v1/chat/completions, /v1/completions, /v1/messages + /v1/messages/count_tokens (Anthropic compat), /v1/responses, /v1/models, /v1/embeddings, /infill, /tokenize, /detokenize, /apply-template, /health, /metrics, /props. Opt-in endpoints add /v1/audio/{transcriptions,translations}, /v1/images/{generations,edits,variations}, /v1/rerank, and /v1/vector_stores/*. See docs/serve.md for the full API and docs/dev/server.md for the implementation notes (what's bound, what's deliberately not, why).
A personal RAG index lives entirely in the local SQLite file (vendored SQLite + sqlite-vec). No external service required.
chimera index create -n notes -e models/bge-small-en-v1.5-q8_0.gguf
chimera index ingest -n notes -f path/to/doc.md
chimera index ingest -n notes -g 'docs/**/*.md'
chimera search -n notes -q "how does X work?" -k 5The same index is queryable over HTTP when chimera serve --enable-rag is running (POST /v1/vector_stores/:name/search). See docs/dev/sqlite.md for the schema, migration model, and the phased plan.
chimera chat --persist saves every turn to the SQLite DB; later you can list, search, or resume past chats from any chimera invocation:
chimera chat -m model.gguf --persist # save turns as they happen
chimera chat --resume 42 # resume saved chat #42
chimera chat --resume last # most recent
chimera chat --list # list saved chats (no model load)
chimera chat --search "secret password" # FTS5 over messages (no model load)The DB location defaults to $CHIMERA_DB then to the platform XDG path (~/Library/Application Support/chimera/chimera.db on macOS, ~/.local/share/chimera/chimera.db on Linux, %LOCALAPPDATA%\chimera\ on Windows). Override with --db <path> on any chat / index / search / db subcommand.
gen accepts one or more --image paths when paired with a vision projector (--mmproj). The image is encoded by the projector, threaded into the prompt at the default media marker, and the resulting chunks are evaluated into the llama context via mtmd_helper_eval_chunks.
chimera gen \
-m models/gemma-4-E4B-it-Q4_K_M.gguf \
--mmproj models/mmproj-gemma-4-E4B-it-BF16.gguf \
--image photo.png \
-p "Describe this image in one sentence." -n 64Notes:
-
The prompt is auto-wrapped in the model's chat template (VL models are almost always instruct-tuned and otherwise stall on turn 0).
-
If your prompt does not already contain the media marker (
<__media__>by default), one is prepended per--imageso images appear before the text. To interleave, place the marker yourself. -
--imagemay be repeated; each gets its own marker. -
The vision encoder runs on the default backend (Metal on macOS), independent of
--gpu-layers(which only controls LLM offload). -
Vision input is supported on
genonly; multi-turn vision inchatis not yet wired up (seeTODO.md).
sd accepts an --init-image for img2img and an optional --mask-image for inpainting. Both must match -W,-H (the SD pipeline does not resize internally).
# img2img: re-render an input image guided by a new prompt
chimera sd -m models/sd-v1-5.gguf \
--init-image input.png --strength 0.6 \
-p "the same scene but at night" -W 512 -H 512 -s 20 -o out.png
# inpaint: only repaint regions where the mask is non-zero
chimera sd -m models/sd-v1-5.gguf \
--init-image input.png --mask-image mask.png \
-p "a hat on the person's head" -W 512 -H 512 -s 20 -o out.png--strength ranges 0..1 (0 preserves the init image, 1 = full noise = text-to-image). The SD context is automatically built with vae_decode_only=false whenever --init-image is supplied.
llama.cpp is required (it is the LLM engine + slot scheduler + HTTP server context). whisper.cpp and stable-diffusion.cpp are modular: drop either to get a smaller binary, faster link, and zero dependency on that project's churn.
# default: link both modalities if the static libs are present, skip otherwise
cmake -S . -B build -DCHIMERA_WITH_WHISPER=AUTO -DCHIMERA_WITH_SD=AUTO
# require whisper.cpp (configure fails if libwhisper.a is missing)
cmake -S . -B build -DCHIMERA_WITH_WHISPER=ON
# text + RAG only — drop both audio and image (Apple-silicon Metal: ~34 MB -> ~12 MB)
cmake -S . -B build -DCHIMERA_WITH_WHISPER=OFF -DCHIMERA_WITH_SD=OFFWhat disappears when a modality is OFF:
| Off | Removed |
|---|---|
WHISPER=OFF |
chimera whisper subcommand; chimera serve --enable-audio; POST /v1/audio/{transcriptions,translations}; whisper.cpp link API |
SD=OFF |
chimera sd subcommand; chimera serve --enable-image; POST /v1/images/{generations,edits,variations}; stable-diffusion.cpp + stb_image_write link API |
gen --mmproj --image (LLM vision pipeline) is unaffected by either flag — it routes through libmtmd (llama.cpp's vision pipeline), not chimera_sd.
If linenoise is present under thirdparty/, interactive chat sessions get readline-style line editing, history (↑/↓, Ctrl-R), and basic editing keys. History persists at $CHIMERA_HISTORY (override) or $HOME/.chimera_chat_history. The integration is opt-out:
# probe automatically (default; links if liblinenoise.a is present)
cmake -S . -B build -DCHIMERA_LINENOISE=AUTO
# require linenoise (configure fails if missing)
cmake -S . -B build -DCHIMERA_LINENOISE=ON
# skip linenoise entirely (chat falls back to plain getline)
cmake -S . -B build -DCHIMERA_LINENOISE=OFFBuild the lib with python scripts/manage.py build -L. Piped / redirected stdin always falls back to getline, so scripts and the test suite are unaffected by this option.
| Code | Meaning |
|---|---|
| 0 | success |
| 1 | generic runtime error |
| 2 | bad input (missing / invalid file or argument) |
| 3 | model-load failure (model not found, mmproj incompatible, etc.) |
| 4 | generation / inference failure |
| >= 100 | CLI11 parse error (forwarded from CLI11's own exit codes) |
make build produces both the chimera executable and three static archives that an external C++ project can link as a library consumer:
| Archive | Role | Link mode |
|---|---|---|
build/libchimera.a |
chimera's own code (model lifecycle, sampler, generation, HTTP server, RAG, SQLite glue) | Normal link |
build/libchimera_thirdparty.a |
Bundled C++ stack (llama, mtmd, server-context, cpp-httplib, whisper, sd, vendored libwebp / linenoise) | Normal link |
build/libchimera_ggml.a |
ggml core + per-backend archives | Whole-archive required (otherwise GPU backends silently fail to register) |
Two consumption surfaces:
Procedural (chimera/*.h). Build a LlamaCommonOptions / WhisperOptions / SdOptions / ServeOptions and call command_prompt, command_embed, command_tokenize, command_whisper, command_sd, or command_serve. The lower-level helpers (load_llama_model, new_llama_context, run_generation, make_sampler, chimera_whisper::transcribe, chimera_sd::generate, ...) are exposed too for callers that want to drive the engines step-by-step.
OOP (header-only, #include "chimera.hpp"). Persistent-handle classes load the model in the constructor and reuse it across calls: chimera::Llama, chimera::Embedder, chimera::Tokenizer, chimera::Whisper, chimera::SD, plus the chimera::Server wrapper. Llama::generate accepts a streaming callback (std::function<void(std::string_view)>) for library consumers that do not want stdout streaming. The header is not compiled into the archives; consumers include it at their call site and pay no overhead if they do not use it.
Example:
#include "chimera.hpp"
int main() {
chimera::Llama llm("Qwen3-1.7B-Q4_0.gguf");
llm.options().n_predict = 64;
auto reply = llm.generate("What is the capital of France?");
// Or stream tokens through a callback:
llm.generate("And Spain?", [](std::string_view piece) {
std::cout << piece << std::flush;
});
}Python (bindings/). A nanobind wrapper over the OOP layer exposes the same classes as a chimera Python module. Build with make bindings (auto-provisions the toolchain via uv) or uv pip install ./bindings. See docs/bindings.md. This is chimera's own in-repo binding; the sibling cyllama and inferna projects bind upstream llama.cpp / whisper.cpp / sd.cpp directly via Cython and nanobind respectively.
import chimera
llm = chimera.Llama("Qwen3-1.7B-Q4_0.gguf")
print(llm.generate("What is the capital of France?"))tests/external/ is a standalone CMake project that links the three archives the way a non-CMake consumer would and exercises both C++ surfaces end to end. See it for the exact link recipe (-Wl,-force_load on macOS, --whole-archive group on Linux, /WHOLEARCHIVE: on Windows). Run with make test-external-smoke (uses CTest under the hood; make test-external-oop filters to the OOP-layer lane).
Reading order for embedders:
docs/dev/combine_archives.md— the three-archive link contract, why whole-archiving ggml is non-optional, validation plan.docs/dev/oop-layer.md—chimera.hppdesign (persistent-handle semantics, dirty-options policy, streaming hook, upstream-drift guards).docs/bindings.md— the Python bindings over the OOP layer: build/install, usage, and the GIL / exceptions / string-options notes.
If you want the same capabilities from Python instead of a native binary, see cyllama or inferna, chimera's sibling projects that expose llama.cpp, whisper.cpp, and stable-diffusion.cpp as Cython and nanobind bindings respectively.
chimera-desktop is a showcase desktop app that uses chimera from a Tauri shell with chimera-specific features such as a persisted-chat browser and sidecar-status bar.
chimera was originally developed inside the cyllama project, sharing its scripts/manage.py and thirdparty/ build infrastructure. It was extracted into this repository so that it could be developed independently with its own release cadence.
MIT. The vendored third-party libraries carry their own licenses (all MIT or permissive equivalents) — see their respective thirdparty/<project>/LICENSE files after running make deps.