chimera serve is chimera's OpenAI-compatible HTTP server. It is not a
greenfield server; it is a thin chimera-side translation layer over
llama.cpp's server-context STATIC library (the engine that powers
llama-server), plus opt-in handlers that bind chimera's own whisper.cpp
and stable-diffusion.cpp wrappers to the OpenAI audio and image routes.
This document is the maintainer's view: what is wired up, what was deliberately left out, the seams that bit during implementation, and the work that is still on the table.
┌───────────────────────────────────────┐
HTTP request │ command_serve() in chimera_serve.cpp │
─────────────────────────────────▶│ │
│ server_http_context (cpp-httplib) │◀── /health, /v1/models
│ │ │ /v1/chat/completions
│ │ │ /v1/completions
│ │ │ /v1/embeddings
│ ▼ │
│ server_routes (pre-built lambdas) │
│ │ │
│ ▼ │
│ server_context (the engine) │
│ ├─ slot scheduler │
│ ├─ chat-template handling │
│ ├─ mtmd integration (vision/audio) │
│ ├─ streaming SSE │
│ ├─ tool-call parsing │
│ └─ KV cache + sampler │
│ │
│ chimera-owned handlers (optional) │
│ ├─ /v1/audio/transcriptions ───▶ chimera_whisper::transcribe
│ └─ /v1/images/* ───▶ chimera_sd::generate
└───────────────────────────────────────┘
Three concrete consequences of this shape:
-
The LLM endpoints are not chimera code. We bind handlers that already exist on
server_routes. Streaming, slot scheduling, tool calls, chat-template plumbing, KV reuse across requests — all come fromlibserver-context.a. -
Audio and image endpoints are chimera code, running in the same process. They are registered on the same
server_http_contextviactx_http.post(path, handler)— the documented extension point thatllama-serveritself uses for its CORS proxy, GCP/Vertex compat, and built-in tools. -
One process, three model backends. The LLM, whisper, and SD models coexist if all three flags are passed. cpp-httplib serves requests on its own thread pool; the LLM engine runs on the main thread via
ctx_server.start_loop(); audio and image work runs synchronously inside whichever httplib worker thread caught the request.
The LlamaCppBuilder was already cloning llama.cpp and producing
libllama.a, libllama-common.a, libmtmd.a plus headers. Phase 1
added the server pieces:
# extra_libs reported as available + cmake_build_targets entries
extra_libs = ["llama", "llama-common", "mtmd",
"server-context", "cpp-httplib"]
# CMake configure step
LLAMA_BUILD_SERVER = True # required so tools/server/CMakeLists.txt
# is included and the server-context target
# is defined
LLAMA_BUILD_WEBUI = False # skip baking the ~11 MB SvelteKit bundle
# into the static lib
# Targets built (note: llama-server *executable* is intentionally absent;
# we only need the static lib and cpp-httplib)
cmake_build_targets(targets=["llama", "llama-common", "mtmd",
"server-context", "cpp-httplib"], ...)
# Artifacts copied to thirdparty/llama.cpp/
copy_lib("tools/server", "server-context", lib_dir)
copy_lib("vendor/cpp-httplib", "cpp-httplib", lib_dir)
glob_copy("tools/server", include_dir, patterns=["server-*.h"])
glob_copy("vendor/cpp-httplib", include_dir/"cpp-httplib",
patterns=["httplib.h"]) # subdir matters; server-http.cpp
# includes it as <cpp-httplib/httplib.h>
# src-aux/: auxiliary sources we ship as code rather than as a library
src_aux = prefix/"src-aux"
glob_copy("tools/server", src_aux, patterns=["server-http.cpp"])The src-aux/ directory is a new concept and exists because server-http
is not a separate library upstream — its .cpp is on llama-server's
own source list, not libserver-context.a's. Rather than build our own
library out of one file, chimera's CMakeLists.txt pulls
thirdparty/llama.cpp/src-aux/server-http.cpp into the chimera
executable's target_sources. The compilation cost is the same; the
build graph is one library short.
static_lib(LIB_SERVER_CONTEXT "${LLAMACPP_LIB}" "server-context")
static_lib(LIB_CPP_HTTPLIB "${LLAMACPP_LIB}" "cpp-httplib")
set(LLAMACPP_SRC_AUX "${LLAMACPP_DIR}/src-aux")
# OpenSSL is now required: cpp-httplib was built with LLAMA_OPENSSL=ON,
# so libcpp-httplib.a references X509_*, SSL_*, EVP_*, etc.
find_package(OpenSSL REQUIRED)
list(APPEND SYSTEM_LIBS OpenSSL::SSL OpenSSL::Crypto)
# macOS-only: cpp-httplib reads trust anchors from the system keychain
# via SecCertificateCopyData / SecTrustCopyAnchorCertificates.
if(APPLE)
list(APPEND SYSTEM_LIBS "-framework Security" "-framework CoreFoundation")
endif()src/chimera/CMakeLists.txt adds chimera_serve.cpp to the target's
source list, pulls in server-http.cpp from src-aux/, and links
LIB_SERVER_CONTEXT + LIB_CPP_HTTPLIB on every platform.
libserver-context.a (specifically server-task.cpp's
to_json_oaicompat_chat_stream) and libllama-common.a (download.cpp,
hf-cache.cpp) both reference llama_build_info(). Upstream generates
that string from common/build-info.cpp.in at configure time. chimera's
shim now returns the literal "chimera"; that string surfaces as the
system_fingerprint field in OpenAI responses.
| File | Responsibility |
|---|---|
src/chimera/chimera_serve.cpp |
command_serve(), route registration, audio + image handlers, JSON-field coercion, base64, signal handling. |
src/chimera/chimera_whisper.h |
Public API: load_model, load_wav_bytes, load_wav_file, resample_linear, transcribe, format_timestamp_10ms. |
src/chimera/chimera_whisper.cpp |
Implementation. command_whisper (CLI) and the HTTP handler both call into this. |
src/chimera/chimera_sd.h |
Public API: load_model, decode_image_bytes, decode_image_file, encode_png, save_png_file, generate. PixelImage / GenerateRequest types. |
src/chimera/chimera_sd.cpp |
Implementation. command_sd (CLI) and the three image HTTP handlers both call into this. |
thirdparty/llama.cpp/src-aux/server-http.cpp |
Vendored from upstream by manage.py; compiled directly into the chimera target. Implements server_http_context over cpp-httplib. |
The whisper and SD modules grew dedicated headers as part of phases 2
and 3 specifically so the CLI subcommands and the HTTP handlers could
share one implementation. Before the refactor, the relevant helpers
lived in anonymous namespaces inside the .cpp files.
| Route | Source | Notes |
|---|---|---|
GET /health, GET /v1/health |
routes.get_health |
Liveness probe; pre-built lambda from server_routes. |
GET /v1/models |
routes.get_models |
Returns { "data": [...], "object": "list" } plus ollama-compat fields. |
GET /metrics |
routes.get_metrics |
Prometheus-style telemetry. params.endpoint_metrics is forced to true in build_common_params, so this works regardless of CLI flags. |
GET /props |
routes.get_props |
Read-only introspection — current chat template, mmproj capabilities, generation defaults. |
POST /chat/completions, POST /v1/chat/completions |
routes.post_chat_completions |
Streaming + non-streaming SSE. Tool calls, mtmd inputs, reasoning_content all work. The unprefixed legacy path is bound for older OpenAI clients. |
POST /v1/completions |
routes.post_completions_oai |
Legacy text completion. |
POST /v1/embeddings |
routes.post_embeddings_oai |
Only returns success when the model was loaded with --embeddings. Without it, upstream's handler returns HTTP 501 with the right message. |
POST /v1/messages, POST /v1/messages/count_tokens |
routes.post_anthropic_messages + post_anthropic_count_tokens |
Anthropic Messages API compat — lets Anthropic SDK / claude-code-shaped clients point at chimera serve. |
POST /infill |
routes.post_infill |
Fill-in-the-middle for code models. Returns 501 on models without FIM tokens, which is the right behavior. |
POST /tokenize, POST /detokenize |
routes.post_tokenize + post_detokenize |
Vocab helpers; useful for clients that don't bundle a tokenizer (e.g. token counting before sending). |
POST /apply-template |
routes.post_apply_template |
Renders the chat template against a messages[] array without generating. Pure debugging value. |
POST /v1/responses |
routes.post_responses_oai |
OpenAI Responses API. Stateful within a single chimera serve invocation — server-context holds the conversation thread state in-process; lost on restart. With --persist-chats the underlying chat-completions traffic is still saved to the chats table. |
GET /slots, POST /slots/:id_slot |
routes.get_slots + routes.post_slots |
Per-slot status (always works); save/restore/erase actions on POST. Save/restore additionally require --slot-save-path; without it the upstream handler returns HTTP 501. |
GET /lora-adapters, POST /lora-adapters |
routes.get_lora_adapters + routes.post_lora_adapters |
Lists adapters loaded via --lora and lets the client re-weight by index. Empty list when no --lora was supplied. |
GET /v1/chimera/info |
make_chimera_info_handler (chimera-owned, in chimera_serve_meta.cpp) |
JSON form of chimera info. Versions, built/loaded backends, devices, GPU/mmap/mlock/RPC flags, whisper/sd linkage + CPU features, SQLite versions, build flags. Devices via ggml_backend_dev_get(i); CPU-feature strings parsed inline (independent of the CLI's parse_sys_info to keep this TU CLI-free). |
GET /v1/chimera/db |
make_chimera_db_handler (chimera-owned, in chimera_serve_meta.cpp) |
JSON form of chimera db status. Path, size, schema version, table list, per-table COUNT(*). Table names validated against [A-Za-z0-9_] before splicing into SQL (defense in depth — list_tables already returns names from sqlite_master). DB path precedence: --chat-db → --rag-db → default. |
POST /v1/chimera/shutdown |
make_chimera_shutdown_handler (chimera-owned, in chimera_serve_meta.cpp) |
Graceful exit. Returns 202 immediately; detached thread sleeps 150 ms then invokes the same teardown as the SIGINT handler (emb_ctx->terminate(), rrk_ctx->terminate(), ctx_server.terminate()). 150 ms is the race window between response queue and listener close. Closure passed in by command_serve — handler doesn't reach into globals. |
| Route | Source | Notes |
|---|---|---|
POST /v1/rerank |
routes.post_rerank from a second server_routes instance attached to the rerank model |
Cross-encoder document reranking. Body: {"query": "...", "documents": [...], "top_n": N}. The rerank model loads alongside the main LLM; both stay in process. |
| Route | Handler | Notes |
|---|---|---|
POST /v1/audio/transcriptions |
make_audio_transcribe_handler |
chimera-owned. Routes through chimera_whisper::transcribe. |
POST /v1/audio/translations |
make_audio_transcribe_handler (same factory, translate=true) |
Whisper's built-in translate mode — source language → English. |
POST /v1/audio/detect-language |
make_audio_detect_language_handler |
Chimera-specific exit-after-detect probe. Sets treq.detect_language=true; whisper short-circuits before any decode pass. Response shape: {"language": "<code>", "duration": <seconds>} — no text, no segments. Separate endpoint (not a query parameter on /transcriptions) so the response contract stays unambiguous. |
Supported response_format values: json (default → {"text":"..."}),
text (raw text/plain), verbose_json (full structure with
task / language / duration / segments[]), srt, vtt.
Ignored request fields (accepted but no effect): model, temperature,
timestamp_granularities[].
Why we do NOT bind routes.post_transcriptions_oai: that upstream
handler feeds audio through mtmd's audio mmproj (LLM-with-audio-tokens),
which is a fundamentally different pipeline from dedicated ASR. Two
different things — both have a place; one chooses based on the model.
We expose the ASR pipeline; the mtmd path is reachable via
/v1/chat/completions with an audio mmproj-aware model.
| Route | Handler | Notes |
|---|---|---|
POST /v1/images/generations |
make_image_generations_handler |
application/json body. txt2img. |
POST /v1/images/edits |
make_image_edits_handler |
multipart (image + optional mask + form fields). img2img / inpaint. |
POST /v1/images/variations |
make_image_variations_handler |
multipart (image). img2img with no prompt. |
GET /v1/images/lora-adapters |
inline lambda in command_serve |
Lists registered SD LoRA aliases (names only). Returns [] when no --sd-lora is set. |
The three POST handlers share a per-request opt-in surface for
ControlNet / PhotoMaker / LoRA, gated on server-init flags. Each
opt-in feature follows the same shape: a maybe_attach_*() helper
returns either nullptr (success / feature not requested) or an
HTTP error response naming the missing flag.
| Server-init flag(s) | Per-request field(s) | Helper | Notes |
|---|---|---|---|
--sd-control-net <path> |
control_image (multipart), control_strength (JSON) |
maybe_attach_control() |
400 + named-flag hint when the field is supplied but the server has no ControlNet loaded. |
--sd-photo-maker <path> (+ optional --sd-pm-id-dir, --sd-pm-id-embed-path) |
pm_id_images (JSON base64 array), pm_id_image_set (named subdir), pm_style_strength |
maybe_attach_pm() |
pm_id_images wins if both shapes supplied. --sd-pm-id-dir is scanned eagerly at startup — empty subdirs are a BadInput config error, not a deferred 400. |
--sd-lora <name>=<path> (repeatable) |
loras: [{"name", "scale"}, ...] |
maybe_attach_loras() |
Closed-set by design — request bodies cannot supply raw paths. SD reloads adapter tensors per-generate, so the alias map is pure metadata; no files open at server start. Unknown alias name returns 400 listing the known names. |
--persist-chats doesn't add new routes; it wraps the upstream
chat-completions handler. make_persisting_chat_handler decorates the
routes.post_chat_completions lambda so each successful exchange is
saved to the SQLite chats + messages tables.
Mechanics:
- The wrapper captures
ChatPersistContext *(per-server state with the DB path + a write mutex) and a copy of the inner handler. - On each request it parses the request body for the
messagesarray- system prompt + model alias.
- For non-streaming responses, after the inner handler returns it
parses
choices[0].message.{content,reasoning_content}+usageand saves. - For streaming responses, it replaces
res->nextwith a wrapper that mirrors every chunk into ashared_ptr<std::string>buffer while still returning it to the client. Whennextreturns false (stream finished) the buffered SSE is parsed event-by-event,delta.contentis concatenated, and the result is saved. - Persistence errors are caught + logged to stderr; they never affect the user's HTTP response.
One chat row per request — the OpenAI API has no chat-id concept, so
multi-turn clients (which resend the full conversation each request)
produce N rows with overlapping content. The X-Chimera-Chat-Id
request/response header (shipped) consolidates a multi-turn exchange
into a single row when the client threads it through; see
chimera_serve.cpp around the X-Chimera-Chat-Id block.
| Route | Handler | Notes |
|---|---|---|
GET /v1/vector_stores |
make_vs_list_handler |
List collections. |
POST /v1/vector_stores |
make_vs_create_handler |
Create. Body: {"name": "..."}. |
GET /v1/vector_stores/:name |
make_vs_get_handler |
Stats. |
POST /v1/vector_stores/:name/delete |
make_vs_delete_handler |
Drop. POST because server_http_context only wraps GET/POST. |
POST /v1/vector_stores/:name/files |
make_vs_ingest_handler |
Chunk + embed + insert. multipart file upload or JSON {"text": ..., "source_uri": ...}. |
POST /v1/vector_stores/:name/search |
make_vs_search_handler |
KNN. JSON {"query": ..., "k": ...}. |
Shared per-server state is a RagContext struct (the loaded
Embedder, a std::mutex serializing embed calls, the db path, and
the embedding model name). All six handlers capture a pointer to it.
SQLite connections are opened per request (cheap in WAL mode); no
pool. Errors that would normally be 404 ("no such collection")
return 400 instead so the upstream set_error_handler 404-body
override (server-http.cpp:140) doesn't swallow the message.
Supported response_format: b64_json only (the default). url returns
HTTP 400 with an explicit "no static-file backend" message.
Supported SD-specific JSON fields (alongside OpenAI's prompt, n,
size, response_format): negative_prompt, steps, cfg_scale,
seed, sample_method, scheduler, strength.
Ignored: model, user, quality, style.
There is no server-context handler for image generation upstream; this pipeline is entirely chimera's.
Every one of these is a one-line ctx_http.post(...) away. The omission
is a scope choice, not a capability gap:
POST /completion,POST /completions— legacy (non-/v1) llama.cpp completion shape, different from/v1/completions. Practically nobody calls it in 2025.POST /embedding,POST /embeddings— non-/v1 embeddings variants; redundant with/v1/embeddings.POST /props— mutating server props at runtime conflicts with chimera serve's "CLI is the config" stance. Read (GET /props) is bound; write is not.
(GET /slots + POST /slots/:id_slot for KV-cache snapshots,
GET /lora-adapters + POST /lora-adapters for LoRA hot-swap, and
POST /v1/rerank for cross-encoder reranking all shipped — see § 4.1
"Always exposed" and the --reranking flag in docs/serve.md.)
Server-mode features not wired up:
- Router / multi-model mode (
is_router_serverbranch inllama-server'sserver.cpp). chimera serve loads one LLM. Seedocs/dev/server-router-mode.mdfor the decision record. - Built-in tool plugins (
--server-tools). EXPERIMENTAL upstream. - MCP CORS proxy (
--webui-mcp-proxy). EXPERIMENTAL upstream. - GCP / Vertex AI compat (
ctx_http.register_gcp_compat()). - Embedded Web chat UI (Variant B, chimera-specific). The upstream-style embed (Variant A,
CHIMERA_WEBUI_EMBED=ON) did ship — see § 7 ofdocs/dev/webui.md; the entry here refers only to the abandoned Variant B prototype. - Child-server / parent-process sleeping notifications.
- SSL / TLS direct serving. Run behind a reverse proxy (nginx, caddy) for HTTPS.
main thread ────────────────────── ctx_server.start_loop() [blocks until shutdown]
cpp-httplib worker thread #1 ──── handler(req) ──┐
cpp-httplib worker thread #2 ──── handler(req) ──┤
cpp-httplib worker thread #N ──── handler(req) ──┘
│
┌───────────────────────┴───────────────────┐
│ │
▼ ▼
LLM routes: enqueue server_task, chimera-owned routes:
wait on response_reader. hold per-modality mutex,
server_context's slot scheduler call into chimera_whisper
drives concurrency on the LLM side. / chimera_sd synchronously.
Why one mutex per modality:
whisper_fullmutates the context's internal state; concurrent calls on the same context corrupt that state.generate_imageis not thread-safe on a sharedsd_ctx_t(both the diffusion graph and the VAE allocator are owned by the context).
This serializes audio and image requests across the server. For the LLM
side, server_context's slot scheduler handles real parallelism via
KV-cache slots; we don't add a mutex there.
If audio or image throughput becomes a bottleneck the right answer is not a finer-grained lock — it's holding multiple contexts (one per worker thread). Both whisper.cpp and stable-diffusion.cpp can be loaded multiple times; the cost is GPU memory.
server-http.cpp:485 collects multipart form text fields into a JSON
object where every value is field.content — always a string. For
application/json bodies, nlohmann/json preserves the original numeric
types. So the same field-reading code can't use fields.value<int>("n", 1)
across both — it works for /v1/images/generations (JSON body) and
throws type_error.302 for /v1/images/edits (multipart).
Fix: coerce_int, coerce_int64, coerce_float, coerce_string in
chimera_serve.cpp. Each accepts the JSON-native type or a string
that parses to that type. All numeric reads in the image handlers go
through these. Worth re-using when adding new routes that may be reached
by either body type.
Embedded SGR escapes inside the prompt string passed to
linenoise_read() confuse utf8_str_width and corrupt the cursor on
multi-line edits. Fix (in chimera_chat, but the lesson stands): emit
the SGR escape to stdout before calling linenoise_read, pass a plain
prompt to linenoise, and emit a reset after. SGR state persists across
linenoise's own cursor moves.
This is not a server-mode bug but landed during the same arc of work. If the server ever grows an interactive console it inherits the same constraint.
It runs language identification and returns without transcribing. To
auto-detect language and transcribe, set params.language = "auto"
(or nullptr/"") and leave detect_language = false. The bug was
unreachable from the CLI (whose default is language="en"), so existing
tests never caught it; the HTTP handler exercised it by default because
the OpenAI spec's default is autodetect. Documented and fixed in
chimera_whisper::transcribe.
This is a stable-diffusion.cpp issue: certain (cfg_scale, sample_method)
combinations trigger GGML_ASSERT(buft) failed inside VAE::encode.
The crash reproduces from the CLI with the same params; it is not
HTTP-specific.
We currently surface this as a generic image generation failed HTTP
500. Two follow-ups worth considering:
- A pre-flight sanity check in
chimera_sd::generatethat rejects known-bad combinations with a 400 — but the combination space is model-specific and hard to enumerate. - Capturing the
GGML_ASSERTlog line and bubbling it back into the HTTP error body. Would require installing a customggml_loghandler in the SD code path and a small ring buffer to retrieve the last few lines on failure.
For now: document in the changelog, leave the broad error message in place.
cpp-httplib was already a transitive dep, but LLAMA_OPENSSL=ON in
manage.py means we now pull in libssl/libcrypto symbols. On macOS that
also drags in the system Security and CoreFoundation frameworks
(SecCertificateCopyData / SecTrustCopyAnchorCertificates) because
cpp-httplib reads the trust store from the keychain.
For Linux distros this is essentially zero friction. For Windows the
OpenSSL search path may need OPENSSL_ROOT_DIR set explicitly.
If we ever want a no-OpenSSL build, the move is to flip
LLAMA_OPENSSL=OFF in manage.py and accept that HTTPS-fetch features
(LoRA URLs, HF download via common) stop working.
command_serve initially called llama_backend_init() itself, but
main() already does. The second call is harmless on most builds but
felt fragile; we removed it from command_serve and kept only
llama_numa_init(params.numa) (which depends on per-subcommand
common_params). The cleanup at end of command_serve similarly does
not call llama_backend_free — main() does that after
command_serve returns.
Server-context API stability. chimera tracks a pinned llama.cpp
commit via LLAMACPP_VERSION in scripts/manage.py. The
server_context, server_routes, and server_http_context types are
not part of upstream's public API; they shift with llama.cpp's
internal refactors.
The make bump-check target automates the diff. It fetches
tools/server/server-context.h and tools/server/server-http.h from
the upstream tag (defaults to the currently pinned version; override
with make bump-check LLAMA_VERSION=bXXXX) and compares them against
the headers currently vendored under
thirdparty/llama.cpp/include/. The output lists added/removed
top-level symbols (struct/class/enum names, handler_t fields,
function signatures) and a unified diff capped at 120 lines. Exit
code is 0 when clean, 2 when any header changed.
Recommended bump workflow:
# 1. See what's changing before touching anything.
make bump-check LLAMA_VERSION=<new_ref>
# 2. If anything came back, audit the call sites — chimera_serve.cpp
# route bindings (especially `handler_t` fields on `server_routes`),
# src/chimera/CMakeLists.txt link order / archive groups, and the
# server-http.cpp source copy under thirdparty/llama.cpp/src-aux/.
# 3. Edit LLAMACPP_VERSION in scripts/manage.py, rebuild deps + chimera:
python scripts/manage.py build --llama-cpp --llama-version <new_ref>
make build
make testOnce LLAMACPP_VERSION is bumped and make build re-vendors the
headers, make bump-check will report "clean" again — the check is
a pre-bump audit step, not a CI guard.
Sleeping / hibernation. server_context::on_sleeping_changed exists
upstream for the router-server protocol; we don't wire it up. If we
ever add chimera serve to a router topology, this is the integration
point.
Memory growth across many turns. server-context manages its own KV-cache lifetime via slots; we trust its behavior. If a deployment shows unbounded memory growth, it is likely a server-context issue rather than a chimera one — file upstream first.
Web UI assets size. LLAMA_BUILD_WEBUI=OFF keeps the chimera
binary ~11 MB smaller than llama-server by default. Both Web UI
delivery paths now ship as opt-in: the bundle can be baked into the
binary via -DCHIMERA_WEBUI_EMBED=ON (upstream's xxd.cmake
machinery; binary size up ~6 MB stripped) or served from any static
directory at runtime via --public-path <dir>. See
docs/dev/webui.md for the full picture.
Single SD context per process. Loading multiple SD models at once would multiply VRAM. For now the design is "one image model per process"; if a deployment needs to switch models, restart the server.
Single whisper context per process. Same as SD. The model name in OpenAI's transcription request is currently ignored; if we ever respect it, we'd need a small registry of preloaded whisper contexts.
Ordered roughly by ROI per implementation effort. None of these is blocking.
Non-WAV audio in /v1/audio/transcriptions. Currently only
RIFF/WAVE is accepted; mp3, m4a, mp4, mpeg, webm all return 415 with a
message telling the caller to transcode. Every viable path adds
dependencies that don't pull their weight: single-header decoders
(dr_mp3.h + dr_flac.h) only cover two of the formats users actually
send, and a complete solution drags in libavcodec / FFmpeg, which is
multi-megabyte and brings its own license + CVE surface. Clients
already have ffmpeg available far more often than they have an
in-process audio codec library, so we punt the decode to them. Do
not revisit without a concrete user request.
-
Web chat UI — chimera-specific UX. The upstream-style embed (Variant A,
CHIMERA_WEBUI_EMBED=ON) is shipped — see § 7 below anddocs/dev/webui.md. What's still open is Variant B: a chimera-aware UI that exposes the routes upstream's UI doesn't know about (audio, images, vector store, persisted chats). The--public-pathflag is the entry point; the actual UI is the missing piece. -
SSE for image generation progress. OpenAI's spec doesn't define SSE for
/v1/images/*; some clients invent extensions. SD reports progress step-by-step via its callback (which we currently route to stderr). Adding anevent: progressSSE stream alongside the finaldata:payload is technically straightforward — the question is which client convention to follow. -
Multi-tenancy. Multiple LLMs loadable simultaneously, route by request's
modelfield. Today we setmodel_aliasto the loaded name and ignoremodel. Doing this properly is theis_router_serverpath in upstream'sserver.cpp; effectively a rewrite ofcommand_serve. Seedocs/dev/server-router-mode.mdfor the wontfix decision record. -
HTTPS direct serving.
--ssl-cert-file/--ssl-key-file. The machinery is in cpp-httplib; would mean wiring it throughserver_http_context::init. Most deployments will use a reverse proxy instead. -
Authentication beyond
--api-key. The current single-key bearer-token check is enough for "behind a VPN"; multi-tenant deployments will want JWT or per-key rate limiting.
(Item formerly numbered 14 — "reorganize chimera_serve.cpp if it keeps
growing past ~600 LOC" — shipped in 0.1.5; the file was split from
2249 LOC into chimera_serve.cpp plus five per-modality TUs
(chimera_serve_audio.cpp, chimera_serve_images.cpp,
chimera_serve_rag.cpp, chimera_serve_chat_persist.cpp,
chimera_serve_chats_read.cpp). The current chimera_serve.cpp is
the route-registration + lifecycle backbone.)
build/llama.cpp/tools/server/server.cpp— upstream's 363-linellama-servermain. Ourcommand_serveis structurally a stripped-down copy of this.build/llama.cpp/tools/server/server-context.h—server_context,server_routes,server_context_metadeclarations.build/llama.cpp/tools/server/server-http.h—server_http_context,server_http_req,server_http_res,uploaded_file.- OpenAI API docs: https://platform.openai.com/docs/api-reference.
- llama.cpp server README: https://github.com/ggml-org/llama.cpp/tree/master/tools/server.