serve example: replace tiny_http with axum by georgewhewell · Pull Request #517 · hellas-ai/catgrad

georgewhewell · 2026-05-13T14:47:37Z

anyone downstream far more likely to be using axum
also some fixes to wire up thinking

Centralize the wire-format -> chat-template `enable_thinking` mapping on each surface's typed request struct, and wire it into the respective request-prep paths so both surfaces actually flow the flag through. OpenAI: - Add `ReasoningEffort::enables_thinking()`: True for Low/Medium/High, false for None. Replaces the open-coded `effort != ReasoningEffort::None` match at the single call site in `prepare_openai_chat_request`. Anthropic: - Add `ThinkingConfig` enum: `Enabled { budget_tokens: u32 }` | `Disabled`, serialized with `{"type": "enabled"|"disabled", ...}` matching Anthropic's API. `enables_thinking()` mirrors the OpenAI helper. - Add `thinking: Option<ThinkingConfig>` field on `MessageRequest`. Previously `thinking` was an unknown field that serde silently dropped on deserialization; the existing `message_request_ignores_unsupported_fields` test asserts it now round-trips into the typed shape. - Add `ModelEngine::prepare_anthropic_message_request` — the Anthropic counterpart to `prepare_openai_chat_request`. Feeds `enable_thinking` from `request.thinking.is_some_and(|t| t.enables_thinking())`, delegates to the shared chat-template prep path. No `tools` binding (typed Anthropic tools land in a separate change).

Replace tiny_http with axum for the demo server. Axum gives us the async router + listener loop we need to add features like a gateway-wrap mode (next commit), and a smaller surface to integrate against than tiny_http's blocking request iterator. Architecture: - `ModelEngine` is `Rc`-backed and therefore `!Send`. The engine stays on a dedicated `inference` std::thread; HTTP handlers talk to it through a `tokio::sync::mpsc` job channel. - Non-streaming jobs carry a `oneshot::Sender` for the typed JSON response. Streaming jobs carry an `mpsc::UnboundedSender<Result<Event, Infallible>>` so the inference loop can feed `axum::response::sse::Sse` with no mapping step. - Single-threaded tokio runtime (`new_current_thread`) since axum doesn't need multiple worker threads for this demo. - Worker thread loads the model and signals readiness via a std::sync::mpsc oneshot before main starts axum — keeps the "Loading model..." UX without sending the engine across threads. Deps (all dev-only, default features off): - `axum` with `http1`, `json`, `tokio` — no `form`, `query`, `matched-path`, `original-uri`, `tower-log`, `tracing`. - `tokio` with `rt`, `macros`, `net`, `sync` — no fs / time / process / signal. - `tokio-stream` (default features off) — only for `UnboundedReceiverStream` so the SSE body has the right `Stream` shape. Removed `tiny_http`. Routes unchanged: `POST /v1/{chat/completions,messages,completions}`. Wire formats and SSE event shapes unchanged.

georgewhewell added 2 commits May 13, 2026 16:50

georgewhewell force-pushed the grw/feat/serve-axum branch from 89a208a to 90bd539 Compare May 13, 2026 14:50

janimo merged commit b51d366 into hellas-ai:master May 13, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

serve example: replace tiny_http with axum #517

serve example: replace tiny_http with axum #517
janimo merged 2 commits into
hellas-ai:masterfrom
georgewhewell:grw/feat/serve-axum

georgewhewell commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

georgewhewell commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants