Skip to content

serve example: replace tiny_http with axum #517

Merged
janimo merged 2 commits into
hellas-ai:masterfrom
georgewhewell:grw/feat/serve-axum
May 13, 2026
Merged

serve example: replace tiny_http with axum #517
janimo merged 2 commits into
hellas-ai:masterfrom
georgewhewell:grw/feat/serve-axum

Conversation

@georgewhewell
Copy link
Copy Markdown
Contributor

anyone downstream far more likely to be using axum
also some fixes to wire up thinking

Centralize the wire-format -> chat-template `enable_thinking`
mapping on each surface's typed request struct, and wire it into the
respective request-prep paths so both surfaces actually flow the
flag through.

OpenAI:
- Add `ReasoningEffort::enables_thinking()`: True for Low/Medium/High,
  false for None. Replaces the open-coded
  `effort != ReasoningEffort::None` match at the single call site in
  `prepare_openai_chat_request`.

Anthropic:
- Add `ThinkingConfig` enum: `Enabled { budget_tokens: u32 }` |
  `Disabled`, serialized with `{"type": "enabled"|"disabled", ...}`
  matching Anthropic's API. `enables_thinking()` mirrors the OpenAI
  helper.
- Add `thinking: Option<ThinkingConfig>` field on `MessageRequest`.
  Previously `thinking` was an unknown field that serde silently
  dropped on deserialization; the existing
  `message_request_ignores_unsupported_fields` test asserts it now
  round-trips into the typed shape.
- Add `ModelEngine::prepare_anthropic_message_request` — the
  Anthropic counterpart to `prepare_openai_chat_request`. Feeds
  `enable_thinking` from `request.thinking.is_some_and(|t| t.enables_thinking())`,
  delegates to the shared chat-template prep path. No `tools`
  binding (typed Anthropic tools land in a separate change).
Replace tiny_http with axum for the demo server. Axum gives us the
async router + listener loop we need to add features like a
gateway-wrap mode (next commit), and a smaller surface to integrate
against than tiny_http's blocking request iterator.

Architecture:
- `ModelEngine` is `Rc`-backed and therefore `!Send`. The engine
  stays on a dedicated `inference` std::thread; HTTP handlers talk
  to it through a `tokio::sync::mpsc` job channel.
- Non-streaming jobs carry a `oneshot::Sender` for the typed JSON
  response. Streaming jobs carry an `mpsc::UnboundedSender<Result<Event,
  Infallible>>` so the inference loop can feed `axum::response::sse::Sse`
  with no mapping step.
- Single-threaded tokio runtime (`new_current_thread`) since axum
  doesn't need multiple worker threads for this demo.
- Worker thread loads the model and signals readiness via a
  std::sync::mpsc oneshot before main starts axum — keeps the
  "Loading model..." UX without sending the engine across threads.

Deps (all dev-only, default features off):
- `axum` with `http1`, `json`, `tokio` — no `form`, `query`,
  `matched-path`, `original-uri`, `tower-log`, `tracing`.
- `tokio` with `rt`, `macros`, `net`, `sync` — no fs / time /
  process / signal.
- `tokio-stream` (default features off) — only for
  `UnboundedReceiverStream` so the SSE body has the right `Stream`
  shape.

Removed `tiny_http`.

Routes unchanged: `POST /v1/{chat/completions,messages,completions}`.
Wire formats and SSE event shapes unchanged.
@georgewhewell georgewhewell force-pushed the grw/feat/serve-axum branch from 89a208a to 90bd539 Compare May 13, 2026 14:50
@janimo janimo merged commit b51d366 into hellas-ai:master May 13, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants