serve example: replace tiny_http with axum #517
Merged
Merged
Conversation
Centralize the wire-format -> chat-template `enable_thinking`
mapping on each surface's typed request struct, and wire it into the
respective request-prep paths so both surfaces actually flow the
flag through.
OpenAI:
- Add `ReasoningEffort::enables_thinking()`: True for Low/Medium/High,
false for None. Replaces the open-coded
`effort != ReasoningEffort::None` match at the single call site in
`prepare_openai_chat_request`.
Anthropic:
- Add `ThinkingConfig` enum: `Enabled { budget_tokens: u32 }` |
`Disabled`, serialized with `{"type": "enabled"|"disabled", ...}`
matching Anthropic's API. `enables_thinking()` mirrors the OpenAI
helper.
- Add `thinking: Option<ThinkingConfig>` field on `MessageRequest`.
Previously `thinking` was an unknown field that serde silently
dropped on deserialization; the existing
`message_request_ignores_unsupported_fields` test asserts it now
round-trips into the typed shape.
- Add `ModelEngine::prepare_anthropic_message_request` — the
Anthropic counterpart to `prepare_openai_chat_request`. Feeds
`enable_thinking` from `request.thinking.is_some_and(|t| t.enables_thinking())`,
delegates to the shared chat-template prep path. No `tools`
binding (typed Anthropic tools land in a separate change).
Replace tiny_http with axum for the demo server. Axum gives us the
async router + listener loop we need to add features like a
gateway-wrap mode (next commit), and a smaller surface to integrate
against than tiny_http's blocking request iterator.
Architecture:
- `ModelEngine` is `Rc`-backed and therefore `!Send`. The engine
stays on a dedicated `inference` std::thread; HTTP handlers talk
to it through a `tokio::sync::mpsc` job channel.
- Non-streaming jobs carry a `oneshot::Sender` for the typed JSON
response. Streaming jobs carry an `mpsc::UnboundedSender<Result<Event,
Infallible>>` so the inference loop can feed `axum::response::sse::Sse`
with no mapping step.
- Single-threaded tokio runtime (`new_current_thread`) since axum
doesn't need multiple worker threads for this demo.
- Worker thread loads the model and signals readiness via a
std::sync::mpsc oneshot before main starts axum — keeps the
"Loading model..." UX without sending the engine across threads.
Deps (all dev-only, default features off):
- `axum` with `http1`, `json`, `tokio` — no `form`, `query`,
`matched-path`, `original-uri`, `tower-log`, `tracing`.
- `tokio` with `rt`, `macros`, `net`, `sync` — no fs / time /
process / signal.
- `tokio-stream` (default features off) — only for
`UnboundedReceiverStream` so the SSE body has the right `Stream`
shape.
Removed `tiny_http`.
Routes unchanged: `POST /v1/{chat/completions,messages,completions}`.
Wire formats and SSE event shapes unchanged.
89a208a to
90bd539
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
anyone downstream far more likely to be using axum
also some fixes to wire up
thinking