feat: agentic-server stateful conversation/responses endpoints.#48
Draft
maralbahari wants to merge 24 commits into
Draft
feat: agentic-server stateful conversation/responses endpoints.#48maralbahari wants to merge 24 commits into
agentic-server stateful conversation/responses endpoints.#48maralbahari wants to merge 24 commits into
Conversation
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Add executor module: rehydration, LLM inference, SSE accumulation, and persistence for both conversation and response stateful flows. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: maral <maralbahari.98@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: maral <maralbahari.98@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: maral <maralbahari.98@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: maral <maralbahari.98@gmail.com>
…ame entry Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: maral <maralbahari.98@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: maral <maralbahari.98@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
agentic-server stateful conversation/responses endpoints.agentic-server stateful conversation/responses endpoints.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
To be reviewed after #46
Wires
agentic-core's executor intoagentic-server, giving the gatewaystateful conversation and response management on top of the existing vLLM
proxy. Implements the Layer 2 (HTTP API) component of
ADR-03.
New HTTP endpoints:
/v1/conversationscreate_conversation(): creates a DB-backed conversation record; response body includesid(theconversation_id) along withcreated_at,object, andmetadata/v1/responsesstorefield; executor path returns aResponsePayloadJSON object (id,object,created_at,model,status,output,usage,previous_response_id,conversation_id, …) for non-streaming requests, or an SSE stream of the same shape for streaming; proxy path forwards the vLLM response verbatimRouting logic (
storefield):store=true(default per spec) → executor path:execute()handles rehydration, LLM inference, and persistencestore=false→ proxy path: request forwarded directly to vLLM, no DB involvementBoth paths share a single
/v1/responseshandler that reads the body once, peeks atstore, and dispatches without re-buffering.Rehydration scope (
conversation_idfield):Within the executor path,
/v1/responsesselects its rehydration strategy based on the presence ofconversation_idin the request body:conversation_id+previous_response_idpresent →rehydrate_from_conversation(): loads the full conversation turn history from the DB and reconstructs context for the next turnconversation_idabsent,previous_response_idpresent →rehydrate_from_response(): loads only the chained response history, with no conversation record involvedState design:
AppStateholds both states simultaneously, with no feature flags or optional fields:Configgains an optionaldb_urlfield. The server defaults tosqlite://./agentic_api.dbwhen unset.Per-request auth override:
If the incoming
Authorization: Bearer <token>differs from the configured key,a lightweight
ExecutionContextclone is created for that request so auth isnot shared across concurrent requests.
Test Plan
Unit / integration tests (13 new, all passing):
tests/responses_test.rs(4 tests):store=falseproxies JSON response from mock vLLMstore=falseproxies SSE stream from mock vLLM with correct content-typestore=truereaches executor path, not the proxy's 200tests/conversations_test.rs(3 tests):store=falsein body returns 400store=true, reaches executor pathstore=true, reaches executor pathtests/health_test.rsandtests/cors_test.rsrefactored to share helpers viatests/common/mod.rs.All 135 tests pass across the workspace.
cargo testRunning Tests
cargo test -p agentic-serverRunning Benchmarks
Benchmark Results
BENCH_TURNS=10 LLM_BASE_URL=http://localhost:9090 \ BENCH_MODEL=Qwen/Qwen3-30B-A3B-FP8 \ cargo bench -p agentic-server --bench benches -- \ conversation_rehydration response_rehydration --sample-size 10Benchmark groups:
conversation_rehydration/non_streaming/turns Nconversation_id and previous_response_id→rehydrate_from_conversationconversation_rehydration/streaming/turns Nresponse_rehydration/non_streaming/turns Nprevious_response_id→rehydrate_from_responseresponse_rehydration/streaming/turns Nproxy/non_streamproxy/streamPer-turn timing only; prior turns are seeded before criterion starts, matching the methodology of
executor_throughput.conversation_rehydrationresponse_rehydrationAnalysis
Both paths stay well under 3 s across all turn counts. Gateway overhead (DB reads, rehydration, prompt reconstruction) is not the bottleneck; LLM inference time dominates, which is why confidence intervals are wide (often spanning ~0.5–1 s) even with 10 samples.
conversation_rehydrationshows a mild upward trend with turn count. Non-streaming median grows from ~2.1 s at turn 1 (no prior context) to ~2.8 s at turn 10 (9 prior turns rehydrated), an incremental cost of roughly 0.08 s per prior turn. This is expected: each additional turn adds to the reconstructed prompt sent to the LLM. The cost is low enough that it does not accumulate dangerously over long conversations.response_rehydrationshows no clear trend with chain depth. Medians are flat and noisy across turns (2.03–2.85 s non-streaming), which suggests the DB fetch overhead per chained response is negligible relative to inference variance.Streaming and non-streaming medians are comparable per turn. Because benchmarks seed prior turns before Criterion starts timing, each measurement isolates a single turn's round-trip cost. The similarity confirms that SSE framing adds no meaningful overhead on the gateway side.
conversation_rehydrationis modestly faster thanresponse_rehydrationat low turn counts (2.06 s vs 2.85 s at turn 1). At higher turn counts they converge. The turn-1 gap reflects the fact that response rehydration always issues a DB lookup for the prior response even on the first measured turn, while conversation rehydration with no prior context skips that step.