Skip to content

feat: agentic-server stateful conversation/responses endpoints.#48

Draft
maralbahari wants to merge 24 commits into
vllm-project:mainfrom
EmbeddedLLM:agentic-server-stateful
Draft

feat: agentic-server stateful conversation/responses endpoints.#48
maralbahari wants to merge 24 commits into
vllm-project:mainfrom
EmbeddedLLM:agentic-server-stateful

Conversation

@maralbahari

Copy link
Copy Markdown
Collaborator

Summary

To be reviewed after #46

Wires agentic-core's executor into agentic-server, giving the gateway
stateful conversation and response management on top of the existing vLLM
proxy. Implements the Layer 2 (HTTP API) component of
ADR-03.

New HTTP endpoints:

Endpoint Method Routes to
/v1/conversations POST create_conversation(): creates a DB-backed conversation record; response body includes id (the conversation_id) along with created_at, object, and metadata
/v1/responses POST executor or proxy, depending on store field; executor path returns a ResponsePayload JSON object (id, object, created_at, model, status, output, usage, previous_response_id, conversation_id, …) for non-streaming requests, or an SSE stream of the same shape for streaming; proxy path forwards the vLLM response verbatim

Routing logic (store field):

  • store=true (default per spec) → executor path: execute() handles rehydration, LLM inference, and persistence
  • store=false → proxy path: request forwarded directly to vLLM, no DB involvement

Both paths share a single /v1/responses handler that reads the body once, peeks at store, and dispatches without re-buffering.

Rehydration scope (conversation_id field):

Within the executor path, /v1/responses selects its rehydration strategy based on the presence of conversation_id in the request body:

  • conversation_id + previous_response_id present → rehydrate_from_conversation(): loads the full conversation turn history from the DB and reconstructs context for the next turn
  • conversation_id absent, previous_response_id present → rehydrate_from_response(): loads only the chained response history, with no conversation record involved

State design:

AppState holds both states simultaneously, with no feature flags or optional fields:

AppState {
    proxy_state: ProxyState          // always present, handles store=false
    exec_ctx: Arc<ExecutionContext>  // always present, handles store=true
    llm_api_base: String
}

Config gains an optional db_url field. The server defaults to sqlite://./agentic_api.db when unset.

Per-request auth override:

If the incoming Authorization: Bearer <token> differs from the configured key,
a lightweight ExecutionContext clone is created for that request so auth is
not shared across concurrent requests.

Test Plan

Unit / integration tests (13 new, all passing):

tests/responses_test.rs (4 tests):

  • store=false proxies JSON response from mock vLLM
  • store=false proxies SSE stream from mock vLLM with correct content-type
  • store=true reaches executor path, not the proxy's 200
  • Oversized body returns 413

tests/conversations_test.rs (3 tests):

  • store=false in body returns 400
  • Empty body defaults store=true, reaches executor path
  • No body defaults store=true, reaches executor path

tests/health_test.rs and tests/cors_test.rs refactored to share helpers via tests/common/mod.rs.

All 135 tests pass across the workspace.

cargo test

Running Tests

cargo test -p agentic-server

Running Benchmarks

# All benchmarks (proxy + conversation rehydration)
cargo bench --bench benches

# Conversation rehydration only
BENCH_TURNS=3 cargo bench --bench benches -- conversation_rehydration

# Against a real vLLM (model auto-detected from /v1/models)
BENCH_TURNS=3 LLM_BASE_URL=http://localhost:9090 \
    cargo bench --bench benches -- conversation_rehydration

# Explicit model name
BENCH_TURNS=3 LLM_BASE_URL=http://localhost:9090 \
    BENCH_MODEL=Qwen/Qwen3-30B-A3B-FP8 \
    cargo bench --bench benches -- conversation_rehydration

Benchmark Results

Model: Qwen/Qwen3-30B-A3B-FP8 · GPU: A100 · vLLM http://localhost:5050 · sample-size=10 · turns 1–10.

BENCH_TURNS=10 LLM_BASE_URL=http://localhost:9090 \
    BENCH_MODEL=Qwen/Qwen3-30B-A3B-FP8 \
    cargo bench -p agentic-server --bench benches -- \
        conversation_rehydration response_rehydration --sample-size 10

Benchmark groups:

Group Path Measures
conversation_rehydration/non_streaming/turns N conversation_id and previous_response_idrehydrate_from_conversation Full HTTP round-trip, N-1 prior turns in conversation
conversation_rehydration/streaming/turns N same, SSE Same, streaming response
response_rehydration/non_streaming/turns N previous_response_idrehydrate_from_response Full HTTP round-trip, N-1 chained responses
response_rehydration/streaming/turns N same, SSE Same, streaming response
proxy/non_stream store=false Direct vLLM proxy overhead, no DB
proxy/stream store=false, SSE Same, streaming

Per-turn timing only; prior turns are seeded before criterion starts, matching the methodology of executor_throughput.


conversation_rehydration

Turn Non-streaming (median) CI (low–high) Streaming (median) CI (low–high)
1 2.06 s 1.61–2.50 s 1.66 s 1.37–1.97 s
2 2.24 s 1.91–2.57 s 2.08 s 1.78–2.33 s
3 2.59 s 2.23–2.98 s 2.19 s 1.80–2.63 s
4 2.46 s 2.08–2.85 s 2.83 s 2.33–3.31 s
5 2.17 s 1.68–2.62 s 2.60 s 2.14–3.15 s
6 2.50 s 2.03–2.98 s 1.99 s 1.72–2.31 s
7 2.34 s 1.93–2.81 s 2.72 s 2.24–3.25 s
8 2.44 s 2.03–2.85 s 2.54 s 2.16–2.94 s
9 2.45 s 1.88–3.09 s 2.71 s 2.30–3.16 s
10 2.84 s 2.52–3.21 s 2.80 s 2.43–3.17 s

response_rehydration

Turn Non-streaming (median) CI (low–high) Streaming (median) CI (low–high)
1 2.85 s 2.52–3.13 s 2.98 s 2.53–3.47 s
2 2.32 s 2.06–2.59 s 2.67 s 2.37–2.96 s
3 2.77 s 2.56–3.04 s 2.33 s 1.84–2.87 s
4 2.79 s 2.46–3.10 s 2.43 s 2.16–2.70 s
5 2.23 s 1.78–2.76 s 2.40 s 2.12–2.67 s
6 2.13 s 1.78–2.48 s 2.94 s 2.34–3.58 s
7 2.03 s 1.75–2.34 s 2.90 s 2.45–3.54 s
8 2.66 s 2.26–3.10 s 1.81 s 1.38–2.32 s
9 2.69 s 2.43–2.96 s 1.75 s 1.44–2.03 s
10 2.42 s 2.12–2.73 s 2.44 s 2.18–2.74 s

Analysis

Both paths stay well under 3 s across all turn counts. Gateway overhead (DB reads, rehydration, prompt reconstruction) is not the bottleneck; LLM inference time dominates, which is why confidence intervals are wide (often spanning ~0.5–1 s) even with 10 samples.

conversation_rehydration shows a mild upward trend with turn count. Non-streaming median grows from ~2.1 s at turn 1 (no prior context) to ~2.8 s at turn 10 (9 prior turns rehydrated), an incremental cost of roughly 0.08 s per prior turn. This is expected: each additional turn adds to the reconstructed prompt sent to the LLM. The cost is low enough that it does not accumulate dangerously over long conversations.

response_rehydration shows no clear trend with chain depth. Medians are flat and noisy across turns (2.03–2.85 s non-streaming), which suggests the DB fetch overhead per chained response is negligible relative to inference variance.

Streaming and non-streaming medians are comparable per turn. Because benchmarks seed prior turns before Criterion starts timing, each measurement isolates a single turn's round-trip cost. The similarity confirms that SSE framing adds no meaningful overhead on the gateway side.

conversation_rehydration is modestly faster than response_rehydration at low turn counts (2.06 s vs 2.85 s at turn 1). At higher turn counts they converge. The turn-1 gap reflects the fact that response rehydration always issues a DB lookup for the prior response even on the first measured turn, while conversation rehydration with no prior context skips that step.

maralbahari and others added 24 commits May 21, 2026 12:02
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Add executor module: rehydration, LLM inference, SSE accumulation,
and persistence for both conversation and response stateful flows.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
…ame entry

Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
@maralbahari maralbahari changed the title [FEAT] agentic-server stateful conversation/responses endpoints. feat: agentic-server stateful conversation/responses endpoints. Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant