feat: `agentic-server` stateful conversation/responses endpoints. by maralbahari · Pull Request #48 · vllm-project/agentic-api

maralbahari · 2026-06-08T09:14:23Z

Summary

To be reviewed after #46

Wires agentic-core's executor into agentic-server, giving the gateway
stateful conversation and response management on top of the existing vLLM
proxy. Implements the Layer 2 (HTTP API) component of
ADR-03.

New HTTP endpoints:

Endpoint	Method	Routes to
`/v1/conversations`	POST	`create_conversation()`: creates a DB-backed conversation record; response body includes `id` (the `conversation_id`) along with `created_at`, `object`, and `metadata`
`/v1/responses`	POST	executor or proxy, depending on `store` field; executor path returns a `ResponsePayload` JSON object (`id`, `object`, `created_at`, `model`, `status`, `output`, `usage`, `previous_response_id`, `conversation_id`, …) for non-streaming requests, or an SSE stream of the same shape for streaming; proxy path forwards the vLLM response verbatim

Routing logic (store field):

store=true (default per spec) → executor path: execute() handles rehydration, LLM inference, and persistence
store=false → proxy path: request forwarded directly to vLLM, no DB involvement

Both paths share a single /v1/responses handler that reads the body once, peeks at store, and dispatches without re-buffering.

Rehydration scope (conversation_id field):

Within the executor path, /v1/responses selects its rehydration strategy based on the presence of conversation_id in the request body:

conversation_id + previous_response_id present → rehydrate_from_conversation(): loads the full conversation turn history from the DB and reconstructs context for the next turn
conversation_id absent, previous_response_id present → rehydrate_from_response(): loads only the chained response history, with no conversation record involved

State design:

AppState holds both states simultaneously, with no feature flags or optional fields:

AppState {
    proxy_state: ProxyState          // always present, handles store=false
    exec_ctx: Arc<ExecutionContext>  // always present, handles store=true
    llm_api_base: String
}

Config gains an optional db_url field. The server defaults to sqlite://./agentic_api.db when unset.

Per-request auth override:

If the incoming Authorization: Bearer <token> differs from the configured key,
a lightweight ExecutionContext clone is created for that request so auth is
not shared across concurrent requests.

Test Plan

Unit / integration tests (13 new, all passing):

tests/responses_test.rs (4 tests):

store=false proxies JSON response from mock vLLM
store=false proxies SSE stream from mock vLLM with correct content-type
store=true reaches executor path, not the proxy's 200
Oversized body returns 413

tests/conversations_test.rs (3 tests):

store=false in body returns 400
Empty body defaults store=true, reaches executor path
No body defaults store=true, reaches executor path

tests/health_test.rs and tests/cors_test.rs refactored to share helpers via tests/common/mod.rs.

All 135 tests pass across the workspace.

cargo test

Running Tests

cargo test -p agentic-server

Running Benchmarks

# All benchmarks (proxy + conversation rehydration)
cargo bench --bench benches

# Conversation rehydration only
BENCH_TURNS=3 cargo bench --bench benches -- conversation_rehydration

# Against a real vLLM (model auto-detected from /v1/models)
BENCH_TURNS=3 LLM_BASE_URL=http://localhost:9090 \
    cargo bench --bench benches -- conversation_rehydration

# Explicit model name
BENCH_TURNS=3 LLM_BASE_URL=http://localhost:9090 \
    BENCH_MODEL=Qwen/Qwen3-30B-A3B-FP8 \
    cargo bench --bench benches -- conversation_rehydration

Benchmark Results

Model: Qwen/Qwen3-30B-A3B-FP8 · GPU: A100 · vLLM http://localhost:5050 · sample-size=10 · turns 1–10.

BENCH_TURNS=10 LLM_BASE_URL=http://localhost:9090 \
    BENCH_MODEL=Qwen/Qwen3-30B-A3B-FP8 \
    cargo bench -p agentic-server --bench benches -- \
        conversation_rehydration response_rehydration --sample-size 10

Benchmark groups:

Group	Path	Measures
`conversation_rehydration/non_streaming/turns N`	`conversation_id and previous_response_id` → `rehydrate_from_conversation`	Full HTTP round-trip, N-1 prior turns in conversation
`conversation_rehydration/streaming/turns N`	same, SSE	Same, streaming response
`response_rehydration/non_streaming/turns N`	`previous_response_id` → `rehydrate_from_response`	Full HTTP round-trip, N-1 chained responses
`response_rehydration/streaming/turns N`	same, SSE	Same, streaming response
`proxy/non_stream`	store=false	Direct vLLM proxy overhead, no DB
`proxy/stream`	store=false, SSE	Same, streaming

Per-turn timing only; prior turns are seeded before criterion starts, matching the methodology of executor_throughput.

`conversation_rehydration`

Turn	Non-streaming (median)	CI (low–high)	Streaming (median)	CI (low–high)
1	2.06 s	1.61–2.50 s	1.66 s	1.37–1.97 s
2	2.24 s	1.91–2.57 s	2.08 s	1.78–2.33 s
3	2.59 s	2.23–2.98 s	2.19 s	1.80–2.63 s
4	2.46 s	2.08–2.85 s	2.83 s	2.33–3.31 s
5	2.17 s	1.68–2.62 s	2.60 s	2.14–3.15 s
6	2.50 s	2.03–2.98 s	1.99 s	1.72–2.31 s
7	2.34 s	1.93–2.81 s	2.72 s	2.24–3.25 s
8	2.44 s	2.03–2.85 s	2.54 s	2.16–2.94 s
9	2.45 s	1.88–3.09 s	2.71 s	2.30–3.16 s
10	2.84 s	2.52–3.21 s	2.80 s	2.43–3.17 s

`response_rehydration`

Turn	Non-streaming (median)	CI (low–high)	Streaming (median)	CI (low–high)
1	2.85 s	2.52–3.13 s	2.98 s	2.53–3.47 s
2	2.32 s	2.06–2.59 s	2.67 s	2.37–2.96 s
3	2.77 s	2.56–3.04 s	2.33 s	1.84–2.87 s
4	2.79 s	2.46–3.10 s	2.43 s	2.16–2.70 s
5	2.23 s	1.78–2.76 s	2.40 s	2.12–2.67 s
6	2.13 s	1.78–2.48 s	2.94 s	2.34–3.58 s
7	2.03 s	1.75–2.34 s	2.90 s	2.45–3.54 s
8	2.66 s	2.26–3.10 s	1.81 s	1.38–2.32 s
9	2.69 s	2.43–2.96 s	1.75 s	1.44–2.03 s
10	2.42 s	2.12–2.73 s	2.44 s	2.18–2.74 s

Analysis

Both paths stay well under 3 s across all turn counts. Gateway overhead (DB reads, rehydration, prompt reconstruction) is not the bottleneck; LLM inference time dominates, which is why confidence intervals are wide (often spanning ~0.5–1 s) even with 10 samples.

conversation_rehydration shows a mild upward trend with turn count. Non-streaming median grows from ~2.1 s at turn 1 (no prior context) to ~2.8 s at turn 10 (9 prior turns rehydrated), an incremental cost of roughly 0.08 s per prior turn. This is expected: each additional turn adds to the reconstructed prompt sent to the LLM. The cost is low enough that it does not accumulate dangerously over long conversations.

response_rehydration shows no clear trend with chain depth. Medians are flat and noisy across turns (2.03–2.85 s non-streaming), which suggests the DB fetch overhead per chained response is negligible relative to inference variance.

Streaming and non-streaming medians are comparable per turn. Because benchmarks seed prior turns before Criterion starts timing, each measurement isolates a single turn's round-trip cost. The similarity confirms that SSE framing adds no meaningful overhead on the gateway side.

conversation_rehydration is modestly faster than response_rehydration at low turn counts (2.06 s vs 2.85 s at turn 1). At higher turn counts they converge. The turn-1 gap reflects the fact that response rehydration always issues a DB lookup for the prior response even on the first measured turn, while conversation rehydration with no prior context skips that step.

Signed-off-by: maral <maralbahari.98@gmail.com>

Add executor module: rehydration, LLM inference, SSE accumulation, and persistence for both conversation and response stateful flows. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: maral <maralbahari.98@gmail.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: maral <maralbahari.98@gmail.com>

Signed-off-by: maral <maralbahari.98@gmail.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: maral <maralbahari.98@gmail.com>

…ame entry Signed-off-by: maral <maralbahari.98@gmail.com>

Signed-off-by: maral <maralbahari.98@gmail.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: maral <maralbahari.98@gmail.com>

Signed-off-by: maral <maralbahari.98@gmail.com>

maralbahari and others added 24 commits May 21, 2026 12:02

feat: implement storage CRUD layer with SQLx and benchmarks

b0272b4

Signed-off-by: maral <maralbahari.98@gmail.com>

use rust criterion for benchmarking and clean code

473485d

Signed-off-by: maral <maralbahari.98@gmail.com>

clean code

0bb4eb8

Signed-off-by: maral <maralbahari.98@gmail.com>

cover more unit tests and add integration tests

9ad6c66

Signed-off-by: maral <maralbahari.98@gmail.com>

Merge remote-tracking branch 'origin/main' into impl-database-crud

c6b9968

Signed-off-by: maral <maralbahari.98@gmail.com>

avoid unnecessary clone

b681865

Signed-off-by: maral <maralbahari.98@gmail.com>

move integration test in agentic-core

4a29e6f

Signed-off-by: maral <maralbahari.98@gmail.com>

fix multi-thread unit test and clean the main cargo.toml

0f9d2c3

Signed-off-by: maral <maralbahari.98@gmail.com>

fix cargo clippy

3ec60a2

Signed-off-by: maral <maralbahari.98@gmail.com>

fix clippy errors in benchmark

06be1e1

Signed-off-by: maral <maralbahari.98@gmail.com>

Merge remote-tracking branch 'origin/main' into agentic-core-executor

fdf5696

Signed-off-by: maral <maralbahari.98@gmail.com>

add integration test based on pre-recorded cassets from openai

89f8f8c

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: maral <maralbahari.98@gmail.com>

clean code and fix cargo clippy

b024ef2

Signed-off-by: maral <maralbahari.98@gmail.com>

improve error handling

da38e55

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: maral <maralbahari.98@gmail.com>

improve apis

8d2d843

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: maral <maralbahari.98@gmail.com>

simplify call_inference

080aabe

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: maral <maralbahari.98@gmail.com>

fix benchmarking to record per turn pref and merge all benches into s…

04a0271

…ame entry Signed-off-by: maral <maralbahari.98@gmail.com>

Merge remote-tracking branch 'origin/main' into agentic-core-executor

2d17980

fix cargo clippy

fc25c96

Signed-off-by: maral <maralbahari.98@gmail.com>

clean code

734cae1

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: maral <maralbahari.98@gmail.com>

feat: wire stateful executor into agentic-server gateway

95cfe77

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: maral <maralbahari.98@gmail.com>

write example for stateful gateway

6ee0e01

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: maral <maralbahari.98@gmail.com>

fix benchmark

351759a

Signed-off-by: maral <maralbahari.98@gmail.com>

maralbahari changed the title ~~[FEAT] agentic-server stateful conversation/responses endpoints.~~ feat: agentic-server stateful conversation/responses endpoints. Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: `agentic-server` stateful conversation/responses endpoints.#48

feat: `agentic-server` stateful conversation/responses endpoints.#48
maralbahari wants to merge 24 commits into
vllm-project:mainfrom
EmbeddedLLM:agentic-server-stateful

maralbahari commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant