Context
myia_vllm/middleware/error_source_capture.py (added 2026-05-16) captures one JSONL line per /v1/* request with client IP, User-Agent, forwarding headers, model, and 1.5 KB body_head + 1.5 KB body_tail as a privacy/cost trade-off. Wired opt-in in both medium-qwen36-moe.yml and medium-qwen36-genesis-tq.yml.
This is fine for short-term forensic capture (identifying the source of stale qwen3.5-35b-a3b 404s, characterizing legitimate traffic). It is suboptimal for long-term forensic retention because consecutive chat-completion requests from the same session share ~80-95% of their body (long system prompt + chat history) and the truncated head/tail re-logs that prefix every turn.
Proposal
Make the middleware diff-aware along the same axis vLLM itself uses internally: the radix-tree token prefix.
Per-request output:
session_key: hash of (client_ip, model, User-Agent, x-anthropic-session-id?) — a stable identifier across consecutive turns of the same conversation.
prefix_len: number of bytes (or, ideally, tokens) shared with the last body under the same session_key.
delta_head: first N bytes where the new body diverges from the previous one (the new user turn or appended tool result).
body_tail: last N bytes (always useful — sampling params at end, latest message).
- Bonus:
prefix_cache_hit_rate for that request if vLLM exposes it via the engine logger (currently logged aggregate only in loggers.py:271).
Storage:
- Local JSONL stays small: a steady-state turn appends ~500 bytes of delta rather than ~3 KB of redundant head.
- Compress + ship to GDrive (5 TB subscription is half-empty per user note 2026-05-16) on rotation.
logrotate style: gzip on file roll, sync to gdrive:/myia_vllm/access_logs/YYYY-MM/.
Why "radix tree"
vLLM's V1 engine prefix caching already maintains a radix tree of hashed token prefixes (xxhash blocks, --prefix-caching-hash-algo xxhash in our profile). The hit rate logged in loggers.py:271 ("Prefix cache hit rate: 81.0%" in our prod load) confirms turns share 81% of their token prefix on average. The middleware would mirror that structure at the byte level (per-session radix), one level above what the engine tracks. No engine modification needed.
Why this matters
- Long-term traffic forensics without log bloat. Steady-state ~500 B/req instead of ~3 KB/req → ÷6.
- Reconstructable full body from
(previous body) ⊕ (delta_head) chain rooted in the first body of the session, so we don't actually lose information.
- Aligns with how Anthropic / z.ai presumably do this internally (per user observation).
What this requires
- In-memory cache
{session_key → (last_body_hash, last_body, last_seen_ts)} with TTL eviction.
- Definition of
session_key. First pass: (client_ip, model, User-Agent). Future: parse x-anthropic-session-id, x-conversation-id, OWUI chat IDs, etc.
- Longest common prefix computation (byte-level is enough; token-level requires running the tokenizer, which adds CPU cost and a tokenizer dependency in the middleware).
- Output schema versioning (
schema_version: 2 to distinguish from current truncated-head/tail format).
Out of scope (for this issue)
- Token-level diff (vs byte-level). Add later if byte-level proves insufficient.
- Encryption at rest for the JSONL files. Privacy currently relies on the file being host-only.
- Streaming response capture — only request body is in scope.
Files
Acceptance criteria
- New middleware module (e.g.
myia_vllm/middleware/session_diff_capture.py) co-existing with the current truncating one — pick at deploy time via --middleware ....
- A
jq recipe in the docstring that reconstructs the Nth full body from the chain.
- Bench: middleware overhead under sustained 10-conc load ≤ 1% latency added on the request path (don't reuse the old
logging_middleware.RequestResponseLogger which cost -40-65% throughput).
Context
myia_vllm/middleware/error_source_capture.py(added 2026-05-16) captures one JSONL line per/v1/*request withclient IP,User-Agent, forwarding headers,model, and 1.5 KB body_head + 1.5 KB body_tail as a privacy/cost trade-off. Wired opt-in in bothmedium-qwen36-moe.ymlandmedium-qwen36-genesis-tq.yml.This is fine for short-term forensic capture (identifying the source of stale
qwen3.5-35b-a3b404s, characterizing legitimate traffic). It is suboptimal for long-term forensic retention because consecutive chat-completion requests from the same session share ~80-95% of their body (long system prompt + chat history) and the truncated head/tail re-logs that prefix every turn.Proposal
Make the middleware diff-aware along the same axis vLLM itself uses internally: the radix-tree token prefix.
Per-request output:
session_key: hash of(client_ip, model, User-Agent, x-anthropic-session-id?)— a stable identifier across consecutive turns of the same conversation.prefix_len: number of bytes (or, ideally, tokens) shared with the last body under the samesession_key.delta_head: first N bytes where the new body diverges from the previous one (the new user turn or appended tool result).body_tail: last N bytes (always useful — sampling params at end, latest message).prefix_cache_hit_ratefor that request if vLLM exposes it via the engine logger (currently logged aggregate only inloggers.py:271).Storage:
logrotatestyle: gzip on file roll, sync togdrive:/myia_vllm/access_logs/YYYY-MM/.Why "radix tree"
vLLM's V1 engine prefix caching already maintains a radix tree of hashed token prefixes (xxhash blocks,
--prefix-caching-hash-algo xxhashin our profile). The hit rate logged inloggers.py:271("Prefix cache hit rate: 81.0%" in our prod load) confirms turns share 81% of their token prefix on average. The middleware would mirror that structure at the byte level (per-session radix), one level above what the engine tracks. No engine modification needed.Why this matters
(previous body) ⊕ (delta_head)chain rooted in the first body of the session, so we don't actually lose information.What this requires
{session_key → (last_body_hash, last_body, last_seen_ts)}with TTL eviction.session_key. First pass:(client_ip, model, User-Agent). Future: parsex-anthropic-session-id,x-conversation-id, OWUI chat IDs, etc.schema_version: 2to distinguish from current truncated-head/tail format).Out of scope (for this issue)
Files
Acceptance criteria
myia_vllm/middleware/session_diff_capture.py) co-existing with the current truncating one — pick at deploy time via--middleware ....jqrecipe in the docstring that reconstructs the Nth full body from the chain.logging_middleware.RequestResponseLoggerwhich cost -40-65% throughput).