Skip to content

[myia_vllm] Diff-aware session logging for error_source_capture middleware #8

@jsboige

Description

@jsboige

Context

myia_vllm/middleware/error_source_capture.py (added 2026-05-16) captures one JSONL line per /v1/* request with client IP, User-Agent, forwarding headers, model, and 1.5 KB body_head + 1.5 KB body_tail as a privacy/cost trade-off. Wired opt-in in both medium-qwen36-moe.yml and medium-qwen36-genesis-tq.yml.

This is fine for short-term forensic capture (identifying the source of stale qwen3.5-35b-a3b 404s, characterizing legitimate traffic). It is suboptimal for long-term forensic retention because consecutive chat-completion requests from the same session share ~80-95% of their body (long system prompt + chat history) and the truncated head/tail re-logs that prefix every turn.

Proposal

Make the middleware diff-aware along the same axis vLLM itself uses internally: the radix-tree token prefix.

Per-request output:

  • session_key: hash of (client_ip, model, User-Agent, x-anthropic-session-id?) — a stable identifier across consecutive turns of the same conversation.
  • prefix_len: number of bytes (or, ideally, tokens) shared with the last body under the same session_key.
  • delta_head: first N bytes where the new body diverges from the previous one (the new user turn or appended tool result).
  • body_tail: last N bytes (always useful — sampling params at end, latest message).
  • Bonus: prefix_cache_hit_rate for that request if vLLM exposes it via the engine logger (currently logged aggregate only in loggers.py:271).

Storage:

  • Local JSONL stays small: a steady-state turn appends ~500 bytes of delta rather than ~3 KB of redundant head.
  • Compress + ship to GDrive (5 TB subscription is half-empty per user note 2026-05-16) on rotation. logrotate style: gzip on file roll, sync to gdrive:/myia_vllm/access_logs/YYYY-MM/.

Why "radix tree"

vLLM's V1 engine prefix caching already maintains a radix tree of hashed token prefixes (xxhash blocks, --prefix-caching-hash-algo xxhash in our profile). The hit rate logged in loggers.py:271 ("Prefix cache hit rate: 81.0%" in our prod load) confirms turns share 81% of their token prefix on average. The middleware would mirror that structure at the byte level (per-session radix), one level above what the engine tracks. No engine modification needed.

Why this matters

  • Long-term traffic forensics without log bloat. Steady-state ~500 B/req instead of ~3 KB/req → ÷6.
  • Reconstructable full body from (previous body) ⊕ (delta_head) chain rooted in the first body of the session, so we don't actually lose information.
  • Aligns with how Anthropic / z.ai presumably do this internally (per user observation).

What this requires

  • In-memory cache {session_key → (last_body_hash, last_body, last_seen_ts)} with TTL eviction.
  • Definition of session_key. First pass: (client_ip, model, User-Agent). Future: parse x-anthropic-session-id, x-conversation-id, OWUI chat IDs, etc.
  • Longest common prefix computation (byte-level is enough; token-level requires running the tokenizer, which adds CPU cost and a tokenizer dependency in the middleware).
  • Output schema versioning (schema_version: 2 to distinguish from current truncated-head/tail format).

Out of scope (for this issue)

  • Token-level diff (vs byte-level). Add later if byte-level proves insufficient.
  • Encryption at rest for the JSONL files. Privacy currently relies on the file being host-only.
  • Streaming response capture — only request body is in scope.

Files

Acceptance criteria

  • New middleware module (e.g. myia_vllm/middleware/session_diff_capture.py) co-existing with the current truncating one — pick at deploy time via --middleware ....
  • A jq recipe in the docstring that reconstructs the Nth full body from the chain.
  • Bench: middleware overhead under sustained 10-conc load ≤ 1% latency added on the request path (don't reuse the old logging_middleware.RequestResponseLogger which cost -40-65% throughput).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions