[myia_vllm] Diff-aware session logging for error_source_capture middleware

## Context

`myia_vllm/middleware/error_source_capture.py` (added 2026-05-16) captures one JSONL line per `/v1/*` request with `client IP`, `User-Agent`, forwarding headers, `model`, and **1.5 KB body_head + 1.5 KB body_tail** as a privacy/cost trade-off. Wired opt-in in both `medium-qwen36-moe.yml` and `medium-qwen36-genesis-tq.yml`.

This is fine for short-term forensic capture (identifying the source of stale `qwen3.5-35b-a3b` 404s, characterizing legitimate traffic). It is **suboptimal for long-term forensic retention** because consecutive chat-completion requests from the same session share ~80-95% of their body (long system prompt + chat history) and the truncated head/tail re-logs that prefix every turn.

## Proposal

Make the middleware **diff-aware** along the same axis vLLM itself uses internally: the **radix-tree token prefix**.

Per-request output:
- `session_key`: hash of `(client_ip, model, User-Agent, x-anthropic-session-id?)` — a stable identifier across consecutive turns of the same conversation.
- `prefix_len`: number of bytes (or, ideally, tokens) shared with the last body under the same `session_key`.
- `delta_head`: first N bytes where the new body diverges from the previous one (the new user turn or appended tool result).
- `body_tail`: last N bytes (always useful — sampling params at end, latest message).
- Bonus: `prefix_cache_hit_rate` for that request if vLLM exposes it via the engine logger (currently logged aggregate only in `loggers.py:271`).

Storage:
- Local JSONL stays small: a steady-state turn appends ~500 bytes of delta rather than ~3 KB of redundant head.
- Compress + ship to GDrive (5 TB subscription is half-empty per user note 2026-05-16) on rotation. `logrotate` style: gzip on file roll, sync to `gdrive:/myia_vllm/access_logs/YYYY-MM/`.

## Why "radix tree"

vLLM's V1 engine prefix caching already maintains a radix tree of hashed token prefixes (xxhash blocks, `--prefix-caching-hash-algo xxhash` in our profile). The hit rate logged in `loggers.py:271` (\"Prefix cache hit rate: 81.0%\" in our prod load) confirms turns share 81% of their token prefix on average. The middleware would mirror that structure at the **byte** level (per-session radix), one level above what the engine tracks. No engine modification needed.

## Why this matters

- Long-term traffic forensics without log bloat. Steady-state ~500 B/req instead of ~3 KB/req → ÷6.
- Reconstructable full body from `(previous body) ⊕ (delta_head)` chain rooted in the first body of the session, so we don't actually lose information.
- Aligns with how Anthropic / z.ai presumably do this internally (per user observation).

## What this requires

- In-memory cache `{session_key → (last_body_hash, last_body, last_seen_ts)}` with TTL eviction.
- Definition of `session_key`. First pass: `(client_ip, model, User-Agent)`. Future: parse `x-anthropic-session-id`, `x-conversation-id`, OWUI chat IDs, etc.
- Longest common prefix computation (byte-level is enough; token-level requires running the tokenizer, which adds CPU cost and a tokenizer dependency in the middleware).
- Output schema versioning (`schema_version: 2` to distinguish from current truncated-head/tail format).

## Out of scope (for this issue)

- Token-level diff (vs byte-level). Add later if byte-level proves insufficient.
- Encryption at rest for the JSONL files. Privacy currently relies on the file being host-only.
- Streaming response capture — only request body is in scope.

## Files

- [myia_vllm/middleware/error_source_capture.py](https://github.com/jsboige/vllm/blob/main/myia_vllm/middleware/error_source_capture.py)
- [myia_vllm/configs/docker/profiles/medium-qwen36-moe.yml](https://github.com/jsboige/vllm/blob/main/myia_vllm/configs/docker/profiles/medium-qwen36-moe.yml)
- [myia_vllm/configs/docker/profiles/medium-qwen36-genesis-tq.yml](https://github.com/jsboige/vllm/blob/main/myia_vllm/configs/docker/profiles/medium-qwen36-genesis-tq.yml)

## Acceptance criteria

- New middleware module (e.g. `myia_vllm/middleware/session_diff_capture.py`) co-existing with the current truncating one — pick at deploy time via `--middleware ...`.
- A `jq` recipe in the docstring that reconstructs the Nth full body from the chain.
- Bench: middleware overhead under sustained 10-conc load ≤ 1% latency added on the request path (don't reuse the old `logging_middleware.RequestResponseLogger` which cost -40-65% throughput).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[myia_vllm] Diff-aware session logging for error_source_capture middleware #8

Context

Proposal

Why "radix tree"

Why this matters

What this requires

Out of scope (for this issue)

Files

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[myia_vllm] Diff-aware session logging for error_source_capture middleware #8

Description

Context

Proposal

Why "radix tree"

Why this matters

What this requires

Out of scope (for this issue)

Files

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions