feat: chat KV cache hardening — multi-session + overflow safety + metrics#49
Merged
feat: chat KV cache hardening — multi-session + overflow safety + metrics#49
Conversation
Follow-up to PR #48 (chat KV cache reuse). Audited the implementation and addressed 4 P0/P1 fragility points found in production-like use: 1. **Multi-session safety (P0)** — quant-server held a single global KV state. Two concurrent chat clients would corrupt each other's cache. Now there's a per-session table (MAX_SESSIONS=16) keyed by the OpenAI-compatible "user" field in the request body. Sessions are LRU-evicted when full. Each session has its own kv_state, cached_tokens, last_used. Default session ("default") preserves the original single-client behavior. 2. **Heap-allocate prompt buffer (P0)** — tq_generate_continue used `int new_tokens[4096]` on the stack, which silently truncated prompts longer than 4096 tokens. Replaced with malloc up to model->config.max_seq_len. realloc failure paths now free the heap buffer before returning -1. 3. **Sliding window on overflow (P1)** — when n_new + max_tokens would exceed max_seq_len, drop the oldest prompt tokens, keep the most recent (max_seq_len - max_tokens - 32) tokens, and force a full reprefill since the prefix shifted. Prevents silent failure / generation truncation. 4. **Cache hit metrics (P1)** — TQ_CHAT_DEBUG=1 env var prints per-call metrics: prefix_hit (LCP length), prefill (new tokens processed), generated, cached. Useful for diagnosing chat clients with poor cache reuse. Verified end-to-end with 2 concurrent sessions: alice cold: 334 ms bob cold: 78 ms (separate session, no cache pollution) alice 2nd: 78 ms (alice's cache survived bob's calls) bob 2nd: 76 ms ... (all subsequent calls ~75-82 ms across both sessions) Known limitation: assistant response tokens generated by sample_topp do not always match the BPE re-tokenization of the same response text in subsequent prompts. This caps the per-turn LCP at the prompt boundary. Real fix is server-side text-prefix matching (cache the last prompt text and tokenize only the suffix), tracked for the next round. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2e93b36 to
ee0b3e6
Compare
unamedkr
added a commit
that referenced
this pull request
Apr 12, 2026
Follow-up to PR #49. The token-level LCP path in tq_generate_continue has a fundamental limitation: model-generated tokens (sample_topp) and text-encoded tokens (tq_encode of the response in the next turn) can diverge due to BPE merge non-roundtripping. This caps per-turn LCP at the prompt boundary (~10 tokens), so longer histories still incur mostly-full reprefill. Fix: tq_generate_chat_text() — text-level prefix matching. How it works: 1. Each session stores the entire prompt+response text from the previous call (cached_text). 2. On a new request, check if the new prompt starts with cached_text byte-for-byte. If yes, the cached state is byte-equivalent valid. 3. Tokenize ONLY the suffix (new_prompt[strlen(cached_text):]) and prefill those tokens at positions [n_cached..n_cached + n_suffix). 4. Run generation. The accumulated output text gets appended to cached_text via a tee callback for the next call. 5. If text prefix doesn't match, fall back to tq_generate_continue (token LCP path). Bug fix bundled: json_find_key("user") was matching the value in {"role":"user"} instead of the top-level "user" key. Result: every request used the "default" session, so multi-session was effectively broken (cross-pollution). The fix scans for "key": (with colon) to disambiguate from value matches. Measured (SmolLM2-135M, single thread, real chat replay): Single user, 10-turn accumulation: PR #48 (token LCP only): turn 10 → 3700 ms PR #49 (above + multi-session): turn 10 → 3700 ms (LCP still capped) This PR (text-prefix path): turn 10 → 739 ms (5x) alice + bob interleaved, 5 turns each (real assistant replay): PR #49: alice 5 = 2412 ms, bob 5 = 2357 ms Now: alice 5 = 498 ms, bob 5 = 462 ms (5x) The growth that remains (~50ms/turn) is the unavoidable O(n) cost of the attention computation over the full context — KV prefill is now truly O(new tokens per turn), not O(full history per turn). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr
added a commit
that referenced
this pull request
Apr 12, 2026
…#50) Follow-up to PR #49. The token-level LCP path in tq_generate_continue has a fundamental limitation: model-generated tokens (sample_topp) and text-encoded tokens (tq_encode of the response in the next turn) can diverge due to BPE merge non-roundtripping. This caps per-turn LCP at the prompt boundary (~10 tokens), so longer histories still incur mostly-full reprefill. Fix: tq_generate_chat_text() — text-level prefix matching. How it works: 1. Each session stores the entire prompt+response text from the previous call (cached_text). 2. On a new request, check if the new prompt starts with cached_text byte-for-byte. If yes, the cached state is byte-equivalent valid. 3. Tokenize ONLY the suffix (new_prompt[strlen(cached_text):]) and prefill those tokens at positions [n_cached..n_cached + n_suffix). 4. Run generation. The accumulated output text gets appended to cached_text via a tee callback for the next call. 5. If text prefix doesn't match, fall back to tq_generate_continue (token LCP path). Bug fix bundled: json_find_key("user") was matching the value in {"role":"user"} instead of the top-level "user" key. Result: every request used the "default" session, so multi-session was effectively broken (cross-pollution). The fix scans for "key": (with colon) to disambiguate from value matches. Measured (SmolLM2-135M, single thread, real chat replay): Single user, 10-turn accumulation: PR #48 (token LCP only): turn 10 → 3700 ms PR #49 (above + multi-session): turn 10 → 3700 ms (LCP still capped) This PR (text-prefix path): turn 10 → 739 ms (5x) alice + bob interleaved, 5 turns each (real assistant replay): PR #49: alice 5 = 2412 ms, bob 5 = 2357 ms Now: alice 5 = 498 ms, bob 5 = 462 ms (5x) The growth that remains (~50ms/turn) is the unavoidable O(n) cost of the attention computation over the full context — KV prefill is now truly O(new tokens per turn), not O(full history per turn). Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to PR #48. Audited the chat KV cache reuse implementation and addressed 4 fragility points.
Changes
Multi-session test
```
alice cold: 334 ms (first call, prefill)
bob cold: 78 ms (separate session, no cross-pollution)
alice 2nd: 78 ms (alice's cache survived bob's calls)
bob 2nd: 76 ms
... (all subsequent calls 75-82 ms across both sessions)
default: 77 ms (third independent session)
```
Known limitation
Assistant response tokens generated by sample_topp don't always match BPE re-tokenization of the same text in subsequent prompts. This caps per-turn LCP at the prompt boundary. Server-side text-prefix matching (cache last prompt text, tokenize only the suffix) is the right fix and tracked for the next round.
🤖 Generated with Claude Code