feat: chat KV cache hardening — multi-session + overflow safety + metrics by unamedkr · Pull Request #49 · quantumaikr/quant.cpp

unamedkr · 2026-04-11T16:37:54Z

Follow-up to PR #48. Audited the chat KV cache reuse implementation and addressed 4 fragility points.

Changes

#	Issue	Severity	Fix
1	Server held single global KV state — concurrent clients corrupted each other's cache	🔴 P0	Per-session table (MAX_SESSIONS=16, LRU evict, keyed by OpenAI 'user' field)
2	`int new_tokens[4096]` stack array silently truncated long prompts	🔴 P0	Heap-allocate up to `max_seq_len`, free on all paths
3	No handling when prompt + max_tokens > max_seq_len	🟡 P1	Sliding window: drop oldest, keep recent, force reprefill
4	No way to diagnose poor cache hit rates	🟡 P1	`TQ_CHAT_DEBUG=1` env var prints prefix_hit / prefill / generated / cached per call

Multi-session test

```
alice cold: 334 ms (first call, prefill)
bob cold: 78 ms (separate session, no cross-pollution)
alice 2nd: 78 ms (alice's cache survived bob's calls)
bob 2nd: 76 ms
... (all subsequent calls 75-82 ms across both sessions)
default: 77 ms (third independent session)
```

Known limitation

Assistant response tokens generated by sample_topp don't always match BPE re-tokenization of the same text in subsequent prompts. This caps per-turn LCP at the prompt boundary. Server-side text-prefix matching (cache last prompt text, tokenize only the suffix) is the right fix and tracked for the next round.

🤖 Generated with Claude Code

Follow-up to PR #48 (chat KV cache reuse). Audited the implementation and addressed 4 P0/P1 fragility points found in production-like use: 1. **Multi-session safety (P0)** — quant-server held a single global KV state. Two concurrent chat clients would corrupt each other's cache. Now there's a per-session table (MAX_SESSIONS=16) keyed by the OpenAI-compatible "user" field in the request body. Sessions are LRU-evicted when full. Each session has its own kv_state, cached_tokens, last_used. Default session ("default") preserves the original single-client behavior. 2. **Heap-allocate prompt buffer (P0)** — tq_generate_continue used `int new_tokens[4096]` on the stack, which silently truncated prompts longer than 4096 tokens. Replaced with malloc up to model->config.max_seq_len. realloc failure paths now free the heap buffer before returning -1. 3. **Sliding window on overflow (P1)** — when n_new + max_tokens would exceed max_seq_len, drop the oldest prompt tokens, keep the most recent (max_seq_len - max_tokens - 32) tokens, and force a full reprefill since the prefix shifted. Prevents silent failure / generation truncation. 4. **Cache hit metrics (P1)** — TQ_CHAT_DEBUG=1 env var prints per-call metrics: prefix_hit (LCP length), prefill (new tokens processed), generated, cached. Useful for diagnosing chat clients with poor cache reuse. Verified end-to-end with 2 concurrent sessions: alice cold: 334 ms bob cold: 78 ms (separate session, no cache pollution) alice 2nd: 78 ms (alice's cache survived bob's calls) bob 2nd: 76 ms ... (all subsequent calls ~75-82 ms across both sessions) Known limitation: assistant response tokens generated by sample_topp do not always match the BPE re-tokenization of the same response text in subsequent prompts. This caps the per-turn LCP at the prompt boundary. Real fix is server-side text-prefix matching (cache the last prompt text and tokenize only the suffix), tracked for the next round. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Follow-up to PR #49. The token-level LCP path in tq_generate_continue has a fundamental limitation: model-generated tokens (sample_topp) and text-encoded tokens (tq_encode of the response in the next turn) can diverge due to BPE merge non-roundtripping. This caps per-turn LCP at the prompt boundary (~10 tokens), so longer histories still incur mostly-full reprefill. Fix: tq_generate_chat_text() — text-level prefix matching. How it works: 1. Each session stores the entire prompt+response text from the previous call (cached_text). 2. On a new request, check if the new prompt starts with cached_text byte-for-byte. If yes, the cached state is byte-equivalent valid. 3. Tokenize ONLY the suffix (new_prompt[strlen(cached_text):]) and prefill those tokens at positions [n_cached..n_cached + n_suffix). 4. Run generation. The accumulated output text gets appended to cached_text via a tee callback for the next call. 5. If text prefix doesn't match, fall back to tq_generate_continue (token LCP path). Bug fix bundled: json_find_key("user") was matching the value in {"role":"user"} instead of the top-level "user" key. Result: every request used the "default" session, so multi-session was effectively broken (cross-pollution). The fix scans for "key": (with colon) to disambiguate from value matches. Measured (SmolLM2-135M, single thread, real chat replay): Single user, 10-turn accumulation: PR #48 (token LCP only): turn 10 → 3700 ms PR #49 (above + multi-session): turn 10 → 3700 ms (LCP still capped) This PR (text-prefix path): turn 10 → 739 ms (5x) alice + bob interleaved, 5 turns each (real assistant replay): PR #49: alice 5 = 2412 ms, bob 5 = 2357 ms Now: alice 5 = 498 ms, bob 5 = 462 ms (5x) The growth that remains (~50ms/turn) is the unavoidable O(n) cost of the attention computation over the full context — KV prefill is now truly O(new tokens per turn), not O(full history per turn). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…#50) Follow-up to PR #49. The token-level LCP path in tq_generate_continue has a fundamental limitation: model-generated tokens (sample_topp) and text-encoded tokens (tq_encode of the response in the next turn) can diverge due to BPE merge non-roundtripping. This caps per-turn LCP at the prompt boundary (~10 tokens), so longer histories still incur mostly-full reprefill. Fix: tq_generate_chat_text() — text-level prefix matching. How it works: 1. Each session stores the entire prompt+response text from the previous call (cached_text). 2. On a new request, check if the new prompt starts with cached_text byte-for-byte. If yes, the cached state is byte-equivalent valid. 3. Tokenize ONLY the suffix (new_prompt[strlen(cached_text):]) and prefill those tokens at positions [n_cached..n_cached + n_suffix). 4. Run generation. The accumulated output text gets appended to cached_text via a tee callback for the next call. 5. If text prefix doesn't match, fall back to tq_generate_continue (token LCP path). Bug fix bundled: json_find_key("user") was matching the value in {"role":"user"} instead of the top-level "user" key. Result: every request used the "default" session, so multi-session was effectively broken (cross-pollution). The fix scans for "key": (with colon) to disambiguate from value matches. Measured (SmolLM2-135M, single thread, real chat replay): Single user, 10-turn accumulation: PR #48 (token LCP only): turn 10 → 3700 ms PR #49 (above + multi-session): turn 10 → 3700 ms (LCP still capped) This PR (text-prefix path): turn 10 → 739 ms (5x) alice + bob interleaved, 5 turns each (real assistant replay): PR #49: alice 5 = 2412 ms, bob 5 = 2357 ms Now: alice 5 = 498 ms, bob 5 = 462 ms (5x) The growth that remains (~50ms/turn) is the unavoidable O(n) cost of the attention computation over the full context — KV prefill is now truly O(new tokens per turn), not O(full history per turn). Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

unamedkr force-pushed the feat/chat-cache-hardening branch from 2e93b36 to ee0b3e6 Compare April 11, 2026 16:38

unamedkr merged commit 7646711 into main Apr 11, 2026
3 checks passed

unamedkr deleted the feat/chat-cache-hardening branch April 11, 2026 16:38

unamedkr mentioned this pull request Apr 12, 2026

feat: text-prefix chat cache + json_find_key bugfix #50

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: chat KV cache hardening — multi-session + overflow safety + metrics#49

feat: chat KV cache hardening — multi-session + overflow safety + metrics#49
unamedkr merged 1 commit intomainfrom
feat/chat-cache-hardening

unamedkr commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

unamedkr commented Apr 11, 2026

Changes

Multi-session test

Known limitation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant