feat: chat-mode KV cache reuse — O(N²) → O(new tokens) per turn#48
Merged
feat: chat-mode KV cache reuse — O(N²) → O(new tokens) per turn#48
Conversation
User-reported issue: chat mode gets exponentially slower as history
accumulates. Each turn re-prefills the entire conversation through
all transformer layers because both quant_generate (single-header)
and the HTTP server's tq_generate were freeing the KV state on every
call. Result: turn N's prefill cost was O(N * total_history_tokens),
which is O(N²) cumulative.
Fix: introduce tq_generate_continue / quant_chat that:
1. Keeps the KV state alive across calls (caller-managed)
2. Tracks the token IDs currently committed to the KV cache
3. On each call, computes the longest common prefix (LCP) between
the cached tokens and the new prompt, and only prefills the
diverging suffix [LCP, n_new)
4. Updates the cache record with the prompt + generated tokens
Three layers wired up:
1. quant.h (single-header / Python wheel)
- quant_ctx now stores cached_tokens / n_cached / cached_capacity
- new public quant_chat(ctx, prompt, cb, ud) — pass NULL prompt
to reset the session
- existing quant_generate unchanged for backwards compat
2. src/engine/tq_generate.c (library build)
- new tq_generate_continue(model, tok, state, prompt, config,
**cached, *n_cached, *cap, output, size)
- same prefix-match logic, mirrors the single-header impl
3. src/server/tq_server.c (HTTP server)
- tq_server now holds a persistent kv_state + cached_tokens
- both /v1/chat/completions paths (streaming + non-streaming)
call tq_generate_continue instead of tq_generate
- state freed on tq_server_free
4. bindings/python/quantcpp
- _binding.py: optional binding for quant_chat (gracefully
missing on older single-header builds)
- Model.chat(prompt) — generator with KV reuse, falls back to
generate() if symbol unavailable
- Model.reset_chat() — wipes the session
- cli.py: `quantcpp run` interactive loop now accumulates ChatML
history and uses Model.chat() for cheap re-sends
Measured (SmolLM2-135M, M1 Pro, single thread, 10 turns of accumulating
synthetic chat history, max_tokens=8/turn):
quant_generate (no reuse): 295 → 681 → 1105 → 1581 → 2105 → 2660
→ 3245 → 3926 → 4679 → 5386 ms
quant_chat (with reuse): 294 → 430 → 451 → 509 → 545 → 608
→ 693 → 750 → 796 → 902 ms
Turn 10 speedup: 5386 → 902 ms (5.97x)
Identical-prompt repeat (perfect LCP): 366 → 91/91/91/91 ms (4x)
Caveat: when assistant responses contain text that re-tokenizes
differently in the larger context (BPE merge non-roundtripping),
LCP truncates and the suffix re-prefills. Real-world chat clients
that replay the exact assistant response see >90% of the speedup.
Worst-case is still better than the no-reuse baseline.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr
added a commit
that referenced
this pull request
Apr 11, 2026
Follow-up to PR #48 (chat KV cache reuse). Audited the implementation and addressed 4 P0/P1 fragility points found in production-like use: 1. **Multi-session safety (P0)** — quant-server held a single global KV state. Two concurrent chat clients would corrupt each other's cache. Now there's a per-session table (MAX_SESSIONS=16) keyed by the OpenAI-compatible "user" field in the request body. Sessions are LRU-evicted when full. Each session has its own kv_state, cached_tokens, last_used. Default session ("default") preserves the original single-client behavior. 2. **Heap-allocate prompt buffer (P0)** — tq_generate_continue used `int new_tokens[4096]` on the stack, which silently truncated prompts longer than 4096 tokens. Replaced with malloc up to model->config.max_seq_len. realloc failure paths now free the heap buffer before returning -1. 3. **Sliding window on overflow (P1)** — when n_new + max_tokens would exceed max_seq_len, drop the oldest prompt tokens, keep the most recent (max_seq_len - max_tokens - 32) tokens, and force a full reprefill since the prefix shifted. Prevents silent failure / generation truncation. 4. **Cache hit metrics (P1)** — TQ_CHAT_DEBUG=1 env var prints per-call metrics: prefix_hit (LCP length), prefill (new tokens processed), generated, cached. Useful for diagnosing chat clients with poor cache reuse. Verified end-to-end with 2 concurrent sessions: alice cold: 334 ms bob cold: 78 ms (separate session, no cache pollution) alice 2nd: 78 ms (alice's cache survived bob's calls) bob 2nd: 76 ms ... (all subsequent calls ~75-82 ms across both sessions) Known limitation: assistant response tokens generated by sample_topp do not always match the BPE re-tokenization of the same response text in subsequent prompts. This caps the per-turn LCP at the prompt boundary. Real fix is server-side text-prefix matching (cache the last prompt text and tokenize only the suffix), tracked for the next round. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr
added a commit
that referenced
this pull request
Apr 11, 2026
Follow-up to PR #48 (chat KV cache reuse). Audited the implementation and addressed 4 P0/P1 fragility points found in production-like use: 1. **Multi-session safety (P0)** — quant-server held a single global KV state. Two concurrent chat clients would corrupt each other's cache. Now there's a per-session table (MAX_SESSIONS=16) keyed by the OpenAI-compatible "user" field in the request body. Sessions are LRU-evicted when full. Each session has its own kv_state, cached_tokens, last_used. Default session ("default") preserves the original single-client behavior. 2. **Heap-allocate prompt buffer (P0)** — tq_generate_continue used `int new_tokens[4096]` on the stack, which silently truncated prompts longer than 4096 tokens. Replaced with malloc up to model->config.max_seq_len. realloc failure paths now free the heap buffer before returning -1. 3. **Sliding window on overflow (P1)** — when n_new + max_tokens would exceed max_seq_len, drop the oldest prompt tokens, keep the most recent (max_seq_len - max_tokens - 32) tokens, and force a full reprefill since the prefix shifted. Prevents silent failure / generation truncation. 4. **Cache hit metrics (P1)** — TQ_CHAT_DEBUG=1 env var prints per-call metrics: prefix_hit (LCP length), prefill (new tokens processed), generated, cached. Useful for diagnosing chat clients with poor cache reuse. Verified end-to-end with 2 concurrent sessions: alice cold: 334 ms bob cold: 78 ms (separate session, no cache pollution) alice 2nd: 78 ms (alice's cache survived bob's calls) bob 2nd: 76 ms ... (all subsequent calls ~75-82 ms across both sessions) Known limitation: assistant response tokens generated by sample_topp do not always match the BPE re-tokenization of the same response text in subsequent prompts. This caps the per-turn LCP at the prompt boundary. Real fix is server-side text-prefix matching (cache the last prompt text and tokenize only the suffix), tracked for the next round. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr
added a commit
that referenced
this pull request
Apr 11, 2026
…ble (#49) Follow-up to PR #48 (chat KV cache reuse). Audited the implementation and addressed 4 P0/P1 fragility points found in production-like use: 1. **Multi-session safety (P0)** — quant-server held a single global KV state. Two concurrent chat clients would corrupt each other's cache. Now there's a per-session table (MAX_SESSIONS=16) keyed by the OpenAI-compatible "user" field in the request body. Sessions are LRU-evicted when full. Each session has its own kv_state, cached_tokens, last_used. Default session ("default") preserves the original single-client behavior. 2. **Heap-allocate prompt buffer (P0)** — tq_generate_continue used `int new_tokens[4096]` on the stack, which silently truncated prompts longer than 4096 tokens. Replaced with malloc up to model->config.max_seq_len. realloc failure paths now free the heap buffer before returning -1. 3. **Sliding window on overflow (P1)** — when n_new + max_tokens would exceed max_seq_len, drop the oldest prompt tokens, keep the most recent (max_seq_len - max_tokens - 32) tokens, and force a full reprefill since the prefix shifted. Prevents silent failure / generation truncation. 4. **Cache hit metrics (P1)** — TQ_CHAT_DEBUG=1 env var prints per-call metrics: prefix_hit (LCP length), prefill (new tokens processed), generated, cached. Useful for diagnosing chat clients with poor cache reuse. Verified end-to-end with 2 concurrent sessions: alice cold: 334 ms bob cold: 78 ms (separate session, no cache pollution) alice 2nd: 78 ms (alice's cache survived bob's calls) bob 2nd: 76 ms ... (all subsequent calls ~75-82 ms across both sessions) Known limitation: assistant response tokens generated by sample_topp do not always match the BPE re-tokenization of the same response text in subsequent prompts. This caps the per-turn LCP at the prompt boundary. Real fix is server-side text-prefix matching (cache the last prompt text and tokenize only the suffix), tracked for the next round. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr
added a commit
that referenced
this pull request
Apr 12, 2026
Follow-up to PR #49. The token-level LCP path in tq_generate_continue has a fundamental limitation: model-generated tokens (sample_topp) and text-encoded tokens (tq_encode of the response in the next turn) can diverge due to BPE merge non-roundtripping. This caps per-turn LCP at the prompt boundary (~10 tokens), so longer histories still incur mostly-full reprefill. Fix: tq_generate_chat_text() — text-level prefix matching. How it works: 1. Each session stores the entire prompt+response text from the previous call (cached_text). 2. On a new request, check if the new prompt starts with cached_text byte-for-byte. If yes, the cached state is byte-equivalent valid. 3. Tokenize ONLY the suffix (new_prompt[strlen(cached_text):]) and prefill those tokens at positions [n_cached..n_cached + n_suffix). 4. Run generation. The accumulated output text gets appended to cached_text via a tee callback for the next call. 5. If text prefix doesn't match, fall back to tq_generate_continue (token LCP path). Bug fix bundled: json_find_key("user") was matching the value in {"role":"user"} instead of the top-level "user" key. Result: every request used the "default" session, so multi-session was effectively broken (cross-pollution). The fix scans for "key": (with colon) to disambiguate from value matches. Measured (SmolLM2-135M, single thread, real chat replay): Single user, 10-turn accumulation: PR #48 (token LCP only): turn 10 → 3700 ms PR #49 (above + multi-session): turn 10 → 3700 ms (LCP still capped) This PR (text-prefix path): turn 10 → 739 ms (5x) alice + bob interleaved, 5 turns each (real assistant replay): PR #49: alice 5 = 2412 ms, bob 5 = 2357 ms Now: alice 5 = 498 ms, bob 5 = 462 ms (5x) The growth that remains (~50ms/turn) is the unavoidable O(n) cost of the attention computation over the full context — KV prefill is now truly O(new tokens per turn), not O(full history per turn). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr
added a commit
that referenced
this pull request
Apr 12, 2026
…#50) Follow-up to PR #49. The token-level LCP path in tq_generate_continue has a fundamental limitation: model-generated tokens (sample_topp) and text-encoded tokens (tq_encode of the response in the next turn) can diverge due to BPE merge non-roundtripping. This caps per-turn LCP at the prompt boundary (~10 tokens), so longer histories still incur mostly-full reprefill. Fix: tq_generate_chat_text() — text-level prefix matching. How it works: 1. Each session stores the entire prompt+response text from the previous call (cached_text). 2. On a new request, check if the new prompt starts with cached_text byte-for-byte. If yes, the cached state is byte-equivalent valid. 3. Tokenize ONLY the suffix (new_prompt[strlen(cached_text):]) and prefill those tokens at positions [n_cached..n_cached + n_suffix). 4. Run generation. The accumulated output text gets appended to cached_text via a tee callback for the next call. 5. If text prefix doesn't match, fall back to tq_generate_continue (token LCP path). Bug fix bundled: json_find_key("user") was matching the value in {"role":"user"} instead of the top-level "user" key. Result: every request used the "default" session, so multi-session was effectively broken (cross-pollution). The fix scans for "key": (with colon) to disambiguate from value matches. Measured (SmolLM2-135M, single thread, real chat replay): Single user, 10-turn accumulation: PR #48 (token LCP only): turn 10 → 3700 ms PR #49 (above + multi-session): turn 10 → 3700 ms (LCP still capped) This PR (text-prefix path): turn 10 → 739 ms (5x) alice + bob interleaved, 5 turns each (real assistant replay): PR #49: alice 5 = 2412 ms, bob 5 = 2357 ms Now: alice 5 = 498 ms, bob 5 = 462 ms (5x) The growth that remains (~50ms/turn) is the unavoidable O(n) cost of the attention computation over the full context — KV prefill is now truly O(new tokens per turn), not O(full history per turn). Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr
added a commit
that referenced
this pull request
Apr 12, 2026
After PRs #48-#51 the chat KV cache reuse path was a complex multi-layer system. Audited every code path for hidden bugs and fixed all of them. ## Bugs found and fixed 1. **Slow-path fallback corrupted KV state** [P0] tq_generate_chat_text's overflow fallback called tq_generate_continue on the SAME state that already had old KV at positions [0..prefix_pos). New prefill would write [0..n_new) leaving stale [n_new..prefix_pos) that subsequent generation might read. Replaced with -2 return code: the caller decides (server returns HTTP 413, WASM auto-resets the chat and shows a status message). 2. **WASM reset_chat partial cleanup** [P1] wasm_reset_chat called quant_chat(NULL) but did not reset g_output_pos / g_output[0] / g_stream_count, so the next generation would append to stale text from the previous chat. Now resets all. 3. **wasm_generate (sync path) missed g_stream_count reset** [P1] The async path zeroed it, the sync path did not. Aligned both. 4. **Wheel header _quant.h stale** [P0] bindings/python/quantcpp/_quant.h is .gitignore'd and the next pip build would have used quant.h from before PR #51 (no tq_generate_chat_text). Synced to current quant.h. 5. **Overflow surface — WASM** [P1] Added n == -2 detection in wasm_generate / wasm_generate_async. Auto-reset chat and call js_on_status with a clear error message so the JS side can show "Context full — chat reset". 6. **Overflow surface — server** [P1] Added gen_rc == -2 detection in both streaming and non-streaming handlers. Server resets the session's KV state + cached_text + tokens and returns HTTP 413 with an OpenAI-compatible error JSON. 7. **tq_generate_continue cached_text drift documentation** [P2] Added a header comment explaining that tq_generate_continue is the lower-level API and doesn't track cached_text. Higher-level callers must use tq_generate_chat_text for cached_text safety. ## Audited but safe - Server session concurrency: get_or_create_session is called inside inference_mutex, so LRU bookkeeping is serialized. - json_extract_string buffer safety: respects buf_size - 1 bound. - WASM g_output overflow: tokens dropped from local buffer but js_on_token still fires, so JS side gets all output. Acceptable. ## Verified end-to-end alice/bob interleaved 5 turns each (real assistant replay): alice: 339 → 514 ms (~50 ms/turn growth from O(n) attention) bob: 310 → 518 ms (similar) No regressions; all turns hit the FAST text-prefix path after turn 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr
added a commit
that referenced
this pull request
Apr 12, 2026
After PRs #48-#51 the chat KV cache reuse path was a complex multi-layer system. Audited every code path for hidden bugs and fixed all of them. ## Bugs found and fixed 1. **Slow-path fallback corrupted KV state** [P0] tq_generate_chat_text's overflow fallback called tq_generate_continue on the SAME state that already had old KV at positions [0..prefix_pos). New prefill would write [0..n_new) leaving stale [n_new..prefix_pos) that subsequent generation might read. Replaced with -2 return code: the caller decides (server returns HTTP 413, WASM auto-resets the chat and shows a status message). 2. **WASM reset_chat partial cleanup** [P1] wasm_reset_chat called quant_chat(NULL) but did not reset g_output_pos / g_output[0] / g_stream_count, so the next generation would append to stale text from the previous chat. Now resets all. 3. **wasm_generate (sync path) missed g_stream_count reset** [P1] The async path zeroed it, the sync path did not. Aligned both. 4. **Wheel header _quant.h stale** [P0] bindings/python/quantcpp/_quant.h is .gitignore'd and the next pip build would have used quant.h from before PR #51 (no tq_generate_chat_text). Synced to current quant.h. 5. **Overflow surface — WASM** [P1] Added n == -2 detection in wasm_generate / wasm_generate_async. Auto-reset chat and call js_on_status with a clear error message so the JS side can show "Context full — chat reset". 6. **Overflow surface — server** [P1] Added gen_rc == -2 detection in both streaming and non-streaming handlers. Server resets the session's KV state + cached_text + tokens and returns HTTP 413 with an OpenAI-compatible error JSON. 7. **tq_generate_continue cached_text drift documentation** [P2] Added a header comment explaining that tq_generate_continue is the lower-level API and doesn't track cached_text. Higher-level callers must use tq_generate_chat_text for cached_text safety. ## Audited but safe - Server session concurrency: get_or_create_session is called inside inference_mutex, so LRU bookkeeping is serialized. - json_extract_string buffer safety: respects buf_size - 1 bound. - WASM g_output overflow: tokens dropped from local buffer but js_on_token still fires, so JS side gets all output. Acceptable. ## Verified end-to-end alice/bob interleaved 5 turns each (real assistant replay): alice: 339 → 514 ms (~50 ms/turn growth from O(n) attention) bob: 310 → 518 ms (similar) No regressions; all turns hit the FAST text-prefix path after turn 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr
added a commit
that referenced
this pull request
Apr 12, 2026
After PRs #48-#51 the chat KV cache reuse path was a complex multi-layer system. Audited every code path for hidden bugs and fixed all of them. ## Bugs found and fixed 1. **Slow-path fallback corrupted KV state** [P0] tq_generate_chat_text's overflow fallback called tq_generate_continue on the SAME state that already had old KV at positions [0..prefix_pos). New prefill would write [0..n_new) leaving stale [n_new..prefix_pos) that subsequent generation might read. Replaced with -2 return code: the caller decides (server returns HTTP 413, WASM auto-resets the chat and shows a status message). 2. **WASM reset_chat partial cleanup** [P1] wasm_reset_chat called quant_chat(NULL) but did not reset g_output_pos / g_output[0] / g_stream_count, so the next generation would append to stale text from the previous chat. Now resets all. 3. **wasm_generate (sync path) missed g_stream_count reset** [P1] The async path zeroed it, the sync path did not. Aligned both. 4. **Wheel header _quant.h stale** [P0] bindings/python/quantcpp/_quant.h is .gitignore'd and the next pip build would have used quant.h from before PR #51 (no tq_generate_chat_text). Synced to current quant.h. 5. **Overflow surface — WASM** [P1] Added n == -2 detection in wasm_generate / wasm_generate_async. Auto-reset chat and call js_on_status with a clear error message so the JS side can show "Context full — chat reset". 6. **Overflow surface — server** [P1] Added gen_rc == -2 detection in both streaming and non-streaming handlers. Server resets the session's KV state + cached_text + tokens and returns HTTP 413 with an OpenAI-compatible error JSON. 7. **tq_generate_continue cached_text drift documentation** [P2] Added a header comment explaining that tq_generate_continue is the lower-level API and doesn't track cached_text. Higher-level callers must use tq_generate_chat_text for cached_text safety. ## Audited but safe - Server session concurrency: get_or_create_session is called inside inference_mutex, so LRU bookkeeping is serialized. - json_extract_string buffer safety: respects buf_size - 1 bound. - WASM g_output overflow: tokens dropped from local buffer but js_on_token still fires, so JS side gets all output. Acceptable. ## Verified end-to-end alice/bob interleaved 5 turns each (real assistant replay): alice: 339 → 514 ms (~50 ms/turn growth from O(n) attention) bob: 310 → 518 ms (similar) No regressions; all turns hit the FAST text-prefix path after turn 1. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
User-reported: chat mode gets exponentially slower as history accumulates. The reason — every turn (both `quant_generate` in the single-header and `tq_generate` in the HTTP server) was freeing the KV state and re-prefilling the entire conversation through every transformer layer. Result: O(N²) cumulative cost.
Fix
Added `tq_generate_continue` / `quant_chat` that keeps KV state alive across calls and uses longest-common-prefix matching between cached tokens and the new prompt to skip the matched prefix.
Wired into 4 layers:
Measured (SmolLM2-135M, M1 Pro, 1 thread, 10 turns of accumulating chat)
6x speedup at turn 10. Identical-prompt repeat (perfect LCP): 366 → 91 ms (4x).
Caveat
When the model's response contains text that re-tokenizes differently in the larger context (BPE merge non-roundtripping), LCP truncates and that part re-prefills. Real-world OpenAI clients that replay the exact assistant response see >90% of the speedup. Worst case is still strictly better than the no-reuse baseline.
🤖 Generated with Claude Code