feat: text-prefix chat cache + json_find_key bugfix#50
Merged
Conversation
Follow-up to PR #49. The token-level LCP path in tq_generate_continue has a fundamental limitation: model-generated tokens (sample_topp) and text-encoded tokens (tq_encode of the response in the next turn) can diverge due to BPE merge non-roundtripping. This caps per-turn LCP at the prompt boundary (~10 tokens), so longer histories still incur mostly-full reprefill. Fix: tq_generate_chat_text() — text-level prefix matching. How it works: 1. Each session stores the entire prompt+response text from the previous call (cached_text). 2. On a new request, check if the new prompt starts with cached_text byte-for-byte. If yes, the cached state is byte-equivalent valid. 3. Tokenize ONLY the suffix (new_prompt[strlen(cached_text):]) and prefill those tokens at positions [n_cached..n_cached + n_suffix). 4. Run generation. The accumulated output text gets appended to cached_text via a tee callback for the next call. 5. If text prefix doesn't match, fall back to tq_generate_continue (token LCP path). Bug fix bundled: json_find_key("user") was matching the value in {"role":"user"} instead of the top-level "user" key. Result: every request used the "default" session, so multi-session was effectively broken (cross-pollution). The fix scans for "key": (with colon) to disambiguate from value matches. Measured (SmolLM2-135M, single thread, real chat replay): Single user, 10-turn accumulation: PR #48 (token LCP only): turn 10 → 3700 ms PR #49 (above + multi-session): turn 10 → 3700 ms (LCP still capped) This PR (text-prefix path): turn 10 → 739 ms (5x) alice + bob interleaved, 5 turns each (real assistant replay): PR #49: alice 5 = 2412 ms, bob 5 = 2357 ms Now: alice 5 = 498 ms, bob 5 = 462 ms (5x) The growth that remains (~50ms/turn) is the unavoidable O(n) cost of the attention computation over the full context — KV prefill is now truly O(new tokens per turn), not O(full history per turn). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr
added a commit
that referenced
this pull request
Apr 12, 2026
PR #50 added text-prefix matching to src/engine/tq_generate.c (used by the HTTP server). This PR ports it to quant.h (single-header) so the WASM browser demo and Python wheel get the same speedup. Three layers: 1. **quant.h**: ported tq_generate_chat_text from src/engine. Added cached_text field to quant_ctx struct. quant_chat() now uses the text-prefix path instead of the token-LCP path. quant_free_ctx() frees cached_text. Pass NULL prompt to reset session (frees cached_text too). 2. **wasm/quant_wasm.c**: - wasm_generate_async / wasm_generate now call quant_chat() instead of quant_generate() (which destroyed the cache via free+recreate of g_ctx every call — biggest reason WASM was slow on multi-turn). - Reuse the existing g_ctx across calls; only update temperature/ top_p/max_tokens fields (kv_compress is immutable post-creation). - New wasm_reset_chat() for starting a new chat session. 3. **wasm/index.html**: - Accumulates ChatML history client-side (chatHistory string). Each turn appends `<|im_start|>user\n${text}<|im_end|>\n <|im_start|>assistant\n` and sends the FULL history to WASM. - The C side's text-prefix matching reuses everything before the new turn — turn N's prefill is O(new user message), not O(full history). - After response, appends model output + <|im_end|>\n so the next turn matches the cached_text byte-for-byte. - Loading message differentiates first turn ("Processing prompt — may take a few seconds") vs subsequent ("Generating..."). 4. **wasm/build.sh**: exports _wasm_reset_chat. Validated end-to-end with the C test (real response replay): turn 1: 206 ms (cold, SLOW path) turn 2: 315 ms (FAST text_match=64) turn 5: 437 ms (FAST text_match=321) turn 10: 637 ms (FAST text_match=750) Every turn after the first hits the FAST text-prefix path. The remaining ~50ms/turn growth is the unavoidable O(n) attention cost. For the WASM browser demo, this means: instead of every turn taking full prefill time (5-10s for a 0.8B model), only turn 1 is slow. Turns 2+ feel instantaneous to the user. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr
added a commit
that referenced
this pull request
Apr 12, 2026
…#51) PR #50 added text-prefix matching to src/engine/tq_generate.c (used by the HTTP server). This PR ports it to quant.h (single-header) so the WASM browser demo and Python wheel get the same speedup. Three layers: 1. **quant.h**: ported tq_generate_chat_text from src/engine. Added cached_text field to quant_ctx struct. quant_chat() now uses the text-prefix path instead of the token-LCP path. quant_free_ctx() frees cached_text. Pass NULL prompt to reset session (frees cached_text too). 2. **wasm/quant_wasm.c**: - wasm_generate_async / wasm_generate now call quant_chat() instead of quant_generate() (which destroyed the cache via free+recreate of g_ctx every call — biggest reason WASM was slow on multi-turn). - Reuse the existing g_ctx across calls; only update temperature/ top_p/max_tokens fields (kv_compress is immutable post-creation). - New wasm_reset_chat() for starting a new chat session. 3. **wasm/index.html**: - Accumulates ChatML history client-side (chatHistory string). Each turn appends `<|im_start|>user\n${text}<|im_end|>\n <|im_start|>assistant\n` and sends the FULL history to WASM. - The C side's text-prefix matching reuses everything before the new turn — turn N's prefill is O(new user message), not O(full history). - After response, appends model output + <|im_end|>\n so the next turn matches the cached_text byte-for-byte. - Loading message differentiates first turn ("Processing prompt — may take a few seconds") vs subsequent ("Generating..."). 4. **wasm/build.sh**: exports _wasm_reset_chat. Validated end-to-end with the C test (real response replay): turn 1: 206 ms (cold, SLOW path) turn 2: 315 ms (FAST text_match=64) turn 5: 437 ms (FAST text_match=321) turn 10: 637 ms (FAST text_match=750) Every turn after the first hits the FAST text-prefix path. The remaining ~50ms/turn growth is the unavoidable O(n) attention cost. For the WASM browser demo, this means: instead of every turn taking full prefill time (5-10s for a 0.8B model), only turn 1 is slow. Turns 2+ feel instantaneous to the user. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to PR #49. Token-level LCP had a fundamental cap from BPE re-tokenization (model-sampled tokens vs text-encoded tokens diverge → LCP truncates at ~10 tokens). This PR adds text-level prefix matching that bypasses the issue entirely.
Approach
`tq_generate_chat_text()` — each session stores the full prompt+response text from the last call. On a new request:
Generated tokens are decoded to text via a tee callback during generation, so the next call's cached_text includes both the prompt AND the model's response.
Bonus: json_find_key bug
Found while debugging multi-session. `json_find_key("user")` was matching the value in `{"role":"user"}` (first occurrence), then failing the colon check, so every request fell back to the "default" session — multi-session was effectively broken since PR #49. Fix: scan for `"key":` (with colon) to disambiguate from values.
Measured
The remaining ~50ms/turn growth is the unavoidable O(n) attention cost over full context. KV prefill is now truly O(new tokens per turn).
Combined progression (chat KV cache work)
🤖 Generated with Claude Code