feat: text-prefix chat cache + json_find_key bugfix by unamedkr · Pull Request #50 · quantumaikr/quant.cpp

unamedkr · 2026-04-12T00:09:29Z

Follow-up to PR #49. Token-level LCP had a fundamental cap from BPE re-tokenization (model-sampled tokens vs text-encoded tokens diverge → LCP truncates at ~10 tokens). This PR adds text-level prefix matching that bypasses the issue entirely.

Approach

`tq_generate_chat_text()` — each session stores the full prompt+response text from the last call. On a new request:

Check if new prompt starts with cached_text byte-for-byte
If yes: tokenize ONLY the suffix and prefill at [n_cached..]
If no: fall back to tq_generate_continue (token LCP path)

Generated tokens are decoded to text via a tee callback during generation, so the next call's cached_text includes both the prompt AND the model's response.

Bonus: json_find_key bug

Found while debugging multi-session. `json_find_key("user")` was matching the value in `{"role":"user"}` (first occurrence), then failing the colon check, so every request fell back to the "default" session — multi-session was effectively broken since PR #49. Fix: scan for `"key":` (with colon) to disambiguate from values.

Measured

Test	PR #49 (token LCP)	This PR (text-prefix)
10-turn single user	turn 10 → 3700 ms	turn 10 → 739 ms
alice+bob interleaved, 5 turns	alice/bob ~2400 ms	alice/bob ~480 ms
Improvement at turn 10	1.5x vs no-reuse	8x vs no-reuse

The remaining ~50ms/turn growth is the unavoidable O(n) attention cost over full context. KV prefill is now truly O(new tokens per turn).

Combined progression (chat KV cache work)

Round	Approach	Turn 10	vs Original
Original	No reuse	5400 ms	1x
PR #48	Token LCP	3700 ms	1.5x
PR #49	+ multi-session + overflow safety	3700 ms	1.5x
This PR	+ text-prefix matching	739 ms	7.3x

🤖 Generated with Claude Code

Follow-up to PR #49. The token-level LCP path in tq_generate_continue has a fundamental limitation: model-generated tokens (sample_topp) and text-encoded tokens (tq_encode of the response in the next turn) can diverge due to BPE merge non-roundtripping. This caps per-turn LCP at the prompt boundary (~10 tokens), so longer histories still incur mostly-full reprefill. Fix: tq_generate_chat_text() — text-level prefix matching. How it works: 1. Each session stores the entire prompt+response text from the previous call (cached_text). 2. On a new request, check if the new prompt starts with cached_text byte-for-byte. If yes, the cached state is byte-equivalent valid. 3. Tokenize ONLY the suffix (new_prompt[strlen(cached_text):]) and prefill those tokens at positions [n_cached..n_cached + n_suffix). 4. Run generation. The accumulated output text gets appended to cached_text via a tee callback for the next call. 5. If text prefix doesn't match, fall back to tq_generate_continue (token LCP path). Bug fix bundled: json_find_key("user") was matching the value in {"role":"user"} instead of the top-level "user" key. Result: every request used the "default" session, so multi-session was effectively broken (cross-pollution). The fix scans for "key": (with colon) to disambiguate from value matches. Measured (SmolLM2-135M, single thread, real chat replay): Single user, 10-turn accumulation: PR #48 (token LCP only): turn 10 → 3700 ms PR #49 (above + multi-session): turn 10 → 3700 ms (LCP still capped) This PR (text-prefix path): turn 10 → 739 ms (5x) alice + bob interleaved, 5 turns each (real assistant replay): PR #49: alice 5 = 2412 ms, bob 5 = 2357 ms Now: alice 5 = 498 ms, bob 5 = 462 ms (5x) The growth that remains (~50ms/turn) is the unavoidable O(n) cost of the attention computation over the full context — KV prefill is now truly O(new tokens per turn), not O(full history per turn). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PR #50 added text-prefix matching to src/engine/tq_generate.c (used by the HTTP server). This PR ports it to quant.h (single-header) so the WASM browser demo and Python wheel get the same speedup. Three layers: 1. **quant.h**: ported tq_generate_chat_text from src/engine. Added cached_text field to quant_ctx struct. quant_chat() now uses the text-prefix path instead of the token-LCP path. quant_free_ctx() frees cached_text. Pass NULL prompt to reset session (frees cached_text too). 2. **wasm/quant_wasm.c**: - wasm_generate_async / wasm_generate now call quant_chat() instead of quant_generate() (which destroyed the cache via free+recreate of g_ctx every call — biggest reason WASM was slow on multi-turn). - Reuse the existing g_ctx across calls; only update temperature/ top_p/max_tokens fields (kv_compress is immutable post-creation). - New wasm_reset_chat() for starting a new chat session. 3. **wasm/index.html**: - Accumulates ChatML history client-side (chatHistory string). Each turn appends `<|im_start|>user\n${text}<|im_end|>\n <|im_start|>assistant\n` and sends the FULL history to WASM. - The C side's text-prefix matching reuses everything before the new turn — turn N's prefill is O(new user message), not O(full history). - After response, appends model output + <|im_end|>\n so the next turn matches the cached_text byte-for-byte. - Loading message differentiates first turn ("Processing prompt — may take a few seconds") vs subsequent ("Generating..."). 4. **wasm/build.sh**: exports _wasm_reset_chat. Validated end-to-end with the C test (real response replay): turn 1: 206 ms (cold, SLOW path) turn 2: 315 ms (FAST text_match=64) turn 5: 437 ms (FAST text_match=321) turn 10: 637 ms (FAST text_match=750) Every turn after the first hits the FAST text-prefix path. The remaining ~50ms/turn growth is the unavoidable O(n) attention cost. For the WASM browser demo, this means: instead of every turn taking full prefill time (5-10s for a 0.8B model), only turn 1 is slow. Turns 2+ feel instantaneous to the user. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…#51) PR #50 added text-prefix matching to src/engine/tq_generate.c (used by the HTTP server). This PR ports it to quant.h (single-header) so the WASM browser demo and Python wheel get the same speedup. Three layers: 1. **quant.h**: ported tq_generate_chat_text from src/engine. Added cached_text field to quant_ctx struct. quant_chat() now uses the text-prefix path instead of the token-LCP path. quant_free_ctx() frees cached_text. Pass NULL prompt to reset session (frees cached_text too). 2. **wasm/quant_wasm.c**: - wasm_generate_async / wasm_generate now call quant_chat() instead of quant_generate() (which destroyed the cache via free+recreate of g_ctx every call — biggest reason WASM was slow on multi-turn). - Reuse the existing g_ctx across calls; only update temperature/ top_p/max_tokens fields (kv_compress is immutable post-creation). - New wasm_reset_chat() for starting a new chat session. 3. **wasm/index.html**: - Accumulates ChatML history client-side (chatHistory string). Each turn appends `<|im_start|>user\n${text}<|im_end|>\n <|im_start|>assistant\n` and sends the FULL history to WASM. - The C side's text-prefix matching reuses everything before the new turn — turn N's prefill is O(new user message), not O(full history). - After response, appends model output + <|im_end|>\n so the next turn matches the cached_text byte-for-byte. - Loading message differentiates first turn ("Processing prompt — may take a few seconds") vs subsequent ("Generating..."). 4. **wasm/build.sh**: exports _wasm_reset_chat. Validated end-to-end with the C test (real response replay): turn 1: 206 ms (cold, SLOW path) turn 2: 315 ms (FAST text_match=64) turn 5: 437 ms (FAST text_match=321) turn 10: 637 ms (FAST text_match=750) Every turn after the first hits the FAST text-prefix path. The remaining ~50ms/turn growth is the unavoidable O(n) attention cost. For the WASM browser demo, this means: instead of every turn taking full prefill time (5-10s for a 0.8B model), only turn 1 is slow. Turns 2+ feel instantaneous to the user. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

unamedkr merged commit 471a5f4 into main Apr 12, 2026

unamedkr deleted the feat/text-prefix-cache branch April 12, 2026 00:09

unamedkr mentioned this pull request Apr 12, 2026

feat(wasm): chat KV cache reuse — instant turn N+1 in browser #51

Merged

unamedkr mentioned this pull request Apr 12, 2026

Non-deterministic output at temperature=0 — possible memory corruption #62

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: text-prefix chat cache + json_find_key bugfix#50

feat: text-prefix chat cache + json_find_key bugfix#50
unamedkr merged 1 commit intomainfrom
feat/text-prefix-cache

unamedkr commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

unamedkr commented Apr 12, 2026

Approach

Bonus: json_find_key bug

Measured

Combined progression (chat KV cache work)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant