feat(wasm): chat KV cache reuse — instant turn N+1 in browser by unamedkr · Pull Request #51 · quantumaikr/quant.cpp

unamedkr · 2026-04-12T00:28:24Z

PR #50 added text-prefix matching to the HTTP server. This PR ports it to the single-header (`quant.h`) so the WASM browser demo and Python wheel get the same speedup.

What changed

Layer	Change
`quant.h`	Ported `tq_generate_chat_text` from src/engine. `quant_chat` now uses text-prefix path. `quant_ctx.cached_text` field added.
`wasm/quant_wasm.c`	wasm_generate_async / wasm_generate now call quant_chat instead of quant_generate (which was destroying the cache by freeing+recreating g_ctx every call!). New wasm_reset_chat.
`wasm/index.html`	Accumulates ChatML history client-side; sends full history each turn. After response, appends model output to history so next turn matches the cached_text byte-for-byte.
`wasm/build.sh`	Exports _wasm_reset_chat.

Measured (single-header path, real response replay)

```
turn 1: 206 ms (cold, SLOW path)
turn 2: 315 ms (FAST text_match=64)
turn 5: 437 ms (FAST text_match=321)
turn 10: 637 ms (FAST text_match=750)
```

What users see

Before: every turn in the WASM demo took full prefill time (5-10s for a 0.8B model). Multi-turn chat was effectively unusable.

After: turn 1 is slow (cold prefill). Turns 2+ feel instantaneous.

Key bug fix

The WASM `wasm_generate_async` was calling `quant_free_ctx(g_ctx) + quant_new(...)` on every request, which freed the entire context including the chat KV cache. This was actually destroying any cache reuse benefits before they could materialize. Switched to reusing g_ctx and only updating the per-call params (temperature, top_p, max_tokens).

🤖 Generated with Claude Code

PR #50 added text-prefix matching to src/engine/tq_generate.c (used by the HTTP server). This PR ports it to quant.h (single-header) so the WASM browser demo and Python wheel get the same speedup. Three layers: 1. **quant.h**: ported tq_generate_chat_text from src/engine. Added cached_text field to quant_ctx struct. quant_chat() now uses the text-prefix path instead of the token-LCP path. quant_free_ctx() frees cached_text. Pass NULL prompt to reset session (frees cached_text too). 2. **wasm/quant_wasm.c**: - wasm_generate_async / wasm_generate now call quant_chat() instead of quant_generate() (which destroyed the cache via free+recreate of g_ctx every call — biggest reason WASM was slow on multi-turn). - Reuse the existing g_ctx across calls; only update temperature/ top_p/max_tokens fields (kv_compress is immutable post-creation). - New wasm_reset_chat() for starting a new chat session. 3. **wasm/index.html**: - Accumulates ChatML history client-side (chatHistory string). Each turn appends `<|im_start|>user\n${text}<|im_end|>\n <|im_start|>assistant\n` and sends the FULL history to WASM. - The C side's text-prefix matching reuses everything before the new turn — turn N's prefill is O(new user message), not O(full history). - After response, appends model output + <|im_end|>\n so the next turn matches the cached_text byte-for-byte. - Loading message differentiates first turn ("Processing prompt — may take a few seconds") vs subsequent ("Generating..."). 4. **wasm/build.sh**: exports _wasm_reset_chat. Validated end-to-end with the C test (real response replay): turn 1: 206 ms (cold, SLOW path) turn 2: 315 ms (FAST text_match=64) turn 5: 437 ms (FAST text_match=321) turn 10: 637 ms (FAST text_match=750) Every turn after the first hits the FAST text-prefix path. The remaining ~50ms/turn growth is the unavoidable O(n) attention cost. For the WASM browser demo, this means: instead of every turn taking full prefill time (5-10s for a 0.8B model), only turn 1 is slow. Turns 2+ feel instantaneous to the user. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

After PRs #48-#51 the chat KV cache reuse path was a complex multi-layer system. Audited every code path for hidden bugs and fixed all of them. ## Bugs found and fixed 1. **Slow-path fallback corrupted KV state** [P0] tq_generate_chat_text's overflow fallback called tq_generate_continue on the SAME state that already had old KV at positions [0..prefix_pos). New prefill would write [0..n_new) leaving stale [n_new..prefix_pos) that subsequent generation might read. Replaced with -2 return code: the caller decides (server returns HTTP 413, WASM auto-resets the chat and shows a status message). 2. **WASM reset_chat partial cleanup** [P1] wasm_reset_chat called quant_chat(NULL) but did not reset g_output_pos / g_output[0] / g_stream_count, so the next generation would append to stale text from the previous chat. Now resets all. 3. **wasm_generate (sync path) missed g_stream_count reset** [P1] The async path zeroed it, the sync path did not. Aligned both. 4. **Wheel header _quant.h stale** [P0] bindings/python/quantcpp/_quant.h is .gitignore'd and the next pip build would have used quant.h from before PR #51 (no tq_generate_chat_text). Synced to current quant.h. 5. **Overflow surface — WASM** [P1] Added n == -2 detection in wasm_generate / wasm_generate_async. Auto-reset chat and call js_on_status with a clear error message so the JS side can show "Context full — chat reset". 6. **Overflow surface — server** [P1] Added gen_rc == -2 detection in both streaming and non-streaming handlers. Server resets the session's KV state + cached_text + tokens and returns HTTP 413 with an OpenAI-compatible error JSON. 7. **tq_generate_continue cached_text drift documentation** [P2] Added a header comment explaining that tq_generate_continue is the lower-level API and doesn't track cached_text. Higher-level callers must use tq_generate_chat_text for cached_text safety. ## Audited but safe - Server session concurrency: get_or_create_session is called inside inference_mutex, so LRU bookkeeping is serialized. - json_extract_string buffer safety: respects buf_size - 1 bound. - WASM g_output overflow: tokens dropped from local buffer but js_on_token still fires, so JS side gets all output. Acceptable. ## Verified end-to-end alice/bob interleaved 5 turns each (real assistant replay): alice: 339 → 514 ms (~50 ms/turn growth from O(n) attention) bob: 310 → 518 ms (similar) No regressions; all turns hit the FAST text-prefix path after turn 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

After PRs #48-#51 the chat KV cache reuse path was a complex multi-layer system. Audited every code path for hidden bugs and fixed all of them. ## Bugs found and fixed 1. **Slow-path fallback corrupted KV state** [P0] tq_generate_chat_text's overflow fallback called tq_generate_continue on the SAME state that already had old KV at positions [0..prefix_pos). New prefill would write [0..n_new) leaving stale [n_new..prefix_pos) that subsequent generation might read. Replaced with -2 return code: the caller decides (server returns HTTP 413, WASM auto-resets the chat and shows a status message). 2. **WASM reset_chat partial cleanup** [P1] wasm_reset_chat called quant_chat(NULL) but did not reset g_output_pos / g_output[0] / g_stream_count, so the next generation would append to stale text from the previous chat. Now resets all. 3. **wasm_generate (sync path) missed g_stream_count reset** [P1] The async path zeroed it, the sync path did not. Aligned both. 4. **Wheel header _quant.h stale** [P0] bindings/python/quantcpp/_quant.h is .gitignore'd and the next pip build would have used quant.h from before PR #51 (no tq_generate_chat_text). Synced to current quant.h. 5. **Overflow surface — WASM** [P1] Added n == -2 detection in wasm_generate / wasm_generate_async. Auto-reset chat and call js_on_status with a clear error message so the JS side can show "Context full — chat reset". 6. **Overflow surface — server** [P1] Added gen_rc == -2 detection in both streaming and non-streaming handlers. Server resets the session's KV state + cached_text + tokens and returns HTTP 413 with an OpenAI-compatible error JSON. 7. **tq_generate_continue cached_text drift documentation** [P2] Added a header comment explaining that tq_generate_continue is the lower-level API and doesn't track cached_text. Higher-level callers must use tq_generate_chat_text for cached_text safety. ## Audited but safe - Server session concurrency: get_or_create_session is called inside inference_mutex, so LRU bookkeeping is serialized. - json_extract_string buffer safety: respects buf_size - 1 bound. - WASM g_output overflow: tokens dropped from local buffer but js_on_token still fires, so JS side gets all output. Acceptable. ## Verified end-to-end alice/bob interleaved 5 turns each (real assistant replay): alice: 339 → 514 ms (~50 ms/turn growth from O(n) attention) bob: 310 → 518 ms (similar) No regressions; all turns hit the FAST text-prefix path after turn 1. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

unamedkr merged commit 49c6605 into main Apr 12, 2026

unamedkr deleted the feat/wasm-chat-caching branch April 12, 2026 00:28

unamedkr mentioned this pull request Apr 12, 2026

fix(chat-cache): comprehensive audit — 7 hidden bugs eliminated #52

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(wasm): chat KV cache reuse — instant turn N+1 in browser#51

feat(wasm): chat KV cache reuse — instant turn N+1 in browser#51
unamedkr merged 1 commit intomainfrom
feat/wasm-chat-caching

unamedkr commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

unamedkr commented Apr 12, 2026

What changed

Measured (single-header path, real response replay)

What users see

Key bug fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant