Skip to content

feat(wasm): chat KV cache reuse — instant turn N+1 in browser#51

Merged
unamedkr merged 1 commit intomainfrom
feat/wasm-chat-caching
Apr 12, 2026
Merged

feat(wasm): chat KV cache reuse — instant turn N+1 in browser#51
unamedkr merged 1 commit intomainfrom
feat/wasm-chat-caching

Conversation

@unamedkr
Copy link
Copy Markdown
Collaborator

PR #50 added text-prefix matching to the HTTP server. This PR ports it to the single-header (`quant.h`) so the WASM browser demo and Python wheel get the same speedup.

What changed

Layer Change
`quant.h` Ported `tq_generate_chat_text` from src/engine. `quant_chat` now uses text-prefix path. `quant_ctx.cached_text` field added.
`wasm/quant_wasm.c` wasm_generate_async / wasm_generate now call quant_chat instead of quant_generate (which was destroying the cache by freeing+recreating g_ctx every call!). New wasm_reset_chat.
`wasm/index.html` Accumulates ChatML history client-side; sends full history each turn. After response, appends model output to history so next turn matches the cached_text byte-for-byte.
`wasm/build.sh` Exports _wasm_reset_chat.

Measured (single-header path, real response replay)

```
turn 1: 206 ms (cold, SLOW path)
turn 2: 315 ms (FAST text_match=64)
turn 5: 437 ms (FAST text_match=321)
turn 10: 637 ms (FAST text_match=750)
```

What users see

Before: every turn in the WASM demo took full prefill time (5-10s for a 0.8B model). Multi-turn chat was effectively unusable.

After: turn 1 is slow (cold prefill). Turns 2+ feel instantaneous.

Key bug fix

The WASM `wasm_generate_async` was calling `quant_free_ctx(g_ctx) + quant_new(...)` on every request, which freed the entire context including the chat KV cache. This was actually destroying any cache reuse benefits before they could materialize. Switched to reusing g_ctx and only updating the per-call params (temperature, top_p, max_tokens).

🤖 Generated with Claude Code

PR #50 added text-prefix matching to src/engine/tq_generate.c (used by
the HTTP server). This PR ports it to quant.h (single-header) so the
WASM browser demo and Python wheel get the same speedup.

Three layers:

1. **quant.h**: ported tq_generate_chat_text from src/engine. Added
   cached_text field to quant_ctx struct. quant_chat() now uses the
   text-prefix path instead of the token-LCP path. quant_free_ctx()
   frees cached_text. Pass NULL prompt to reset session (frees
   cached_text too).

2. **wasm/quant_wasm.c**:
   - wasm_generate_async / wasm_generate now call quant_chat() instead
     of quant_generate() (which destroyed the cache via free+recreate
     of g_ctx every call — biggest reason WASM was slow on multi-turn).
   - Reuse the existing g_ctx across calls; only update temperature/
     top_p/max_tokens fields (kv_compress is immutable post-creation).
   - New wasm_reset_chat() for starting a new chat session.

3. **wasm/index.html**:
   - Accumulates ChatML history client-side (chatHistory string).
     Each turn appends `<|im_start|>user\n${text}<|im_end|>\n
     <|im_start|>assistant\n` and sends the FULL history to WASM.
   - The C side's text-prefix matching reuses everything before the
     new turn — turn N's prefill is O(new user message), not
     O(full history).
   - After response, appends model output + <|im_end|>\n so the next
     turn matches the cached_text byte-for-byte.
   - Loading message differentiates first turn ("Processing prompt
     — may take a few seconds") vs subsequent ("Generating...").

4. **wasm/build.sh**: exports _wasm_reset_chat.

Validated end-to-end with the C test (real response replay):

  turn  1:  206 ms (cold, SLOW path)
  turn  2:  315 ms (FAST text_match=64)
  turn  5:  437 ms (FAST text_match=321)
  turn 10:  637 ms (FAST text_match=750)

Every turn after the first hits the FAST text-prefix path. The
remaining ~50ms/turn growth is the unavoidable O(n) attention cost.

For the WASM browser demo, this means: instead of every turn taking
full prefill time (5-10s for a 0.8B model), only turn 1 is slow.
Turns 2+ feel instantaneous to the user.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@unamedkr unamedkr merged commit 49c6605 into main Apr 12, 2026
@unamedkr unamedkr deleted the feat/wasm-chat-caching branch April 12, 2026 00:28
unamedkr added a commit that referenced this pull request Apr 12, 2026
After PRs #48-#51 the chat KV cache reuse path was a complex multi-layer
system. Audited every code path for hidden bugs and fixed all of them.

## Bugs found and fixed

1. **Slow-path fallback corrupted KV state** [P0]
   tq_generate_chat_text's overflow fallback called tq_generate_continue
   on the SAME state that already had old KV at positions [0..prefix_pos).
   New prefill would write [0..n_new) leaving stale [n_new..prefix_pos)
   that subsequent generation might read. Replaced with -2 return code:
   the caller decides (server returns HTTP 413, WASM auto-resets the
   chat and shows a status message).

2. **WASM reset_chat partial cleanup** [P1]
   wasm_reset_chat called quant_chat(NULL) but did not reset
   g_output_pos / g_output[0] / g_stream_count, so the next generation
   would append to stale text from the previous chat. Now resets all.

3. **wasm_generate (sync path) missed g_stream_count reset** [P1]
   The async path zeroed it, the sync path did not. Aligned both.

4. **Wheel header _quant.h stale** [P0]
   bindings/python/quantcpp/_quant.h is .gitignore'd and the next pip
   build would have used quant.h from before PR #51 (no
   tq_generate_chat_text). Synced to current quant.h.

5. **Overflow surface — WASM** [P1]
   Added n == -2 detection in wasm_generate / wasm_generate_async.
   Auto-reset chat and call js_on_status with a clear error message
   so the JS side can show "Context full — chat reset".

6. **Overflow surface — server** [P1]
   Added gen_rc == -2 detection in both streaming and non-streaming
   handlers. Server resets the session's KV state + cached_text + tokens
   and returns HTTP 413 with an OpenAI-compatible error JSON.

7. **tq_generate_continue cached_text drift documentation** [P2]
   Added a header comment explaining that tq_generate_continue is the
   lower-level API and doesn't track cached_text. Higher-level callers
   must use tq_generate_chat_text for cached_text safety.

## Audited but safe

- Server session concurrency: get_or_create_session is called inside
  inference_mutex, so LRU bookkeeping is serialized.
- json_extract_string buffer safety: respects buf_size - 1 bound.
- WASM g_output overflow: tokens dropped from local buffer but
  js_on_token still fires, so JS side gets all output. Acceptable.

## Verified end-to-end

  alice/bob interleaved 5 turns each (real assistant replay):
    alice: 339 → 514 ms (~50 ms/turn growth from O(n) attention)
    bob:   310 → 518 ms (similar)

No regressions; all turns hit the FAST text-prefix path after turn 1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr added a commit that referenced this pull request Apr 12, 2026
After PRs #48-#51 the chat KV cache reuse path was a complex multi-layer
system. Audited every code path for hidden bugs and fixed all of them.

## Bugs found and fixed

1. **Slow-path fallback corrupted KV state** [P0]
   tq_generate_chat_text's overflow fallback called tq_generate_continue
   on the SAME state that already had old KV at positions [0..prefix_pos).
   New prefill would write [0..n_new) leaving stale [n_new..prefix_pos)
   that subsequent generation might read. Replaced with -2 return code:
   the caller decides (server returns HTTP 413, WASM auto-resets the
   chat and shows a status message).

2. **WASM reset_chat partial cleanup** [P1]
   wasm_reset_chat called quant_chat(NULL) but did not reset
   g_output_pos / g_output[0] / g_stream_count, so the next generation
   would append to stale text from the previous chat. Now resets all.

3. **wasm_generate (sync path) missed g_stream_count reset** [P1]
   The async path zeroed it, the sync path did not. Aligned both.

4. **Wheel header _quant.h stale** [P0]
   bindings/python/quantcpp/_quant.h is .gitignore'd and the next pip
   build would have used quant.h from before PR #51 (no
   tq_generate_chat_text). Synced to current quant.h.

5. **Overflow surface — WASM** [P1]
   Added n == -2 detection in wasm_generate / wasm_generate_async.
   Auto-reset chat and call js_on_status with a clear error message
   so the JS side can show "Context full — chat reset".

6. **Overflow surface — server** [P1]
   Added gen_rc == -2 detection in both streaming and non-streaming
   handlers. Server resets the session's KV state + cached_text + tokens
   and returns HTTP 413 with an OpenAI-compatible error JSON.

7. **tq_generate_continue cached_text drift documentation** [P2]
   Added a header comment explaining that tq_generate_continue is the
   lower-level API and doesn't track cached_text. Higher-level callers
   must use tq_generate_chat_text for cached_text safety.

## Audited but safe

- Server session concurrency: get_or_create_session is called inside
  inference_mutex, so LRU bookkeeping is serialized.
- json_extract_string buffer safety: respects buf_size - 1 bound.
- WASM g_output overflow: tokens dropped from local buffer but
  js_on_token still fires, so JS side gets all output. Acceptable.

## Verified end-to-end

  alice/bob interleaved 5 turns each (real assistant replay):
    alice: 339 → 514 ms (~50 ms/turn growth from O(n) attention)
    bob:   310 → 518 ms (similar)

No regressions; all turns hit the FAST text-prefix path after turn 1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr added a commit that referenced this pull request Apr 12, 2026
After PRs #48-#51 the chat KV cache reuse path was a complex multi-layer
system. Audited every code path for hidden bugs and fixed all of them.

## Bugs found and fixed

1. **Slow-path fallback corrupted KV state** [P0]
   tq_generate_chat_text's overflow fallback called tq_generate_continue
   on the SAME state that already had old KV at positions [0..prefix_pos).
   New prefill would write [0..n_new) leaving stale [n_new..prefix_pos)
   that subsequent generation might read. Replaced with -2 return code:
   the caller decides (server returns HTTP 413, WASM auto-resets the
   chat and shows a status message).

2. **WASM reset_chat partial cleanup** [P1]
   wasm_reset_chat called quant_chat(NULL) but did not reset
   g_output_pos / g_output[0] / g_stream_count, so the next generation
   would append to stale text from the previous chat. Now resets all.

3. **wasm_generate (sync path) missed g_stream_count reset** [P1]
   The async path zeroed it, the sync path did not. Aligned both.

4. **Wheel header _quant.h stale** [P0]
   bindings/python/quantcpp/_quant.h is .gitignore'd and the next pip
   build would have used quant.h from before PR #51 (no
   tq_generate_chat_text). Synced to current quant.h.

5. **Overflow surface — WASM** [P1]
   Added n == -2 detection in wasm_generate / wasm_generate_async.
   Auto-reset chat and call js_on_status with a clear error message
   so the JS side can show "Context full — chat reset".

6. **Overflow surface — server** [P1]
   Added gen_rc == -2 detection in both streaming and non-streaming
   handlers. Server resets the session's KV state + cached_text + tokens
   and returns HTTP 413 with an OpenAI-compatible error JSON.

7. **tq_generate_continue cached_text drift documentation** [P2]
   Added a header comment explaining that tq_generate_continue is the
   lower-level API and doesn't track cached_text. Higher-level callers
   must use tq_generate_chat_text for cached_text safety.

## Audited but safe

- Server session concurrency: get_or_create_session is called inside
  inference_mutex, so LRU bookkeeping is serialized.
- json_extract_string buffer safety: respects buf_size - 1 bound.
- WASM g_output overflow: tokens dropped from local buffer but
  js_on_token still fires, so JS side gets all output. Acceptable.

## Verified end-to-end

  alice/bob interleaved 5 turns each (real assistant replay):
    alice: 339 → 514 ms (~50 ms/turn growth from O(n) attention)
    bob:   310 → 518 ms (similar)

No regressions; all turns hit the FAST text-prefix path after turn 1.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant