Skip to content

feat: text-prefix chat cache + json_find_key bugfix#50

Merged
unamedkr merged 1 commit intomainfrom
feat/text-prefix-cache
Apr 12, 2026
Merged

feat: text-prefix chat cache + json_find_key bugfix#50
unamedkr merged 1 commit intomainfrom
feat/text-prefix-cache

Conversation

@unamedkr
Copy link
Copy Markdown
Collaborator

Follow-up to PR #49. Token-level LCP had a fundamental cap from BPE re-tokenization (model-sampled tokens vs text-encoded tokens diverge → LCP truncates at ~10 tokens). This PR adds text-level prefix matching that bypasses the issue entirely.

Approach

`tq_generate_chat_text()` — each session stores the full prompt+response text from the last call. On a new request:

  1. Check if new prompt starts with cached_text byte-for-byte
  2. If yes: tokenize ONLY the suffix and prefill at [n_cached..]
  3. If no: fall back to tq_generate_continue (token LCP path)

Generated tokens are decoded to text via a tee callback during generation, so the next call's cached_text includes both the prompt AND the model's response.

Bonus: json_find_key bug

Found while debugging multi-session. `json_find_key("user")` was matching the value in `{"role":"user"}` (first occurrence), then failing the colon check, so every request fell back to the "default" session — multi-session was effectively broken since PR #49. Fix: scan for `"key":` (with colon) to disambiguate from values.

Measured

Test PR #49 (token LCP) This PR (text-prefix)
10-turn single user turn 10 → 3700 ms turn 10 → 739 ms
alice+bob interleaved, 5 turns alice/bob ~2400 ms alice/bob ~480 ms
Improvement at turn 10 1.5x vs no-reuse 8x vs no-reuse

The remaining ~50ms/turn growth is the unavoidable O(n) attention cost over full context. KV prefill is now truly O(new tokens per turn).

Combined progression (chat KV cache work)

Round Approach Turn 10 vs Original
Original No reuse 5400 ms 1x
PR #48 Token LCP 3700 ms 1.5x
PR #49 + multi-session + overflow safety 3700 ms 1.5x
This PR + text-prefix matching 739 ms 7.3x

🤖 Generated with Claude Code

Follow-up to PR #49. The token-level LCP path in tq_generate_continue
has a fundamental limitation: model-generated tokens (sample_topp) and
text-encoded tokens (tq_encode of the response in the next turn) can
diverge due to BPE merge non-roundtripping. This caps per-turn LCP at
the prompt boundary (~10 tokens), so longer histories still incur
mostly-full reprefill.

Fix: tq_generate_chat_text() — text-level prefix matching.

How it works:
1. Each session stores the entire prompt+response text from the
   previous call (cached_text).
2. On a new request, check if the new prompt starts with cached_text
   byte-for-byte. If yes, the cached state is byte-equivalent valid.
3. Tokenize ONLY the suffix (new_prompt[strlen(cached_text):]) and
   prefill those tokens at positions [n_cached..n_cached + n_suffix).
4. Run generation. The accumulated output text gets appended to
   cached_text via a tee callback for the next call.
5. If text prefix doesn't match, fall back to tq_generate_continue
   (token LCP path).

Bug fix bundled: json_find_key("user") was matching the value in
{"role":"user"} instead of the top-level "user" key. Result: every
request used the "default" session, so multi-session was effectively
broken (cross-pollution). The fix scans for "key": (with colon) to
disambiguate from value matches.

Measured (SmolLM2-135M, single thread, real chat replay):

  Single user, 10-turn accumulation:
    PR #48 (token LCP only):         turn 10 → 3700 ms
    PR #49 (above + multi-session):  turn 10 → 3700 ms (LCP still capped)
    This PR (text-prefix path):      turn 10 →  739 ms (5x)

  alice + bob interleaved, 5 turns each (real assistant replay):
    PR #49:  alice 5 = 2412 ms, bob 5 = 2357 ms
    Now:     alice 5 =  498 ms, bob 5 =  462 ms (5x)

The growth that remains (~50ms/turn) is the unavoidable O(n) cost of
the attention computation over the full context — KV prefill is now
truly O(new tokens per turn), not O(full history per turn).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@unamedkr unamedkr merged commit 471a5f4 into main Apr 12, 2026
@unamedkr unamedkr deleted the feat/text-prefix-cache branch April 12, 2026 00:09
unamedkr added a commit that referenced this pull request Apr 12, 2026
PR #50 added text-prefix matching to src/engine/tq_generate.c (used by
the HTTP server). This PR ports it to quant.h (single-header) so the
WASM browser demo and Python wheel get the same speedup.

Three layers:

1. **quant.h**: ported tq_generate_chat_text from src/engine. Added
   cached_text field to quant_ctx struct. quant_chat() now uses the
   text-prefix path instead of the token-LCP path. quant_free_ctx()
   frees cached_text. Pass NULL prompt to reset session (frees
   cached_text too).

2. **wasm/quant_wasm.c**:
   - wasm_generate_async / wasm_generate now call quant_chat() instead
     of quant_generate() (which destroyed the cache via free+recreate
     of g_ctx every call — biggest reason WASM was slow on multi-turn).
   - Reuse the existing g_ctx across calls; only update temperature/
     top_p/max_tokens fields (kv_compress is immutable post-creation).
   - New wasm_reset_chat() for starting a new chat session.

3. **wasm/index.html**:
   - Accumulates ChatML history client-side (chatHistory string).
     Each turn appends `<|im_start|>user\n${text}<|im_end|>\n
     <|im_start|>assistant\n` and sends the FULL history to WASM.
   - The C side's text-prefix matching reuses everything before the
     new turn — turn N's prefill is O(new user message), not
     O(full history).
   - After response, appends model output + <|im_end|>\n so the next
     turn matches the cached_text byte-for-byte.
   - Loading message differentiates first turn ("Processing prompt
     — may take a few seconds") vs subsequent ("Generating...").

4. **wasm/build.sh**: exports _wasm_reset_chat.

Validated end-to-end with the C test (real response replay):

  turn  1:  206 ms (cold, SLOW path)
  turn  2:  315 ms (FAST text_match=64)
  turn  5:  437 ms (FAST text_match=321)
  turn 10:  637 ms (FAST text_match=750)

Every turn after the first hits the FAST text-prefix path. The
remaining ~50ms/turn growth is the unavoidable O(n) attention cost.

For the WASM browser demo, this means: instead of every turn taking
full prefill time (5-10s for a 0.8B model), only turn 1 is slow.
Turns 2+ feel instantaneous to the user.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr added a commit that referenced this pull request Apr 12, 2026
…#51)

PR #50 added text-prefix matching to src/engine/tq_generate.c (used by
the HTTP server). This PR ports it to quant.h (single-header) so the
WASM browser demo and Python wheel get the same speedup.

Three layers:

1. **quant.h**: ported tq_generate_chat_text from src/engine. Added
   cached_text field to quant_ctx struct. quant_chat() now uses the
   text-prefix path instead of the token-LCP path. quant_free_ctx()
   frees cached_text. Pass NULL prompt to reset session (frees
   cached_text too).

2. **wasm/quant_wasm.c**:
   - wasm_generate_async / wasm_generate now call quant_chat() instead
     of quant_generate() (which destroyed the cache via free+recreate
     of g_ctx every call — biggest reason WASM was slow on multi-turn).
   - Reuse the existing g_ctx across calls; only update temperature/
     top_p/max_tokens fields (kv_compress is immutable post-creation).
   - New wasm_reset_chat() for starting a new chat session.

3. **wasm/index.html**:
   - Accumulates ChatML history client-side (chatHistory string).
     Each turn appends `<|im_start|>user\n${text}<|im_end|>\n
     <|im_start|>assistant\n` and sends the FULL history to WASM.
   - The C side's text-prefix matching reuses everything before the
     new turn — turn N's prefill is O(new user message), not
     O(full history).
   - After response, appends model output + <|im_end|>\n so the next
     turn matches the cached_text byte-for-byte.
   - Loading message differentiates first turn ("Processing prompt
     — may take a few seconds") vs subsequent ("Generating...").

4. **wasm/build.sh**: exports _wasm_reset_chat.

Validated end-to-end with the C test (real response replay):

  turn  1:  206 ms (cold, SLOW path)
  turn  2:  315 ms (FAST text_match=64)
  turn  5:  437 ms (FAST text_match=321)
  turn 10:  637 ms (FAST text_match=750)

Every turn after the first hits the FAST text-prefix path. The
remaining ~50ms/turn growth is the unavoidable O(n) attention cost.

For the WASM browser demo, this means: instead of every turn taking
full prefill time (5-10s for a 0.8B model), only turn 1 is slow.
Turns 2+ feel instantaneous to the user.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant