Skip to content

Fix: Seed TokenBuffer with cached prompt tokens on cache hits#960

Open
Yukon wants to merge 1 commit intojundot:mainfrom
Yukon:fix/tokenbuffer-seed
Open

Fix: Seed TokenBuffer with cached prompt tokens on cache hits#960
Yukon wants to merge 1 commit intojundot:mainfrom
Yukon:fix/tokenbuffer-seed

Conversation

@Yukon
Copy link
Copy Markdown

@Yukon Yukon commented Apr 25, 2026

Issue

While investigating the root cause for repetition issues with models it was discovered that the we are not restoring the TokenBuffer in mlx_lm during cache hits. When a prior turn’s KV cache was restored, the penalty state was not restored, resulting in penalty processors starting with an empty context and failing to penalize tokens from previous generation.

Fix

Passing all_tokens to BatchGenerator.insert() fixes this so the TokenBuffer is initialized with the full cached prompt prefix. This gives penalty processors visibility into tokens from the restored cache. As is done by sample server code in mlx_lm.

Known Call Outs

When using SpecPrefill tokens that are not selected during KV cache generation will still be in request.prompt_token_ids and penalized. This approach was taken for a user centric approach to penalization based on the conversation and not specifically penalize only tokens being seen and generated by model.

Next Steps

As part of continuing to address n-gram repetition in oMLX there will be a follow up PR to address the small penalty window that currently uses the default 20 token context size. A window size that is far too small to effectively penalize CoT models like Qwen 3.6 from repetition.

🤖 AI was used to generate code, work reviewed and validated by a human.

Pass all_tokens=[[prompt[:-1]]] to BatchGenerator.insert() so that
penalty processors (repetition/presence/frequency) can see and penalize
tokens from the cached KV prefix. Previously, cache hits on multi-turn
conversations would seed the TokenBuffer with an empty context, so the
assistant could freely repeat tokens from its previous output.

Test: TestTokenBufferSeedOnCacheHit verifies that insert() is called
with the correct all_tokens on both cache-hit and single-token cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant