Fix: Seed TokenBuffer with cached prompt tokens on cache hits#960
Open
Yukon wants to merge 1 commit intojundot:mainfrom
Open
Fix: Seed TokenBuffer with cached prompt tokens on cache hits#960Yukon wants to merge 1 commit intojundot:mainfrom
Yukon wants to merge 1 commit intojundot:mainfrom
Conversation
Pass all_tokens=[[prompt[:-1]]] to BatchGenerator.insert() so that penalty processors (repetition/presence/frequency) can see and penalize tokens from the cached KV prefix. Previously, cache hits on multi-turn conversations would seed the TokenBuffer with an empty context, so the assistant could freely repeat tokens from its previous output. Test: TestTokenBufferSeedOnCacheHit verifies that insert() is called with the correct all_tokens on both cache-hit and single-token cases.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue
While investigating the root cause for repetition issues with models it was discovered that the we are not restoring the
TokenBufferin mlx_lm during cache hits. When a prior turn’s KV cache was restored, the penalty state was not restored, resulting in penalty processors starting with an empty context and failing to penalize tokens from previous generation.Fix
Passing
all_tokenstoBatchGenerator.insert()fixes this so theTokenBufferis initialized with the full cached prompt prefix. This gives penalty processors visibility into tokens from the restored cache. As is done by sample server code in mlx_lm.Known Call Outs
When using SpecPrefill tokens that are not selected during KV cache generation will still be in
request.prompt_token_idsand penalized. This approach was taken for a user centric approach to penalization based on the conversation and not specifically penalize only tokens being seen and generated by model.Next Steps
As part of continuing to address n-gram repetition in oMLX there will be a follow up PR to address the small penalty window that currently uses the default 20 token context size. A window size that is far too small to effectively penalize CoT models like Qwen 3.6 from repetition.
🤖 AI was used to generate code, work reviewed and validated by a human.