Fix: Seed TokenBuffer with cached prompt tokens on cache hits by Yukon · Pull Request #960 · jundot/omlx

Yukon · 2026-04-25T21:10:51Z

Issue

While investigating the root cause for repetition issues with models it was discovered that the we are not restoring the TokenBuffer in mlx_lm during cache hits. When a prior turn’s KV cache was restored, the penalty state was not restored, resulting in penalty processors starting with an empty context and failing to penalize tokens from previous generation.

Fix

Passing all_tokens to BatchGenerator.insert() fixes this so the TokenBuffer is initialized with the full cached prompt prefix. This gives penalty processors visibility into tokens from the restored cache. As is done by sample server code in mlx_lm.

Known Call Outs

When using SpecPrefill tokens that are not selected during KV cache generation will still be in request.prompt_token_ids and penalized. This approach was taken for a user centric approach to penalization based on the conversation and not specifically penalize only tokens being seen and generated by model.

Next Steps

As part of continuing to address n-gram repetition in oMLX there will be a follow up PR to address the small penalty window that currently uses the default 20 token context size. A window size that is far too small to effectively penalize CoT models like Qwen 3.6 from repetition.

🤖 AI was used to generate code, work reviewed and validated by a human.

Pass all_tokens=[[prompt[:-1]]] to BatchGenerator.insert() so that penalty processors (repetition/presence/frequency) can see and penalize tokens from the cached KV prefix. Previously, cache hits on multi-turn conversations would seed the TokenBuffer with an empty context, so the assistant could freely repeat tokens from its previous output. Test: TestTokenBufferSeedOnCacheHit verifies that insert() is called with the correct all_tokens on both cache-hit and single-token cases.

Yukon mentioned this pull request Apr 25, 2026

feat: Penalty Window Config - Make Penalization Effective #961

Open

jundot force-pushed the main branch from 7844f15 to b078330 Compare April 28, 2026 02:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Seed TokenBuffer with cached prompt tokens on cache hits#960

Fix: Seed TokenBuffer with cached prompt tokens on cache hits#960
Yukon wants to merge 1 commit intojundot:mainfrom
Yukon:fix/tokenbuffer-seed

Yukon commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Yukon commented Apr 25, 2026

Issue

Fix

Known Call Outs

Next Steps

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant