feat(pld): hybrid partial-accept replay for SSM models (#134)#149
Open
st-adam wants to merge 1 commit into
Open
feat(pld): hybrid partial-accept replay for SSM models (#134)#149st-adam wants to merge 1 commit into
st-adam wants to merge 1 commit into
Conversation
3 tasks
Ari4ka
approved these changes
May 11, 2026
…i#134) On hybrid SSM/ATT models with 0 < num_accept < K, the PLD path previously discarded accepted drafts and emitted only a correction token. This PR adds _replay_ssm_forward() which restores caches to N, replays the accepted tokens through the model, and emits num_accept+1 tokens instead of 1. - New Scheduler._replay_ssm_forward() staticmethod (scheduler.py) - Modify case (b): try replay first, fall back to correction-only on failure - Add _pld_replay_{enabled,attempts,emitted,failures} counters - Add pld_ssm_replay telemetry to /health endpoint (server.py) - Document the fix in notes/prompt-lookup-decoding.md - 6 unit tests in tests/test_pld_ssm_replay.py - New partial_accept_stress benchmark in tests/benchmark/test_pld_acceptance.py Expected gain: +5-10% on top of PR jjang-ai#26's +4-7% on hybrid models. Disable: VMLX_DISABLE_PLD_REPLAY=1 Closes jjang-ai#134 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
|
Rebased onto current Verified post-rebase:
PR is |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
0 < num_accept < Kin PLD partial-reject pathScheduler._replay_ssm_forward()to restore caches to N, replay accepted tokens, advance to N+K', emit K'+1 tokens instead of 1VMLX_DISABLE_PLD_REPLAY=1/healthfieldpld_ssm_replay.{enabled,attempts,emitted,failures}tests/test_pld_ssm_replay.pyProblem
PR #26 PLD on hybrid models (48 GatedDeltaNet + 16 full-attention): with K=2, a partial accept (
num_accept=1) still emits only 1 correction token because SSM state cannot be trimmed — both caches must rewind to N.Solution
After partial rejection, restore to N, replay
drafts[:num_accept]forward through the full model. Both caches reach N+num_accept. Emitdrafts[:num_accept] + [bonus_token]— same as the full-accept path, minus the extra K-K' tokens.Expected gain
+5-10% on top of PR #26's +4-7% on hybrid models. Full PLD target for hybrid moves from +4-7% toward the +15-25% cited in #134.
Test plan
pytest tests/test_pld_ssm_replay.py -v— 6 unit tests passpytest tests/test_ssm_companion_cache.py -v— existing tests unaffectedVMLX_DISABLE_PLD_REPLAY=1vs unset — byte-equal at T=0, higher tok/s unsetFixes #134
🤖 Generated with Claude Code