Add Prompt Lookup Decoding (ngram-simple) and Rolling-Hash Speculative Memory (ngram-mod)#1297
Add Prompt Lookup Decoding (ngram-simple) and Rolling-Hash Speculative Memory (ngram-mod)#1297mayank2130 wants to merge 6 commits into
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
hey @angeloskath can this PLD/n-gram decoding be reviewed. If you're not the one to reachout for mlx-lm PRs could you point me to someone else. Thanks. |
|
Tested this branch on Apple Silicon (M5, 32 GB, macOS 25.2.0, Python 3.13) against Qwen3.5/3.6-family models and hit three issues worth flagging — two bugs and one UX trap that together made the feature look like it was working when it wasn't. 1. CLI flags are parsed but never forwarded to
|
|
Repro scripts and prompts for the findings above, as offered: https://gist.github.com/ashalliants/91819d410f6822e406a314740c8b7d0e — |
Closes #851
Summary
Adds Prompt Lookup Decoding (PLD) and rolling-hash speculative decoding to
mlx_lmvia a generalizedDraftStrategyabstraction.Instead of generating speculative drafts with a smaller neural model, the new strategies reuse previously observed token trajectories:
ngram-simpleperforms exact prompt-history lookupngram-modimplements a rolling-hash associative memory ported fromllama.cppPR #19164Both strategies preserve output correctness because speculative tokens are only accepted if verified by the target model under the same sampling configuration.
This PR adds:
DraftStrategyinterface for pluggable speculative draftersModelDraftStrategyfor existing neural draftingNgramSimpleStrategyfor prompt lookup decodingNgramModStrategy+NgramModTablefor rolling-hash speculative memoryUsage
For
ngram-mod, reuse a table across related generations to preserve learned n-gram memory:CLI: multi-turn ngram-simple
CLI: multi-turn ngram-mod
The chat command keeps the conversation history and prompt cache alive across turns, so T2/T3 can reuse the generated structure from T1.
Server
Per-request JSON overrides:
draft_type,ngram_sizedisable_adaptive_gateArchitecture
Speculative drafting is abstracted behind:
NgramSimpleStrategyscans backward for matching n-grams and proposes the following continuation tokens directly from prior history.NgramModStrategyports llama.cpp's rolling-hash speculative memory.Architecture mirrors llama.cpp's split between:
The shared table stores:
hash(ngram) -> next_tokenallowing speculative reuse across requests handled by the same running server process.
Implementation behavior intentionally matches llama.cpp:
Adaptive Gate
An optional adaptive gate computes a 3-gram repetition score over the prompt. If repetition falls below:
NGRAM_GATE_THRESHOLD = 0.02speculation is skipped automatically.
This is particularly important for ngram-mod, whose cold-start behavior can regress below baseline throughput on low-repetition prompts.
Benchmarks
All benchmarks used:
LONG MULTI-TURN EDITING (~280 TOK/TURN) — OVERALL
ngram-mod nd=6 per-turn behavior