qwen35: sampled-verify — speculative decoding with an active sampler#374
Open
Rhonstin wants to merge 1 commit into
Open
qwen35: sampled-verify — speculative decoding with an active sampler#374Rhonstin wants to merge 1 commit into
Rhonstin wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
2 issues found across 4 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
The chain verify path commits the target's argmax at every position, so spec decode only matches the target distribution at temperature 0. Any sampling request had to fall back to AR decode (16 tok/s on a 27B Q4_K_M / RTX 3090), while greedy spec ran at 60-70 tok/s. This adds a sampled-verify mode (DFLASH_SAMPLED_VERIFY=1): at each chain position the verifier draws a token from the target's full sampler chain (temperature / top-k / top-p / penalties) over that position's verify logits, and accepts the draft token iff the sampled token matches it. Every committed token is therefore an exact sample from the target distribution — the output distribution is identical to AR sampling, sample-and-match style. On mismatch the sampled token itself becomes the bonus token, so a rejected position still commits one valid sample. Three places must sample rather than take argmax, and getting any of them wrong corrupts generation in distinctive ways: - First token after prefill: it used to be the prefill argmax. With sampling enabled this injects one greedy token per request; at temperature 1 every generation opened identically. Now sampled from the prefill logits. - Per-step seed: the next draft step is seeded with the replay's last token, which the next verify commits as-is. Seeding with the replay argmax injects one greedy token per step (~1/16 of output), which biases the distribution and locks long generations into repetition loops. Now sampled from the replay's last verify logits. - The acceptance walk itself, over per-position verify logits exposed via a new DFlashTarget::read_verify_logits() hook (default-off; the qwen35 target reads them from the step graph's logits tensor). Sampled-verify requires full-attention verify (fa_window == 0). With a finite fa-window the windowed verify pass starves the logit tail at long context: argmax stays intact, but sampling from the poisoned tail degrades quality — on a 24K-token agent prompt, tool-call success went 0/12 with a finite window vs 12/12 with full attention. The mode is gated off unless fa_window == 0; prefill and AR decode are unaffected (they always run full attention). Measured on RTX 3090 (Qwen3.6-27B Q4_K_M + DFlash draft, temp 0.7-1.0): 16 -> 60-70 tok/s on short prompts, 16 -> ~31 tok/s at 24K context. Greedy path is bit-exact unchanged. Token histograms over 150 sampled generations match AR within noise. DFLASH_SV_DEBUG=1 traces the acceptance walk per position. Implements the "active sampler during verification" future addition from the README.
Contributor
Author
|
Amended after self-review: (1) the DFLASH_SAMPLED_VERIFY gate is now correctly opt-in (=1) as documented — the earlier revision had it opt-out; (2) the acceptance walk now seeds its penalty history with the step's seed token (draft_tok[0]), which is committed by the replay but not yet in out_tokens at walk time — without it the sampled distribution drifted from AR by one token whenever repetition penalties are active. |
cf8ea3f to
9d63002
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a sampled-verify mode to the qwen35 chain spec-decode path, enabled with
DFLASH_SAMPLED_VERIFY=1. It implements the "active sampler during verification" item from the README's future additions: speculative decoding that works at any temperature while preserving the exact target sampling distribution.Today the chain verify commits the target's argmax at each position, so spec decode is only distribution-correct at temperature 0 — sampling requests fall back to AR decode. On a 27B Q4_K_M / RTX 3090 that means 16 tok/s for any agent traffic that samples.
How it works
Sample-and-match: at each chain position the verifier draws a token from the target's full sampler chain (temperature / top-k / top-p / penalties) over that position's verify logits, and accepts the draft token iff the sample matches it. On mismatch, the sampled token itself becomes the bonus token. Every committed token is therefore an exact sample from the target distribution — output distribution identical to AR sampling, no acceptance-rate correction terms needed.
Three spots must sample rather than take argmax — each one, when wrong, corrupts generation in a distinctive way that we hit and fixed during validation:
DFlashTarget::read_verify_logits()hook (virtual, defaultfalse; the qwen35 target reads the step graph's logits tensor, so targets that don't implement it are unaffected).fa-window interaction
Sampled-verify is gated to
fa_window == 0. With a finite--fa-window, the windowed verify pass starves the logit tail at long context: argmax survives, but sampling from the poisoned tail degrades quality. Decisive experiment on a 24K-token agent prompt: tool-call success 0/12 with a finite window vs 12/12 with full-attention verify. Prefill and AR decode always run full attention, so they're unaffected — this only constrains the verify pass.Results (RTX 3090, Qwen3.6-27B Q4_K_M + DFlash draft)
DFLASH_SAMPLED_VERIFYunset) is bit-exact unchanged.DFLASH_SV_DEBUG=1traces the acceptance walk per position.Default is off, so existing deployments see zero behavior change.