Skip to content

qwen35: sampled-verify — speculative decoding with an active sampler#374

Open
Rhonstin wants to merge 1 commit into
Luce-Org:mainfrom
Rhonstin:feat/sampled-verify
Open

qwen35: sampled-verify — speculative decoding with an active sampler#374
Rhonstin wants to merge 1 commit into
Luce-Org:mainfrom
Rhonstin:feat/sampled-verify

Conversation

@Rhonstin

@Rhonstin Rhonstin commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

What

Adds a sampled-verify mode to the qwen35 chain spec-decode path, enabled with DFLASH_SAMPLED_VERIFY=1. It implements the "active sampler during verification" item from the README's future additions: speculative decoding that works at any temperature while preserving the exact target sampling distribution.

Today the chain verify commits the target's argmax at each position, so spec decode is only distribution-correct at temperature 0 — sampling requests fall back to AR decode. On a 27B Q4_K_M / RTX 3090 that means 16 tok/s for any agent traffic that samples.

How it works

Sample-and-match: at each chain position the verifier draws a token from the target's full sampler chain (temperature / top-k / top-p / penalties) over that position's verify logits, and accepts the draft token iff the sample matches it. On mismatch, the sampled token itself becomes the bonus token. Every committed token is therefore an exact sample from the target distribution — output distribution identical to AR sampling, no acceptance-rate correction terms needed.

Three spots must sample rather than take argmax — each one, when wrong, corrupts generation in a distinctive way that we hit and fixed during validation:

  1. First token after prefill — was the prefill argmax; injects one greedy token per request (at temp 1, 150/150 generations opened with the same word). Now sampled from the prefill logits.
  2. Per-step seed — the next draft step is seeded with the replay's last token, which the next verify commits as-is. Argmax here injects one greedy token per step (~1/16 of output) and locks long generations into repetition loops. Now sampled from the replay's last verify logits.
  3. The acceptance walk — over per-position verify logits, exposed via a new DFlashTarget::read_verify_logits() hook (virtual, default false; the qwen35 target reads the step graph's logits tensor, so targets that don't implement it are unaffected).

fa-window interaction

Sampled-verify is gated to fa_window == 0. With a finite --fa-window, the windowed verify pass starves the logit tail at long context: argmax survives, but sampling from the poisoned tail degrades quality. Decisive experiment on a 24K-token agent prompt: tool-call success 0/12 with a finite window vs 12/12 with full-attention verify. Prefill and AR decode always run full attention, so they're unaffected — this only constrains the verify pass.

Results (RTX 3090, Qwen3.6-27B Q4_K_M + DFlash draft)

Workload AR sampling sampled-verify
short prompts, temp 0.7–1.0 16 tok/s 60–70 tok/s
24K agent prompt w/ tools, temp 1.0 16 tok/s ~31 tok/s
tool-call success (12 runs, temp 1.0) 12/12 12/12
  • Greedy path (DFLASH_SAMPLED_VERIFY unset) is bit-exact unchanged.
  • Token histograms over 150 sampled generations match AR within noise.
  • No degeneration over 1200-token continuous generations.
  • DFLASH_SV_DEBUG=1 traces the acceptance walk per position.

Default is off, so existing deployments see zero behavior change.

Review in cubic

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 4 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/qwen35/qwen35_backend.cpp
Comment thread server/src/qwen35/qwen35_backend.cpp Outdated
The chain verify path commits the target's argmax at every position, so
spec decode only matches the target distribution at temperature 0. Any
sampling request had to fall back to AR decode (16 tok/s on a 27B
Q4_K_M / RTX 3090), while greedy spec ran at 60-70 tok/s.

This adds a sampled-verify mode (DFLASH_SAMPLED_VERIFY=1): at each chain
position the verifier draws a token from the target's full sampler chain
(temperature / top-k / top-p / penalties) over that position's verify
logits, and accepts the draft token iff the sampled token matches it.
Every committed token is therefore an exact sample from the target
distribution — the output distribution is identical to AR sampling,
sample-and-match style. On mismatch the sampled token itself becomes the
bonus token, so a rejected position still commits one valid sample.

Three places must sample rather than take argmax, and getting any of
them wrong corrupts generation in distinctive ways:

- First token after prefill: it used to be the prefill argmax. With
  sampling enabled this injects one greedy token per request; at
  temperature 1 every generation opened identically. Now sampled from
  the prefill logits.

- Per-step seed: the next draft step is seeded with the replay's last
  token, which the next verify commits as-is. Seeding with the replay
  argmax injects one greedy token per step (~1/16 of output), which
  biases the distribution and locks long generations into repetition
  loops. Now sampled from the replay's last verify logits.

- The acceptance walk itself, over per-position verify logits exposed
  via a new DFlashTarget::read_verify_logits() hook (default-off; the
  qwen35 target reads them from the step graph's logits tensor).

Sampled-verify requires full-attention verify (fa_window == 0). With a
finite fa-window the windowed verify pass starves the logit tail at
long context: argmax stays intact, but sampling from the poisoned tail
degrades quality — on a 24K-token agent prompt, tool-call success went
0/12 with a finite window vs 12/12 with full attention. The mode is
gated off unless fa_window == 0; prefill and AR decode are unaffected
(they always run full attention).

Measured on RTX 3090 (Qwen3.6-27B Q4_K_M + DFlash draft, temp 0.7-1.0):
16 -> 60-70 tok/s on short prompts, 16 -> ~31 tok/s at 24K context.
Greedy path is bit-exact unchanged. Token histograms over 150 sampled
generations match AR within noise. DFLASH_SV_DEBUG=1 traces the
acceptance walk per position.

Implements the "active sampler during verification" future addition
from the README.
@Rhonstin

Copy link
Copy Markdown
Contributor Author

Amended after self-review: (1) the DFLASH_SAMPLED_VERIFY gate is now correctly opt-in (=1) as documented — the earlier revision had it opt-out; (2) the acceptance walk now seeds its penalty history with the step's seed token (draft_tok[0]), which is committed by the replay but not yet in out_tokens at walk time — without it the sampled distribution drifted from AR by one token whenever repetition penalties are active.

@Rhonstin Rhonstin force-pushed the feat/sampled-verify branch from cf8ea3f to 9d63002 Compare June 12, 2026 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant