qwen35: sampled-verify — speculative decoding with an active sampler by Rhonstin · Pull Request #374 · Luce-Org/lucebox-hub

Rhonstin · 2026-06-12T11:05:19Z

What

Adds a sampled-verify mode to the qwen35 chain spec-decode path, enabled with DFLASH_SAMPLED_VERIFY=1. It implements the "active sampler during verification" item from the README's future additions: speculative decoding that works at any temperature while preserving the exact target sampling distribution.

Today the chain verify commits the target's argmax at each position, so spec decode is only distribution-correct at temperature 0 — sampling requests fall back to AR decode. On a 27B Q4_K_M / RTX 3090 that means 16 tok/s for any agent traffic that samples.

How it works

Sample-and-match: at each chain position the verifier draws a token from the target's full sampler chain (temperature / top-k / top-p / penalties) over that position's verify logits, and accepts the draft token iff the sample matches it. On mismatch, the sampled token itself becomes the bonus token. Every committed token is therefore an exact sample from the target distribution — output distribution identical to AR sampling, no acceptance-rate correction terms needed.

Three spots must sample rather than take argmax — each one, when wrong, corrupts generation in a distinctive way that we hit and fixed during validation:

First token after prefill — was the prefill argmax; injects one greedy token per request (at temp 1, 150/150 generations opened with the same word). Now sampled from the prefill logits.
Per-step seed — the next draft step is seeded with the replay's last token, which the next verify commits as-is. Argmax here injects one greedy token per step (~1/16 of output) and locks long generations into repetition loops. Now sampled from the replay's last verify logits.
The acceptance walk — over per-position verify logits, exposed via a new DFlashTarget::read_verify_logits() hook (virtual, default false; the qwen35 target reads the step graph's logits tensor, so targets that don't implement it are unaffected).

fa-window interaction

Sampled-verify is gated to fa_window == 0. With a finite --fa-window, the windowed verify pass starves the logit tail at long context: argmax survives, but sampling from the poisoned tail degrades quality. Decisive experiment on a 24K-token agent prompt: tool-call success 0/12 with a finite window vs 12/12 with full-attention verify. Prefill and AR decode always run full attention, so they're unaffected — this only constrains the verify pass.

Results (RTX 3090, Qwen3.6-27B Q4_K_M + DFlash draft)

Workload	AR sampling	sampled-verify
short prompts, temp 0.7–1.0	16 tok/s	60–70 tok/s
24K agent prompt w/ tools, temp 1.0	16 tok/s	~31 tok/s
tool-call success (12 runs, temp 1.0)	12/12	12/12

Greedy path (DFLASH_SAMPLED_VERIFY unset) is bit-exact unchanged.
Token histograms over 150 sampled generations match AR within noise.
No degeneration over 1200-token continuous generations.
DFLASH_SV_DEBUG=1 traces the acceptance walk per position.

Default is off, so existing deployments see zero behavior change.

cubic-dev-ai

2 issues found across 4 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

The chain verify path commits the target's argmax at every position, so spec decode only matches the target distribution at temperature 0. Any sampling request had to fall back to AR decode (16 tok/s on a 27B Q4_K_M / RTX 3090), while greedy spec ran at 60-70 tok/s. This adds a sampled-verify mode (DFLASH_SAMPLED_VERIFY=1): at each chain position the verifier draws a token from the target's full sampler chain (temperature / top-k / top-p / penalties) over that position's verify logits, and accepts the draft token iff the sampled token matches it. Every committed token is therefore an exact sample from the target distribution — the output distribution is identical to AR sampling, sample-and-match style. On mismatch the sampled token itself becomes the bonus token, so a rejected position still commits one valid sample. Three places must sample rather than take argmax, and getting any of them wrong corrupts generation in distinctive ways: - First token after prefill: it used to be the prefill argmax. With sampling enabled this injects one greedy token per request; at temperature 1 every generation opened identically. Now sampled from the prefill logits. - Per-step seed: the next draft step is seeded with the replay's last token, which the next verify commits as-is. Seeding with the replay argmax injects one greedy token per step (~1/16 of output), which biases the distribution and locks long generations into repetition loops. Now sampled from the replay's last verify logits. - The acceptance walk itself, over per-position verify logits exposed via a new DFlashTarget::read_verify_logits() hook (default-off; the qwen35 target reads them from the step graph's logits tensor). Sampled-verify requires full-attention verify (fa_window == 0). With a finite fa-window the windowed verify pass starves the logit tail at long context: argmax stays intact, but sampling from the poisoned tail degrades quality — on a 24K-token agent prompt, tool-call success went 0/12 with a finite window vs 12/12 with full attention. The mode is gated off unless fa_window == 0; prefill and AR decode are unaffected (they always run full attention). Measured on RTX 3090 (Qwen3.6-27B Q4_K_M + DFlash draft, temp 0.7-1.0): 16 -> 60-70 tok/s on short prompts, 16 -> ~31 tok/s at 24K context. Greedy path is bit-exact unchanged. Token histograms over 150 sampled generations match AR within noise. DFLASH_SV_DEBUG=1 traces the acceptance walk per position. Implements the "active sampler during verification" future addition from the README.

Rhonstin · 2026-06-12T15:54:28Z

Amended after self-review: (1) the DFLASH_SAMPLED_VERIFY gate is now correctly opt-in (=1) as documented — the earlier revision had it opt-out; (2) the acceptance walk now seeds its penalty history with the step's seed token (draft_tok[0]), which is committed by the replay but not yet in out_tokens at walk time — without it the sampled distribution drifted from AR by one token whenever repetition penalties are active.

cubic-dev-ai Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread server/src/qwen35/qwen35_backend.cpp

Comment thread server/src/qwen35/qwen35_backend.cpp Outdated

Rhonstin force-pushed the feat/sampled-verify branch from cf8ea3f to 9d63002 Compare June 12, 2026 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qwen35: sampled-verify — speculative decoding with an active sampler#374

qwen35: sampled-verify — speculative decoding with an active sampler#374
Rhonstin wants to merge 1 commit into
Luce-Org:mainfrom
Rhonstin:feat/sampled-verify

Rhonstin commented Jun 12, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Rhonstin commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Rhonstin commented Jun 12, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How it works

fa-window interaction

Results (RTX 3090, Qwen3.6-27B Q4_K_M + DFlash draft)

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Rhonstin commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Rhonstin commented Jun 12, 2026 •

edited by cubic-dev-ai Bot

Loading

cubic-dev-ai Bot left a comment •

edited

Loading