Skip to content

feat(mtp): Phase 2.2.Attn+KV + mrope + prompt-prefill + K-chain; defer 3.5 batched-verify#175

Merged
kekzl merged 10 commits into
mainfrom
feat/mtp-phase-2.2
May 14, 2026
Merged

feat(mtp): Phase 2.2.Attn+KV + mrope + prompt-prefill + K-chain; defer 3.5 batched-verify#175
kekzl merged 10 commits into
mainfrom
feat/mtp-phase-2.2

Conversation

@kekzl
Copy link
Copy Markdown
Owner

@kekzl kekzl commented May 14, 2026

Summary

Follow-on to PR #174 (which shipped Phase 2.2.MoE + Attn MVP + Phase 3.5 telemetry + Phase 5.5 harness + the two root-cause bug fixes). This PR completes the diagnostic investigation of the 30% accept-rate plateau and ships definitive findings.

What's in

Sub-phase Commit
Per-head qk-norm + opt-in standard RoPE 1edf183
mrope-aware Q/K rotation kernel 7c09e22
Whole-prompt MTP prefill (Engine::mtp_prefill_prompt) cf4dcd9
Perf baseline refresh (Docker variance) 1ee1af7
Diagnostic env toggles (pattern log, prenorm-h) 4503d43
K-chain accept measurement + cache rollback c1b3e57

Phase 5.5 final findings (Qwen3.6-NVFP4)

K=1 accept rate (per prompt class): 20-37% across factual/verbose-think/code/instruction, avg ~22-30% with run-to-run variance.

K=2 chained drafts (4-prompt sample):

lookahead accept rate
k=0 (next token) 22.4%
k=1 (next-next token) 1.6%

The MTP head collapses at chained lookahead 1. When MTP feeds its own h_final back as h_prev for the second step, it diverges from the main-model trajectory the head was trained against.

ROI math for Phase 3.5 batched-verify

  • K=1: cost ≈ 2× decode forward, tokens/cycle = 1.22, throughput = 0.61× baseline → 40% SLOWER.
  • K=2: tokens/cycle = 1.23, same slowdown.
  • Break-even: need ≥ 50% accept on k=0. Current 22% is far short.

Phase 3.5 batched-verify implementation is DEFERRED. Shipping it now would be a measurable throughput regression.

Refuted hypotheses (all bench-confirmed)

Hypothesis Result
Missing RoPE mrope ≡ standard partial-rope for text-only positions; same accept
Empty prompt context (asymmetry with main model) Prompt-prefill marginal; not the bottleneck
Pre-norm h_prev via output_norm Worse; raw pre-norm h_prev is correct
qk-norm offset double-application Fixed; doesn't move the needle alone
K>1 multi-step chaining Collapses to noise at lookahead ≥ 1

Likely real bottleneck (not addressed)

DeepSeek-V3-style MTP architecture conditions each chained step on the main model's intermediate-layer activations, not on MTP's own projected hidden state. Implementing this would require running the main model up to layer L for each chain step → no spec-decode speedup. Future architectural change required.

Diagnostic infrastructure shipped (re-runnable)

scripts/mtp_accuracy_bench.sh          # 4 prompt class harness
--mtp-spec-decode K                    # K=1, 2, 3 chain measurement
IMP_MTP_PATTERN_LOG=1                  # log (predicted, actual) pairs
IMP_MTP_PRENORM_H=1                    # A/B knob: pre-norm h_prev
IMP_MTP_NO_ROPE=1                      # disable mrope entirely

CLI prints per-lookahead accept rates:

mtp     X / Y drafts matched (Z% accept rate)
mtp[k=1] X / Y drafts matched (Z% accept @ +1 lookahead)

Useful side-effects

  1. mrope-aware kernel exists — reduces to standard partial-rope for text-only positions, but is the right architecture for any future multimodal MTP work.
  2. Whole-prompt MTP prefill infrastructure — Engine::mtp_prefill_prompt populates MTP KV cache during prefill. Doesn't break the plateau but is the right thing structurally.
  3. K-chain accept measurement — sliding-window per-lookahead instrumentation with cache rollback. Makes any future MTP improvement empirically falsifiable.
  4. Telemetry hook stays non-behavioral: generated tokens identical with/without MTP.

Validation

  • make verify-fast green (after baseline refresh for post-Docker-prune cuBLAS variance).
  • Phase 5.5 bench re-runnable; numbers match across runs within variance.
  • Manual: imp-cli --mtp-spec-decode 1/2/3 --prompt "..." shows per-lookahead accept rates inline.

Memory

mtp_phase5_validation_2026_05_14 — final state, re-eval triggers, what to try next.

🤖 Generated with Claude Code

kekzl and others added 10 commits May 14, 2026 23:20
…=True

Replaces the Step 5.A passthrough with a real attention block.

Architectural discovery (from vllm Qwen3NextAttention upstream): Qwen3.6
MTP uses attn_output_gate=True, which doubles q_proj output to
[num_heads, 2*head_dim] per token. The output is split per-head into
(q, gate) — q drives Q@K attention, gate is silu'd and elementwise
multiplied with the attention output before o_proj. This is why the
shapes are q_proj[8192,2048] / o_proj[2048,4096] on a model with main
config num_heads=16 head_dim=256: 2*16*256=8192 for Q+gate, 16*256=4096
for the post-gate attention output.

MVP scope: M=1 first-draft step with NO MTP KV history.
  - softmax over a single token reduces to identity
  - attn_out_per_head[h] = V[h / GQA_group] (broadcast via GQA)
  - silu(gate) elementwise multiplied on top
  - o_proj reduces 4096 → 2048
  - Result added to fc_out as the attention residual

This gives the FIRST draft step a real attention contribution (gate × V).
Subsequent draft steps (K>=1) would need an actual MTP KV cache to
attend over prior drafts — that's Phase 2.2.Attn+KV future work.

New kernels in mtp_forward.cu:
  - mtp_gated_v_broadcast_kernel: computes silu(gate[h]) * V_broadcast[h/gqa]
  - mtp_add_kernel: fc_out += attn_residual

Reused: imp::rmsnorm (input_layernorm), imp::gemm (q/v/o projections).

Workspace gains 6 attention scratch buffers (input_norm, q_full, k_proj,
v_proj, attn_out, attn_residual) plus 3 dim fields (num_heads, num_kv_heads,
head_dim). mtp_workspace_allocate gains the 3-tuple of attention dims as
default-0 params (back-compat preserved).

Engine derives attention dims from MTP head's q_proj/v_proj shapes:
  num_heads    = q_proj.shape[0] / (2 * head_dim)
  num_kv_heads = v_proj.shape[0] / head_dim
  head_dim     = main model's head_dim (256 for Qwen3.6)
Log line gains "num_heads=N/NKV, head_dim=HD" suffix.

Validation:
  - MtpForwardTest.DraftStepProducesValidToken now also asserts mtp_head_dim=256,
    mtp_num_heads=16, mtp_num_kv=2 from the shipped Qwen3.6-NVFP4 weights, and
    allocates the workspace with attention engaged. Test PASS in 14.9s.
  - make verify-fast green: decode +2.77%, prefill +3.22%, graphs 1.58×.

The K (k_proj) tensor and q_norm/k_norm are computed-for-shape but unused
in the no-history MVP. They become live when the MTP KV cache + softmax
attention kernel land.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hooks mtp_draft_one into step_decode_forward to record per-step prediction
accuracy WITHOUT changing generation. Single-sequence only (batch=1).

After each main forward emits the next token, the hook:
  1. If a prediction was made on the prior step, compare to the actual
     next_token → increment mtp_accuracy_.{matches,total}.
  2. Capture executor_->view_hidden(1) (the just-produced hidden state for
     the new token) and call mtp_draft_one to predict next-step's token.

Engine gains:
  - struct MtpAccuracy { int matches, total; float rate(); }
  - mtp_accuracy() const accessor
  - mtp_accuracy_reset() — clears for fresh measurement
  - private mtp_pending_prediction_ + mtp_accuracy_ state

CLI prints accuracy at end of generation when MTP enabled:
  $ imp-cli --model Qwen3.6-NVFP4 --mtp-spec-decode 1 --prompt "..." --max-tokens 16
  ...
  mtp     0 / 14 drafts matched (0.0% accept rate)

Validation:
  - WITH and WITHOUT --mtp-spec-decode: generated tokens are IDENTICAL
    on Qwen3.6-NVFP4 ("The user is asking for the capital of France.").
    Confirms telemetry is non-behavioral.
  - 0% accept rate at K=1 confirms what the architectural analysis
    predicted: the current Phase 2.2.Attn MVP (M=1, no MTP KV history)
    can't keep up with the main model's predictions. Phase 2.2.Attn+KV
    (real attention with MTP-side cache) is required before Phase 3.5
    batched-verify yields any speedup.
  - make verify-fast green (decode +2.77% vs baseline, prefill within
    threshold, graphs 1.48× speedup, smoke 'Paris' check passes).

This satisfies Phase 5.5's measurement infrastructure: we can now run
arbitrary prompts and read off MTP accuracy to drive the decision on
whether to implement batched verify (Phase 3.5 proper) and KV-aware
attention (Phase 2.2.Attn+KV). The session-end log line is the
acceptance-rate signal the design spec calls for.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds scripts/mtp_accuracy_bench.sh that runs imp-cli with --mtp-spec-decode 1
across 4 prompt classes (factual / verbose-think / code / instruction) on
Qwen3.6-NVFP4 and parses the per-step accept-rate telemetry line.

Also gates the async-graph-loop initiation on !mtp_spec_decode_enabled()
so the Phase 3.5 telemetry hook (which lives in step_decode_forward) sees
EVERY decode step. Before this gate, ~3 of 4 prompts had their telemetry
silently dropped because long generations transition to the async path
mid-stream.

Decision thresholds (per design spec):
  ≥ 60% on ≥ 3/4 classes → Phase 3.5 batched-verify is ROI-justified
  < 30% across the board → Phase 2.2.Attn+KV is the load-bearing blocker

Result on Qwen3.6-NVFP4 with the current Phase 2.2 implementation
(MoE complete + gated-attention MVP without MTP KV history):

  | class          | matches | total | rate  |
  |----------------|---------|-------|-------|
  | factual        | 0       | 67    | 0.0%  |
  | verbose-think  | 0       | 15    | 0.0%  |
  | code           | 0       | 126   | 0.0%  |
  | instruction    | 0       | 126   | 0.0%  |

Definitive: 0/4 at 0%. The draft signal without MTP-side KV attention is
indistinguishable from noise — DO NOT implement Phase 3.5 batched-verify
yet, it would be measurable throughput regression. Next prerequisite is
Phase 2.2.Attn+KV (proper attention with MTP-side cache, ~16 MiB extra
VRAM at 16K context).

Memory entry mtp_phase5_validation_2026_05_14 captures the finding +
recipe for re-running once Phase 2.2.Attn+KV lands.

make verify-fast green: decode within 3% threshold, graphs 1.63×, smoke
'Paris' check passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 5.5 re-bench on Qwen3.6-NVFP4 with --mtp-spec-decode 1:

  | class          | matches | total | rate  |
  |----------------|---------|-------|-------|
  | factual        | 22      | 67    | 32.8% |
  | verbose-think  | 23      | 126   | 18.3% |
  | code           | 47      | 126   | 37.3% |
  | instruction    | 37      | 126   | 29.4% |
                                  avg 29.5%

Up from 0/4 at 0% (previous session). The MTP head was fundamentally
producing zero-information output. TWO compounding bugs:

(1) RMSNorm shape misuse in mtp_forward.cu (~30 of 30% of the win)
    imp::rmsnorm reads x.shape[0] as rows and x.shape[1] as d_model.
    My Phase 2.1+2.2 code passed 1D tensors [hidden_dim] → kernel saw
    rows=hidden_dim, d_model=0 and EARLY-RETURNED. Output buffers kept
    their uninitialized contents, locking the LM-head argmax to a
    deterministic noise token (6178 = 'awn' for Qwen3.6). Fixed all
    four mtp_forward.cu call sites to use shape [1, hidden_dim] (2D).

(2) Missing arch_norm_offset for Qwen3.5/3.6 MTP norms
    Qwen3.5/3.6 SafeTensors stores RMSNorm gammas as deltas W where
    actual gamma = 1 + W. Main-model code adds the +1 during BF16→FP16
    via ctx.arch_norm_offset. upload_mtp_weights() was using
    upload_unquantized_weight() which doesn't expose the offset param.
    With W ≈ 0.0 stored on disk, MTP norms ran with scale ≈ 0 instead of
    ≈ 1. Fixed: dispatch norm tensors (pre_fc_norm_{embedding,hidden},
    input_layernorm, post_attention_layernorm, q_norm, k_norm,
    final_norm) through upload_weight() with weight_offset=arch_norm_offset.

Also includes Phase 2.2.Attn+KV scaffolding from this session:
  - Per-session MTP K/V cache (max 16K context → 32 MiB), reset on
    imp_context_reset via Engine::mtp_accuracy_reset()
  - mtp_attn_kv_scan_kernel: one CTA per Q-head, softmax over cached
    positions, scaled by 1/sqrt(head_dim), GQA broadcast from kv_h
  - mtp_kv_append_kernel + mtp_gate_attn_out_kernel for the composed pipeline
  - Engine sizes the cache from min(model.max_seq_len, 16384)

Bench script now reports avg rate + 3-way verdict (default-on /
batched-verify-worth-trying / still-blocked).

RoPE on Q/K is still missing — would close some of the ~30% → ~85%
gap to DeepSeek-V3 paper expectations (Qwen3.6 uses partial-rope 0.25
+ mrope sections [11, 11, 10] for multimodal). With RoPE, accept rate
should clear the 60% default-on threshold on factual/instruction; at
that point batched-verify (Phase 3.5 proper) becomes a real perf win.

verify-fast green: decode within threshold, graphs 1.89× speedup,
smoke 'Paris' check passes. Generated tokens still identical with/
without MTP — telemetry remains non-behavioral.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndard RoPE

Three changes to the MTP attention forward:

1. Q-extraction from q_full into d_q_attn (cudaMemcpy2DAsync, per-head
   pitch: first head_dim of each head's 2*head_dim slice). Gate stays in
   q_full for the post-attention silu(gate)*out multiplication.

2. Per-head QK-norm via imp::rmsnorm (the non-fused path the main model
   uses when qknorm_rope_fused is disabled). Reshape Q to [num_heads,
   head_dim] and K to [num_kv_heads, head_dim], apply rmsnorm with
   weight=mtp.q_norm/k_norm, offset=norm_weight_offset (0.0 for
   Qwen3.5/3.6 — the +1 was baked in at upload).

3. Opt-in standard partial-RoPE via imp::rope_forward, gated on env
   IMP_MTP_USE_ROPE=1. Default OFF because Qwen3.6 uses mrope with
   section split [11, 11, 10] on the 64 rope-dims; standard partial-rope
   is a frequency-pattern mismatch versus training and empirically
   REDUCES accept rate (~30% → ~20%). Proper mrope support is a
   future enhancement.

CRITICAL bug fix on the way: the runtime weight_offset for q_norm/k_norm
must match cfg.norm_weight_offset (0.0 for Qwen3.5/3.6) NOT a duplicate
arch-derived +1. The upload path already adds +1 during BF16→FP16; an
additional +1 at runtime would compute gamma = W + 2 instead of W + 1.

Phase 5.5 bench on Qwen3.6-NVFP4 (with these changes, RoPE off):

  | class          | matches | total | rate  |
  |----------------|---------|-------|-------|
  | factual        | 21      | 67    | 31.3% |
  | verbose-think  | 31      | 126   | 24.6% |
  | code           | 43      | 126   | 34.1% |
  | instruction    | 36      | 126   | 28.6% |
                                  avg 29.6%

Within noise of the pre-qk-norm 29.5% baseline — qk-norm doesn't move
the needle alone. The next step that should is mrope-aware Q/K rotation.
Both qk-norm and mrope are present in the main model's forward; the MTP
head was trained with both, so applying one without the other
under-utilizes the trained weights.

Workspace gains:
  - d_q_attn:       [num_heads * head_dim] FP16 — Q extracted post-gate
  - d_mtp_position: [1] int — for RoPE position
  - rope_theta / rope_dim / rope_neox / rms_norm_eps / arch_norm_offset hyperparams

Engine threads these from model->config. The opt-in env knob lets us
revisit RoPE quickly once mrope lands.

verify-fast green: decode within threshold, graphs 1.69× speedup,
smoke 'Paris' check passes. Generated tokens still identical with/without
MTP enabled — telemetry remains non-behavioral.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the opt-in standard partial-rope (via imp::rope_forward) with a
dedicated mrope kernel that supports Qwen3-VL's section split. For
Qwen3.6 the mrope_section is [11, 11, 10] (half-counts; full rope_dim
= 64 = 2·(11+11+10)).

For text-only generation all 3 position components (T, H, W) equal
mtp_pos, so mrope mathematically reduces to standard partial-rope —
matching the bench evidence (avg 27.3% with mrope on vs 29.5% without,
within run-to-run variance). The kernel infrastructure exists so that
multimodal token handling (different T/H/W per token) drops in cleanly
when needed.

`mtp_mrope_kernel<IsKv>` design:
  - One CTA per head (Q: nh CTAs, K: nkv CTAs)
  - Threads handle the rope_dim/2 frequency pairs in stride
  - Per pair k: cumulative-section lookup → position[s] from {T,H,W}
  - Frequency: inv_freq[k] = theta^(-2k/rope_dim) (standard partial-rope)
  - NeoX rotation: (x[k], x[k+rope_dim/2]) → rotated by `pos * inv_freq[k]`

Workspace gains:
  - mrope_sec0/sec1/sec2 (validated to sum to rope_dim/2)

Engine sets the section split from hardcoded Qwen3.6 values when rope_dim
== 64 (since imp doesn't load mrope_section from config yet). Falls back
to all-temporal split for other rope_dim values.

RoPE is now ON by default (rope_dim from config); IMP_MTP_NO_ROPE=1
disables. The standard-rope IMP_MTP_USE_ROPE knob is removed.

Phase 5.5 bench unchanged at avg ~27% (variance), confirming the 30%
plateau is NOT a rotation-correctness issue. The next quality levers:
  - Multi-step MTP chaining (K>1) — feed previous draft back to next forward
  - Investigate whether main-model absolute positions vs MTP-local 0..N
    matters (theory says no — relative attention is preserved — but
    worth empirical confirmation)
  - Per-class analysis of which drafts match vs miss to understand what
    the MTP head learned to predict well

verify-fast: decode noise from container rebuild (cuBLAS algo selection
variance ~3-5% on Qwen3-8B Q8_0; affects baseline-vs-current numbers
but not MTP accept-rate metric). Pre-stash test confirmed regression
is independent of these changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds Engine::mtp_prefill_prompt() that runs MTP forward at each prompt
position to populate the MTP-side KV cache. Called automatically from
step_prefill after the main forward completes (when MTP is enabled +
single-chunk prefill).

Without this, MTP starts decode with an empty KV cache while the main
model has the entire prompt context — a fundamental asymmetry that
caps achievable accept rate.

Hook conditions:
  - MTP spec-decode enabled
  - offset == 0 AND req->prefill_offset == 0  (single-chunk path only)
  - Uses executor's hidden_ buffer (holds all chunk_len positions post-forward)

The LAST prompt position's MTP prediction is stored in mtp_pending_prediction_
so the first decode-step's accuracy comparison is meaningful (instead of
being skipped due to the prior -1 sentinel).

Cost: O(prompt_len) extra MTP forwards, ~2-3 ms each on Qwen3.6-NVFP4.
50-token prompt → ~125 ms one-time. Acceptable.

Phase 5.5 bench result (with prefill enabled):

  | class          | matches | total | rate  |
  |----------------|---------|-------|-------|
  | factual        | 19      | 94    | 20.2% |
  | verbose-think  | 41      | 127   | 32.3% |
  | code           | 38      | 127   | 29.9% |
  | instruction    | 30      | 127   | 23.6% |
                                  avg 26.5%

Marginal change from no-prefill bench (avg 27.3%). The 30% plateau is
NOT a context-asymmetry issue either — prompt context is now in cache
and accept rate is still bounded. Two contributing factors:
  - First decode-step's accept comparison is now counted (was skipped
    when pending=-1 at session start), and predicting from prompt-tail
    is harder than from mid-decode.
  - Class verbose-think did improve (25→32%), suggesting prompt context
    DOES help on some prompt types — but only modestly.

The fundamental gap to DeepSeek-V3's ~85% paper number likely needs:
  - Multi-step (K>1) MTP forward chaining
  - Possibly different scaling / hyperparameter tuning
  - Or per-element validation against HF reference forward

Chunked prefill (long prompts that hit max_seq_len chunk boundary) is
NOT supported by this hook yet — only the single-chunk case. The MTP
KV cache would need per-chunk capture for chunked. Documented as a
TODO; not a blocker since typical prompts are 50-200 tokens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cuBLAS heuristic re-selection after the previous imp:test image was
pruned: tg128 151.16 → 148.15 (-2.0%), pp512 14547 → 14260 (-2.0%).
This is the documented "GEMM algo selection variance across container
restarts" issue, unrelated to any MTP changes. Pre-stash test on the
merged state confirmed identical regression independent of recent
commits.

verify-fast green after refresh: decode -1.86% within 3% threshold,
graphs 1.69× speedup, smoke 'Paris' check passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two diagnostic toggles added for analyzing the 30% accept-rate plateau:

  IMP_MTP_PATTERN_LOG=1
    Logs every (predicted, actual) pair from the telemetry hook with
    decoded token strings, so accept patterns can be analyzed offline.
    Format: "MTP-PAT [+/-] pred=N 'str' actual=N 'str'"

  IMP_MTP_PRENORM_H=1
    Applies the main model's output_norm RMSNorm to h_prev before
    passing to MTP. Tests whether MTP expects post-final-norm hidden
    state (some upstream MTP variants do).

Findings from accept-pattern analysis on Qwen3.6-NVFP4:
  - Accepts skew heavily toward common English tokens: "the", "is", "a",
    "asking", "provide", "answer", "factual", "straightforward", "\n".
  - Wrong predictions skew toward HIGH-vocab-ID tokens: Thai script
    (~233k), Cyrillic (~163k), special tokens (<|im_start|>, </think>,
    <|box_end|>), emojis. Almost no garbled Latin-script gibberish.
  - Suggests MTP's h_final lands in the lm_head input space well enough
    that COMMON-WORD signal survives, but for content-specific positions
    the distribution spreads into the rare-token tail of the vocab.

IMP_MTP_PRENORM_H=1 A/B result: WORSE (factual 20.6% → 10.3%). Refutes
the "MTP expects post-out-norm h_prev" hypothesis. The default (raw
pre-out-norm hidden state from executor's hidden_ buffer) is correct.

These knobs stay as opt-in diagnostics — they don't affect production
default behavior. The pattern log is useful for any future MTP debug
session ("why did accept rate drop on prompt X?").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds K>1 chained-draft instrumentation: at each main decode step, draft
K predictions in a chain (each feeding back the previous draft + MTP's
own h_final as the next h_prev). Track per-lookahead accept rate via
mtp_chain_accept_[k]. Cache roll-back keeps only the first chain step's
KV-cache write so the real MTP context doesn't drift.

CLI prints "mtp[k=N]" lines for each lookahead distance ≥ 1.

Bench finding on Qwen3.6-NVFP4 (K=2 across 4 prompt classes):

  | prompt          | k=0 (next) | k=1 (next-next) |
  |-----------------|------------|-----------------|
  | factual         | 29.9%      | 0.8%            |
  | verbose-think   | 17.3%      | 0.0%            |
  | code            | 19.7%      | 2.4%            |
  | instruction     | 22.8%      | 3.2%            |
  | avg             | 22.4%      | 1.6%            |

DEFINITIVE: The MTP head COLLAPSES at chained lookahead 1. K=2 drafts
have ~1.6% accept on the second token — essentially noise. Chained MTP
forward (feeding MTP's own h_final back as h_prev) diverges from the
main-model trajectory the model was trained against.

Implication for Phase 3.5 batched-verify ROI:
  - K=1 batched verify: cost ≈ 2× decode, accept ≈ 22%, tokens/cycle ≈ 1.22.
    Effective throughput: 1.22 / 2 = 0.61× baseline → 40% SLOWER.
  - K=2 batched verify: tokens/cycle ≈ 1 + 0.22 + 0.22*0.016 = 1.23.
    Same 40% slowdown.

Break-even requires accept rate ≥ 50% on k=0, which we cannot reach
without architectural changes (multi-token-predictor needs the MAIN
model's intermediate hidden states for each chained step, not MTP's
projected hidden state — which is exactly what spec-decode tries to
avoid recomputing).

Conclusion: Phase 3.5 batched-verify implementation should be DEFERRED
until MTP single-step accept rate clears 50%. That likely requires
either:
  - Different MTP head architecture (multi-head with main-model
    intermediate-layer projection, like DeepSeek-V3's actual design)
  - Or accepting MTP as content-tail "easy token" predictor only and
    using a different draft model for spec-decode.

K-chain measurement infrastructure stays as a useful diagnostic for
future MTP iterations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kekzl kekzl force-pushed the feat/mtp-phase-2.2 branch from c1b3e57 to caad3f8 Compare May 14, 2026 21:27
@kekzl kekzl enabled auto-merge (squash) May 14, 2026 21:28
@kekzl kekzl merged commit 43014d2 into main May 14, 2026
2 checks passed
@kekzl kekzl deleted the feat/mtp-phase-2.2 branch May 14, 2026 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant