Skip to content

KVFlash: bounded KV residency (lookahead sparse attention) for dflash#373

Open
davide221 wants to merge 22 commits into
mainfrom
proto/kv-pager
Open

KVFlash: bounded KV residency (lookahead sparse attention) for dflash#373
davide221 wants to merge 22 commits into
mainfrom
proto/kv-pager

Conversation

@davide221

@davide221 davide221 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

KVFlash: bounded KV residency (lookahead sparse attention) for dflash

FlashMemory-style (arXiv 2606.09079) decode-time KV paging behind a new --kvflash <tokens> flag. The full-attention KV cache lives in a fixed pool of slots; cold 64-token chunks page to host RAM bit-exact and recallable. GPU KV footprint becomes a hard O(pool) constant at any logical context length.

Full docs in optimizations/kvflash/ (README, RESULTS, DESIGN).

Headline numbers (lucebox RTX 3090, Qwen3.6-27B Q4_K_M, Q8_0 KV)

context mode prefill decode tok/s needle /16 KV in VRAM
64K full cache 130.6 s 27.8 16 1152 MiB
64K KVFlash 4K 87.5 s 38.6 14 72 MiB
128K full cache 335.9 s 19.6 16 2304 MiB
128K KVFlash 4K 177.8 s 38.6 14 72 MiB
256K full cache 999.0 s 13.1 16 4608 MiB
256K KVFlash 4K 354.9 s 38.6 15 72 MiB

Decode is flat at 38.6 tok/s from 64K to the model's native 256K maximum (1.4x / 2.0x / 2.9x over the full cache), prefill is up to 2.8x faster, and attn-KV memory drops 99.2% (2304 to 18 MiB at 128K with a 1K pool).

How it works

  • Attention KV tensors are allocated at pool size (create_target_cache gains ctx_alloc); cache.max_ctx stays the logical bound. The allocation delta IS the saving.
  • A pager (common/kvflash_pager.h) maps logical positions to pool slots at 64-token chunk granularity, riding the existing step-invariant set_rows KV append. RoPE is baked into K rows at write time, so relocation is legal; page-out/page-in moves raw quantized bytes and is bit-exact.
  • Decode attends over the pool with an exact slot-validity mask, re-uploaded before every compute (gallocr reuses input regions during graph execution). The mask is free: 25.10 vs 25.52 ms/step maskless.
  • Every tau decoded tokens (default 64, self-throttling) the scorer re-ranks all chunks and reselect() repages the pool: the paper's lookahead loop, with a hard capacity cap their sigmoid threshold lacks.

Policy is pluggable, pflash is optional

KvFlashScorer (common/) is the policy seam. With no scorer the pool runs pure LRU (zero pflash dependency, recency-only memory). When pflash loads its drafter, KvFlashDrafterScorer attaches automatically and reselect becomes relevance-driven: needle recall holds at 88-100% down to 6-9% residency from 8K to 256K, where LRU scores 0 outside its tail window.

Spec decode runs on the pool

Chain-mode verify_batch slot-maps the draft block (per-token kv_write_rows, which is [n_tokens, n_head_kv] ne0-major) and builds a slot-space mask. Rejected drafts need no rollback: the pos < base_pos validity rule excludes their slots until rewritten. Acceptance parity measured on the daemon: 15.4-15.6% pooled vs 15.3% full cache. DDTree tree-verify is not pool-aware yet and falls back to AR with a one-time warning.

Quality

Harness ground truth with the pool sized per the heuristic: HumanEval 10/10, GSM 10/10, MATH 10/10, agent 6/6, identical to the full-cache baseline (base-vs-base control: 16/16 byte-identical, so the stack is deterministic; text drift under KVFlash is the masked kernel's different deterministic rounding lineage, not a correctness effect).

Verification

  • test_kvflash suite A-F: full-cache baseline, shuffled-relocation equivalence (0.83% argmax flips, gate 2%), live paging with bit-exact roundtrip and >=90% KV-bytes cut, score-driven reselect recall, decode profile, full LSA loop with the drafter as Memory Indexer.
  • Daemon smokes: agnostic LRU (1441 logical tokens through a 1024-slot pool, live eviction mid-request, coherent at 36.9 tok/s), pflash + drafter scorer auto-attach, spec decode with mid-generation pool wrap, two-request pager reset.
  • Rebased onto current main (PRs 364/370/371); end-of-prefill snapshot block and kvflash prefill sync coexist, rebuilt and re-smoked on the 3090.

Known limits (documented in RESULTS.md)

  • DDTree falls back to AR while KVFlash is active.
  • Post-generation snapshots are skipped once cur_pos exceeds the pool (pooled snapshots need page-table serialization); prefill-time snapshots work.
  • Paging is synchronous; copy-stream overlap is a follow-up.
  • Memory-dense tasks that need the whole context at once (MRCR-style) are a paradigm limit shared with FlashMemory; size the pool up for those.

Usage

dflash_server model.gguf --max-ctx 262144 --kvflash 4096            # LRU policy
dflash_server model.gguf --max-ctx 262144 --kvflash 4096 \
    --prefill-compression always --prefill-drafter qwen3-0.6b.gguf  # drafter policy

🧙 Built with WOZCODE

Review in cubic

davide221 and others added 4 commits June 12, 2026 10:15
KvFlashPager: bounded resident pool for the full-attention KV cache
(FlashMemory-style lookahead sparse attention, arXiv 2606.09079).
Logical positions map to physical pool slots at 64-token chunk
granularity; cold chunks page to a host backing store bit-exact and
recallable. GPU footprint is a hard O(pool) bound at any context length.

KvFlashScorer: dependency-free chunk-relevance policy interface. With no
scorer the pager runs pure LRU; KvFlashDrafterScorer adapts the pflash
Qwen3-0.6B drafter (tail-attention chunk scores, z-normalized, bisecting
on allocation pressure) so reselect becomes relevance-driven.

Co-Authored-By: WOZCODE <contact@withwoz.com>
- create_target_cache gains ctx_alloc: attention KV tensors allocate at
  pool capacity while cache.max_ctx stays the logical bound.
- build_target_step gains kvflash_mask: pooled decode keeps the
  step-invariant set_rows KV append active alongside an exact
  slot-validity mask (uploaded before every compute; gallocr reuses
  input regions during graph execution, so a stale mask is garbage).
- do_ar_decode routes kv_write_rows through the pager slot, pushes
  history, and reselects every tau decoded tokens (effective interval
  max(tau, history/45) caps rescore overhead near 15%).
- Spec decode (chain) verifies ON the pool: verify_batch slot-maps the
  draft block (kv_write_rows is [n_tokens, n_head_kv] ne0-major) and
  builds a slot-space mask; rejected drafts need no rollback since the
  pos < base_pos validity rule excludes their slots until rewritten.
  DDTree tree-verify is not pool-aware and falls back to AR.
- pflash synergy: when the prefill drafter loads, KvFlashDrafterScorer
  attaches automatically; without it the pool runs LRU (fully agnostic).
- Post-generation snapshots are skipped once cur_pos exceeds the pool;
  prompts must fit the pool (clear error otherwise); pool size clamps
  to --max-ctx with a warning.

Co-Authored-By: WOZCODE <contact@withwoz.com>
Gated suite A-F: full-cache baseline, shuffled-relocation equivalence
(<=2% argmax flips), live paging with bit-exact page_out/page_in
roundtrip and >=90% KV-bytes cut, score-driven reselect recall, decode
profile, and the full LSA loop with the drafter as Memory Indexer.
Modes: --niah / --niah256 (needle recall vs residency), --longab
(end-to-end long-prompt A/B, per-process configs for clean VRAM), --no-mask.

Co-Authored-By: WOZCODE <contact@withwoz.com>
Measured on lucebox RTX 3090, Qwen3.6-27B Q4_K_M, Q8_0 KV: decode flat
at 38.6 tok/s from 64K to native-max 256K (2.9x over full cache at
256K), 72 MiB resident KV vs 4608 MiB, prefill up to 2.8x faster,
needle recall 88-100% at 6-9% residency with the drafter policy,
harness ground truth 32/32 vs 32/32, spec acceptance at parity.

Co-Authored-By: WOZCODE <contact@withwoz.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 issues found across 18 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/server/server_main.cpp">

<violation number="1" location="server/src/server/server_main.cpp:411">
P2: Missing input validation for --kvflash token count. The value is stored raw via setenv without any validation that it is a positive integer. Every other numeric flag in this block (--spark-slots, --ddtree-budget, --fa-window, --chunk, etc.) parses with std::atoi and validates. Passing non-numeric, zero, or negative input will silently set DFLASH_KVFLASH to garbage, deferring the failure to an opaque downstream atoi call rather than failing early with a clear error message.</violation>
</file>

<file name="server/src/qwen3/qwen3_kvflash_scorer.cpp">

<violation number="1" location="server/src/qwen3/qwen3_kvflash_scorer.cpp:110">
P2: `score_chunks` divides by `chunk_tokens` without validating it, which can crash on invalid input.</violation>
</file>

<file name="server/src/qwen35/graph_builders.h">

<violation number="1" location="server/src/qwen35/graph_builders.h:71">
P3: Header comment for `kvflash_mask` incorrectly states it is "Only meaningful with n_tokens == 1", but the parameter is actively used with `n_tokens > 1` in the verify_batch/spec-decode path (qwen35_dflash_target.cpp:63), and the implementation in graph_builders.cpp:291-296 explicitly describes support for "multi-token ... forwards (decode AND spec verify)". The header constraint is misleading and contradicts both the implementation comment and actual usage.</violation>
</file>

<file name="server/src/common/kvflash_pager.h">

<violation number="1" location="server/src/common/kvflash_pager.h:70">
P1: `attach()` does not validate that pool capacity leaves at least one evictable chunk, so small pools can deadlock eviction and make `slot_for()` fail with `-1`.</violation>
</file>

<file name="server/src/qwen3/qwen3_kvflash_scorer.h">

<violation number="1" location="server/src/qwen3/qwen3_kvflash_scorer.h:7">
P3: Stale documentation reference: the header comment says 'see common/kv_scorer.h' but no such file exists. The correct base-class header is `common/kvflash_scorer.h` (confirmed at `server/src/common/kvflash_scorer.h`). This will mislead developers looking for the dependency-free interface description.</violation>
</file>

<file name="server/src/qwen35/qwen35_backend.cpp">

<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:1304">
P1: `slot_for()` failure is unchecked, so kvflash can write to KV row `-1` when the pool has no evictable block.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/common/kvflash_pager.h
Comment thread server/src/qwen35/qwen35_backend.cpp
Comment thread server/src/server/server_main.cpp
Comment thread server/src/qwen3/qwen3_kvflash_scorer.cpp
Comment thread server/src/qwen35/graph_builders.h
Comment thread server/src/qwen3/qwen3_kvflash_scorer.h Outdated
The pager core is architecture-blind; this routes each backend's KV
writes and masks through it so --kvflash works on every model family
the daemon serves.

- qwen35moe (Qwen3.6-35B-A3B): the non-hybrid path inherits qwen35.
  The Spark pipelined hybrid decode gains a kv_slot parameter; the
  cached per-layer FA span clamps to the pool, so the cached graph
  stops rebuilding once the window reaches pool size. The pool span
  stays maskless like the rest of that path: the pager zeroes freed
  blocks (page-out + zero_free_blocks on request reset), the same
  zero-row approximation production padding already relies on. Hybrid
  spec decode (literal-offset KV writes) falls back to pipelined AR.
- laguna: all 40 layers pooled. laguna_step/_hybrid take a const
  pager; full + SWA masks are built in SLOT space via fill_slot_pos.
  SWA exactness from a protected tail >= sliding_window. Legacy
  per-layer hybrid decode and NO_KVPAD/PAD_CPY/no_mask ablations are
  refused under kvflash.
- gemma4: pools FULL-attention layers only (SWA layers already
  ring-buffer; KV-reuse layers share their source tensors). Slot-space
  full mask; FA span and mask width clamp to tensor capacity.
  Mutually exclusive with --fa-window; spec verify falls back to AR.
- pager: new const helpers slot_of / fill_slot_pos (slot-space mask
  construction) and zero_free_blocks (request-reset hygiene for
  maskless consumers); kvflash state in Qwen35Backend moved to
  protected for the MoE subclass.
- guards everywhere: prompt-fits-pool on every prefill/restore path,
  snapshots refused after the first relocation on laguna/gemma4.

Smoked on the 3090, pool 1024 / max-ctx 8192 with live LRU eviction
mid-request: A3B Spark hybrid 101.6 tok/s, laguna 137.1, gemma4 119.0,
all coherent; gemma4 no-flag control unchanged (120.2).

Co-Authored-By: WOZCODE <contact@withwoz.com>
@davide221

Copy link
Copy Markdown
Contributor Author

Update: KVFlash now covers every architecture the daemon serves

--kvflash was qwen35-only at PR open; this push ports it to the other three backends. The pager core (common/kvflash_pager.h) was already architecture-blind; each backend now routes its KV writes and masks through it:

arch model smoked integration decode
qwen35moe Qwen3.6-35B-A3B (Spark hybrid, 9403 hot / 837 cold experts) pipelined_decode_one_token gains kv_slot; cached per-layer FA span clamps to the pool (graph stops rebuilding at pool size); maskless pool span backed by pager-zeroed free blocks; hybrid spec falls back to pipelined AR 101.6 tok/s coherent
laguna Laguna-XS.2 (Spark hybrid, single-graph decode) laguna_step(_hybrid) take a const pager; full + SWA masks built in SLOT space via the new fill_slot_pos; protected tail >= sliding_window keeps SWA exact; all 40 layers pooled 137.1 tok/s coherent
gemma4 Gemma4 26B-A4B pools FULL-attention layers only (5 of them; SWA layers already ring-buffer, KV-reuse layers share source tensors); slot-space full mask; mutually exclusive with --fa-window; spec falls back to AR 119.0 tok/s coherent

All smokes: pool 1024 / max-ctx 8192, ~1.2K logical tokens so live LRU eviction engages mid-request, RTX 3090. A no-flag gemma4 control on the same build confirms the default path is unchanged. The qwen35 numbers in the PR body are unaffected.

Policy note: qwen35/qwen35moe attach the pflash drafter scorer automatically; laguna and gemma4 run LRU-only for now (the drafter is Qwen-tokenizer bound) with the KvFlashScorer seam open for their own indexers.

New pager helpers: slot_of / fill_slot_pos (const lookups for slot-space masks) and zero_free_blocks (request-reset hygiene for maskless consumers).

🧙 Built with WOZCODE

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 18 files (changes from recent commits).

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/gemma4/gemma4_backend.cpp Outdated
Comment thread server/src/laguna/laguna_backend.cpp Outdated
- Pool-deadlock guard (P1): KvFlashPager::min_pool_tokens() + attach()
  refusal when sinks + tail window leave no evictable block; every
  backend floors the requested pool at config read (512 for qwen-family
  and gemma4; laguna derives its floor from the resident SWA window)
  with a warning instead of a runtime eviction failure.
- Unchecked slot_for() in do_ar_decode (P1): a -1 slot now fails the
  request with a clear error instead of becoming a set_rows row index.
- --kvflash / --kvflash-tau (P2): validate as positive integers at the
  CLI and exit early instead of deferring garbage env values downstream.
- score_chunks (P2): guard chunk_tokens <= 0.
- Stale docs (P3 x2): kvflash_mask comment no longer claims n_tokens==1
  only (it serves multi-token spec verify); kv_scorer.h rename leftover
  now points at common/kvflash_scorer.h.

Verified on the 3090: bad flag values rejected with clear messages;
--kvflash 256 raises to the 512 floor and decodes coherently through
live eviction in the tightest legal pool (8 blocks, 5 protected).

Co-Authored-By: WOZCODE <contact@withwoz.com>
@davide221

Copy link
Copy Markdown
Contributor Author

All 6 cubic findings were valid and are fixed in the latest push:

  • P1 pool-deadlock (attach() no evictable chunk): real — --kvflash 256 gave 4 chunks while sinks (1) + tail (4) protect 5, so eviction had no victim once the pool filled. Fix is two-layered: KvFlashPager::min_pool_tokens() + an attach() refusal with a clear message, and every backend's config read now floors the pool (512 for qwen-family/gemma4; laguna computes its floor from the SWA window it must keep resident) with a "raising" warning instead of a runtime failure. Verified live: --kvflash 256 logs requested pool 256 < minimum 512; raising and decodes correctly through eviction.
  • P1 unchecked slot_for() in do_ar_decode: real — a -1 would have become a set_rows row index. Now checked, logs, sets last_error, and fails the request. (The spec-verify path already checked; this was the one unchecked site.)
  • P2 --kvflash raw setenv: both --kvflash and --kvflash-tau now validate as positive integers and exit with a clear message, matching the other numeric flags.
  • P2 score_chunks division by chunk_tokens: guarded (chunk_tokens <= 0 returns false) alongside the existing entry validation.
  • P3 stale "Only meaningful with n_tokens == 1" comment: rewritten — the param serves both single-token decode and multi-token spec verify since the spec-on-pool phase landed.
  • P3 stale common/kv_scorer.h reference: rename leftover, now common/kvflash_scorer.h.

Rebuilt + re-smoked on the 3090 after the fixes (27B, pool floor path, coherent output through live eviction).

🧙 Built with WOZCODE

…lpers

The multi-arch port left three copies of the same plumbing; this pulls
them into the kvflash layer so each backend integration reduces to wiring
(net -32 lines):

- kvflash_pool_from_env(): the env read + 256-rounding + eviction floor +
  max_ctx clamp lived in three slightly diverging copies (qwen35 inline,
  laguna, gemma4). One reader, parameterized by the arch's KvFlashConfig;
  laguna passes its SWA-tail config via a new kvflash_config() so the
  floor and attach can never disagree.
- KvFlashPager::alloc_span(): the slot_for loop + exhaustion diagnostic
  existed in laguna, gemma4, and the qwen35moe restore replay; the backend
  helpers are now one-line delegates and the error message is
  single-sourced.
- kvflash_fill_rows_and_masks(): laguna's step-input filler and gemma4's
  inline rows + slot-space mask fill were the same algorithm; the shared
  helper builds append rows plus causal (and optional sliding-window)
  masks from the pager's slot map, so graph code no longer reimplements
  the slot-to-position conversion.

No behavior change: rebuilt on the 3090 and re-smoked the three affected
archs through live eviction (laguna 138.0 tok/s, gemma4 119.4, qwen35 37.0,
all coherent, banners unchanged).

Co-Authored-By: WOZCODE <contact@withwoz.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 8 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread server/src/common/kvflash_pager.h
davide221 and others added 5 commits June 12, 2026 18:53
- assets/cards/kvflash_card.png registered in the README cards grid
  (DECODE 2.9x at 256K, CONTEXT 256K, KV VRAM -99%), linking to
  optimizations/kvflash/.
- optimizations/kvflash/README.md gains the hero image (pflash layout).
- README/RESULTS now state explicitly that the 256K full-cache baseline
  rows are measured, not extrapolated, and fit the 24 GB card only
  because the KV is Q8_0 (F16 KV would be 9.2 GiB and not fit); KVFlash
  holds 72 MiB resident either way.

Co-Authored-By: WOZCODE <contact@withwoz.com>
The measured tables now carry the cache parameter on the column itself
(KV in VRAM (Q8_0)) instead of relying on the prose footnote alone; the
footnote keeps the why (F16 KV would not fit 256K on 24 GB at all).

Co-Authored-By: WOZCODE <contact@withwoz.com>
New 'Bounded KV residency (KVFlash)' subsection after the KV cache
block, mirroring the Spark pattern: one-paragraph intro + flag table
(--kvflash / --kvflash-tau and their env equivalents) linking to
optimizations/kvflash/.

Co-Authored-By: WOZCODE <contact@withwoz.com>
The 38.6 tok/s / 72 MiB figures are Qwen3.6-27B at one pool size; the
four model families land at different speeds. The flags reference now
states the property (decode independent of context length, pool-sized
resident KV) and points at optimizations/kvflash/ for per-model numbers.

Co-Authored-By: WOZCODE <contact@withwoz.com>
… without compression

Three UX/capability gaps closed, all verified on the 3090:

- Pooled chunked prefill in the daemon (DESIGN follow-up #2): a prompt
  larger than the pool no longer refuses — do_prefill switches to
  pager-chunk-sized batches with slot-mapped set_rows writes, a
  slot-space mask per chunk (verify_batch recipe), and live eviction as
  the pool fills. Constant VRAM, linear time. Smoked: 6843-token prompt
  through a 2048 pool, coherent output, 35.1 tok/s decode. Restore
  offsets and boundary snapshots are refused in the pooled path.
- --kvflash auto: sizes the pool from --max-ctx (25% with a drafter
  configured, 50% LRU-only), same floor/clamp rails, all model families
  via the shared config reader. Smoked both sizings.
- Drafter scorer without compression: --prefill-drafter alone now arms
  the residency scorer. The server hands the path to the backend
  (DFLASH_KVFLASH_DRAFTER); kvflash_ensure_scorer() lazy-loads the
  drafter on the first reselect that needs it (never on the first
  tokens) and re-attaches after a draft-residency release. Previously
  the scorer only attached inside the pflash compress path, so this
  flag combination silently ran recency-only LRU. Smoked: attach fires
  mid-generation, banner announces the pending policy.
- Snapshot guards now use pager.is_identity() instead of cumulative
  page_outs stats: one eviction-heavy request no longer disables
  snapshots for the rest of the process (laguna/gemma4), and qwen35
  refuses identity-copy snapshots of relocated pools.

Co-Authored-By: WOZCODE <contact@withwoz.com>
@davide221

Copy link
Copy Markdown
Contributor Author

Update: pooled chunked prefill + --kvflash auto + drafter scoring without compression

Three follow-ups landed in 9db8472, all smoked on the 3090:

  • Prompts larger than the pool now work through the daemon. do_prefill switches to pooled chunked prefill (64-token slot-mapped batches, slot-space mask per chunk, live eviction) instead of refusing — the harness recipe, now in the server. Smoked: a 6,843-token prompt through a 2,048-token pool, coherent output, 35.1 tok/s decode.
  • --kvflash auto: pool sized from --max-ctx — 25% when a drafter is configured, 50% LRU-only. Works on all four model families.
  • --prefill-drafter alone now arms the residency scorer (lazy-loaded at the first reselect). Previously the scorer only attached via the pflash compression path, so --kvflash + drafter with compression off silently ran recency-only LRU.
  • Bugfix: snapshot guards use is_identity() instead of cumulative page_outs, so one long request no longer disables snapshots for the rest of the process.

The intended one-liner UX is now real:

dflash_server model.gguf --max-ctx 262144 --kvflash auto --prefill-drafter qwen3-0.6b.gguf

🧙 Built with WOZCODE

davide221 and others added 2 commits June 12, 2026 19:58
High accuracy by default: when --kvflash is on and no --prefill-drafter
was given, the qwen-family backend probes the well-known locations for
the Qwen3-0.6B drafter (target's dir, drafter/, draft/, then
/opt/lucebox/models/drafter/ — Spark's load-what-sits-next-to-the-model
pattern) and arms the residency scorer from it. LRU is now the explicit
FALLBACK when no drafter exists, and the banner says so
('lru (recency-only: no Qwen3-0.6B drafter found ...)') instead of
presenting recency-only paging as a normal mode.

Nothing turns kvflash itself on by default; this only picks the policy
once the user asks for the pool.

Smoked on the 3090 with ONLY '--kvflash auto': probe found the
appliance drafter, auto sized 25% (drafter expected), scorer attached
at the first reselect, coherent output.

Co-Authored-By: WOZCODE <contact@withwoz.com>
…kvflash-policy

Relevance is a property of the text, not the tokenizer, so non-qwen
targets no longer have to run recency-only residency:

- KvFlashCrossTokScorer: detokenize the target's history with its own
  tokenizer (loaded from the target GGUF), re-tokenize for the Qwen3-0.6B
  drafter (its GGUF), run the same tail-attention scoring, and map
  per-drafter-token scores back to the target's 64-token chunk
  boundaries by character spans. Tokenizers are host-only, lazy-loaded.
- laguna + gemma4 gain the full reselect loop (history, adaptive tau,
  lazy drafter load at the first reselect boundary, score_hook + repage).
  Drafter-scored residency is now the default on ALL four families; the
  probe + sizing live in the shared helpers.
- --kvflash-policy {drafter,lru}: the explicit opt-out the default was
  missing (no probe, no drafter load, recency-only paging).
- Shared kvflash_find_drafter() / kvflash_policy_is_lru() replace the
  per-backend probe; banners state the armed policy and how to change it.

Verified on the 3090 (gemma4 26B-A4B, pool 1024): cross-tok scorer
attaches mid-generation, 18 drafter-driven reselects with page events,
coherent 1.9K-token output. Stress needle A/B vs LRU: LRU degenerates
and never recites; cross-tok stays coherent and recalls the correct
prefix but not the exact code. Documented in RESULTS.md as functional
but untuned (qwen-native scoring keeps its measured 14-16/16; the
teacher-forced NIAH harness for non-qwen archs is the follow-up).

Co-Authored-By: WOZCODE <contact@withwoz.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 12 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread server/src/qwen3/qwen3_kvflash_scorer.h
Comment thread server/src/gemma4/gemma4_backend.cpp
'auto' now sizes from the GPU instead of a fixed fraction of max_ctx:
half of (device-free minus reserve) after the weights are resident,
converted at the model's pooled-KV density, capped at the decode speed
knee (16384 tokens default, DFLASH_KVFLASH_MAX_POOL to override) and at
max_ctx. Rationale: a bigger pool means more resident chunks and fewer
forced evictions of useful context (the relevance-crowding seen in the
gemma4 needle stress), while the cap keeps the per-step KV read near
the flat-decode optimum; on tight cards the VRAM term shrinks the pool
automatically.

Backends supply the budget (ggml_backend_dev_memory + per-arch density:
qwen35 full-attn layers at resolve_kv_types' quant, laguna all layers at
args.kv_type, gemma4 full-attn layers at F16 with per-layer dims); the
reserve covers compute buffers plus the drafter when one is expected.
The fraction heuristic survives only as the no-budget fallback.

Smoked on the 3090 at max-ctx 131072: 27B picks 16384 (free 8.3 GiB,
14.0 KiB/token, speed-capped), gemma4 picks 16384 (7.5 GiB, 20.0
KiB/token), both banners report the full math, both decode coherently.

Co-Authored-By: WOZCODE <contact@withwoz.com>
@cubic-dev-ai

cubic-dev-ai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment @cubic-dev-ai review.

Four valid findings from cubic's later passes, all fixed:
- KvFlashCrossTokScorer: raw owning pimpl now has deleted copy
  ctor/assignment (double-free guard; held in unique_ptr everywhere,
  but the class shouldn't rely on that).
- KvFlashPager::slot_for: a failed allocation rolls cur_chunk_ back so
  the next eviction's tail window isn't computed from a chunk that
  never materialized.
- laguna unpark: kvflash_attach failure now frees the just-loaded
  weights + cache before returning (was leaking them while still
  reporting parked).
- kvflash_drafter_failed_ latch clears on unpark in all three backends:
  a transient drafter-load failure no longer downgrades residency to
  LRU for the process lifetime (still no per-tau retry spam).

Stale finding skipped: the cumulative page_outs snapshot guard was
already replaced by is_identity() two rounds ago.

Docs brought up to shipped reality: DESIGN.md per-arch policy section
(cross-tok default, --kvflash-policy, VRAM auto), do_prefill bullet
(pooled chunked prefill), and the follow-ups list now separates done
(pooled prefill, spec-on-pool, VRAM auto, cross-tok) from open
(drafter KV persistence, laguna/gemma4 pooled prefill, pooled
snapshots, async paging, non-qwen NIAH harness).

Full test_kvflash regression suite on this exact tree: ALL PASS
(relocation 2% gate, bit-exact roundtrip, eviction decode, reselect
recall, LSA loop, >=90% KV cut), exit 0.

Co-Authored-By: WOZCODE <contact@withwoz.com>
@davide221

Copy link
Copy Markdown
Contributor Author

Pre-ship audit complete (321695c)

Final sweep before merge:

  • Cubic round 2: 4 of 5 later findings valid, all fixed — copy guards on the cross-tok scorer's owning pointer (P1), cur_chunk_ rollback on failed allocation, laguna unpark resource cleanup on attach failure, and the drafter-failure latch now clears on unpark instead of downgrading to LRU for the process lifetime. The fifth (cumulative page_outs snapshot guard) was already fixed by is_identity() in an earlier round.
  • Docs reconciled with shipped reality: DESIGN.md policy section, pooled-prefill bullet, and a done-vs-open follow-ups split.
  • Full test_kvflash regression suite on the final tree: ALL PASS, exit 0 — relocation equivalence (2% gate), bit-exact page roundtrip, eviction decode, reselect recall, LSA loop, >=90% KV-memory cut.

Open follow-ups are documented in DESIGN.md (drafter KV persistence, pooled prefill on laguna/gemma4, pooled snapshots, async paging, non-qwen NIAH harness + cross-tok tuning). From our side this is ready to merge once the GPU checks land.

🧙 Built with WOZCODE

davide221 and others added 4 commits June 13, 2026 00:18
Both GPU jobs shared group lucebox3-gpu-runner, but a concurrency group
holds only ONE waiting job: the CUDA job took the running slot, the
Radeon job sat in the waiting slot, and every new job entering the
group from any branch displaced it ('Canceling since a higher priority
waiting request exists') — the Radeon leg was cancelled chronically
while the 3090 leg passed. The combo box has two distinct GPUs, so the
jobs never contended for a device; per-GPU groups keep cross-PR
serialization where it matters and stop the cross-displacement.

Co-Authored-By: WOZCODE <contact@withwoz.com>
rocminfo on a wedged KFD blocks in uninterruptible sleep until the
20-minute job timeout kills the run with zero evidence. Probe it under
a 15 s timeout first; on hang, dump /dev/kfd holders, D-state processes,
and recent amdgpu/kfd dmesg, then fail in seconds with the diagnosis on
the job page. The smoke step reuses the healthy probe's output.

Co-Authored-By: WOZCODE <contact@withwoz.com>
The 'DDTree falls back to AR under KVFlash' limitation guarded against
a tree verify that does not exist in the daemon: the complete tree
machinery (build_ddtree, build_target_step_tree, follow_verified_tree)
is only called from test_dflash, the benchmark harness. In the server,
--ddtree sizes the verify intermediates for budget+1 tokens and enables
fast_rollback, then generation runs the same chain spec loop either way
— and both pieces are already pool-compatible: chain verify_batch is
slot-mapped (measured at acceptance parity), and fast_rollback's
snapshot_kv/restore_kv only snapshot DeltaNet/conv recurrent state,
which KVFlash never pages.

Gate removed; docs corrected (the known-limit now names the harness-only
tree graphs, not the daemon).

A/B on the 3090 (27B + DFlash draft, --ddtree, 600 tokens): pooled
14.6% accept / avg_commit 3.33 / 33.5 tok/s vs full-cache 13.9% / 3.23
/ 33.3 — parity, both coherent.

Co-Authored-By: WOZCODE <contact@withwoz.com>
timeout(1) cannot kill a process in uninterruptible sleep, so the
previous diagnostic step itself blocked for the full job timeout when
KFD was wedged (observed live: 20 minutes of silence, no evidence
printed). Probe rocminfo in the background with output to a file (no
held pipe), enforce the 15 s deadline in the shell, and on hang print
the probe's own D-state, /dev/kfd holders, and amdgpu dmesg before
failing fast — without ever wait()ing on the corpse.

Co-Authored-By: WOZCODE <contact@withwoz.com>
@davide221

Copy link
Copy Markdown
Contributor Author

Update: --ddtree runs on the pool (9a17281)

Investigating the "DDTree falls back to AR" limitation dissolved it: the daemon never had tree verify — the tree machinery (build_ddtree/build_target_step_tree/follow_verified_tree) is only called from test_dflash, the benchmark harness. In the server, --ddtree sizes verify intermediates and enables fast_rollback, then runs the same chain spec loop — and both are pool-compatible (chain verify is slot-mapped; fast_rollback only snapshots DeltaNet state, which is never paged). Gate removed, docs corrected.

A/B on the 3090 (27B + DFlash draft, --ddtree, 600 tokens): pooled 14.6% accept / avg_commit 3.33 / 33.5 tok/s vs full-cache 13.9% / 3.23 / 33.3 — parity, both coherent.

Also in: the ROCm CI job now self-diagnoses a wedged KFD in ~15 s (D-state-proof background probe) instead of eating its 20-minute timeout in silence; the current Radeon failures are a driver wedge on the runner box, not this branch (zero PR code runs in that job).

🧙 Built with WOZCODE

…gression fix

Spec decode now runs on the pool everywhere it exists. gemma4 was the
last gap:

- gemma4_verify_batch gains the kvflash path: set_rows kv-index inputs
  (full layers -> pool slots, SWA -> ring rows), slot-space causal mask
  via the shared helper, FA span + mask width clamped to the pool.
  Gemma4DFlashTarget allocates the verify block's slots up front; the
  spec loop's KV-truncation rejection maps directly onto the pool's
  validity rule (rejected slots hold future positions, masked until the
  next verify rewrites them). Both backend spec gates removed.
- Pre-existing regression fixed (blocks gemma spec on MAIN, not just
  here): PR #359's strict assert reads dflash.n_target_layers, which
  the published gemma draft fills with the TARGET layer count (30)
  while its fc tensor is sized for the 6 CAPTURE layers — the draft
  refused to load at all. Per that PR's own weights-are-ground-truth
  rule, derive the capture count from fc when it divides n_embd and
  warn on the metadata mismatch; genuinely inconsistent shapes still
  fail.
- gemma4 accept_rate now reaches the HTTP usage block (was silently
  0.0 while the loop logged the real rate — same reporting-only class
  as the PR #321 layer-split gap).

A/B on the 3090 (26B-A4B + published q8_0 draft, 600 tokens): pooled
and full cache produce IDENTICAL acceptance (407/3104 = 13.1%,
avg_commit 3.09) and identical text; usage reports 0.131 on both.

Co-Authored-By: WOZCODE <contact@withwoz.com>
@davide221

Copy link
Copy Markdown
Contributor Author

Update: gemma4 spec decode on the pool — spec now works everywhere it exists (abb4cf4)

The last spec-on-pool gap is closed. gemma4_verify_batch gains the slot-mapped path (set_rows kv indices, slot-space causal mask, pool-clamped span); gemma4's KV-truncation rejection semantics map directly onto the pool's validity rule. A/B on the 3090 with the published q8_0 draft: pooled and full cache produce identical acceptance (407/3104 = 13.1%, avg_commit 3.09) and identical text.

Two pre-existing bugs found and fixed along the way (both affect main, not just this branch):

Spec-on-pool coverage is now: qwen35 chain ✓, qwen35 --ddtree config ✓, gemma4 chain ✓; the only exception remains MoE-hybrid spec (literal-offset writes, falls back to pipelined AR), and laguna has no spec decode to begin with.

🧙 Built with WOZCODE

…7B-hardcoded)

The converter stamped the qwen35-27B draft's scalars (n_head_kv=8,
hidden=5120, n_layer=5, ff=17408, ...) onto every draft regardless of
source, so any non-27B DFlash draft (A3B, gemma) converted to a GGUF
with correct weights but wrong metadata — which the strict draft loader
then rejected (blk.0 attn_k dim != n_head_kv*head_dim). Every MoE/A3B
spec-decode attempt on main fails at draft load for this reason.

load_arch() now resolves the architecture from the source config.json
(authoritative for transformer hparams) cross-checked against the
tensor shapes (authoritative for the rest: head_dim from k_proj,
intermediate from gate_proj, n_target_layers from fc, n_layer from the
block count), falling back to the 27B constants only when config.json
is absent. Verified: A3B draft converts to n_head_kv=4 n_layer=8
ff=6144 and loads clean.

This unblocks MoE speculative decode. Validated on the 3090: A3B MoE
all-GPU with --ddtree + --kvflash 2048 runs spec decode on the pool
(10.4% accept, avg_commit 2.66, coherent) vs full cache (11.5%, 2.84,
coherent) — so dflash + ddtree + kvflash compose on MoE. The qwen35moe
--spark hybrid spec path has a separate pre-existing CUDA crash (see
RESULTS Known limits); it was never reachable until drafts could load.

Co-Authored-By: WOZCODE <contact@withwoz.com>
@davide221

Copy link
Copy Markdown
Contributor Author

Update: MoE has dflash + ddtree on the pool (feef3fd)

A3B MoE, all-GPU (experts resident), --ddtree --kvflash 2048: spec decode runs on the pool — 10.4% accept, avg_commit 2.66, 59.5 tok/s, coherent vs full cache 11.5% / 2.84 / 64.6 (gap within the documented masked-kernel rounding variance). dflash + ddtree + kvflash compose on MoE. This path needs no new code — qwen35moe inherits the qwen35 spec loop already pool-validated.

Getting there required fixing the draft converter, which was broken for every non-27B draft on main: convert_dflash_to_gguf.py hardcoded the 27B scalars (n_head_kv=8, hidden=5120, ...), so A3B/gemma drafts converted with correct weights but 27B metadata and the strict loader rejected them. Now config-driven (cross-checked against tensor shapes); the A3B draft converts to n_head_kv=4 n_layer=8 ff=6144 and loads clean. This unblocks MoE/A3B spec decode on main, not just here.

One genuine pre-existing bug surfaced, filed in RESULTS Known limits: the qwen35moe --spark hybrid spec path crashes with a CUDA illegal-memory-access — independent of kvflash (crashes on the full cache too), never reachable before because no A3B draft could load. --spark spec falls back to pipelined AR under kvflash; that crash is its own follow-up.

I wrote pool-aware code for the hybrid path too but did not ship it — it can't be exercised until that crash is fixed, and I'm keeping the branch to validated code only.

🧙 Built with WOZCODE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant