KVFlash: bounded KV residency (lookahead sparse attention) for dflash by davide221 · Pull Request #373 · Luce-Org/lucebox-hub

davide221 · 2026-06-12T08:20:32Z

KVFlash: bounded KV residency (lookahead sparse attention) for dflash

FlashMemory-style (arXiv 2606.09079) decode-time KV paging behind a new --kvflash <tokens> flag. The full-attention KV cache lives in a fixed pool of slots; cold 64-token chunks page to host RAM bit-exact and recallable. GPU KV footprint becomes a hard O(pool) constant at any logical context length.

Full docs in optimizations/kvflash/ (README, RESULTS, DESIGN).

Headline numbers (lucebox RTX 3090, Qwen3.6-27B Q4_K_M, Q8_0 KV)

context	mode	prefill	decode tok/s	needle /16	KV in VRAM
64K	full cache	130.6 s	27.8	16	1152 MiB
64K	KVFlash 4K	87.5 s	38.6	14	72 MiB
128K	full cache	335.9 s	19.6	16	2304 MiB
128K	KVFlash 4K	177.8 s	38.6	14	72 MiB
256K	full cache	999.0 s	13.1	16	4608 MiB
256K	KVFlash 4K	354.9 s	38.6	15	72 MiB

Decode is flat at 38.6 tok/s from 64K to the model's native 256K maximum (1.4x / 2.0x / 2.9x over the full cache), prefill is up to 2.8x faster, and attn-KV memory drops 99.2% (2304 to 18 MiB at 128K with a 1K pool).

How it works

Attention KV tensors are allocated at pool size (create_target_cache gains ctx_alloc); cache.max_ctx stays the logical bound. The allocation delta IS the saving.
A pager (common/kvflash_pager.h) maps logical positions to pool slots at 64-token chunk granularity, riding the existing step-invariant set_rows KV append. RoPE is baked into K rows at write time, so relocation is legal; page-out/page-in moves raw quantized bytes and is bit-exact.
Decode attends over the pool with an exact slot-validity mask, re-uploaded before every compute (gallocr reuses input regions during graph execution). The mask is free: 25.10 vs 25.52 ms/step maskless.
Every tau decoded tokens (default 64, self-throttling) the scorer re-ranks all chunks and reselect() repages the pool: the paper's lookahead loop, with a hard capacity cap their sigmoid threshold lacks.

Policy is pluggable, pflash is optional

KvFlashScorer (common/) is the policy seam. With no scorer the pool runs pure LRU (zero pflash dependency, recency-only memory). When pflash loads its drafter, KvFlashDrafterScorer attaches automatically and reselect becomes relevance-driven: needle recall holds at 88-100% down to 6-9% residency from 8K to 256K, where LRU scores 0 outside its tail window.

Spec decode runs on the pool

Chain-mode verify_batch slot-maps the draft block (per-token kv_write_rows, which is [n_tokens, n_head_kv] ne0-major) and builds a slot-space mask. Rejected drafts need no rollback: the pos < base_pos validity rule excludes their slots until rewritten. Acceptance parity measured on the daemon: 15.4-15.6% pooled vs 15.3% full cache. DDTree tree-verify is not pool-aware yet and falls back to AR with a one-time warning.

Quality

Harness ground truth with the pool sized per the heuristic: HumanEval 10/10, GSM 10/10, MATH 10/10, agent 6/6, identical to the full-cache baseline (base-vs-base control: 16/16 byte-identical, so the stack is deterministic; text drift under KVFlash is the masked kernel's different deterministic rounding lineage, not a correctness effect).

Verification

test_kvflash suite A-F: full-cache baseline, shuffled-relocation equivalence (0.83% argmax flips, gate 2%), live paging with bit-exact roundtrip and >=90% KV-bytes cut, score-driven reselect recall, decode profile, full LSA loop with the drafter as Memory Indexer.
Daemon smokes: agnostic LRU (1441 logical tokens through a 1024-slot pool, live eviction mid-request, coherent at 36.9 tok/s), pflash + drafter scorer auto-attach, spec decode with mid-generation pool wrap, two-request pager reset.
Rebased onto current main (PRs 364/370/371); end-of-prefill snapshot block and kvflash prefill sync coexist, rebuilt and re-smoked on the 3090.

Known limits (documented in RESULTS.md)

DDTree falls back to AR while KVFlash is active.
Post-generation snapshots are skipped once cur_pos exceeds the pool (pooled snapshots need page-table serialization); prefill-time snapshots work.
Paging is synchronous; copy-stream overlap is a follow-up.
Memory-dense tasks that need the whole context at once (MRCR-style) are a paradigm limit shared with FlashMemory; size the pool up for those.

Usage

dflash_server model.gguf --max-ctx 262144 --kvflash 4096            # LRU policy
dflash_server model.gguf --max-ctx 262144 --kvflash 4096 \
    --prefill-compression always --prefill-drafter qwen3-0.6b.gguf  # drafter policy

🧙 Built with WOZCODE

KvFlashPager: bounded resident pool for the full-attention KV cache (FlashMemory-style lookahead sparse attention, arXiv 2606.09079). Logical positions map to physical pool slots at 64-token chunk granularity; cold chunks page to a host backing store bit-exact and recallable. GPU footprint is a hard O(pool) bound at any context length. KvFlashScorer: dependency-free chunk-relevance policy interface. With no scorer the pager runs pure LRU; KvFlashDrafterScorer adapts the pflash Qwen3-0.6B drafter (tail-attention chunk scores, z-normalized, bisecting on allocation pressure) so reselect becomes relevance-driven. Co-Authored-By: WOZCODE <contact@withwoz.com>

- create_target_cache gains ctx_alloc: attention KV tensors allocate at pool capacity while cache.max_ctx stays the logical bound. - build_target_step gains kvflash_mask: pooled decode keeps the step-invariant set_rows KV append active alongside an exact slot-validity mask (uploaded before every compute; gallocr reuses input regions during graph execution, so a stale mask is garbage). - do_ar_decode routes kv_write_rows through the pager slot, pushes history, and reselects every tau decoded tokens (effective interval max(tau, history/45) caps rescore overhead near 15%). - Spec decode (chain) verifies ON the pool: verify_batch slot-maps the draft block (kv_write_rows is [n_tokens, n_head_kv] ne0-major) and builds a slot-space mask; rejected drafts need no rollback since the pos < base_pos validity rule excludes their slots until rewritten. DDTree tree-verify is not pool-aware and falls back to AR. - pflash synergy: when the prefill drafter loads, KvFlashDrafterScorer attaches automatically; without it the pool runs LRU (fully agnostic). - Post-generation snapshots are skipped once cur_pos exceeds the pool; prompts must fit the pool (clear error otherwise); pool size clamps to --max-ctx with a warning. Co-Authored-By: WOZCODE <contact@withwoz.com>

Gated suite A-F: full-cache baseline, shuffled-relocation equivalence (<=2% argmax flips), live paging with bit-exact page_out/page_in roundtrip and >=90% KV-bytes cut, score-driven reselect recall, decode profile, and the full LSA loop with the drafter as Memory Indexer. Modes: --niah / --niah256 (needle recall vs residency), --longab (end-to-end long-prompt A/B, per-process configs for clean VRAM), --no-mask. Co-Authored-By: WOZCODE <contact@withwoz.com>

Measured on lucebox RTX 3090, Qwen3.6-27B Q4_K_M, Q8_0 KV: decode flat at 38.6 tok/s from 64K to native-max 256K (2.9x over full cache at 256K), 72 MiB resident KV vs 4608 MiB, prefill up to 2.8x faster, needle recall 88-100% at 6-9% residency with the drafter policy, harness ground truth 32/32 vs 32/32, spec acceptance at parity. Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai

6 issues found across 18 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/server/server_main.cpp">

<violation number="1" location="server/src/server/server_main.cpp:411">
P2: Missing input validation for --kvflash token count. The value is stored raw via setenv without any validation that it is a positive integer. Every other numeric flag in this block (--spark-slots, --ddtree-budget, --fa-window, --chunk, etc.) parses with std::atoi and validates. Passing non-numeric, zero, or negative input will silently set DFLASH_KVFLASH to garbage, deferring the failure to an opaque downstream atoi call rather than failing early with a clear error message.</violation>
</file>

<file name="server/src/qwen3/qwen3_kvflash_scorer.cpp">

<violation number="1" location="server/src/qwen3/qwen3_kvflash_scorer.cpp:110">
P2: `score_chunks` divides by `chunk_tokens` without validating it, which can crash on invalid input.</violation>
</file>

<file name="server/src/qwen35/graph_builders.h">

<violation number="1" location="server/src/qwen35/graph_builders.h:71">
P3: Header comment for `kvflash_mask` incorrectly states it is "Only meaningful with n_tokens == 1", but the parameter is actively used with `n_tokens > 1` in the verify_batch/spec-decode path (qwen35_dflash_target.cpp:63), and the implementation in graph_builders.cpp:291-296 explicitly describes support for "multi-token ... forwards (decode AND spec verify)". The header constraint is misleading and contradicts both the implementation comment and actual usage.</violation>
</file>

<file name="server/src/common/kvflash_pager.h">

<violation number="1" location="server/src/common/kvflash_pager.h:70">
P1: `attach()` does not validate that pool capacity leaves at least one evictable chunk, so small pools can deadlock eviction and make `slot_for()` fail with `-1`.</violation>
</file>

<file name="server/src/qwen3/qwen3_kvflash_scorer.h">

<violation number="1" location="server/src/qwen3/qwen3_kvflash_scorer.h:7">
P3: Stale documentation reference: the header comment says 'see common/kv_scorer.h' but no such file exists. The correct base-class header is `common/kvflash_scorer.h` (confirmed at `server/src/common/kvflash_scorer.h`). This will mislead developers looking for the dependency-free interface description.</violation>
</file>

<file name="server/src/qwen35/qwen35_backend.cpp">

<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:1304">
P1: `slot_for()` failure is unchecked, so kvflash can write to KV row `-1` when the pool has no evictable block.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

The pager core is architecture-blind; this routes each backend's KV writes and masks through it so --kvflash works on every model family the daemon serves. - qwen35moe (Qwen3.6-35B-A3B): the non-hybrid path inherits qwen35. The Spark pipelined hybrid decode gains a kv_slot parameter; the cached per-layer FA span clamps to the pool, so the cached graph stops rebuilding once the window reaches pool size. The pool span stays maskless like the rest of that path: the pager zeroes freed blocks (page-out + zero_free_blocks on request reset), the same zero-row approximation production padding already relies on. Hybrid spec decode (literal-offset KV writes) falls back to pipelined AR. - laguna: all 40 layers pooled. laguna_step/_hybrid take a const pager; full + SWA masks are built in SLOT space via fill_slot_pos. SWA exactness from a protected tail >= sliding_window. Legacy per-layer hybrid decode and NO_KVPAD/PAD_CPY/no_mask ablations are refused under kvflash. - gemma4: pools FULL-attention layers only (SWA layers already ring-buffer; KV-reuse layers share their source tensors). Slot-space full mask; FA span and mask width clamp to tensor capacity. Mutually exclusive with --fa-window; spec verify falls back to AR. - pager: new const helpers slot_of / fill_slot_pos (slot-space mask construction) and zero_free_blocks (request-reset hygiene for maskless consumers); kvflash state in Qwen35Backend moved to protected for the MoE subclass. - guards everywhere: prompt-fits-pool on every prefill/restore path, snapshots refused after the first relocation on laguna/gemma4. Smoked on the 3090, pool 1024 / max-ctx 8192 with live LRU eviction mid-request: A3B Spark hybrid 101.6 tok/s, laguna 137.1, gemma4 119.0, all coherent; gemma4 no-flag control unchanged (120.2). Co-Authored-By: WOZCODE <contact@withwoz.com>

davide221 · 2026-06-12T10:21:44Z

Update: KVFlash now covers every architecture the daemon serves

--kvflash was qwen35-only at PR open; this push ports it to the other three backends. The pager core (common/kvflash_pager.h) was already architecture-blind; each backend now routes its KV writes and masks through it:

arch	model smoked	integration	decode
qwen35moe	Qwen3.6-35B-A3B (Spark hybrid, 9403 hot / 837 cold experts)	`pipelined_decode_one_token` gains `kv_slot`; cached per-layer FA span clamps to the pool (graph stops rebuilding at pool size); maskless pool span backed by pager-zeroed free blocks; hybrid spec falls back to pipelined AR	101.6 tok/s coherent
laguna	Laguna-XS.2 (Spark hybrid, single-graph decode)	`laguna_step(_hybrid)` take a const pager; full + SWA masks built in SLOT space via the new `fill_slot_pos`; protected tail >= sliding_window keeps SWA exact; all 40 layers pooled	137.1 tok/s coherent
gemma4	Gemma4 26B-A4B	pools FULL-attention layers only (5 of them; SWA layers already ring-buffer, KV-reuse layers share source tensors); slot-space full mask; mutually exclusive with `--fa-window`; spec falls back to AR	119.0 tok/s coherent

All smokes: pool 1024 / max-ctx 8192, ~1.2K logical tokens so live LRU eviction engages mid-request, RTX 3090. A no-flag gemma4 control on the same build confirms the default path is unchanged. The qwen35 numbers in the PR body are unaffected.

Policy note: qwen35/qwen35moe attach the pflash drafter scorer automatically; laguna and gemma4 run LRU-only for now (the drafter is Qwen-tokenizer bound) with the KvFlashScorer seam open for their own indexers.

New pager helpers: slot_of / fill_slot_pos (const lookups for slot-space masks) and zero_free_blocks (request-reset hygiene for maskless consumers).

🧙 Built with WOZCODE

cubic-dev-ai

2 issues found across 18 files (changes from recent commits).

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

- Pool-deadlock guard (P1): KvFlashPager::min_pool_tokens() + attach() refusal when sinks + tail window leave no evictable block; every backend floors the requested pool at config read (512 for qwen-family and gemma4; laguna derives its floor from the resident SWA window) with a warning instead of a runtime eviction failure. - Unchecked slot_for() in do_ar_decode (P1): a -1 slot now fails the request with a clear error instead of becoming a set_rows row index. - --kvflash / --kvflash-tau (P2): validate as positive integers at the CLI and exit early instead of deferring garbage env values downstream. - score_chunks (P2): guard chunk_tokens <= 0. - Stale docs (P3 x2): kvflash_mask comment no longer claims n_tokens==1 only (it serves multi-token spec verify); kv_scorer.h rename leftover now points at common/kvflash_scorer.h. Verified on the 3090: bad flag values rejected with clear messages; --kvflash 256 raises to the 512 floor and decodes coherently through live eviction in the tightest legal pool (8 blocks, 5 protected). Co-Authored-By: WOZCODE <contact@withwoz.com>

davide221 · 2026-06-12T10:36:34Z

All 6 cubic findings were valid and are fixed in the latest push:

P1 pool-deadlock (attach() no evictable chunk): real — --kvflash 256 gave 4 chunks while sinks (1) + tail (4) protect 5, so eviction had no victim once the pool filled. Fix is two-layered: KvFlashPager::min_pool_tokens() + an attach() refusal with a clear message, and every backend's config read now floors the pool (512 for qwen-family/gemma4; laguna computes its floor from the SWA window it must keep resident) with a "raising" warning instead of a runtime failure. Verified live: --kvflash 256 logs requested pool 256 < minimum 512; raising and decodes correctly through eviction.
P1 unchecked slot_for() in do_ar_decode: real — a -1 would have become a set_rows row index. Now checked, logs, sets last_error, and fails the request. (The spec-verify path already checked; this was the one unchecked site.)
P2 --kvflash raw setenv: both --kvflash and --kvflash-tau now validate as positive integers and exit with a clear message, matching the other numeric flags.
P2 score_chunks division by chunk_tokens: guarded (chunk_tokens <= 0 returns false) alongside the existing entry validation.
P3 stale "Only meaningful with n_tokens == 1" comment: rewritten — the param serves both single-token decode and multi-token spec verify since the spec-on-pool phase landed.
P3 stale common/kv_scorer.h reference: rename leftover, now common/kvflash_scorer.h.

Rebuilt + re-smoked on the 3090 after the fixes (27B, pool floor path, coherent output through live eviction).

🧙 Built with WOZCODE

…lpers The multi-arch port left three copies of the same plumbing; this pulls them into the kvflash layer so each backend integration reduces to wiring (net -32 lines): - kvflash_pool_from_env(): the env read + 256-rounding + eviction floor + max_ctx clamp lived in three slightly diverging copies (qwen35 inline, laguna, gemma4). One reader, parameterized by the arch's KvFlashConfig; laguna passes its SWA-tail config via a new kvflash_config() so the floor and attach can never disagree. - KvFlashPager::alloc_span(): the slot_for loop + exhaustion diagnostic existed in laguna, gemma4, and the qwen35moe restore replay; the backend helpers are now one-line delegates and the error message is single-sourced. - kvflash_fill_rows_and_masks(): laguna's step-input filler and gemma4's inline rows + slot-space mask fill were the same algorithm; the shared helper builds append rows plus causal (and optional sliding-window) masks from the pager's slot map, so graph code no longer reimplements the slot-to-position conversion. No behavior change: rebuilt on the 3090 and re-smoked the three affected archs through live eviction (laguna 138.0 tok/s, gemma4 119.4, qwen35 37.0, all coherent, banners unchanged). Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai

1 issue found across 8 files (changes from recent commits).

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

- assets/cards/kvflash_card.png registered in the README cards grid (DECODE 2.9x at 256K, CONTEXT 256K, KV VRAM -99%), linking to optimizations/kvflash/. - optimizations/kvflash/README.md gains the hero image (pflash layout). - README/RESULTS now state explicitly that the 256K full-cache baseline rows are measured, not extrapolated, and fit the 24 GB card only because the KV is Q8_0 (F16 KV would be 9.2 GiB and not fit); KVFlash holds 72 MiB resident either way. Co-Authored-By: WOZCODE <contact@withwoz.com>

The measured tables now carry the cache parameter on the column itself (KV in VRAM (Q8_0)) instead of relying on the prose footnote alone; the footnote keeps the why (F16 KV would not fit 256K on 24 GB at all). Co-Authored-By: WOZCODE <contact@withwoz.com>

New 'Bounded KV residency (KVFlash)' subsection after the KV cache block, mirroring the Spark pattern: one-paragraph intro + flag table (--kvflash / --kvflash-tau and their env equivalents) linking to optimizations/kvflash/. Co-Authored-By: WOZCODE <contact@withwoz.com>

The 38.6 tok/s / 72 MiB figures are Qwen3.6-27B at one pool size; the four model families land at different speeds. The flags reference now states the property (decode independent of context length, pool-sized resident KV) and points at optimizations/kvflash/ for per-model numbers. Co-Authored-By: WOZCODE <contact@withwoz.com>

… without compression Three UX/capability gaps closed, all verified on the 3090: - Pooled chunked prefill in the daemon (DESIGN follow-up #2): a prompt larger than the pool no longer refuses — do_prefill switches to pager-chunk-sized batches with slot-mapped set_rows writes, a slot-space mask per chunk (verify_batch recipe), and live eviction as the pool fills. Constant VRAM, linear time. Smoked: 6843-token prompt through a 2048 pool, coherent output, 35.1 tok/s decode. Restore offsets and boundary snapshots are refused in the pooled path. - --kvflash auto: sizes the pool from --max-ctx (25% with a drafter configured, 50% LRU-only), same floor/clamp rails, all model families via the shared config reader. Smoked both sizings. - Drafter scorer without compression: --prefill-drafter alone now arms the residency scorer. The server hands the path to the backend (DFLASH_KVFLASH_DRAFTER); kvflash_ensure_scorer() lazy-loads the drafter on the first reselect that needs it (never on the first tokens) and re-attaches after a draft-residency release. Previously the scorer only attached inside the pflash compress path, so this flag combination silently ran recency-only LRU. Smoked: attach fires mid-generation, banner announces the pending policy. - Snapshot guards now use pager.is_identity() instead of cumulative page_outs stats: one eviction-heavy request no longer disables snapshots for the rest of the process (laguna/gemma4), and qwen35 refuses identity-copy snapshots of relocated pools. Co-Authored-By: WOZCODE <contact@withwoz.com>

davide221 · 2026-06-12T17:34:42Z

Update: pooled chunked prefill + `--kvflash auto` + drafter scoring without compression

Three follow-ups landed in 9db8472, all smoked on the 3090:

Prompts larger than the pool now work through the daemon. do_prefill switches to pooled chunked prefill (64-token slot-mapped batches, slot-space mask per chunk, live eviction) instead of refusing — the harness recipe, now in the server. Smoked: a 6,843-token prompt through a 2,048-token pool, coherent output, 35.1 tok/s decode.
--kvflash auto: pool sized from --max-ctx — 25% when a drafter is configured, 50% LRU-only. Works on all four model families.
--prefill-drafter alone now arms the residency scorer (lazy-loaded at the first reselect). Previously the scorer only attached via the pflash compression path, so --kvflash + drafter with compression off silently ran recency-only LRU.
Bugfix: snapshot guards use is_identity() instead of cumulative page_outs, so one long request no longer disables snapshots for the rest of the process.

The intended one-liner UX is now real:

dflash_server model.gguf --max-ctx 262144 --kvflash auto --prefill-drafter qwen3-0.6b.gguf

🧙 Built with WOZCODE

High accuracy by default: when --kvflash is on and no --prefill-drafter was given, the qwen-family backend probes the well-known locations for the Qwen3-0.6B drafter (target's dir, drafter/, draft/, then /opt/lucebox/models/drafter/ — Spark's load-what-sits-next-to-the-model pattern) and arms the residency scorer from it. LRU is now the explicit FALLBACK when no drafter exists, and the banner says so ('lru (recency-only: no Qwen3-0.6B drafter found ...)') instead of presenting recency-only paging as a normal mode. Nothing turns kvflash itself on by default; this only picks the policy once the user asks for the pool. Smoked on the 3090 with ONLY '--kvflash auto': probe found the appliance drafter, auto sized 25% (drafter expected), scorer attached at the first reselect, coherent output. Co-Authored-By: WOZCODE <contact@withwoz.com>

…kvflash-policy Relevance is a property of the text, not the tokenizer, so non-qwen targets no longer have to run recency-only residency: - KvFlashCrossTokScorer: detokenize the target's history with its own tokenizer (loaded from the target GGUF), re-tokenize for the Qwen3-0.6B drafter (its GGUF), run the same tail-attention scoring, and map per-drafter-token scores back to the target's 64-token chunk boundaries by character spans. Tokenizers are host-only, lazy-loaded. - laguna + gemma4 gain the full reselect loop (history, adaptive tau, lazy drafter load at the first reselect boundary, score_hook + repage). Drafter-scored residency is now the default on ALL four families; the probe + sizing live in the shared helpers. - --kvflash-policy {drafter,lru}: the explicit opt-out the default was missing (no probe, no drafter load, recency-only paging). - Shared kvflash_find_drafter() / kvflash_policy_is_lru() replace the per-backend probe; banners state the armed policy and how to change it. Verified on the 3090 (gemma4 26B-A4B, pool 1024): cross-tok scorer attaches mid-generation, 18 drafter-driven reselects with page events, coherent 1.9K-token output. Stress needle A/B vs LRU: LRU degenerates and never recites; cross-tok stays coherent and recalls the correct prefix but not the exact code. Documented in RESULTS.md as functional but untuned (qwen-native scoring keeps its measured 14-16/16; the teacher-forced NIAH harness for non-qwen archs is the follow-up). Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai

2 issues found across 12 files (changes from recent commits).

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

'auto' now sizes from the GPU instead of a fixed fraction of max_ctx: half of (device-free minus reserve) after the weights are resident, converted at the model's pooled-KV density, capped at the decode speed knee (16384 tokens default, DFLASH_KVFLASH_MAX_POOL to override) and at max_ctx. Rationale: a bigger pool means more resident chunks and fewer forced evictions of useful context (the relevance-crowding seen in the gemma4 needle stress), while the cap keeps the per-step KV read near the flat-decode optimum; on tight cards the VRAM term shrinks the pool automatically. Backends supply the budget (ggml_backend_dev_memory + per-arch density: qwen35 full-attn layers at resolve_kv_types' quant, laguna all layers at args.kv_type, gemma4 full-attn layers at F16 with per-layer dims); the reserve covers compute buffers plus the drafter when one is expected. The fraction heuristic survives only as the no-budget fallback. Smoked on the 3090 at max-ctx 131072: 27B picks 16384 (free 8.3 GiB, 14.0 KiB/token, speed-capped), gemma4 picks 16384 (7.5 GiB, 20.0 KiB/token), both banners report the full math, both decode coherently. Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai · 2026-06-12T21:47:14Z

You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment @cubic-dev-ai review.

Four valid findings from cubic's later passes, all fixed: - KvFlashCrossTokScorer: raw owning pimpl now has deleted copy ctor/assignment (double-free guard; held in unique_ptr everywhere, but the class shouldn't rely on that). - KvFlashPager::slot_for: a failed allocation rolls cur_chunk_ back so the next eviction's tail window isn't computed from a chunk that never materialized. - laguna unpark: kvflash_attach failure now frees the just-loaded weights + cache before returning (was leaking them while still reporting parked). - kvflash_drafter_failed_ latch clears on unpark in all three backends: a transient drafter-load failure no longer downgrades residency to LRU for the process lifetime (still no per-tau retry spam). Stale finding skipped: the cumulative page_outs snapshot guard was already replaced by is_identity() two rounds ago. Docs brought up to shipped reality: DESIGN.md per-arch policy section (cross-tok default, --kvflash-policy, VRAM auto), do_prefill bullet (pooled chunked prefill), and the follow-ups list now separates done (pooled prefill, spec-on-pool, VRAM auto, cross-tok) from open (drafter KV persistence, laguna/gemma4 pooled prefill, pooled snapshots, async paging, non-qwen NIAH harness). Full test_kvflash regression suite on this exact tree: ALL PASS (relocation 2% gate, bit-exact roundtrip, eviction decode, reselect recall, LSA loop, >=90% KV cut), exit 0. Co-Authored-By: WOZCODE <contact@withwoz.com>

davide221 · 2026-06-12T22:02:03Z

Pre-ship audit complete (`321695c`)

Final sweep before merge:

Cubic round 2: 4 of 5 later findings valid, all fixed — copy guards on the cross-tok scorer's owning pointer (P1), cur_chunk_ rollback on failed allocation, laguna unpark resource cleanup on attach failure, and the drafter-failure latch now clears on unpark instead of downgrading to LRU for the process lifetime. The fifth (cumulative page_outs snapshot guard) was already fixed by is_identity() in an earlier round.
Docs reconciled with shipped reality: DESIGN.md policy section, pooled-prefill bullet, and a done-vs-open follow-ups split.
Full test_kvflash regression suite on the final tree: ALL PASS, exit 0 — relocation equivalence (2% gate), bit-exact page roundtrip, eviction decode, reselect recall, LSA loop, >=90% KV-memory cut.

Open follow-ups are documented in DESIGN.md (drafter KV persistence, pooled prefill on laguna/gemma4, pooled snapshots, async paging, non-qwen NIAH harness + cross-tok tuning). From our side this is ready to merge once the GPU checks land.

🧙 Built with WOZCODE

Both GPU jobs shared group lucebox3-gpu-runner, but a concurrency group holds only ONE waiting job: the CUDA job took the running slot, the Radeon job sat in the waiting slot, and every new job entering the group from any branch displaced it ('Canceling since a higher priority waiting request exists') — the Radeon leg was cancelled chronically while the 3090 leg passed. The combo box has two distinct GPUs, so the jobs never contended for a device; per-GPU groups keep cross-PR serialization where it matters and stop the cross-displacement. Co-Authored-By: WOZCODE <contact@withwoz.com>

rocminfo on a wedged KFD blocks in uninterruptible sleep until the 20-minute job timeout kills the run with zero evidence. Probe it under a 15 s timeout first; on hang, dump /dev/kfd holders, D-state processes, and recent amdgpu/kfd dmesg, then fail in seconds with the diagnosis on the job page. The smoke step reuses the healthy probe's output. Co-Authored-By: WOZCODE <contact@withwoz.com>

The 'DDTree falls back to AR under KVFlash' limitation guarded against a tree verify that does not exist in the daemon: the complete tree machinery (build_ddtree, build_target_step_tree, follow_verified_tree) is only called from test_dflash, the benchmark harness. In the server, --ddtree sizes the verify intermediates for budget+1 tokens and enables fast_rollback, then generation runs the same chain spec loop either way — and both pieces are already pool-compatible: chain verify_batch is slot-mapped (measured at acceptance parity), and fast_rollback's snapshot_kv/restore_kv only snapshot DeltaNet/conv recurrent state, which KVFlash never pages. Gate removed; docs corrected (the known-limit now names the harness-only tree graphs, not the daemon). A/B on the 3090 (27B + DFlash draft, --ddtree, 600 tokens): pooled 14.6% accept / avg_commit 3.33 / 33.5 tok/s vs full-cache 13.9% / 3.23 / 33.3 — parity, both coherent. Co-Authored-By: WOZCODE <contact@withwoz.com>

timeout(1) cannot kill a process in uninterruptible sleep, so the previous diagnostic step itself blocked for the full job timeout when KFD was wedged (observed live: 20 minutes of silence, no evidence printed). Probe rocminfo in the background with output to a file (no held pipe), enforce the 15 s deadline in the shell, and on hang print the probe's own D-state, /dev/kfd holders, and amdgpu dmesg before failing fast — without ever wait()ing on the corpse. Co-Authored-By: WOZCODE <contact@withwoz.com>

davide221 · 2026-06-12T23:22:51Z

Update: `--ddtree` runs on the pool (`9a17281`)

Investigating the "DDTree falls back to AR" limitation dissolved it: the daemon never had tree verify — the tree machinery (build_ddtree/build_target_step_tree/follow_verified_tree) is only called from test_dflash, the benchmark harness. In the server, --ddtree sizes verify intermediates and enables fast_rollback, then runs the same chain spec loop — and both are pool-compatible (chain verify is slot-mapped; fast_rollback only snapshots DeltaNet state, which is never paged). Gate removed, docs corrected.

A/B on the 3090 (27B + DFlash draft, --ddtree, 600 tokens): pooled 14.6% accept / avg_commit 3.33 / 33.5 tok/s vs full-cache 13.9% / 3.23 / 33.3 — parity, both coherent.

Also in: the ROCm CI job now self-diagnoses a wedged KFD in ~15 s (D-state-proof background probe) instead of eating its 20-minute timeout in silence; the current Radeon failures are a driver wedge on the runner box, not this branch (zero PR code runs in that job).

🧙 Built with WOZCODE

…gression fix Spec decode now runs on the pool everywhere it exists. gemma4 was the last gap: - gemma4_verify_batch gains the kvflash path: set_rows kv-index inputs (full layers -> pool slots, SWA -> ring rows), slot-space causal mask via the shared helper, FA span + mask width clamped to the pool. Gemma4DFlashTarget allocates the verify block's slots up front; the spec loop's KV-truncation rejection maps directly onto the pool's validity rule (rejected slots hold future positions, masked until the next verify rewrites them). Both backend spec gates removed. - Pre-existing regression fixed (blocks gemma spec on MAIN, not just here): PR #359's strict assert reads dflash.n_target_layers, which the published gemma draft fills with the TARGET layer count (30) while its fc tensor is sized for the 6 CAPTURE layers — the draft refused to load at all. Per that PR's own weights-are-ground-truth rule, derive the capture count from fc when it divides n_embd and warn on the metadata mismatch; genuinely inconsistent shapes still fail. - gemma4 accept_rate now reaches the HTTP usage block (was silently 0.0 while the loop logged the real rate — same reporting-only class as the PR #321 layer-split gap). A/B on the 3090 (26B-A4B + published q8_0 draft, 600 tokens): pooled and full cache produce IDENTICAL acceptance (407/3104 = 13.1%, avg_commit 3.09) and identical text; usage reports 0.131 on both. Co-Authored-By: WOZCODE <contact@withwoz.com>

davide221 · 2026-06-12T23:50:53Z

Update: gemma4 spec decode on the pool — spec now works everywhere it exists (`abb4cf4`)

The last spec-on-pool gap is closed. gemma4_verify_batch gains the slot-mapped path (set_rows kv indices, slot-space causal mask, pool-clamped span); gemma4's KV-truncation rejection semantics map directly onto the pool's validity rule. A/B on the 3090 with the published q8_0 draft: pooled and full cache produce identical acceptance (407/3104 = 13.1%, avg_commit 3.09) and identical text.

Two pre-existing bugs found and fixed along the way (both affect main, not just this branch):

The published gemma draft cannot load since PR feat(qwen35): derive scalars from weights, assert vs GGUF metadata #359: its dflash.n_target_layers metadata holds the target's 30 layers while the fc tensor is sized for the 6 capture layers, so the strict assert rejects it and gemma spec decode is silently AR-only. Fixed per that PR's own weights-are-ground-truth rule: derive the capture count from the tensor, warn on the metadata mismatch.
gemma4 accept_rate never reached the HTTP usage block (reported 0.0 while the loop logged the real rate) — same reporting-only class as the PR feat(server): support DFlash with mixed-backend target layer split #321 layer-split gap. Wired through.

Spec-on-pool coverage is now: qwen35 chain ✓, qwen35 --ddtree config ✓, gemma4 chain ✓; the only exception remains MoE-hybrid spec (literal-offset writes, falls back to pipelined AR), and laguna has no spec decode to begin with.

🧙 Built with WOZCODE

…7B-hardcoded) The converter stamped the qwen35-27B draft's scalars (n_head_kv=8, hidden=5120, n_layer=5, ff=17408, ...) onto every draft regardless of source, so any non-27B DFlash draft (A3B, gemma) converted to a GGUF with correct weights but wrong metadata — which the strict draft loader then rejected (blk.0 attn_k dim != n_head_kv*head_dim). Every MoE/A3B spec-decode attempt on main fails at draft load for this reason. load_arch() now resolves the architecture from the source config.json (authoritative for transformer hparams) cross-checked against the tensor shapes (authoritative for the rest: head_dim from k_proj, intermediate from gate_proj, n_target_layers from fc, n_layer from the block count), falling back to the 27B constants only when config.json is absent. Verified: A3B draft converts to n_head_kv=4 n_layer=8 ff=6144 and loads clean. This unblocks MoE speculative decode. Validated on the 3090: A3B MoE all-GPU with --ddtree + --kvflash 2048 runs spec decode on the pool (10.4% accept, avg_commit 2.66, coherent) vs full cache (11.5%, 2.84, coherent) — so dflash + ddtree + kvflash compose on MoE. The qwen35moe --spark hybrid spec path has a separate pre-existing CUDA crash (see RESULTS Known limits); it was never reachable until drafts could load. Co-Authored-By: WOZCODE <contact@withwoz.com>

davide221 · 2026-06-13T11:31:41Z

Update: MoE has dflash + ddtree on the pool (`feef3fd`)

A3B MoE, all-GPU (experts resident), --ddtree --kvflash 2048: spec decode runs on the pool — 10.4% accept, avg_commit 2.66, 59.5 tok/s, coherent vs full cache 11.5% / 2.84 / 64.6 (gap within the documented masked-kernel rounding variance). dflash + ddtree + kvflash compose on MoE. This path needs no new code — qwen35moe inherits the qwen35 spec loop already pool-validated.

Getting there required fixing the draft converter, which was broken for every non-27B draft on main: convert_dflash_to_gguf.py hardcoded the 27B scalars (n_head_kv=8, hidden=5120, ...), so A3B/gemma drafts converted with correct weights but 27B metadata and the strict loader rejected them. Now config-driven (cross-checked against tensor shapes); the A3B draft converts to n_head_kv=4 n_layer=8 ff=6144 and loads clean. This unblocks MoE/A3B spec decode on main, not just here.

One genuine pre-existing bug surfaced, filed in RESULTS Known limits: the qwen35moe --spark hybrid spec path crashes with a CUDA illegal-memory-access — independent of kvflash (crashes on the full cache too), never reachable before because no A3B draft could load. --spark spec falls back to pipelined AR under kvflash; that crash is its own follow-up.

I wrote pool-aware code for the hybrid path too but did not ship it — it can't be exercised until that crash is fixed, and I'm keeping the branch to validated code only.

🧙 Built with WOZCODE

davide221 and others added 4 commits June 12, 2026 10:15

cubic-dev-ai Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread server/src/gemma4/gemma4_backend.cpp Outdated

Comment thread server/src/laguna/laguna_backend.cpp Outdated

cubic-dev-ai Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread server/src/common/kvflash_pager.h

davide221 and others added 5 commits June 12, 2026 18:53

davide221 and others added 2 commits June 12, 2026 19:58

cubic-dev-ai Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread server/src/qwen3/qwen3_kvflash_scorer.h

Comment thread server/src/gemma4/gemma4_backend.cpp

davide221 and others added 4 commits June 13, 2026 00:18

Conversation

davide221 commented Jun 12, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

KVFlash: bounded KV residency (lookahead sparse attention) for dflash

Headline numbers (lucebox RTX 3090, Qwen3.6-27B Q4_K_M, Q8_0 KV)

How it works

Policy is pluggable, pflash is optional

Spec decode runs on the pool

Quality

Verification

Known limits (documented in RESULTS.md)

Usage

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davide221 commented Jun 12, 2026

Update: KVFlash now covers every architecture the daemon serves

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

davide221 commented Jun 12, 2026

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

davide221 commented Jun 12, 2026

Update: pooled chunked prefill + --kvflash auto + drafter scoring without compression

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot commented Jun 12, 2026

Uh oh!

davide221 commented Jun 12, 2026

Pre-ship audit complete (321695c)

Uh oh!

davide221 commented Jun 12, 2026

Update: --ddtree runs on the pool (9a17281)

Uh oh!

davide221 commented Jun 12, 2026

Update: gemma4 spec decode on the pool — spec now works everywhere it exists (abb4cf4)

Uh oh!

davide221 commented Jun 13, 2026

Update: MoE has dflash + ddtree on the pool (feef3fd)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

davide221 commented Jun 12, 2026 •

edited by cubic-dev-ai Bot

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

Update: pooled chunked prefill + `--kvflash auto` + drafter scoring without compression

cubic-dev-ai Bot left a comment •

edited

Loading

Pre-ship audit complete (`321695c`)

Update: `--ddtree` runs on the pool (`9a17281`)

Update: gemma4 spec decode on the pool — spec now works everywhere it exists (`abb4cf4`)

Update: MoE has dflash + ddtree on the pool (`feef3fd`)