KVFlash: bounded KV residency (lookahead sparse attention) for dflash#373
KVFlash: bounded KV residency (lookahead sparse attention) for dflash#373davide221 wants to merge 22 commits into
Conversation
KvFlashPager: bounded resident pool for the full-attention KV cache (FlashMemory-style lookahead sparse attention, arXiv 2606.09079). Logical positions map to physical pool slots at 64-token chunk granularity; cold chunks page to a host backing store bit-exact and recallable. GPU footprint is a hard O(pool) bound at any context length. KvFlashScorer: dependency-free chunk-relevance policy interface. With no scorer the pager runs pure LRU; KvFlashDrafterScorer adapts the pflash Qwen3-0.6B drafter (tail-attention chunk scores, z-normalized, bisecting on allocation pressure) so reselect becomes relevance-driven. Co-Authored-By: WOZCODE <contact@withwoz.com>
- create_target_cache gains ctx_alloc: attention KV tensors allocate at pool capacity while cache.max_ctx stays the logical bound. - build_target_step gains kvflash_mask: pooled decode keeps the step-invariant set_rows KV append active alongside an exact slot-validity mask (uploaded before every compute; gallocr reuses input regions during graph execution, so a stale mask is garbage). - do_ar_decode routes kv_write_rows through the pager slot, pushes history, and reselects every tau decoded tokens (effective interval max(tau, history/45) caps rescore overhead near 15%). - Spec decode (chain) verifies ON the pool: verify_batch slot-maps the draft block (kv_write_rows is [n_tokens, n_head_kv] ne0-major) and builds a slot-space mask; rejected drafts need no rollback since the pos < base_pos validity rule excludes their slots until rewritten. DDTree tree-verify is not pool-aware and falls back to AR. - pflash synergy: when the prefill drafter loads, KvFlashDrafterScorer attaches automatically; without it the pool runs LRU (fully agnostic). - Post-generation snapshots are skipped once cur_pos exceeds the pool; prompts must fit the pool (clear error otherwise); pool size clamps to --max-ctx with a warning. Co-Authored-By: WOZCODE <contact@withwoz.com>
Gated suite A-F: full-cache baseline, shuffled-relocation equivalence (<=2% argmax flips), live paging with bit-exact page_out/page_in roundtrip and >=90% KV-bytes cut, score-driven reselect recall, decode profile, and the full LSA loop with the drafter as Memory Indexer. Modes: --niah / --niah256 (needle recall vs residency), --longab (end-to-end long-prompt A/B, per-process configs for clean VRAM), --no-mask. Co-Authored-By: WOZCODE <contact@withwoz.com>
Measured on lucebox RTX 3090, Qwen3.6-27B Q4_K_M, Q8_0 KV: decode flat at 38.6 tok/s from 64K to native-max 256K (2.9x over full cache at 256K), 72 MiB resident KV vs 4608 MiB, prefill up to 2.8x faster, needle recall 88-100% at 6-9% residency with the drafter policy, harness ground truth 32/32 vs 32/32, spec acceptance at parity. Co-Authored-By: WOZCODE <contact@withwoz.com>
There was a problem hiding this comment.
6 issues found across 18 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/server/server_main.cpp">
<violation number="1" location="server/src/server/server_main.cpp:411">
P2: Missing input validation for --kvflash token count. The value is stored raw via setenv without any validation that it is a positive integer. Every other numeric flag in this block (--spark-slots, --ddtree-budget, --fa-window, --chunk, etc.) parses with std::atoi and validates. Passing non-numeric, zero, or negative input will silently set DFLASH_KVFLASH to garbage, deferring the failure to an opaque downstream atoi call rather than failing early with a clear error message.</violation>
</file>
<file name="server/src/qwen3/qwen3_kvflash_scorer.cpp">
<violation number="1" location="server/src/qwen3/qwen3_kvflash_scorer.cpp:110">
P2: `score_chunks` divides by `chunk_tokens` without validating it, which can crash on invalid input.</violation>
</file>
<file name="server/src/qwen35/graph_builders.h">
<violation number="1" location="server/src/qwen35/graph_builders.h:71">
P3: Header comment for `kvflash_mask` incorrectly states it is "Only meaningful with n_tokens == 1", but the parameter is actively used with `n_tokens > 1` in the verify_batch/spec-decode path (qwen35_dflash_target.cpp:63), and the implementation in graph_builders.cpp:291-296 explicitly describes support for "multi-token ... forwards (decode AND spec verify)". The header constraint is misleading and contradicts both the implementation comment and actual usage.</violation>
</file>
<file name="server/src/common/kvflash_pager.h">
<violation number="1" location="server/src/common/kvflash_pager.h:70">
P1: `attach()` does not validate that pool capacity leaves at least one evictable chunk, so small pools can deadlock eviction and make `slot_for()` fail with `-1`.</violation>
</file>
<file name="server/src/qwen3/qwen3_kvflash_scorer.h">
<violation number="1" location="server/src/qwen3/qwen3_kvflash_scorer.h:7">
P3: Stale documentation reference: the header comment says 'see common/kv_scorer.h' but no such file exists. The correct base-class header is `common/kvflash_scorer.h` (confirmed at `server/src/common/kvflash_scorer.h`). This will mislead developers looking for the dependency-free interface description.</violation>
</file>
<file name="server/src/qwen35/qwen35_backend.cpp">
<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:1304">
P1: `slot_for()` failure is unchecked, so kvflash can write to KV row `-1` when the pool has no evictable block.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
The pager core is architecture-blind; this routes each backend's KV writes and masks through it so --kvflash works on every model family the daemon serves. - qwen35moe (Qwen3.6-35B-A3B): the non-hybrid path inherits qwen35. The Spark pipelined hybrid decode gains a kv_slot parameter; the cached per-layer FA span clamps to the pool, so the cached graph stops rebuilding once the window reaches pool size. The pool span stays maskless like the rest of that path: the pager zeroes freed blocks (page-out + zero_free_blocks on request reset), the same zero-row approximation production padding already relies on. Hybrid spec decode (literal-offset KV writes) falls back to pipelined AR. - laguna: all 40 layers pooled. laguna_step/_hybrid take a const pager; full + SWA masks are built in SLOT space via fill_slot_pos. SWA exactness from a protected tail >= sliding_window. Legacy per-layer hybrid decode and NO_KVPAD/PAD_CPY/no_mask ablations are refused under kvflash. - gemma4: pools FULL-attention layers only (SWA layers already ring-buffer; KV-reuse layers share their source tensors). Slot-space full mask; FA span and mask width clamp to tensor capacity. Mutually exclusive with --fa-window; spec verify falls back to AR. - pager: new const helpers slot_of / fill_slot_pos (slot-space mask construction) and zero_free_blocks (request-reset hygiene for maskless consumers); kvflash state in Qwen35Backend moved to protected for the MoE subclass. - guards everywhere: prompt-fits-pool on every prefill/restore path, snapshots refused after the first relocation on laguna/gemma4. Smoked on the 3090, pool 1024 / max-ctx 8192 with live LRU eviction mid-request: A3B Spark hybrid 101.6 tok/s, laguna 137.1, gemma4 119.0, all coherent; gemma4 no-flag control unchanged (120.2). Co-Authored-By: WOZCODE <contact@withwoz.com>
Update: KVFlash now covers every architecture the daemon serves
All smokes: pool 1024 / max-ctx 8192, ~1.2K logical tokens so live LRU eviction engages mid-request, RTX 3090. A no-flag gemma4 control on the same build confirms the default path is unchanged. The qwen35 numbers in the PR body are unaffected. Policy note: qwen35/qwen35moe attach the pflash drafter scorer automatically; laguna and gemma4 run LRU-only for now (the drafter is Qwen-tokenizer bound) with the New pager helpers: 🧙 Built with WOZCODE |
There was a problem hiding this comment.
2 issues found across 18 files (changes from recent commits).
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
- Pool-deadlock guard (P1): KvFlashPager::min_pool_tokens() + attach() refusal when sinks + tail window leave no evictable block; every backend floors the requested pool at config read (512 for qwen-family and gemma4; laguna derives its floor from the resident SWA window) with a warning instead of a runtime eviction failure. - Unchecked slot_for() in do_ar_decode (P1): a -1 slot now fails the request with a clear error instead of becoming a set_rows row index. - --kvflash / --kvflash-tau (P2): validate as positive integers at the CLI and exit early instead of deferring garbage env values downstream. - score_chunks (P2): guard chunk_tokens <= 0. - Stale docs (P3 x2): kvflash_mask comment no longer claims n_tokens==1 only (it serves multi-token spec verify); kv_scorer.h rename leftover now points at common/kvflash_scorer.h. Verified on the 3090: bad flag values rejected with clear messages; --kvflash 256 raises to the 512 floor and decodes coherently through live eviction in the tightest legal pool (8 blocks, 5 protected). Co-Authored-By: WOZCODE <contact@withwoz.com>
|
All 6 cubic findings were valid and are fixed in the latest push:
Rebuilt + re-smoked on the 3090 after the fixes (27B, pool floor path, coherent output through live eviction). 🧙 Built with WOZCODE |
…lpers The multi-arch port left three copies of the same plumbing; this pulls them into the kvflash layer so each backend integration reduces to wiring (net -32 lines): - kvflash_pool_from_env(): the env read + 256-rounding + eviction floor + max_ctx clamp lived in three slightly diverging copies (qwen35 inline, laguna, gemma4). One reader, parameterized by the arch's KvFlashConfig; laguna passes its SWA-tail config via a new kvflash_config() so the floor and attach can never disagree. - KvFlashPager::alloc_span(): the slot_for loop + exhaustion diagnostic existed in laguna, gemma4, and the qwen35moe restore replay; the backend helpers are now one-line delegates and the error message is single-sourced. - kvflash_fill_rows_and_masks(): laguna's step-input filler and gemma4's inline rows + slot-space mask fill were the same algorithm; the shared helper builds append rows plus causal (and optional sliding-window) masks from the pager's slot map, so graph code no longer reimplements the slot-to-position conversion. No behavior change: rebuilt on the 3090 and re-smoked the three affected archs through live eviction (laguna 138.0 tok/s, gemma4 119.4, qwen35 37.0, all coherent, banners unchanged). Co-Authored-By: WOZCODE <contact@withwoz.com>
There was a problem hiding this comment.
1 issue found across 8 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
- assets/cards/kvflash_card.png registered in the README cards grid (DECODE 2.9x at 256K, CONTEXT 256K, KV VRAM -99%), linking to optimizations/kvflash/. - optimizations/kvflash/README.md gains the hero image (pflash layout). - README/RESULTS now state explicitly that the 256K full-cache baseline rows are measured, not extrapolated, and fit the 24 GB card only because the KV is Q8_0 (F16 KV would be 9.2 GiB and not fit); KVFlash holds 72 MiB resident either way. Co-Authored-By: WOZCODE <contact@withwoz.com>
The measured tables now carry the cache parameter on the column itself (KV in VRAM (Q8_0)) instead of relying on the prose footnote alone; the footnote keeps the why (F16 KV would not fit 256K on 24 GB at all). Co-Authored-By: WOZCODE <contact@withwoz.com>
New 'Bounded KV residency (KVFlash)' subsection after the KV cache block, mirroring the Spark pattern: one-paragraph intro + flag table (--kvflash / --kvflash-tau and their env equivalents) linking to optimizations/kvflash/. Co-Authored-By: WOZCODE <contact@withwoz.com>
The 38.6 tok/s / 72 MiB figures are Qwen3.6-27B at one pool size; the four model families land at different speeds. The flags reference now states the property (decode independent of context length, pool-sized resident KV) and points at optimizations/kvflash/ for per-model numbers. Co-Authored-By: WOZCODE <contact@withwoz.com>
… without compression Three UX/capability gaps closed, all verified on the 3090: - Pooled chunked prefill in the daemon (DESIGN follow-up #2): a prompt larger than the pool no longer refuses — do_prefill switches to pager-chunk-sized batches with slot-mapped set_rows writes, a slot-space mask per chunk (verify_batch recipe), and live eviction as the pool fills. Constant VRAM, linear time. Smoked: 6843-token prompt through a 2048 pool, coherent output, 35.1 tok/s decode. Restore offsets and boundary snapshots are refused in the pooled path. - --kvflash auto: sizes the pool from --max-ctx (25% with a drafter configured, 50% LRU-only), same floor/clamp rails, all model families via the shared config reader. Smoked both sizings. - Drafter scorer without compression: --prefill-drafter alone now arms the residency scorer. The server hands the path to the backend (DFLASH_KVFLASH_DRAFTER); kvflash_ensure_scorer() lazy-loads the drafter on the first reselect that needs it (never on the first tokens) and re-attaches after a draft-residency release. Previously the scorer only attached inside the pflash compress path, so this flag combination silently ran recency-only LRU. Smoked: attach fires mid-generation, banner announces the pending policy. - Snapshot guards now use pager.is_identity() instead of cumulative page_outs stats: one eviction-heavy request no longer disables snapshots for the rest of the process (laguna/gemma4), and qwen35 refuses identity-copy snapshots of relocated pools. Co-Authored-By: WOZCODE <contact@withwoz.com>
Update: pooled chunked prefill +
|
High accuracy by default: when --kvflash is on and no --prefill-drafter
was given, the qwen-family backend probes the well-known locations for
the Qwen3-0.6B drafter (target's dir, drafter/, draft/, then
/opt/lucebox/models/drafter/ — Spark's load-what-sits-next-to-the-model
pattern) and arms the residency scorer from it. LRU is now the explicit
FALLBACK when no drafter exists, and the banner says so
('lru (recency-only: no Qwen3-0.6B drafter found ...)') instead of
presenting recency-only paging as a normal mode.
Nothing turns kvflash itself on by default; this only picks the policy
once the user asks for the pool.
Smoked on the 3090 with ONLY '--kvflash auto': probe found the
appliance drafter, auto sized 25% (drafter expected), scorer attached
at the first reselect, coherent output.
Co-Authored-By: WOZCODE <contact@withwoz.com>
…kvflash-policy
Relevance is a property of the text, not the tokenizer, so non-qwen
targets no longer have to run recency-only residency:
- KvFlashCrossTokScorer: detokenize the target's history with its own
tokenizer (loaded from the target GGUF), re-tokenize for the Qwen3-0.6B
drafter (its GGUF), run the same tail-attention scoring, and map
per-drafter-token scores back to the target's 64-token chunk
boundaries by character spans. Tokenizers are host-only, lazy-loaded.
- laguna + gemma4 gain the full reselect loop (history, adaptive tau,
lazy drafter load at the first reselect boundary, score_hook + repage).
Drafter-scored residency is now the default on ALL four families; the
probe + sizing live in the shared helpers.
- --kvflash-policy {drafter,lru}: the explicit opt-out the default was
missing (no probe, no drafter load, recency-only paging).
- Shared kvflash_find_drafter() / kvflash_policy_is_lru() replace the
per-backend probe; banners state the armed policy and how to change it.
Verified on the 3090 (gemma4 26B-A4B, pool 1024): cross-tok scorer
attaches mid-generation, 18 drafter-driven reselects with page events,
coherent 1.9K-token output. Stress needle A/B vs LRU: LRU degenerates
and never recites; cross-tok stays coherent and recalls the correct
prefix but not the exact code. Documented in RESULTS.md as functional
but untuned (qwen-native scoring keeps its measured 14-16/16; the
teacher-forced NIAH harness for non-qwen archs is the follow-up).
Co-Authored-By: WOZCODE <contact@withwoz.com>
There was a problem hiding this comment.
2 issues found across 12 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
'auto' now sizes from the GPU instead of a fixed fraction of max_ctx: half of (device-free minus reserve) after the weights are resident, converted at the model's pooled-KV density, capped at the decode speed knee (16384 tokens default, DFLASH_KVFLASH_MAX_POOL to override) and at max_ctx. Rationale: a bigger pool means more resident chunks and fewer forced evictions of useful context (the relevance-crowding seen in the gemma4 needle stress), while the cap keeps the per-step KV read near the flat-decode optimum; on tight cards the VRAM term shrinks the pool automatically. Backends supply the budget (ggml_backend_dev_memory + per-arch density: qwen35 full-attn layers at resolve_kv_types' quant, laguna all layers at args.kv_type, gemma4 full-attn layers at F16 with per-layer dims); the reserve covers compute buffers plus the drafter when one is expected. The fraction heuristic survives only as the no-budget fallback. Smoked on the 3090 at max-ctx 131072: 27B picks 16384 (free 8.3 GiB, 14.0 KiB/token, speed-capped), gemma4 picks 16384 (7.5 GiB, 20.0 KiB/token), both banners report the full math, both decode coherently. Co-Authored-By: WOZCODE <contact@withwoz.com>
|
You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment |
Four valid findings from cubic's later passes, all fixed: - KvFlashCrossTokScorer: raw owning pimpl now has deleted copy ctor/assignment (double-free guard; held in unique_ptr everywhere, but the class shouldn't rely on that). - KvFlashPager::slot_for: a failed allocation rolls cur_chunk_ back so the next eviction's tail window isn't computed from a chunk that never materialized. - laguna unpark: kvflash_attach failure now frees the just-loaded weights + cache before returning (was leaking them while still reporting parked). - kvflash_drafter_failed_ latch clears on unpark in all three backends: a transient drafter-load failure no longer downgrades residency to LRU for the process lifetime (still no per-tau retry spam). Stale finding skipped: the cumulative page_outs snapshot guard was already replaced by is_identity() two rounds ago. Docs brought up to shipped reality: DESIGN.md per-arch policy section (cross-tok default, --kvflash-policy, VRAM auto), do_prefill bullet (pooled chunked prefill), and the follow-ups list now separates done (pooled prefill, spec-on-pool, VRAM auto, cross-tok) from open (drafter KV persistence, laguna/gemma4 pooled prefill, pooled snapshots, async paging, non-qwen NIAH harness). Full test_kvflash regression suite on this exact tree: ALL PASS (relocation 2% gate, bit-exact roundtrip, eviction decode, reselect recall, LSA loop, >=90% KV cut), exit 0. Co-Authored-By: WOZCODE <contact@withwoz.com>
Pre-ship audit complete (
|
Both GPU jobs shared group lucebox3-gpu-runner, but a concurrency group
holds only ONE waiting job: the CUDA job took the running slot, the
Radeon job sat in the waiting slot, and every new job entering the
group from any branch displaced it ('Canceling since a higher priority
waiting request exists') — the Radeon leg was cancelled chronically
while the 3090 leg passed. The combo box has two distinct GPUs, so the
jobs never contended for a device; per-GPU groups keep cross-PR
serialization where it matters and stop the cross-displacement.
Co-Authored-By: WOZCODE <contact@withwoz.com>
rocminfo on a wedged KFD blocks in uninterruptible sleep until the 20-minute job timeout kills the run with zero evidence. Probe it under a 15 s timeout first; on hang, dump /dev/kfd holders, D-state processes, and recent amdgpu/kfd dmesg, then fail in seconds with the diagnosis on the job page. The smoke step reuses the healthy probe's output. Co-Authored-By: WOZCODE <contact@withwoz.com>
The 'DDTree falls back to AR under KVFlash' limitation guarded against a tree verify that does not exist in the daemon: the complete tree machinery (build_ddtree, build_target_step_tree, follow_verified_tree) is only called from test_dflash, the benchmark harness. In the server, --ddtree sizes the verify intermediates for budget+1 tokens and enables fast_rollback, then generation runs the same chain spec loop either way — and both pieces are already pool-compatible: chain verify_batch is slot-mapped (measured at acceptance parity), and fast_rollback's snapshot_kv/restore_kv only snapshot DeltaNet/conv recurrent state, which KVFlash never pages. Gate removed; docs corrected (the known-limit now names the harness-only tree graphs, not the daemon). A/B on the 3090 (27B + DFlash draft, --ddtree, 600 tokens): pooled 14.6% accept / avg_commit 3.33 / 33.5 tok/s vs full-cache 13.9% / 3.23 / 33.3 — parity, both coherent. Co-Authored-By: WOZCODE <contact@withwoz.com>
timeout(1) cannot kill a process in uninterruptible sleep, so the previous diagnostic step itself blocked for the full job timeout when KFD was wedged (observed live: 20 minutes of silence, no evidence printed). Probe rocminfo in the background with output to a file (no held pipe), enforce the 15 s deadline in the shell, and on hang print the probe's own D-state, /dev/kfd holders, and amdgpu dmesg before failing fast — without ever wait()ing on the corpse. Co-Authored-By: WOZCODE <contact@withwoz.com>
Update:
|
…gression fix Spec decode now runs on the pool everywhere it exists. gemma4 was the last gap: - gemma4_verify_batch gains the kvflash path: set_rows kv-index inputs (full layers -> pool slots, SWA -> ring rows), slot-space causal mask via the shared helper, FA span + mask width clamped to the pool. Gemma4DFlashTarget allocates the verify block's slots up front; the spec loop's KV-truncation rejection maps directly onto the pool's validity rule (rejected slots hold future positions, masked until the next verify rewrites them). Both backend spec gates removed. - Pre-existing regression fixed (blocks gemma spec on MAIN, not just here): PR #359's strict assert reads dflash.n_target_layers, which the published gemma draft fills with the TARGET layer count (30) while its fc tensor is sized for the 6 CAPTURE layers — the draft refused to load at all. Per that PR's own weights-are-ground-truth rule, derive the capture count from fc when it divides n_embd and warn on the metadata mismatch; genuinely inconsistent shapes still fail. - gemma4 accept_rate now reaches the HTTP usage block (was silently 0.0 while the loop logged the real rate — same reporting-only class as the PR #321 layer-split gap). A/B on the 3090 (26B-A4B + published q8_0 draft, 600 tokens): pooled and full cache produce IDENTICAL acceptance (407/3104 = 13.1%, avg_commit 3.09) and identical text; usage reports 0.131 on both. Co-Authored-By: WOZCODE <contact@withwoz.com>
Update: gemma4 spec decode on the pool — spec now works everywhere it exists (
|
…7B-hardcoded) The converter stamped the qwen35-27B draft's scalars (n_head_kv=8, hidden=5120, n_layer=5, ff=17408, ...) onto every draft regardless of source, so any non-27B DFlash draft (A3B, gemma) converted to a GGUF with correct weights but wrong metadata — which the strict draft loader then rejected (blk.0 attn_k dim != n_head_kv*head_dim). Every MoE/A3B spec-decode attempt on main fails at draft load for this reason. load_arch() now resolves the architecture from the source config.json (authoritative for transformer hparams) cross-checked against the tensor shapes (authoritative for the rest: head_dim from k_proj, intermediate from gate_proj, n_target_layers from fc, n_layer from the block count), falling back to the 27B constants only when config.json is absent. Verified: A3B draft converts to n_head_kv=4 n_layer=8 ff=6144 and loads clean. This unblocks MoE speculative decode. Validated on the 3090: A3B MoE all-GPU with --ddtree + --kvflash 2048 runs spec decode on the pool (10.4% accept, avg_commit 2.66, coherent) vs full cache (11.5%, 2.84, coherent) — so dflash + ddtree + kvflash compose on MoE. The qwen35moe --spark hybrid spec path has a separate pre-existing CUDA crash (see RESULTS Known limits); it was never reachable until drafts could load. Co-Authored-By: WOZCODE <contact@withwoz.com>
Update: MoE has dflash + ddtree on the pool (
|
KVFlash: bounded KV residency (lookahead sparse attention) for dflash
FlashMemory-style (arXiv 2606.09079) decode-time KV paging behind a new
--kvflash <tokens>flag. The full-attention KV cache lives in a fixed pool of slots; cold 64-token chunks page to host RAM bit-exact and recallable. GPU KV footprint becomes a hard O(pool) constant at any logical context length.Full docs in
optimizations/kvflash/(README, RESULTS, DESIGN).Headline numbers (lucebox RTX 3090, Qwen3.6-27B Q4_K_M, Q8_0 KV)
Decode is flat at 38.6 tok/s from 64K to the model's native 256K maximum (1.4x / 2.0x / 2.9x over the full cache), prefill is up to 2.8x faster, and attn-KV memory drops 99.2% (2304 to 18 MiB at 128K with a 1K pool).
How it works
create_target_cachegainsctx_alloc);cache.max_ctxstays the logical bound. The allocation delta IS the saving.common/kvflash_pager.h) maps logical positions to pool slots at 64-token chunk granularity, riding the existing step-invariantset_rowsKV append. RoPE is baked into K rows at write time, so relocation is legal; page-out/page-in moves raw quantized bytes and is bit-exact.reselect()repages the pool: the paper's lookahead loop, with a hard capacity cap their sigmoid threshold lacks.Policy is pluggable, pflash is optional
KvFlashScorer(common/) is the policy seam. With no scorer the pool runs pure LRU (zero pflash dependency, recency-only memory). When pflash loads its drafter,KvFlashDrafterScorerattaches automatically and reselect becomes relevance-driven: needle recall holds at 88-100% down to 6-9% residency from 8K to 256K, where LRU scores 0 outside its tail window.Spec decode runs on the pool
Chain-mode
verify_batchslot-maps the draft block (per-tokenkv_write_rows, which is[n_tokens, n_head_kv]ne0-major) and builds a slot-space mask. Rejected drafts need no rollback: thepos < base_posvalidity rule excludes their slots until rewritten. Acceptance parity measured on the daemon: 15.4-15.6% pooled vs 15.3% full cache. DDTree tree-verify is not pool-aware yet and falls back to AR with a one-time warning.Quality
Harness ground truth with the pool sized per the heuristic: HumanEval 10/10, GSM 10/10, MATH 10/10, agent 6/6, identical to the full-cache baseline (base-vs-base control: 16/16 byte-identical, so the stack is deterministic; text drift under KVFlash is the masked kernel's different deterministic rounding lineage, not a correctness effect).
Verification
test_kvflashsuite A-F: full-cache baseline, shuffled-relocation equivalence (0.83% argmax flips, gate 2%), live paging with bit-exact roundtrip and >=90% KV-bytes cut, score-driven reselect recall, decode profile, full LSA loop with the drafter as Memory Indexer.Known limits (documented in RESULTS.md)
cur_posexceeds the pool (pooled snapshots need page-table serialization); prefill-time snapshots work.Usage
🧙 Built with WOZCODE