compose: FlowKV aged-history compression + drafter residency fix — 1.72x vs disk-cache baseline at <=64K by dusterbloom · Pull Request #372 · Luce-Org/lucebox-hub

dusterbloom · 2026-06-11T18:09:41Z

TL;DR

On current main (RTX 3090 24GB, which includes the PR #364 scoped disk prefix cache), enabling FlowKV aged-history compression on top of the disk cache now delivers:

	main (disk cache alone)	this PR (compose)	delta
7-turn agentic session wall (N=3 mean)	527.5s	306.7s	1.72x
worst-turn fresh prefill (26K tok)	370 tok/s (73-77s)	396 tok/s (66-73s)	parity+
decode @63k context	8.1-9.1 tok/s	10.6-14.5 tok/s	+16-30%
tool-call validity	16/21	18/21	held

Benchmark: goldgate_fix trace (real multi-turn agentic session, 34K-64K prompt tokens per turn), N=3 interleaved A/B on the same binary, same thermal window.

Summary

PR #364 made warm agentic turns cheap by restoring a stable token prefix from disk. The remaining cost on long sessions is the aged conversation history that still has to be prefilled fresh whenever the prefix diverges, and the per-turn growth beyond the cached boundary. This PR composes FlowKV aged-history compression with that cache: messages older than a hot window are compressed (drafter-scored, anchor-preserving) while the system prompt stays verbatim as the cache anchor, so the disk-cache key remains stable across turns. A unified gate keeps the three paths exclusive — turn-1 verbatim, FlowKV on continuations, whole-prompt pFlash otherwise — and with compression disabled the request path is byte-identical to main.

Two fixes found during benchmarking turned the compose from a wash into the 1.72x above:

Drafter residency. The pflash scoring drafter (~2GB BF16) staying resident through the target's large prefill collapses prefill throughput 370 -> 121 tok/s on 24GB cards (allocator pressure at ~61K KV; verified by A/B, not capacity — the same prefill runs at full rate with the drafter released). The drafter is now released after its scoring pass and lazily reloaded (~2s). This is the auto residency default; --draft-residency persistent keeps the old behavior for >=32GB cards.
Admission ordering. The ingress context-length check rejected prompt+max_tokens > max_ctx before compression could run. Oversized requests are now admitted when compression will run, and the hard limit is enforced on the post-compress effective size instead.

Changes

FlowKV aged-history compression composed with the feat(server): add scoped disk prefix cache policy #364 scoped disk cache behind a unified gate (http_server.cpp); compress off keeps main's behavior byte-identical.
auto draft residency releases the pflash drafter after compress scoring (placement/draft_residency.h).
Pure admission helper should_reject_oversized() + post-compress effective-size gate (server/admission.h).
Skip-park guard: --prefill-skip-park downgraded on <32GB GPUs at max_ctx>65536 (VMM VA-fragmentation crash class) (placement/skip_park_guard.h).
ee7 early-exit drafter, anchor-transitive cascade with expansion throttles, tail-capture guard for the chunk-boundary assert.
Tests: 1926-assertion unit suite green; standalone suites for admission (7), skip-park guard (6), anchor cascade, early-exit score range, warm-path regression. ~55% of the diff is tests.

Limitations

Cold >64K prompts at max_ctx=131072 still fail during verbatim-turn prefill (VMM pool growth with target+decode-draft resident; root-caused, follow-up scoped: decode-draft release during large prefills / lazy-KV).
Compression keeps ~93% on dense code (anchor-dominated) — known, separate lever.
GGML_CUDA_NO_VMM=1 as an environment variable is a no-op (compile-time option in this fork); scripts relying on it were never protected.

History

731561d1 compose FlowKV with feat(server): add scoped disk prefix cache policy #364 scoped cache; 0efdc33c gate compression as fallback so compose can't regress main; 6a848058 unified gate (FlowKV reachable + scoped save preserved).
cefa3caf ee7 early-exit drafter + anchor-transitive cascade + tail-capture guard.
3fc6882f drafter auto-release after compress scoring (the 1.72x).
2ae98c0f compress-aware admission.
637fbdaf comment trim (-133 LOC, no logic changes).
1c562eb4 skip-park footprint guard.

…rg#364 scoped cache - Port 354e7b6 message-count freeze (aged[1..n-hot) compressed once, cached) - Remove mutual-exclusion: FlowKV active → disk clamps to system_end (verbatim system anchor, stable cross-session key); Luce-Org#364 unchanged when compress=false - WS1: non-continuation turns skip compression (cold-poison fix preserved) - Inert-guard: aged band < 512 tokens → FlowKV-OFF - Config: DiskPrefixCachePolicy::compress + --disk-prefix-cache-compress CLI - Tests T1-T7: 1908 assertions, 0 failures

… vs Luce-Org#364 FlowKV ran whenever disk_cache_policy.compress was set, with no size gate, so every multi-turn agentic turn paid the full pFlash drafter-forward (~400s/session at 59K) and re-expanded the prompt — making COMPOSE ~1.9x slower than the plain Luce-Org#364 scoped disk cache it should improve on. - Gate FlowKV on the original prompt size (same threshold as the pFlash gate), and skip it once pFlash has already compressed. - Below threshold COMPOSE is byte-identical to Luce-Org#364 (full prefix-cache hits, no drafter tax); compression fires only when the conversation can't fit the KV. - Keep the scoped-disk-re-prefill skip under compression (avoids turn-2 hang). Validated on abc_cache_harness COMPOSE arm (auto, threshold=65000): goldgate_fix total wall 846s -> 480s (~Luce-Org#364's 443s), zero compression on sub-threshold turns. Activate via --prefill-compression auto --prefill-threshold ~max_ctx.

…g-42 tail-capture guard ee7 truncates drafter forward at layer 7 of 28, scoring only those layers. 9.3× drafter wall at 128K (RTX 3090, Qwen3.6-27B-Q4_K_M target + Qwen2.5-0.5B-BF16 drafter). Anchor-transitive cascade rescues multi-hop on bimodal-density prompts (gated, default OFF). Bug Luce-Org#42 fix: tail-capture view-bounds guard at S%4096 in {1..7}. 5 unit tests included. Bench scripts split to follow-up PR.

…g#364 scoped save 47081e67 demoted FlowKV to a downstream else-if after whole-prompt pFlash, gated on the same threshold — making FlowKV structurally unreachable (any threshold that let it run made pFlash fire first; PFLASH_FREEZE_HISTORY went dead). Replace with the unified gate (compute should_compress once; route continuations to FlowKV-freeze with should_compress=false; whole-prompt pFlash only for cold non-continuations), mirroring the working flowkv-standalone structure. Re-enable Luce-Org#364's scoped disk save under compression (drop the band-aid guard; the disk-clamp already pins the save to the stable system_end prefix). Paired A/B, same binary (cb458145), full 7-turn goldgate_fix, single-session: COMPOSE_FLOWKV 615.9s vs pure-Luce-Org#364 713.7s (1.16x), decode 13.6 vs 6.7 tps, tool-valid 85.7% vs 71.4%. FlowKV engages on continuations; ee7 keeps the drafter forward cheap. Turn-4 transition cost (park/unpark + uncached compressed-prefill) is the remaining lever, not the gate.

Resident drafter (~2GB) starves the target's large prefill on 24GB cards (370 -> 121 tok/s on the freeze transition turn). Release after scoring, lazy reload next turn (~2s). N=3 interleaved: 527.5s -> 306.7s (1.72x), turn-4 prefill 217-269s -> 66-73s, quality held. persistent remains the big-card opt-out.

…them Ingress gate rejected prompt+max_tokens > max_ctx before compression ran, making >max_ctx sessions unreachable even when FlowKV/pFlash could shrink them. Extract pure should_reject_oversized() (admission.h): pass oversized requests through when compression will run; enforce the hard limit on the post-compress effective size in worker_loop. Oversized requests now get compressed first and reject cleanly only if still over budget.

-133 net LOC, comments only — zero logic/string/assertion changes. All suites re-verified green (1926 asserts + 4 standalone tests).

Dual-resident target+draft fragments VMM virtual address space; at max_ctx=131072 the compute pool's cuMemSetAccess fails (device not ready). Safe cell (<=65536, 10+ clean runs) keeps the fast no-park path; dangerous cell parks. Note: GGML_CUDA_NO_VMM=1 env is compile- time-only in this fork and never mitigated this.

cubic-dev-ai

8 issues found across 24 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/server/freeze_history.h">

<violation number="1" location="server/src/server/freeze_history.h:10">
P3: Unused include: `<vector>` is not used by any declaration in this header. Remove it to keep dependencies minimal.</violation>
</file>

<file name="server/src/qwen35/qwen35_backend.cpp">

<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:541">
P2: Skip-park safety guard reads VRAM from current CUDA device instead of the configured placement GPU, making guard decisions incorrect on multi-GPU setups.</violation>
</file>

<file name="server/src/server/http_server.cpp">

<violation number="1" location="server/src/server/http_server.cpp:1904">
P1: FlowKV message-part extraction reads unvalidated JSON as strings; non-string `text` values can throw uncaught exceptions in the worker loop.</violation>
</file>

<file name="server/src/qwen3/anchor_scan.cpp">

<violation number="1" location="server/src/qwen3/anchor_scan.cpp:27">
P1: `search_end` clamping to 0 causes one invalid n-gram comparison when `body_end < ngram`, risking out-of-bounds reads and boundary violations.</violation>

<violation number="2" location="server/src/qwen3/anchor_scan.cpp:103">
P1: Transitive cascade loop exits early due to comparing `forced` against an immediately copied snapshot, so subsequent expansion/rescan iterations are skipped.</violation>
</file>

<file name="server/src/qwen3/anchor_scan.h">

<violation number="1" location="server/src/qwen3/anchor_scan.h:18">
P2: max_forced_count hard cap is checked only inside the cascade loop, but not after pass-1. If pass-1 alone already pushes forced chunks above max_forced_count, the cap is never enforced — the result can exceed the limit.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai · 2026-06-11T18:20:25Z

+{
+    const int n_chunks = (int)forced.size();
+    const int ngram    = cfg.ngram;
+    const int search_end = std::max(0, body_end - ngram);


P1: search_end clamping to 0 causes one invalid n-gram comparison when body_end < ngram, risking out-of-bounds reads and boundary violations.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen3/anchor_scan.cpp, line 27: <comment>`search_end` clamping to 0 causes one invalid n-gram comparison when `body_end < ngram`, risking out-of-bounds reads and boundary violations.</comment> <file context> @@ -0,0 +1,164 @@ +{ + const int n_chunks = (int)forced.size(); + const int ngram = cfg.ngram; + const int search_end = std::max(0, body_end - ngram); + + for (int qi = 0; qi + ngram <= (int)query_pool.size(); ++qi) { </file context>

cubic-dev-ai · 2026-06-11T18:20:25Z

+    // Cascade loop: expand pool with tokens from newly-forced chunks and re-scan.
+    std::vector<uint8_t> prev_forced;
+    for (int it = 0; it < max_iters; ++it) {
+        prev_forced = forced;


P1: Transitive cascade loop exits early due to comparing forced against an immediately copied snapshot, so subsequent expansion/rescan iterations are skipped.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen3/anchor_scan.cpp, line 103: <comment>Transitive cascade loop exits early due to comparing `forced` against an immediately copied snapshot, so subsequent expansion/rescan iterations are skipped.</comment> <file context> @@ -0,0 +1,164 @@ + // Cascade loop: expand pool with tokens from newly-forced chunks and re-scan. + std::vector<uint8_t> prev_forced; + for (int it = 0; it < max_iters; ++it) { + prev_forced = forced; + + // Rare-token worklist: catches multi-hop cascades within a single outer iteration. </file context>

cubic-dev-ai · 2026-06-11T18:20:25Z

+                                        const std::string ptype = part.value("type", "");
+                                        if (ptype == "text" || ptype == "input_text" ||
+                                            ptype == "output_text")
+                                            msg_content += part.value("text", "");


P1: FlowKV message-part extraction reads unvalidated JSON as strings; non-string text values can throw uncaught exceptions in the worker loop.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/server/http_server.cpp, line 1904: <comment>FlowKV message-part extraction reads unvalidated JSON as strings; non-string `text` values can throw uncaught exceptions in the worker loop.</comment> <file context> @@ -1798,6 +1808,233 @@ void HttpServer::worker_loop() { + const std::string ptype = part.value("type", ""); + if (ptype == "text" || ptype == "input_text" || + ptype == "output_text") + msg_content += part.value("text", ""); + } + } </file context>

Suggested change

msg_content += part.value("text", "");

if (part.contains("text") && part["text"].is_string()) msg_content += part["text"].get<std::string>();

cubic-dev-ai · 2026-06-11T18:20:25Z

+    {
+        size_t total_vram = 0;
+        int dev = 0;
+        cudaGetDevice(&dev);


P2: Skip-park safety guard reads VRAM from current CUDA device instead of the configured placement GPU, making guard decisions incorrect on multi-GPU setups.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_backend.cpp, line 541: <comment>Skip-park safety guard reads VRAM from current CUDA device instead of the configured placement GPU, making guard decisions incorrect on multi-GPU setups.</comment> <file context> @@ -534,7 +535,22 @@ bool Qwen35Backend::handle_compress(const std::string & line, const DaemonIO & i + { + size_t total_vram = 0; + int dev = 0; + cudaGetDevice(&dev); + cudaDeviceProp prop{}; + if (cudaGetDeviceProperties(&prop, dev) == cudaSuccess) </file context>

cubic-dev-ai · 2026-06-11T18:20:25Z

+    int ngram = 4;
+    int rare_token_max_freq = 8;        // tokens appearing <= this many times in body count as rare
+    int cascade_min_anchor_count = 0;   // skip cascade if pass-1 forced >= this many chunks (0 = always cascade)
+    int max_forced_count = INT_MAX;     // hard cap on total forced chunks


P2: max_forced_count hard cap is checked only inside the cascade loop, but not after pass-1. If pass-1 alone already pushes forced chunks above max_forced_count, the cap is never enforced — the result can exceed the limit.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen3/anchor_scan.h, line 18: <comment>max_forced_count hard cap is checked only inside the cascade loop, but not after pass-1. If pass-1 alone already pushes forced chunks above max_forced_count, the cap is never enforced — the result can exceed the limit.</comment> <file context> @@ -0,0 +1,42 @@ + int ngram = 4; + int rare_token_max_freq = 8; // tokens appearing <= this many times in body count as rare + int cascade_min_anchor_count = 0; // skip cascade if pass-1 forced >= this many chunks (0 = always cascade) + int max_forced_count = INT_MAX; // hard cap on total forced chunks +}; + </file context>

…ze in post-compress gate Two confirmed PR-review findings: - request-level prefix_cache.scope override replaced the whole policy, silently dropping the server-level compress flag (FlowKV disabled for any client sending an explicit scope) - post-compress context gate used the raw prompt size on pflash full-cache hits, falsely 400ing oversized repeats served from cached compressed state Both extracted to pure helpers (apply_request_scope_override, effective_prompt_overflows) with failing-test-first coverage.

dusterbloom · 2026-06-11T19:35:45Z

Review disposition (all 8 cubic findings verified against code + tests before fixing):

finding	verdict	action
http_server.cpp:2175 raw size on full-cache hits	confirmed	fixed in `e542e90` (`effective_prompt_overflows` helper, failing-test-first)
disk_prefix_cache.h:50 scope override drops compress	confirmed	fixed in `e542e90` (`apply_request_scope_override`, failing-test-first)
http_server.cpp:1904 JSON throw	refuted	`json::value(key, default)` is type-safe (returns default on type mismatch, no throw); the `.get<std::string>()` at :1897 is guarded by `is_string()`
anchor_scan.cpp:27 / :103 / anchor_scan.h:18	confirmed in the utility library	not reachable from production (the shipping drafter uses its own inline scan; these functions are exercised by tests only). Fixes queued as a follow-up batch rather than blocking this PR
qwen35_backend.cpp:541 current-device VRAM query	partial	latent multi-GPU-only issue; single-GPU is the only shipped config. 1-line fix queued with the follow-up batch
freeze_history.h:10 unused include	confirmed	queued with follow-up batch

Suite after fixes: 1939 assertions green; admission standalone 12/12.

cubic-dev-ai

1 issue found across 6 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/qwen35/qwen35_backend.cpp">

<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:541">
P2: Skip-park safety guard reads VRAM from current CUDA device instead of the configured placement GPU, making guard decisions incorrect on multi-GPU setups.</violation>
</file>

<file name="server/src/server/http_server.cpp">

<violation number="1" location="server/src/server/http_server.cpp:1904">
P1: FlowKV message-part extraction reads unvalidated JSON as strings; non-string `text` values can throw uncaught exceptions in the worker loop.</violation>
</file>

<file name="server/src/qwen3/anchor_scan.cpp">

<violation number="1" location="server/src/qwen3/anchor_scan.cpp:27">
P1: `search_end` clamping to 0 causes one invalid n-gram comparison when `body_end < ngram`, risking out-of-bounds reads and boundary violations.</violation>

<violation number="2" location="server/src/qwen3/anchor_scan.cpp:103">
P1: Transitive cascade loop exits early due to comparing `forced` against an immediately copied snapshot, so subsequent expansion/rescan iterations are skipped.</violation>
</file>

<file name="server/src/qwen3/anchor_scan.h">

<violation number="1" location="server/src/qwen3/anchor_scan.h:18">
P2: max_forced_count hard cap is checked only inside the cascade loop, but not after pass-1. If pass-1 alone already pushes forced chunks above max_forced_count, the cap is never enforced — the result can exceed the limit.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

0 conflated 'no hit' with a zero-length hit; sentinel is now -1 and the gate treats any >=0 value as served-from-cache.

dusterbloom · 2026-06-11T20:16:01Z

cubic P2 (http_server.cpp:1798 sentinel conflation): a zero-length full-cache entry is not constructible today (admission floor + kept anchors guarantee non-empty compressed prompts), but the conflation class is now removed outright — sentinel is -1, gate treats >=0 as a hit. Failing-test-first (test_effective_overflows_zero_length_hit_is_a_hit), suite 13/13 + 1939 asserts green. Fixed in 26e0ee3.

…ter residency fix Keep the current stack's qwen3 helper/test implementations where the PR overlapped, while taking the PR's server-side admission, skip-park, HTTP/server wiring, and test additions.

Record the PR Luce-Org#372 integration, current head, and updated open-PR accounting.

dusterbloom added 8 commits June 11, 2026 17:24

chore: trim comment blocks across branch additions to one-liners

637fbda

-133 net LOC, comments only — zero logic/string/assertion changes. All suites re-verified green (1926 asserts + 4 standalone tests).

cubic-dev-ai Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread server/src/server/http_server.cpp Outdated

dusterbloom added 2 commits June 11, 2026 22:05

ci: retrigger after NVIDIA apt mirror sync flake

de774d2

fix(review): -1 sentinel for full-cache served tokens

26e0ee3

0 conflated 'no hit' with a zero-length hit; sentinel is now -1 and the gate treats any >=0 value as served-from-cache.

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 12, 2026

docs: refresh auto-integration manifest after PR Luce-Org#372 merge

929fd93

Record the PR Luce-Org#372 integration, current head, and updated open-PR accounting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compose: FlowKV aged-history compression + drafter residency fix — 1.72x vs disk-cache baseline at <=64K#372

compose: FlowKV aged-history compression + drafter residency fix — 1.72x vs disk-cache baseline at <=64K#372
dusterbloom wants to merge 11 commits into
Luce-Org:mainfrom
dusterbloom:split/11-flowkv-compose

dusterbloom commented Jun 11, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

cubic-dev-ai Bot Jun 11, 2026

Uh oh!

cubic-dev-ai Bot Jun 11, 2026

Uh oh!

Uh oh!

cubic-dev-ai Bot Jun 11, 2026

Uh oh!

cubic-dev-ai Bot Jun 11, 2026

Uh oh!

Uh oh!

cubic-dev-ai Bot Jun 11, 2026

Uh oh!

Uh oh!

dusterbloom commented Jun 11, 2026

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

dusterbloom commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	msg_content += part.value("text", "");
	if (part.contains("text") && part["text"].is_string()) msg_content += part["text"].get<std::string>();

Conversation

dusterbloom commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Summary

Changes

Limitations

History

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dusterbloom commented Jun 11, 2026

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dusterbloom commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dusterbloom commented Jun 11, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading