Skip to content

compose: FlowKV aged-history compression + drafter residency fix — 1.72x vs disk-cache baseline at <=64K#372

Open
dusterbloom wants to merge 11 commits into
Luce-Org:mainfrom
dusterbloom:split/11-flowkv-compose
Open

compose: FlowKV aged-history compression + drafter residency fix — 1.72x vs disk-cache baseline at <=64K#372
dusterbloom wants to merge 11 commits into
Luce-Org:mainfrom
dusterbloom:split/11-flowkv-compose

Conversation

@dusterbloom

@dusterbloom dusterbloom commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

TL;DR

On current main (RTX 3090 24GB, which includes the PR #364 scoped disk prefix cache), enabling FlowKV aged-history compression on top of the disk cache now delivers:

main (disk cache alone) this PR (compose) delta
7-turn agentic session wall (N=3 mean) 527.5s 306.7s 1.72x
worst-turn fresh prefill (26K tok) 370 tok/s (73-77s) 396 tok/s (66-73s) parity+
decode @63k context 8.1-9.1 tok/s 10.6-14.5 tok/s +16-30%
tool-call validity 16/21 18/21 held

Benchmark: goldgate_fix trace (real multi-turn agentic session, 34K-64K prompt tokens per turn), N=3 interleaved A/B on the same binary, same thermal window.

Summary

PR #364 made warm agentic turns cheap by restoring a stable token prefix from disk. The remaining cost on long sessions is the aged conversation history that still has to be prefilled fresh whenever the prefix diverges, and the per-turn growth beyond the cached boundary. This PR composes FlowKV aged-history compression with that cache: messages older than a hot window are compressed (drafter-scored, anchor-preserving) while the system prompt stays verbatim as the cache anchor, so the disk-cache key remains stable across turns. A unified gate keeps the three paths exclusive — turn-1 verbatim, FlowKV on continuations, whole-prompt pFlash otherwise — and with compression disabled the request path is byte-identical to main.

Two fixes found during benchmarking turned the compose from a wash into the 1.72x above:

  • Drafter residency. The pflash scoring drafter (~2GB BF16) staying resident through the target's large prefill collapses prefill throughput 370 -> 121 tok/s on 24GB cards (allocator pressure at ~61K KV; verified by A/B, not capacity — the same prefill runs at full rate with the drafter released). The drafter is now released after its scoring pass and lazily reloaded (~2s). This is the auto residency default; --draft-residency persistent keeps the old behavior for >=32GB cards.
  • Admission ordering. The ingress context-length check rejected prompt+max_tokens > max_ctx before compression could run. Oversized requests are now admitted when compression will run, and the hard limit is enforced on the post-compress effective size instead.

Changes

  • FlowKV aged-history compression composed with the feat(server): add scoped disk prefix cache policy #364 scoped disk cache behind a unified gate (http_server.cpp); compress off keeps main's behavior byte-identical.
  • auto draft residency releases the pflash drafter after compress scoring (placement/draft_residency.h).
  • Pure admission helper should_reject_oversized() + post-compress effective-size gate (server/admission.h).
  • Skip-park guard: --prefill-skip-park downgraded on <32GB GPUs at max_ctx>65536 (VMM VA-fragmentation crash class) (placement/skip_park_guard.h).
  • ee7 early-exit drafter, anchor-transitive cascade with expansion throttles, tail-capture guard for the chunk-boundary assert.
  • Tests: 1926-assertion unit suite green; standalone suites for admission (7), skip-park guard (6), anchor cascade, early-exit score range, warm-path regression. ~55% of the diff is tests.

Limitations

  • Cold >64K prompts at max_ctx=131072 still fail during verbatim-turn prefill (VMM pool growth with target+decode-draft resident; root-caused, follow-up scoped: decode-draft release during large prefills / lazy-KV).
  • Compression keeps ~93% on dense code (anchor-dominated) — known, separate lever.
  • GGML_CUDA_NO_VMM=1 as an environment variable is a no-op (compile-time option in this fork); scripts relying on it were never protected.

History

  1. 731561d1 compose FlowKV with feat(server): add scoped disk prefix cache policy #364 scoped cache; 0efdc33c gate compression as fallback so compose can't regress main; 6a848058 unified gate (FlowKV reachable + scoped save preserved).
  2. cefa3caf ee7 early-exit drafter + anchor-transitive cascade + tail-capture guard.
  3. 3fc6882f drafter auto-release after compress scoring (the 1.72x).
  4. 2ae98c0f compress-aware admission.
  5. 637fbdaf comment trim (-133 LOC, no logic changes).
  6. 1c562eb4 skip-park footprint guard.

…rg#364 scoped cache

- Port 354e7b6 message-count freeze (aged[1..n-hot) compressed once, cached)
- Remove mutual-exclusion: FlowKV active → disk clamps to system_end (verbatim system anchor, stable cross-session key); Luce-Org#364 unchanged when compress=false
- WS1: non-continuation turns skip compression (cold-poison fix preserved)
- Inert-guard: aged band < 512 tokens → FlowKV-OFF
- Config: DiskPrefixCachePolicy::compress + --disk-prefix-cache-compress CLI
- Tests T1-T7: 1908 assertions, 0 failures
… vs Luce-Org#364

FlowKV ran whenever disk_cache_policy.compress was set, with no size gate, so
every multi-turn agentic turn paid the full pFlash drafter-forward (~400s/session
at 59K) and re-expanded the prompt — making COMPOSE ~1.9x slower than the plain
Luce-Org#364 scoped disk cache it should improve on.

- Gate FlowKV on the original prompt size (same threshold as the pFlash gate),
  and skip it once pFlash has already compressed.
- Below threshold COMPOSE is byte-identical to Luce-Org#364 (full prefix-cache hits, no
  drafter tax); compression fires only when the conversation can't fit the KV.
- Keep the scoped-disk-re-prefill skip under compression (avoids turn-2 hang).

Validated on abc_cache_harness COMPOSE arm (auto, threshold=65000): goldgate_fix
total wall 846s -> 480s (~Luce-Org#364's 443s), zero compression on sub-threshold turns.
Activate via --prefill-compression auto --prefill-threshold ~max_ctx.
…g-42 tail-capture guard

ee7 truncates drafter forward at layer 7 of 28, scoring only those layers.
9.3× drafter wall at 128K (RTX 3090, Qwen3.6-27B-Q4_K_M target + Qwen2.5-0.5B-BF16 drafter).
Anchor-transitive cascade rescues multi-hop on bimodal-density prompts (gated, default OFF).
Bug Luce-Org#42 fix: tail-capture view-bounds guard at S%4096 in {1..7}.

5 unit tests included. Bench scripts split to follow-up PR.
…g#364 scoped save

47081e67 demoted FlowKV to a downstream else-if after whole-prompt pFlash,
gated on the same threshold — making FlowKV structurally unreachable (any
threshold that let it run made pFlash fire first; PFLASH_FREEZE_HISTORY went
dead). Replace with the unified gate (compute should_compress once; route
continuations to FlowKV-freeze with should_compress=false; whole-prompt pFlash
only for cold non-continuations), mirroring the working flowkv-standalone
structure. Re-enable Luce-Org#364's scoped disk save under compression (drop the
band-aid guard; the disk-clamp already pins the save to the stable system_end
prefix).

Paired A/B, same binary (cb458145), full 7-turn goldgate_fix, single-session:
COMPOSE_FLOWKV 615.9s vs pure-Luce-Org#364 713.7s (1.16x), decode 13.6 vs 6.7 tps,
tool-valid 85.7% vs 71.4%. FlowKV engages on continuations; ee7 keeps the
drafter forward cheap. Turn-4 transition cost (park/unpark + uncached
compressed-prefill) is the remaining lever, not the gate.
Resident drafter (~2GB) starves the target's large prefill on 24GB cards
(370 -> 121 tok/s on the freeze transition turn). Release after scoring,
lazy reload next turn (~2s). N=3 interleaved: 527.5s -> 306.7s (1.72x),
turn-4 prefill 217-269s -> 66-73s, quality held. persistent remains the
big-card opt-out.
…them

Ingress gate rejected prompt+max_tokens > max_ctx before compression ran,
making >max_ctx sessions unreachable even when FlowKV/pFlash could shrink
them. Extract pure should_reject_oversized() (admission.h): pass oversized
requests through when compression will run; enforce the hard limit on the
post-compress effective size in worker_loop. Oversized requests now get
compressed first and reject cleanly only if still over budget.
-133 net LOC, comments only — zero logic/string/assertion changes.
All suites re-verified green (1926 asserts + 4 standalone tests).
Dual-resident target+draft fragments VMM virtual address space; at
max_ctx=131072 the compute pool's cuMemSetAccess fails (device not
ready). Safe cell (<=65536, 10+ clean runs) keeps the fast no-park
path; dangerous cell parks. Note: GGML_CUDA_NO_VMM=1 env is compile-
time-only in this fork and never mitigated this.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 issues found across 24 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/server/freeze_history.h">

<violation number="1" location="server/src/server/freeze_history.h:10">
P3: Unused include: `<vector>` is not used by any declaration in this header. Remove it to keep dependencies minimal.</violation>
</file>

<file name="server/src/qwen35/qwen35_backend.cpp">

<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:541">
P2: Skip-park safety guard reads VRAM from current CUDA device instead of the configured placement GPU, making guard decisions incorrect on multi-GPU setups.</violation>
</file>

<file name="server/src/server/http_server.cpp">

<violation number="1" location="server/src/server/http_server.cpp:1904">
P1: FlowKV message-part extraction reads unvalidated JSON as strings; non-string `text` values can throw uncaught exceptions in the worker loop.</violation>
</file>

<file name="server/src/qwen3/anchor_scan.cpp">

<violation number="1" location="server/src/qwen3/anchor_scan.cpp:27">
P1: `search_end` clamping to 0 causes one invalid n-gram comparison when `body_end < ngram`, risking out-of-bounds reads and boundary violations.</violation>

<violation number="2" location="server/src/qwen3/anchor_scan.cpp:103">
P1: Transitive cascade loop exits early due to comparing `forced` against an immediately copied snapshot, so subsequent expansion/rescan iterations are skipped.</violation>
</file>

<file name="server/src/qwen3/anchor_scan.h">

<violation number="1" location="server/src/qwen3/anchor_scan.h:18">
P2: max_forced_count hard cap is checked only inside the cascade loop, but not after pass-1. If pass-1 alone already pushes forced chunks above max_forced_count, the cap is never enforced — the result can exceed the limit.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

{
const int n_chunks = (int)forced.size();
const int ngram = cfg.ngram;
const int search_end = std::max(0, body_end - ngram);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: search_end clamping to 0 causes one invalid n-gram comparison when body_end < ngram, risking out-of-bounds reads and boundary violations.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen3/anchor_scan.cpp, line 27:

<comment>`search_end` clamping to 0 causes one invalid n-gram comparison when `body_end < ngram`, risking out-of-bounds reads and boundary violations.</comment>

<file context>
@@ -0,0 +1,164 @@
+{
+    const int n_chunks = (int)forced.size();
+    const int ngram    = cfg.ngram;
+    const int search_end = std::max(0, body_end - ngram);
+
+    for (int qi = 0; qi + ngram <= (int)query_pool.size(); ++qi) {
</file context>

// Cascade loop: expand pool with tokens from newly-forced chunks and re-scan.
std::vector<uint8_t> prev_forced;
for (int it = 0; it < max_iters; ++it) {
prev_forced = forced;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Transitive cascade loop exits early due to comparing forced against an immediately copied snapshot, so subsequent expansion/rescan iterations are skipped.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen3/anchor_scan.cpp, line 103:

<comment>Transitive cascade loop exits early due to comparing `forced` against an immediately copied snapshot, so subsequent expansion/rescan iterations are skipped.</comment>

<file context>
@@ -0,0 +1,164 @@
+    // Cascade loop: expand pool with tokens from newly-forced chunks and re-scan.
+    std::vector<uint8_t> prev_forced;
+    for (int it = 0; it < max_iters; ++it) {
+        prev_forced = forced;
+
+        // Rare-token worklist: catches multi-hop cascades within a single outer iteration.
</file context>

Comment thread server/src/server/http_server.cpp Outdated
const std::string ptype = part.value("type", "");
if (ptype == "text" || ptype == "input_text" ||
ptype == "output_text")
msg_content += part.value("text", "");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: FlowKV message-part extraction reads unvalidated JSON as strings; non-string text values can throw uncaught exceptions in the worker loop.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/server/http_server.cpp, line 1904:

<comment>FlowKV message-part extraction reads unvalidated JSON as strings; non-string `text` values can throw uncaught exceptions in the worker loop.</comment>

<file context>
@@ -1798,6 +1808,233 @@ void HttpServer::worker_loop() {
+                                        const std::string ptype = part.value("type", "");
+                                        if (ptype == "text" || ptype == "input_text" ||
+                                            ptype == "output_text")
+                                            msg_content += part.value("text", "");
+                                    }
+                                }
</file context>
Suggested change
msg_content += part.value("text", "");
if (part.contains("text") && part["text"].is_string()) msg_content += part["text"].get<std::string>();

{
size_t total_vram = 0;
int dev = 0;
cudaGetDevice(&dev);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Skip-park safety guard reads VRAM from current CUDA device instead of the configured placement GPU, making guard decisions incorrect on multi-GPU setups.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_backend.cpp, line 541:

<comment>Skip-park safety guard reads VRAM from current CUDA device instead of the configured placement GPU, making guard decisions incorrect on multi-GPU setups.</comment>

<file context>
@@ -534,7 +535,22 @@ bool Qwen35Backend::handle_compress(const std::string & line, const DaemonIO & i
+    {
+        size_t total_vram = 0;
+        int dev = 0;
+        cudaGetDevice(&dev);
+        cudaDeviceProp prop{};
+        if (cudaGetDeviceProperties(&prop, dev) == cudaSuccess)
</file context>

Comment thread server/src/server/disk_prefix_cache.h
int ngram = 4;
int rare_token_max_freq = 8; // tokens appearing <= this many times in body count as rare
int cascade_min_anchor_count = 0; // skip cascade if pass-1 forced >= this many chunks (0 = always cascade)
int max_forced_count = INT_MAX; // hard cap on total forced chunks

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: max_forced_count hard cap is checked only inside the cascade loop, but not after pass-1. If pass-1 alone already pushes forced chunks above max_forced_count, the cap is never enforced — the result can exceed the limit.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen3/anchor_scan.h, line 18:

<comment>max_forced_count hard cap is checked only inside the cascade loop, but not after pass-1. If pass-1 alone already pushes forced chunks above max_forced_count, the cap is never enforced — the result can exceed the limit.</comment>

<file context>
@@ -0,0 +1,42 @@
+    int ngram = 4;
+    int rare_token_max_freq = 8;        // tokens appearing <= this many times in body count as rare
+    int cascade_min_anchor_count = 0;   // skip cascade if pass-1 forced >= this many chunks (0 = always cascade)
+    int max_forced_count = INT_MAX;     // hard cap on total forced chunks
+};
+
</file context>

Comment thread server/src/server/freeze_history.h
…ze in post-compress gate

Two confirmed PR-review findings:
- request-level prefix_cache.scope override replaced the whole policy,
  silently dropping the server-level compress flag (FlowKV disabled for
  any client sending an explicit scope)
- post-compress context gate used the raw prompt size on pflash
  full-cache hits, falsely 400ing oversized repeats served from cached
  compressed state

Both extracted to pure helpers (apply_request_scope_override,
effective_prompt_overflows) with failing-test-first coverage.
@dusterbloom

Copy link
Copy Markdown
Collaborator Author

Review disposition (all 8 cubic findings verified against code + tests before fixing):

finding verdict action
http_server.cpp:2175 raw size on full-cache hits confirmed fixed in e542e90 (effective_prompt_overflows helper, failing-test-first)
disk_prefix_cache.h:50 scope override drops compress confirmed fixed in e542e90 (apply_request_scope_override, failing-test-first)
http_server.cpp:1904 JSON throw refuted json::value(key, default) is type-safe (returns default on type mismatch, no throw); the .get<std::string>() at :1897 is guarded by is_string()
anchor_scan.cpp:27 / :103 / anchor_scan.h:18 confirmed in the utility library not reachable from production (the shipping drafter uses its own inline scan; these functions are exercised by tests only). Fixes queued as a follow-up batch rather than blocking this PR
qwen35_backend.cpp:541 current-device VRAM query partial latent multi-GPU-only issue; single-GPU is the only shipped config. 1-line fix queued with the follow-up batch
freeze_history.h:10 unused include confirmed queued with follow-up batch

Suite after fixes: 1939 assertions green; admission standalone 12/12.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 6 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/qwen35/qwen35_backend.cpp">

<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:541">
P2: Skip-park safety guard reads VRAM from current CUDA device instead of the configured placement GPU, making guard decisions incorrect on multi-GPU setups.</violation>
</file>

<file name="server/src/server/http_server.cpp">

<violation number="1" location="server/src/server/http_server.cpp:1904">
P1: FlowKV message-part extraction reads unvalidated JSON as strings; non-string `text` values can throw uncaught exceptions in the worker loop.</violation>
</file>

<file name="server/src/qwen3/anchor_scan.cpp">

<violation number="1" location="server/src/qwen3/anchor_scan.cpp:27">
P1: `search_end` clamping to 0 causes one invalid n-gram comparison when `body_end < ngram`, risking out-of-bounds reads and boundary violations.</violation>

<violation number="2" location="server/src/qwen3/anchor_scan.cpp:103">
P1: Transitive cascade loop exits early due to comparing `forced` against an immediately copied snapshot, so subsequent expansion/rescan iterations are skipped.</violation>
</file>

<file name="server/src/qwen3/anchor_scan.h">

<violation number="1" location="server/src/qwen3/anchor_scan.h:18">
P2: max_forced_count hard cap is checked only inside the cascade loop, but not after pass-1. If pass-1 alone already pushes forced chunks above max_forced_count, the cap is never enforced — the result can exceed the limit.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/server/http_server.cpp Outdated
0 conflated 'no hit' with a zero-length hit; sentinel is now -1 and the
gate treats any >=0 value as served-from-cache.
@dusterbloom

Copy link
Copy Markdown
Collaborator Author

cubic P2 (http_server.cpp:1798 sentinel conflation): a zero-length full-cache entry is not constructible today (admission floor + kept anchors guarantee non-empty compressed prompts), but the conflation class is now removed outright — sentinel is -1, gate treats >=0 as a hit. Failing-test-first (test_effective_overflows_zero_length_hit_is_a_hit), suite 13/13 + 1939 asserts green. Fixed in 26e0ee3.

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 12, 2026
…ter residency fix

Keep the current stack's qwen3 helper/test implementations where the PR overlapped, while taking the PR's server-side admission, skip-park, HTTP/server wiring, and test additions.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 12, 2026
Record the PR Luce-Org#372 integration, current head, and updated open-PR accounting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant