fix(kv-cache): guard never-hit entries from immediate self-eviction#159
Closed
audreyt wants to merge 1 commit into
Closed
fix(kv-cache): guard never-hit entries from immediate self-eviction#159audreyt wants to merge 1 commit into
audreyt wants to merge 1 commit into
Conversation
A fresh cold-save written at the start of a new conversation is, by
construction, smaller and less token-dense per byte than a mature
multi-thousand-token snapshot from a long-running session. The existing
`(hits + 1) * tokens / file_size` eviction score correctly captures the
long-run "value per byte stored" ranking, but it has one pathology
under budget pressure: on a saturated cache the new cold-save is the
lowest-scoring entry the moment it is written, so the eviction loop
reaps it on the same wall-clock second before its session's next turn
can issue its first lookup. The cold-save infrastructure (cold-prefix
boundary alignment, BPE boundary trim, SHA-keyed text lookup on
subsequent requests) is then effectively dead under any real workload
that uses the default 4 GB budget.
Observed before this change on a multi-turn workload at the default
budget pressure (2500-token initial prompt + 3 small turns):
turn 1 cold: 7.1s turn 2: 6.5s turn 3: 6.5s
Log on every turn:
kv cache stored tokens=2048 trimmed=… reason=cold size=49.78 MiB
kv cache evicted reason=disk-cache-full tokens=2048 hits=0 size=49.78 MiB
i.e. each cold-save is created and reaped in the same second. Turn 2
falls back to a full cold prefill.
This commit adds a narrow guard. A never-hit entry whose last_used is
within KV_EVICTION_FRESH_GUARD_SECONDS (60s) gets a +INF eviction
score, so the eviction loop cannot pick it until either:
1. it ages past the guard window without ever being hit, in which
case it returns to normal scoring and is freely evictable; or
2. it is hit at least once, in which case the existing (hits + 1)
factor already protects it.
The eviction loop additionally short-circuits when victim_score ==
DBL_MAX: if every entry in the cache is guard-protected, no entry is
evicted and the cache overruns budget transiently until the guard
window expires. That excursion is bounded by the guard duration and
made explicit rather than silently looping forever or evicting a
guarded entry.
After this change on the same workload at the same 8 GB budget that
was already saturated by old snapshots:
turn 1 cold: 7.1s turn 2: 2.1s turn 3: 2.0s (3.2x speedup)
Log on every turn:
kv cache hit text tokens=2048 text=… load=5-6 ms file=…
Turns 2+ now load the just-written cold-save's 2048-token prefix from
disk in ~6 ms and only prefill the new tokens beyond it, instead of
re-running the full prefill from token zero.
The new test test_kv_cache_eviction_guards_just_written_never_hit_entry
reproduces the failure mode: an old high-density entry and a fresh
low-density entry compete for a one-entry budget. Under the old score
the fresh entry would be evicted; under the guard the fresh entry
survives and the old entry is evicted, matching the empirical fix.
The two existing eviction tests (values_fresh_snapshots,
keeps_aligned_continued_frontiers) continue to pass: they use ancient
last_used timestamps (100, 200, 300) so the guard window never
applies and the score reverts to the original formula.
Honest scope statement: this fixes single-session multi-turn under
budget pressure where fresh cold-saves would otherwise self-evict
before their first lookup. Concurrent-session pathologies (many
guard-protected entries piling up at once, briefly exceeding budget
until their windows expire) are bounded by the guard duration and
accepted as a transient over-budget excursion, not an unbounded leak.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
audreyt
added a commit
to audreyt/ds4
that referenced
this pull request
May 15, 2026
PR against antirez/ds4: antirez#159 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
audreyt
added a commit
to audreyt/pi-ds4
that referenced
this pull request
May 15, 2026
ensureSupportCheckout previously cloned audreyt/ds4 main with --depth 1
on first install and never touched the checkout again. ensureBuilt
likewise built ds4-server once and never rebuilt. That meant once a user
had a working pi-ds4 install, audreyt/ds4 commits we shipped afterward
(e.g. the KV-cache fresh-cold-save guard at 484f37a) could not reach
them without a manual `rm -rf ~/.pi/ds4/support`.
Add SUPPORT_PIN — the exact audreyt/ds4 commit each pi-ds4 release was
tested against. On every launch, syncToSupportPin:
- reads HEAD with `git rev-parse`,
- if it doesn't match SUPPORT_PIN, fetches the pin and hard-resets to
it,
- deletes the cached ds4-server binary so ensureBuilt rebuilds against
the new source.
Three guard rails so this can't surprise users:
1. Symlinked SUPPORT_DIR (dev workflow pointing at a working tree like
/Users/au/w/ds4) is detected via lstat and skipped entirely.
2. A working tree with uncommitted changes (git status --porcelain
non-empty) refuses the reset and surfaces a clear error pointing at
DS4_SUPPORT_PIN= as the disable escape hatch.
3. DS4_SUPPORT_REPO or DS4_SUPPORT_BRANCH set in env disables the pin
— the user has explicitly opted into a different upstream.
Bump cadence: every pi-ds4 release that wants existing users to pick up
a new ds4-server fix bumps SUPPORT_PIN to the audreyt/ds4 commit being
shipped. The pin lives next to SUPPORT_REPO/BRANCH so the relationship
is visible at the top of the file.
This commit ships SUPPORT_PIN=484f37a66dd4b13dc2c968674eb94821109a7c4f,
the audreyt/ds4 main that includes the KV-cache fresh-cold-save guard
(antirez/ds4#159, merged locally as 484f37a). Existing pi-ds4 installs
will pay one git fetch + make ds4-server on their next launch and then
their multi-turn turn-2 latency drops from ~6.5s to ~2s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owner
|
Already fixed when I saw the issue. Sorry! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fresh cold-saves at the start of a new conversation are smaller and less
token-dense per byte than mature multi-thousand-token snapshots. The
existing
(hits + 1) * tokens / file_sizeeviction score correctlycaptures the long-run "value per byte stored" ranking but has one
pathology under budget pressure: on a saturated cache the new cold-save
is the lowest-scoring entry the moment it is written, so the eviction
loop reaps it on the same wall-clock second — before its session's next
turn can issue its first lookup. The cold-save infrastructure
(cold-prefix boundary alignment, BPE boundary trim via
--kv-cache-boundary-trim-tokens, SHA-keyed text lookup on subsequentrequests) is then effectively dead under any real workload that uses
the default 4 GB budget.
This PR adds a narrow guard: a never-hit entry whose
last_usediswithin
KV_EVICTION_FRESH_GUARD_SECONDS(60s) gets a+INFevictionscore, so the eviction loop cannot pick it until either it ages out
without being hit, or it is hit at least once (after which the existing
(hits + 1)factor protects it).Reproduction
Three-turn chat workload, ~2500-token initial prompt, default 8 GB
--kv-disk-space-mbsaturated by old snapshots:Server log before, every turn:
i.e. each cold-save is created and reaped in the same second.
Server log after:
Turn 2+ now load the just-written 2048-token cold prefix from disk in
~6 ms and only prefill the tokens beyond it.
Why not a different fix
Three earlier candidates were considered and rejected on this code path:
(hits + K) * tokens / file_sizewith K=4: doesn't help. A2048-token / 50 MB cold-save has ~39 t/B; a mature 90000-token / 1.3 GB
snapshot has ~71 t/B. The size-density gap is intrinsic (fixed-cost
header overhead amortizes better at larger sizes), and no bias on the
(hits + …)factor flips the ranking.Capping
hitsat a small constant: same problem. Withhitscappedat 4 and bias 4, the cold-save scores 156 while the mature entry scores
285. The dominant signal is
tokens / file_size, nothits.Time-decayed score everywhere: introduces
exp()calls per entryper eviction pass and a tuning constant for the decay timescale.
Strictly more code than the narrow guard, and changes the long-run
ranking among aged entries, which is the part the current formula gets
right.
Eviction loop change
When every entry is guard-protected (i.e. all
victim_score == DBL_MAX),the loop short-circuits with a
breakrather than evicting a guardedentry. The cache then overruns budget transiently until guard windows
expire. That excursion is bounded by
KV_EVICTION_FRESH_GUARD_SECONDSand made explicit, rather than silently looping forever or evicting
through the guard.
Tests
test_kv_cache_eviction_guards_just_written_never_hit_entry—reproduces the failure mode. An old high-density entry and a fresh
low-density entry compete for a one-entry budget. Under the old score
the fresh entry would be evicted; under the guard the fresh entry
survives and the old entry is evicted.
test_kv_cache_eviction_values_fresh_snapshotsandtest_kv_cache_eviction_keeps_aligned_continued_frontierscontinue topass — they use ancient
last_usedvalues (100, 200, 300) so theguard window never applies and scoring reverts to the original formula.
Test plan
make ds4_test && ./ds4_testpasses locally (the three evictiontests in particular).
cache shows turn-2 disk-text hits in the server log instead of
kv cache evicted reason=disk-cache-fullof the just-writtencold-save.
--kv-disk-space-mb 4096workloads benefit too (thepathology is worst at smaller budgets).
🤖 Generated with Claude Code