When the disk KV cache is at its budget, a single chat request writes a 291 MiB snapshot to disk and unlinks it on the very next operation, twice in a row. The same SHA gets stored and evicted three times across one request.
23:33:08 kv cache evicted reason=disk-cache-full tokens=20480 hits=0 size=291.80 MiB file=.../7400e69e....kv
23:33:08 chat ctx=0..23029:23029 TOOLS prompt start
...
23:34:06 chunk 20480/23029 (88.9%)
23:34:06 kv cache stored tokens=20480 trimmed=0 reason=continued size=291.80 MiB save=42.6 ms
23:34:06 kv cache evicted reason=disk-cache-full tokens=20480 hits=0 size=291.80 MiB file=.../7400e69e....kv
23:34:06 kv cache stored tokens=20480 trimmed=2549 reason=cold size=291.80 MiB save=55.2 ms
23:34:06 kv cache evicted reason=disk-cache-full tokens=20480 hits=0 size=291.80 MiB file=.../7400e69e....kv
The 7400e69e file is stored, evicted, stored again, evicted again, all within the same second. Net useful work: zero. Wasted I/O: ~98 ms (save=42.6 + save=55.2) and 583 MiB written.
Why it happens
Two independent issues compound. Either alone is mild; together they thrash.
1. Two stores at the same prefix length. During the prefill at ds4_server.c:10354-10372 the cold path syncs to cold_store_len, which during the sync triggers the continued-interval callback at ds4_server.c:9892 and stores the snapshot once with reason continued; then ds4_server.c:10368 stores the same prefix again with reason cold. Same tokens, same rendered text, same SHA, two writes. This fires whenever cold_store_len lands on a multiple of the continued step (default 10240 after alignment), which is most prompts above ~10 k tokens.
2. Eviction picks the file that was just written. Every successful store calls kv_cache_evict at ds4_server.c:9010. The score at ds4_server.c:8627-8639 is (hits + 1) * tokens / file_size. For hits == 0 entries this reduces to roughly 1 / bytes_per_token, so a just-written entry has no advantage over older hits == 0 entries. The last_used tiebreaker only fires on exact float equality. In practice the just-written file is the largest low-score entry, so evicting it alone satisfies the budget in one iteration, and the loop picks it.
The comment at ds4_server.c:8636 says hits + 1 exists so a fresh checkpoint is not deleted because its hit counter is 0. The + 1 only defends against multiply-by-zero; it does not rank a fresh entry above other fresh entries.
How to reproduce
- Fill the
kv-cache directory close to budget (a handful of chats, or set --kv-cache-budget-mb low for testing).
- Send any chat whose prompt tokens are a multiple of the continued step (10240 with defaults), or fall just above one.
- The
stored ... reason=continued / evicted ... <same hash> / stored ... reason=cold / evicted ... <same hash> pattern appears in the log.
The command line I was using:
./ds4-server --ctx 512000 --kv-disk-dir ./kv-cache --kv-cache-boundary-align-tokens 2048 --kv-cache-min-tokens 512 --kv-disk-space-mb 8192 --mtp gguf/DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf --mtp-draft 2 --host 0.0.0.0 --port 8085 --kv-cache-boundary-trim-tokens 1000
Smallest fix I can see
Either of these on its own removes most of the waste:
- In
kv_cache_evict, exclude the file the current store just wrote from the candidate set. If the budget cannot be satisfied without it, return that fact and let the store path skip the write.
- Before the cold sync at
ds4_server.c:10354-10372, set kc->continued_last_store_tokens = cold_store_len so the in-prefill callback does not also fire at that boundary.
Happy to send a patch if useful.
When the disk KV cache is at its budget, a single chat request writes a 291 MiB snapshot to disk and
unlinks it on the very next operation, twice in a row. The same SHA gets stored and evicted three times across one request.The
7400e69efile is stored, evicted, stored again, evicted again, all within the same second. Net useful work: zero. Wasted I/O: ~98 ms (save=42.6+save=55.2) and 583 MiB written.Why it happens
Two independent issues compound. Either alone is mild; together they thrash.
1. Two stores at the same prefix length. During the prefill at
ds4_server.c:10354-10372the cold path syncs tocold_store_len, which during the sync triggers the continued-interval callback atds4_server.c:9892and stores the snapshot once with reasoncontinued; thends4_server.c:10368stores the same prefix again with reasoncold. Same tokens, same rendered text, same SHA, two writes. This fires whenevercold_store_lenlands on a multiple of the continued step (default 10240 after alignment), which is most prompts above ~10 k tokens.2. Eviction picks the file that was just written. Every successful store calls
kv_cache_evictatds4_server.c:9010. The score atds4_server.c:8627-8639is(hits + 1) * tokens / file_size. Forhits == 0entries this reduces to roughly1 / bytes_per_token, so a just-written entry has no advantage over olderhits == 0entries. Thelast_usedtiebreaker only fires on exact float equality. In practice the just-written file is the largest low-score entry, so evicting it alone satisfies the budget in one iteration, and the loop picks it.The comment at
ds4_server.c:8636sayshits + 1exists so a fresh checkpoint is not deleted because its hit counter is 0. The+ 1only defends against multiply-by-zero; it does not rank a fresh entry above other fresh entries.How to reproduce
kv-cachedirectory close to budget (a handful of chats, or set--kv-cache-budget-mblow for testing).stored ... reason=continued/evicted ... <same hash>/stored ... reason=cold/evicted ... <same hash>pattern appears in the log.The command line I was using:
./ds4-server --ctx 512000 --kv-disk-dir ./kv-cache --kv-cache-boundary-align-tokens 2048 --kv-cache-min-tokens 512 --kv-disk-space-mb 8192 --mtp gguf/DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf --mtp-draft 2 --host 0.0.0.0 --port 8085 --kv-cache-boundary-trim-tokens 1000Smallest fix I can see
Either of these on its own removes most of the waste:
kv_cache_evict, exclude the file the current store just wrote from the candidate set. If the budget cannot be satisfied without it, return that fact and let the store path skip the write.ds4_server.c:10354-10372, setkc->continued_last_store_tokens = cold_store_lenso the in-prefill callback does not also fire at that boundary.Happy to send a patch if useful.