Skip to content

Disk KV cache: stores then immediately evicts the same file when budget is full #157

@unsaltedbutter-ai

Description

@unsaltedbutter-ai

When the disk KV cache is at its budget, a single chat request writes a 291 MiB snapshot to disk and unlinks it on the very next operation, twice in a row. The same SHA gets stored and evicted three times across one request.

23:33:08 kv cache evicted reason=disk-cache-full tokens=20480 hits=0 size=291.80 MiB file=.../7400e69e....kv
23:33:08 chat ctx=0..23029:23029 TOOLS prompt start
...
23:34:06 chunk 20480/23029 (88.9%)
23:34:06 kv cache stored tokens=20480 trimmed=0 reason=continued size=291.80 MiB save=42.6 ms
23:34:06 kv cache evicted reason=disk-cache-full tokens=20480 hits=0 size=291.80 MiB file=.../7400e69e....kv
23:34:06 kv cache stored tokens=20480 trimmed=2549 reason=cold size=291.80 MiB save=55.2 ms
23:34:06 kv cache evicted reason=disk-cache-full tokens=20480 hits=0 size=291.80 MiB file=.../7400e69e....kv

The 7400e69e file is stored, evicted, stored again, evicted again, all within the same second. Net useful work: zero. Wasted I/O: ~98 ms (save=42.6 + save=55.2) and 583 MiB written.

Why it happens

Two independent issues compound. Either alone is mild; together they thrash.

1. Two stores at the same prefix length. During the prefill at ds4_server.c:10354-10372 the cold path syncs to cold_store_len, which during the sync triggers the continued-interval callback at ds4_server.c:9892 and stores the snapshot once with reason continued; then ds4_server.c:10368 stores the same prefix again with reason cold. Same tokens, same rendered text, same SHA, two writes. This fires whenever cold_store_len lands on a multiple of the continued step (default 10240 after alignment), which is most prompts above ~10 k tokens.

2. Eviction picks the file that was just written. Every successful store calls kv_cache_evict at ds4_server.c:9010. The score at ds4_server.c:8627-8639 is (hits + 1) * tokens / file_size. For hits == 0 entries this reduces to roughly 1 / bytes_per_token, so a just-written entry has no advantage over older hits == 0 entries. The last_used tiebreaker only fires on exact float equality. In practice the just-written file is the largest low-score entry, so evicting it alone satisfies the budget in one iteration, and the loop picks it.

The comment at ds4_server.c:8636 says hits + 1 exists so a fresh checkpoint is not deleted because its hit counter is 0. The + 1 only defends against multiply-by-zero; it does not rank a fresh entry above other fresh entries.

How to reproduce

  1. Fill the kv-cache directory close to budget (a handful of chats, or set --kv-cache-budget-mb low for testing).
  2. Send any chat whose prompt tokens are a multiple of the continued step (10240 with defaults), or fall just above one.
  3. The stored ... reason=continued / evicted ... <same hash> / stored ... reason=cold / evicted ... <same hash> pattern appears in the log.

The command line I was using:
./ds4-server --ctx 512000 --kv-disk-dir ./kv-cache --kv-cache-boundary-align-tokens 2048 --kv-cache-min-tokens 512 --kv-disk-space-mb 8192 --mtp gguf/DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf --mtp-draft 2 --host 0.0.0.0 --port 8085 --kv-cache-boundary-trim-tokens 1000

Smallest fix I can see

Either of these on its own removes most of the waste:

  • In kv_cache_evict, exclude the file the current store just wrote from the candidate set. If the budget cannot be satisfied without it, return that fact and let the store path skip the write.
  • Before the cold sync at ds4_server.c:10354-10372, set kc->continued_last_store_tokens = cold_store_len so the in-prefill callback does not also fire at that boundary.

Happy to send a patch if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingkv-cache

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions