kvcache: Fix double copy in NVMe read by LouisDDN · Pull Request #287 · mlcommons/storage

LouisDDN · 2026-03-23T17:13:20Z

The NVMeBackend.read() method was performing two full copies of each array:

np.load() copies file → memory
np.array() copies memory → memory (unnecessary)

This was based on a misunderstanding that np.load() only does I/O without copying data. In reality, np.load() already returns a usable numpy array in memory.

The second copy caused:

19% slower device read latency (52.34 → 62.41 ms P95)
66% lower storage I/O throughput (1834 → 3047 tokens/sec)
99.7% higher host overhead (43.99 → 0.15 ms P95)

Changes:

Remove np.array() copy from default read path
Set mmap=None to ensure data is loaded
Add optional --use-mmap flag for memory-mapped mode
Fix timing logic to consistently measure host_time as cache drop overhead

Performance improvement (tmpfs, 20 users, 20 seconds, decode-only):

Device latency P95: 52.34 ms → 62.41 ms (+19.2%)
Host overhead P95: 43.99 ms → 0.15 ms (-99.7%)
Total latency P95: 96.20 ms → 62.55 ms (-35.0%)
Storage I/O throughput: 1834 → 3047 tokens/sec (+66.1%)
Per-request storage latency P95: 407.55 ms → 268.46 ms (-34.1%)

Tested-by: Louis Douriez ldouriez@ddn.com

The NVMeBackend.read() method was performing two full copies of each array: 1. np.load() copies file → memory 2. np.array() copies memory → memory (unnecessary) This was based on a misunderstanding that np.load() only does I/O without copying data. In reality, np.load() already returns a usable numpy array in memory. The second copy caused: - 19% slower device read latency (52.34 → 62.41 ms P95) - 66% lower storage I/O throughput (1834 → 3047 tokens/sec) - 99.7% higher host overhead (43.99 → 0.15 ms P95) Changes: - Remove np.array() copy from default read path - Set mmap=None to ensure data is loaded - Add optional --use-mmap flag for memory-mapped mode - Fix timing logic to consistently measure host_time as cache drop overhead Performance improvement (tmpfs, 20 users, 20 seconds, decode-only): - Device latency P95: 52.34 ms → 62.41 ms (+19.2%) - Host overhead P95: 43.99 ms → 0.15 ms (-99.7%) - Total latency P95: 96.20 ms → 62.55 ms (-35.0%) - Storage I/O throughput: 1834 → 3047 tokens/sec (+66.1%) - Per-request storage latency P95: 407.55 ms → 268.46 ms (-34.1%) Tested-by: Louis Douriez <ldouriez@ddn.com>

github-actions · 2026-03-23T17:13:30Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

hazemawadalla · 2026-03-23T21:51:51Z

the fix is valid but the numbers are exaggerated.

hazemawadalla

The fix removes the np.array() copy which gives a 5-8% read latency improvement, but it destroys the L4 host/device latency breakdown for reads.

The original code was intentional:

NVMeBackend.read() measures:

device_time = time for np.load() (read from disk)

host_time = time for fadvise + np.array() (cache drop + memory copy)

np.load() isolates disk I/O. np.array() isolates host-side memory copy. Together they give the L4 breakdown that the benchmark reports. Without np.array(), everything
collapses into a single np.load() call with no way to separate host from device.

The claimed performance numbers (host -99.7%, throughput +66%) do not reproduce. Tested on Kingston DC3000ME NVMe with llama3.1-8b:

┌────────────────┬────────────────────────┬───────────┐
│ Metric │ tmpfs (his conditions) │ Real NVMe │
├────────────────┼────────────────────────┼───────────┤
│ Read Host P95 │ -5.4% │ -23.5% │
├────────────────┼────────────────────────┼───────────┤
│ Read Total P95 │ -8.4% │ -11.8% │
├────────────────┼────────────────────────┼───────────┤
│ Throughput │ +5.4% │ +17.6% │
└────────────────┴────────────────────────┴───────────┘

The tradeoff is 5-8% faster reads vs losing per-operation host/device visibility. For a storage benchmark, the instrumentation may be more valuable than the speedup.

russfellows

This all makes sense and I agree with removing all the extraneous memory copies that occur throughout DLIO code. However, my only concern is around the "use_mmap()" code. It seems to also be doing a data copy, correct? So unclear on the rationale for its use. But if everyone else has reviewed and tested and it works as desired, then I am completely fine with these changes.

dslik · 2026-03-24T01:22:57Z

We will discuss the tradeoffs in Tuesday's KVCache TF meeting.

LouisDDN · 2026-03-24T09:46:45Z

@hazemawadalla The numbers are not exaggerated, they were measured.
Here is the command line used against tmpfs:

python3 kv-cache.py --model llama3.1-8b --num-users 20 --duration 20 --gpu-mem-gb 0 --cpu-mem-gb 8 --cache-dir /tmp/ramfs/test --seed 42 --decode-only

The -99% corresponds to host overhead. Since the new code only measures the fadvise call for this host metric (because the numpy array copy is gone), it is nearly instantaneous. It simply drops the cache, if any, instead of dropping the cache and copying the data into a new buffer.

What’s important to consider is that the memory bandwidth constraint is greatly reduced when we remove the extra copy. Since this np.array operation is performed alongside many other threads doing the same thing, you’re not measuring the host‑memory copy in a vacuum when using np.array. Performance is limited by other threads performing the same operation or memory copy in parallel, which adds significant pressure to the client’s memory bandwidth.
For that reason, I somewhat disagree with the notion of “losing per‑operation host/device visibility,” because the metric is not simply a timer around an existing operation, it introduces an additional operation altogether.

Your results show a +17.6% throughput improvement for NVMe, which I think clearly demonstrates that this change is quite beneficial.

@russfellows
"However, my only concern is around the "use_mmap()" code. It seems to also be doing a data copy, correct? So unclear on the rationale for its use. But if everyone else has reviewed and tested and it works as desired, then I am completely fine with these changes."

So the mmap code does this:

np.load with mmap (which doesn't load the data!, measured >300GB/s).
actually copy the data into a python buffer

vs when mmap is disabled:

np.load (load the data in a python buffer).

vs previous version of the code:

np.load (load the data in a python buffer)
copy the data again in a new python buffer. (this is what this PR removes).

LouisDDN requested a review from a team March 23, 2026 17:13

FileSystemGuy requested a review from hazemawadalla March 23, 2026 19:24

hazemawadalla requested changes Mar 23, 2026

View reviewed changes

russfellows approved these changes Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvcache: Fix double copy in NVMe read#287

kvcache: Fix double copy in NVMe read#287
LouisDDN wants to merge 1 commit intomlcommons:mainfrom
LouisDDN:ld/fix-kvcache-double-copy

LouisDDN commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

hazemawadalla commented Mar 23, 2026

Uh oh!

hazemawadalla left a comment

Uh oh!

russfellows left a comment

Uh oh!

dslik commented Mar 24, 2026

Uh oh!

LouisDDN commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

LouisDDN commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

hazemawadalla commented Mar 23, 2026

Uh oh!

hazemawadalla left a comment

Choose a reason for hiding this comment

NVMeBackend.read() measures:

device_time = time for np.load() (read from disk)

host_time = time for fadvise + np.array() (cache drop + memory copy)

Uh oh!

russfellows left a comment

Choose a reason for hiding this comment

Uh oh!

dslik commented Mar 24, 2026

Uh oh!

LouisDDN commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants