Skip to content

kvcache: Fix double copy in NVMe read#287

Open
LouisDDN wants to merge 1 commit intomlcommons:mainfrom
LouisDDN:ld/fix-kvcache-double-copy
Open

kvcache: Fix double copy in NVMe read#287
LouisDDN wants to merge 1 commit intomlcommons:mainfrom
LouisDDN:ld/fix-kvcache-double-copy

Conversation

@LouisDDN
Copy link
Contributor

The NVMeBackend.read() method was performing two full copies of each array:

  1. np.load() copies file → memory
  2. np.array() copies memory → memory (unnecessary)

This was based on a misunderstanding that np.load() only does I/O without copying data. In reality, np.load() already returns a usable numpy array in memory.

The second copy caused:

  • 19% slower device read latency (52.34 → 62.41 ms P95)
  • 66% lower storage I/O throughput (1834 → 3047 tokens/sec)
  • 99.7% higher host overhead (43.99 → 0.15 ms P95)

Changes:

  • Remove np.array() copy from default read path
  • Set mmap=None to ensure data is loaded
  • Add optional --use-mmap flag for memory-mapped mode
  • Fix timing logic to consistently measure host_time as cache drop overhead

Performance improvement (tmpfs, 20 users, 20 seconds, decode-only):

  • Device latency P95: 52.34 ms → 62.41 ms (+19.2%)
  • Host overhead P95: 43.99 ms → 0.15 ms (-99.7%)
  • Total latency P95: 96.20 ms → 62.55 ms (-35.0%)
  • Storage I/O throughput: 1834 → 3047 tokens/sec (+66.1%)
  • Per-request storage latency P95: 407.55 ms → 268.46 ms (-34.1%)

Tested-by: Louis Douriez ldouriez@ddn.com

The NVMeBackend.read() method was performing two full copies of each
array:
1. np.load() copies file → memory
2. np.array() copies memory → memory (unnecessary)

This was based on a misunderstanding that np.load() only does I/O
without copying data. In reality, np.load() already returns a usable
numpy array in memory.

The second copy caused:
- 19% slower device read latency (52.34 → 62.41 ms P95)
- 66% lower storage I/O throughput (1834 → 3047 tokens/sec)
- 99.7% higher host overhead (43.99 → 0.15 ms P95)

Changes:
- Remove np.array() copy from default read path
- Set mmap=None to ensure data is loaded
- Add optional --use-mmap flag for memory-mapped mode
- Fix timing logic to consistently measure host_time as cache drop overhead

Performance improvement (tmpfs, 20 users, 20 seconds, decode-only):
- Device latency P95: 52.34 ms → 62.41 ms (+19.2%)
- Host overhead P95: 43.99 ms → 0.15 ms (-99.7%)
- Total latency P95: 96.20 ms → 62.55 ms (-35.0%)
- Storage I/O throughput: 1834 → 3047 tokens/sec (+66.1%)
- Per-request storage latency P95: 407.55 ms → 268.46 ms (-34.1%)

Tested-by: Louis Douriez <ldouriez@ddn.com>
@LouisDDN LouisDDN requested a review from a team March 23, 2026 17:13
@github-actions
Copy link

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@hazemawadalla
Copy link
Contributor

the fix is valid but the numbers are exaggerated.

Copy link
Contributor

@hazemawadalla hazemawadalla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix removes the np.array() copy which gives a 5-8% read latency improvement, but it destroys the L4 host/device latency breakdown for reads.

The original code was intentional:

NVMeBackend.read() measures:

device_time = time for np.load() (read from disk)

host_time = time for fadvise + np.array() (cache drop + memory copy)

np.load() isolates disk I/O. np.array() isolates host-side memory copy. Together they give the L4 breakdown that the benchmark reports. Without np.array(), everything
collapses into a single np.load() call with no way to separate host from device.

The claimed performance numbers (host -99.7%, throughput +66%) do not reproduce. Tested on Kingston DC3000ME NVMe with llama3.1-8b:

┌────────────────┬────────────────────────┬───────────┐
│ Metric │ tmpfs (his conditions) │ Real NVMe │
├────────────────┼────────────────────────┼───────────┤
│ Read Host P95 │ -5.4% │ -23.5% │
├────────────────┼────────────────────────┼───────────┤
│ Read Total P95 │ -8.4% │ -11.8% │
├────────────────┼────────────────────────┼───────────┤
│ Throughput │ +5.4% │ +17.6% │
└────────────────┴────────────────────────┴───────────┘

The tradeoff is 5-8% faster reads vs losing per-operation host/device visibility. For a storage benchmark, the instrumentation may be more valuable than the speedup.

Copy link

@russfellows russfellows left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all makes sense and I agree with removing all the extraneous memory copies that occur throughout DLIO code. However, my only concern is around the "use_mmap()" code. It seems to also be doing a data copy, correct? So unclear on the rationale for its use. But if everyone else has reviewed and tested and it works as desired, then I am completely fine with these changes.

@dslik
Copy link
Contributor

dslik commented Mar 24, 2026

We will discuss the tradeoffs in Tuesday's KVCache TF meeting.

@LouisDDN
Copy link
Contributor Author

@hazemawadalla The numbers are not exaggerated, they were measured.
Here is the command line used against tmpfs:

python3 kv-cache.py --model llama3.1-8b --num-users 20 --duration 20 --gpu-mem-gb 0 --cpu-mem-gb 8 --cache-dir /tmp/ramfs/test --seed 42 --decode-only

The -99% corresponds to host overhead. Since the new code only measures the fadvise call for this host metric (because the numpy array copy is gone), it is nearly instantaneous. It simply drops the cache, if any, instead of dropping the cache and copying the data into a new buffer.

What’s important to consider is that the memory bandwidth constraint is greatly reduced when we remove the extra copy. Since this np.array operation is performed alongside many other threads doing the same thing, you’re not measuring the host‑memory copy in a vacuum when using np.array. Performance is limited by other threads performing the same operation or memory copy in parallel, which adds significant pressure to the client’s memory bandwidth.
For that reason, I somewhat disagree with the notion of “losing per‑operation host/device visibility,” because the metric is not simply a timer around an existing operation, it introduces an additional operation altogether.

Your results show a +17.6% throughput improvement for NVMe, which I think clearly demonstrates that this change is quite beneficial.

@russfellows
"However, my only concern is around the "use_mmap()" code. It seems to also be doing a data copy, correct? So unclear on the rationale for its use. But if everyone else has reviewed and tested and it works as desired, then I am completely fine with these changes."

So the mmap code does this:

  • np.load with mmap (which doesn't load the data!, measured >300GB/s).
  • actually copy the data into a python buffer

vs when mmap is disabled:

  • np.load (load the data in a python buffer).

vs previous version of the code:

  • np.load (load the data in a python buffer)
  • copy the data again in a new python buffer. (this is what this PR removes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants