kvcache: Fix double copy in NVMe read#287
Conversation
The NVMeBackend.read() method was performing two full copies of each array: 1. np.load() copies file → memory 2. np.array() copies memory → memory (unnecessary) This was based on a misunderstanding that np.load() only does I/O without copying data. In reality, np.load() already returns a usable numpy array in memory. The second copy caused: - 19% slower device read latency (52.34 → 62.41 ms P95) - 66% lower storage I/O throughput (1834 → 3047 tokens/sec) - 99.7% higher host overhead (43.99 → 0.15 ms P95) Changes: - Remove np.array() copy from default read path - Set mmap=None to ensure data is loaded - Add optional --use-mmap flag for memory-mapped mode - Fix timing logic to consistently measure host_time as cache drop overhead Performance improvement (tmpfs, 20 users, 20 seconds, decode-only): - Device latency P95: 52.34 ms → 62.41 ms (+19.2%) - Host overhead P95: 43.99 ms → 0.15 ms (-99.7%) - Total latency P95: 96.20 ms → 62.55 ms (-35.0%) - Storage I/O throughput: 1834 → 3047 tokens/sec (+66.1%) - Per-request storage latency P95: 407.55 ms → 268.46 ms (-34.1%) Tested-by: Louis Douriez <ldouriez@ddn.com>
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
|
the fix is valid but the numbers are exaggerated. |
hazemawadalla
left a comment
There was a problem hiding this comment.
The fix removes the np.array() copy which gives a 5-8% read latency improvement, but it destroys the L4 host/device latency breakdown for reads.
The original code was intentional:
NVMeBackend.read() measures:
device_time = time for np.load() (read from disk)
host_time = time for fadvise + np.array() (cache drop + memory copy)
np.load() isolates disk I/O. np.array() isolates host-side memory copy. Together they give the L4 breakdown that the benchmark reports. Without np.array(), everything
collapses into a single np.load() call with no way to separate host from device.
The claimed performance numbers (host -99.7%, throughput +66%) do not reproduce. Tested on Kingston DC3000ME NVMe with llama3.1-8b:
┌────────────────┬────────────────────────┬───────────┐
│ Metric │ tmpfs (his conditions) │ Real NVMe │
├────────────────┼────────────────────────┼───────────┤
│ Read Host P95 │ -5.4% │ -23.5% │
├────────────────┼────────────────────────┼───────────┤
│ Read Total P95 │ -8.4% │ -11.8% │
├────────────────┼────────────────────────┼───────────┤
│ Throughput │ +5.4% │ +17.6% │
└────────────────┴────────────────────────┴───────────┘
The tradeoff is 5-8% faster reads vs losing per-operation host/device visibility. For a storage benchmark, the instrumentation may be more valuable than the speedup.
russfellows
left a comment
There was a problem hiding this comment.
This all makes sense and I agree with removing all the extraneous memory copies that occur throughout DLIO code. However, my only concern is around the "use_mmap()" code. It seems to also be doing a data copy, correct? So unclear on the rationale for its use. But if everyone else has reviewed and tested and it works as desired, then I am completely fine with these changes.
|
We will discuss the tradeoffs in Tuesday's KVCache TF meeting. |
|
@hazemawadalla The numbers are not exaggerated, they were measured. python3 kv-cache.py --model llama3.1-8b --num-users 20 --duration 20 --gpu-mem-gb 0 --cpu-mem-gb 8 --cache-dir /tmp/ramfs/test --seed 42 --decode-onlyThe -99% corresponds to host overhead. Since the new code only measures the fadvise call for this host metric (because the numpy array copy is gone), it is nearly instantaneous. It simply drops the cache, if any, instead of dropping the cache and copying the data into a new buffer. What’s important to consider is that the memory bandwidth constraint is greatly reduced when we remove the extra copy. Since this np.array operation is performed alongside many other threads doing the same thing, you’re not measuring the host‑memory copy in a vacuum when using np.array. Performance is limited by other threads performing the same operation or memory copy in parallel, which adds significant pressure to the client’s memory bandwidth. Your results show a +17.6% throughput improvement for NVMe, which I think clearly demonstrates that this change is quite beneficial. @russfellows So the mmap code does this:
vs when mmap is disabled:
vs previous version of the code:
|
The NVMeBackend.read() method was performing two full copies of each array:
This was based on a misunderstanding that np.load() only does I/O without copying data. In reality, np.load() already returns a usable numpy array in memory.
The second copy caused:
Changes:
Performance improvement (tmpfs, 20 users, 20 seconds, decode-only):
Tested-by: Louis Douriez ldouriez@ddn.com