[Performance] Improve RAM usage#895
Open
RepublicOfKorokke wants to merge 8 commits intojundot:mainfrom
Open
Conversation
b91eced to
f8e1479
Compare
[Background] When hot_cache_only=True, evicted entries from the hot cache should be discarded rather than written to SSD. Previously, evicted blocks were still being enqueued for SSD writes, defeating the purpose of the in-memory-only mode and potentially causing unnecessary I/O overhead. [Approach] Added a conditional check in _evict_from_hot_cache() to skip SSD write enqueueing when hot_cache_only=True. Evicted entries are now simply discarded with a debug log message. Also fixed a redundant check in _enqueue_ssd_write() that already returns early for hot_cache_only mode. [Side Effect] None - this is the intended behavior for hot_cache_only mode. Evicted entries were already not being persisted in earlier code paths, so this aligns the eviction behavior with the configuration intent.
[Background] The hot_cache_only mode was not functioning properly - it was returning False and blocking cache storage. Additionally, when entries were stored, they were converted to tensors_raw (raw bytes), causing memory doubling on cache hits as new mx.array objects had to be created from scratch. [Approach] In hot_cache_only mode, store mx.array objects directly in the hot cache instead of converting to tensors_raw. This reuses the same GPU memory on cache hits rather than allocating new memory. Also fixed the logic flow to properly handle hot_cache_only as a primary mode, not a fallback case. [Side Effect] Entries stored in hot_cache_only mode now use direct array storage, while entries from SSD promotion still use tensors_raw. This means load_block checks for arrays first (fast path) before falling back to tensors_raw. No breaking changes for existing cached data as both paths are supported.
[Remaining problem] Cache did not hit with log below omlx.scheduler - DEBUG - Request 07c4a473-0682-4b51-a9b0-6db7b8471bad: paged cache reconstruction failed, released shared blocks
…ory leak [Background] Boundary snapshots were never cleaned up after storing, leading to memory leaks. [Approach] Added cleanup logic to delete boundary snapshots after successful cache storage. Also applied consistent formatting to conditional expressions. [Side Effect] None - boundary snapshots are lightweight metadata needed only during cache storage. The cleanup ensures they are released immediately after use.
…ly mode [Background] When hot_cache_only=true, boundary snapshots were previously disabled but the code still used last-block-only strategy for cache reconstruction. This caused restore failures because walk-back reconstruction requires intermediate block states that weren't stored. [Approach] - Always set has_valid_state=True for RotatingKVCache and GDN recurrent caches, ensuring actual data is stored for all blocks (not just last block) - Add hot_cache_only check in scheduler to disable boundary snapshots when in hot_cache_only mode (no cold cache writes needed) [Side Effect] None - increases memory usage slightly but enables reliable cache restoration without boundary snapshots in hot_cache_only mode.
…nstruction [Background] The previous implementation dequantized TurboQuantKVCache back to FP16 KVCache during cache reconstruction, which roughly doubled memory usage (2~8-bit -> 16-bit). With hot_cache_only mode storing many active caches in GPU memory, this caused unnecessary memory pressure and potential Metal allocation failures. [Approach] Modified reconstruct_cache() to keep TurboQuantKVCache in its quantized form rather than dequantizing. The lazy quantization approach will re-apply quantization at decode start, maintaining the memory savings throughout the cache lifetime. [Side Effect] None - TurboQuantKVCache was already designed for lazy quantization. The reconstruction path now matches the intended lazy behavior. Cache hit rate and token savings remain unchanged. May improve memory headroom for hot_cache_only scenarios with quantized models.
[Background] The hot_cache_only feature was introduced in prior commits to allow in-memory-only caching without SSD writes. These tests verify the expected behavior: (1) True discards evicted blocks vs False writes to SSD, (2) True stores as arrays (fast path) vs False stores as tensors_raw, (3) load uses fast path for arrays vs fallback for tensors_raw. Additionally verifies TurboQuantKVCache stays quantized after reconstruction. [Approach] Added TestHotCacheOnlyMode class with eviction/discard tests, storage format tests in test_paged_ssd_cache, and reconstruction test in test_turboquant. [Side Effect] None - these are new test cases that verify existing functionality with no impact on production code.
f8e1479 to
d2d64fb
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
About
This PR optimizes memory usage and improves the reliability of the KV cache system, focusing on reducing memory duplication in
hot_cache_onlymode and optimizingTurboQuantKVCachereconstruction.Features
mx.arraystorage in hot cache forhot_cache_onlymode.TurboQuantKVCacheduring reconstruction.Changes
PagedSSDCacheManagerto support direct array storage and updated stats collection.BlockAwarePrefixCacheto avoid dequantization of TQ cache and ensure all blocks are stored.Schedulerto manage boundary snapshots and prevent leaks.Memory Optimization in Hot Cache
Implemented a fast path in
PagedSSDCacheManagerthat stores and retrievesmx.arrayobjects directly whenhot_cache_onlyis enabled. This avoids converting arrays to raw bytes and back, which previously caused memory doubling during cache hits.TurboQuant and Prefix Cache Improvements
Modified
BlockAwarePrefixCacheto keepTurboQuantKVCachein its quantized state during reconstruction instead of converting to FP16. Additionally, changed the storage strategy forRotatingKVCacheand other non-sliceable caches to always store actual data, enabling reliable walk-back restoration without relying on boundary snapshots.Leak Fix and Reliability
Fixed a memory leak in the
Schedulerby explicitly clearing boundary snapshots from_boundary_cache_snapshotsonce the paged cache has been stored. Also updated boundary snapshot logic to be disabled whenhot_cache_onlyis active.Testing
Before fix
After fix
Memory usage sequence with
gemma-4-26B-A4B-it-oQ3-fp16hot_cache_only: true
hot_cache_only: false