Skip to content

[Performance] Improve RAM usage#895

Open
RepublicOfKorokke wants to merge 8 commits intojundot:mainfrom
RepublicOfKorokke:perf/improve-ram-usage
Open

[Performance] Improve RAM usage#895
RepublicOfKorokke wants to merge 8 commits intojundot:mainfrom
RepublicOfKorokke:perf/improve-ram-usage

Conversation

@RepublicOfKorokke
Copy link
Copy Markdown

@RepublicOfKorokke RepublicOfKorokke commented Apr 22, 2026

About

This PR optimizes memory usage and improves the reliability of the KV cache system, focusing on reducing memory duplication in hot_cache_only mode and optimizing TurboQuantKVCache reconstruction.

Features

  • Direct mx.array storage in hot cache for hot_cache_only mode.
  • Quantized-form retention for TurboQuantKVCache during reconstruction.
  • Walk-back restore support for all blocks in prefix cache.
  • Memory leak prevention for boundary snapshots.

Changes

  • Updated PagedSSDCacheManager to support direct array storage and updated stats collection.
  • Modified BlockAwarePrefixCache to avoid dequantization of TQ cache and ensure all blocks are stored.
  • Adjusted Scheduler to manage boundary snapshots and prevent leaks.
  • Added integration tests for memory leak detection.

Memory Optimization in Hot Cache

Implemented a fast path in PagedSSDCacheManager that stores and retrieves mx.array objects directly when hot_cache_only is enabled. This avoids converting arrays to raw bytes and back, which previously caused memory doubling during cache hits.

TurboQuant and Prefix Cache Improvements

Modified BlockAwarePrefixCache to keep TurboQuantKVCache in its quantized state during reconstruction instead of converting to FP16. Additionally, changed the storage strategy for RotatingKVCache and other non-sliceable caches to always store actual data, enabling reliable walk-back restoration without relying on boundary snapshots.

Leak Fix and Reliability

Fixed a memory leak in the Scheduler by explicitly clearing boundary snapshots from _boundary_cache_snapshots once the paged cache has been stored. Also updated boundary snapshot logic to be disabled when hot_cache_only is active.

Testing

  • Unit Test pass
    • Added several tests about cache management.
Before fix
FAILED tests/test_accuracy_benchmark.py::TestAccuracyBenchmarkRequest::test_all_valid_benchmarks - AssertionError: assert 16 == 12
FAILED tests/test_admin_api_key.py::TestListModelsSettings::test_list_models_includes_all_model_settings_fields - AssertionError: Missing fields: {'turboquant_skip_last', 'preserve_thinking'}, Extra fields: set()
FAILED tests/test_admin_profiles_api.py::test_all_model_settings_fields_classified - AssertionError: New ModelSettings field(s) {'preserve_thinking'} must be classified in UNIVERSAL_PROFILE_FIELDS, MODEL_SPECIFIC_PROFILE_FIELDS, or EXCLUDED_FROM_PROFILES. I...
FAILED tests/test_engine_core.py::TestEngineCoreAddRequest::test_add_request_returns_id - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreAddRequest::test_add_request_with_custom_id - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreAddRequest::test_add_request_creates_collector - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreAddRequest::test_add_request_with_default_sampling_params - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreAbortRequest::test_abort_request_wakes_blocked_stream_outputs - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreGenerateCancellation::test_generate_cancel_aborts_request - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreErrorPropagation::test_error_output_propagates_to_collector - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreErrorPropagation::test_stream_outputs_raises_on_error - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreErrorPropagation::test_generate_raises_on_error - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreAbortAllRequests::test_abort_all_requests - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreAbortAllRequests::test_abort_all_requests_engine_keeps_running - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_grammar.py::TestCompileWithStructuralTag::test_calls_get_builtin_structural_tag - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileWithStructuralTag::test_reasoning_false_when_thinking_disabled - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileWithStructuralTag::test_patches_user_grammar_into_tag - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileGrammarForRequest::test_reasoning_parser_uses_structural_tag - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileGrammarForRequest::test_reasoning_parser_with_thinking_disabled - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_scheduler.py::TestSchedulerAbortRequest::test_abort_running_request_removes_from_batch - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerAbortRequest::test_abort_running_request_always_calls_remove - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerAbortRequest::test_abort_cleans_all_scheduler_state - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerBoundarySnapshots::test_capture_boundary_snapshot_at_block_boundary - KeyError: 'req-boundary'
FAILED tests/test_scheduler.py::TestSchedulerBoundarySnapshots::test_cleanup_finished_skips_output_tokens_for_reasoning_model - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerBoundarySnapshots::test_cleanup_finished_stores_output_tokens_for_non_reasoning_model - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerBoundarySnapshots::test_cleanup_finished_uses_boundary_snapshot_for_partial_trailing_tokens - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerRotatingBlockAlignment::test_cleanup_finished_always_calls_remove_for_mapped_uid - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerRotatingBlockAlignment::test_cleanup_finished_removes_uid_from_active_batch - RuntimeError: There is no Stream(gpu, 1) in current thread.
========  28 failed, 3797 passed, 30 skipped, 54 deselected in 232.83s (0:03:52) ======== 
After fix
FAILED tests/test_accuracy_benchmark.py::TestAccuracyBenchmarkRequest::test_all_valid_benchmarks - AssertionError: assert 16 == 12
FAILED tests/test_admin_api_key.py::TestListModelsSettings::test_list_models_includes_all_model_settings_fields - AssertionError: Missing fields: {'turboquant_skip_last', 'preserve_thinking'}, Extr...
FAILED tests/test_admin_profiles_api.py::test_all_model_settings_fields_classified - AssertionError: New ModelSettings field(s) {'preserve_thinking'} must be classified...
FAILED tests/test_engine_core.py::TestEngineCoreAddRequest::test_add_request_returns_id - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreAddRequest::test_add_request_with_custom_id - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreAddRequest::test_add_request_creates_collector - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreAddRequest::test_add_request_with_default_sampling_params - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreAbortRequest::test_abort_request_wakes_blocked_stream_outputs - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreGenerateCancellation::test_generate_cancel_aborts_request - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreErrorPropagation::test_error_output_propagates_to_collector - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreErrorPropagation::test_stream_outputs_raises_on_error - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreErrorPropagation::test_generate_raises_on_error - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreAbortAllRequests::test_abort_all_requests - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreAbortAllRequests::test_abort_all_requests_engine_keeps_running - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_grammar.py::TestCompileWithStructuralTag::test_calls_get_builtin_structural_tag - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileWithStructuralTag::test_reasoning_false_when_thinking_disabled - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileWithStructuralTag::test_patches_user_grammar_into_tag - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileGrammarForRequest::test_reasoning_parser_uses_structural_tag - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileGrammarForRequest::test_reasoning_parser_with_thinking_disabled - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_scheduler.py::TestSchedulerAbortRequest::test_abort_running_request_removes_from_batch - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerAbortRequest::test_abort_running_request_always_calls_remove - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerAbortRequest::test_abort_cleans_all_scheduler_state - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerBoundarySnapshots::test_capture_boundary_snapshot_at_block_boundary - KeyError: 'req-boundary'
FAILED tests/test_scheduler.py::TestSchedulerBoundarySnapshots::test_cleanup_finished_skips_output_tokens_for_reasoning_model - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerBoundarySnapshots::test_cleanup_finished_stores_output_tokens_for_non_reasoning_model - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerBoundarySnapshots::test_cleanup_finished_uses_boundary_snapshot_for_partial_trailing_tokens - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerRotatingBlockAlignment::test_cleanup_finished_always_calls_remove_for_mapped_uid - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerRotatingBlockAlignment::test_cleanup_finished_removes_uid_from_active_batch - RuntimeError: There is no Stream(gpu, 1) in current thread.
======== 28 failed, 3802 passed, 30 skipped, 55 deselected in 231.61s (0:03:51) ========

Memory usage sequence with gemma-4-26B-A4B-it-oQ3-fp16

Step Activity/Phase Memory Usage (GB) Before Fix Memory Usage (GB) After Fix Notes
1 Model Loading 13 13 Initial model initialization and allocation.
2 Prefill (Initial) 14 – 20 14 – 20 Initial processing of 65k tokens.
3 Decode (Initial) 14~17 14~17 Initial decoding phase execution.
4 Post-Decode (1st Run) 30 14 Memory has successfully stabilized following cache activation. This represents the sustained minimum memory footprint.
5 Prefill (2nd Run) 44 17 Send same 65k tokens. Cache hit scenario. Processing was instantaneous and did not require full load measurement.
6 Decode (2nd Run) 30 17~20 Execution of the second decoding phase.
7 Post-Decode (2nd Run) 30 14 Crucially, the memory increase was successfully mitigated compared to the initial run, eliminating the extra ~15GB spike.
  • Hot cache is working
  • SSD cache is working

hot_cache_only: true

  "cache": {
    "enabled": true,
    "hot_cache_only": true,
    "ssd_cache_dir": "/<home>/.omlx/cache",
    "ssd_cache_max_size": "46GB",
    "hot_cache_max_size": "16GB",
    "initial_cache_blocks": 256
  },
  • cache is working with
    • gemma-4-26B-A4B-it-oQ3-fp16
      • TurboQuant KV Cache 3bit
    • Qwen3.6-35B-A3B-MLX-oQ4-FP16
      • TurboQuant KV Cache 3bit

hot_cache_only: false

  "cache": {
    "enabled": true,
    "hot_cache_only": false,
    "ssd_cache_dir": "/<home>/.omlx/cache",
    "ssd_cache_max_size": "46GB",
    "hot_cache_max_size": "16GB",
    "initial_cache_blocks": 256
  },
  • cache is working with (even after reload the model)
    • gemma-4-26B-A4B-it-oQ3-fp16
      • TurboQuant KV Cache 3bit
    • Qwen3.6-35B-A3B-MLX-oQ4-FP16
      • TurboQuant KV Cache 3bit

[Background]
When hot_cache_only=True, evicted entries from the hot cache should be discarded rather than written to SSD. Previously, evicted blocks were still being enqueued for SSD writes, defeating the purpose of the in-memory-only mode and potentially causing unnecessary I/O overhead.

[Approach]
Added a conditional check in _evict_from_hot_cache() to skip SSD write enqueueing when hot_cache_only=True. Evicted entries are now simply discarded with a debug log message. Also fixed a redundant check in _enqueue_ssd_write() that already returns early for hot_cache_only mode.

[Side Effect]
None - this is the intended behavior for hot_cache_only mode. Evicted entries were already not being persisted in earlier code paths, so this aligns the eviction behavior with the configuration intent.
[Background]
The hot_cache_only mode was not functioning properly - it was returning False and blocking cache storage. Additionally, when entries were stored, they were converted to tensors_raw (raw bytes), causing memory doubling on cache hits as new mx.array objects had to be created from scratch.

[Approach]
In hot_cache_only mode, store mx.array objects directly in the hot cache instead of converting to tensors_raw. This reuses the same GPU memory on cache hits rather than allocating new memory. Also fixed the logic flow to properly handle hot_cache_only as a primary mode, not a fallback case.

[Side Effect]
Entries stored in hot_cache_only mode now use direct array storage, while entries from SSD promotion still use tensors_raw. This means load_block checks for arrays first (fast path) before falling back to tensors_raw. No breaking changes for existing cached data as both paths are supported.
[Remaining problem]

Cache did not hit with log below
omlx.scheduler - DEBUG - Request 07c4a473-0682-4b51-a9b0-6db7b8471bad: paged cache reconstruction failed, released shared blocks
…ory leak

[Background]
Boundary snapshots were
never cleaned up after storing, leading to memory leaks.

[Approach]
Added cleanup logic to delete boundary snapshots after successful cache storage.
Also applied consistent formatting to conditional expressions.

[Side Effect]
None - boundary snapshots are lightweight metadata needed only during cache
storage. The cleanup ensures they are released immediately after use.
…ly mode

[Background]
When hot_cache_only=true, boundary snapshots were previously disabled but the code still
used last-block-only strategy for cache reconstruction. This caused restore failures because
walk-back reconstruction requires intermediate block states that weren't stored.

[Approach]
- Always set has_valid_state=True for RotatingKVCache and GDN recurrent caches,
  ensuring actual data is stored for all blocks (not just last block)
- Add hot_cache_only check in scheduler to disable boundary snapshots when in
  hot_cache_only mode (no cold cache writes needed)

[Side Effect]
None - increases memory usage slightly but enables reliable cache restoration
without boundary snapshots in hot_cache_only mode.
…nstruction

[Background]
The previous implementation dequantized TurboQuantKVCache back to FP16 KVCache
during cache reconstruction, which roughly doubled memory usage (2~8-bit -> 16-bit).
With hot_cache_only mode storing many active caches in GPU memory, this
caused unnecessary memory pressure and potential Metal allocation failures.

[Approach]
Modified reconstruct_cache() to keep TurboQuantKVCache in its quantized form
rather than dequantizing. The lazy quantization approach will re-apply
quantization at decode start, maintaining the memory savings throughout
the cache lifetime.

[Side Effect]
None - TurboQuantKVCache was already designed for lazy quantization.
The reconstruction path now matches the intended lazy behavior. Cache hit
rate and token savings remain unchanged. May improve memory headroom for
hot_cache_only scenarios with quantized models.
[Background]
The hot_cache_only feature was introduced in prior commits to allow
in-memory-only caching without SSD writes. These tests verify the expected
behavior: (1) True discards evicted blocks vs False writes to SSD,
(2) True stores as arrays (fast path) vs False stores as tensors_raw,
(3) load uses fast path for arrays vs fallback for tensors_raw.
Additionally verifies TurboQuantKVCache stays quantized after reconstruction.

[Approach]
Added TestHotCacheOnlyMode class with eviction/discard tests, storage format
tests in test_paged_ssd_cache, and reconstruction test in test_turboquant.

[Side Effect]
None - these are new test cases that verify existing functionality with
no impact on production code.
@RepublicOfKorokke RepublicOfKorokke marked this pull request as ready for review April 23, 2026 02:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant