[Performance] Improve RAM usage by RepublicOfKorokke · Pull Request #895 · jundot/omlx

RepublicOfKorokke · 2026-04-22T10:17:16Z

About

This PR optimizes memory usage and improves the reliability of the KV cache system, focusing on reducing memory duplication in hot_cache_only mode and optimizing TurboQuantKVCache reconstruction.

Features

Direct mx.array storage in hot cache for hot_cache_only mode.
Quantized-form retention for TurboQuantKVCache during reconstruction.
Walk-back restore support for all blocks in prefix cache.
Memory leak prevention for boundary snapshots.

Changes

Updated PagedSSDCacheManager to support direct array storage and updated stats collection.
Modified BlockAwarePrefixCache to avoid dequantization of TQ cache and ensure all blocks are stored.
Adjusted Scheduler to manage boundary snapshots and prevent leaks.
Added integration tests for memory leak detection.

Memory Optimization in Hot Cache

Implemented a fast path in PagedSSDCacheManager that stores and retrieves mx.array objects directly when hot_cache_only is enabled. This avoids converting arrays to raw bytes and back, which previously caused memory doubling during cache hits.

TurboQuant and Prefix Cache Improvements

Modified BlockAwarePrefixCache to keep TurboQuantKVCache in its quantized state during reconstruction instead of converting to FP16. Additionally, changed the storage strategy for RotatingKVCache and other non-sliceable caches to always store actual data, enabling reliable walk-back restoration without relying on boundary snapshots.

Leak Fix and Reliability

Fixed a memory leak in the Scheduler by explicitly clearing boundary snapshots from _boundary_cache_snapshots once the paged cache has been stored. Also updated boundary snapshot logic to be disabled when hot_cache_only is active.

Testing

Unit Test pass
- Added several tests about cache management.

Before fix

FAILED tests/test_accuracy_benchmark.py::TestAccuracyBenchmarkRequest::test_all_valid_benchmarks - AssertionError: assert 16 == 12
FAILED tests/test_admin_api_key.py::TestListModelsSettings::test_list_models_includes_all_model_settings_fields - AssertionError: Missing fields: {'turboquant_skip_last', 'preserve_thinking'}, Extra fields: set()
FAILED tests/test_admin_profiles_api.py::test_all_model_settings_fields_classified - AssertionError: New ModelSettings field(s) {'preserve_thinking'} must be classified in UNIVERSAL_PROFILE_FIELDS, MODEL_SPECIFIC_PROFILE_FIELDS, or EXCLUDED_FROM_PROFILES. I...
FAILED tests/test_engine_core.py::TestEngineCoreAddRequest::test_add_request_returns_id - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreAddRequest::test_add_request_with_custom_id - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreAddRequest::test_add_request_creates_collector - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreAddRequest::test_add_request_with_default_sampling_params - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreAbortRequest::test_abort_request_wakes_blocked_stream_outputs - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreGenerateCancellation::test_generate_cancel_aborts_request - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreErrorPropagation::test_error_output_propagates_to_collector - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreErrorPropagation::test_stream_outputs_raises_on_error - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreErrorPropagation::test_generate_raises_on_error - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreAbortAllRequests::test_abort_all_requests - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreAbortAllRequests::test_abort_all_requests_engine_keeps_running - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_grammar.py::TestCompileWithStructuralTag::test_calls_get_builtin_structural_tag - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileWithStructuralTag::test_reasoning_false_when_thinking_disabled - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileWithStructuralTag::test_patches_user_grammar_into_tag - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileGrammarForRequest::test_reasoning_parser_uses_structural_tag - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileGrammarForRequest::test_reasoning_parser_with_thinking_disabled - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_scheduler.py::TestSchedulerAbortRequest::test_abort_running_request_removes_from_batch - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerAbortRequest::test_abort_running_request_always_calls_remove - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerAbortRequest::test_abort_cleans_all_scheduler_state - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerBoundarySnapshots::test_capture_boundary_snapshot_at_block_boundary - KeyError: 'req-boundary'
FAILED tests/test_scheduler.py::TestSchedulerBoundarySnapshots::test_cleanup_finished_skips_output_tokens_for_reasoning_model - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerBoundarySnapshots::test_cleanup_finished_stores_output_tokens_for_non_reasoning_model - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerBoundarySnapshots::test_cleanup_finished_uses_boundary_snapshot_for_partial_trailing_tokens - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerRotatingBlockAlignment::test_cleanup_finished_always_calls_remove_for_mapped_uid - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerRotatingBlockAlignment::test_cleanup_finished_removes_uid_from_active_batch - RuntimeError: There is no Stream(gpu, 1) in current thread.
========  28 failed, 3797 passed, 30 skipped, 54 deselected in 232.83s (0:03:52) ========

After fix

FAILED tests/test_accuracy_benchmark.py::TestAccuracyBenchmarkRequest::test_all_valid_benchmarks - AssertionError: assert 16 == 12
FAILED tests/test_admin_api_key.py::TestListModelsSettings::test_list_models_includes_all_model_settings_fields - AssertionError: Missing fields: {'turboquant_skip_last', 'preserve_thinking'}, Extr...
FAILED tests/test_admin_profiles_api.py::test_all_model_settings_fields_classified - AssertionError: New ModelSettings field(s) {'preserve_thinking'} must be classified...
FAILED tests/test_engine_core.py::TestEngineCoreAddRequest::test_add_request_returns_id - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreAddRequest::test_add_request_with_custom_id - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreAddRequest::test_add_request_creates_collector - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreAddRequest::test_add_request_with_default_sampling_params - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreAbortRequest::test_abort_request_wakes_blocked_stream_outputs - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreGenerateCancellation::test_generate_cancel_aborts_request - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreErrorPropagation::test_error_output_propagates_to_collector - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreErrorPropagation::test_stream_outputs_raises_on_error - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreErrorPropagation::test_generate_raises_on_error - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreAbortAllRequests::test_abort_all_requests - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_engine_core.py::TestEngineCoreAbortAllRequests::test_abort_all_requests_engine_keeps_running - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_grammar.py::TestCompileWithStructuralTag::test_calls_get_builtin_structural_tag - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileWithStructuralTag::test_reasoning_false_when_thinking_disabled - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileWithStructuralTag::test_patches_user_grammar_into_tag - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileGrammarForRequest::test_reasoning_parser_uses_structural_tag - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileGrammarForRequest::test_reasoning_parser_with_thinking_disabled - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_scheduler.py::TestSchedulerAbortRequest::test_abort_running_request_removes_from_batch - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerAbortRequest::test_abort_running_request_always_calls_remove - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerAbortRequest::test_abort_cleans_all_scheduler_state - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerBoundarySnapshots::test_capture_boundary_snapshot_at_block_boundary - KeyError: 'req-boundary'
FAILED tests/test_scheduler.py::TestSchedulerBoundarySnapshots::test_cleanup_finished_skips_output_tokens_for_reasoning_model - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerBoundarySnapshots::test_cleanup_finished_stores_output_tokens_for_non_reasoning_model - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerBoundarySnapshots::test_cleanup_finished_uses_boundary_snapshot_for_partial_trailing_tokens - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerRotatingBlockAlignment::test_cleanup_finished_always_calls_remove_for_mapped_uid - RuntimeError: There is no Stream(gpu, 1) in current thread.
FAILED tests/test_scheduler.py::TestSchedulerRotatingBlockAlignment::test_cleanup_finished_removes_uid_from_active_batch - RuntimeError: There is no Stream(gpu, 1) in current thread.
======== 28 failed, 3802 passed, 30 skipped, 55 deselected in 231.61s (0:03:51) ========

Memory usage sequence with gemma-4-26B-A4B-it-oQ3-fp16

Step	Activity/Phase	Memory Usage (GB) Before Fix	Memory Usage (GB) After Fix	Notes
1	Model Loading	13	13	Initial model initialization and allocation.
2	Prefill (Initial)	14 – 20	14 – 20	Initial processing of 65k tokens.
3	Decode (Initial)	14~17	14~17	Initial decoding phase execution.
4	Post-Decode (1st Run)	30	14	Memory has successfully stabilized following cache activation. This represents the sustained minimum memory footprint.
5	Prefill (2nd Run)	44	17	Send same 65k tokens. Cache hit scenario. Processing was instantaneous and did not require full load measurement.
6	Decode (2nd Run)	30	17~20	Execution of the second decoding phase.
7	Post-Decode (2nd Run)	30	14	Crucially, the memory increase was successfully mitigated compared to the initial run, eliminating the extra ~15GB spike.

Hot cache is working
SSD cache is working

hot_cache_only: true

  "cache": {
    "enabled": true,
    "hot_cache_only": true,
    "ssd_cache_dir": "/<home>/.omlx/cache",
    "ssd_cache_max_size": "46GB",
    "hot_cache_max_size": "16GB",
    "initial_cache_blocks": 256
  },

cache is working with
- gemma-4-26B-A4B-it-oQ3-fp16
  - TurboQuant KV Cache 3bit
- Qwen3.6-35B-A3B-MLX-oQ4-FP16
  - TurboQuant KV Cache 3bit

hot_cache_only: false

  "cache": {
    "enabled": true,
    "hot_cache_only": false,
    "ssd_cache_dir": "/<home>/.omlx/cache",
    "ssd_cache_max_size": "46GB",
    "hot_cache_max_size": "16GB",
    "initial_cache_blocks": 256
  },

cache is working with (even after reload the model)
- gemma-4-26B-A4B-it-oQ3-fp16
  - TurboQuant KV Cache 3bit
- Qwen3.6-35B-A3B-MLX-oQ4-FP16
  - TurboQuant KV Cache 3bit

[Background] When hot_cache_only=True, evicted entries from the hot cache should be discarded rather than written to SSD. Previously, evicted blocks were still being enqueued for SSD writes, defeating the purpose of the in-memory-only mode and potentially causing unnecessary I/O overhead. [Approach] Added a conditional check in _evict_from_hot_cache() to skip SSD write enqueueing when hot_cache_only=True. Evicted entries are now simply discarded with a debug log message. Also fixed a redundant check in _enqueue_ssd_write() that already returns early for hot_cache_only mode. [Side Effect] None - this is the intended behavior for hot_cache_only mode. Evicted entries were already not being persisted in earlier code paths, so this aligns the eviction behavior with the configuration intent.

[Background] The hot_cache_only mode was not functioning properly - it was returning False and blocking cache storage. Additionally, when entries were stored, they were converted to tensors_raw (raw bytes), causing memory doubling on cache hits as new mx.array objects had to be created from scratch. [Approach] In hot_cache_only mode, store mx.array objects directly in the hot cache instead of converting to tensors_raw. This reuses the same GPU memory on cache hits rather than allocating new memory. Also fixed the logic flow to properly handle hot_cache_only as a primary mode, not a fallback case. [Side Effect] Entries stored in hot_cache_only mode now use direct array storage, while entries from SSD promotion still use tensors_raw. This means load_block checks for arrays first (fast path) before falling back to tensors_raw. No breaking changes for existing cached data as both paths are supported.

[Remaining problem] Cache did not hit with log below omlx.scheduler - DEBUG - Request 07c4a473-0682-4b51-a9b0-6db7b8471bad: paged cache reconstruction failed, released shared blocks

…ory leak [Background] Boundary snapshots were never cleaned up after storing, leading to memory leaks. [Approach] Added cleanup logic to delete boundary snapshots after successful cache storage. Also applied consistent formatting to conditional expressions. [Side Effect] None - boundary snapshots are lightweight metadata needed only during cache storage. The cleanup ensures they are released immediately after use.

…ly mode [Background] When hot_cache_only=true, boundary snapshots were previously disabled but the code still used last-block-only strategy for cache reconstruction. This caused restore failures because walk-back reconstruction requires intermediate block states that weren't stored. [Approach] - Always set has_valid_state=True for RotatingKVCache and GDN recurrent caches, ensuring actual data is stored for all blocks (not just last block) - Add hot_cache_only check in scheduler to disable boundary snapshots when in hot_cache_only mode (no cold cache writes needed) [Side Effect] None - increases memory usage slightly but enables reliable cache restoration without boundary snapshots in hot_cache_only mode.

…nstruction [Background] The previous implementation dequantized TurboQuantKVCache back to FP16 KVCache during cache reconstruction, which roughly doubled memory usage (2~8-bit -> 16-bit). With hot_cache_only mode storing many active caches in GPU memory, this caused unnecessary memory pressure and potential Metal allocation failures. [Approach] Modified reconstruct_cache() to keep TurboQuantKVCache in its quantized form rather than dequantizing. The lazy quantization approach will re-apply quantization at decode start, maintaining the memory savings throughout the cache lifetime. [Side Effect] None - TurboQuantKVCache was already designed for lazy quantization. The reconstruction path now matches the intended lazy behavior. Cache hit rate and token savings remain unchanged. May improve memory headroom for hot_cache_only scenarios with quantized models.

[Background] The hot_cache_only feature was introduced in prior commits to allow in-memory-only caching without SSD writes. These tests verify the expected behavior: (1) True discards evicted blocks vs False writes to SSD, (2) True stores as arrays (fast path) vs False stores as tensors_raw, (3) load uses fast path for arrays vs fallback for tensors_raw. Additionally verifies TurboQuantKVCache stays quantized after reconstruction. [Approach] Added TestHotCacheOnlyMode class with eviction/discard tests, storage format tests in test_paged_ssd_cache, and reconstruction test in test_turboquant. [Side Effect] None - these are new test cases that verify existing functionality with no impact on production code.

jundot force-pushed the main branch from 528fa1b to eedaf74 Compare April 22, 2026 14:40

RepublicOfKorokke force-pushed the perf/improve-ram-usage branch from b91eced to f8e1479 Compare April 22, 2026 17:54

RepublicOfKorokke added 8 commits April 23, 2026 10:57

fix: Failed to collect model-scoped SSD cache stats 'tensors_raw' is

60f476f

[Remaining problem] Cache did not hit with log below omlx.scheduler - DEBUG - Request 07c4a473-0682-4b51-a9b0-6db7b8471bad: paged cache reconstruction failed, released shared blocks

fix: unit test

bee8bb5

RepublicOfKorokke force-pushed the perf/improve-ram-usage branch from f8e1479 to d2d64fb Compare April 23, 2026 02:48

RepublicOfKorokke marked this pull request as ready for review April 23, 2026 02:48

jundot force-pushed the main branch from 7844f15 to b078330 Compare April 28, 2026 02:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Improve RAM usage#895

[Performance] Improve RAM usage#895
RepublicOfKorokke wants to merge 8 commits intojundot:mainfrom
RepublicOfKorokke:perf/improve-ram-usage

RepublicOfKorokke commented Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RepublicOfKorokke commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

About

Features

Changes

Memory Optimization in Hot Cache

TurboQuant and Prefix Cache Improvements

Leak Fix and Reliability

Testing

hot_cache_only: true

hot_cache_only: false

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RepublicOfKorokke commented Apr 22, 2026 •

edited

Loading