Skip to content

feat: Auto-clear SSD cache on model unload#885

Open
afanty2021 wants to merge 57 commits intojundot:mainfrom
afanty2021:feature/auto-clear-ssd-cache-on-unload
Open

feat: Auto-clear SSD cache on model unload#885
afanty2021 wants to merge 57 commits intojundot:mainfrom
afanty2021:feature/auto-clear-ssd-cache-on-unload

Conversation

@afanty2021
Copy link
Copy Markdown

Summary

Add automatic SSD cache clearing when models are unloaded, useful for benchmarking scenarios where cache persistence across model loads is not needed.

Problem

When running multiple models in sequence for benchmarking:

  • Each model leaves SSD cache behind after unloading
  • Subsequent models write new cache, causing "SSD write queue full" warnings
  • Cache interference can affect benchmark results
2026-04-21 06:29:19,224 - omlx.cache.paged_ssd_cache - WARNING - SSD write queue full, dropping evicted block c127f0c289fba682

Solution

Add a clear_ssd_cache_on_unload configuration option that:

  • Clears SSD cache before engine stop during model unload
  • Is disabled by default (backward compatible)
  • Can be enabled via CLI flag, environment variable, or config file

Changes

Configuration

  • CLI flag: --paged-ssd-cache-clear-on-unload
  • Environment variable: OMLX_PAGED_SSD_CACHE_CLEAR_ON_UNLOAD
  • Config file: clear_on_unload in paged_ssd_cache section

Implementation

  • Modified _unload_engine() in omlx/engine_pool.py
  • Clears SSD cache before stopping the engine
  • Safe attribute access with exception handling
  • Comprehensive test coverage (3 new tests)

Usage

# Enable auto-clear (recommended for benchmarking)
omlx serve --model-dir ~/models --paged-ssd-cache-clear-on-unload

# Environment variable
export OMLX_PAGED_SSD_CACHE_CLEAR_ON_UNLOAD=true
omlx serve --model-dir ~/models

# Config file (~/.omlx/settings.json)
{
  "paged_ssd_cache": {
    "clear_on_unload": true
  }
}

Test plan

  • Unit tests pass (3 new tests in test_engine_pool.py)
  • Configuration parsing works correctly
  • CLI argument parsing works correctly
  • Manual testing with multiple model loads/unloads
  • Documentation added (docs/CLEAR_SSD_CACHE_ON_UNLOAD.md)

Breaking changes

None. This feature is opt-in and disabled by default.

Checklist

  • Code follows project style guidelines
  • Tests added/updated
  • Documentation updated
  • No breaking changes

🤖 Generated with Claude Code

jundot and others added 30 commits March 29, 2026 19:15
- add oq_manager/hf_uploader fields to ServerState dataclass
- update KVCache reconstruct test to expect tensor shape offset
- update oQ predicate bits test for affine-only mode
- rewrite metal limit tests to match no-op behavior (jundot#429)
- fix memory fallback test mock to patch HAS_MLX
# Conflicts:
#	omlx/_version.py
#	packaging/venvstacks.toml
- Add model alias resolution for Claude model names
- Include capabilities field in models API response
- Add Ollama network setup documentation
- Fix tokenizer detection for Qwen3.5-Claude models

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add MEMORY/ directory (omlx runtime memory data)
- Add benchmark_qwen35.py (personal performance testing script)
- Add config/ directory (personal omlx configuration files)
- Add docs/CLAUDE_CODE_*.md and docs/NETWORK_DEPLOYMENT.md (personal documentation)
- Add specific scripts in scripts/ directory (personal utility scripts)

These files are user-specific and should not be committed to the repository.
GenerationBatch is not available in mlx-lm 0.31.2.
Disabled the import and patch code temporarily.
TODO: Re-enable when mlx-lm adds GenerationBatch back or alternative is found.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add complete CLAUDE.md documentation for the oMLX project to help
AI assistants understand the codebase architecture and development
workflow.

- Document project overview and core features
- Explain system architecture and components
- Provide development guide and testing instructions
- Record upstream changes (v0.3.5.dev1)
- Include technical stack and dependency information

This documentation enables better AI assistance for future development
tasks and maintains consistency with the project's AI workflow standards.
- Add graphify-out/ directory (Graphify tool output)
- Add user-specific configuration files (.graphify_python, GITHUB_ISSUE_TEMPLATE.md, hfd.sh)

These files are project-specific or generated artifacts that should not be
tracked in version control.
- 更新版本号至 0.3.5.dev2
- 同步上游 20 个最新提交
- 记录 VLM 性能提升(2倍速度)
- 添加音频模型扩展和语音克隆功能
- 更新 Metal 缓存优化和 IME 输入法修复
- 同步超时修复和 SSE 稳定性改进

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
解决 scheduler.py 冲突,采用上游版本:
- 恢复 GenerationBatch monkey-patch
- 添加 mRoPE(多旋转位置编码)支持
- 新增 VLM mRoPE 集成测试

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- 更新文档时间:2026-04-10 → 2026-04-14
- 核心特性:新增 DFlash 推测解码引擎(3-4x 加速)
- 系统架构:添加 DFlashEngine 节点
- 关键依赖:添加 dflash-mlx
- 目录结构:添加 engine/dflash.py
- 核心组件:新增 DFlash 推测解码引擎说明
- 最近变更:添加 edb7244 提交记录
- 重要变更:添加 DFlash 引擎特性详细说明
- 相关资源:添加 dflash-mlx 链接

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- 添加 TurboQuantKVCache.merge monkey-patch 支持
- 改进 mRoPE 实现,使用 PromptProcessingBatch.prompt
- 修复 burst-completion bug (jundot#557)
- 改进 TurboQuantKV 实现,添加 _apply_turboquant_kv_empty
- 添加 .worktrees/ 到 .gitignore
- 保留本地用户配置
- 更新版本号到 v0.3.5-rc1
- 添加最新 20 条上游提交记录
- 更新重要变更汇总
- 记录 TurboQuantKV 优化和 mRoPE 改进
- 更新项目版本到 v0.3.5
- 更新 dflash-mlx 依赖到 v0.1.3 (814c4a1)
- 将变更记录从 CLAUDE.md 移至 CHANGELOG.md
- 在 CLAUDE.md 顶部添加变更日志链接
- 精简 CLAUDE.md,专注于项目结构和开发指南

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…se CLI

NSStatusItem.isVisible() only reflects the app's own setVisible: state,
so on Tahoe (26.x) it stays True even when ControlCenter hides the icon
or the user toggles it off in System Settings > Menu Bar. The 3s NSAlert
check in v0.3.6 never fired for affected users.

- Replace isVisible() with a frame check on the status item's button
  window (width/height and x-origin) which actually catches Tahoe hiding
- Retain the one-shot timer reference to prevent early PyObjC dealloc,
  bump delay 1s -> 3s for ControlCenter to settle
- NSAlert gains an "Open Menu Bar Settings" button that deep-links to
  System Settings via x-apple.systempreferences URL
- Re-check in health_timer so runtime toggle-off also triggers a warning
  (gated by _warned_hidden for once-per-session behavior)
- Add "omlx diagnose menubar" CLI that reports macOS version, app install,
  menubar process status, recent visibility log entries, and manual
  recovery steps, useful when the icon is gone and the menu is the only
  control surface

Apple's sandbox policy blocks programmatic re-enable on Tahoe: the real
visibility prefs live in group.com.apple.controlcenter's Group Container,
which third-party apps can't read or write. Writes to the legacy
com.apple.controlcenter.plist are ignored by Tahoe's ControlCenter
(verified on-device). So 0.3.7 focuses on detection plus guiding the
user to the right Settings pane.

Refs jundot#725 jundot#806
afanty2021 and others added 27 commits April 17, 2026 15:49
合并 30 个上游提交,主要更新:
- 升级到 v0.3.6
- dflash-mlx 升级到 v0.1.3
- 修复 Tahoe 菜单栏隐藏状态检测
- 修复 VLM 工具消息格式化
- 修复 Jina 重排序器评分
- oQ 量化 float16 选项(M1/M2 加速)
- 移除 LSUIElement 防止 ControlCenter 阻塞

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts:
#	omlx/cli.py
#	packaging/omlx_app/app.py
- Add model alias resolution for Claude model names
- Include capabilities field in models API response
- Add Ollama network setup documentation
- Fix tokenizer detection for Qwen3.5-Claude models

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add MEMORY/ directory (omlx runtime memory data)
- Add benchmark_qwen35.py (personal performance testing script)
- Add config/ directory (personal omlx configuration files)
- Add docs/CLAUDE_CODE_*.md and docs/NETWORK_DEPLOYMENT.md (personal documentation)
- Add specific scripts in scripts/ directory (personal utility scripts)

These files are user-specific and should not be committed to the repository.
GenerationBatch is not available in mlx-lm 0.31.2.
Disabled the import and patch code temporarily.
TODO: Re-enable when mlx-lm adds GenerationBatch back or alternative is found.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add complete CLAUDE.md documentation for the oMLX project to help
AI assistants understand the codebase architecture and development
workflow.

- Document project overview and core features
- Explain system architecture and components
- Provide development guide and testing instructions
- Record upstream changes (v0.3.5.dev1)
- Include technical stack and dependency information

This documentation enables better AI assistance for future development
tasks and maintains consistency with the project's AI workflow standards.
- Add graphify-out/ directory (Graphify tool output)
- Add user-specific configuration files (.graphify_python, GITHUB_ISSUE_TEMPLATE.md, hfd.sh)

These files are project-specific or generated artifacts that should not be
tracked in version control.
- 更新版本号至 0.3.5.dev2
- 同步上游 20 个最新提交
- 记录 VLM 性能提升(2倍速度)
- 添加音频模型扩展和语音克隆功能
- 更新 Metal 缓存优化和 IME 输入法修复
- 同步超时修复和 SSE 稳定性改进

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- 更新文档时间:2026-04-10 → 2026-04-14
- 核心特性:新增 DFlash 推测解码引擎(3-4x 加速)
- 系统架构:添加 DFlashEngine 节点
- 关键依赖:添加 dflash-mlx
- 目录结构:添加 engine/dflash.py
- 核心组件:新增 DFlash 推测解码引擎说明
- 最近变更:添加 edb7244 提交记录
- 重要变更:添加 DFlash 引擎特性详细说明
- 相关资源:添加 dflash-mlx 链接

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- 更新版本号到 v0.3.5-rc1
- 添加最新 20 条上游提交记录
- 更新重要变更汇总
- 记录 TurboQuantKV 优化和 mRoPE 改进
- 更新项目版本到 v0.3.5
- 更新 dflash-mlx 依赖到 v0.1.3 (814c4a1)
- 将变更记录从 CLAUDE.md 移至 CHANGELOG.md
- 在 CLAUDE.md 顶部添加变更日志链接
- 精简 CLAUDE.md,专注于项目结构和开发指南

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- 移除 Git 合并冲突标记
- 保留 mRoPE(多维度 RoPE)支持代码,用于 Qwen3-VL/3.5 等视觉语言模型
- mlx-lm commit dcbf6e3 已重新引入 GenerationBatch,可安全启用
- 添加 Gemma 4 tokenizer 修复(清除 extra_special_tokens)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Set extra_special_tokens to empty dict instead of None to avoid
AttributeError when transformers tries to call .keys() on the value.

Gemma 4 configs have extra_special_tokens as a list, but transformers
expects a dict. Setting it to an empty dict overrides the config value
and prevents: "AttributeError: 'list' object has no attribute 'keys'"

Also add test cases to verify the fix works correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Update mlx-lm to a401730 (MiniMax M2 parallel tool fix + BatchKVCache extend fix)
- Add regex dependency for Gemma 4 tool parser
- Sync venvstacks.toml with pyproject.toml

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…VLM detection

- /v1/models now returns all available models instead of just the first one
- Removed hardcoded Claude aliases, use model_type for capability detection
- Added chat template fallback for models with tokenizer.chat_template = None
- Fixed quantized Gemma 4 detection (4bit/8bit models are text-only)
- Fixed image upload not cleared when switching to non-VLM model
- Fixed Qwen3.5/3.6-A3B tokenizer_class override (TokenizersBackend → Qwen2Tokenizer)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…bility

Gemma 4 models require transformers >=5.5.0, but omlx needs <5.4.0
due to VLM breaking changes in 5.4.0. Override tokenizer_class from
GemmaTokenizer (new) to GemmaTokenizer (Gemma 2/3) which is compatible.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch mlx-lm, mlx-vlm, and mlx-embeddings to use local file:// paths
instead of git+https URLs for faster development and testing.

Local paths:
- /Users/berton/Github/mlx-lm
- /Users/berton/Github/mlx-vlm
- /Users/berton/Github/mlx-embeddings

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch dflash-mlx to use local file:// path instead of git URL
for faster development and testing.

Local path: /Users/berton/Github/dflash-mlx

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Reduce max_tokens from 16384 to 8192
- Lower execution timeout from 30s to 5s
- Decrease test cases from 3 to 1

These changes improve benchmark speed while maintaining
basic validation capability.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts:
#	omlx/server.py
Add automatic SSD cache clearing when models are unloaded,
useful for benchmarking scenarios where cache persistence
across model loads is not needed.

Features:
- New CLI flag: --paged-ssd-cache-clear-on-unload
- Environment variable: OMLX_PAGED_SSD_CACHE_CLEAR_ON_UNLOAD
- Config file option: clear_on_unload in paged_ssd_cache section

Implementation:
- Clears SSD cache before engine stop in _unload_engine()
- Safe attribute access with exception handling
- Comprehensive test coverage (3 new tests)

Fixes issue where SSD write queue becomes full during
multi-model benchmarking due to accumulated cache.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jundot jundot force-pushed the main branch 2 times, most recently from 7844f15 to b078330 Compare April 28, 2026 02:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants