feat: Auto-clear SSD cache on model unload#885
Open
afanty2021 wants to merge 57 commits intojundot:mainfrom
Open
feat: Auto-clear SSD cache on model unload#885afanty2021 wants to merge 57 commits intojundot:mainfrom
afanty2021 wants to merge 57 commits intojundot:mainfrom
Conversation
- add oq_manager/hf_uploader fields to ServerState dataclass - update KVCache reconstruct test to expect tensor shape offset - update oQ predicate bits test for affine-only mode - rewrite metal limit tests to match no-op behavior (jundot#429) - fix memory fallback test mock to patch HAS_MLX
# Conflicts: # omlx/_version.py # packaging/venvstacks.toml
- Add model alias resolution for Claude model names - Include capabilities field in models API response - Add Ollama network setup documentation - Fix tokenizer detection for Qwen3.5-Claude models Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add MEMORY/ directory (omlx runtime memory data) - Add benchmark_qwen35.py (personal performance testing script) - Add config/ directory (personal omlx configuration files) - Add docs/CLAUDE_CODE_*.md and docs/NETWORK_DEPLOYMENT.md (personal documentation) - Add specific scripts in scripts/ directory (personal utility scripts) These files are user-specific and should not be committed to the repository.
GenerationBatch is not available in mlx-lm 0.31.2. Disabled the import and patch code temporarily. TODO: Re-enable when mlx-lm adds GenerationBatch back or alternative is found. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add complete CLAUDE.md documentation for the oMLX project to help AI assistants understand the codebase architecture and development workflow. - Document project overview and core features - Explain system architecture and components - Provide development guide and testing instructions - Record upstream changes (v0.3.5.dev1) - Include technical stack and dependency information This documentation enables better AI assistance for future development tasks and maintains consistency with the project's AI workflow standards.
- Add graphify-out/ directory (Graphify tool output) - Add user-specific configuration files (.graphify_python, GITHUB_ISSUE_TEMPLATE.md, hfd.sh) These files are project-specific or generated artifacts that should not be tracked in version control.
- 更新版本号至 0.3.5.dev2 - 同步上游 20 个最新提交 - 记录 VLM 性能提升(2倍速度) - 添加音频模型扩展和语音克隆功能 - 更新 Metal 缓存优化和 IME 输入法修复 - 同步超时修复和 SSE 稳定性改进 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
解决 scheduler.py 冲突,采用上游版本: - 恢复 GenerationBatch monkey-patch - 添加 mRoPE(多旋转位置编码)支持 - 新增 VLM mRoPE 集成测试 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- 更新文档时间:2026-04-10 → 2026-04-14 - 核心特性:新增 DFlash 推测解码引擎(3-4x 加速) - 系统架构:添加 DFlashEngine 节点 - 关键依赖:添加 dflash-mlx - 目录结构:添加 engine/dflash.py - 核心组件:新增 DFlash 推测解码引擎说明 - 最近变更:添加 edb7244 提交记录 - 重要变更:添加 DFlash 引擎特性详细说明 - 相关资源:添加 dflash-mlx 链接 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- 添加 TurboQuantKVCache.merge monkey-patch 支持 - 改进 mRoPE 实现,使用 PromptProcessingBatch.prompt - 修复 burst-completion bug (jundot#557) - 改进 TurboQuantKV 实现,添加 _apply_turboquant_kv_empty - 添加 .worktrees/ 到 .gitignore - 保留本地用户配置
- 更新版本号到 v0.3.5-rc1 - 添加最新 20 条上游提交记录 - 更新重要变更汇总 - 记录 TurboQuantKV 优化和 mRoPE 改进
- 更新项目版本到 v0.3.5 - 更新 dflash-mlx 依赖到 v0.1.3 (814c4a1) - 将变更记录从 CLAUDE.md 移至 CHANGELOG.md - 在 CLAUDE.md 顶部添加变更日志链接 - 精简 CLAUDE.md,专注于项目结构和开发指南 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…se CLI NSStatusItem.isVisible() only reflects the app's own setVisible: state, so on Tahoe (26.x) it stays True even when ControlCenter hides the icon or the user toggles it off in System Settings > Menu Bar. The 3s NSAlert check in v0.3.6 never fired for affected users. - Replace isVisible() with a frame check on the status item's button window (width/height and x-origin) which actually catches Tahoe hiding - Retain the one-shot timer reference to prevent early PyObjC dealloc, bump delay 1s -> 3s for ControlCenter to settle - NSAlert gains an "Open Menu Bar Settings" button that deep-links to System Settings via x-apple.systempreferences URL - Re-check in health_timer so runtime toggle-off also triggers a warning (gated by _warned_hidden for once-per-session behavior) - Add "omlx diagnose menubar" CLI that reports macOS version, app install, menubar process status, recent visibility log entries, and manual recovery steps, useful when the icon is gone and the menu is the only control surface Apple's sandbox policy blocks programmatic re-enable on Tahoe: the real visibility prefs live in group.com.apple.controlcenter's Group Container, which third-party apps can't read or write. Writes to the legacy com.apple.controlcenter.plist are ignored by Tahoe's ControlCenter (verified on-device). So 0.3.7 focuses on detection plus guiding the user to the right Settings pane. Refs jundot#725 jundot#806
合并 30 个上游提交,主要更新: - 升级到 v0.3.6 - dflash-mlx 升级到 v0.1.3 - 修复 Tahoe 菜单栏隐藏状态检测 - 修复 VLM 工具消息格式化 - 修复 Jina 重排序器评分 - oQ 量化 float16 选项(M1/M2 加速) - 移除 LSUIElement 防止 ControlCenter 阻塞 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts: # omlx/cli.py # packaging/omlx_app/app.py
- Add model alias resolution for Claude model names - Include capabilities field in models API response - Add Ollama network setup documentation - Fix tokenizer detection for Qwen3.5-Claude models Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add MEMORY/ directory (omlx runtime memory data) - Add benchmark_qwen35.py (personal performance testing script) - Add config/ directory (personal omlx configuration files) - Add docs/CLAUDE_CODE_*.md and docs/NETWORK_DEPLOYMENT.md (personal documentation) - Add specific scripts in scripts/ directory (personal utility scripts) These files are user-specific and should not be committed to the repository.
GenerationBatch is not available in mlx-lm 0.31.2. Disabled the import and patch code temporarily. TODO: Re-enable when mlx-lm adds GenerationBatch back or alternative is found. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add complete CLAUDE.md documentation for the oMLX project to help AI assistants understand the codebase architecture and development workflow. - Document project overview and core features - Explain system architecture and components - Provide development guide and testing instructions - Record upstream changes (v0.3.5.dev1) - Include technical stack and dependency information This documentation enables better AI assistance for future development tasks and maintains consistency with the project's AI workflow standards.
- Add graphify-out/ directory (Graphify tool output) - Add user-specific configuration files (.graphify_python, GITHUB_ISSUE_TEMPLATE.md, hfd.sh) These files are project-specific or generated artifacts that should not be tracked in version control.
- 更新版本号至 0.3.5.dev2 - 同步上游 20 个最新提交 - 记录 VLM 性能提升(2倍速度) - 添加音频模型扩展和语音克隆功能 - 更新 Metal 缓存优化和 IME 输入法修复 - 同步超时修复和 SSE 稳定性改进 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- 更新文档时间:2026-04-10 → 2026-04-14 - 核心特性:新增 DFlash 推测解码引擎(3-4x 加速) - 系统架构:添加 DFlashEngine 节点 - 关键依赖:添加 dflash-mlx - 目录结构:添加 engine/dflash.py - 核心组件:新增 DFlash 推测解码引擎说明 - 最近变更:添加 edb7244 提交记录 - 重要变更:添加 DFlash 引擎特性详细说明 - 相关资源:添加 dflash-mlx 链接 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- 更新版本号到 v0.3.5-rc1 - 添加最新 20 条上游提交记录 - 更新重要变更汇总 - 记录 TurboQuantKV 优化和 mRoPE 改进
- 更新项目版本到 v0.3.5 - 更新 dflash-mlx 依赖到 v0.1.3 (814c4a1) - 将变更记录从 CLAUDE.md 移至 CHANGELOG.md - 在 CLAUDE.md 顶部添加变更日志链接 - 精简 CLAUDE.md,专注于项目结构和开发指南 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- 移除 Git 合并冲突标记 - 保留 mRoPE(多维度 RoPE)支持代码,用于 Qwen3-VL/3.5 等视觉语言模型 - mlx-lm commit dcbf6e3 已重新引入 GenerationBatch,可安全启用 - 添加 Gemma 4 tokenizer 修复(清除 extra_special_tokens) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Set extra_special_tokens to empty dict instead of None to avoid AttributeError when transformers tries to call .keys() on the value. Gemma 4 configs have extra_special_tokens as a list, but transformers expects a dict. Setting it to an empty dict overrides the config value and prevents: "AttributeError: 'list' object has no attribute 'keys'" Also add test cases to verify the fix works correctly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Update mlx-lm to a401730 (MiniMax M2 parallel tool fix + BatchKVCache extend fix) - Add regex dependency for Gemma 4 tool parser - Sync venvstacks.toml with pyproject.toml Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…VLM detection - /v1/models now returns all available models instead of just the first one - Removed hardcoded Claude aliases, use model_type for capability detection - Added chat template fallback for models with tokenizer.chat_template = None - Fixed quantized Gemma 4 detection (4bit/8bit models are text-only) - Fixed image upload not cleared when switching to non-VLM model - Fixed Qwen3.5/3.6-A3B tokenizer_class override (TokenizersBackend → Qwen2Tokenizer) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…bility Gemma 4 models require transformers >=5.5.0, but omlx needs <5.4.0 due to VLM breaking changes in 5.4.0. Override tokenizer_class from GemmaTokenizer (new) to GemmaTokenizer (Gemma 2/3) which is compatible. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch mlx-lm, mlx-vlm, and mlx-embeddings to use local file:// paths instead of git+https URLs for faster development and testing. Local paths: - /Users/berton/Github/mlx-lm - /Users/berton/Github/mlx-vlm - /Users/berton/Github/mlx-embeddings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch dflash-mlx to use local file:// path instead of git URL for faster development and testing. Local path: /Users/berton/Github/dflash-mlx Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Reduce max_tokens from 16384 to 8192 - Lower execution timeout from 30s to 5s - Decrease test cases from 3 to 1 These changes improve benchmark speed while maintaining basic validation capability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts: # omlx/server.py
Add automatic SSD cache clearing when models are unloaded, useful for benchmarking scenarios where cache persistence across model loads is not needed. Features: - New CLI flag: --paged-ssd-cache-clear-on-unload - Environment variable: OMLX_PAGED_SSD_CACHE_CLEAR_ON_UNLOAD - Config file option: clear_on_unload in paged_ssd_cache section Implementation: - Clears SSD cache before engine stop in _unload_engine() - Safe attribute access with exception handling - Comprehensive test coverage (3 new tests) Fixes issue where SSD write queue becomes full during multi-model benchmarking due to accumulated cache. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7844f15 to
b078330
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add automatic SSD cache clearing when models are unloaded, useful for benchmarking scenarios where cache persistence across model loads is not needed.
Problem
When running multiple models in sequence for benchmarking:
Solution
Add a
clear_ssd_cache_on_unloadconfiguration option that:Changes
Configuration
--paged-ssd-cache-clear-on-unloadOMLX_PAGED_SSD_CACHE_CLEAR_ON_UNLOADclear_on_unloadinpaged_ssd_cachesectionImplementation
_unload_engine()inomlx/engine_pool.pyUsage
Test plan
test_engine_pool.py)docs/CLEAR_SSD_CACHE_ON_UNLOAD.md)Breaking changes
None. This feature is opt-in and disabled by default.
Checklist
🤖 Generated with Claude Code