feat(atomic-chat): B2/M4 — DFlash speculative decoding + Kakeya target KV#59
Draft
FluffyAIcode wants to merge 3 commits intoAgentMemory/atomic-chat-b2-mlx-dflash-kakeya-04aefrom
Conversation
DFlash's MLX driver manages target KV cache internally; the 2026 release track has three different hook surfaces across revisions. Instead of pinning to one dflash version, feature-detect at engine startup and pick whichever is available: Strategy A - KWARG passthrough (cleanest): stream_generate exposes target_cache / caches / target_caches / prompt_cache as a kwarg. Strategy B - model.make_cache monkey-patch: mlx-lm models expose make_cache() that dflash delegates to. Strategy C - module-level mlx_lm.models.cache.make_prompt_cache patch: some dflash revisions call this function directly. Strategy FALLBACK - single-track MLX decode + KakeyaLatticeMLXCache. Loses the speculative speedup but keeps the KV compression benefit. Invoked loudly on unknown dflash APIs. All three non-fallback strategies are context-managed via activate() so state mutations get cleanly rolled back on exit. Detection runs exactly once per model load (not per request). Tests (8 passed, no MLX / no dflash required): - strategy detection across four synthetic signatures - activate() state roundtrip for each strategy - unknown-kwarg downgrade to fallback - monkey-patched make_cache restores original on __exit__ - build() passes variant/q_range/boundary through to factory - unknown strategy raises RuntimeError (no silent fall-through) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
M4 wires cache_injection into engine_mlx and opens the
/v1/chat/completions route (non-stream JSON + SSE stream).
engine_mlx.py:
- _LoadedMLXModel pre-computes InjectionDecision at load time
(not per request); logs which strategy was chosen.
- chat() + chat_stream() delegate to the same _run_stream
generator, which branches between DFlash+Kakeya and native-MLX
fallback on _injection_decision.strategy.
- _dflash_iter_factory abstraction lets tests substitute a fake
dflash output stream without touching the real import.
- Override dict lets callers mutate variant / q_range / boundary
per request; engine rebuilds a MLXChannel dataclass for the
request scope only.
- Chat-template rendering uses tokenizer.apply_chat_template when
available, with a flat prompt fallback.
server.py:
- /v1/chat/completions opens (no more 503).
- Pydantic request model with extra=allow so future OpenAI
additions flow through untouched.
- stream=true returns an SSE StreamingResponse mirroring B1's
chunk schema.
- 404 on KeyError (unknown model); 501 on NotImplementedError
(future unsupported features).
Fix: the earlier skeleton checked 'stop' before yielding, which
dropped the piece containing the stop substring. Swap the order so
a stop substring ends the loop AFTER emitting the carrier piece.
Applied to both DFlash and native-MLX paths.
Tests (24 passed, stubs-only):
- test_server_skeleton.py updated: drop 503 assertion; add
test_chat_unknown_model_returns_404 via monkeypatched engine.
- test_engine_routing.py (4 tests, all mock-based):
* DFlash path aggregates pieces + averages acceptance lengths
* DFlash stream honours stop substring
* Mistral (dflash_available=False) falls back to native-MLX
* per-request override rebuilds channel with new q_range
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…uct pivot
ROADMAP updates:
- Status line advances M1-M3 -> M1-M4.
- M4 section expanded with the three cache_injection strategies
and the 32-test green status.
- M6 re-scoped from 'Atomic-Chat extension backend option' to
'independent product form evaluation'. The standalone-product
proposal lands alongside M5 benchmark in the next PR.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
摘要
M4 交付:把 DFlash (z-lab) 的 block-diffusion speculative decoding 接进 B2 sidecar,target 模型的 KV cache 自动走
KakeyaLatticeMLXCacheE8 压缩;/v1/chat/completions正式开放(stream + non-stream)。依赖关系
AgentMemory/atomic-chat-b2-mlx-dflash-kakeya-04ae(B2 M1-M3 骨架, PR #58)核心交付
cache_injection.py— 三策略 DFlash KV 注入DFlash 2026 版本序列有三个不同的"target KV cache 注入点",分别依赖 dflash 不同的 API 面。本 PR 不绑死任一版本,而是feature-detect 选最合适的:
KWARGstream_generate签名里有target_cache/caches/target_caches/prompt_cachekwargMODEL_MAKE_CACHE.make_cache()方法MODULE_PROMPT_CACHEmlx_lm.models.cache.make_prompt_cache存在FALLBACK_NATIVE_MLX所有非 fallback 策略都在
activate()context manager 里;退出时状态干净 rollback,不会污染后续请求。engine_mlx.py— 两条路径_LoadedMLXModel在 load 时预计算InjectionDecision,请求时走KakeyaCacheInjector.activate()+dflash.model_mlx.stream_generate(...);每步记录acceptance_length;mlx_lm.generate.stream_generate(prompt_cache=caches),仍享受 KV 压缩但无 speculative 加速。两条路径统一到
_run_stream(...)生成器,chat()和chat_stream()复用同一逻辑。server.py—/v1/chat/completions开张ChatCompletionRequest(extra=allow, 未知字段直通)StreamingResponse;stream=false 返回 OpenAI 风格 JSONx_kakeya响应字段带dflash_used、injection_strategy、acceptance_length_meanBug fix (from M3)
M3 骨架有个小 bug — stop substring 检查放在 yield 之前,会把含 STOP 的那一块整条吃掉。M4 改成 yield-first-then-check,两条路径同步修复。
测试
104/104 全绿(Linux CI 无 MLX/dflash 时也能跑):
kakeyalattice_mlx/tests/(M1-M2)kakeya_sidecar_mlx/tests/test_model_registry_mlx.pykakeya_sidecar_mlx/tests/test_server_skeleton.pykakeya_sidecar_mlx/tests/test_cache_injection.py(新)kakeya_sidecar_mlx/tests/test_engine_routing.py(新)关键工程决策
cache_injection.py多 100 行,换来 dflash 任意小版本都能跑。RotatingKVCache原样(draft KV 压缩留给 Phase 2 M5+ 之后)。stopsubstring 时,包含 stop 的那一块 piece 仍然发出,但之后立即停止 — B1 的 HF streamer 默认行为。非目标(本 PR 不做)
后续
Qwen3-8B × {bf16, Kakeya Q=38/10/4} × DFlash-b16),在 gsm8k + humaneval 上实测docs/B2_STANDALONE_PRODUCT_PROPOSAL.md:把 B2 从"Atomic-Chat backend"重新定位为独立 Mac 本地推理产品