feat(atomic-chat): B2/M4 — DFlash speculative decoding + Kakeya target KV by FluffyAIcode · Pull Request #59 · FluffyAIcode/LLM-KV--Cache-compress

FluffyAIcode · 2026-04-30T05:00:38Z

摘要

M4 交付：把 DFlash (z-lab) 的 block-diffusion speculative decoding 接进 B2 sidecar，target 模型的 KV cache 自动走 KakeyaLatticeMLXCache E8 压缩；/v1/chat/completions 正式开放（stream + non-stream）。

依赖关系

Base: AgentMemory/atomic-chat-b2-mlx-dflash-kakeya-04ae (B2 M1-M3 骨架, PR #58)
Blocker: 合入 B2 分支后再合入本分支；也可 rebase 到 main 后独立 merge，两者均可。

核心交付

`cache_injection.py` — 三策略 DFlash KV 注入

DFlash 2026 版本序列有三个不同的"target KV cache 注入点"，分别依赖 dflash 不同的 API 面。本 PR 不绑死任一版本，而是feature-detect 选最合适的：

策略	触发条件	侵入度
`KWARG`	`stream_generate` 签名里有 `target_cache` / `caches` / `target_caches` / `prompt_cache` kwarg	最轻
`MODEL_MAKE_CACHE`	target model 暴露 `.make_cache()` 方法	一个方法 monkey-patch（context-managed）
`MODULE_PROMPT_CACHE`	`mlx_lm.models.cache.make_prompt_cache` 存在	模块级 patch
`FALLBACK_NATIVE_MLX`	以上都不满足；退化为单轨 MLX decode + Kakeya KV	丢加速保压缩

所有非 fallback 策略都在 activate() context manager 里；退出时状态干净 rollback，不会污染后续请求。

`engine_mlx.py` — 两条路径

DFlash 路径：_LoadedMLXModel 在 load 时预计算 InjectionDecision，请求时走 KakeyaCacheInjector.activate() + dflash.model_mlx.stream_generate(...)；每步记录 acceptance_length；
Native MLX 兜底：无 DFlash draft 或 injection 失败时走 mlx_lm.generate.stream_generate(prompt_cache=caches)，仍享受 KV 压缩但无 speculative 加速。

两条路径统一到 _run_stream(...) 生成器，chat() 和 chat_stream() 复用同一逻辑。

`server.py` — `/v1/chat/completions` 开张

Pydantic ChatCompletionRequest (extra=allow, 未知字段直通)
stream=true 返回 SSE StreamingResponse；stream=false 返回 OpenAI 风格 JSON
x_kakeya 响应字段带 dflash_used、injection_strategy、acceptance_length_mean
未知 model → 404；未实现特性 → 501；其它异常 → 500

Bug fix (from M3)

M3 骨架有个小 bug — stop substring 检查放在 yield 之前，会把含 STOP 的那一块整条吃掉。M4 改成 yield-first-then-check，两条路径同步修复。

测试

104/104 全绿（Linux CI 无 MLX/dflash 时也能跑）：

套件	数量	覆盖
`kakeyalattice_mlx/tests/` (M1-M2)	72 + 1 skip	codec、Hadamard、closest-point、KV cache delegation、MLX parity（Apple Silicon 激活）
`kakeya_sidecar_mlx/tests/test_model_registry_mlx.py`	12	DFlash draft 一致性、E8-only 守卫、Qwen3-8B 默认档
`kakeya_sidecar_mlx/tests/test_server_skeleton.py`	5	路由形状、dflash metadata 暴露、404 路径
`kakeya_sidecar_mlx/tests/test_cache_injection.py` (新)	8	四种 strategy detection + activate() 状态生命周期 + unknown strategy 爆错
`kakeya_sidecar_mlx/tests/test_engine_routing.py` (新)	4	DFlash 聚合、stop substring、fallback 路径、per-request override

$ cd integrations/atomic-chat-b2/kakeya_sidecar_mlx
$ PYTHONPATH=. python3 -m pytest tests/ -v
# => 32 passed

$ cd ../kakeyalattice_mlx
$ PYTHONPATH=. python3 -m pytest tests/ -v
# => 72 passed, 1 skipped

关键工程决策

Feature-detect 而非 pin dflash 版本。dflash 2026-02 → 2026-04 已出三个 API 轮廓；pin 版本等于把 B2 稳定性绑架给上游小版本。三策略 + fallback 的代价是 cache_injection.py 多 100 行，换来 dflash 任意小版本都能跑。
Target 压缩激进 / Draft 不压缩。DFlash 的 verify 由 target 兜底分布保证 lossless；target 走 Kakeya Q=38 近无损档，draft 保持 RotatingKVCache 原样（draft KV 压缩留给 Phase 2 M5+ 之后）。
NotImplementedError → 501 而非 500。服务器层区分"功能没做 (501)" vs "真出错 (500)"，给 Atomic-Chat 前端分层提示用户。
Stop 语义对齐 B1。用户提供 stop substring 时，包含 stop 的那一块 piece 仍然发出，但之后立即停止 — B1 的 HF streamer 默认行为。

非目标（本 PR 不做）

❌ Draft 模型 KV 也走 Kakeya 压缩（留给 M5+ 验证 acceptance-rate 后再启）
❌ 实测 acceptance length / tok/s（留给 M5 benchmark PR）
❌ Atomic-Chat extension 加 backend 选项（M6 已 re-scope 为独立产品形态）
❌ DeepSeek / GLM / Gemma（DFlash 没发布对应 draft）

后续

M5 PR — acceptance-rate benchmark (Qwen3-8B × {bf16, Kakeya Q=38/10/4} × DFlash-b16)，在 gsm8k + humaneval 上实测
M6 — 随 M5 PR 附 docs/B2_STANDALONE_PRODUCT_PROPOSAL.md：把 B2 从"Atomic-Chat backend"重新定位为独立 Mac 本地推理产品

DFlash's MLX driver manages target KV cache internally; the 2026 release track has three different hook surfaces across revisions. Instead of pinning to one dflash version, feature-detect at engine startup and pick whichever is available: Strategy A - KWARG passthrough (cleanest): stream_generate exposes target_cache / caches / target_caches / prompt_cache as a kwarg. Strategy B - model.make_cache monkey-patch: mlx-lm models expose make_cache() that dflash delegates to. Strategy C - module-level mlx_lm.models.cache.make_prompt_cache patch: some dflash revisions call this function directly. Strategy FALLBACK - single-track MLX decode + KakeyaLatticeMLXCache. Loses the speculative speedup but keeps the KV compression benefit. Invoked loudly on unknown dflash APIs. All three non-fallback strategies are context-managed via activate() so state mutations get cleanly rolled back on exit. Detection runs exactly once per model load (not per request). Tests (8 passed, no MLX / no dflash required): - strategy detection across four synthetic signatures - activate() state roundtrip for each strategy - unknown-kwarg downgrade to fallback - monkey-patched make_cache restores original on __exit__ - build() passes variant/q_range/boundary through to factory - unknown strategy raises RuntimeError (no silent fall-through) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

M4 wires cache_injection into engine_mlx and opens the /v1/chat/completions route (non-stream JSON + SSE stream). engine_mlx.py: - _LoadedMLXModel pre-computes InjectionDecision at load time (not per request); logs which strategy was chosen. - chat() + chat_stream() delegate to the same _run_stream generator, which branches between DFlash+Kakeya and native-MLX fallback on _injection_decision.strategy. - _dflash_iter_factory abstraction lets tests substitute a fake dflash output stream without touching the real import. - Override dict lets callers mutate variant / q_range / boundary per request; engine rebuilds a MLXChannel dataclass for the request scope only. - Chat-template rendering uses tokenizer.apply_chat_template when available, with a flat prompt fallback. server.py: - /v1/chat/completions opens (no more 503). - Pydantic request model with extra=allow so future OpenAI additions flow through untouched. - stream=true returns an SSE StreamingResponse mirroring B1's chunk schema. - 404 on KeyError (unknown model); 501 on NotImplementedError (future unsupported features). Fix: the earlier skeleton checked 'stop' before yielding, which dropped the piece containing the stop substring. Swap the order so a stop substring ends the loop AFTER emitting the carrier piece. Applied to both DFlash and native-MLX paths. Tests (24 passed, stubs-only): - test_server_skeleton.py updated: drop 503 assertion; add test_chat_unknown_model_returns_404 via monkeypatched engine. - test_engine_routing.py (4 tests, all mock-based): * DFlash path aggregates pieces + averages acceptance lengths * DFlash stream honours stop substring * Mistral (dflash_available=False) falls back to native-MLX * per-request override rebuilds channel with new q_range Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…uct pivot ROADMAP updates: - Status line advances M1-M3 -> M1-M4. - M4 section expanded with the three cache_injection strategies and the 32-test green status. - M6 re-scoped from 'Atomic-Chat extension backend option' to 'independent product form evaluation'. The standalone-product proposal lands alongside M5 benchmark in the next PR. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 3 commits April 30, 2026 04:59

FluffyAIcode mentioned this pull request Apr 30, 2026

feat(atomic-chat): B2/M5 — acceptance-rate benchmark + M6 standalone product proposal #60

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(atomic-chat): B2/M4 — DFlash speculative decoding + Kakeya target KV#59

feat(atomic-chat): B2/M4 — DFlash speculative decoding + Kakeya target KV#59
FluffyAIcode wants to merge 3 commits intoAgentMemory/atomic-chat-b2-mlx-dflash-kakeya-04aefrom
AgentMemory/atomic-chat-b2-m4-dflash-integration-04ae

FluffyAIcode commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Apr 30, 2026

摘要

依赖关系

核心交付

cache_injection.py — 三策略 DFlash KV 注入

engine_mlx.py — 两条路径

server.py — /v1/chat/completions 开张

Bug fix (from M3)

测试

关键工程决策

非目标（本 PR 不做）

后续

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`cache_injection.py` — 三策略 DFlash KV 注入

`engine_mlx.py` — 两条路径

`server.py` — `/v1/chat/completions` 开张