Skip to content

feat(atomic-chat): B2/M4 — DFlash speculative decoding + Kakeya target KV#59

Draft
FluffyAIcode wants to merge 3 commits intoAgentMemory/atomic-chat-b2-mlx-dflash-kakeya-04aefrom
AgentMemory/atomic-chat-b2-m4-dflash-integration-04ae
Draft

feat(atomic-chat): B2/M4 — DFlash speculative decoding + Kakeya target KV#59
FluffyAIcode wants to merge 3 commits intoAgentMemory/atomic-chat-b2-mlx-dflash-kakeya-04aefrom
AgentMemory/atomic-chat-b2-m4-dflash-integration-04ae

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

摘要

M4 交付:把 DFlash (z-lab) 的 block-diffusion speculative decoding 接进 B2 sidecar,target 模型的 KV cache 自动走 KakeyaLatticeMLXCache E8 压缩;/v1/chat/completions 正式开放(stream + non-stream)。

依赖关系

  • Base: AgentMemory/atomic-chat-b2-mlx-dflash-kakeya-04ae (B2 M1-M3 骨架, PR #58)
  • Blocker: 合入 B2 分支后再合入本分支;也可 rebase 到 main 后独立 merge,两者均可。

核心交付

cache_injection.py — 三策略 DFlash KV 注入

DFlash 2026 版本序列有三个不同的"target KV cache 注入点",分别依赖 dflash 不同的 API 面。本 PR 不绑死任一版本,而是feature-detect 选最合适的

策略 触发条件 侵入度
KWARG stream_generate 签名里有 target_cache / caches / target_caches / prompt_cache kwarg 最轻
MODEL_MAKE_CACHE target model 暴露 .make_cache() 方法 一个方法 monkey-patch(context-managed)
MODULE_PROMPT_CACHE mlx_lm.models.cache.make_prompt_cache 存在 模块级 patch
FALLBACK_NATIVE_MLX 以上都不满足;退化为单轨 MLX decode + Kakeya KV 丢加速保压缩

所有非 fallback 策略都在 activate() context manager 里;退出时状态干净 rollback,不会污染后续请求。

engine_mlx.py — 两条路径

  • DFlash 路径_LoadedMLXModel 在 load 时预计算 InjectionDecision,请求时走 KakeyaCacheInjector.activate() + dflash.model_mlx.stream_generate(...);每步记录 acceptance_length
  • Native MLX 兜底:无 DFlash draft 或 injection 失败时走 mlx_lm.generate.stream_generate(prompt_cache=caches),仍享受 KV 压缩但无 speculative 加速。

两条路径统一到 _run_stream(...) 生成器,chat()chat_stream() 复用同一逻辑。

server.py/v1/chat/completions 开张

  • Pydantic ChatCompletionRequest (extra=allow, 未知字段直通)
  • stream=true 返回 SSE StreamingResponse;stream=false 返回 OpenAI 风格 JSON
  • x_kakeya 响应字段带 dflash_usedinjection_strategyacceptance_length_mean
  • 未知 model → 404;未实现特性 → 501;其它异常 → 500

Bug fix (from M3)

M3 骨架有个小 bug — stop substring 检查放在 yield 之前,会把含 STOP 的那一块整条吃掉。M4 改成 yield-first-then-check,两条路径同步修复。

测试

104/104 全绿(Linux CI 无 MLX/dflash 时也能跑):

套件 数量 覆盖
kakeyalattice_mlx/tests/ (M1-M2) 72 + 1 skip codec、Hadamard、closest-point、KV cache delegation、MLX parity(Apple Silicon 激活)
kakeya_sidecar_mlx/tests/test_model_registry_mlx.py 12 DFlash draft 一致性、E8-only 守卫、Qwen3-8B 默认档
kakeya_sidecar_mlx/tests/test_server_skeleton.py 5 路由形状、dflash metadata 暴露、404 路径
kakeya_sidecar_mlx/tests/test_cache_injection.py (新) 8 四种 strategy detection + activate() 状态生命周期 + unknown strategy 爆错
kakeya_sidecar_mlx/tests/test_engine_routing.py (新) 4 DFlash 聚合、stop substring、fallback 路径、per-request override
$ cd integrations/atomic-chat-b2/kakeya_sidecar_mlx
$ PYTHONPATH=. python3 -m pytest tests/ -v
# => 32 passed

$ cd ../kakeyalattice_mlx
$ PYTHONPATH=. python3 -m pytest tests/ -v
# => 72 passed, 1 skipped

关键工程决策

  1. Feature-detect 而非 pin dflash 版本。dflash 2026-02 → 2026-04 已出三个 API 轮廓;pin 版本等于把 B2 稳定性绑架给上游小版本。三策略 + fallback 的代价是 cache_injection.py 多 100 行,换来 dflash 任意小版本都能跑。
  2. Target 压缩激进 / Draft 不压缩。DFlash 的 verify 由 target 兜底分布保证 lossless;target 走 Kakeya Q=38 近无损档,draft 保持 RotatingKVCache 原样(draft KV 压缩留给 Phase 2 M5+ 之后)。
  3. NotImplementedError → 501 而非 500。服务器层区分"功能没做 (501)" vs "真出错 (500)",给 Atomic-Chat 前端分层提示用户。
  4. Stop 语义对齐 B1。用户提供 stop substring 时,包含 stop 的那一块 piece 仍然发出,但之后立即停止 — B1 的 HF streamer 默认行为。

非目标(本 PR 不做)

  • ❌ Draft 模型 KV 也走 Kakeya 压缩(留给 M5+ 验证 acceptance-rate 后再启)
  • ❌ 实测 acceptance length / tok/s(留给 M5 benchmark PR)
  • ❌ Atomic-Chat extension 加 backend 选项(M6 已 re-scope 为独立产品形态)
  • ❌ DeepSeek / GLM / Gemma(DFlash 没发布对应 draft)

后续

  • M5 PR — acceptance-rate benchmark (Qwen3-8B × {bf16, Kakeya Q=38/10/4} × DFlash-b16),在 gsm8k + humaneval 上实测
  • M6 — 随 M5 PR 附 docs/B2_STANDALONE_PRODUCT_PROPOSAL.md:把 B2 从"Atomic-Chat backend"重新定位为独立 Mac 本地推理产品
Open in Web Open in Cursor 

cursoragent and others added 3 commits April 30, 2026 04:59
DFlash's MLX driver manages target KV cache internally; the 2026
release track has three different hook surfaces across revisions.
Instead of pinning to one dflash version, feature-detect at engine
startup and pick whichever is available:

Strategy A - KWARG passthrough (cleanest): stream_generate exposes
  target_cache / caches / target_caches / prompt_cache as a kwarg.

Strategy B - model.make_cache monkey-patch: mlx-lm models expose
  make_cache() that dflash delegates to.

Strategy C - module-level mlx_lm.models.cache.make_prompt_cache
  patch: some dflash revisions call this function directly.

Strategy FALLBACK - single-track MLX decode + KakeyaLatticeMLXCache.
  Loses the speculative speedup but keeps the KV compression benefit.
  Invoked loudly on unknown dflash APIs.

All three non-fallback strategies are context-managed via
activate() so state mutations get cleanly rolled back on exit.
Detection runs exactly once per model load (not per request).

Tests (8 passed, no MLX / no dflash required):
  - strategy detection across four synthetic signatures
  - activate() state roundtrip for each strategy
  - unknown-kwarg downgrade to fallback
  - monkey-patched make_cache restores original on __exit__
  - build() passes variant/q_range/boundary through to factory
  - unknown strategy raises RuntimeError (no silent fall-through)

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
M4 wires cache_injection into engine_mlx and opens the
/v1/chat/completions route (non-stream JSON + SSE stream).

engine_mlx.py:
  - _LoadedMLXModel pre-computes InjectionDecision at load time
    (not per request); logs which strategy was chosen.
  - chat() + chat_stream() delegate to the same _run_stream
    generator, which branches between DFlash+Kakeya and native-MLX
    fallback on _injection_decision.strategy.
  - _dflash_iter_factory abstraction lets tests substitute a fake
    dflash output stream without touching the real import.
  - Override dict lets callers mutate variant / q_range / boundary
    per request; engine rebuilds a MLXChannel dataclass for the
    request scope only.
  - Chat-template rendering uses tokenizer.apply_chat_template when
    available, with a flat prompt fallback.

server.py:
  - /v1/chat/completions opens (no more 503).
  - Pydantic request model with extra=allow so future OpenAI
    additions flow through untouched.
  - stream=true returns an SSE StreamingResponse mirroring B1's
    chunk schema.
  - 404 on KeyError (unknown model); 501 on NotImplementedError
    (future unsupported features).

Fix: the earlier skeleton checked 'stop' before yielding, which
dropped the piece containing the stop substring. Swap the order so
a stop substring ends the loop AFTER emitting the carrier piece.
Applied to both DFlash and native-MLX paths.

Tests (24 passed, stubs-only):
  - test_server_skeleton.py updated: drop 503 assertion; add
    test_chat_unknown_model_returns_404 via monkeypatched engine.
  - test_engine_routing.py (4 tests, all mock-based):
      * DFlash path aggregates pieces + averages acceptance lengths
      * DFlash stream honours stop substring
      * Mistral (dflash_available=False) falls back to native-MLX
      * per-request override rebuilds channel with new q_range

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…uct pivot

ROADMAP updates:
  - Status line advances M1-M3 -> M1-M4.
  - M4 section expanded with the three cache_injection strategies
    and the 32-test green status.
  - M6 re-scoped from 'Atomic-Chat extension backend option' to
    'independent product form evaluation'. The standalone-product
    proposal lands alongside M5 benchmark in the next PR.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants