Skip to content

feat(atomic-chat): B2 — MLX + DFlash + KakeyaLattice (M1-M3 skeleton)#58

Draft
FluffyAIcode wants to merge 3 commits intomainfrom
AgentMemory/atomic-chat-b2-mlx-dflash-kakeya-04ae
Draft

feat(atomic-chat): B2 — MLX + DFlash + KakeyaLattice (M1-M3 skeleton)#58
FluffyAIcode wants to merge 3 commits intomainfrom
AgentMemory/atomic-chat-b2-mlx-dflash-kakeya-04ae

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

摘要

B2 = Atomic-Chat 的 Mac Pro Mode 推理栈。在 B1 (PR #57) 的 HF + MPS sidecar 之上,升级到:

  • MLX(Apple 官方,Apple Silicon 原生统一内存加速)
  • DFlash block-diffusion speculative decoding drafter(z-lab/dflash, arXiv:2602.06036, 2026-02,报 Qwen3-8B 6× lossless 加速)
  • KakeyaLattice v1.5 E8 codec 的 MLX 移植,与 PyTorch 参考 bit-level parity

设计依据:docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md §12 DFlash 集成路径 + §14 A0 vs B2 决策表(由 PR #57 引入)。

与 B1 的关系

B1 和 B2 不互斥、同时并存。两个 sidecar 各自监听不同端口,Atomic-Chat 的 Tauri plugin 按用户选的 backend 路由:

B1 (PR #57) B2 (本 PR)
推理引擎 HF transformers + torch MPS MLX
推测解码 DFlash 3-6×
KV 压缩 KakeyaLatticeCache (PyTorch) KakeyaLatticeMLXCache
默认端口 1338 1339
平台 Mac / Windows / Linux Mac 专属
decode 速度 (Qwen3-8B @ M3 Pro, 预期) ~50 tok/s ~200-280 tok/s effective
长上下文 (Mac 16GB, 预期) ~32k ~48-64k

本 PR 范围(M1-M3 骨架)

ROADMAP 划分:

M 内容 本 PR 状态
M1 kakeyalattice_mlx/ — E8 codec MLX 版 + PyTorch parity 测试
M2 KakeyaLatticeMLXCache — mlx-lm KV cache wrapper
M3 kakeya_sidecar_mlx/ — OpenAI 兼容 MLX sidecar 骨架 ✅ (chat 返 503 直到 M4)
M4 DFlash 接入,dflash.model_mlx.stream_generate + target KV 压缩 ⏳ 后续 PR
M5 acceptance-rate benchmark (Qwen3-8B × DFlash × KakeyaLattice) ⏳ 后续 PR
M6 Atomic-Chat extension 追加 "KakeyaLattice (MLX+DFlash) Pro" backend 选项 ⏳ 后续 PR

交付物

integrations/atomic-chat-b2/kakeyalattice_mlx/ — M1+M2

KakeyaLattice v1.5 E8 codec 的 MLX 原生实现,float32 下与 PyTorch 参考 bit-identical。

文件 作用
hadamard.py Sylvester–Hadamard 矩阵构建器(MLX + NumPy ref)
closest_point.py Conway-Sloane Alg 5 closest-D8 / -E8(MLX)
codec.py E8LatticeCodebookMLX with .roundtrip(x)
kv_cache.py KakeyaLatticeMLXCache — mlx-lm KVCache 包装 + make_kakeya_caches() factory
_reference_numpy.py NumPy shadow reference — 用于 Linux CI + 三方 parity

关键设计决策:

  • float32 内部计算(parity 要求);fp16/bf16 路径存在但不参与 parity gate
  • Wrapper-not-subclass for KV cache — mlx-lm 跨版本加新属性时自动穿透
  • 三方 parity (PyTorch ↔ NumPy ↔ MLX) — NumPy shadow 让 Linux CI 也能验证算法等价性
  • zero-input NaN 修复eps = finfo(float32).eps 对齐 PyTorch ref,避免 fp16-collapsed eps 再除 0

integrations/atomic-chat-b2/kakeya_sidecar_mlx/ — M3

B2 OpenAI 兼容 sidecar。与 B1 路由形状完全一致,/v1/modelsx_kakeya 多两个字段:dflash_draft_repo + dflash_available

文件 作用
cli.py kakeya-sidecar-mlx --host/--port/--device/--enable-dflash/--prewarm
model_registry_mlx.py MLX 档位,含 2026-04-30 z-lab DFlash draft catalogue
engine_mlx.py MLXEngine — warmup + LRU 实现;chat()NotImplementedError 指向 M4
server.py FastAPI 路由;/v1/chat/completions 在 M4 前返 503 + ROADMAP 指向

Registry 覆盖 (M3):

target DFlash draft
Qwen/Qwen3-4B z-lab/Qwen3-4B-DFlash-b16
Qwen/Qwen3-8B z-lab/Qwen3-8B-DFlash-b16 (B2 primary benchmark target)
meta-llama/Llama-3.1-8B-Instruct z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat
Qwen/Qwen3.5-4B z-lab/Qwen3.5-4B-DFlash
mistralai/Mistral-7B-Instruct-v0.3 (no DFlash draft) — MLX-only channels 保留

测试

89/89 全绿(在 Linux CI 上跑;MLX-parity gate 会在 Apple Silicon 上激活额外 12 个用例):

# M1+M2 codec 测试
$ cd integrations/atomic-chat-b2/kakeyalattice_mlx
$ PYTHONPATH=. python3 -m pytest tests/ -v
# => 72 passed, 1 skipped (MLX-gated)

# M3 sidecar 测试
$ cd integrations/atomic-chat-b2/kakeya_sidecar_mlx
$ PYTHONPATH=. python3 -m pytest tests/ -v
# => 17 passed

测试覆盖:

  • Bit accounting — E8 位数公式匹配 v1.5 报告的所有 canonical 值(D=128 Q=38→848, Q=10→608, Q=4→448, Q=152→1104)
  • Hadamard 结构 — shape、self-inverse、power-of-2 guards
  • Closest-D8/E8 — 奇偶约束、E8 coset 结构、E8≤D8 距离单调、in-lattice fixed point
  • Roundtrip — shape/dtype 保持、bounded rel-err、Q 单调、zero-input→zero、determinism
  • KV cache — delegation、fire/skip counter、broken-codec fallback、attribute forwarding
  • MLX parity — MLX ↔ NumPy max_abs_diff ≤ 1e-5 横跨 D∈{64,128,256} × Q∈{4,10,38,152}
  • Registry — DFlash draft coherence(dflash_available 永远 ≤ dflash_draft_repo)、E8-only 守卫、Qwen3-8B 默认档 = Q=38 近无损(acceptance rate 安全)
  • Sidecar routing — /health 标记 B2 变体、/v1/models 暴露 dflash metadata、/v1/chat/completions 503 直到 M4

非目标(本 PR 明确不做)

  • ❌ 改任何 vLLM 路径(C/C2 方案是完全独立 PR)
  • ❌ 移植 v1.4 D4 codec(B2 只要 E8;D4 留在 B1)
  • ❌ Metal Performance Shaders 级别的 fused E8 kernel(MLX 内置算子已够用;fused kernel 是 B2 合并后的优化项)
  • ❌ 处理 Windows/Linux 用户(B2 明确 Mac-only;非 Mac 用户走 B1 或 llama.cpp)

依赖 PR

后续 PR 预告

按 ROADMAP M4-M6,独立分支推进:

  1. M4 — DFlash 集成:接 dflash.model_mlx.stream_generate,target KV 走 KakeyaLatticeMLXCache
  2. M5 — acceptance-rate benchmarkQwen3-8B × {bf16, Kakeya Q=38/10/4} × DFlash-b16 on gsm8k + humaneval,实测 acceptance length 分布 + 有效 tok/s
  3. M6 — Atomic-Chat extension:在 B1 extension 基础上加 "KakeyaLattice (MLX+DFlash) ★ Pro" backend 选项,Tauri plugin 同时托管两个 sidecar
Open in Web Open in Cursor 

cursoragent and others added 3 commits April 30, 2026 04:17
B2 is the performance successor to B1 (PR #57):
  - B1 = HF transformers + torch MPS + KakeyaLatticeCache (Python)
  - B2 = MLX-native + DFlash block-diffusion speculative decoding +
    KakeyaLattice E8 codec ported to MLX

The two sidecars coexist: B1 serves Mac/Win/Linux on :1338,
B2 serves Mac-only on :1339. Atomic-Chat's frontend picks one via
a UI toggle (default B2 on Apple Silicon).

This commit ships:
  - integrations/atomic-chat-b2/README.md  — directory overview + B1/B2 compare
  - integrations/atomic-chat-b2/ROADMAP.md — M1-M6 milestones (M1-M3 in this PR,
    M4-M6 as follow-up PRs). Explicit non-goals noted: no vLLM path, no
    D4 port, no fused Metal kernel, Mac-only by design.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
MLX-native implementation of the v1.5 E8 nested-lattice KV-cache
codec, bit-identical to the PyTorch reference at
kakeyalattice.V15KakeyaZamirE8GPU in float32.

Package layout (kakeyalattice_mlx/):
  hadamard.py          Sylvester-Hadamard matrix builder (MLX + NumPy ref)
  closest_point.py     Conway-Sloane Alg 5 closest-D8 / -E8 (MLX)
  codec.py             E8LatticeCodebookMLX with .roundtrip(x)
  kv_cache.py          KakeyaLatticeMLXCache: mlx-lm KVCache wrapper
                       + make_kakeya_caches(model, variant, q_range,
                       boundary) factory for per-layer caches
  _reference_numpy.py  NumPy shadow of the codec (CI on Linux)

Tests (72 passed + 1 skipped on Linux; MLX-parity gate activates on
Apple Silicon):
  test_bits_accounting.py        E8 bit-formula matches v1.5 report
                                 canonical values (D=128 Q=38 -> 848,
                                 Q=10 -> 608, Q=4 -> 448, Q=152 -> 1104)
  test_hadamard_numpy.py         shape / self-inverse / power-of-2 guards
  test_closest_point_numpy.py    D8 parity constraint, E8 coset structure,
                                 E8 distance <= D8 (monotonicity), shape
                                 preservation, already-in-lattice fixpoint
  test_roundtrip_numpy.py        shape/dtype, bounded rel-err, Q monotone
                                 in error, zero-input round-trips to zero,
                                 determinism
  test_kv_cache.py               delegation, fire/skip counters, broken-codec
                                 fallback to inner, attribute forwarding
  test_codec_mlx_parity.py       [MLX-gated] MLX <-> NumPy max_abs_diff <=
                                 1e-5 across D in {64,128,256} x Q in
                                 {4,10,38,152}; optional PyTorch 3-way parity

The zero-input NaN path (fp16-collapsed eps) is explicitly tested
and fixed by pinning eps = finfo(float32).eps, matching the PyTorch
reference.

Design decisions:
  - float32 internal compute by default (required for bit parity);
    fp16/bf16 paths exist but are not parity-gated.
  - Wrapper-not-subclass for the KV cache: mlx-lm adds attributes
    across releases; __getattr__ forwards to the inner cache so new
    attributes flow through without patches.
  - NumPy shadow reference is its own module so three-way parity
    (PyTorch <-> NumPy <-> MLX) is always verifiable without
    pulling torch into Linux CI.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
B2 sidecar (:1339) matching B1's OpenAI-compatible HTTP shape, with
MLX-specific metadata surfaced in /v1/models.

Package layout (kakeya_sidecar_mlx/):
  __init__.py           lazy re-export (importable without mlx/fastapi)
  cli.py                kakeya-sidecar-mlx --host/--port/--device/
                        --enable-dflash/--prewarm entry point
  model_registry_mlx.py MLX deployment profiles with dflash_draft_repo
                        field; populated from z-lab's 2026-04-30
                        DFlash draft catalogue
  engine_mlx.py         MLXEngineConfig + MLXEngine skeleton: warmup +
                        LRU implemented; chat/chat_stream raise
                        NotImplementedError pointing at ROADMAP M4
  server.py             FastAPI routes; /v1/chat/completions returns
                        503 with ROADMAP reference until M4 lands

Registry coverage (M3):
  qwen3-4b            -> z-lab/Qwen3-4B-DFlash-b16
  qwen3-8b            -> z-lab/Qwen3-8B-DFlash-b16 (B2 primary target)
  llama-3.1-8b-instruct -> z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat
  qwen3.5-4b          -> z-lab/Qwen3.5-4B-DFlash
  mistral-7b-instruct-v0.3 -> no DFlash draft; MLX-only channels

Tests (17 passed):
  test_model_registry_mlx.py  registry covers all DFlash hero models;
                              dflash_available <-> dflash_draft_repo
                              invariant (inconsistent availability
                              silently downgraded in __post_init__);
                              E8-only guard; Qwen3-8B default pinned to
                              Q=38 near-lossless for acceptance-rate
                              safety
  test_server_skeleton.py     /health flags B2 variant, /v1/models
                              exposes dflash metadata, Mistral has no
                              DFlash draft, /v1/chat/completions returns
                              503 until M4

Design decisions:
  - Explicit 503 on /chat/completions (not a fallback mock): better a
    clean skeleton than a half-working chat that silently diverges from
    B1's output.
  - dflash_available is a DERIVED flag recomputed in __post_init__ to
    stop the UI from ever showing 'DFlash: on' without an actual draft.
  - qwen3-8b default = Q=38 (not Q=10) — under DFlash spec-decode a
    target-side Kakeya |Delta-ppl| cuts acceptance length, so we pay
    for quality first.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants