feat(atomic-chat): B2 — MLX + DFlash + KakeyaLattice (M1-M3 skeleton) by FluffyAIcode · Pull Request #58 · FluffyAIcode/LLM-KV--Cache-compress

FluffyAIcode · 2026-04-30T04:19:04Z

摘要

B2 = Atomic-Chat 的 Mac Pro Mode 推理栈。在 B1 (PR #57) 的 HF + MPS sidecar 之上，升级到：

MLX（Apple 官方，Apple Silicon 原生统一内存加速）
DFlash block-diffusion speculative decoding drafter（z-lab/dflash, arXiv:2602.06036, 2026-02，报 Qwen3-8B 6× lossless 加速）
KakeyaLattice v1.5 E8 codec 的 MLX 移植，与 PyTorch 参考 bit-level parity

设计依据：docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md §12 DFlash 集成路径 + §14 A0 vs B2 决策表（由 PR #57 引入）。

与 B1 的关系

B1 和 B2 不互斥、同时并存。两个 sidecar 各自监听不同端口，Atomic-Chat 的 Tauri plugin 按用户选的 backend 路由：

	B1 (PR #57)	B2 (本 PR)
推理引擎	HF transformers + torch MPS	MLX
推测解码	无	DFlash 3-6×
KV 压缩	`KakeyaLatticeCache` (PyTorch)	`KakeyaLatticeMLXCache`
默认端口	1338	1339
平台	Mac / Windows / Linux	Mac 专属
decode 速度 (Qwen3-8B @ M3 Pro, 预期)	~50 tok/s	~200-280 tok/s effective
长上下文 (Mac 16GB, 预期)	~32k	~48-64k

本 PR 范围（M1-M3 骨架）

按 ROADMAP 划分：

M	内容	本 PR 状态
M1	`kakeyalattice_mlx/` — E8 codec MLX 版 + PyTorch parity 测试	✅
M2	`KakeyaLatticeMLXCache` — mlx-lm KV cache wrapper	✅
M3	`kakeya_sidecar_mlx/` — OpenAI 兼容 MLX sidecar 骨架	✅ (chat 返 503 直到 M4)
M4	DFlash 接入，`dflash.model_mlx.stream_generate` + target KV 压缩	⏳ 后续 PR
M5	acceptance-rate benchmark (Qwen3-8B × DFlash × KakeyaLattice)	⏳ 后续 PR
M6	Atomic-Chat extension 追加 "KakeyaLattice (MLX+DFlash) Pro" backend 选项	⏳ 后续 PR

交付物

`integrations/atomic-chat-b2/kakeyalattice_mlx/` — M1+M2

KakeyaLattice v1.5 E8 codec 的 MLX 原生实现，float32 下与 PyTorch 参考 bit-identical。

文件	作用
`hadamard.py`	Sylvester–Hadamard 矩阵构建器（MLX + NumPy ref）
`closest_point.py`	Conway-Sloane Alg 5 closest-D8 / -E8（MLX）
`codec.py`	`E8LatticeCodebookMLX` with `.roundtrip(x)`
`kv_cache.py`	`KakeyaLatticeMLXCache` — mlx-lm KVCache 包装 + `make_kakeya_caches()` factory
`_reference_numpy.py`	NumPy shadow reference — 用于 Linux CI + 三方 parity

关键设计决策:

float32 内部计算（parity 要求）；fp16/bf16 路径存在但不参与 parity gate
Wrapper-not-subclass for KV cache — mlx-lm 跨版本加新属性时自动穿透
三方 parity (PyTorch ↔ NumPy ↔ MLX) — NumPy shadow 让 Linux CI 也能验证算法等价性
zero-input NaN 修复 — eps = finfo(float32).eps 对齐 PyTorch ref，避免 fp16-collapsed eps 再除 0

`integrations/atomic-chat-b2/kakeya_sidecar_mlx/` — M3

B2 OpenAI 兼容 sidecar。与 B1 路由形状完全一致，/v1/models 的 x_kakeya 多两个字段：dflash_draft_repo + dflash_available。

文件	作用
`cli.py`	`kakeya-sidecar-mlx --host/--port/--device/--enable-dflash/--prewarm`
`model_registry_mlx.py`	MLX 档位，含 2026-04-30 z-lab DFlash draft catalogue
`engine_mlx.py`	`MLXEngine` — warmup + LRU 实现；`chat()` 先 `NotImplementedError` 指向 M4
`server.py`	FastAPI 路由；`/v1/chat/completions` 在 M4 前返 503 + ROADMAP 指向

Registry 覆盖 (M3):

target	DFlash draft
`Qwen/Qwen3-4B`	`z-lab/Qwen3-4B-DFlash-b16`
`Qwen/Qwen3-8B`	`z-lab/Qwen3-8B-DFlash-b16` (B2 primary benchmark target)
`meta-llama/Llama-3.1-8B-Instruct`	`z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat`
`Qwen/Qwen3.5-4B`	`z-lab/Qwen3.5-4B-DFlash`
`mistralai/Mistral-7B-Instruct-v0.3`	(no DFlash draft) — MLX-only channels 保留

测试

89/89 全绿（在 Linux CI 上跑；MLX-parity gate 会在 Apple Silicon 上激活额外 12 个用例）：

# M1+M2 codec 测试
$ cd integrations/atomic-chat-b2/kakeyalattice_mlx
$ PYTHONPATH=. python3 -m pytest tests/ -v
# => 72 passed, 1 skipped (MLX-gated)

# M3 sidecar 测试
$ cd integrations/atomic-chat-b2/kakeya_sidecar_mlx
$ PYTHONPATH=. python3 -m pytest tests/ -v
# => 17 passed

测试覆盖：

Bit accounting — E8 位数公式匹配 v1.5 报告的所有 canonical 值（D=128 Q=38→848, Q=10→608, Q=4→448, Q=152→1104）
Hadamard 结构 — shape、self-inverse、power-of-2 guards
Closest-D8/E8 — 奇偶约束、E8 coset 结构、E8≤D8 距离单调、in-lattice fixed point
Roundtrip — shape/dtype 保持、bounded rel-err、Q 单调、zero-input→zero、determinism
KV cache — delegation、fire/skip counter、broken-codec fallback、attribute forwarding
MLX parity — MLX ↔ NumPy max_abs_diff ≤ 1e-5 横跨 D∈{64,128,256} × Q∈{4,10,38,152}
Registry — DFlash draft coherence（dflash_available 永远 ≤ dflash_draft_repo）、E8-only 守卫、Qwen3-8B 默认档 = Q=38 近无损（acceptance rate 安全）
Sidecar routing — /health 标记 B2 变体、/v1/models 暴露 dflash metadata、/v1/chat/completions 503 直到 M4

非目标（本 PR 明确不做）

❌ 改任何 vLLM 路径（C/C2 方案是完全独立 PR）
❌ 移植 v1.4 D4 codec（B2 只要 E8；D4 留在 B1）
❌ Metal Performance Shaders 级别的 fused E8 kernel（MLX 内置算子已够用；fused kernel 是 B2 合并后的优化项）
❌ 处理 Windows/Linux 用户（B2 明确 Mac-only；非 Mac 用户走 B1 或 llama.cpp）

依赖 PR

PR feat(atomic-chat): KakeyaLattice v1.5 integration for Atomic-Chat — B1 (HF + MPS sidecar) #57 (B1) — 文档 §14 A0 vs B2 决策表需要先合入以提供完整叙事链；本 PR 不直接 import B1 代码，故 不是 hard blocker。可以独立 review / merge 顺序按 Atomic-Chat 团队决策。

后续 PR 预告

按 ROADMAP M4-M6，独立分支推进：

M4 — DFlash 集成：接 dflash.model_mlx.stream_generate，target KV 走 KakeyaLatticeMLXCache
M5 — acceptance-rate benchmark：Qwen3-8B × {bf16, Kakeya Q=38/10/4} × DFlash-b16 on gsm8k + humaneval，实测 acceptance length 分布 + 有效 tok/s
M6 — Atomic-Chat extension：在 B1 extension 基础上加 "KakeyaLattice (MLX+DFlash) ★ Pro" backend 选项，Tauri plugin 同时托管两个 sidecar

B2 is the performance successor to B1 (PR #57): - B1 = HF transformers + torch MPS + KakeyaLatticeCache (Python) - B2 = MLX-native + DFlash block-diffusion speculative decoding + KakeyaLattice E8 codec ported to MLX The two sidecars coexist: B1 serves Mac/Win/Linux on :1338, B2 serves Mac-only on :1339. Atomic-Chat's frontend picks one via a UI toggle (default B2 on Apple Silicon). This commit ships: - integrations/atomic-chat-b2/README.md — directory overview + B1/B2 compare - integrations/atomic-chat-b2/ROADMAP.md — M1-M6 milestones (M1-M3 in this PR, M4-M6 as follow-up PRs). Explicit non-goals noted: no vLLM path, no D4 port, no fused Metal kernel, Mac-only by design. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

MLX-native implementation of the v1.5 E8 nested-lattice KV-cache codec, bit-identical to the PyTorch reference at kakeyalattice.V15KakeyaZamirE8GPU in float32. Package layout (kakeyalattice_mlx/): hadamard.py Sylvester-Hadamard matrix builder (MLX + NumPy ref) closest_point.py Conway-Sloane Alg 5 closest-D8 / -E8 (MLX) codec.py E8LatticeCodebookMLX with .roundtrip(x) kv_cache.py KakeyaLatticeMLXCache: mlx-lm KVCache wrapper + make_kakeya_caches(model, variant, q_range, boundary) factory for per-layer caches _reference_numpy.py NumPy shadow of the codec (CI on Linux) Tests (72 passed + 1 skipped on Linux; MLX-parity gate activates on Apple Silicon): test_bits_accounting.py E8 bit-formula matches v1.5 report canonical values (D=128 Q=38 -> 848, Q=10 -> 608, Q=4 -> 448, Q=152 -> 1104) test_hadamard_numpy.py shape / self-inverse / power-of-2 guards test_closest_point_numpy.py D8 parity constraint, E8 coset structure, E8 distance <= D8 (monotonicity), shape preservation, already-in-lattice fixpoint test_roundtrip_numpy.py shape/dtype, bounded rel-err, Q monotone in error, zero-input round-trips to zero, determinism test_kv_cache.py delegation, fire/skip counters, broken-codec fallback to inner, attribute forwarding test_codec_mlx_parity.py [MLX-gated] MLX <-> NumPy max_abs_diff <= 1e-5 across D in {64,128,256} x Q in {4,10,38,152}; optional PyTorch 3-way parity The zero-input NaN path (fp16-collapsed eps) is explicitly tested and fixed by pinning eps = finfo(float32).eps, matching the PyTorch reference. Design decisions: - float32 internal compute by default (required for bit parity); fp16/bf16 paths exist but are not parity-gated. - Wrapper-not-subclass for the KV cache: mlx-lm adds attributes across releases; __getattr__ forwards to the inner cache so new attributes flow through without patches. - NumPy shadow reference is its own module so three-way parity (PyTorch <-> NumPy <-> MLX) is always verifiable without pulling torch into Linux CI. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

B2 sidecar (:1339) matching B1's OpenAI-compatible HTTP shape, with MLX-specific metadata surfaced in /v1/models. Package layout (kakeya_sidecar_mlx/): __init__.py lazy re-export (importable without mlx/fastapi) cli.py kakeya-sidecar-mlx --host/--port/--device/ --enable-dflash/--prewarm entry point model_registry_mlx.py MLX deployment profiles with dflash_draft_repo field; populated from z-lab's 2026-04-30 DFlash draft catalogue engine_mlx.py MLXEngineConfig + MLXEngine skeleton: warmup + LRU implemented; chat/chat_stream raise NotImplementedError pointing at ROADMAP M4 server.py FastAPI routes; /v1/chat/completions returns 503 with ROADMAP reference until M4 lands Registry coverage (M3): qwen3-4b -> z-lab/Qwen3-4B-DFlash-b16 qwen3-8b -> z-lab/Qwen3-8B-DFlash-b16 (B2 primary target) llama-3.1-8b-instruct -> z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat qwen3.5-4b -> z-lab/Qwen3.5-4B-DFlash mistral-7b-instruct-v0.3 -> no DFlash draft; MLX-only channels Tests (17 passed): test_model_registry_mlx.py registry covers all DFlash hero models; dflash_available <-> dflash_draft_repo invariant (inconsistent availability silently downgraded in __post_init__); E8-only guard; Qwen3-8B default pinned to Q=38 near-lossless for acceptance-rate safety test_server_skeleton.py /health flags B2 variant, /v1/models exposes dflash metadata, Mistral has no DFlash draft, /v1/chat/completions returns 503 until M4 Design decisions: - Explicit 503 on /chat/completions (not a fallback mock): better a clean skeleton than a half-working chat that silently diverges from B1's output. - dflash_available is a DERIVED flag recomputed in __post_init__ to stop the UI from ever showing 'DFlash: on' without an actual draft. - qwen3-8b default = Q=38 (not Q=10) — under DFlash spec-decode a target-side Kakeya |Delta-ppl| cuts acceptance length, so we pay for quality first. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 3 commits April 30, 2026 04:17

FluffyAIcode mentioned this pull request Apr 30, 2026

feat(atomic-chat): B2/M5 — acceptance-rate benchmark + M6 standalone product proposal #60

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(atomic-chat): B2 — MLX + DFlash + KakeyaLattice (M1-M3 skeleton)#58

feat(atomic-chat): B2 — MLX + DFlash + KakeyaLattice (M1-M3 skeleton)#58
FluffyAIcode wants to merge 3 commits intomainfrom
AgentMemory/atomic-chat-b2-mlx-dflash-kakeya-04ae

FluffyAIcode commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Apr 30, 2026

摘要

与 B1 的关系

本 PR 范围（M1-M3 骨架）

交付物

integrations/atomic-chat-b2/kakeyalattice_mlx/ — M1+M2

integrations/atomic-chat-b2/kakeya_sidecar_mlx/ — M3

测试

非目标（本 PR 明确不做）

依赖 PR

后续 PR 预告

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`integrations/atomic-chat-b2/kakeyalattice_mlx/` — M1+M2

`integrations/atomic-chat-b2/kakeya_sidecar_mlx/` — M3