feat(atomic-chat): B2 — MLX + DFlash + KakeyaLattice (M1-M3 skeleton)#58
Draft
FluffyAIcode wants to merge 3 commits intomainfrom
Draft
feat(atomic-chat): B2 — MLX + DFlash + KakeyaLattice (M1-M3 skeleton)#58FluffyAIcode wants to merge 3 commits intomainfrom
FluffyAIcode wants to merge 3 commits intomainfrom
Conversation
B2 is the performance successor to B1 (PR #57): - B1 = HF transformers + torch MPS + KakeyaLatticeCache (Python) - B2 = MLX-native + DFlash block-diffusion speculative decoding + KakeyaLattice E8 codec ported to MLX The two sidecars coexist: B1 serves Mac/Win/Linux on :1338, B2 serves Mac-only on :1339. Atomic-Chat's frontend picks one via a UI toggle (default B2 on Apple Silicon). This commit ships: - integrations/atomic-chat-b2/README.md — directory overview + B1/B2 compare - integrations/atomic-chat-b2/ROADMAP.md — M1-M6 milestones (M1-M3 in this PR, M4-M6 as follow-up PRs). Explicit non-goals noted: no vLLM path, no D4 port, no fused Metal kernel, Mac-only by design. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
MLX-native implementation of the v1.5 E8 nested-lattice KV-cache
codec, bit-identical to the PyTorch reference at
kakeyalattice.V15KakeyaZamirE8GPU in float32.
Package layout (kakeyalattice_mlx/):
hadamard.py Sylvester-Hadamard matrix builder (MLX + NumPy ref)
closest_point.py Conway-Sloane Alg 5 closest-D8 / -E8 (MLX)
codec.py E8LatticeCodebookMLX with .roundtrip(x)
kv_cache.py KakeyaLatticeMLXCache: mlx-lm KVCache wrapper
+ make_kakeya_caches(model, variant, q_range,
boundary) factory for per-layer caches
_reference_numpy.py NumPy shadow of the codec (CI on Linux)
Tests (72 passed + 1 skipped on Linux; MLX-parity gate activates on
Apple Silicon):
test_bits_accounting.py E8 bit-formula matches v1.5 report
canonical values (D=128 Q=38 -> 848,
Q=10 -> 608, Q=4 -> 448, Q=152 -> 1104)
test_hadamard_numpy.py shape / self-inverse / power-of-2 guards
test_closest_point_numpy.py D8 parity constraint, E8 coset structure,
E8 distance <= D8 (monotonicity), shape
preservation, already-in-lattice fixpoint
test_roundtrip_numpy.py shape/dtype, bounded rel-err, Q monotone
in error, zero-input round-trips to zero,
determinism
test_kv_cache.py delegation, fire/skip counters, broken-codec
fallback to inner, attribute forwarding
test_codec_mlx_parity.py [MLX-gated] MLX <-> NumPy max_abs_diff <=
1e-5 across D in {64,128,256} x Q in
{4,10,38,152}; optional PyTorch 3-way parity
The zero-input NaN path (fp16-collapsed eps) is explicitly tested
and fixed by pinning eps = finfo(float32).eps, matching the PyTorch
reference.
Design decisions:
- float32 internal compute by default (required for bit parity);
fp16/bf16 paths exist but are not parity-gated.
- Wrapper-not-subclass for the KV cache: mlx-lm adds attributes
across releases; __getattr__ forwards to the inner cache so new
attributes flow through without patches.
- NumPy shadow reference is its own module so three-way parity
(PyTorch <-> NumPy <-> MLX) is always verifiable without
pulling torch into Linux CI.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
B2 sidecar (:1339) matching B1's OpenAI-compatible HTTP shape, with
MLX-specific metadata surfaced in /v1/models.
Package layout (kakeya_sidecar_mlx/):
__init__.py lazy re-export (importable without mlx/fastapi)
cli.py kakeya-sidecar-mlx --host/--port/--device/
--enable-dflash/--prewarm entry point
model_registry_mlx.py MLX deployment profiles with dflash_draft_repo
field; populated from z-lab's 2026-04-30
DFlash draft catalogue
engine_mlx.py MLXEngineConfig + MLXEngine skeleton: warmup +
LRU implemented; chat/chat_stream raise
NotImplementedError pointing at ROADMAP M4
server.py FastAPI routes; /v1/chat/completions returns
503 with ROADMAP reference until M4 lands
Registry coverage (M3):
qwen3-4b -> z-lab/Qwen3-4B-DFlash-b16
qwen3-8b -> z-lab/Qwen3-8B-DFlash-b16 (B2 primary target)
llama-3.1-8b-instruct -> z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat
qwen3.5-4b -> z-lab/Qwen3.5-4B-DFlash
mistral-7b-instruct-v0.3 -> no DFlash draft; MLX-only channels
Tests (17 passed):
test_model_registry_mlx.py registry covers all DFlash hero models;
dflash_available <-> dflash_draft_repo
invariant (inconsistent availability
silently downgraded in __post_init__);
E8-only guard; Qwen3-8B default pinned to
Q=38 near-lossless for acceptance-rate
safety
test_server_skeleton.py /health flags B2 variant, /v1/models
exposes dflash metadata, Mistral has no
DFlash draft, /v1/chat/completions returns
503 until M4
Design decisions:
- Explicit 503 on /chat/completions (not a fallback mock): better a
clean skeleton than a half-working chat that silently diverges from
B1's output.
- dflash_available is a DERIVED flag recomputed in __post_init__ to
stop the UI from ever showing 'DFlash: on' without an actual draft.
- qwen3-8b default = Q=38 (not Q=10) — under DFlash spec-decode a
target-side Kakeya |Delta-ppl| cuts acceptance length, so we pay
for quality first.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
摘要
B2 = Atomic-Chat 的 Mac Pro Mode 推理栈。在 B1 (PR #57) 的 HF + MPS sidecar 之上,升级到:
设计依据:
docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md§12 DFlash 集成路径 + §14 A0 vs B2 决策表(由 PR #57 引入)。与 B1 的关系
B1 和 B2 不互斥、同时并存。两个 sidecar 各自监听不同端口,Atomic-Chat 的 Tauri plugin 按用户选的 backend 路由:
KakeyaLatticeCache(PyTorch)KakeyaLatticeMLXCache本 PR 范围(M1-M3 骨架)
按 ROADMAP 划分:
kakeyalattice_mlx/— E8 codec MLX 版 + PyTorch parity 测试KakeyaLatticeMLXCache— mlx-lm KV cache wrapperkakeya_sidecar_mlx/— OpenAI 兼容 MLX sidecar 骨架dflash.model_mlx.stream_generate+ target KV 压缩交付物
integrations/atomic-chat-b2/kakeyalattice_mlx/— M1+M2KakeyaLattice v1.5 E8 codec 的 MLX 原生实现,float32 下与 PyTorch 参考 bit-identical。
hadamard.pyclosest_point.pycodec.pyE8LatticeCodebookMLXwith.roundtrip(x)kv_cache.pyKakeyaLatticeMLXCache— mlx-lm KVCache 包装 +make_kakeya_caches()factory_reference_numpy.py关键设计决策:
eps = finfo(float32).eps对齐 PyTorch ref,避免 fp16-collapsed eps 再除 0integrations/atomic-chat-b2/kakeya_sidecar_mlx/— M3B2 OpenAI 兼容 sidecar。与 B1 路由形状完全一致,
/v1/models的x_kakeya多两个字段:dflash_draft_repo+dflash_available。cli.pykakeya-sidecar-mlx --host/--port/--device/--enable-dflash/--prewarmmodel_registry_mlx.pyengine_mlx.pyMLXEngine— warmup + LRU 实现;chat()先NotImplementedError指向 M4server.py/v1/chat/completions在 M4 前返 503 + ROADMAP 指向Registry 覆盖 (M3):
Qwen/Qwen3-4Bz-lab/Qwen3-4B-DFlash-b16Qwen/Qwen3-8Bz-lab/Qwen3-8B-DFlash-b16(B2 primary benchmark target)meta-llama/Llama-3.1-8B-Instructz-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChatQwen/Qwen3.5-4Bz-lab/Qwen3.5-4B-DFlashmistralai/Mistral-7B-Instruct-v0.3测试
89/89 全绿(在 Linux CI 上跑;MLX-parity gate 会在 Apple Silicon 上激活额外 12 个用例):
测试覆盖:
max_abs_diff ≤ 1e-5横跨 D∈{64,128,256} × Q∈{4,10,38,152}dflash_available永远 ≤dflash_draft_repo)、E8-only 守卫、Qwen3-8B 默认档 = Q=38 近无损(acceptance rate 安全)非目标(本 PR 明确不做)
依赖 PR
后续 PR 预告
按 ROADMAP M4-M6,独立分支推进:
dflash.model_mlx.stream_generate,target KV 走KakeyaLatticeMLXCacheQwen3-8B × {bf16, Kakeya Q=38/10/4} × DFlash-b16on gsm8k + humaneval,实测 acceptance length 分布 + 有效 tok/s"KakeyaLattice (MLX+DFlash) ★ Pro"backend 选项,Tauri plugin 同时托管两个 sidecar