feat(atomic-chat): B2/M5 — acceptance-rate benchmark + M6 standalone product proposal#60
Draft
FluffyAIcode wants to merge 3 commits intoAgentMemory/atomic-chat-b2-mlx-dflash-kakeya-04aefrom
Conversation
New benchmarks package `benchmarks.b2_dflash_kakeya` quantifies the
impact of KakeyaLattice E8 KV-cache compression on DFlash block-
diffusion speculative decoding.
Experiment design (laid out in README):
target: Qwen/Qwen3-8B
draft: z-lab/Qwen3-8B-DFlash-b16
KV channels: bf16 / e8-q38 / e8-q10 / e8-q4 (4 levels)
datasets: gsm8k, humaneval (32 prompts each by default)
metrics: acceptance_length {mean, p50, p95}, tps, TTFT,
codec_fired, correctness_proxy
Package layout:
runner.py CLI + top-level orchestration; --dry-run exercises
the whole pipeline on Linux CI without MLX/dflash/HF.
datasets.py Three-tier dataset loader (local jsonl ->
HF datasets -> synthetic). Synthetic is gated
behind --allow-synthetic so nobody ships numbers
from 3-prompt fixtures.
engines.py Engine Protocol with RealEngine (delegates to
kakeya_sidecar_mlx.MLXEngine) and MockEngine
(deterministic fake with the theoretical accept-
length ordering bf16 > Q=38 > Q=10 > Q=4).
metrics.py Pure-stdlib percentile + mean + correctness
proxies (gsm8k = numeric-substring, humaneval =
def/return substring; full execution harness
explicitly out of scope for now).
schema.py Pinned schema version b2-dflash-kakeya-v1 so
downstream tooling can detect breaking changes.
Tests (24 passed, Linux CI only, no MLX/dflash/HF):
test_metrics.py percentile edge cases, mean, correctness
proxies, multi-record summarisation.
test_datasets.py synthetic fixture, local jsonl preference,
n_samples truncation, seed determinism.
test_runner_mock.py full end-to-end via MockEngine + synthetic
data; asserts accept-length ordering is
preserved through aggregation and that
JSON outputs conform to schema v1.
Dry-run smoke (executed locally):
python -m benchmarks.b2_dflash_kakeya.runner \
--dry-run --n-samples 3 --max-tokens 64 \
--out-dir /tmp/b2_dryrun
# => 8 JSON files (2 datasets x 4 channels), schema v1,
# accept means 15 / 14 / 12 / 8 as expected.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
The M5 runner writes one JSON per (dataset, channel) combination
into reports/b2_release/. The real run requires Apple Silicon +
MLX + dflash + gsm8k/humaneval data; until then this directory
only documents the expected layout.
Expected artefacts after a real run:
b2_dflash_kakeya_{gsm8k,humaneval}_{bf16,e8-q38,e8-q10,e8-q4}.json
FINDINGS.md (narrative + aggregate tables)
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…kend
Evaluates repositioning B2 (MLX + DFlash + KakeyaLattice-MLX) from
'an Atomic-Chat second backend' (original M6 scope) to an
independent Mac local-inference product.
Three candidate forms:
A. CLI tool `kakeya-llm` (homebrew / pip)
B. Native macOS app `Kakeya Studio` (DMG, custom UI)
C. SDK + developer/enterprise library (PyPI + docs site)
Recommendation: ship A + C first, defer B.
Rationale:
1. atomic.chat's "Google TurboQuant built-in" headline is
under-delivered by the actual llama.cpp KV quantisation stack;
folding B2 into Atomic-Chat means fulfilling *their* marketing
through *our* engineering with no control over the message.
2. B2's three capabilities (MLX-native inference, DFlash 3-6x
speedup, E8 KV compression) are general-purpose infra — not
chat-UI-specific. Binding them to one chat app under-uses the
stack.
3. Existing engineering assets (kakeyalattice_mlx,
kakeya_sidecar_mlx, cache_injection) are ~90% reusable for A+C
with minimal new work; going the Atomic-Chat-backend route
blocks on third-party PR review + release cadence.
Phased execution:
Phase 1 (after M5 merges): PyPI release, landing page, CLI shim.
Phase 2 (4-8 weeks later): Cursor / Raycast / Obsidian integrations,
early enterprise POCs.
Phase 3 (6 months out, re-evaluate): whether to do the native app.
This evaluation does NOT invalidate PR #57 (B1), #58 (B2 skeleton),
or #59 (B2/M4): all that code is directly reusable. It re-scopes
M6 only, shifting from 'ship to Atomic-Chat' to 'ship independently
+ opportunistic Atomic-Chat integration if they take our PR'.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
摘要
M5:交付 B2 的 acceptance-rate benchmark — 把 "DFlash × KakeyaLattice 叠加不损伤 acceptance" 的理论推演变成可运行的实验栈。
M6:交付 独立产品形态评估
docs/B2_STANDALONE_PRODUCT_PROPOSAL.md,把 B2 的 GTM 路径从"Atomic-Chat 的第二个 backend"重新定位为独立 Mac 本地推理 SDK + CLI 产品。依赖关系
AgentMemory/atomic-chat-b2-mlx-dflash-kakeya-04ae(B2 M1-M3, PR #58)MLXEngine作为RealEngine底座,所以合并顺序建议 feat(atomic-chat): B2 — MLX + DFlash + KakeyaLattice (M1-M3 skeleton) #58 → feat(atomic-chat): B2/M4 — DFlash speculative decoding + Kakeya target KV #59 → 本 PR,不过 M5 的 CI 测试走 MockEngine,不依赖 M4 代码。M5 — Benchmark 交付物
benchmarks/b2_dflash_kakeya/新包runner.py--dry-run让 Linux CI 能跑通全管线datasets.pydatasets→ synthetic fixture)engines.pyEngineProtocol +RealEngine(MLX+DFlash) +MockEngine(CI)metrics.pyschema.pyb2-dflash-kakeya-v1实验设计
Qwen/Qwen3-8B(non-thinking)z-lab/Qwen3-8B-DFlash-b16bf16/e8-q38/e8-q10/e8-q4理论预期 (MockEngine 内编码)
如果 Q=38 真实实测降 >2pp,或 Q=10 降 >5pp,需要相应修正默认档位和宣传口径。
测试 (24 passed, Linux CI 可跑)
Dry-run 烟测
$ python -m benchmarks.b2_dflash_kakeya.runner \ --dry-run --n-samples 3 --max-tokens 64 \ --out-dir /tmp/b2_dryrun # => 8 JSON files (2 datasets × 4 channels) # accept_mean: bf16≈15, q38≈14, q10≈12, q4≈8 — 符合理论排序M6 — 独立产品形态评估
完整提案在
docs/B2_STANDALONE_PRODUCT_PROPOSAL.md。摘要:重估的三个原因
三种候选形态
kakeya-llm(brew + pip)Kakeya Studio(DMG)推荐路径:A + C 先行,B 推迟 6 个月
kakeyalattice-mlx+kakeya-sidecar-mlx上 PyPI,建 landing page,kakeya-llmCLI wrapper对已有工作的影响
合并建议
此 PR 有两类独立内容可分拆:
benchmarks/b2_dflash_kakeya/(M5) — 工程代码 + 测试,独立可 mergedocs/B2_STANDALONE_PRODUCT_PROPOSAL.md(M6) — 仅文档,需团队评估 GTM 决策可以选择:
Commits
feat(b2/M5): acceptance-rate benchmark runner for DFlash x Kakeyadocs(b2/M5): reports/b2_release/ placeholder for real benchmark outputdocs(b2/M6): standalone product proposal - pivot from Atomic-Chat backend下一步(假设本 PR 合入 + M6 评估被采纳)
kakeyalattice_mlx和kakeya_sidecar_mlx加 PyPI 发布 CIkakeyalattice.dev/kakeya-mlx.devdomain