feat(atomic-chat): B2/M5 — acceptance-rate benchmark + M6 standalone product proposal by FluffyAIcode · Pull Request #60 · FluffyAIcode/LLM-KV--Cache-compress

FluffyAIcode · 2026-04-30T05:08:39Z

摘要

M5：交付 B2 的 acceptance-rate benchmark — 把 "DFlash × KakeyaLattice 叠加不损伤 acceptance" 的理论推演变成可运行的实验栈。

M6：交付 独立产品形态评估 docs/B2_STANDALONE_PRODUCT_PROPOSAL.md，把 B2 的 GTM 路径从"Atomic-Chat 的第二个 backend"重新定位为独立 Mac 本地推理 SDK + CLI 产品。

依赖关系

Base: AgentMemory/atomic-chat-b2-mlx-dflash-kakeya-04ae (B2 M1-M3, PR #58)
Parallel: PR #59 (B2/M4 DFlash integration) — M5 benchmark 使用 M4 的 MLXEngine 作为 RealEngine 底座，所以合并顺序建议 feat(atomic-chat): B2 — MLX + DFlash + KakeyaLattice (M1-M3 skeleton) #58 → feat(atomic-chat): B2/M4 — DFlash speculative decoding + Kakeya target KV #59 → 本 PR，不过 M5 的 CI 测试走 MockEngine，不依赖 M4 代码。

M5 — Benchmark 交付物

`benchmarks/b2_dflash_kakeya/` 新包

文件	作用
`runner.py`	CLI + 顶层流程；`--dry-run` 让 Linux CI 能跑通全管线
`datasets.py`	三级 loader (local jsonl → HF `datasets` → synthetic fixture)
`engines.py`	`Engine` Protocol + `RealEngine` (MLX+DFlash) + `MockEngine` (CI)
`metrics.py`	纯 stdlib percentile / mean / correctness proxy (gsm8k / humaneval)
`schema.py`	固定 schema 版本 `b2-dflash-kakeya-v1`

实验设计

维度	值
Target	`Qwen/Qwen3-8B` (non-thinking)
Draft	`z-lab/Qwen3-8B-DFlash-b16`
KV channels	`bf16` / `e8-q38` / `e8-q10` / `e8-q4`
Datasets	gsm8k + humaneval
n_samples	32 (default), 可调至 8 快速烟测
指标	acceptance_length {mean, p50, p95}, tps, TTFT, codec_fired, correctness_proxy

理论预期 (MockEngine 内编码)

channel	acceptance_length	tps 相对 baseline	结论
bf16	~15 (DFlash 官方 Qwen3-8B 数)	1.00×	baseline
Kakeya Q=38	~14 (降 <1pp)	~0.95-1.00×	近无损可用
Kakeya Q=10	~12 (降 1-3pp)	~0.80-0.90×	用 KV 节省换速度
Kakeya Q=4	~8 (显著下降)	~0.50-0.70×	不进默认档

如果 Q=38 真实实测降 >2pp，或 Q=10 降 >5pp，需要相应修正默认档位和宣传口径。

测试 (24 passed, Linux CI 可跑)

test_metrics.py — percentile 边界、mean、correctness proxy、多样本聚合
test_datasets.py — synthetic fixture、local jsonl 优先、截断、seed 确定性
test_runner_mock.py — end-to-end：MockEngine + synthetic → JSON 输出 schema v1 round-trip + accept-length 排序被聚合保留

Dry-run 烟测

$ python -m benchmarks.b2_dflash_kakeya.runner \
    --dry-run --n-samples 3 --max-tokens 64 \
    --out-dir /tmp/b2_dryrun

# => 8 JSON files (2 datasets × 4 channels)
# accept_mean: bf16≈15, q38≈14, q10≈12, q4≈8 — 符合理论排序

M6 — 独立产品形态评估

完整提案在 docs/B2_STANDALONE_PRODUCT_PROPOSAL.md。摘要：

重估的三个原因

atomic.chat 宣传与 B2 承诺不对等 — "Google TurboQuant built-in" 实际是 llama.cpp 标量量化（2023 年技术），v1.5 报告里 TQ b=2 结构性不可用。B2 兑现这个承诺是帮第三方产品做实验证，我们自己拿不到品牌溢价。
B2 价值点超过 chat UI 后端 — MLX 原生 / DFlash / E8 KV 压缩三能力可服务 IDE / agent / 批处理 / 其他 GUI 宿主；锁进 chat app = 锁进 UI。
atomic.chat 产品路线风险 — 推理引擎更替、扩展 API 改动、KakeyaLattice 可能永远藏在"高级设置"里。

三种候选形态

形态	交付物	优势	劣势
A	CLI `kakeya-llm` (brew + pip)	分发简单、通用集成面	非技术用户门槛高
B	native app `Kakeya Studio` (DMG)	完整叙事控制、品牌建立	UI / signing / notarize / GTM 高成本，与 atomic.chat 正面竞争
C	SDK + 企业服务 (PyPI + 文档站)	最小 UI 投入，最大技术杠杆，错位竞争	B2B 销售周期长

推荐路径：A + C 先行，B 推迟 6 个月

Phase 1：kakeyalattice-mlx + kakeya-sidecar-mlx 上 PyPI，建 landing page，kakeya-llm CLI wrapper
Phase 2 (4-8 周后)：Cursor / Continue / Raycast / Obsidian 集成，企业 POC
Phase 3 (6 个月后再评估)：要不要做 native app，以及是否把 KakeyaLattice 作为品牌正面对决

对已有工作的影响

PR feat(atomic-chat): KakeyaLattice v1.5 integration for Atomic-Chat — B1 (HF + MPS sidecar) #57 (B1) — 保留，跨平台 (Mac/Win/Linux) 依然有价值
PR feat(atomic-chat): B2 — MLX + DFlash + KakeyaLattice (M1-M3 skeleton) #58 (B2 骨架) + PR feat(atomic-chat): B2/M4 — DFlash speculative decoding + Kakeya target KV #59 (B2/M4 DFlash) — 保留，就是 Phase 1 要发布的 MLX 栈
原 M6（Atomic-Chat 扩展 backend）— 降级为"可选 opportunity"：如果他们合我们 PR 那是 bonus 分发渠道；不再是产品主路径

合并建议

此 PR 有两类独立内容可分拆：

benchmarks/b2_dflash_kakeya/ (M5) — 工程代码 + 测试，独立可 merge
docs/B2_STANDALONE_PRODUCT_PROPOSAL.md (M6) — 仅文档，需团队评估 GTM 决策

可以选择：

一起 merge — 让 M5 + M6 在同一 commit 历史里作为 B2 阶段 2 的叙事收尾
拆两个 PR — 如果 M6 的 GTM 评估需要团队讨论再定稿，先 merge M5 benchmark；M6 文档留在这个 PR 里待议

Commits

feat(b2/M5): acceptance-rate benchmark runner for DFlash x Kakeya
docs(b2/M5): reports/b2_release/ placeholder for real benchmark output
docs(b2/M6): standalone product proposal - pivot from Atomic-Chat backend

下一步（假设本 PR 合入 + M6 评估被采纳）

给 kakeyalattice_mlx 和 kakeya_sidecar_mlx 加 PyPI 发布 CI
注册 kakeyalattice.dev / kakeya-mlx.dev domain
起草 landing page 内容 (对比 atomic.chat / llama.cpp / ollama / MLC-LLM)
2 周内出 v0.1.0 PyPI release + Twitter / HN / arXiv 公告
1 个月内：Cursor / Continue / Aider 集成 PR + 第一个企业 POC 接触

New benchmarks package `benchmarks.b2_dflash_kakeya` quantifies the impact of KakeyaLattice E8 KV-cache compression on DFlash block- diffusion speculative decoding. Experiment design (laid out in README): target: Qwen/Qwen3-8B draft: z-lab/Qwen3-8B-DFlash-b16 KV channels: bf16 / e8-q38 / e8-q10 / e8-q4 (4 levels) datasets: gsm8k, humaneval (32 prompts each by default) metrics: acceptance_length {mean, p50, p95}, tps, TTFT, codec_fired, correctness_proxy Package layout: runner.py CLI + top-level orchestration; --dry-run exercises the whole pipeline on Linux CI without MLX/dflash/HF. datasets.py Three-tier dataset loader (local jsonl -> HF datasets -> synthetic). Synthetic is gated behind --allow-synthetic so nobody ships numbers from 3-prompt fixtures. engines.py Engine Protocol with RealEngine (delegates to kakeya_sidecar_mlx.MLXEngine) and MockEngine (deterministic fake with the theoretical accept- length ordering bf16 > Q=38 > Q=10 > Q=4). metrics.py Pure-stdlib percentile + mean + correctness proxies (gsm8k = numeric-substring, humaneval = def/return substring; full execution harness explicitly out of scope for now). schema.py Pinned schema version b2-dflash-kakeya-v1 so downstream tooling can detect breaking changes. Tests (24 passed, Linux CI only, no MLX/dflash/HF): test_metrics.py percentile edge cases, mean, correctness proxies, multi-record summarisation. test_datasets.py synthetic fixture, local jsonl preference, n_samples truncation, seed determinism. test_runner_mock.py full end-to-end via MockEngine + synthetic data; asserts accept-length ordering is preserved through aggregation and that JSON outputs conform to schema v1. Dry-run smoke (executed locally): python -m benchmarks.b2_dflash_kakeya.runner \ --dry-run --n-samples 3 --max-tokens 64 \ --out-dir /tmp/b2_dryrun # => 8 JSON files (2 datasets x 4 channels), schema v1, # accept means 15 / 14 / 12 / 8 as expected. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

The M5 runner writes one JSON per (dataset, channel) combination into reports/b2_release/. The real run requires Apple Silicon + MLX + dflash + gsm8k/humaneval data; until then this directory only documents the expected layout. Expected artefacts after a real run: b2_dflash_kakeya_{gsm8k,humaneval}_{bf16,e8-q38,e8-q10,e8-q4}.json FINDINGS.md (narrative + aggregate tables) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…kend Evaluates repositioning B2 (MLX + DFlash + KakeyaLattice-MLX) from 'an Atomic-Chat second backend' (original M6 scope) to an independent Mac local-inference product. Three candidate forms: A. CLI tool `kakeya-llm` (homebrew / pip) B. Native macOS app `Kakeya Studio` (DMG, custom UI) C. SDK + developer/enterprise library (PyPI + docs site) Recommendation: ship A + C first, defer B. Rationale: 1. atomic.chat's "Google TurboQuant built-in" headline is under-delivered by the actual llama.cpp KV quantisation stack; folding B2 into Atomic-Chat means fulfilling *their* marketing through *our* engineering with no control over the message. 2. B2's three capabilities (MLX-native inference, DFlash 3-6x speedup, E8 KV compression) are general-purpose infra — not chat-UI-specific. Binding them to one chat app under-uses the stack. 3. Existing engineering assets (kakeyalattice_mlx, kakeya_sidecar_mlx, cache_injection) are ~90% reusable for A+C with minimal new work; going the Atomic-Chat-backend route blocks on third-party PR review + release cadence. Phased execution: Phase 1 (after M5 merges): PyPI release, landing page, CLI shim. Phase 2 (4-8 weeks later): Cursor / Raycast / Obsidian integrations, early enterprise POCs. Phase 3 (6 months out, re-evaluate): whether to do the native app. This evaluation does NOT invalidate PR #57 (B1), #58 (B2 skeleton), or #59 (B2/M4): all that code is directly reusable. It re-scopes M6 only, shifting from 'ship to Atomic-Chat' to 'ship independently + opportunistic Atomic-Chat integration if they take our PR'. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 3 commits April 30, 2026 05:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(atomic-chat): B2/M5 — acceptance-rate benchmark + M6 standalone product proposal#60

feat(atomic-chat): B2/M5 — acceptance-rate benchmark + M6 standalone product proposal#60
FluffyAIcode wants to merge 3 commits intoAgentMemory/atomic-chat-b2-mlx-dflash-kakeya-04aefrom
AgentMemory/atomic-chat-b2-m5-acceptance-benchmark-04ae

FluffyAIcode commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Apr 30, 2026

摘要

依赖关系

M5 — Benchmark 交付物

benchmarks/b2_dflash_kakeya/ 新包

实验设计

理论预期 (MockEngine 内编码)

测试 (24 passed, Linux CI 可跑)

Dry-run 烟测

M6 — 独立产品形态评估

重估的三个原因

三种候选形态

推荐路径：A + C 先行，B 推迟 6 个月

对已有工作的影响

合并建议

Commits

下一步（假设本 PR 合入 + M6 评估被采纳）

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`benchmarks/b2_dflash_kakeya/` 新包