Skip to content

feat(atomic-chat): B2/M5 — acceptance-rate benchmark + M6 standalone product proposal#60

Draft
FluffyAIcode wants to merge 3 commits intoAgentMemory/atomic-chat-b2-mlx-dflash-kakeya-04aefrom
AgentMemory/atomic-chat-b2-m5-acceptance-benchmark-04ae
Draft

feat(atomic-chat): B2/M5 — acceptance-rate benchmark + M6 standalone product proposal#60
FluffyAIcode wants to merge 3 commits intoAgentMemory/atomic-chat-b2-mlx-dflash-kakeya-04aefrom
AgentMemory/atomic-chat-b2-m5-acceptance-benchmark-04ae

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

摘要

M5:交付 B2 的 acceptance-rate benchmark — 把 "DFlash × KakeyaLattice 叠加不损伤 acceptance" 的理论推演变成可运行的实验栈。

M6:交付 独立产品形态评估 docs/B2_STANDALONE_PRODUCT_PROPOSAL.md,把 B2 的 GTM 路径从"Atomic-Chat 的第二个 backend"重新定位为独立 Mac 本地推理 SDK + CLI 产品

依赖关系

M5 — Benchmark 交付物

benchmarks/b2_dflash_kakeya/ 新包

文件 作用
runner.py CLI + 顶层流程;--dry-run 让 Linux CI 能跑通全管线
datasets.py 三级 loader (local jsonl → HF datasets → synthetic fixture)
engines.py Engine Protocol + RealEngine (MLX+DFlash) + MockEngine (CI)
metrics.py 纯 stdlib percentile / mean / correctness proxy (gsm8k / humaneval)
schema.py 固定 schema 版本 b2-dflash-kakeya-v1

实验设计

维度
Target Qwen/Qwen3-8B (non-thinking)
Draft z-lab/Qwen3-8B-DFlash-b16
KV channels bf16 / e8-q38 / e8-q10 / e8-q4
Datasets gsm8k + humaneval
n_samples 32 (default), 可调至 8 快速烟测
指标 acceptance_length {mean, p50, p95}, tps, TTFT, codec_fired, correctness_proxy

理论预期 (MockEngine 内编码)

channel acceptance_length tps 相对 baseline 结论
bf16 ~15 (DFlash 官方 Qwen3-8B 数) 1.00× baseline
Kakeya Q=38 ~14 (降 <1pp) ~0.95-1.00× 近无损可用
Kakeya Q=10 ~12 (降 1-3pp) ~0.80-0.90× 用 KV 节省换速度
Kakeya Q=4 ~8 (显著下降) ~0.50-0.70× 不进默认档

如果 Q=38 真实实测降 >2pp,或 Q=10 降 >5pp,需要相应修正默认档位和宣传口径。

测试 (24 passed, Linux CI 可跑)

  • test_metrics.py — percentile 边界、mean、correctness proxy、多样本聚合
  • test_datasets.py — synthetic fixture、local jsonl 优先、截断、seed 确定性
  • test_runner_mock.py — end-to-end:MockEngine + synthetic → JSON 输出 schema v1 round-trip + accept-length 排序被聚合保留

Dry-run 烟测

$ python -m benchmarks.b2_dflash_kakeya.runner \
    --dry-run --n-samples 3 --max-tokens 64 \
    --out-dir /tmp/b2_dryrun

# => 8 JSON files (2 datasets × 4 channels)
# accept_mean: bf16≈15, q38≈14, q10≈12, q4≈8 — 符合理论排序

M6 — 独立产品形态评估

完整提案在 docs/B2_STANDALONE_PRODUCT_PROPOSAL.md。摘要:

重估的三个原因

  1. atomic.chat 宣传与 B2 承诺不对等 — "Google TurboQuant built-in" 实际是 llama.cpp 标量量化(2023 年技术),v1.5 报告里 TQ b=2 结构性不可用。B2 兑现这个承诺是帮第三方产品做实验证,我们自己拿不到品牌溢价。
  2. B2 价值点超过 chat UI 后端 — MLX 原生 / DFlash / E8 KV 压缩三能力可服务 IDE / agent / 批处理 / 其他 GUI 宿主;锁进 chat app = 锁进 UI。
  3. atomic.chat 产品路线风险 — 推理引擎更替、扩展 API 改动、KakeyaLattice 可能永远藏在"高级设置"里。

三种候选形态

形态 交付物 优势 劣势
A CLI kakeya-llm (brew + pip) 分发简单、通用集成面 非技术用户门槛高
B native app Kakeya Studio (DMG) 完整叙事控制、品牌建立 UI / signing / notarize / GTM 高成本,与 atomic.chat 正面竞争
C SDK + 企业服务 (PyPI + 文档站) 最小 UI 投入,最大技术杠杆,错位竞争 B2B 销售周期长

推荐路径:A + C 先行,B 推迟 6 个月

  • Phase 1kakeyalattice-mlx + kakeya-sidecar-mlx 上 PyPI,建 landing page,kakeya-llm CLI wrapper
  • Phase 2 (4-8 周后):Cursor / Continue / Raycast / Obsidian 集成,企业 POC
  • Phase 3 (6 个月后再评估):要不要做 native app,以及是否把 KakeyaLattice 作为品牌正面对决

对已有工作的影响

合并建议

此 PR 有两类独立内容可分拆:

  1. benchmarks/b2_dflash_kakeya/ (M5) — 工程代码 + 测试,独立可 merge
  2. docs/B2_STANDALONE_PRODUCT_PROPOSAL.md (M6) — 仅文档,需团队评估 GTM 决策

可以选择:

  • 一起 merge — 让 M5 + M6 在同一 commit 历史里作为 B2 阶段 2 的叙事收尾
  • 拆两个 PR — 如果 M6 的 GTM 评估需要团队讨论再定稿,先 merge M5 benchmark;M6 文档留在这个 PR 里待议

Commits

  1. feat(b2/M5): acceptance-rate benchmark runner for DFlash x Kakeya
  2. docs(b2/M5): reports/b2_release/ placeholder for real benchmark output
  3. docs(b2/M6): standalone product proposal - pivot from Atomic-Chat backend

下一步(假设本 PR 合入 + M6 评估被采纳)

  1. kakeyalattice_mlxkakeya_sidecar_mlx 加 PyPI 发布 CI
  2. 注册 kakeyalattice.dev / kakeya-mlx.dev domain
  3. 起草 landing page 内容 (对比 atomic.chat / llama.cpp / ollama / MLC-LLM)
  4. 2 周内出 v0.1.0 PyPI release + Twitter / HN / arXiv 公告
  5. 1 个月内:Cursor / Continue / Aider 集成 PR + 第一个企业 POC 接触
Open in Web Open in Cursor 

cursoragent and others added 3 commits April 30, 2026 05:07
New benchmarks package `benchmarks.b2_dflash_kakeya` quantifies the
impact of KakeyaLattice E8 KV-cache compression on DFlash block-
diffusion speculative decoding.

Experiment design (laid out in README):
  target: Qwen/Qwen3-8B
  draft:  z-lab/Qwen3-8B-DFlash-b16
  KV channels: bf16 / e8-q38 / e8-q10 / e8-q4   (4 levels)
  datasets: gsm8k, humaneval   (32 prompts each by default)
  metrics: acceptance_length {mean, p50, p95}, tps, TTFT,
           codec_fired, correctness_proxy

Package layout:
  runner.py       CLI + top-level orchestration; --dry-run exercises
                  the whole pipeline on Linux CI without MLX/dflash/HF.
  datasets.py     Three-tier dataset loader (local jsonl ->
                  HF datasets -> synthetic). Synthetic is gated
                  behind --allow-synthetic so nobody ships numbers
                  from 3-prompt fixtures.
  engines.py      Engine Protocol with RealEngine (delegates to
                  kakeya_sidecar_mlx.MLXEngine) and MockEngine
                  (deterministic fake with the theoretical accept-
                  length ordering bf16 > Q=38 > Q=10 > Q=4).
  metrics.py      Pure-stdlib percentile + mean + correctness
                  proxies (gsm8k = numeric-substring, humaneval =
                  def/return substring; full execution harness
                  explicitly out of scope for now).
  schema.py       Pinned schema version b2-dflash-kakeya-v1 so
                  downstream tooling can detect breaking changes.

Tests (24 passed, Linux CI only, no MLX/dflash/HF):
  test_metrics.py        percentile edge cases, mean, correctness
                         proxies, multi-record summarisation.
  test_datasets.py       synthetic fixture, local jsonl preference,
                         n_samples truncation, seed determinism.
  test_runner_mock.py    full end-to-end via MockEngine + synthetic
                         data; asserts accept-length ordering is
                         preserved through aggregation and that
                         JSON outputs conform to schema v1.

Dry-run smoke (executed locally):
  python -m benchmarks.b2_dflash_kakeya.runner \
      --dry-run --n-samples 3 --max-tokens 64 \
      --out-dir /tmp/b2_dryrun
  # => 8 JSON files (2 datasets x 4 channels), schema v1,
  #    accept means 15 / 14 / 12 / 8 as expected.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
The M5 runner writes one JSON per (dataset, channel) combination
into reports/b2_release/. The real run requires Apple Silicon +
MLX + dflash + gsm8k/humaneval data; until then this directory
only documents the expected layout.

Expected artefacts after a real run:
  b2_dflash_kakeya_{gsm8k,humaneval}_{bf16,e8-q38,e8-q10,e8-q4}.json
  FINDINGS.md (narrative + aggregate tables)

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…kend

Evaluates repositioning B2 (MLX + DFlash + KakeyaLattice-MLX) from
'an Atomic-Chat second backend' (original M6 scope) to an
independent Mac local-inference product.

Three candidate forms:
  A. CLI tool `kakeya-llm`           (homebrew / pip)
  B. Native macOS app `Kakeya Studio` (DMG, custom UI)
  C. SDK + developer/enterprise library (PyPI + docs site)

Recommendation: ship A + C first, defer B.

Rationale:
  1. atomic.chat's "Google TurboQuant built-in" headline is
     under-delivered by the actual llama.cpp KV quantisation stack;
     folding B2 into Atomic-Chat means fulfilling *their* marketing
     through *our* engineering with no control over the message.
  2. B2's three capabilities (MLX-native inference, DFlash 3-6x
     speedup, E8 KV compression) are general-purpose infra — not
     chat-UI-specific. Binding them to one chat app under-uses the
     stack.
  3. Existing engineering assets (kakeyalattice_mlx,
     kakeya_sidecar_mlx, cache_injection) are ~90% reusable for A+C
     with minimal new work; going the Atomic-Chat-backend route
     blocks on third-party PR review + release cadence.

Phased execution:
  Phase 1 (after M5 merges): PyPI release, landing page, CLI shim.
  Phase 2 (4-8 weeks later): Cursor / Raycast / Obsidian integrations,
    early enterprise POCs.
  Phase 3 (6 months out, re-evaluate): whether to do the native app.

This evaluation does NOT invalidate PR #57 (B1), #58 (B2 skeleton),
or #59 (B2/M4): all that code is directly reusable. It re-scopes
M6 only, shifting from 'ship to Atomic-Chat' to 'ship independently
+ opportunistic Atomic-Chat integration if they take our PR'.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants