From 41b627ca4ca272a6a07597a1f1745858d1f2d668 Mon Sep 17 00:00:00 2001 From: Cursor Agent Date: Thu, 30 Apr 2026 05:07:21 +0000 Subject: [PATCH 1/3] feat(b2/M5): acceptance-rate benchmark runner for DFlash x Kakeya New benchmarks package `benchmarks.b2_dflash_kakeya` quantifies the impact of KakeyaLattice E8 KV-cache compression on DFlash block- diffusion speculative decoding. Experiment design (laid out in README): target: Qwen/Qwen3-8B draft: z-lab/Qwen3-8B-DFlash-b16 KV channels: bf16 / e8-q38 / e8-q10 / e8-q4 (4 levels) datasets: gsm8k, humaneval (32 prompts each by default) metrics: acceptance_length {mean, p50, p95}, tps, TTFT, codec_fired, correctness_proxy Package layout: runner.py CLI + top-level orchestration; --dry-run exercises the whole pipeline on Linux CI without MLX/dflash/HF. datasets.py Three-tier dataset loader (local jsonl -> HF datasets -> synthetic). Synthetic is gated behind --allow-synthetic so nobody ships numbers from 3-prompt fixtures. engines.py Engine Protocol with RealEngine (delegates to kakeya_sidecar_mlx.MLXEngine) and MockEngine (deterministic fake with the theoretical accept- length ordering bf16 > Q=38 > Q=10 > Q=4). metrics.py Pure-stdlib percentile + mean + correctness proxies (gsm8k = numeric-substring, humaneval = def/return substring; full execution harness explicitly out of scope for now). schema.py Pinned schema version b2-dflash-kakeya-v1 so downstream tooling can detect breaking changes. Tests (24 passed, Linux CI only, no MLX/dflash/HF): test_metrics.py percentile edge cases, mean, correctness proxies, multi-record summarisation. test_datasets.py synthetic fixture, local jsonl preference, n_samples truncation, seed determinism. test_runner_mock.py full end-to-end via MockEngine + synthetic data; asserts accept-length ordering is preserved through aggregation and that JSON outputs conform to schema v1. Dry-run smoke (executed locally): python -m benchmarks.b2_dflash_kakeya.runner \ --dry-run --n-samples 3 --max-tokens 64 \ --out-dir /tmp/b2_dryrun # => 8 JSON files (2 datasets x 4 channels), schema v1, # accept means 15 / 14 / 12 / 8 as expected. Co-authored-by: FluffyAIcode --- benchmarks/b2_dflash_kakeya/README.md | 138 +++++++++ benchmarks/b2_dflash_kakeya/__init__.py | 4 + benchmarks/b2_dflash_kakeya/datasets.py | 156 ++++++++++ benchmarks/b2_dflash_kakeya/engines.py | 210 +++++++++++++ benchmarks/b2_dflash_kakeya/metrics.py | 99 ++++++ benchmarks/b2_dflash_kakeya/runner.py | 282 ++++++++++++++++++ benchmarks/b2_dflash_kakeya/schema.py | 82 +++++ benchmarks/b2_dflash_kakeya/tests/__init__.py | 0 .../b2_dflash_kakeya/tests/test_datasets.py | 93 ++++++ .../b2_dflash_kakeya/tests/test_metrics.py | 91 ++++++ .../tests/test_runner_mock.py | 113 +++++++ 11 files changed, 1268 insertions(+) create mode 100644 benchmarks/b2_dflash_kakeya/README.md create mode 100644 benchmarks/b2_dflash_kakeya/__init__.py create mode 100644 benchmarks/b2_dflash_kakeya/datasets.py create mode 100644 benchmarks/b2_dflash_kakeya/engines.py create mode 100644 benchmarks/b2_dflash_kakeya/metrics.py create mode 100644 benchmarks/b2_dflash_kakeya/runner.py create mode 100644 benchmarks/b2_dflash_kakeya/schema.py create mode 100644 benchmarks/b2_dflash_kakeya/tests/__init__.py create mode 100644 benchmarks/b2_dflash_kakeya/tests/test_datasets.py create mode 100644 benchmarks/b2_dflash_kakeya/tests/test_metrics.py create mode 100644 benchmarks/b2_dflash_kakeya/tests/test_runner_mock.py diff --git a/benchmarks/b2_dflash_kakeya/README.md b/benchmarks/b2_dflash_kakeya/README.md new file mode 100644 index 00000000..0d24d15a --- /dev/null +++ b/benchmarks/b2_dflash_kakeya/README.md @@ -0,0 +1,138 @@ +# B2 — DFlash × KakeyaLattice acceptance-rate benchmark + +**目标**:量化 KakeyaLattice E8 KV-cache 压缩对 DFlash block-diffusion +speculative decoding 的影响 — 具体是 acceptance length 的变化(是否因 +target 分布扰动而掉速)以及端到端 tok/s 的实际收益。 + +B2 (`integrations/atomic-chat-b2/`) 的所有工程都是围绕"能不能叠加"的 +理论推演 + 骨架 + 单测。M5 负责把推演变成数字。 + +## 实验设计 + +### Target × Draft × KV channel + +| Target | Draft (DFlash) | KV channel | +|:-|:-|:-| +| `Qwen/Qwen3-8B` (non-thinking) | `z-lab/Qwen3-8B-DFlash-b16` | bf16 baseline | +| 同上 | 同上 | Kakeya E8 Q=38 (near-lossless) | +| 同上 | 同上 | Kakeya E8 Q=10 (balanced) | +| 同上 | 同上 | Kakeya E8 Q=4 (aggressive) | + +共 **4 组 (1 target × 1 draft × 4 KV channel)**。bf16 是对照组,不走 +KakeyaLatticeMLXCache,直接用 mlx-lm 原生 KVCache。 + +### 数据集 + +- **`gsm8k`** (GSM8K test split):数学推理;DFlash 论文主 benchmark。 +- **`humaneval`** (HumanEval openai):代码生成;DFlash 论文次 benchmark。 + +对每个数据集,随机抽 `n_samples=32` prompt;若要快速烟测可调低到 8。 +种子固定 (`seed=42`) 以便两次跑的 prompt 集一致。 + +### 指标 + +| 指标 | 含义 | 来源 | +|:-|:-|:-| +| `acceptance_length_mean` | 平均每 verify step 接受 token 数;DFlash 的核心指标 | `dflash.model_mlx.stream_generate` 每步返回 | +| `acceptance_length_p50 / p95` | 分位数 | 同上 | +| `generation_tps` | 端到端 tok/s | dflash 或 mlx_lm 的 timer | +| `total_tokens` | 生成 token 总数 | tokenizer 统计 | +| `first_token_latency_s` | 首 token 延迟 | wall clock | +| `kakeya_codec_fired` | codec 触发次数(非 boundary 层) | `KakeyaLatticeMLXCache.fire_count` | +| `correctness_proxy` (可选) | answer 是否含 expected | 仅对 gsm8k / humaneval 有简单匹配 | + +### 预期结果(按 PR #57 §12.2 的理论分析) + +| channel | acceptance_length | tps 相对 baseline | 结论 | +|:-|:-:|:-:|:-| +| bf16 baseline | ~14-16 (DFlash 官方报数) | 1.00× | ✓ | +| Kakeya Q=38 | ~13-15 (降 <1pp) | ~0.95-1.00× | 近无损可用 | +| Kakeya Q=10 | ~11-13 (降 1-3pp) | ~0.80-0.90× | 用 KV 节省换速度 | +| Kakeya Q=4 | ~7-10 (显著下降) | ~0.50-0.70× | 不进默认档位 | + +**如果 Q=38 acceptance 掉得超过 2pp**,该档位不作为 B2 默认;回退到 +Q=76 或 Q=152。**如果 Q=10 acceptance 掉得超过 5pp**,B2 的"加速 + 压缩 +双赢"叙事需要修正。 + +## 运行 + +### 真实运行(需要 Apple Silicon + MLX + dflash) + +```bash +# 1. 装依赖 +pip install -e integrations/atomic-chat-b2/kakeyalattice_mlx[mlx] +pip install dflash # z-lab/dflash 官方包 (MLX backend) +pip install "mlx-lm>=0.20" + +# 2. 跑基线 (bf16) + 三个 Kakeya 档 +python -m benchmarks.b2_dflash_kakeya.runner \ + --target Qwen/Qwen3-8B \ + --draft z-lab/Qwen3-8B-DFlash-b16 \ + --datasets gsm8k humaneval \ + --n-samples 32 \ + --channels bf16 e8-q38 e8-q10 e8-q4 \ + --out-dir reports/b2_release + +# 3. 结果在 reports/b2_release/b2_dflash_kakeya_{dataset}_{channel}.json +``` + +### Dry-run(Linux CI 可跑;不下模型、不加载 dflash) + +```bash +python -m benchmarks.b2_dflash_kakeya.runner --dry-run +``` + +Dry-run 会走完参数解析 + dataset 加载 + 指标聚合路径;推理步骤由 +`--mock-engine` 自动注入的 FakeEngine 替身提供,用来 CI 验证 runner +工程完整度。 + +## 文件 + +``` +benchmarks/b2_dflash_kakeya/ +├── README.md (本文件) +├── __init__.py +├── runner.py 主入口 + 参数解析 + 顶层流程 +├── datasets.py gsm8k / humaneval 加载器 (本地 jsonl + 可选 HF datasets) +├── engines.py RealEngine (DFlash+Kakeya) + MockEngine (CI) +├── metrics.py acceptance_length 分布, tps, Δppl 估算 +├── schema.py 输出 JSON 的 TypedDict + 版本号 +└── tests/ + ├── __init__.py + ├── test_metrics.py (Linux CI green) + ├── test_datasets.py (Linux CI green) + └── test_runner_mock.py (Linux CI green, 用 MockEngine) +``` + +## 输出 schema + +```json +{ + "schema_version": "b2-dflash-kakeya-v1", + "target_model": "Qwen/Qwen3-8B", + "draft_model": "z-lab/Qwen3-8B-DFlash-b16", + "dataset": "gsm8k", + "channel": "e8-q10", + "n_samples": 32, + "samples": [ { prompt..., metrics... }, ... ], + "aggregate": { + "acceptance_length_mean": 12.3, + "acceptance_length_p50": 12, + "acceptance_length_p95": 18, + "generation_tps_mean": 210.5, + "first_token_latency_s": 0.142, + "total_tokens_sum": 8192, + "codec_fired_mean": 35.2 + }, + "hardware": { "device": "mlx:metal", "chip": "Apple M3 Pro", ... }, + "software": { "mlx": "...", "dflash": "...", "kakeyalattice_mlx": "..." } +} +``` + +## 对比 atomic.chat 首页宣传 + +atomic.chat 首页声称 *"Google TurboQuant built-in"* 与 *"Compressed down +to just 3 bits"*。v1.5 报告里 TQ b=2 在 4 模型上结构性不可用、b=3 被 +E8 Q=4 全面压过 3-6×。M5 的 b2 报告会补上:**同样 (CR, |Δppl|) 前提下, +DFlash 加速损失是多少** — 也就是宣传里未兑现的"速度 + 压缩双赢"那一栏 +的真实数字。 diff --git a/benchmarks/b2_dflash_kakeya/__init__.py b/benchmarks/b2_dflash_kakeya/__init__.py new file mode 100644 index 00000000..fb9b37d6 --- /dev/null +++ b/benchmarks/b2_dflash_kakeya/__init__.py @@ -0,0 +1,4 @@ +"""B2 DFlash x KakeyaLattice acceptance-rate benchmark.""" +from __future__ import annotations + +__version__ = "0.1.0" diff --git a/benchmarks/b2_dflash_kakeya/datasets.py b/benchmarks/b2_dflash_kakeya/datasets.py new file mode 100644 index 00000000..3c20e3ba --- /dev/null +++ b/benchmarks/b2_dflash_kakeya/datasets.py @@ -0,0 +1,156 @@ +"""Dataset loaders for the B2 acceptance-rate benchmark. + +Two datasets are supported out of the box: **gsm8k** and **humaneval**. + +Loading strategy (in priority order): + +1. **Local JSONL file**: ``benchmarks/b2_dflash_kakeya/data/.jsonl``. + Users who can't reach HF hub (or want a frozen subset for the + paper) check in a jsonl snapshot and we read it directly. Keeps + the benchmark reproducible offline. +2. **HuggingFace ``datasets`` library** if available. We load + ``openai/gsm8k`` (``main`` config, ``test`` split) and + ``openai/humaneval`` (``test`` split). Cached under HF_HOME. +3. **Synthetic fixture** — a tiny built-in 3-prompt dataset per name. + Used by unit tests and ``--dry-run`` mode; explicitly labelled so + nobody publishes numbers from it by accident. + +Each prompt is returned as a ``PromptItem`` dataclass carrying an +id, the prompt string the target LLM will see, and an optional +ground-truth field used by the correctness proxy in ``metrics.py``. +""" +from __future__ import annotations + +import json +import random +from dataclasses import dataclass +from pathlib import Path + + +@dataclass(frozen=True) +class PromptItem: + dataset: str # "gsm8k" | "humaneval" | "synthetic" + prompt_id: str + prompt: str + ground_truth: str | None = None + + +_SUPPORTED = ("gsm8k", "humaneval") + +_DATA_DIR = Path(__file__).parent / "data" + + +def _load_local_jsonl(name: str) -> list[dict] | None: + path = _DATA_DIR / f"{name}.jsonl" + if not path.exists(): + return None + with path.open() as f: + return [json.loads(line) for line in f if line.strip()] + + +def _load_hf(name: str) -> list[dict] | None: + try: + from datasets import load_dataset # type: ignore + except ImportError: + return None + if name == "gsm8k": + ds = load_dataset("openai/gsm8k", "main", split="test") + return [dict(row) for row in ds] + if name == "humaneval": + ds = load_dataset("openai/humaneval", split="test") + return [dict(row) for row in ds] + return None + + +_SYNTHETIC_FIXTURES: dict[str, list[PromptItem]] = { + "gsm8k": [ + PromptItem("synthetic", "s0", + "Q: Janet has 3 apples, gives 1 to Bob. How many are left?", + "2"), + PromptItem("synthetic", "s1", + "Q: A train travels 60 miles in 1.5 hours. What is its speed?", + "40"), + PromptItem("synthetic", "s2", + "Q: If 5 pencils cost $2.50, what is the cost of 8 pencils?", + "4"), + ], + "humaneval": [ + PromptItem("synthetic", "h0", + "def add(a, b):\n \"\"\"Return a + b.\"\"\"\n", + "def add(a, b):\n return a + b"), + PromptItem("synthetic", "h1", + "def is_even(n):\n \"\"\"Return True if n is even.\"\"\"\n", + "def is_even(n):\n return n % 2 == 0"), + PromptItem("synthetic", "h2", + "def reverse(s):\n \"\"\"Return s reversed.\"\"\"\n", + "def reverse(s):\n return s[::-1]"), + ], +} + + +def load_dataset_for_b2( + name: str, + *, + n_samples: int, + seed: int = 42, + allow_hf: bool = True, + allow_synthetic: bool = True, +) -> list[PromptItem]: + """Load up to ``n_samples`` prompts for the named dataset. + + The loader degrades gracefully: local jsonl → HF datasets → + synthetic. ``allow_hf=False`` forces the local/synthetic path + (useful for offline CI). ``allow_synthetic=False`` forbids the + synthetic fallback (useful for real benchmark runs so nobody + accidentally "runs gsm8k" on 3 fake prompts). + """ + if name not in _SUPPORTED: + raise ValueError( + f"dataset {name!r} not supported; pick from {_SUPPORTED}" + ) + + rng = random.Random(seed) + + rows: list[dict] | None = _load_local_jsonl(name) + if rows is None and allow_hf: + rows = _load_hf(name) + + if rows is not None: + rng.shuffle(rows) + rows = rows[:n_samples] + return [_row_to_item(name, i, r) for i, r in enumerate(rows)] + + if not allow_synthetic: + raise FileNotFoundError( + f"no local jsonl for {name!r} and synthetic fallback disabled. " + f"Expected file at {_DATA_DIR / (name + '.jsonl')}, or install " + "the `datasets` library and set allow_hf=True." + ) + + fixture = list(_SYNTHETIC_FIXTURES[name]) + rng.shuffle(fixture) + return fixture[:n_samples] if n_samples < len(fixture) else fixture + + +def _row_to_item(name: str, i: int, row: dict) -> PromptItem: + if name == "gsm8k": + return PromptItem( + dataset="gsm8k", + prompt_id=f"gsm8k-{i}", + prompt=row.get("question", ""), + ground_truth=row.get("answer"), + ) + if name == "humaneval": + return PromptItem( + dataset="humaneval", + prompt_id=str(row.get("task_id", f"humaneval-{i}")), + prompt=row.get("prompt", ""), + ground_truth=row.get("canonical_solution"), + ) + raise ValueError(name) + + +__all__ = [ + "PromptItem", + "load_dataset_for_b2", +] diff --git a/benchmarks/b2_dflash_kakeya/engines.py b/benchmarks/b2_dflash_kakeya/engines.py new file mode 100644 index 00000000..b5ef8242 --- /dev/null +++ b/benchmarks/b2_dflash_kakeya/engines.py @@ -0,0 +1,210 @@ +"""Engine abstractions for the M5 benchmark. + +Real runs use ``RealEngine`` which delegates to the B2 MLXEngine +(i.e. DFlash path + KakeyaLatticeMLXCache). CI / dry-runs use +``MockEngine`` which returns deterministic fake acceptance-length +traces so the runner + metrics pipeline is exercised without any +MLX / dflash / Metal dependency. + +Both engines expose the same ``generate(prompt, channel, max_tokens)`` +method, returning ``EngineResult``. +""" +from __future__ import annotations + +import logging +import random +import time +from dataclasses import dataclass, field +from typing import Protocol + +log = logging.getLogger("benchmarks.b2_dflash_kakeya.engines") + + +@dataclass +class EngineResult: + response: str + acceptance_lengths: list[int] + generation_tps: float | None + first_token_latency_s: float | None + total_tokens: int + codec_fired: int | None = None + extra: dict = field(default_factory=dict) + + +class Engine(Protocol): + def generate( + self, + *, + prompt: str, + channel: str, + max_tokens: int, + ) -> EngineResult: + ... + + def close(self) -> None: + ... + + +# --------------------------------------------------------------------------- +# MockEngine +# --------------------------------------------------------------------------- + + +class MockEngine: + """Deterministic fake engine for CI + dry-run. + + Simulates the relationship we expect from the real stack: + + - bf16 baseline: acceptance length ~15 (DFlash's Qwen3-8B number) + - Kakeya Q=38: ~14 (small hit) + - Kakeya Q=10: ~12 (moderate hit) + - Kakeya Q=4: ~8 (large hit) + + Values are drawn from a small Gaussian with those means so the + metrics pipeline sees realistic distribution shapes. + """ + + _ACCEPT_MEAN_BY_CHANNEL = { + "bf16": 15.0, + "e8-q38": 14.0, + "e8-q10": 12.0, + "e8-q4": 8.0, + } + _TPS_MEAN_BY_CHANNEL = { + "bf16": 200.0, + "e8-q38": 195.0, + "e8-q10": 175.0, + "e8-q4": 120.0, + } + + def __init__(self, seed: int = 0) -> None: + self._rng = random.Random(seed) + + def generate( + self, + *, + prompt: str, + channel: str, + max_tokens: int, + ) -> EngineResult: + al_mean = self._ACCEPT_MEAN_BY_CHANNEL.get(channel, 10.0) + tps_mean = self._TPS_MEAN_BY_CHANNEL.get(channel, 150.0) + + # Decide how many verify steps a max_tokens budget produces. + n_steps = max(1, int(max_tokens / max(al_mean, 1.0))) + acc = [ + max(1, int(self._rng.gauss(al_mean, 1.5))) + for _ in range(n_steps) + ] + total_tokens = sum(acc) + tps = max(10.0, self._rng.gauss(tps_mean, tps_mean * 0.05)) + ttft = max(0.02, self._rng.gauss(0.12, 0.02)) + + return EngineResult( + response=f"", + acceptance_lengths=acc, + generation_tps=tps, + first_token_latency_s=ttft, + total_tokens=total_tokens, + codec_fired=0 if channel == "bf16" else n_steps * 30, + extra={"channel": channel, "backend": "mock"}, + ) + + def close(self) -> None: + pass + + +# --------------------------------------------------------------------------- +# RealEngine (Apple Silicon only; thin wrapper over B2 MLXEngine) +# --------------------------------------------------------------------------- + + +class RealEngine: + """Adapter over ``kakeya_sidecar_mlx.MLXEngine``. + + Lazily imports everything so ``import engines`` on Linux CI works. + """ + + def __init__( + self, + *, + target_model: str, + enable_dflash: bool = True, + trust_remote_code: bool = True, + ) -> None: + from kakeya_sidecar_mlx.engine_mlx import MLXEngine, MLXEngineConfig + + cfg = MLXEngineConfig( + enable_dflash=enable_dflash, + trust_remote_code=trust_remote_code, + ) + self._engine = MLXEngine(cfg) + self._target = target_model + + def generate( + self, + *, + prompt: str, + channel: str, + max_tokens: int, + ) -> EngineResult: + # channel maps ("bf16", "e8-q38", "e8-q10", ...) → B2 channel id + channel_id, override = self._channel_to_id(channel) + + t0 = time.time() + response, stats = self._engine.chat( + channel_id, + [{"role": "user", "content": prompt}], + max_tokens=max_tokens, + temperature=0.0, + override=override, + ) + wall = time.time() - t0 + + # MLXEngine stats carry 'acceptance_length_mean' only, not the + # full per-step list; for the benchmark we want distribution. + # The B2 engine will be extended in a later PR to expose the + # per-step list; for now we synthesize a single-element list + # so the aggregation still works. + al_mean = stats.get("acceptance_length_mean") + acc_list = [int(al_mean)] if al_mean else [] + + total_tokens = stats.get("generated_chars", 0) // 4 # rough proxy + tps = (total_tokens / wall) if wall > 0 else None + + return EngineResult( + response=response, + acceptance_lengths=acc_list, + generation_tps=tps, + first_token_latency_s=None, + total_tokens=total_tokens, + codec_fired=None, + extra={"channel": channel, "backend": "mlx+dflash+kakeya"}, + ) + + @staticmethod + def _channel_to_id(channel: str) -> tuple[str, dict | None]: + """Map benchmark channel name to MLXEngine channel id + override. + + ``channel="bf16"`` maps to the target's Q=38 channel with a + per-request override that disables the codec path (boundary + covers every layer). In practice we ship a "bypass" dedicated + channel in a follow-up; for now the override is a clear + signal. + """ + # The benchmark assumes Qwen3-8B as target model id. + if channel == "bf16": + return "qwen3-8b@e8-q38", {"q_range": 38, "boundary": 99999} + if channel == "e8-q38": + return "qwen3-8b@e8-q38", None + if channel == "e8-q10": + return "qwen3-8b@e8-q10", None + if channel == "e8-q4": + return "qwen3-8b@e8-q4", None + raise ValueError(f"unknown channel {channel!r}") + + def close(self) -> None: + pass + + +__all__ = ["Engine", "EngineResult", "MockEngine", "RealEngine"] diff --git a/benchmarks/b2_dflash_kakeya/metrics.py b/benchmarks/b2_dflash_kakeya/metrics.py new file mode 100644 index 00000000..488879aa --- /dev/null +++ b/benchmarks/b2_dflash_kakeya/metrics.py @@ -0,0 +1,99 @@ +"""Metric aggregation helpers. + +All stats are pure Python + standard library — no numpy dependency +so this module loads on any CI. +""" +from __future__ import annotations + +import math +from typing import Iterable, Sequence + + +def percentile(xs: Sequence[float], p: float) -> float | None: + """Linear-interpolation percentile matching numpy's default. + + ``p`` in [0, 100]. Returns ``None`` for empty input to propagate + missing-data semantics to the schema. + """ + if not xs: + return None + if p < 0 or p > 100: + raise ValueError(f"percentile p must be in [0, 100], got {p}") + sorted_xs = sorted(xs) + if len(sorted_xs) == 1: + return float(sorted_xs[0]) + rank = (p / 100.0) * (len(sorted_xs) - 1) + lo = int(math.floor(rank)) + hi = int(math.ceil(rank)) + if lo == hi: + return float(sorted_xs[lo]) + frac = rank - lo + return float(sorted_xs[lo] + (sorted_xs[hi] - sorted_xs[lo]) * frac) + + +def mean(xs: Iterable[float]) -> float | None: + lst = list(xs) + if not lst: + return None + return sum(lst) / len(lst) + + +def summarise_accept_lengths( + sample_records: Iterable[object], +) -> dict[str, float | None]: + """Flatten per-step acceptance lengths across samples and summarise.""" + flat: list[float] = [] + for s in sample_records: + # Support either a SampleRecord or a plain dict (round-tripped). + al = getattr(s, "acceptance_lengths", None) + if al is None and isinstance(s, dict): + al = s.get("acceptance_lengths") + if al: + flat.extend(float(x) for x in al) + + return { + "mean": mean(flat), + "p50": percentile(flat, 50.0), + "p95": percentile(flat, 95.0), + } + + +def gsm8k_correct(response: str, expected: str) -> bool: + """Simple gsm8k correctness proxy. + + GSM8K ground truth format ends with ``#### ``. We strip + that, then check whether the model's response contains the exact + numeric answer as a substring. Deliberately loose — this is a + proxy, not a full grader. + """ + if "####" in expected: + expected_answer = expected.rsplit("####", 1)[-1].strip() + else: + expected_answer = expected.strip() + if not expected_answer: + return False + return expected_answer in response + + +def humaneval_correct(response: str, test_snippet: str) -> bool: + """HumanEval correctness proxy via substring match on the solution body. + + Real HumanEval grading runs the generated code against the + reference tests in a sandbox; that's deliberately out-of-scope for + a sidecar benchmark. We approximate by checking that the response + contains a ``def`` signature and the ``return`` keyword — a very + loose gate that at least separates "emitted code" from "emitted + refusal". Upgrade path: wire the official execution harness in a + follow-up PR. + """ + _ = test_snippet # unused; kept for API symmetry with gsm8k_correct + return ("def " in response) and ("return" in response) + + +__all__ = [ + "percentile", + "mean", + "summarise_accept_lengths", + "gsm8k_correct", + "humaneval_correct", +] diff --git a/benchmarks/b2_dflash_kakeya/runner.py b/benchmarks/b2_dflash_kakeya/runner.py new file mode 100644 index 00000000..eb438820 --- /dev/null +++ b/benchmarks/b2_dflash_kakeya/runner.py @@ -0,0 +1,282 @@ +"""Top-level benchmark runner. + +Usage: + + python -m benchmarks.b2_dflash_kakeya.runner \\ + --target Qwen/Qwen3-8B \\ + --draft z-lab/Qwen3-8B-DFlash-b16 \\ + --datasets gsm8k humaneval \\ + --n-samples 32 \\ + --channels bf16 e8-q38 e8-q10 e8-q4 \\ + --out-dir reports/b2_release + + python -m benchmarks.b2_dflash_kakeya.runner --dry-run + # CI-friendly: uses MockEngine + synthetic dataset fallback + +The runner is deliberately engine-agnostic — all MLX / dflash / +mlx-lm imports are behind the ``RealEngine`` constructor in +``engines.py``. Linux CI exercises the whole runner via +``--mock-engine``. +""" +from __future__ import annotations + +import argparse +import json +import logging +import sys +from pathlib import Path +from typing import Any + +from . import __version__ as RUNNER_VERSION +from .datasets import PromptItem, load_dataset_for_b2 +from .engines import Engine, EngineResult, MockEngine +from .metrics import ( + gsm8k_correct, + humaneval_correct, + mean, + percentile, +) +from .schema import ( + SCHEMA_VERSION, + AggregateMetrics, + BenchmarkResult, + HardwareInfo, + SampleRecord, + SoftwareInfo, +) + + +# --------------------------------------------------------------------------- +# Engine factory +# --------------------------------------------------------------------------- + + +def _build_engine(args) -> Engine: + if args.dry_run or args.mock_engine: + return MockEngine(seed=args.seed) + from .engines import RealEngine + return RealEngine( + target_model=args.target, + enable_dflash=not args.no_dflash, + ) + + +# --------------------------------------------------------------------------- +# Run one (dataset, channel) combination +# --------------------------------------------------------------------------- + + +def run_combination( + *, + engine: Engine, + dataset: str, + channel: str, + prompts: list[PromptItem], + max_tokens: int, + target_model: str, + draft_model: str | None, +) -> BenchmarkResult: + sample_records: list[SampleRecord] = [] + n_correct = 0 + any_correctness_scored = False + + for item in prompts: + result: EngineResult = engine.generate( + prompt=item.prompt, + channel=channel, + max_tokens=max_tokens, + ) + + correct: bool | None = None + if item.ground_truth is not None: + if dataset == "gsm8k": + correct = gsm8k_correct(result.response, item.ground_truth) + elif dataset == "humaneval": + correct = humaneval_correct(result.response, item.ground_truth) + if correct is not None: + any_correctness_scored = True + if correct: + n_correct += 1 + + sample_records.append(SampleRecord( + prompt_id=item.prompt_id, + prompt=item.prompt, + response=result.response, + acceptance_lengths=list(result.acceptance_lengths), + generation_tps=result.generation_tps, + first_token_latency_s=result.first_token_latency_s, + total_tokens=result.total_tokens, + codec_fired=result.codec_fired, + correctness_proxy=correct, + )) + + flat_al: list[float] = [] + for s in sample_records: + flat_al.extend(float(x) for x in s.acceptance_lengths) + + agg = AggregateMetrics( + acceptance_length_mean=mean(flat_al), + acceptance_length_p50=percentile(flat_al, 50.0), + acceptance_length_p95=percentile(flat_al, 95.0), + generation_tps_mean=mean( + [s.generation_tps for s in sample_records if s.generation_tps] + ), + first_token_latency_s=mean( + [s.first_token_latency_s for s in sample_records + if s.first_token_latency_s is not None] + ), + total_tokens_sum=sum(s.total_tokens for s in sample_records), + codec_fired_mean=mean( + [float(s.codec_fired) for s in sample_records + if s.codec_fired is not None] + ), + n_correct=n_correct if any_correctness_scored else None, + n_samples=len(sample_records), + ) + + return BenchmarkResult( + schema_version=SCHEMA_VERSION, + target_model=target_model, + draft_model=draft_model, + dataset=dataset, + channel=channel, + n_samples=len(sample_records), + samples=sample_records, + aggregate=agg, + hardware=detect_hardware(), + software=detect_software(), + ) + + +# --------------------------------------------------------------------------- +# Env detection (safe to call on any OS; returns "unknown" fields when lib +# isn't installed). +# --------------------------------------------------------------------------- + + +def detect_hardware() -> HardwareInfo: + import platform + chip = platform.processor() or "unknown" + device = "unknown" + try: + import mlx.core as mx # type: ignore + if mx.metal.is_available(): + device = "mlx:metal" + else: + device = "mlx:cpu" + except ImportError: + pass + return HardwareInfo(device=device, chip=chip) + + +def detect_software() -> SoftwareInfo: + def _ver(mod: str) -> str | None: + try: + m = __import__(mod) + return getattr(m, "__version__", None) + except ImportError: + return None + return SoftwareInfo( + mlx=_ver("mlx"), + mlx_lm=_ver("mlx_lm"), + dflash=_ver("dflash"), + kakeyalattice_mlx=_ver("kakeyalattice_mlx"), + kakeya_sidecar_mlx=_ver("kakeya_sidecar_mlx"), + ) + + +# --------------------------------------------------------------------------- +# CLI +# --------------------------------------------------------------------------- + + +def build_parser() -> argparse.ArgumentParser: + p = argparse.ArgumentParser( + prog="b2-dflash-kakeya-benchmark", + description=( + "Acceptance-rate benchmark for DFlash speculative decoding " + "combined with KakeyaLattice E8 KV-cache compression " + "(B2 / M5)." + ), + ) + p.add_argument("--target", default="Qwen/Qwen3-8B", + help="HuggingFace / mlx-community target model id") + p.add_argument("--draft", default="z-lab/Qwen3-8B-DFlash-b16", + help="DFlash draft model id (non-thinking, b16).") + p.add_argument("--datasets", nargs="+", + default=["gsm8k", "humaneval"], + choices=["gsm8k", "humaneval"]) + p.add_argument("--channels", nargs="+", + default=["bf16", "e8-q38", "e8-q10", "e8-q4"], + help="KV cache channels to evaluate.") + p.add_argument("--n-samples", type=int, default=32) + p.add_argument("--max-tokens", type=int, default=512) + p.add_argument("--seed", type=int, default=42) + p.add_argument("--no-dflash", action="store_true", + help="Disable DFlash (debug only).") + p.add_argument("--dry-run", action="store_true", + help="Use MockEngine + synthetic datasets. " + "No MLX / dflash / HF required.") + p.add_argument("--mock-engine", action="store_true", + help="Force MockEngine but still load real datasets " + "(via local jsonl or HF datasets).") + p.add_argument("--allow-synthetic", action="store_true", + help="Permit synthetic dataset fallback even outside " + "--dry-run. Off by default to prevent publishing " + "numbers from 3-prompt fixtures.") + p.add_argument("--out-dir", default="reports/b2_release", + help="Where to write per-combination JSON.") + p.add_argument("--log-level", default="INFO", + choices=["DEBUG", "INFO", "WARNING", "ERROR"]) + return p + + +def main(argv: list[str] | None = None) -> int: + args = build_parser().parse_args(argv) + logging.basicConfig( + level=getattr(logging, args.log_level), + format="%(asctime)s %(levelname)s %(name)s: %(message)s", + ) + log = logging.getLogger("b2-dflash-kakeya") + log.info("runner version=%s schema=%s", RUNNER_VERSION, SCHEMA_VERSION) + + out_dir = Path(args.out_dir) + out_dir.mkdir(parents=True, exist_ok=True) + + engine = _build_engine(args) + + try: + for dataset in args.datasets: + prompts = load_dataset_for_b2( + dataset, + n_samples=args.n_samples, + seed=args.seed, + allow_hf=not args.dry_run, + allow_synthetic=args.dry_run or args.allow_synthetic, + ) + log.info("dataset=%s n_prompts=%d", dataset, len(prompts)) + + for channel in args.channels: + log.info("running channel=%s", channel) + result = run_combination( + engine=engine, + dataset=dataset, + channel=channel, + prompts=prompts, + max_tokens=args.max_tokens, + target_model=args.target, + draft_model=None if args.no_dflash else args.draft, + ) + fname = f"b2_dflash_kakeya_{dataset}_{channel}.json" + out_path = out_dir / fname + with out_path.open("w") as f: + json.dump(result.to_dict(), f, indent=2, default=str) + log.info("wrote %s (accept_mean=%s)", + out_path, result.aggregate.acceptance_length_mean) + finally: + engine.close() + return 0 + + +if __name__ == "__main__": # pragma: no cover + sys.exit(main()) diff --git a/benchmarks/b2_dflash_kakeya/schema.py b/benchmarks/b2_dflash_kakeya/schema.py new file mode 100644 index 00000000..13a2c7ba --- /dev/null +++ b/benchmarks/b2_dflash_kakeya/schema.py @@ -0,0 +1,82 @@ +"""Output JSON schema for b2-dflash-kakeya benchmark runs. + +We pin the schema version here so downstream tooling (reports site, +comparison scripts) can detect breaking changes without guessing. +""" +from __future__ import annotations + +from dataclasses import asdict, dataclass, field +from typing import Any + + +SCHEMA_VERSION = "b2-dflash-kakeya-v1" + + +@dataclass +class HardwareInfo: + device: str = "unknown" + chip: str = "unknown" + total_memory_gb: float | None = None + + +@dataclass +class SoftwareInfo: + mlx: str | None = None + mlx_lm: str | None = None + dflash: str | None = None + kakeyalattice_mlx: str | None = None + kakeya_sidecar_mlx: str | None = None + + +@dataclass +class SampleRecord: + prompt_id: str + prompt: str + response: str + acceptance_lengths: list[int] + generation_tps: float | None + first_token_latency_s: float | None + total_tokens: int + codec_fired: int | None = None + correctness_proxy: bool | None = None + + +@dataclass +class AggregateMetrics: + acceptance_length_mean: float | None + acceptance_length_p50: float | None + acceptance_length_p95: float | None + generation_tps_mean: float | None + first_token_latency_s: float | None + total_tokens_sum: int + codec_fired_mean: float | None + n_correct: int | None + n_samples: int + + +@dataclass +class BenchmarkResult: + schema_version: str + target_model: str + draft_model: str | None + dataset: str + channel: str # e.g. "bf16" or "e8-q10" + n_samples: int + samples: list[SampleRecord] + aggregate: AggregateMetrics + hardware: HardwareInfo = field(default_factory=HardwareInfo) + software: SoftwareInfo = field(default_factory=SoftwareInfo) + + def to_dict(self) -> dict[str, Any]: + d = asdict(self) + return d + + +__all__ = [ + "SCHEMA_VERSION", + "HardwareInfo", + "SoftwareInfo", + "SampleRecord", + "AggregateMetrics", + "BenchmarkResult", +] diff --git a/benchmarks/b2_dflash_kakeya/tests/__init__.py b/benchmarks/b2_dflash_kakeya/tests/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/benchmarks/b2_dflash_kakeya/tests/test_datasets.py b/benchmarks/b2_dflash_kakeya/tests/test_datasets.py new file mode 100644 index 00000000..50d6b130 --- /dev/null +++ b/benchmarks/b2_dflash_kakeya/tests/test_datasets.py @@ -0,0 +1,93 @@ +"""Dataset loader tests. + +Two fallback paths verified without touching the network: +- synthetic fixture: always available (3 items per dataset) +- local jsonl override: supplied via temporary directory +""" +from __future__ import annotations + +import json +from pathlib import Path + +import pytest + +from benchmarks.b2_dflash_kakeya import datasets as ds_mod +from benchmarks.b2_dflash_kakeya.datasets import ( + PromptItem, + load_dataset_for_b2, +) + + +def test_synthetic_fixture_for_gsm8k() -> None: + items = load_dataset_for_b2( + "gsm8k", n_samples=3, allow_hf=False, allow_synthetic=True, + ) + assert len(items) == 3 + for it in items: + assert isinstance(it, PromptItem) + assert it.dataset == "synthetic" + assert it.ground_truth is not None + + +def test_synthetic_forbidden_raises() -> None: + with pytest.raises(FileNotFoundError): + load_dataset_for_b2( + "gsm8k", n_samples=3, allow_hf=False, allow_synthetic=False, + ) + + +def test_unknown_dataset_rejected() -> None: + with pytest.raises(ValueError): + load_dataset_for_b2("winograd", n_samples=1) + + +def test_humanval_synthetic_has_code_scaffolding() -> None: + items = load_dataset_for_b2( + "humaneval", n_samples=3, allow_hf=False, allow_synthetic=True, + ) + for it in items: + assert "def " in it.prompt + + +def test_local_jsonl_preferred_over_synthetic(tmp_path, monkeypatch) -> None: + """If a local jsonl exists, it's used instead of the synthetic + fixture (and instead of HF).""" + data_dir = tmp_path / "data" + data_dir.mkdir() + jsonl = data_dir / "gsm8k.jsonl" + with jsonl.open("w") as f: + f.write(json.dumps({"question": "Q1", "answer": "A1"}) + "\n") + f.write(json.dumps({"question": "Q2", "answer": "A2"}) + "\n") + + monkeypatch.setattr(ds_mod, "_DATA_DIR", data_dir) + + items = load_dataset_for_b2("gsm8k", n_samples=10, allow_hf=False) + assert len(items) == 2 + assert all(it.dataset == "gsm8k" for it in items) + assert {it.ground_truth for it in items} == {"A1", "A2"} + + +def test_n_samples_truncation(tmp_path, monkeypatch) -> None: + data_dir = tmp_path / "data" + data_dir.mkdir() + jsonl = data_dir / "humaneval.jsonl" + with jsonl.open("w") as f: + for i in range(10): + f.write(json.dumps({ + "task_id": f"t/{i}", + "prompt": f"def f{i}(): return", + "canonical_solution": f"return {i}", + }) + "\n") + monkeypatch.setattr(ds_mod, "_DATA_DIR", data_dir) + + items = load_dataset_for_b2("humaneval", n_samples=4, allow_hf=False) + assert len(items) == 4 + + +def test_seed_determinism_for_synthetic() -> None: + a = load_dataset_for_b2("gsm8k", n_samples=3, seed=1, + allow_hf=False, allow_synthetic=True) + b = load_dataset_for_b2("gsm8k", n_samples=3, seed=1, + allow_hf=False, allow_synthetic=True) + assert [(it.prompt_id, it.prompt) for it in a] == \ + [(it.prompt_id, it.prompt) for it in b] diff --git a/benchmarks/b2_dflash_kakeya/tests/test_metrics.py b/benchmarks/b2_dflash_kakeya/tests/test_metrics.py new file mode 100644 index 00000000..195d8523 --- /dev/null +++ b/benchmarks/b2_dflash_kakeya/tests/test_metrics.py @@ -0,0 +1,91 @@ +"""Unit tests for ``metrics.py`` — pure Python, no deps beyond stdlib.""" +from __future__ import annotations + +import pytest + +from benchmarks.b2_dflash_kakeya.metrics import ( + gsm8k_correct, + humaneval_correct, + mean, + percentile, + summarise_accept_lengths, +) + + +def test_mean_empty_is_none() -> None: + assert mean([]) is None + + +def test_mean_basic() -> None: + assert mean([1, 2, 3, 4]) == 2.5 + + +def test_percentile_empty_is_none() -> None: + assert percentile([], 50) is None + + +def test_percentile_single_element() -> None: + assert percentile([7.0], 50) == 7.0 + assert percentile([7.0], 0) == 7.0 + assert percentile([7.0], 100) == 7.0 + + +def test_percentile_matches_numpy_default() -> None: + xs = [10, 20, 30, 40, 50] + assert percentile(xs, 0) == 10 + assert percentile(xs, 50) == 30 + assert percentile(xs, 100) == 50 + # Linear interpolation between sorted[1]=20 and sorted[2]=30 at rank 1.5 + assert percentile(xs, 37.5) == pytest.approx(25.0) + + +def test_percentile_rejects_bad_p() -> None: + with pytest.raises(ValueError): + percentile([1, 2, 3], -1) + with pytest.raises(ValueError): + percentile([1, 2, 3], 101) + + +def test_summarise_handles_empty_records() -> None: + out = summarise_accept_lengths([]) + assert out == {"mean": None, "p50": None, "p95": None} + + +def test_summarise_flattens_per_sample_lists() -> None: + class _R: + def __init__(self, xs): + self.acceptance_lengths = xs + + records = [_R([10, 12]), _R([14, 16]), _R([18])] + out = summarise_accept_lengths(records) + assert out["mean"] == pytest.approx(14.0) + + +def test_summarise_accepts_dicts() -> None: + out = summarise_accept_lengths([ + {"acceptance_lengths": [1, 2, 3]}, + {"acceptance_lengths": [4, 5]}, + ]) + assert out["mean"] == pytest.approx(3.0) + + +# --------------------------------------------------------------------------- +# correctness proxies +# --------------------------------------------------------------------------- + + +def test_gsm8k_correct_extracts_after_hash() -> None: + expected = "Jane has three apples. #### 3" + assert gsm8k_correct("The answer is 3.", expected) is True + assert gsm8k_correct("The answer is 42.", expected) is False + + +def test_gsm8k_correct_no_hash_prefix() -> None: + assert gsm8k_correct("Answer: 7", "7") is True + assert gsm8k_correct("Answer: 7", "") is False + + +def test_humaneval_correct_requires_def_and_return() -> None: + assert humaneval_correct("def f(x):\n return x + 1", "") is True + assert humaneval_correct("print('refused')", "") is False + assert humaneval_correct("def f(x):\n print(x)", "") is False diff --git a/benchmarks/b2_dflash_kakeya/tests/test_runner_mock.py b/benchmarks/b2_dflash_kakeya/tests/test_runner_mock.py new file mode 100644 index 00000000..e1255f91 --- /dev/null +++ b/benchmarks/b2_dflash_kakeya/tests/test_runner_mock.py @@ -0,0 +1,113 @@ +"""End-to-end runner test using MockEngine + synthetic datasets. + +Verifies the full loop: dataset loading → engine.generate per prompt +→ metric aggregation → JSON serialisation → schema round-trip. + +Runs in <0.5s on Linux CI, no MLX / dflash / HF. +""" +from __future__ import annotations + +import json +from pathlib import Path + +import pytest + +from benchmarks.b2_dflash_kakeya.engines import MockEngine +from benchmarks.b2_dflash_kakeya.runner import ( + build_parser, + main, + run_combination, +) +from benchmarks.b2_dflash_kakeya.datasets import load_dataset_for_b2 +from benchmarks.b2_dflash_kakeya.schema import SCHEMA_VERSION + + +def test_parser_defaults() -> None: + args = build_parser().parse_args([]) + assert args.target == "Qwen/Qwen3-8B" + assert "z-lab/Qwen3-8B-DFlash-b16" in args.draft + assert set(args.datasets) == {"gsm8k", "humaneval"} + assert args.channels == ["bf16", "e8-q38", "e8-q10", "e8-q4"] + + +def test_run_combination_accept_length_ordering() -> None: + """MockEngine encodes the theoretical acceptance-length ordering + bf16 > Q=38 > Q=10 > Q=4. The benchmark runner aggregation must + preserve it.""" + engine = MockEngine(seed=0) + prompts = load_dataset_for_b2( + "gsm8k", n_samples=3, allow_hf=False, allow_synthetic=True, + ) + + accept_means: dict[str, float] = {} + for channel in ("bf16", "e8-q38", "e8-q10", "e8-q4"): + res = run_combination( + engine=engine, dataset="gsm8k", channel=channel, + prompts=prompts, max_tokens=128, + target_model="Qwen/Qwen3-8B", + draft_model="z-lab/Qwen3-8B-DFlash-b16", + ) + accept_means[channel] = res.aggregate.acceptance_length_mean or 0.0 + + assert accept_means["bf16"] > accept_means["e8-q38"] + assert accept_means["e8-q38"] > accept_means["e8-q10"] + assert accept_means["e8-q10"] > accept_means["e8-q4"] + + +def test_main_dry_run_writes_json_per_combination(tmp_path) -> None: + out_dir = tmp_path / "reports" + code = main([ + "--dry-run", + "--n-samples", "3", + "--max-tokens", "64", + "--out-dir", str(out_dir), + "--datasets", "gsm8k", + "--channels", "bf16", "e8-q10", + ]) + assert code == 0 + + files = sorted(p.name for p in out_dir.iterdir()) + assert files == [ + "b2_dflash_kakeya_gsm8k_bf16.json", + "b2_dflash_kakeya_gsm8k_e8-q10.json", + ] + + for name in files: + with (out_dir / name).open() as f: + obj = json.load(f) + assert obj["schema_version"] == SCHEMA_VERSION + assert obj["dataset"] == "gsm8k" + assert obj["n_samples"] == 3 + assert obj["aggregate"]["acceptance_length_mean"] is not None + assert obj["aggregate"]["n_samples"] == 3 + assert len(obj["samples"]) == 3 + + +def test_main_dry_run_humaneval_correctness_populated(tmp_path) -> None: + out_dir = tmp_path / "reports" + main([ + "--dry-run", + "--n-samples", "3", + "--max-tokens", "32", + "--out-dir", str(out_dir), + "--datasets", "humaneval", + "--channels", "bf16", + ]) + with (out_dir / "b2_dflash_kakeya_humaneval_bf16.json").open() as f: + obj = json.load(f) + # Mock responses don't actually contain code, so correctness proxy + # should be False for all; n_correct populated = 0. + # The important thing is that the field exists and isn't None for + # a dataset with ground truth. + assert obj["aggregate"]["n_correct"] is not None + for s in obj["samples"]: + assert s["correctness_proxy"] is not None + + +def test_channel_not_in_mock_engine_gets_default_means() -> None: + """A channel name the MockEngine doesn't know should not crash — + it falls back to 10/150 defaults.""" + engine = MockEngine(seed=1) + res = engine.generate(prompt="hi", channel="custom-42", max_tokens=64) + assert res.acceptance_lengths + assert all(x >= 1 for x in res.acceptance_lengths) From bac07b944ca5557869dfbe76913eba9bbf807a8a Mon Sep 17 00:00:00 2001 From: Cursor Agent Date: Thu, 30 Apr 2026 05:07:29 +0000 Subject: [PATCH 2/3] docs(b2/M5): reports/b2_release/ placeholder for real benchmark output The M5 runner writes one JSON per (dataset, channel) combination into reports/b2_release/. The real run requires Apple Silicon + MLX + dflash + gsm8k/humaneval data; until then this directory only documents the expected layout. Expected artefacts after a real run: b2_dflash_kakeya_{gsm8k,humaneval}_{bf16,e8-q38,e8-q10,e8-q4}.json FINDINGS.md (narrative + aggregate tables) Co-authored-by: FluffyAIcode --- reports/b2_release/README.md | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) create mode 100644 reports/b2_release/README.md diff --git a/reports/b2_release/README.md b/reports/b2_release/README.md new file mode 100644 index 00000000..40c6ca2c --- /dev/null +++ b/reports/b2_release/README.md @@ -0,0 +1,21 @@ +# B2 Release — acceptance-rate benchmark outputs + +空目录占位。真实 B2 benchmark 输出(需要 Apple Silicon + MLX + DFlash) +跑出来后,8 个 JSON 文件会落在这里: + +``` +reports/b2_release/ +├── b2_dflash_kakeya_gsm8k_bf16.json +├── b2_dflash_kakeya_gsm8k_e8-q38.json +├── b2_dflash_kakeya_gsm8k_e8-q10.json +├── b2_dflash_kakeya_gsm8k_e8-q4.json +├── b2_dflash_kakeya_humaneval_bf16.json +├── b2_dflash_kakeya_humaneval_e8-q38.json +├── b2_dflash_kakeya_humaneval_e8-q10.json +├── b2_dflash_kakeya_humaneval_e8-q4.json +└── FINDINGS.md (narrative + aggregate tables) +``` + +跑法见 `benchmarks/b2_dflash_kakeya/README.md`。 + +schema 版本 + 每条 JSON 结构见 `benchmarks/b2_dflash_kakeya/schema.py`。 From decd8468d944a0b1f51c21c14f57efe03fbab047 Mon Sep 17 00:00:00 2001 From: Cursor Agent Date: Thu, 30 Apr 2026 05:07:45 +0000 Subject: [PATCH 3/3] docs(b2/M6): standalone product proposal - pivot from Atomic-Chat backend MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Evaluates repositioning B2 (MLX + DFlash + KakeyaLattice-MLX) from 'an Atomic-Chat second backend' (original M6 scope) to an independent Mac local-inference product. Three candidate forms: A. CLI tool `kakeya-llm` (homebrew / pip) B. Native macOS app `Kakeya Studio` (DMG, custom UI) C. SDK + developer/enterprise library (PyPI + docs site) Recommendation: ship A + C first, defer B. Rationale: 1. atomic.chat's "Google TurboQuant built-in" headline is under-delivered by the actual llama.cpp KV quantisation stack; folding B2 into Atomic-Chat means fulfilling *their* marketing through *our* engineering with no control over the message. 2. B2's three capabilities (MLX-native inference, DFlash 3-6x speedup, E8 KV compression) are general-purpose infra — not chat-UI-specific. Binding them to one chat app under-uses the stack. 3. Existing engineering assets (kakeyalattice_mlx, kakeya_sidecar_mlx, cache_injection) are ~90% reusable for A+C with minimal new work; going the Atomic-Chat-backend route blocks on third-party PR review + release cadence. Phased execution: Phase 1 (after M5 merges): PyPI release, landing page, CLI shim. Phase 2 (4-8 weeks later): Cursor / Raycast / Obsidian integrations, early enterprise POCs. Phase 3 (6 months out, re-evaluate): whether to do the native app. This evaluation does NOT invalidate PR #57 (B1), #58 (B2 skeleton), or #59 (B2/M4): all that code is directly reusable. It re-scopes M6 only, shifting from 'ship to Atomic-Chat' to 'ship independently + opportunistic Atomic-Chat integration if they take our PR'. Co-authored-by: FluffyAIcode --- docs/B2_STANDALONE_PRODUCT_PROPOSAL.md | 242 +++++++++++++++++++++++++ 1 file changed, 242 insertions(+) create mode 100644 docs/B2_STANDALONE_PRODUCT_PROPOSAL.md diff --git a/docs/B2_STANDALONE_PRODUCT_PROPOSAL.md b/docs/B2_STANDALONE_PRODUCT_PROPOSAL.md new file mode 100644 index 00000000..d319fcd0 --- /dev/null +++ b/docs/B2_STANDALONE_PRODUCT_PROPOSAL.md @@ -0,0 +1,242 @@ +# B2 Standalone Product Proposal — 从 Atomic-Chat backend 走向独立 Mac 本地推理产品 + +**Date**: 2026-04-30 +**Scope**: 评估把 B2 (`MLX + DFlash + KakeyaLattice-MLX`) 从"Atomic-Chat +的第二个后端"升级为**独立 Mac 本地推理产品**的商业/工程可行性。 +**Status**: 评估稿,作为 M6 原 "Atomic-Chat extension 追加 backend 选项" +的替代路线。 + +--- + +## 1. 为什么重估 + +原 M6 规划是把 B2 做成 Atomic-Chat 的第二个 backend:在 TS 扩展里加 +`"variant": "mlx-dflash"`,Tauri plugin 同时托管 B1 + B2 两个 sidecar。 +这条路径**工程上成立**,但在商业逻辑上有几个问题。 + +### 1.1 Atomic-Chat 的宣传与 B2 的产品承诺不对等 + +atomic.chat 首页的核心宣传: + +- *"Google TurboQuant built-in — 8× Faster Inference"* — v1.5 报告里 + TurboQuant b=2 在 4 个模型上**结构性不可用**,b=3 被 E8 Q=4 全面压过 + 3-6×。真实落地的是 llama.cpp 的 q4_0/q8_0 标量量化(2023 年的技术)。 +- *"Compressed down to just 3 bits — with no retraining, no fine-tuning, + and no trade-off in model performance"* — 同上,3 bit 级别对 <3B 模型 + 在 llama.cpp 上真实可用性极差。 + +把 B2 挂进 Atomic-Chat 作为 backend 选项,等于帮它**兑现宣传承诺**。 +但 Atomic-Chat 团队**不是我们**,合作节奏、PR review、version +compatibility 都有摩擦。我们交付 PR → 他们决定合不合 → 用户什么时候 +能拿到——这条链路至少多一个季度。 + +### 1.2 B2 的价值点超过"chat UI 的后端" + +B2 的三个能力: +- **MLX 原生推理** — Apple Silicon 第一等公民 +- **DFlash 3-6× 无损加速** — 2026 年 speculative decoding SOTA +- **KakeyaLattice E8 KV 压缩** — 长上下文不 OOM + +这三个能力的使用面 **不止 chat**。举例: +- IDE 集成(Cursor / VSCode 里本地跑 Qwen-Coder 30B) +- 命令行 agent / MCP-driven workflow(不需要 UI 的场景) +- 批量文本处理(合同 / 论文 / 代码库摘要,不交互) +- 其他 GUI 宿主(Raycast extension、Alfred workflow、Obsidian plugin) + +把 B2 绑死到 Atomic-Chat 的 UI 里,等于把一个**通用推理加速层**锁进 +一个特定 chat app。 + +### 1.3 atomic.chat 的长期演化风险 + +atomic.chat 是商业公司产品,UI / 扩展 API 路线图由他们决定。他们可能: +- 换推理引擎(目前完全绑 llama.cpp;将来改成 MLX 原生或自研) +- 改扩展 API(我们的 TS 扩展 + Rust plugin 需随之改) +- 决定不把 KakeyaLattice 放进默认发行("Pro Mode for Mac" 选项可能永远 + 藏在高级设置里) + +这些风险不是零概率,且一旦发生我们的工程沉默成本很高。 + +## 2. 独立产品路线的三种形态 + +### Form A:CLI 工具 `kakeya-llm` + +- `brew install kakeya-llm` / `pip install kakeya-llm` +- `kakeya-llm chat qwen3-8b --q 38 --dflash` 启交互 +- `kakeya-llm serve --port 1339` 开 OpenAI 兼容 server +- `kakeya-llm bench gsm8k` 跑内置 M5 benchmark +- 目标用户:开发者 / ML 工程师 / 学生 + +**优势**: +- 分发简单,homebrew 生态现成 +- 不依赖 UI 框架;Metal / CUDA / CPU 都能跑(基座是 mlx-lm + dflash) +- 完整控制 KakeyaLattice × DFlash 的每一个旋钮 + +**劣势**: +- 非技术用户门槛高 +- 默认没有对话历史 / 多 assistant / MCP 集成,纯"跑模型" +- 需要自己维护包发布 + 版本升级 + +**适合场景**:作为底层 API 被其他 app 集成(IDE / 插件 / 工作流)。 + +### Form B:native macOS app `Kakeya Studio` + +- DMG 包,不要求任何命令行能力 +- 单窗口 chat UI + 模型管理 + 性能监控 +- 内置 KakeyaLattice compression 的**可视化**("你刚才这次对话省了 + X GB KV,速度比 baseline 快 Y×") +- OpenAI 兼容 `:1339` 自带,让 Cursor / Raycast / Alfred 能连 +- 目标用户:Mac 本地 AI 爱好者(和 atomic.chat 用户重叠但不完全相同) + +**优势**: +- 完整控制 UI/叙事/宣传口径("基于论文的 E8 格压缩",可验证、学术 + 可溯) +- 把 KakeyaLattice 的 "discrete Kakeya cover" 理论直接 surface 到 + UI(这是 atomic.chat 做不到也不会做的差异化) +- 省下 Atomic-Chat 团队合作的沟通成本 + +**劣势**: +- 从零搭 Tauri/Electron UI + signing + notarize + auto-update +- App Store 审核(如果走 MAS 发行) +- 和 atomic.chat 正面竞争 Mac 本地 AI app 市场 — 他们已经 500+ stars, + 我们要从 0 开始做 GTM + +**适合场景**:想建立 KakeyaLattice 作为品牌的长期目标。 + +### Form C:SDK + 企业/开发者服务 `kakeyalattice-mlx` as a library + +- 把 `kakeyalattice_mlx` + `kakeya_sidecar_mlx` + `cache_injection` 升级 + 为严肃 Python 包,发到 PyPI +- 卖点不是 "app",是 "the MLX-native long-context inference stack" +- 配套:文档站 + benchmark 报告 + 参考 Docker image +- 用户是开发团队(Cursor 这类 IDE 想集成本地推理、内部工具团队想给 + 员工提供本地 AI) +- 目标客户:对"确定性 + 可审计 + 离线"有强诉求的企业(律所、医疗、 + 投行、国防承包商) + +**优势**: +- 最小 UI 投入,最大技术杠杆 +- 和 Atomic-Chat(C 端 chat app)错位竞争,客户群完全不重叠 +- Apache-2.0 吸引生态,有机会成为事实标准(类似 `transformers` 在 + HF 生态里的地位) + +**劣势**: +- B2B 销售周期长 +- 需要保持上游(MLX / DFlash / HF)兼容性的长期工程承诺 +- 早期没有"用户界面"的东西,marketing 需要更专业渠道(Twitter/X、 + arXiv 引用、技术 blog) + +**适合场景**:研究机构 / 行业 / 企业客户优先。 + +## 3. 推荐路径:A + C 串联,暂缓 B + +基于以下几个现实判断: + +1. **我们的工程资产天然贴合 A 和 C**: + - `kakeya_sidecar_mlx` 已经是 OpenAI 兼容 server — **直接就是 A 的核心**。 + - `kakeyalattice_mlx` 已经是独立 pip 包,bit-identical parity 有测试 + 保证 — **直接就是 C 的核心**。 + - 从"B1 + B2 集成到 Atomic-Chat"变成"把 B2 独立发布",**工程改动不到 + 10%**,主要是加发布 CI + 文档站 + 品牌 landing page。 + +2. **B 的 native app 路径成本最高且回报最晚**: + - 成本:搭 UI / signing / notarize / auto-update / App Store / GTM / + 客服,全套做齐是半年级别工作。 + - 回报:必须打败 atomic.chat (500+ stars, 成熟 GTM) 才能起量。 + - 风险:一旦 atomic.chat 集成 KakeyaLattice(无论是通过我们的 PR + 还是自己实现),B 的差异化就瓦解。 + +3. **A + C 的组合可以"先卖给开发者 → 逐步渗透到 app"**: + - Cursor / Raycast / Alfred / Obsidian 的插件开发者用我们 SDK 嵌 + 本地推理 → 终端用户通过这些 app 间接用上 KakeyaLattice + - **这条路径 atomic.chat 无法阻断**,因为他们的 UI 只是众多消费端之一 + +## 4. 具体执行建议 + +### Phase 1(M5 benchmark 之后立即做)— 公开发布 A + C 的底层包 + +1. **PyPI 发布**: + - `kakeyalattice-mlx` v0.1.0 + - `kakeya-sidecar-mlx` v0.1.0 + - 两者都走 trusted publishing + GitHub release + 自动 wheel build +2. **文档站**(`kakeyalattice.dev` 或 `kakeya-mlx.dev` domain): + - Quick start(一条 `pip install` 命令) + - API reference(auto-gen from docstrings) + - 性能 benchmark 页面(绑到 `reports/b2_release/` 的 JSON) + - 对比表格(atomic.chat / llama.cpp / ollama / MLC-LLM / 我们) +3. **CLI shim**: `kakeya-llm` 作为 `kakeya-sidecar-mlx` 的子命令包装, + 提供 `chat`/`serve`/`bench` 三个入口。 +4. **GitHub 项目分拆**(可选): `kakeyalattice` 主 repo 保留 codec + + 论文;`kakeya-mlx` 拆出独立 repo 做 sidecar + benchmark + 文档站。 + 独立 issue tracker + release cadence。 + +### Phase 2(Phase 1 上线 4-8 周后)— 生态嵌入 + +1. **Cursor / Continue / Aider 集成** — 这些 IDE agent 工具都支持 + OpenAI 兼容的本地 server。我们的 sidecar 默认就能被它们连上, + 只需要文档 + 示例。 +2. **Raycast extension**: Raycast 的 extension 生态对"本地 AI"有大需求; + 做一个 Raycast AI 替代品走 Kakeya sidecar。 +3. **企业 POC**: 选 2-3 家有"隐私 + Mac 标配 + 长文档处理"需求的客户 + (律所、投行),做 3 个月 POC,产出案例。 + +### Phase 3(6 个月后再评估)— B 的 native app 要不要做 + +如果 Phase 2 证明了: +- 用户愿意把 KakeyaLattice 作为"本地 AI infra"集成进他们的工作流 +- SDK/CLI 形态的 conversion 到 Apache-2.0 用户足够多(比如周下载过万) +- atomic.chat 没有把 KakeyaLattice 集成进去(无论出于什么原因) + +**那时再评估要不要做 native app**。做的时候差异化应该非常明确(e.g. +"for researchers who want to see the E8 lattice compression happen in +real time"),而不是通用 chat UI 和 atomic.chat 正面硬刚。 + +### 不推荐的路径 + +- **不做原 M6**: 把 B2 做成 Atomic-Chat 的 backend 选项,然后等 PR + 合入。工程沉默成本过高,且回报受制于第三方。 +- **不跳过 Phase 1 直接做 B**: "先做 UI,再补后端"是典型的精力错配; + UI 是消费端表面,不是壁垒。KakeyaLattice 的壁垒在 `kakeyalattice_mlx` + 这一层,应该优先发布它。 + +## 5. 对 B1 PR #57 和 B2 PR #58/#59 的影响 + +这份评估不否定已有工作: + +- **B1 (PR #57)** 依然有用 — HF + MPS 路径是跨平台的(Mac/Win/Linux), + CLI `kakeya-llm` 会默认用它;Apple Silicon 用户自动切 B2。 +- **B2 (PR #58 骨架 + PR #59 DFlash 集成)** 依然有用 — 就是 Phase 1 + 要发布的 MLX 栈。 +- **Atomic-Chat 集成** (原 M6) — **可以保留为可选**: 如果 atomic.chat + 团队愿意合我们的 PR,那是 bonus 分发渠道;但**不是独立产品的主路径**。 + +换言之,现有工程资产**全部都在独立产品路径上可直接复用**。本评估只是 +**调整 GTM 策略和 branding focus**。 + +## 6. 下一步(如果采纳本评估) + +1. **本 PR 合并后立即做**: + - 给 `kakeyalattice_mlx` 和 `kakeya_sidecar_mlx` 加 PyPI 发布 CI + (`.github/workflows/publish-*.yml`) + - 注册 domain(`kakeyalattice.dev` 等) + - 起草 landing page 内容(对比表 + quick start + FAQ) +2. **2 周内**: + - 首个 PyPI release (v0.1.0) + - 开 `kakeya-mlx` 独立 repo(可选,看我们 repo 策略偏好) + - Twitter / arXiv / HN 发布公告 +3. **1 个月内**: + - Cursor/Continue/Aider 集成文档 + PR + - 第一个企业 POC 接触 + +## 7. 结论(一句话) + +> **把 B2 从"atomic.chat 的后端"升级为"Mac 本地推理的独立 SDK + CLI 发行"。 +> 所有工程资产都能直接复用;差异化点(E8 格压缩 + DFlash 加速的 +> MLX 原生组合)在独立发行下完全可控,且不受 Atomic-Chat 产品路线 +> 影响。Native app 形态的决定推迟到 Phase 3,避免早期 UI 投入与 +> atomic.chat 正面竞争。** + +--- + +*作者:Cursor Cloud Agent · 分支 +`AgentMemory/atomic-chat-b2-m5-acceptance-benchmark-04ae` · 作为 M6 评估 +文档与 M5 benchmark 同 PR 交付。*