From 41b627ca4ca272a6a07597a1f1745858d1f2d668 Mon Sep 17 00:00:00 2001
From: Cursor Agent <cursoragent@cursor.com>
Date: Thu, 30 Apr 2026 05:07:21 +0000
Subject: [PATCH 1/3] feat(b2/M5): acceptance-rate benchmark runner for DFlash
 x Kakeya

New benchmarks package `benchmarks.b2_dflash_kakeya` quantifies the
impact of KakeyaLattice E8 KV-cache compression on DFlash block-
diffusion speculative decoding.

Experiment design (laid out in README):
  target: Qwen/Qwen3-8B
  draft:  z-lab/Qwen3-8B-DFlash-b16
  KV channels: bf16 / e8-q38 / e8-q10 / e8-q4   (4 levels)
  datasets: gsm8k, humaneval   (32 prompts each by default)
  metrics: acceptance_length {mean, p50, p95}, tps, TTFT,
           codec_fired, correctness_proxy

Package layout:
  runner.py       CLI + top-level orchestration; --dry-run exercises
                  the whole pipeline on Linux CI without MLX/dflash/HF.
  datasets.py     Three-tier dataset loader (local jsonl ->
                  HF datasets -> synthetic). Synthetic is gated
                  behind --allow-synthetic so nobody ships numbers
                  from 3-prompt fixtures.
  engines.py      Engine Protocol with RealEngine (delegates to
                  kakeya_sidecar_mlx.MLXEngine) and MockEngine
                  (deterministic fake with the theoretical accept-
                  length ordering bf16 > Q=38 > Q=10 > Q=4).
  metrics.py      Pure-stdlib percentile + mean + correctness
                  proxies (gsm8k = numeric-substring, humaneval =
                  def/return substring; full execution harness
                  explicitly out of scope for now).
  schema.py       Pinned schema version b2-dflash-kakeya-v1 so
                  downstream tooling can detect breaking changes.

Tests (24 passed, Linux CI only, no MLX/dflash/HF):
  test_metrics.py        percentile edge cases, mean, correctness
                         proxies, multi-record summarisation.
  test_datasets.py       synthetic fixture, local jsonl preference,
                         n_samples truncation, seed determinism.
  test_runner_mock.py    full end-to-end via MockEngine + synthetic
                         data; asserts accept-length ordering is
                         preserved through aggregation and that
                         JSON outputs conform to schema v1.

Dry-run smoke (executed locally):
  python -m benchmarks.b2_dflash_kakeya.runner \
      --dry-run --n-samples 3 --max-tokens 64 \
      --out-dir /tmp/b2_dryrun
  # => 8 JSON files (2 datasets x 4 channels), schema v1,
  #    accept means 15 / 14 / 12 / 8 as expected.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
---
 benchmarks/b2_dflash_kakeya/README.md         | 138 +++++++++
 benchmarks/b2_dflash_kakeya/__init__.py       |   4 +
 benchmarks/b2_dflash_kakeya/datasets.py       | 156 ++++++++++
 benchmarks/b2_dflash_kakeya/engines.py        | 210 +++++++++++++
 benchmarks/b2_dflash_kakeya/metrics.py        |  99 ++++++
 benchmarks/b2_dflash_kakeya/runner.py         | 282 ++++++++++++++++++
 benchmarks/b2_dflash_kakeya/schema.py         |  82 +++++
 benchmarks/b2_dflash_kakeya/tests/__init__.py |   0
 .../b2_dflash_kakeya/tests/test_datasets.py   |  93 ++++++
 .../b2_dflash_kakeya/tests/test_metrics.py    |  91 ++++++
 .../tests/test_runner_mock.py                 | 113 +++++++
 11 files changed, 1268 insertions(+)
 create mode 100644 benchmarks/b2_dflash_kakeya/README.md
 create mode 100644 benchmarks/b2_dflash_kakeya/__init__.py
 create mode 100644 benchmarks/b2_dflash_kakeya/datasets.py
 create mode 100644 benchmarks/b2_dflash_kakeya/engines.py
 create mode 100644 benchmarks/b2_dflash_kakeya/metrics.py
 create mode 100644 benchmarks/b2_dflash_kakeya/runner.py
 create mode 100644 benchmarks/b2_dflash_kakeya/schema.py
 create mode 100644 benchmarks/b2_dflash_kakeya/tests/__init__.py
 create mode 100644 benchmarks/b2_dflash_kakeya/tests/test_datasets.py
 create mode 100644 benchmarks/b2_dflash_kakeya/tests/test_metrics.py
 create mode 100644 benchmarks/b2_dflash_kakeya/tests/test_runner_mock.py

diff --git a/benchmarks/b2_dflash_kakeya/README.md b/benchmarks/b2_dflash_kakeya/README.md
new file mode 100644
index 00000000..0d24d15a
--- /dev/null
+++ b/benchmarks/b2_dflash_kakeya/README.md
@@ -0,0 +1,138 @@
+# B2 — DFlash × KakeyaLattice acceptance-rate benchmark
+
+**目标**：量化 KakeyaLattice E8 KV-cache 压缩对 DFlash block-diffusion
+speculative decoding 的影响 — 具体是 acceptance length 的变化（是否因
+target 分布扰动而掉速）以及端到端 tok/s 的实际收益。
+
+B2 (`integrations/atomic-chat-b2/`) 的所有工程都是围绕"能不能叠加"的
+理论推演 + 骨架 + 单测。M5 负责把推演变成数字。
+
+## 实验设计
+
+### Target × Draft × KV channel
+
+| Target | Draft (DFlash) | KV channel |
+|:-|:-|:-|
+| `Qwen/Qwen3-8B` (non-thinking) | `z-lab/Qwen3-8B-DFlash-b16` | bf16 baseline |
+| 同上 | 同上 | Kakeya E8 Q=38 (near-lossless) |
+| 同上 | 同上 | Kakeya E8 Q=10 (balanced) |
+| 同上 | 同上 | Kakeya E8 Q=4 (aggressive) |
+
+共 **4 组 (1 target × 1 draft × 4 KV channel)**。bf16 是对照组，不走
+KakeyaLatticeMLXCache，直接用 mlx-lm 原生 KVCache。
+
+### 数据集
+
+- **`gsm8k`** (GSM8K test split)：数学推理；DFlash 论文主 benchmark。
+- **`humaneval`** (HumanEval openai)：代码生成；DFlash 论文次 benchmark。
+
+对每个数据集，随机抽 `n_samples=32` prompt；若要快速烟测可调低到 8。
+种子固定 (`seed=42`) 以便两次跑的 prompt 集一致。
+
+### 指标
+
+| 指标 | 含义 | 来源 |
+|:-|:-|:-|
+| `acceptance_length_mean` | 平均每 verify step 接受 token 数；DFlash 的核心指标 | `dflash.model_mlx.stream_generate` 每步返回 |
+| `acceptance_length_p50 / p95` | 分位数 | 同上 |
+| `generation_tps` | 端到端 tok/s | dflash 或 mlx_lm 的 timer |
+| `total_tokens` | 生成 token 总数 | tokenizer 统计 |
+| `first_token_latency_s` | 首 token 延迟 | wall clock |
+| `kakeya_codec_fired` | codec 触发次数（非 boundary 层） | `KakeyaLatticeMLXCache.fire_count` |
+| `correctness_proxy` (可选) | answer 是否含 expected | 仅对 gsm8k / humaneval 有简单匹配 |
+
+### 预期结果（按 PR #57 §12.2 的理论分析）
+
+| channel | acceptance_length | tps 相对 baseline | 结论 |
+|:-|:-:|:-:|:-|
+| bf16 baseline | ~14-16 (DFlash 官方报数) | 1.00× | ✓ |
+| Kakeya Q=38 | ~13-15 (降 <1pp) | ~0.95-1.00× | 近无损可用 |
+| Kakeya Q=10 | ~11-13 (降 1-3pp) | ~0.80-0.90× | 用 KV 节省换速度 |
+| Kakeya Q=4 | ~7-10 (显著下降) | ~0.50-0.70× | 不进默认档位 |
+
+**如果 Q=38 acceptance 掉得超过 2pp**，该档位不作为 B2 默认；回退到
+Q=76 或 Q=152。**如果 Q=10 acceptance 掉得超过 5pp**，B2 的"加速 + 压缩
+双赢"叙事需要修正。
+
+## 运行
+
+### 真实运行（需要 Apple Silicon + MLX + dflash）
+
+```bash
+# 1. 装依赖
+pip install -e integrations/atomic-chat-b2/kakeyalattice_mlx[mlx]
+pip install dflash                # z-lab/dflash 官方包 (MLX backend)
+pip install "mlx-lm>=0.20"
+
+# 2. 跑基线 (bf16) + 三个 Kakeya 档
+python -m benchmarks.b2_dflash_kakeya.runner \
+    --target Qwen/Qwen3-8B \
+    --draft  z-lab/Qwen3-8B-DFlash-b16 \
+    --datasets gsm8k humaneval \
+    --n-samples 32 \
+    --channels bf16 e8-q38 e8-q10 e8-q4 \
+    --out-dir reports/b2_release
+
+# 3. 结果在 reports/b2_release/b2_dflash_kakeya_{dataset}_{channel}.json
+```
+
+### Dry-run（Linux CI 可跑；不下模型、不加载 dflash）
+
+```bash
+python -m benchmarks.b2_dflash_kakeya.runner --dry-run
+```
+
+Dry-run 会走完参数解析 + dataset 加载 + 指标聚合路径；推理步骤由
+`--mock-engine` 自动注入的 FakeEngine 替身提供，用来 CI 验证 runner
+工程完整度。
+
+## 文件
+
+```
+benchmarks/b2_dflash_kakeya/
+├── README.md         (本文件)
+├── __init__.py
+├── runner.py         主入口 + 参数解析 + 顶层流程
+├── datasets.py       gsm8k / humaneval 加载器 (本地 jsonl + 可选 HF datasets)
+├── engines.py        RealEngine (DFlash+Kakeya) + MockEngine (CI)
+├── metrics.py        acceptance_length 分布, tps, Δppl 估算
+├── schema.py         输出 JSON 的 TypedDict + 版本号
+└── tests/
+    ├── __init__.py
+    ├── test_metrics.py        (Linux CI green)
+    ├── test_datasets.py       (Linux CI green)
+    └── test_runner_mock.py    (Linux CI green, 用 MockEngine)
+```
+
+## 输出 schema
+
+```json
+{
+  "schema_version": "b2-dflash-kakeya-v1",
+  "target_model": "Qwen/Qwen3-8B",
+  "draft_model":  "z-lab/Qwen3-8B-DFlash-b16",
+  "dataset":      "gsm8k",
+  "channel":      "e8-q10",
+  "n_samples":    32,
+  "samples": [ { prompt..., metrics... }, ... ],
+  "aggregate": {
+    "acceptance_length_mean": 12.3,
+    "acceptance_length_p50":  12,
+    "acceptance_length_p95":  18,
+    "generation_tps_mean":    210.5,
+    "first_token_latency_s":  0.142,
+    "total_tokens_sum":       8192,
+    "codec_fired_mean":       35.2
+  },
+  "hardware": { "device": "mlx:metal", "chip": "Apple M3 Pro", ... },
+  "software": { "mlx": "...", "dflash": "...", "kakeyalattice_mlx": "..." }
+}
+```
+
+## 对比 atomic.chat 首页宣传
+
+atomic.chat 首页声称 *"Google TurboQuant built-in"* 与 *"Compressed down
+to just 3 bits"*。v1.5 报告里 TQ b=2 在 4 模型上结构性不可用、b=3 被
+E8 Q=4 全面压过 3-6×。M5 的 b2 报告会补上：**同样 (CR, |Δppl|) 前提下,
+DFlash 加速损失是多少** — 也就是宣传里未兑现的"速度 + 压缩双赢"那一栏
+的真实数字。
diff --git a/benchmarks/b2_dflash_kakeya/__init__.py b/benchmarks/b2_dflash_kakeya/__init__.py
new file mode 100644
index 00000000..fb9b37d6
--- /dev/null
+++ b/benchmarks/b2_dflash_kakeya/__init__.py
@@ -0,0 +1,4 @@
+"""B2 DFlash x KakeyaLattice acceptance-rate benchmark."""
+from __future__ import annotations
+
+__version__ = "0.1.0"
diff --git a/benchmarks/b2_dflash_kakeya/datasets.py b/benchmarks/b2_dflash_kakeya/datasets.py
new file mode 100644
index 00000000..3c20e3ba
--- /dev/null
+++ b/benchmarks/b2_dflash_kakeya/datasets.py
@@ -0,0 +1,156 @@
+"""Dataset loaders for the B2 acceptance-rate benchmark.
+
+Two datasets are supported out of the box: **gsm8k** and **humaneval**.
+
+Loading strategy (in priority order):
+
+1. **Local JSONL file**: ``benchmarks/b2_dflash_kakeya/data/<name>.jsonl``.
+   Users who can't reach HF hub (or want a frozen subset for the
+   paper) check in a jsonl snapshot and we read it directly. Keeps
+   the benchmark reproducible offline.
+2. **HuggingFace ``datasets`` library** if available. We load
+   ``openai/gsm8k`` (``main`` config, ``test`` split) and
+   ``openai/humaneval`` (``test`` split). Cached under HF_HOME.
+3. **Synthetic fixture** — a tiny built-in 3-prompt dataset per name.
+   Used by unit tests and ``--dry-run`` mode; explicitly labelled so
+   nobody publishes numbers from it by accident.
+
+Each prompt is returned as a ``PromptItem`` dataclass carrying an
+id, the prompt string the target LLM will see, and an optional
+ground-truth field used by the correctness proxy in ``metrics.py``.
+"""
+from __future__ import annotations
+
+import json
+import random
+from dataclasses import dataclass
+from pathlib import Path
+
+
+@dataclass(frozen=True)
+class PromptItem:
+    dataset: str              # "gsm8k" | "humaneval" | "synthetic"
+    prompt_id: str
+    prompt: str
+    ground_truth: str | None = None
+
+
+_SUPPORTED = ("gsm8k", "humaneval")
+
+_DATA_DIR = Path(__file__).parent / "data"
+
+
+def _load_local_jsonl(name: str) -> list[dict] | None:
+    path = _DATA_DIR / f"{name}.jsonl"
+    if not path.exists():
+        return None
+    with path.open() as f:
+        return [json.loads(line) for line in f if line.strip()]
+
+
+def _load_hf(name: str) -> list[dict] | None:
+    try:
+        from datasets import load_dataset  # type: ignore
+    except ImportError:
+        return None
+    if name == "gsm8k":
+        ds = load_dataset("openai/gsm8k", "main", split="test")
+        return [dict(row) for row in ds]
+    if name == "humaneval":
+        ds = load_dataset("openai/humaneval", split="test")
+        return [dict(row) for row in ds]
+    return None
+
+
+_SYNTHETIC_FIXTURES: dict[str, list[PromptItem]] = {
+    "gsm8k": [
+        PromptItem("synthetic", "s0",
+                   "Q: Janet has 3 apples, gives 1 to Bob. How many are left?",
+                   "2"),
+        PromptItem("synthetic", "s1",
+                   "Q: A train travels 60 miles in 1.5 hours. What is its speed?",
+                   "40"),
+        PromptItem("synthetic", "s2",
+                   "Q: If 5 pencils cost $2.50, what is the cost of 8 pencils?",
+                   "4"),
+    ],
+    "humaneval": [
+        PromptItem("synthetic", "h0",
+                   "def add(a, b):\n    \"\"\"Return a + b.\"\"\"\n",
+                   "def add(a, b):\n    return a + b"),
+        PromptItem("synthetic", "h1",
+                   "def is_even(n):\n    \"\"\"Return True if n is even.\"\"\"\n",
+                   "def is_even(n):\n    return n % 2 == 0"),
+        PromptItem("synthetic", "h2",
+                   "def reverse(s):\n    \"\"\"Return s reversed.\"\"\"\n",
+                   "def reverse(s):\n    return s[::-1]"),
+    ],
+}
+
+
+def load_dataset_for_b2(
+    name: str,
+    *,
+    n_samples: int,
+    seed: int = 42,
+    allow_hf: bool = True,
+    allow_synthetic: bool = True,
+) -> list[PromptItem]:
+    """Load up to ``n_samples`` prompts for the named dataset.
+
+    The loader degrades gracefully: local jsonl → HF datasets →
+    synthetic. ``allow_hf=False`` forces the local/synthetic path
+    (useful for offline CI). ``allow_synthetic=False`` forbids the
+    synthetic fallback (useful for real benchmark runs so nobody
+    accidentally "runs gsm8k" on 3 fake prompts).
+    """
+    if name not in _SUPPORTED:
+        raise ValueError(
+            f"dataset {name!r} not supported; pick from {_SUPPORTED}"
+        )
+
+    rng = random.Random(seed)
+
+    rows: list[dict] | None = _load_local_jsonl(name)
+    if rows is None and allow_hf:
+        rows = _load_hf(name)
+
+    if rows is not None:
+        rng.shuffle(rows)
+        rows = rows[:n_samples]
+        return [_row_to_item(name, i, r) for i, r in enumerate(rows)]
+
+    if not allow_synthetic:
+        raise FileNotFoundError(
+            f"no local jsonl for {name!r} and synthetic fallback disabled. "
+            f"Expected file at {_DATA_DIR / (name + '.jsonl')}, or install "
+            "the `datasets` library and set allow_hf=True."
+        )
+
+    fixture = list(_SYNTHETIC_FIXTURES[name])
+    rng.shuffle(fixture)
+    return fixture[:n_samples] if n_samples < len(fixture) else fixture
+
+
+def _row_to_item(name: str, i: int, row: dict) -> PromptItem:
+    if name == "gsm8k":
+        return PromptItem(
+            dataset="gsm8k",
+            prompt_id=f"gsm8k-{i}",
+            prompt=row.get("question", ""),
+            ground_truth=row.get("answer"),
+        )
+    if name == "humaneval":
+        return PromptItem(
+            dataset="humaneval",
+            prompt_id=str(row.get("task_id", f"humaneval-{i}")),
+            prompt=row.get("prompt", ""),
+            ground_truth=row.get("canonical_solution"),
+        )
+    raise ValueError(name)
+
+
+__all__ = [
+    "PromptItem",
+    "load_dataset_for_b2",
+]
diff --git a/benchmarks/b2_dflash_kakeya/engines.py b/benchmarks/b2_dflash_kakeya/engines.py
new file mode 100644
index 00000000..b5ef8242
--- /dev/null
+++ b/benchmarks/b2_dflash_kakeya/engines.py
@@ -0,0 +1,210 @@
+"""Engine abstractions for the M5 benchmark.
+
+Real runs use ``RealEngine`` which delegates to the B2 MLXEngine
+(i.e. DFlash path + KakeyaLatticeMLXCache). CI / dry-runs use
+``MockEngine`` which returns deterministic fake acceptance-length
+traces so the runner + metrics pipeline is exercised without any
+MLX / dflash / Metal dependency.
+
+Both engines expose the same ``generate(prompt, channel, max_tokens)``
+method, returning ``EngineResult``.
+"""
+from __future__ import annotations
+
+import logging
+import random
+import time
+from dataclasses import dataclass, field
+from typing import Protocol
+
+log = logging.getLogger("benchmarks.b2_dflash_kakeya.engines")
+
+
+@dataclass
+class EngineResult:
+    response: str
+    acceptance_lengths: list[int]
+    generation_tps: float | None
+    first_token_latency_s: float | None
+    total_tokens: int
+    codec_fired: int | None = None
+    extra: dict = field(default_factory=dict)
+
+
+class Engine(Protocol):
+    def generate(
+        self,
+        *,
+        prompt: str,
+        channel: str,
+        max_tokens: int,
+    ) -> EngineResult:
+        ...
+
+    def close(self) -> None:
+        ...
+
+
+# ---------------------------------------------------------------------------
+# MockEngine
+# ---------------------------------------------------------------------------
+
+
+class MockEngine:
+    """Deterministic fake engine for CI + dry-run.
+
+    Simulates the relationship we expect from the real stack:
+
+    - bf16 baseline: acceptance length ~15 (DFlash's Qwen3-8B number)
+    - Kakeya Q=38: ~14 (small hit)
+    - Kakeya Q=10: ~12 (moderate hit)
+    - Kakeya Q=4:  ~8  (large hit)
+
+    Values are drawn from a small Gaussian with those means so the
+    metrics pipeline sees realistic distribution shapes.
+    """
+
+    _ACCEPT_MEAN_BY_CHANNEL = {
+        "bf16":   15.0,
+        "e8-q38": 14.0,
+        "e8-q10": 12.0,
+        "e8-q4":   8.0,
+    }
+    _TPS_MEAN_BY_CHANNEL = {
+        "bf16":   200.0,
+        "e8-q38": 195.0,
+        "e8-q10": 175.0,
+        "e8-q4":  120.0,
+    }
+
+    def __init__(self, seed: int = 0) -> None:
+        self._rng = random.Random(seed)
+
+    def generate(
+        self,
+        *,
+        prompt: str,
+        channel: str,
+        max_tokens: int,
+    ) -> EngineResult:
+        al_mean = self._ACCEPT_MEAN_BY_CHANNEL.get(channel, 10.0)
+        tps_mean = self._TPS_MEAN_BY_CHANNEL.get(channel, 150.0)
+
+        # Decide how many verify steps a max_tokens budget produces.
+        n_steps = max(1, int(max_tokens / max(al_mean, 1.0)))
+        acc = [
+            max(1, int(self._rng.gauss(al_mean, 1.5)))
+            for _ in range(n_steps)
+        ]
+        total_tokens = sum(acc)
+        tps = max(10.0, self._rng.gauss(tps_mean, tps_mean * 0.05))
+        ttft = max(0.02, self._rng.gauss(0.12, 0.02))
+
+        return EngineResult(
+            response=f"<mock-response-for:{prompt[:30]!r}>",
+            acceptance_lengths=acc,
+            generation_tps=tps,
+            first_token_latency_s=ttft,
+            total_tokens=total_tokens,
+            codec_fired=0 if channel == "bf16" else n_steps * 30,
+            extra={"channel": channel, "backend": "mock"},
+        )
+
+    def close(self) -> None:
+        pass
+
+
+# ---------------------------------------------------------------------------
+# RealEngine (Apple Silicon only; thin wrapper over B2 MLXEngine)
+# ---------------------------------------------------------------------------
+
+
+class RealEngine:
+    """Adapter over ``kakeya_sidecar_mlx.MLXEngine``.
+
+    Lazily imports everything so ``import engines`` on Linux CI works.
+    """
+
+    def __init__(
+        self,
+        *,
+        target_model: str,
+        enable_dflash: bool = True,
+        trust_remote_code: bool = True,
+    ) -> None:
+        from kakeya_sidecar_mlx.engine_mlx import MLXEngine, MLXEngineConfig
+
+        cfg = MLXEngineConfig(
+            enable_dflash=enable_dflash,
+            trust_remote_code=trust_remote_code,
+        )
+        self._engine = MLXEngine(cfg)
+        self._target = target_model
+
+    def generate(
+        self,
+        *,
+        prompt: str,
+        channel: str,
+        max_tokens: int,
+    ) -> EngineResult:
+        # channel maps ("bf16", "e8-q38", "e8-q10", ...) → B2 channel id
+        channel_id, override = self._channel_to_id(channel)
+
+        t0 = time.time()
+        response, stats = self._engine.chat(
+            channel_id,
+            [{"role": "user", "content": prompt}],
+            max_tokens=max_tokens,
+            temperature=0.0,
+            override=override,
+        )
+        wall = time.time() - t0
+
+        # MLXEngine stats carry 'acceptance_length_mean' only, not the
+        # full per-step list; for the benchmark we want distribution.
+        # The B2 engine will be extended in a later PR to expose the
+        # per-step list; for now we synthesize a single-element list
+        # so the aggregation still works.
+        al_mean = stats.get("acceptance_length_mean")
+        acc_list = [int(al_mean)] if al_mean else []
+
+        total_tokens = stats.get("generated_chars", 0) // 4   # rough proxy
+        tps = (total_tokens / wall) if wall > 0 else None
+
+        return EngineResult(
+            response=response,
+            acceptance_lengths=acc_list,
+            generation_tps=tps,
+            first_token_latency_s=None,
+            total_tokens=total_tokens,
+            codec_fired=None,
+            extra={"channel": channel, "backend": "mlx+dflash+kakeya"},
+        )
+
+    @staticmethod
+    def _channel_to_id(channel: str) -> tuple[str, dict | None]:
+        """Map benchmark channel name to MLXEngine channel id + override.
+
+        ``channel="bf16"`` maps to the target's Q=38 channel with a
+        per-request override that disables the codec path (boundary
+        covers every layer). In practice we ship a "bypass" dedicated
+        channel in a follow-up; for now the override is a clear
+        signal.
+        """
+        # The benchmark assumes Qwen3-8B as target model id.
+        if channel == "bf16":
+            return "qwen3-8b@e8-q38", {"q_range": 38, "boundary": 99999}
+        if channel == "e8-q38":
+            return "qwen3-8b@e8-q38", None
+        if channel == "e8-q10":
+            return "qwen3-8b@e8-q10", None
+        if channel == "e8-q4":
+            return "qwen3-8b@e8-q4", None
+        raise ValueError(f"unknown channel {channel!r}")
+
+    def close(self) -> None:
+        pass
+
+
+__all__ = ["Engine", "EngineResult", "MockEngine", "RealEngine"]
diff --git a/benchmarks/b2_dflash_kakeya/metrics.py b/benchmarks/b2_dflash_kakeya/metrics.py
new file mode 100644
index 00000000..488879aa
--- /dev/null
+++ b/benchmarks/b2_dflash_kakeya/metrics.py
@@ -0,0 +1,99 @@
+"""Metric aggregation helpers.
+
+All stats are pure Python + standard library — no numpy dependency
+so this module loads on any CI.
+"""
+from __future__ import annotations
+
+import math
+from typing import Iterable, Sequence
+
+
+def percentile(xs: Sequence[float], p: float) -> float | None:
+    """Linear-interpolation percentile matching numpy's default.
+
+    ``p`` in [0, 100]. Returns ``None`` for empty input to propagate
+    missing-data semantics to the schema.
+    """
+    if not xs:
+        return None
+    if p < 0 or p > 100:
+        raise ValueError(f"percentile p must be in [0, 100], got {p}")
+    sorted_xs = sorted(xs)
+    if len(sorted_xs) == 1:
+        return float(sorted_xs[0])
+    rank = (p / 100.0) * (len(sorted_xs) - 1)
+    lo = int(math.floor(rank))
+    hi = int(math.ceil(rank))
+    if lo == hi:
+        return float(sorted_xs[lo])
+    frac = rank - lo
+    return float(sorted_xs[lo] + (sorted_xs[hi] - sorted_xs[lo]) * frac)
+
+
+def mean(xs: Iterable[float]) -> float | None:
+    lst = list(xs)
+    if not lst:
+        return None
+    return sum(lst) / len(lst)
+
+
+def summarise_accept_lengths(
+    sample_records: Iterable[object],
+) -> dict[str, float | None]:
+    """Flatten per-step acceptance lengths across samples and summarise."""
+    flat: list[float] = []
+    for s in sample_records:
+        # Support either a SampleRecord or a plain dict (round-tripped).
+        al = getattr(s, "acceptance_lengths", None)
+        if al is None and isinstance(s, dict):
+            al = s.get("acceptance_lengths")
+        if al:
+            flat.extend(float(x) for x in al)
+
+    return {
+        "mean": mean(flat),
+        "p50": percentile(flat, 50.0),
+        "p95": percentile(flat, 95.0),
+    }
+
+
+def gsm8k_correct(response: str, expected: str) -> bool:
+    """Simple gsm8k correctness proxy.
+
+    GSM8K ground truth format ends with ``#### <answer>``. We strip
+    that, then check whether the model's response contains the exact
+    numeric answer as a substring. Deliberately loose — this is a
+    proxy, not a full grader.
+    """
+    if "####" in expected:
+        expected_answer = expected.rsplit("####", 1)[-1].strip()
+    else:
+        expected_answer = expected.strip()
+    if not expected_answer:
+        return False
+    return expected_answer in response
+
+
+def humaneval_correct(response: str, test_snippet: str) -> bool:
+    """HumanEval correctness proxy via substring match on the solution body.
+
+    Real HumanEval grading runs the generated code against the
+    reference tests in a sandbox; that's deliberately out-of-scope for
+    a sidecar benchmark. We approximate by checking that the response
+    contains a ``def`` signature and the ``return`` keyword — a very
+    loose gate that at least separates "emitted code" from "emitted
+    refusal". Upgrade path: wire the official execution harness in a
+    follow-up PR.
+    """
+    _ = test_snippet  # unused; kept for API symmetry with gsm8k_correct
+    return ("def " in response) and ("return" in response)
+
+
+__all__ = [
+    "percentile",
+    "mean",
+    "summarise_accept_lengths",
+    "gsm8k_correct",
+    "humaneval_correct",
+]
diff --git a/benchmarks/b2_dflash_kakeya/runner.py b/benchmarks/b2_dflash_kakeya/runner.py
new file mode 100644
index 00000000..eb438820
--- /dev/null
+++ b/benchmarks/b2_dflash_kakeya/runner.py
@@ -0,0 +1,282 @@
+"""Top-level benchmark runner.
+
+Usage:
+
+    python -m benchmarks.b2_dflash_kakeya.runner \\
+        --target Qwen/Qwen3-8B \\
+        --draft  z-lab/Qwen3-8B-DFlash-b16 \\
+        --datasets gsm8k humaneval \\
+        --n-samples 32 \\
+        --channels bf16 e8-q38 e8-q10 e8-q4 \\
+        --out-dir reports/b2_release
+
+    python -m benchmarks.b2_dflash_kakeya.runner --dry-run
+        # CI-friendly: uses MockEngine + synthetic dataset fallback
+
+The runner is deliberately engine-agnostic — all MLX / dflash /
+mlx-lm imports are behind the ``RealEngine`` constructor in
+``engines.py``. Linux CI exercises the whole runner via
+``--mock-engine``.
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import logging
+import sys
+from pathlib import Path
+from typing import Any
+
+from . import __version__ as RUNNER_VERSION
+from .datasets import PromptItem, load_dataset_for_b2
+from .engines import Engine, EngineResult, MockEngine
+from .metrics import (
+    gsm8k_correct,
+    humaneval_correct,
+    mean,
+    percentile,
+)
+from .schema import (
+    SCHEMA_VERSION,
+    AggregateMetrics,
+    BenchmarkResult,
+    HardwareInfo,
+    SampleRecord,
+    SoftwareInfo,
+)
+
+
+# ---------------------------------------------------------------------------
+# Engine factory
+# ---------------------------------------------------------------------------
+
+
+def _build_engine(args) -> Engine:
+    if args.dry_run or args.mock_engine:
+        return MockEngine(seed=args.seed)
+    from .engines import RealEngine
+    return RealEngine(
+        target_model=args.target,
+        enable_dflash=not args.no_dflash,
+    )
+
+
+# ---------------------------------------------------------------------------
+# Run one (dataset, channel) combination
+# ---------------------------------------------------------------------------
+
+
+def run_combination(
+    *,
+    engine: Engine,
+    dataset: str,
+    channel: str,
+    prompts: list[PromptItem],
+    max_tokens: int,
+    target_model: str,
+    draft_model: str | None,
+) -> BenchmarkResult:
+    sample_records: list[SampleRecord] = []
+    n_correct = 0
+    any_correctness_scored = False
+
+    for item in prompts:
+        result: EngineResult = engine.generate(
+            prompt=item.prompt,
+            channel=channel,
+            max_tokens=max_tokens,
+        )
+
+        correct: bool | None = None
+        if item.ground_truth is not None:
+            if dataset == "gsm8k":
+                correct = gsm8k_correct(result.response, item.ground_truth)
+            elif dataset == "humaneval":
+                correct = humaneval_correct(result.response, item.ground_truth)
+            if correct is not None:
+                any_correctness_scored = True
+                if correct:
+                    n_correct += 1
+
+        sample_records.append(SampleRecord(
+            prompt_id=item.prompt_id,
+            prompt=item.prompt,
+            response=result.response,
+            acceptance_lengths=list(result.acceptance_lengths),
+            generation_tps=result.generation_tps,
+            first_token_latency_s=result.first_token_latency_s,
+            total_tokens=result.total_tokens,
+            codec_fired=result.codec_fired,
+            correctness_proxy=correct,
+        ))
+
+    flat_al: list[float] = []
+    for s in sample_records:
+        flat_al.extend(float(x) for x in s.acceptance_lengths)
+
+    agg = AggregateMetrics(
+        acceptance_length_mean=mean(flat_al),
+        acceptance_length_p50=percentile(flat_al, 50.0),
+        acceptance_length_p95=percentile(flat_al, 95.0),
+        generation_tps_mean=mean(
+            [s.generation_tps for s in sample_records if s.generation_tps]
+        ),
+        first_token_latency_s=mean(
+            [s.first_token_latency_s for s in sample_records
+             if s.first_token_latency_s is not None]
+        ),
+        total_tokens_sum=sum(s.total_tokens for s in sample_records),
+        codec_fired_mean=mean(
+            [float(s.codec_fired) for s in sample_records
+             if s.codec_fired is not None]
+        ),
+        n_correct=n_correct if any_correctness_scored else None,
+        n_samples=len(sample_records),
+    )
+
+    return BenchmarkResult(
+        schema_version=SCHEMA_VERSION,
+        target_model=target_model,
+        draft_model=draft_model,
+        dataset=dataset,
+        channel=channel,
+        n_samples=len(sample_records),
+        samples=sample_records,
+        aggregate=agg,
+        hardware=detect_hardware(),
+        software=detect_software(),
+    )
+
+
+# ---------------------------------------------------------------------------
+# Env detection (safe to call on any OS; returns "unknown" fields when lib
+# isn't installed).
+# ---------------------------------------------------------------------------
+
+
+def detect_hardware() -> HardwareInfo:
+    import platform
+    chip = platform.processor() or "unknown"
+    device = "unknown"
+    try:
+        import mlx.core as mx  # type: ignore
+        if mx.metal.is_available():
+            device = "mlx:metal"
+        else:
+            device = "mlx:cpu"
+    except ImportError:
+        pass
+    return HardwareInfo(device=device, chip=chip)
+
+
+def detect_software() -> SoftwareInfo:
+    def _ver(mod: str) -> str | None:
+        try:
+            m = __import__(mod)
+            return getattr(m, "__version__", None)
+        except ImportError:
+            return None
+    return SoftwareInfo(
+        mlx=_ver("mlx"),
+        mlx_lm=_ver("mlx_lm"),
+        dflash=_ver("dflash"),
+        kakeyalattice_mlx=_ver("kakeyalattice_mlx"),
+        kakeya_sidecar_mlx=_ver("kakeya_sidecar_mlx"),
+    )
+
+
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+
+
+def build_parser() -> argparse.ArgumentParser:
+    p = argparse.ArgumentParser(
+        prog="b2-dflash-kakeya-benchmark",
+        description=(
+            "Acceptance-rate benchmark for DFlash speculative decoding "
+            "combined with KakeyaLattice E8 KV-cache compression "
+            "(B2 / M5)."
+        ),
+    )
+    p.add_argument("--target", default="Qwen/Qwen3-8B",
+                   help="HuggingFace / mlx-community target model id")
+    p.add_argument("--draft", default="z-lab/Qwen3-8B-DFlash-b16",
+                   help="DFlash draft model id (non-thinking, b16).")
+    p.add_argument("--datasets", nargs="+",
+                   default=["gsm8k", "humaneval"],
+                   choices=["gsm8k", "humaneval"])
+    p.add_argument("--channels", nargs="+",
+                   default=["bf16", "e8-q38", "e8-q10", "e8-q4"],
+                   help="KV cache channels to evaluate.")
+    p.add_argument("--n-samples", type=int, default=32)
+    p.add_argument("--max-tokens", type=int, default=512)
+    p.add_argument("--seed", type=int, default=42)
+    p.add_argument("--no-dflash", action="store_true",
+                   help="Disable DFlash (debug only).")
+    p.add_argument("--dry-run", action="store_true",
+                   help="Use MockEngine + synthetic datasets. "
+                        "No MLX / dflash / HF required.")
+    p.add_argument("--mock-engine", action="store_true",
+                   help="Force MockEngine but still load real datasets "
+                        "(via local jsonl or HF datasets).")
+    p.add_argument("--allow-synthetic", action="store_true",
+                   help="Permit synthetic dataset fallback even outside "
+                        "--dry-run. Off by default to prevent publishing "
+                        "numbers from 3-prompt fixtures.")
+    p.add_argument("--out-dir", default="reports/b2_release",
+                   help="Where to write per-combination JSON.")
+    p.add_argument("--log-level", default="INFO",
+                   choices=["DEBUG", "INFO", "WARNING", "ERROR"])
+    return p
+
+
+def main(argv: list[str] | None = None) -> int:
+    args = build_parser().parse_args(argv)
+    logging.basicConfig(
+        level=getattr(logging, args.log_level),
+        format="%(asctime)s %(levelname)s %(name)s: %(message)s",
+    )
+    log = logging.getLogger("b2-dflash-kakeya")
+    log.info("runner version=%s schema=%s", RUNNER_VERSION, SCHEMA_VERSION)
+
+    out_dir = Path(args.out_dir)
+    out_dir.mkdir(parents=True, exist_ok=True)
+
+    engine = _build_engine(args)
+
+    try:
+        for dataset in args.datasets:
+            prompts = load_dataset_for_b2(
+                dataset,
+                n_samples=args.n_samples,
+                seed=args.seed,
+                allow_hf=not args.dry_run,
+                allow_synthetic=args.dry_run or args.allow_synthetic,
+            )
+            log.info("dataset=%s n_prompts=%d", dataset, len(prompts))
+
+            for channel in args.channels:
+                log.info("running channel=%s", channel)
+                result = run_combination(
+                    engine=engine,
+                    dataset=dataset,
+                    channel=channel,
+                    prompts=prompts,
+                    max_tokens=args.max_tokens,
+                    target_model=args.target,
+                    draft_model=None if args.no_dflash else args.draft,
+                )
+                fname = f"b2_dflash_kakeya_{dataset}_{channel}.json"
+                out_path = out_dir / fname
+                with out_path.open("w") as f:
+                    json.dump(result.to_dict(), f, indent=2, default=str)
+                log.info("wrote %s  (accept_mean=%s)",
+                         out_path, result.aggregate.acceptance_length_mean)
+    finally:
+        engine.close()
+    return 0
+
+
+if __name__ == "__main__":  # pragma: no cover
+    sys.exit(main())
diff --git a/benchmarks/b2_dflash_kakeya/schema.py b/benchmarks/b2_dflash_kakeya/schema.py
new file mode 100644
index 00000000..13a2c7ba
--- /dev/null
+++ b/benchmarks/b2_dflash_kakeya/schema.py
@@ -0,0 +1,82 @@
+"""Output JSON schema for b2-dflash-kakeya benchmark runs.
+
+We pin the schema version here so downstream tooling (reports site,
+comparison scripts) can detect breaking changes without guessing.
+"""
+from __future__ import annotations
+
+from dataclasses import asdict, dataclass, field
+from typing import Any
+
+
+SCHEMA_VERSION = "b2-dflash-kakeya-v1"
+
+
+@dataclass
+class HardwareInfo:
+    device: str = "unknown"
+    chip: str = "unknown"
+    total_memory_gb: float | None = None
+
+
+@dataclass
+class SoftwareInfo:
+    mlx: str | None = None
+    mlx_lm: str | None = None
+    dflash: str | None = None
+    kakeyalattice_mlx: str | None = None
+    kakeya_sidecar_mlx: str | None = None
+
+
+@dataclass
+class SampleRecord:
+    prompt_id: str
+    prompt: str
+    response: str
+    acceptance_lengths: list[int]
+    generation_tps: float | None
+    first_token_latency_s: float | None
+    total_tokens: int
+    codec_fired: int | None = None
+    correctness_proxy: bool | None = None
+
+
+@dataclass
+class AggregateMetrics:
+    acceptance_length_mean: float | None
+    acceptance_length_p50: float | None
+    acceptance_length_p95: float | None
+    generation_tps_mean: float | None
+    first_token_latency_s: float | None
+    total_tokens_sum: int
+    codec_fired_mean: float | None
+    n_correct: int | None
+    n_samples: int
+
+
+@dataclass
+class BenchmarkResult:
+    schema_version: str
+    target_model: str
+    draft_model: str | None
+    dataset: str
+    channel: str                 # e.g. "bf16" or "e8-q10"
+    n_samples: int
+    samples: list[SampleRecord]
+    aggregate: AggregateMetrics
+    hardware: HardwareInfo = field(default_factory=HardwareInfo)
+    software: SoftwareInfo = field(default_factory=SoftwareInfo)
+
+    def to_dict(self) -> dict[str, Any]:
+        d = asdict(self)
+        return d
+
+
+__all__ = [
+    "SCHEMA_VERSION",
+    "HardwareInfo",
+    "SoftwareInfo",
+    "SampleRecord",
+    "AggregateMetrics",
+    "BenchmarkResult",
+]
diff --git a/benchmarks/b2_dflash_kakeya/tests/__init__.py b/benchmarks/b2_dflash_kakeya/tests/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/benchmarks/b2_dflash_kakeya/tests/test_datasets.py b/benchmarks/b2_dflash_kakeya/tests/test_datasets.py
new file mode 100644
index 00000000..50d6b130
--- /dev/null
+++ b/benchmarks/b2_dflash_kakeya/tests/test_datasets.py
@@ -0,0 +1,93 @@
+"""Dataset loader tests.
+
+Two fallback paths verified without touching the network:
+- synthetic fixture: always available (3 items per dataset)
+- local jsonl override: supplied via temporary directory
+"""
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+import pytest
+
+from benchmarks.b2_dflash_kakeya import datasets as ds_mod
+from benchmarks.b2_dflash_kakeya.datasets import (
+    PromptItem,
+    load_dataset_for_b2,
+)
+
+
+def test_synthetic_fixture_for_gsm8k() -> None:
+    items = load_dataset_for_b2(
+        "gsm8k", n_samples=3, allow_hf=False, allow_synthetic=True,
+    )
+    assert len(items) == 3
+    for it in items:
+        assert isinstance(it, PromptItem)
+        assert it.dataset == "synthetic"
+        assert it.ground_truth is not None
+
+
+def test_synthetic_forbidden_raises() -> None:
+    with pytest.raises(FileNotFoundError):
+        load_dataset_for_b2(
+            "gsm8k", n_samples=3, allow_hf=False, allow_synthetic=False,
+        )
+
+
+def test_unknown_dataset_rejected() -> None:
+    with pytest.raises(ValueError):
+        load_dataset_for_b2("winograd", n_samples=1)
+
+
+def test_humanval_synthetic_has_code_scaffolding() -> None:
+    items = load_dataset_for_b2(
+        "humaneval", n_samples=3, allow_hf=False, allow_synthetic=True,
+    )
+    for it in items:
+        assert "def " in it.prompt
+
+
+def test_local_jsonl_preferred_over_synthetic(tmp_path, monkeypatch) -> None:
+    """If a local jsonl exists, it's used instead of the synthetic
+    fixture (and instead of HF)."""
+    data_dir = tmp_path / "data"
+    data_dir.mkdir()
+    jsonl = data_dir / "gsm8k.jsonl"
+    with jsonl.open("w") as f:
+        f.write(json.dumps({"question": "Q1", "answer": "A1"}) + "\n")
+        f.write(json.dumps({"question": "Q2", "answer": "A2"}) + "\n")
+
+    monkeypatch.setattr(ds_mod, "_DATA_DIR", data_dir)
+
+    items = load_dataset_for_b2("gsm8k", n_samples=10, allow_hf=False)
+    assert len(items) == 2
+    assert all(it.dataset == "gsm8k" for it in items)
+    assert {it.ground_truth for it in items} == {"A1", "A2"}
+
+
+def test_n_samples_truncation(tmp_path, monkeypatch) -> None:
+    data_dir = tmp_path / "data"
+    data_dir.mkdir()
+    jsonl = data_dir / "humaneval.jsonl"
+    with jsonl.open("w") as f:
+        for i in range(10):
+            f.write(json.dumps({
+                "task_id": f"t/{i}",
+                "prompt": f"def f{i}(): return",
+                "canonical_solution": f"return {i}",
+            }) + "\n")
+    monkeypatch.setattr(ds_mod, "_DATA_DIR", data_dir)
+
+    items = load_dataset_for_b2("humaneval", n_samples=4, allow_hf=False)
+    assert len(items) == 4
+
+
+def test_seed_determinism_for_synthetic() -> None:
+    a = load_dataset_for_b2("gsm8k", n_samples=3, seed=1,
+                            allow_hf=False, allow_synthetic=True)
+    b = load_dataset_for_b2("gsm8k", n_samples=3, seed=1,
+                            allow_hf=False, allow_synthetic=True)
+    assert [(it.prompt_id, it.prompt) for it in a] == \
+           [(it.prompt_id, it.prompt) for it in b]
diff --git a/benchmarks/b2_dflash_kakeya/tests/test_metrics.py b/benchmarks/b2_dflash_kakeya/tests/test_metrics.py
new file mode 100644
index 00000000..195d8523
--- /dev/null
+++ b/benchmarks/b2_dflash_kakeya/tests/test_metrics.py
@@ -0,0 +1,91 @@
+"""Unit tests for ``metrics.py`` — pure Python, no deps beyond stdlib."""
+from __future__ import annotations
+
+import pytest
+
+from benchmarks.b2_dflash_kakeya.metrics import (
+    gsm8k_correct,
+    humaneval_correct,
+    mean,
+    percentile,
+    summarise_accept_lengths,
+)
+
+
+def test_mean_empty_is_none() -> None:
+    assert mean([]) is None
+
+
+def test_mean_basic() -> None:
+    assert mean([1, 2, 3, 4]) == 2.5
+
+
+def test_percentile_empty_is_none() -> None:
+    assert percentile([], 50) is None
+
+
+def test_percentile_single_element() -> None:
+    assert percentile([7.0], 50) == 7.0
+    assert percentile([7.0], 0) == 7.0
+    assert percentile([7.0], 100) == 7.0
+
+
+def test_percentile_matches_numpy_default() -> None:
+    xs = [10, 20, 30, 40, 50]
+    assert percentile(xs, 0) == 10
+    assert percentile(xs, 50) == 30
+    assert percentile(xs, 100) == 50
+    # Linear interpolation between sorted[1]=20 and sorted[2]=30 at rank 1.5
+    assert percentile(xs, 37.5) == pytest.approx(25.0)
+
+
+def test_percentile_rejects_bad_p() -> None:
+    with pytest.raises(ValueError):
+        percentile([1, 2, 3], -1)
+    with pytest.raises(ValueError):
+        percentile([1, 2, 3], 101)
+
+
+def test_summarise_handles_empty_records() -> None:
+    out = summarise_accept_lengths([])
+    assert out == {"mean": None, "p50": None, "p95": None}
+
+
+def test_summarise_flattens_per_sample_lists() -> None:
+    class _R:
+        def __init__(self, xs):
+            self.acceptance_lengths = xs
+
+    records = [_R([10, 12]), _R([14, 16]), _R([18])]
+    out = summarise_accept_lengths(records)
+    assert out["mean"] == pytest.approx(14.0)
+
+
+def test_summarise_accepts_dicts() -> None:
+    out = summarise_accept_lengths([
+        {"acceptance_lengths": [1, 2, 3]},
+        {"acceptance_lengths": [4, 5]},
+    ])
+    assert out["mean"] == pytest.approx(3.0)
+
+
+# ---------------------------------------------------------------------------
+# correctness proxies
+# ---------------------------------------------------------------------------
+
+
+def test_gsm8k_correct_extracts_after_hash() -> None:
+    expected = "Jane has three apples. #### 3"
+    assert gsm8k_correct("The answer is 3.", expected) is True
+    assert gsm8k_correct("The answer is 42.", expected) is False
+
+
+def test_gsm8k_correct_no_hash_prefix() -> None:
+    assert gsm8k_correct("Answer: 7", "7") is True
+    assert gsm8k_correct("Answer: 7", "") is False
+
+
+def test_humaneval_correct_requires_def_and_return() -> None:
+    assert humaneval_correct("def f(x):\n    return x + 1", "") is True
+    assert humaneval_correct("print('refused')", "") is False
+    assert humaneval_correct("def f(x):\n    print(x)", "") is False
diff --git a/benchmarks/b2_dflash_kakeya/tests/test_runner_mock.py b/benchmarks/b2_dflash_kakeya/tests/test_runner_mock.py
new file mode 100644
index 00000000..e1255f91
--- /dev/null
+++ b/benchmarks/b2_dflash_kakeya/tests/test_runner_mock.py
@@ -0,0 +1,113 @@
+"""End-to-end runner test using MockEngine + synthetic datasets.
+
+Verifies the full loop: dataset loading → engine.generate per prompt
+→ metric aggregation → JSON serialisation → schema round-trip.
+
+Runs in <0.5s on Linux CI, no MLX / dflash / HF.
+"""
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+import pytest
+
+from benchmarks.b2_dflash_kakeya.engines import MockEngine
+from benchmarks.b2_dflash_kakeya.runner import (
+    build_parser,
+    main,
+    run_combination,
+)
+from benchmarks.b2_dflash_kakeya.datasets import load_dataset_for_b2
+from benchmarks.b2_dflash_kakeya.schema import SCHEMA_VERSION
+
+
+def test_parser_defaults() -> None:
+    args = build_parser().parse_args([])
+    assert args.target == "Qwen/Qwen3-8B"
+    assert "z-lab/Qwen3-8B-DFlash-b16" in args.draft
+    assert set(args.datasets) == {"gsm8k", "humaneval"}
+    assert args.channels == ["bf16", "e8-q38", "e8-q10", "e8-q4"]
+
+
+def test_run_combination_accept_length_ordering() -> None:
+    """MockEngine encodes the theoretical acceptance-length ordering
+    bf16 > Q=38 > Q=10 > Q=4. The benchmark runner aggregation must
+    preserve it."""
+    engine = MockEngine(seed=0)
+    prompts = load_dataset_for_b2(
+        "gsm8k", n_samples=3, allow_hf=False, allow_synthetic=True,
+    )
+
+    accept_means: dict[str, float] = {}
+    for channel in ("bf16", "e8-q38", "e8-q10", "e8-q4"):
+        res = run_combination(
+            engine=engine, dataset="gsm8k", channel=channel,
+            prompts=prompts, max_tokens=128,
+            target_model="Qwen/Qwen3-8B",
+            draft_model="z-lab/Qwen3-8B-DFlash-b16",
+        )
+        accept_means[channel] = res.aggregate.acceptance_length_mean or 0.0
+
+    assert accept_means["bf16"]   > accept_means["e8-q38"]
+    assert accept_means["e8-q38"] > accept_means["e8-q10"]
+    assert accept_means["e8-q10"] > accept_means["e8-q4"]
+
+
+def test_main_dry_run_writes_json_per_combination(tmp_path) -> None:
+    out_dir = tmp_path / "reports"
+    code = main([
+        "--dry-run",
+        "--n-samples", "3",
+        "--max-tokens", "64",
+        "--out-dir", str(out_dir),
+        "--datasets", "gsm8k",
+        "--channels", "bf16", "e8-q10",
+    ])
+    assert code == 0
+
+    files = sorted(p.name for p in out_dir.iterdir())
+    assert files == [
+        "b2_dflash_kakeya_gsm8k_bf16.json",
+        "b2_dflash_kakeya_gsm8k_e8-q10.json",
+    ]
+
+    for name in files:
+        with (out_dir / name).open() as f:
+            obj = json.load(f)
+        assert obj["schema_version"] == SCHEMA_VERSION
+        assert obj["dataset"] == "gsm8k"
+        assert obj["n_samples"] == 3
+        assert obj["aggregate"]["acceptance_length_mean"] is not None
+        assert obj["aggregate"]["n_samples"] == 3
+        assert len(obj["samples"]) == 3
+
+
+def test_main_dry_run_humaneval_correctness_populated(tmp_path) -> None:
+    out_dir = tmp_path / "reports"
+    main([
+        "--dry-run",
+        "--n-samples", "3",
+        "--max-tokens", "32",
+        "--out-dir", str(out_dir),
+        "--datasets", "humaneval",
+        "--channels", "bf16",
+    ])
+    with (out_dir / "b2_dflash_kakeya_humaneval_bf16.json").open() as f:
+        obj = json.load(f)
+    # Mock responses don't actually contain code, so correctness proxy
+    # should be False for all; n_correct populated = 0.
+    # The important thing is that the field exists and isn't None for
+    # a dataset with ground truth.
+    assert obj["aggregate"]["n_correct"] is not None
+    for s in obj["samples"]:
+        assert s["correctness_proxy"] is not None
+
+
+def test_channel_not_in_mock_engine_gets_default_means() -> None:
+    """A channel name the MockEngine doesn't know should not crash —
+    it falls back to 10/150 defaults."""
+    engine = MockEngine(seed=1)
+    res = engine.generate(prompt="hi", channel="custom-42", max_tokens=64)
+    assert res.acceptance_lengths
+    assert all(x >= 1 for x in res.acceptance_lengths)

From bac07b944ca5557869dfbe76913eba9bbf807a8a Mon Sep 17 00:00:00 2001
From: Cursor Agent <cursoragent@cursor.com>
Date: Thu, 30 Apr 2026 05:07:29 +0000
Subject: [PATCH 2/3] docs(b2/M5): reports/b2_release/ placeholder for real
 benchmark output

The M5 runner writes one JSON per (dataset, channel) combination
into reports/b2_release/. The real run requires Apple Silicon +
MLX + dflash + gsm8k/humaneval data; until then this directory
only documents the expected layout.

Expected artefacts after a real run:
  b2_dflash_kakeya_{gsm8k,humaneval}_{bf16,e8-q38,e8-q10,e8-q4}.json
  FINDINGS.md (narrative + aggregate tables)

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
---
 reports/b2_release/README.md | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 create mode 100644 reports/b2_release/README.md

diff --git a/reports/b2_release/README.md b/reports/b2_release/README.md
new file mode 100644
index 00000000..40c6ca2c
--- /dev/null
+++ b/reports/b2_release/README.md
@@ -0,0 +1,21 @@
+# B2 Release — acceptance-rate benchmark outputs
+
+空目录占位。真实 B2 benchmark 输出（需要 Apple Silicon + MLX + DFlash）
+跑出来后，8 个 JSON 文件会落在这里:
+
+```
+reports/b2_release/
+├── b2_dflash_kakeya_gsm8k_bf16.json
+├── b2_dflash_kakeya_gsm8k_e8-q38.json
+├── b2_dflash_kakeya_gsm8k_e8-q10.json
+├── b2_dflash_kakeya_gsm8k_e8-q4.json
+├── b2_dflash_kakeya_humaneval_bf16.json
+├── b2_dflash_kakeya_humaneval_e8-q38.json
+├── b2_dflash_kakeya_humaneval_e8-q10.json
+├── b2_dflash_kakeya_humaneval_e8-q4.json
+└── FINDINGS.md                              (narrative + aggregate tables)
+```
+
+跑法见 `benchmarks/b2_dflash_kakeya/README.md`。
+
+schema 版本 + 每条 JSON 结构见 `benchmarks/b2_dflash_kakeya/schema.py`。

From decd8468d944a0b1f51c21c14f57efe03fbab047 Mon Sep 17 00:00:00 2001
From: Cursor Agent <cursoragent@cursor.com>
Date: Thu, 30 Apr 2026 05:07:45 +0000
Subject: [PATCH 3/3] docs(b2/M6): standalone product proposal - pivot from
 Atomic-Chat backend
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Evaluates repositioning B2 (MLX + DFlash + KakeyaLattice-MLX) from
'an Atomic-Chat second backend' (original M6 scope) to an
independent Mac local-inference product.

Three candidate forms:
  A. CLI tool `kakeya-llm`           (homebrew / pip)
  B. Native macOS app `Kakeya Studio` (DMG, custom UI)
  C. SDK + developer/enterprise library (PyPI + docs site)

Recommendation: ship A + C first, defer B.

Rationale:
  1. atomic.chat's "Google TurboQuant built-in" headline is
     under-delivered by the actual llama.cpp KV quantisation stack;
     folding B2 into Atomic-Chat means fulfilling *their* marketing
     through *our* engineering with no control over the message.
  2. B2's three capabilities (MLX-native inference, DFlash 3-6x
     speedup, E8 KV compression) are general-purpose infra — not
     chat-UI-specific. Binding them to one chat app under-uses the
     stack.
  3. Existing engineering assets (kakeyalattice_mlx,
     kakeya_sidecar_mlx, cache_injection) are ~90% reusable for A+C
     with minimal new work; going the Atomic-Chat-backend route
     blocks on third-party PR review + release cadence.

Phased execution:
  Phase 1 (after M5 merges): PyPI release, landing page, CLI shim.
  Phase 2 (4-8 weeks later): Cursor / Raycast / Obsidian integrations,
    early enterprise POCs.
  Phase 3 (6 months out, re-evaluate): whether to do the native app.

This evaluation does NOT invalidate PR #57 (B1), #58 (B2 skeleton),
or #59 (B2/M4): all that code is directly reusable. It re-scopes
M6 only, shifting from 'ship to Atomic-Chat' to 'ship independently
+ opportunistic Atomic-Chat integration if they take our PR'.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
---
 docs/B2_STANDALONE_PRODUCT_PROPOSAL.md | 242 +++++++++++++++++++++++++
 1 file changed, 242 insertions(+)
 create mode 100644 docs/B2_STANDALONE_PRODUCT_PROPOSAL.md

diff --git a/docs/B2_STANDALONE_PRODUCT_PROPOSAL.md b/docs/B2_STANDALONE_PRODUCT_PROPOSAL.md
new file mode 100644
index 00000000..d319fcd0
--- /dev/null
+++ b/docs/B2_STANDALONE_PRODUCT_PROPOSAL.md
@@ -0,0 +1,242 @@
+# B2 Standalone Product Proposal — 从 Atomic-Chat backend 走向独立 Mac 本地推理产品
+
+**Date**: 2026-04-30
+**Scope**: 评估把 B2 (`MLX + DFlash + KakeyaLattice-MLX`) 从"Atomic-Chat
+的第二个后端"升级为**独立 Mac 本地推理产品**的商业/工程可行性。
+**Status**: 评估稿，作为 M6 原 "Atomic-Chat extension 追加 backend 选项"
+的替代路线。
+
+---
+
+## 1. 为什么重估
+
+原 M6 规划是把 B2 做成 Atomic-Chat 的第二个 backend：在 TS 扩展里加
+`"variant": "mlx-dflash"`，Tauri plugin 同时托管 B1 + B2 两个 sidecar。
+这条路径**工程上成立**，但在商业逻辑上有几个问题。
+
+### 1.1 Atomic-Chat 的宣传与 B2 的产品承诺不对等
+
+atomic.chat 首页的核心宣传：
+
+- *"Google TurboQuant built-in — 8× Faster Inference"* — v1.5 报告里
+  TurboQuant b=2 在 4 个模型上**结构性不可用**，b=3 被 E8 Q=4 全面压过
+  3-6×。真实落地的是 llama.cpp 的 q4_0/q8_0 标量量化（2023 年的技术）。
+- *"Compressed down to just 3 bits — with no retraining, no fine-tuning,
+  and no trade-off in model performance"* — 同上，3 bit 级别对 <3B 模型
+  在 llama.cpp 上真实可用性极差。
+
+把 B2 挂进 Atomic-Chat 作为 backend 选项，等于帮它**兑现宣传承诺**。
+但 Atomic-Chat 团队**不是我们**，合作节奏、PR review、version
+compatibility 都有摩擦。我们交付 PR → 他们决定合不合 → 用户什么时候
+能拿到——这条链路至少多一个季度。
+
+### 1.2 B2 的价值点超过"chat UI 的后端"
+
+B2 的三个能力：
+- **MLX 原生推理** — Apple Silicon 第一等公民
+- **DFlash 3-6× 无损加速** — 2026 年 speculative decoding SOTA
+- **KakeyaLattice E8 KV 压缩** — 长上下文不 OOM
+
+这三个能力的使用面 **不止 chat**。举例：
+- IDE 集成（Cursor / VSCode 里本地跑 Qwen-Coder 30B）
+- 命令行 agent / MCP-driven workflow（不需要 UI 的场景）
+- 批量文本处理（合同 / 论文 / 代码库摘要，不交互）
+- 其他 GUI 宿主（Raycast extension、Alfred workflow、Obsidian plugin）
+
+把 B2 绑死到 Atomic-Chat 的 UI 里，等于把一个**通用推理加速层**锁进
+一个特定 chat app。
+
+### 1.3 atomic.chat 的长期演化风险
+
+atomic.chat 是商业公司产品，UI / 扩展 API 路线图由他们决定。他们可能：
+- 换推理引擎（目前完全绑 llama.cpp；将来改成 MLX 原生或自研）
+- 改扩展 API（我们的 TS 扩展 + Rust plugin 需随之改）
+- 决定不把 KakeyaLattice 放进默认发行（"Pro Mode for Mac" 选项可能永远
+  藏在高级设置里）
+
+这些风险不是零概率，且一旦发生我们的工程沉默成本很高。
+
+## 2. 独立产品路线的三种形态
+
+### Form A：CLI 工具 `kakeya-llm`
+
+- `brew install kakeya-llm` / `pip install kakeya-llm`
+- `kakeya-llm chat qwen3-8b --q 38 --dflash` 启交互
+- `kakeya-llm serve --port 1339` 开 OpenAI 兼容 server
+- `kakeya-llm bench gsm8k` 跑内置 M5 benchmark
+- 目标用户：开发者 / ML 工程师 / 学生
+
+**优势**：
+- 分发简单，homebrew 生态现成
+- 不依赖 UI 框架；Metal / CUDA / CPU 都能跑（基座是 mlx-lm + dflash）
+- 完整控制 KakeyaLattice × DFlash 的每一个旋钮
+
+**劣势**：
+- 非技术用户门槛高
+- 默认没有对话历史 / 多 assistant / MCP 集成，纯"跑模型"
+- 需要自己维护包发布 + 版本升级
+
+**适合场景**：作为底层 API 被其他 app 集成（IDE / 插件 / 工作流）。
+
+### Form B：native macOS app `Kakeya Studio`
+
+- DMG 包，不要求任何命令行能力
+- 单窗口 chat UI + 模型管理 + 性能监控
+- 内置 KakeyaLattice compression 的**可视化**（"你刚才这次对话省了
+  X GB KV，速度比 baseline 快 Y×"）
+- OpenAI 兼容 `:1339` 自带，让 Cursor / Raycast / Alfred 能连
+- 目标用户：Mac 本地 AI 爱好者（和 atomic.chat 用户重叠但不完全相同）
+
+**优势**：
+- 完整控制 UI／叙事／宣传口径（"基于论文的 E8 格压缩"，可验证、学术
+  可溯）
+- 把 KakeyaLattice 的 "discrete Kakeya cover" 理论直接 surface 到
+  UI（这是 atomic.chat 做不到也不会做的差异化）
+- 省下 Atomic-Chat 团队合作的沟通成本
+
+**劣势**：
+- 从零搭 Tauri/Electron UI + signing + notarize + auto-update
+- App Store 审核（如果走 MAS 发行）
+- 和 atomic.chat 正面竞争 Mac 本地 AI app 市场 — 他们已经 500+ stars，
+  我们要从 0 开始做 GTM
+
+**适合场景**：想建立 KakeyaLattice 作为品牌的长期目标。
+
+### Form C：SDK + 企业/开发者服务 `kakeyalattice-mlx` as a library
+
+- 把 `kakeyalattice_mlx` + `kakeya_sidecar_mlx` + `cache_injection` 升级
+  为严肃 Python 包，发到 PyPI
+- 卖点不是 "app"，是 "the MLX-native long-context inference stack"
+- 配套：文档站 + benchmark 报告 + 参考 Docker image
+- 用户是开发团队（Cursor 这类 IDE 想集成本地推理、内部工具团队想给
+  员工提供本地 AI）
+- 目标客户：对"确定性 + 可审计 + 离线"有强诉求的企业（律所、医疗、
+  投行、国防承包商）
+
+**优势**：
+- 最小 UI 投入，最大技术杠杆
+- 和 Atomic-Chat（C 端 chat app）错位竞争，客户群完全不重叠
+- Apache-2.0 吸引生态，有机会成为事实标准（类似 `transformers` 在
+  HF 生态里的地位）
+
+**劣势**：
+- B2B 销售周期长
+- 需要保持上游（MLX / DFlash / HF）兼容性的长期工程承诺
+- 早期没有"用户界面"的东西，marketing 需要更专业渠道（Twitter/X、
+  arXiv 引用、技术 blog）
+
+**适合场景**：研究机构 / 行业 / 企业客户优先。
+
+## 3. 推荐路径：A + C 串联，暂缓 B
+
+基于以下几个现实判断：
+
+1. **我们的工程资产天然贴合 A 和 C**：
+   - `kakeya_sidecar_mlx` 已经是 OpenAI 兼容 server — **直接就是 A 的核心**。
+   - `kakeyalattice_mlx` 已经是独立 pip 包，bit-identical parity 有测试
+     保证 — **直接就是 C 的核心**。
+   - 从"B1 + B2 集成到 Atomic-Chat"变成"把 B2 独立发布"，**工程改动不到
+     10%**，主要是加发布 CI + 文档站 + 品牌 landing page。
+
+2. **B 的 native app 路径成本最高且回报最晚**：
+   - 成本：搭 UI / signing / notarize / auto-update / App Store / GTM /
+     客服，全套做齐是半年级别工作。
+   - 回报：必须打败 atomic.chat (500+ stars, 成熟 GTM) 才能起量。
+   - 风险：一旦 atomic.chat 集成 KakeyaLattice（无论是通过我们的 PR
+     还是自己实现），B 的差异化就瓦解。
+
+3. **A + C 的组合可以"先卖给开发者 → 逐步渗透到 app"**：
+   - Cursor / Raycast / Alfred / Obsidian 的插件开发者用我们 SDK 嵌
+     本地推理 → 终端用户通过这些 app 间接用上 KakeyaLattice
+   - **这条路径 atomic.chat 无法阻断**，因为他们的 UI 只是众多消费端之一
+
+## 4. 具体执行建议
+
+### Phase 1（M5 benchmark 之后立即做）— 公开发布 A + C 的底层包
+
+1. **PyPI 发布**:
+   - `kakeyalattice-mlx` v0.1.0
+   - `kakeya-sidecar-mlx` v0.1.0
+   - 两者都走 trusted publishing + GitHub release + 自动 wheel build
+2. **文档站**（`kakeyalattice.dev` 或 `kakeya-mlx.dev` domain）:
+   - Quick start（一条 `pip install` 命令）
+   - API reference（auto-gen from docstrings）
+   - 性能 benchmark 页面（绑到 `reports/b2_release/` 的 JSON）
+   - 对比表格（atomic.chat / llama.cpp / ollama / MLC-LLM / 我们）
+3. **CLI shim**: `kakeya-llm` 作为 `kakeya-sidecar-mlx` 的子命令包装，
+   提供 `chat`/`serve`/`bench` 三个入口。
+4. **GitHub 项目分拆**（可选）: `kakeyalattice` 主 repo 保留 codec +
+   论文；`kakeya-mlx` 拆出独立 repo 做 sidecar + benchmark + 文档站。
+   独立 issue tracker + release cadence。
+
+### Phase 2（Phase 1 上线 4-8 周后）— 生态嵌入
+
+1. **Cursor / Continue / Aider 集成** — 这些 IDE agent 工具都支持
+   OpenAI 兼容的本地 server。我们的 sidecar 默认就能被它们连上，
+   只需要文档 + 示例。
+2. **Raycast extension**: Raycast 的 extension 生态对"本地 AI"有大需求；
+   做一个 Raycast AI 替代品走 Kakeya sidecar。
+3. **企业 POC**: 选 2-3 家有"隐私 + Mac 标配 + 长文档处理"需求的客户
+  （律所、投行），做 3 个月 POC，产出案例。
+
+### Phase 3（6 个月后再评估）— B 的 native app 要不要做
+
+如果 Phase 2 证明了：
+- 用户愿意把 KakeyaLattice 作为"本地 AI infra"集成进他们的工作流
+- SDK/CLI 形态的 conversion 到 Apache-2.0 用户足够多（比如周下载过万）
+- atomic.chat 没有把 KakeyaLattice 集成进去（无论出于什么原因）
+
+**那时再评估要不要做 native app**。做的时候差异化应该非常明确（e.g.
+"for researchers who want to see the E8 lattice compression happen in
+real time"），而不是通用 chat UI 和 atomic.chat 正面硬刚。
+
+### 不推荐的路径
+
+- **不做原 M6**: 把 B2 做成 Atomic-Chat 的 backend 选项，然后等 PR
+  合入。工程沉默成本过高，且回报受制于第三方。
+- **不跳过 Phase 1 直接做 B**: "先做 UI，再补后端"是典型的精力错配；
+  UI 是消费端表面，不是壁垒。KakeyaLattice 的壁垒在 `kakeyalattice_mlx`
+  这一层，应该优先发布它。
+
+## 5. 对 B1 PR #57 和 B2 PR #58/#59 的影响
+
+这份评估不否定已有工作：
+
+- **B1 (PR #57)** 依然有用 — HF + MPS 路径是跨平台的（Mac/Win/Linux），
+  CLI `kakeya-llm` 会默认用它；Apple Silicon 用户自动切 B2。
+- **B2 (PR #58 骨架 + PR #59 DFlash 集成)** 依然有用 — 就是 Phase 1
+  要发布的 MLX 栈。
+- **Atomic-Chat 集成** (原 M6) — **可以保留为可选**: 如果 atomic.chat
+  团队愿意合我们的 PR，那是 bonus 分发渠道；但**不是独立产品的主路径**。
+
+换言之，现有工程资产**全部都在独立产品路径上可直接复用**。本评估只是
+**调整 GTM 策略和 branding focus**。
+
+## 6. 下一步（如果采纳本评估）
+
+1. **本 PR 合并后立即做**：
+   - 给 `kakeyalattice_mlx` 和 `kakeya_sidecar_mlx` 加 PyPI 发布 CI
+     (`.github/workflows/publish-*.yml`)
+   - 注册 domain（`kakeyalattice.dev` 等）
+   - 起草 landing page 内容（对比表 + quick start + FAQ）
+2. **2 周内**：
+   - 首个 PyPI release (v0.1.0)
+   - 开 `kakeya-mlx` 独立 repo（可选，看我们 repo 策略偏好）
+   - Twitter / arXiv / HN 发布公告
+3. **1 个月内**：
+   - Cursor/Continue/Aider 集成文档 + PR
+   - 第一个企业 POC 接触
+
+## 7. 结论（一句话）
+
+> **把 B2 从"atomic.chat 的后端"升级为"Mac 本地推理的独立 SDK + CLI 发行"。
+> 所有工程资产都能直接复用；差异化点（E8 格压缩 + DFlash 加速的
+> MLX 原生组合）在独立发行下完全可控，且不受 Atomic-Chat 产品路线
+> 影响。Native app 形态的决定推迟到 Phase 3，避免早期 UI 投入与
+> atomic.chat 正面竞争。**
+
+---
+
+*作者：Cursor Cloud Agent · 分支
+`AgentMemory/atomic-chat-b2-m5-acceptance-benchmark-04ae` · 作为 M6 评估
+文档与 M5 benchmark 同 PR 交付。*