diff --git a/docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md b/docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md
new file mode 100644
index 0000000..3f3ccc4
--- /dev/null
+++ b/docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md
@@ -0,0 +1,635 @@
+# Atomic-Chat × KakeyaLattice v1.5 — 本地 Mac 部署集成架构
+
+**Date**: 2026-04-30
+**Scope**: 把 KakeyaLattice v1.5 (E8 nested-lattice KV-cache codec) 作为核心
+压缩层嵌进 [`AtomicBot-ai/Atomic-Chat`](https://github.com/AtomicBot-ai/Atomic-Chat)
+的本地推理栈，支撑 Qwen3 / Llama-3.x / Gemma-3/4 / DeepSeek-R1-Distill /
+GLM-4 / Mistral 等多个开源模型在 **Mac (Apple Silicon, Metal)** 上的
+离线部署。
+
+本文档是"分析 + 设计"一体的规划稿。代码骨架落在
+[`integrations/atomic-chat/`](../integrations/atomic-chat/)。
+
+---
+
+## 1. Atomic-Chat 现状架构（先分析，再集成）
+
+从官方 README / CONTRIBUTING 与配套项目 [atomic.chat](https://atomic.chat/)
+可以还原出下面这张图（简化，聚焦"推理后端接入点"）。
+
+```
+┌────────────────────────────────────────────────────────────┐
+│  Web App (React + Redux + React Router + Vite + TS)        │
+│    - 对话 UI、模型管理 UI、Assistant 编排                  │
+└────────────┬───────────────────────────┬───────────────────┘
+             │ imports                    │ imports
+             ▼                            ▼
+   ┌──────────────────────┐     ┌──────────────────────────┐
+   │ Core SDK (core/)     │     │ Extensions (extensions/) │
+   │ - TypeScript APIs    │◄────│ - assistant-extension    │
+   │ - Extension System   │ uses│ - conversation-extension │
+   │ - Event Bus          │     │ - download-extension     │
+   │ - Type Definitions   │     │ - llamacpp-extension  ◄──┼── 本地推理
+   └──────────┬───────────┘     └───────────┬──────────────┘
+              │                              │
+              │        Tauri IPC (invoke)    │
+              └──────────────┬───────────────┘
+                             ▼
+   ┌──────────────────────────────────────────────────────────┐
+   │  Tauri Backend (Rust, src-tauri/)                         │
+   │  - Window / IPC / 安全 / 文件系统                          │
+   └──────────────────────────────────────────────────────────┘
+                             ▲
+                             │ 提供能力
+   ┌──────────────────────────────────────────────────────────┐
+   │  Tauri Plugins (src-tauri/plugins/)                       │
+   │  - hardware  (CPU/GPU/RAM 探测)                            │
+   │  - llamacpp  (llama.cpp 进程管理、Metal/CUDA 后端、推理)   │
+   └──────────────────────────────────────────────────────────┘
+                             │
+                             ▼
+          OpenAI 兼容本地 HTTP 服务 @ localhost:1337
+                    (/v1/models, /v1/chat/completions …)
+```
+
+关键事实:
+
+| 维度 | 现状 |
+|:-----|:-----|
+| Shell | **Tauri**（不是 Electron）|
+| 推理后端 | 只有一个：`llamacpp-extension` + `plugins/llamacpp` 驱动 **llama.cpp** 进程 |
+| 模型格式 | GGUF（主力），宣传支持 MLX / ONNX（由 llama.cpp 生态/路线图覆盖）|
+| 模型来源 | 直连 HuggingFace 下载 |
+| 硬件加速 | Mac 走 Metal（`xcodebuild -downloadComponent MetalToolchain`），Windows x64 走 CPU / CUDA |
+| 对外 API | OpenAI 兼容的 `localhost:1337`，其他程序可直接打 |
+| Agent 能力 | MCP (Model Context Protocol) 集成，跑 agentic workflow |
+| 营销口径 | 站点声称 *"Google TurboQuant built-in"* — 指的是 KV 压缩 |
+
+模型下载到推理的主流程(文件系统级):
+
+```
+User clicks download
+  → web-app 发请求
+  → download-extension 决定源
+  → Tauri backend 落盘
+  → llamacpp-extension 注册模型
+  → plugins/llamacpp 起 llama.cpp 子进程
+  → OpenAI 兼容 /v1 路由暴露
+```
+
+---
+
+## 2. 硬冲突：为什么不能把 KakeyaLattice 塞进 llama.cpp
+
+要"把 KakeyaLattice 作为核心架构"集成，先认清一个工程事实。
+
+**KakeyaLattice v1.5 是 PyTorch-first 的 GPU 原生实现。** 其核心算子
+(`V15KakeyaZamirE8GPU.roundtrip`) 依赖:
+
+- Sylvester–Hadamard rotation（任意长度 2^k，PyTorch `matmul`）
+- E8 Conway–Sloane closest-point（两 coset 评估 + `argmin`）
+- Per-vector adaptive `q_max`（`amax` + `clamp`）
+
+最干净的宿主是 HuggingFace `transformers`:`kakeyalattice.hf.KakeyaLatticeCache`
+已经是 `DynamicCache` 的一级子类，`model.generate(past_key_values=cache)` 一行接入。
+
+反之 **llama.cpp 没有可插拔的"KV-cache 量化策略"的对外接口**:
+1. `kv_cache` 是 C++ 内部结构，量化类型 (`q4_0`, `q8_0`, `f16`) 写死在 `ggml_type`。
+2. 没有 Python hook，改造需要实现一套 E8 的 C++/Metal 内核并 patch `llama_kv_cache_unified`。
+3. 即便改造出来，也无法复用 KakeyaLattice 现有的 bit-parity 测试、ablation
+   benchmark、`rigorous_eval.py` 全套回归。
+
+所以"核心架构"集成的工程现实是 — **把 KakeyaLattice 做成与 llama.cpp
+平级的第二个本地推理后端**，由 Atomic-Chat 的 Extension System 负责路由。
+
+---
+
+## 3. 集成目标架构
+
+在 Atomic-Chat 既有架构上加"右臂"，与 llama.cpp 的"左臂"并列:
+
+```
+┌────────────────────────────────────────────────────────────┐
+│  Web App (unchanged)                                        │
+└────────────────────┬──────────────────────────────────────┘
+                     ▼
+          Core SDK (unchanged) + Extensions
+            ├─ llamacpp-extension          (既有, GGUF)
+            └─ kakeyalattice-extension     ★ 新增
+                      │
+                      │ Tauri IPC
+                      ▼
+          Tauri Plugins
+            ├─ plugins/llamacpp            (既有)
+            └─ plugins/kakeyalattice       ★ 新增
+                      │
+                      │ 管理 Python sidecar 进程生命周期
+                      ▼
+          Kakeya Sidecar  (localhost:1338, OpenAI 兼容)
+            ├─ HuggingFace transformers + torch MPS
+            ├─ KakeyaLatticeCache (E8, per-model Q)
+            └─ /v1/models, /v1/chat/completions (stream)
+                      │
+                      ▼
+          本地模型仓库 (HF safetensors)
+            Qwen3-4B / Llama-3.2-3B / Gemma-4-E4B /
+            DeepSeek-R1-Distill-1.5B / GLM-4-9B-Chat / Mistral-7B …
+```
+
+在用户视角:
+
+- 既有的 Atomic-Chat UI 不需要重画。模型选择 UI 多一个 "Backend" 过滤器
+  (`llama.cpp` / `KakeyaLattice (E8)`)。
+- 原 `localhost:1337` 继续由 Atomic-Chat 前门承担，内部按选择的模型
+  将 OpenAI 请求路由到 llama.cpp 或 Kakeya sidecar。
+- **用户只感觉到一件事**：选了 `KakeyaLattice` 后端后，长上下文占用的
+  内存少一截（E8 Q=10 ~3.37×，Q=4 ~4.57×），质量损失可控（|Δppl|<7% 在
+  Qwen3/Gemma/GLM），能塞进 Mac 16/32GB 的上限里。
+
+---
+
+## 4. 关键设计决策
+
+### 4.1 为什么 Python sidecar，而不是 Rust FFI
+
+方案对比:
+
+| 方案 | 优点 | 致命缺点 |
+|:-----|:-----|:---------|
+| Rust 直接调 libtorch / tch-rs | 无进程边界 | torch MPS ABI 不稳定；transformers 的 cache / generate 逻辑在 Python，不能照搬；维护成本爆炸 |
+| **Python sidecar**（本方案）| 直接复用 `kakeyalattice.hf.KakeyaLatticeCache`；transformers 自带所有模型支持；Metal 通过 torch MPS 零改动 | 多一个进程（由 Tauri plugin 托管，用户无感）|
+| 改 llama.cpp | 与 Atomic-Chat 现架构对齐 | 前述 §2，实现成本不可接受 |
+
+选 sidecar。用 `stdio` / `localhost:1338` 与 Tauri 通信，plugin 负责起/停/健康检查/日志转发。这套模式 llama.cpp 自己就是这么跑的（llama.cpp 也是独立进程）。
+
+### 4.2 Metal (MPS) 而不是 CUDA
+
+KakeyaLattice v1.5 的 `V15KakeyaZamirE8GPU` 构造器签名:
+
+```python
+V15KakeyaZamirE8GPU(D=head_dim, q_range=Q, device="cuda")
+```
+
+Mac 上只要把 `device="cuda"` 换成 `device="mps"` 就能跑 — 因为所有算子都是
+标准 PyTorch（`matmul`、`argmin`、`amax`、`clamp`、`round`），MPS backend 全部支持。
+这是整个集成最幸运的点:**不需要写任何 Metal shader**。
+
+> 验证方式：`integrations/atomic-chat/kakeya_sidecar/tests/test_mps_smoke.py`
+> 会在 Apple Silicon 上跑 E8 roundtrip，对齐 CUDA 参考输出（相对误差 < 1e-5）。
+
+性能预期（vs v1.5 on H200 @ 551µs/2048-vec slice）:
+- M2 Pro / M3 Pro (16GB)：codec 开销约 4-8 ms/slice，decode step 整体 15-30ms
+  的 20-30%。不是热点，但不可忽略。
+- 优化方向（未来 PR）：Triton 不行，直接写 Metal Performance Shaders (MPS)
+  的 fused E8 closest-point。当前 sidecar 预留接口 `codec.set_backend("metal")`。
+
+### 4.3 per-model Q 档位（来自 v1.5 报告）
+
+直接复用 `reports/v1_5_release/V15_FULL_4MODEL_REPORT.md` 的结论，每个模型
+落成一张 "deployment profile" JSON，见 `kakeya_sidecar/model_registry.py`:
+
+| Model (HF id) | head_dim | variant | 推荐 Q | CR | |Δppl| (v1.5 实测) | 备注 |
+|:--|:-:|:-:|:-:|:-:|:-:|:--|
+| Qwen/Qwen3-4B | 128 | e8 | 10 | 3.37× | 3.85% | 平衡点 |
+| Qwen/Qwen3-4B | 128 | e8 | 38 | ~2.5× | <1% (paper) | 近无损 |
+| meta-llama/Llama-3.2-3B-Instruct | 128 | e8 | 10 | 3.37× | 待测 (同类) | 报告未直测 |
+| google/gemma-4-E4B | 256/512 | e8 | 10 | 3.47× | 1.56% | 异构 head_dim 已验证 |
+| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | 128 | e8 | 10 | 3.37× | 需 boundary≥2 | 小模型结构敏感 |
+| zai-org/GLM-4-9B-Chat | 128 | e8 | 10 | 3.37× | 6.96% | L=40，可用 |
+| mistralai/Mistral-7B-Instruct-v0.3 | 128 | e8 | 10 | 3.37× | 待测 (同类) | head_dim 兼容 |
+
+准则:
+- **默认 Q=38**：近无损，Mac 首选（用户感知最稳）。
+- **长上下文 / 内存吃紧时 Q=10**：3.37× 压缩，|Δppl|<7% 在主流 7B~9B。
+- **Q=4 不作为默认**：报告里 GLM Q=4 |Δppl|=32%，只在 UI 高级选项里暴露。
+- **DeepSeek-R1-Distill 家族强制 `boundary=2`**：避开报告里记录的
+  no-boundary 灾难模式。
+
+### 4.4 长上下文是真正的卖点
+
+Mac 本地部署里"KV cache 撑爆内存"是高频失败。举例 Llama-3.2-3B 在 32k
+ctx 的 KV：`2 × L=28 × head_dim=128 × num_kv_heads=8 × 32768 × 2(bf16) =
+3.7 GB`。把它压 3.37× 后变 1.1 GB，Mac mini M2 16GB 就够跑。
+
+这是"接 KakeyaLattice"带来的**用户可感知价值**，不是单纯的 PR 演示。
+
+---
+
+## 5. 多模型支持矩阵
+
+所有通过 `head_dim % 8 == 0` 且 `head_dim` 是 2 的幂 的模型都即插即用。
+`KakeyaLatticeCache` 的 `strict=True` 会在不兼容的模型（如 GPT-NeoX
+head_dim=96）启动时报错并 fallback 到 `llama.cpp` 后端。
+
+| Family | head_dim | E8 OK | 备注 |
+|:--|:-:|:-:|:--|
+| Llama-3.x (1B/3B/8B) | 128 | ✅ | Mac 主力 |
+| Qwen2 / Qwen3 (1.5B/4B/7B) | 128 | ✅ | v1.5 报告首发 |
+| Mistral / Mixtral | 128 | ✅ | |
+| Gemma-3 / Gemma-4 | 256 (+512 MatFormer) | ✅ | 异构 head_dim 在 KakeyaLatticeCache 已覆盖 |
+| DeepSeek-R1-Distill 系列 (1.5B/7B/14B) | 128 | ✅ | 强制 boundary=2 |
+| DeepSeek-V2/V3 (MLA) | 128 + 64 rope | ⚠️ | 需要 MLA 专用路径，下一步 |
+| GLM-4-9B-Chat | 128 | ✅ | |
+| Phi-3 (4k/128k) | 96 | ✗ | 跳过 → llama.cpp |
+| GPT-NeoX 旧模型 | 96 | ✗ | 跳过 → llama.cpp |
+
+---
+
+## 6. API 设计 — OpenAI 兼容
+
+Sidecar 暴露最小 OpenAI 子集，Atomic-Chat 的 `localhost:1337` 前门只做 URL 改写。
+
+### `GET /v1/models`
+返回当前已加载 + 可加载的模型；每条多两个 KakeyaLattice 字段:
+
+```json
+{
+  "object": "list",
+  "data": [
+    {
+      "id": "qwen3-4b@e8-q10",
+      "object": "model",
+      "owned_by": "kakeyalattice",
+      "x_kakeya": {
+        "variant": "e8",
+        "q_range": 10,
+        "boundary": 0,
+        "est_compression": 3.37,
+        "est_delta_ppl_pct": 3.85
+      }
+    }
+  ]
+}
+```
+
+### `POST /v1/chat/completions`
+完全兼容 OpenAI 请求格式。sidecar 接到后:
+1. 按 `model` 字段解析 `<hf-id>@<variant>-q<Q>`；
+2. lazy-load 模型 + `KakeyaLatticeCache` (LRU，同时间只驻留 1~2 个)；
+3. `model.generate(..., past_key_values=cache, streamer=...)` 跑出
+   token 流；
+4. 转成 OpenAI `chat.completion.chunk` SSE 流。
+
+### `GET /v1/kakeya/stats`
+扩展接口，给 UI 用，返回:
+- 当前 cache 的 `codec_fired_per_layer` / `skip_fired_per_layer`
+- 当前 KV HBM/URAM footprint 估算
+- 上一次 decode step 的 codec 占用 %
+
+---
+
+## 7. 安装 & 运行（开发者视角）
+
+本仓库提供的骨架可以**直接拉到 Atomic-Chat 仓库 `extensions/` 下使用**。
+
+```bash
+# 1. 装 Python sidecar
+cd integrations/atomic-chat/kakeya_sidecar
+pip install -e ".[mac]"
+kakeya-sidecar --port 1338 --device mps --prewarm qwen3-4b@e8-q10
+
+# 2. 拷扩展到 Atomic-Chat 仓库
+cp -R integrations/atomic-chat/kakeyalattice-extension \
+      ~/Atomic-Chat/extensions/
+# 在 web-app 的 extension registry 里注册一行，重新 `make dev` 即可
+
+# 3. 拷 Tauri plugin
+cp -R integrations/atomic-chat/tauri-plugin-kakeyalattice \
+      ~/Atomic-Chat/src-tauri/plugins/kakeyalattice
+# 在 src-tauri/tauri.conf.json + Cargo.toml 加 plugin 注册
+```
+
+打包到 Atomic-Chat 安装包时，sidecar 需要用 `PyOxidizer` 或 `briefcase`
+打成独立可执行，与 `llama.cpp` 二进制并列塞进 `.dmg`。本仓库不做
+端到端打包（Atomic-Chat 仓库自己的 CI 做），只保证 sidecar 本身能跑。
+
+---
+
+## 8. 与项目原有 vLLM 路径的关系
+
+本仓库 `vllm_backend/kakeya_v1_4_snapshot/` 是 **Linux + CUDA + vLLM** 的
+snapshot-hook 插件路径，用于在 H200 上跑 `benchmarks/rigorous_eval.py`。
+
+Atomic-Chat 集成 **不复用这条路径**，原因：
+- vLLM 在 Mac 上没有 Metal 后端，跑不起来。
+- vLLM 的 PagedAttention + KV cache 是 CUDA 内核，snapshot hook 依赖 CUDA tensor。
+
+二者关系:
+- `vllm_backend/` → 研究/benchmark 用，CI 数据来源，**保持不动**。
+- `integrations/atomic-chat/kakeya_sidecar/` → 产品用，Mac 部署，新增路径。
+
+两者共用的唯一组件是 `kakeyalattice` Python 包本身（纯 PyTorch，device 无关）。
+
+---
+
+## 9. 路线图
+
+| 阶段 | 内容 | 依赖 |
+|:-:|:--|:--|
+| M1 | Sidecar MVP：OpenAI `/v1/chat/completions` 流式 + Qwen3/Llama3/Gemma/Mistral/DeepSeek/GLM 6 模型单测 | 本 PR 落地 |
+| M2 | kakeyalattice-extension + Tauri plugin 接入 Atomic-Chat，UI 增加 "Backend: KakeyaLattice" 选项 | M1 |
+| M3 | Metal fused E8 closest-point kernel，codec 延迟降到 bf16 decode 的 < 5% | Metal Shading Language 专家 |
+| M4 | `KakeyaLatticeCache` 切换到真·存索引模式（非 roundtrip），HBM 节省从 "nominal" 转为 "实打实" | 自定义 attention kernel |
+| M5 | 与 Atomic-Chat 的 MCP/agent 链路打通：long-context 检索可在 compressed KV 上直接跑 NIAH 风格取值 | M1 + M2 |
+
+---
+
+## 10. 风险 & 取舍
+
+| 风险 | 缓解 |
+|:--|:--|
+| MPS 算子在 torch 2.x 有偶发 bug（`argmin` 在非连续 tensor 上） | sidecar 在 codec 前强制 `.contiguous()`，已经放进 smoke test |
+| sidecar 进程崩掉导致整台 Atomic-Chat 推理断线 | Tauri plugin 负责 supervise + 自动重启（对齐 llama.cpp plugin 现有行为） |
+| 用户误选 Q=4 在小模型上（如 DeepSeek-1.5B no-boundary 已知 5.4 万 % PPL） | UI 只暴露 `safe` Q 档；`aggressive` 档进高级菜单 + 警示 |
+| HF 模型许可协议（Llama-3 门禁、GLM-4 的自有 license）| 沿用 Atomic-Chat 既有机制，sidecar 不做任何绕过 |
+| `KakeyaLatticeCache` 的 `roundtrip` 落在 bf16 阵上，实际 HBM 节省是 "nominal"（见 `kakeyalattice/README.md` §What it is NOT） | M4 前，主卖点是**长上下文的 attention 质量下限**，不是 HBM 压缩率。UI 文案要诚实 |
+
+---
+
+## 11. 与 atomic.chat 宣传口径的对齐
+
+atomic.chat 首页写的是 *"Google TurboQuant built-in"*。按 v1.5 报告:
+
+- v1.5 E8 Q=4 (CR 4.57×) **vs** TQ b=3 (CR 4.92×)：v1.5 在 4 个模型上全面
+  赢 3-6× 更低 |Δppl|。
+- TQ b=2 (CR 7.11×) 在 4 个模型上结构性不可用（\|Δppl\| > 100% 到 14 万%）。
+
+所以这次集成对 Atomic-Chat 产品而言是**把宣传里的 "TurboQuant built-in"
+换成工程上更靠谱的 E8 nested-lattice**。两者可以共存（用 `backend` 参数
+切），但在默认档位上 KakeyaLattice 明显更稳。
+
+---
+
+## 附：文件对应表
+
+本 PR 新增的文件与它们在 Atomic-Chat 仓库里的最终归宿:
+
+| 本仓库路径 | Atomic-Chat 仓库映射 |
+|:--|:--|
+| `integrations/atomic-chat/kakeya_sidecar/` | 独立安装，sidecar 包装后塞进 `.dmg` |
+| `integrations/atomic-chat/kakeyalattice-extension/` | 放到 `extensions/` |
+| `integrations/atomic-chat/tauri-plugin-kakeyalattice/` | 放到 `src-tauri/plugins/kakeyalattice/` |
+| `docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md` | 存本仓库作设计依据 |
+
+---
+
+## 12. DFlash (block diffusion speculative decoding) 集成路径
+
+### 12.1 DFlash 是什么，和 KakeyaLattice 的关系
+
+[**DFlash** (z-lab, arXiv:2602.06036, 2026-02)](https://github.com/z-lab/dflash) 是
+block-diffusion 架构的 **speculative-decoding drafter**: 小 draft 模型
+在单次 forward 里并行 draft 一整块 token (block_size=16)，target LLM
+一次并行 verify，总体在 Qwen3-8B 上达 6× 无损加速 (2.5× 超过 EAGLE-3)。
+官方支持四后端: **Transformers / SGLang / vLLM-nightly / MLX (Apple
+Silicon)**。
+
+DFlash 与 KakeyaLattice **正交**:
+
+| | 改什么 | 省什么 | 质量承诺 |
+|:-|:-|:-|:-|
+| DFlash | 解码循环(draft 一次出 16 token) | **解码步数(时间)** | lossless (verify 回退 target 分布) |
+| KakeyaLattice | 每步内部的 KV 表示 | **每步 KV HBM / 带宽(空间)** | 近无损 Q=38 <1%;  Q=10 <7% on 3/4 models |
+
+叠加后理论效果: `总加速 ≈ DFlash 3-6× × KakeyaLattice KV 节省 → 更长 ctx × +18-24% 并发`。
+
+### 12.2 叠加的关键交互
+
+1. **Verify 阶段是 mini-prefill**, 16 token 一次并行过 codec, codec
+   开销自动摊销到 1/16 — MPS codec 占比从单轨 decode 的 20-30% 降到 1-2%.
+2. **Draft 模型的 KV 也是压缩目标**: DFlash MLX 已暴露
+   `sliding_window_size` bound draft KV growth; 用 KakeyaLattice Q=10
+   激进档压缩 draft KV 完全安全, 因为 verify 由 target 正确分布兜底.
+   这是"draft 激进压 / target 保守压"的分层策略, 只有在推测解码框架下
+   才能成立.
+3. **Acceptance rate 风险**: target 走 Kakeya Q=38 会引入 <1% |Δppl|,
+   期望 acceptance 掉 <1pp; 需实测 `z-lab/Qwen3-8B-DFlash-b16 +
+   Qwen3-8B + KakeyaLatticeCache(e8, q=38)` 对比 pure bf16 baseline,
+   看 acceptance-length 分布.
+4. **MLX 路径阻碍**: `KakeyaLatticeCache` 是 `transformers.DynamicCache`
+   子类, 不能直接接 MLX 的 `RotatingKVCache` / `StandardKVCache`.
+   需要把 E8 codec 移到 MLX — 工作量约 1-2 文件 / 200-300 行 MLX 代码,
+   无自定义 CUDA/Metal kernel.
+
+### 12.3 方案矩阵 (把 DFlash 加进去)
+
+| 方案 | Platform | 加速栈 | KV 压缩 | 工程成本 |
+|:-|:-|:-|:-|:-|
+| **B1** Mac sidecar, HF + MPS (本 PR) | Mac | 单轨 AR | `KakeyaLatticeCache` 现成 | ✅ 已落地 |
+| **B2** Mac sidecar, MLX + DFlash + KakeyaLattice-MLX | Mac | DFlash 3-6× | KakeyaLattice MLX port | 中 |
+| **C** 本地 vLLM + KakeyaLattice snapshot | Linux/CUDA | vLLM | 既有 plugin | 已有 |
+| **C2** 本地 vLLM nightly + DFlash spec-config + KakeyaLattice | Linux/CUDA | vLLM + DFlash | 既有 plugin + 两插件共存调试 | 中-高 |
+
+B2 是本 PR 叙事链中的 "phase 2", 作为**独立后续 PR** 推进。
+
+### 12.4 B2 路线图 (后续 PR)
+
+| 里程碑 | 产出 |
+|:-|:-|
+| M1 | `kakeyalattice_mlx/` — E8 codec 的纯 MLX 实现 + 与 PyTorch 参考的 bit-level parity |
+| M2 | `KakeyaLatticeMLXCache` — mlx-lm KV cache 包装层 |
+| M3 | `kakeya_sidecar_mlx/` — OpenAI 兼容 MLX sidecar, 加载 target + 可选 DFlash draft |
+| M4 | 接 DFlash: `dflash.model_mlx.stream_generate` 内置 target KV 压缩 |
+| M5 | acceptance-rate benchmark: `Qwen3-8B + z-lab/Qwen3-8B-DFlash-b16` × `{bf16, Kakeya Q=38, Q=10}` |
+| M6 | Atomic-Chat extension 追加 "KakeyaLattice (MLX+DFlash)" backend 选项 |
+
+### 12.5 `DeploymentProfile` 预留的 `dflash_draft_repo` 字段
+
+`kakeya_sidecar/kakeya_sidecar/model_registry.py` 在 B2 PR 里会补上:
+
+```python
+dflash_draft_repo: str | None = None  # e.g. "z-lab/Qwen3-8B-DFlash-b16"
+```
+
+对应已发布的 DFlash draft 模型清单:
+
+| target | DFlash draft |
+|:--|:--|
+| Qwen/Qwen3-4B (non-thinking) | `z-lab/Qwen3-4B-DFlash-b16` |
+| Qwen/Qwen3-8B (non-thinking) | `z-lab/Qwen3-8B-DFlash-b16` |
+| meta-llama/Llama-3.1-8B-Instruct | `z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat` |
+| Qwen/Qwen3.5-4B | `z-lab/Qwen3.5-4B-DFlash` |
+| Qwen/Qwen3.5-27B | `z-lab/Qwen3.5-27B-DFlash` |
+| openai/gpt-oss-20b | `z-lab/gpt-oss-20b-DFlash` |
+| openai/gpt-oss-120b | `z-lab/gpt-oss-120b-DFlash` |
+
+B1 sidecar 现在不消费这个字段, 留给 B2 PR 填充。
+
+---
+
+## 13. 浏览器里跑推理的可行性评估
+
+### 13.1 定义: "在 web 里跑"有四种
+
+| | 含义 | 技术 |
+|:-|:-|:-|
+| W1 | 纯浏览器、零本地依赖 | WebGPU / WASM / WebLLM / transformers.js |
+| W2 | 浏览器 UI + 同机本地 sidecar | `http://localhost:1338` + Tauri webview |
+| W3 | Tauri webview 里的 web-app | 本 PR 的 TS 扩展跑在这一层 |
+| W4 | 远端 server + 浏览器前端 | LAN/云端 vLLM/SGLang |
+
+本节聚焦 W1 (最硬的问题: 能否真把 B2 搬进浏览器)。
+
+### 13.2 B2 三件部件搬 Web 的可行性
+
+| 部件 | WebGPU 可行性 | 工作量 |
+|:-|:-:|:-|
+| **Target LLM** | ✅ WebLLM / transformers.js 已支持 Qwen/Llama/Gemma/Phi 小模型 int4 | 现成 |
+| **KakeyaLattice E8 codec** | ✅ 全是 `matmul` / `argmin` / `round` / `clamp` / `amax`, WGSL 几百行 | 低: 1-2 天 WGSL + JS |
+| **DFlash drafter** | ✘ 无 web 后端; draft 架构 (cross-architecture KV injection + block-diffusion + mask token) 需要 port 自 `dflash/model_mlx.py` ~1500 行到 WebGPU | 高: 独立项目级 |
+
+三件拼起来的结论:
+- `target + KakeyaLattice` ✅ 可行
+- `target + KakeyaLattice + DFlash` ✘ 短期不可行
+
+### 13.3 关键限制的数字锚点
+
+- WASM 线性内存单实例 4 GB 上限 (memory64 提案 Safari 未实现)
+- WebGPU buffer binding 单 binding Chrome 默认 256 MB (可提至 2 GB)
+- 7B bf16 = 14 GB → 必 int4 量化; 13B int4 ≈ 7 GB 贴 M3 Pro 边界
+- WebGPU 相对 native GPU 效率 **50-70%** (workgroup / subgroup ops / driver 抽象损耗)
+- Qwen-2.5-3B WebLLM @ M3 Pro ~30-45 tok/s (MLX native ~80-100)
+- WebGPU + DFlash 缺位 → web 比 "native MLX + DFlash" 慢 **一个量级**
+
+### 13.4 Web 推理的优劣势速览
+
+**优势**:
+- 分发极简 (URL), 跨平台零 build matrix, 更新零摩擦
+- 浏览器沙盒提供可审计的隐私承诺
+- 嵌入性: Notion / Chrome extension / iframe 原位推理
+- 服务端成本近零 (demo / playground 场景)
+
+**劣势**:
+- 硬件上限被浏览器上限约束 (WASM 4 GB / WebGPU buffer quota)
+- 冷启动悬崖 (首次几 GB 权重下载)
+- 性能天花板比 native 低 50-70% + 无 DFlash = 慢一个量级
+- 电池 / 发热 / 浏览器碎片化 (2026 覆盖率 ~60-75%)
+- 无本地文件 / 无 MCP / 无子进程能力 — Atomic-Chat 的 agent 卖点全部失效
+
+### 13.5 三条现实路径 (ladder)
+
+| ladder | 能做到 | 工作量 |
+|:-:|:--|:--|
+| L1 | Web 版 KakeyaLattice E8 codec demo (WGSL + WebGPU), marketing/教育 | 1-2 天 |
+| L2 | Web 里 transformers.js + 自定义 KV cache wrapper 跑 3-4B target + KakeyaLattice (无 DFlash) | 几周 |
+| L3 | 完整 B2 in web (需 DFlash WebGPU 后端, 等 z-lab / MLC 社区出) | 巨大, 非本仓库能推进 |
+
+### 13.6 W2 的务实选项 (不是纯 web, 但"零点击即用")
+
+sidecar 加一个 `/chat.html` 静态路由 + React build 产物, 用户打开
+`http://localhost:1338/chat.html` 浏览器就能 chat, 底层是完整 B2
+native 栈. 这是 Atomic-Chat Tauri webview 的"脱壳版", 保留 B2 全部
+性能, 同时满足"打开浏览器就用"的体验诉求.
+
+### 13.7 建议
+
+- **短期**: 不在本 PR 做 W1; 保留 B1 Python sidecar 作为产品主力.
+- **中期** (独立 PR): L1 (codec demo 页面) 作为 marketing / 教育资产.
+- **长期**: L2 (web-native LLM + KakeyaLattice) 机会窗口, 但 DFlash 缺位
+  意味着"web 永远比 native 慢一代", 投入要量力.
+
+---
+
+## 14. A0 (llama.cpp) vs B2 (MLX + DFlash + KakeyaLattice) 决策表
+
+本节给 Atomic-Chat 团队在 review 这套集成 PR 时一份直接可用的决策依据:
+B2 **不是** A0 的替代, 而是 A0 的"Pro Mode for Mac"附加后端.
+
+### 14.1 两套方案的全貌
+
+**A0 (Atomic-Chat 现状)**:
+- Tauri 2 + React, 跨平台 (Mac/Win/Linux)
+- 单推理后端: `extensions/llamacpp-extension` + `plugins/llamacpp` 驱动 llama.cpp
+- GGUF 模型, Metal / CUDA / CPU 加速
+- KV 压缩: `--cache-type-k/v` = q8_0 / q4_0 / q4_1 标量量化
+- 推测解码: llama.cpp 引擎有, UI 未集成
+- MCP / Assistant / 多 provider / OpenAI 兼容 `:1337`
+- 宣传 "Google TurboQuant built-in" (实为 llama.cpp KV 标量量化)
+
+**B2 (本 PR 叙事链末端, 独立后续 PR 落地)**:
+- 复用 Atomic-Chat 的 Tauri 壳
+- 第二推理后端: Python sidecar 跑 MLX + target LLM + DFlash drafter + KakeyaLattice-MLX codec
+- MLX 权重 (mlx-community) + HF safetensors
+- Apple Silicon 专属 (MLX 不上 Windows/Linux)
+- KV 压缩: E8 Q=38 / Q=10 / Q=4 (v1.5 报告锚定)
+- 推测解码: DFlash 3-6× 无损 (z-lab 预训练 draft)
+- 复用 A0 的 UI / MCP / Assistant / `:1337` OpenAI 前门 (内部路由到 `:1338` sidecar)
+
+### 14.2 多维对比表
+
+| 维度 | A0 | B2 |
+|:-|:-|:-|
+| Decode 速度 (Qwen3-8B @ M3 Pro) | ~35-45 tok/s (q4_K_M) | ~200-280 tok/s effective (MLX 50 × DFlash 3-6×) |
+| KV 压缩质量 | q4/q8 标量, q2_K 对 <3B 模型不稳 | E8 Q=38 \|Δppl\|<1%, Q=10 <7% (3/4 模型) |
+| 长上下文 (Mac 16GB) | Qwen3-8B ~16-24k | ~48-64k (KV 再 3.37× 压缩) |
+| 推测解码 | 引擎有, UI 不可达 | DFlash 3-6× 无损 |
+| 模型生态 | GGUF 池 >> MLX 池 (几万 vs 几百) | MLX 主流模型覆盖 + HF 原始权重 |
+| 冷启动 | ~0.5-1s | ~2-3s (Python + MLX load) |
+| 平台覆盖 | Mac + Windows + Linux | **Mac 专属 (Apple Silicon)** |
+| 小模型 (<3B) | llama.cpp 手工 Metal 调优, ~90+ tok/s | MLX 比 llama.cpp 慢 20-30% |
+| 大模型 (>13B) | GGUF int4/int2 OK | MLX int4 OK 但生态弱 |
+| Agent / MCP | 原生集成 | 完全复用 A0 能力 |
+| KakeyaLattice 接入深度 | 无 (llama.cpp 无可插拔 KV hook) | 原生 `KakeyaLatticeCache` (B1) + MLX port (B2) |
+| 打包 | DMG + 静态 llama.cpp | DMG + PyOxidizer sidecar + signing |
+| debuggability | C++ 栈 + native profiler | Python stack + MLX trace, 显著更易 |
+| 新 codec 接入 | 改 C++ + Metal kernel | 几十行 MLX |
+| License | Apache-2.0 + MIT | Apache-2.0 全线 |
+| 社区规模 | llama.cpp ~80k★ | MLX 官方 / DFlash 新 / KakeyaLattice 专项 |
+
+### 14.3 各自赢在哪里
+
+**A0 三大硬核优势**:
+1. **生态压倒性** — GGUF 新模型 release-to-quant lag 24-72h, MLX 是 1-2 周.
+2. **跨平台** — Windows 用户 (Atomic-Chat 有 .exe 头等二进制) 在 B2 下无路可走.
+3. **小模型性能** — llama.cpp Metal shader hand-tuned; Phi-3.5 mini @ M3 Pro ~90+ tok/s.
+
+**B2 三大硬核优势**:
+1. **加速 × 压缩乘法效应** — 200+ tok/s effective, A0 在产品形态下追不上 (llama.cpp 推测解码未暴露到 UI).
+2. **长上下文真解锁** — Mac 16GB + Qwen3-8B 从"32k OOM"到"64k 可用", \|Δppl\|<1%.
+3. **质量天花板** — Q=38 近无损 + Q=10 平衡档都有 v1.5 n=32 CI 实测撑, 不是工程直觉.
+
+**A0 软肋**:
+- `Google TurboQuant built-in` 的宣传工程上名不副实 (实为 llama.cpp 2023 年就有的标量量化); v1.5 报告中 TQ b=2 "结构性不可用", b=3 被 E8 Q=4 在 4 模型上全面压过.
+- draft model 支持在 UI 未暴露, 加速特性"存在但不可达".
+
+**B2 软肋**:
+- Mac-only — Windows/Linux 用户用不到.
+- Python sidecar 打包 + notarize 是 Atomic-Chat 新工程.
+- 冷启动慢 2 秒.
+- DFlash 对模型家族有限定 (Qwen3 / Llama-3.1 / Qwen3.5 / gpt-oss / Kimi).
+
+### 14.4 用户画像 → 最优方案
+
+| 用户画像 | 最优 |
+|:-|:-|
+| Windows / Linux | **A0** (B2 不可用) |
+| Mac + <3B 小模型 (Phi, Gemma-2B) | **A0** (llama.cpp 手工调优略优) |
+| Mac + 3-8B 主力 (Qwen3, Llama-3) | **B2** (加速 + KV 压缩双突破) |
+| Mac + 长上下文 / 整本书 / 代码库 | **B2** (唯一解) |
+| 追新模型 (每天等新 GGUF) | **A0** (生态快一周) |
+| MCP / agent 工作流 | 任选; B2 高 tok/s 让连续调用更流畅 |
+| 隐私极强 | 任选; B2 Python 栈可审计更深 |
+
+### 14.5 给产品决策者的三句话
+
+1. **不要把 B2 包装成"取代 llama.cpp"** — 产品伤害极大 (Windows / GGUF 追新 / 小模型用户流失).
+2. **把 B2 定位为"Pro Mode for Mac"** — 文案强调 "DFlash + E8 lattice KV, Qwen3-8B 在 M3 Pro 200+ tok/s, 48k ctx 不 OOM", 兑现 atomic.chat 早已宣传却未落地的 "TurboQuant built-in" 承诺.
+3. **宣传可信度** — B2 的每一个数字都能 reproduce (KakeyaLattice v1.5 n=32 Student-t CI + DFlash z-lab 官方 benchmark). 把 "Google TurboQuant built-in" 改为 "KakeyaLattice E8 + DFlash built-in for Mac", 工程上才立得住.
+
+### 14.6 本 PR 的定位重申
+
+本 PR (分支 `AgentMemory/atomic-chat-kakeya-integration-04ae`) 交付
+**B1 方案 + 完整设计叙事**, 具体包含:
+
+- `integrations/atomic-chat/kakeya_sidecar/` — Python sidecar (HF + MPS + `KakeyaLatticeCache`) 代码 + 15 单测
+- `integrations/atomic-chat/kakeyalattice-extension/` — Atomic-Chat TS 扩展骨架, tsc 通过
+- `integrations/atomic-chat/tauri-plugin-kakeyalattice/` — Rust Tauri plugin 骨架
+- `docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md` — §1-§14 完整设计
+
+**B2 方案** (MLX + DFlash + KakeyaLattice-MLX) 走**独立后续 PR**
+(`AgentMemory/atomic-chat-b2-mlx-dflash-kakeya-04ae`), 避免本 PR 审阅面
+膨胀. B2 PR 会在本 PR merge 之后开出, 对本 PR 不是 blocker.
+
+---
+
+*作者：Cursor Cloud Agent · 分支 `AgentMemory/atomic-chat-kakeya-integration-04ae`.*
diff --git a/integrations/atomic-chat/.gitignore b/integrations/atomic-chat/.gitignore
new file mode 100644
index 0000000..4624513
--- /dev/null
+++ b/integrations/atomic-chat/.gitignore
@@ -0,0 +1,15 @@
+# Python
+__pycache__/
+*.pyc
+*.egg-info/
+.pytest_cache/
+.venv/
+
+# Node / TS
+node_modules/
+dist/
+*.tsbuildinfo
+
+# Rust
+target/
+Cargo.lock
diff --git a/integrations/atomic-chat/README.md b/integrations/atomic-chat/README.md
new file mode 100644
index 0000000..abc1062
--- /dev/null
+++ b/integrations/atomic-chat/README.md
@@ -0,0 +1,95 @@
+# Atomic-Chat × KakeyaLattice v1.5 — 本地 Mac 部署集成
+
+把 [`kakeyalattice` v1.5 (E8 格 KV-cache codec)](../../kakeyalattice/) 作为
+**第二个一等推理后端** 接入
+[`AtomicBot-ai/Atomic-Chat`](https://github.com/AtomicBot-ai/Atomic-Chat)，
+目标是 **Mac (Apple Silicon, Metal)** 的多模型离线部署。
+
+> 完整设计依据见 [`docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md`](../../docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md)。
+> 本目录只放"可直接拷进 Atomic-Chat 仓库"的工程骨架。
+
+## 目录结构
+
+```
+integrations/atomic-chat/
+├── kakeya_sidecar/              ★ Python 推理 sidecar
+│   ├── pyproject.toml           —— 独立 pip 包 `kakeya-sidecar`
+│   ├── kakeya_sidecar/
+│   │   ├── __init__.py
+│   │   ├── __main__.py          —— `python -m kakeya_sidecar ...`
+│   │   ├── cli.py               —— argparse 入口 + uvicorn 启动
+│   │   ├── server.py            —— FastAPI OpenAI 兼容接口
+│   │   ├── engine.py            —— HF + KakeyaLatticeCache 推理核心
+│   │   ├── model_registry.py    —— 多模型部署档位
+│   │   └── schemas.py           —— OpenAI 请求/响应 dataclass
+│   └── tests/
+│       └── test_model_registry.py
+├── kakeyalattice-extension/     ★ Atomic-Chat TypeScript 扩展
+│   ├── package.json
+│   ├── src/
+│   │   ├── index.ts             —— 扩展入口（注册到 core SDK）
+│   │   └── backend.ts           —— 走 Tauri plugin → sidecar
+│   └── README.md
+└── tauri-plugin-kakeyalattice/  ★ Rust Tauri 插件（桩）
+    ├── Cargo.toml
+    ├── src/
+    │   ├── lib.rs               —— `tauri::plugin::Builder` 注册
+    │   └── commands.rs          —— sidecar 生命周期 + 代理调用
+    └── README.md
+```
+
+## 快速验证 (Mac)
+
+```bash
+# 1. 安装 sidecar
+cd integrations/atomic-chat/kakeya_sidecar
+pip install -e ".[mac]"
+
+# 2. 单元测试（不需要下载模型）
+pytest tests/ -v
+
+# 3. 跑起来
+kakeya-sidecar --port 1338 --device mps
+curl http://localhost:1338/v1/models | jq .
+
+# 4. 发一次推理请求（会首次下载模型，小心硬盘）
+curl http://localhost:1338/v1/chat/completions -H "Content-Type: application/json" -d '{
+  "model": "qwen3-4b@e8-q10",
+  "messages": [{"role":"user","content":"Explain nested-lattice codes in one line."}],
+  "stream": false, "max_tokens": 64
+}'
+```
+
+## 集成到 Atomic-Chat 主仓库（步骤示意）
+
+```bash
+# 假设 Atomic-Chat 主仓库在 ~/code/Atomic-Chat
+export ATOMIC=~/code/Atomic-Chat
+
+# 1. 扩展
+rsync -a kakeyalattice-extension/   $ATOMIC/extensions/kakeyalattice-extension/
+
+# 2. Tauri plugin
+rsync -a tauri-plugin-kakeyalattice/ $ATOMIC/src-tauri/plugins/kakeyalattice/
+
+# 3. 向 Atomic-Chat 的 extension registry 加一行：
+#    （详见 $ATOMIC/extensions/README / CONTRIBUTING.md）
+#    { name: "kakeyalattice-extension", enabled: true }
+
+# 4. Python sidecar 需要和 llama.cpp 一起打进安装包。
+#    Atomic-Chat 在 `scripts/bundle-binaries.*` 有现成的二进制打包脚本，
+#    对 sidecar 用 PyOxidizer / PyInstaller 产出单文件，挂上去即可。
+```
+
+> 本 PR 不改 Atomic-Chat 主仓库 — 只在本仓库提供可直接移植的骨架、
+> 完整的设计文档、以及 sidecar 自己的单元测试。
+
+## 与既有 vLLM 插件的关系
+
+| 路径 | 场景 | 是否改动 |
+|:--|:--|:--|
+| `vllm_backend/kakeya_v1_4_snapshot/` | Linux / CUDA / H200 benchmark | 不动 |
+| `integrations/atomic-chat/kakeya_sidecar/` | Mac / MPS / 产品端推理 | 本 PR 新增 |
+
+两者共用的唯一组件是 Python 包 `kakeyalattice` 本身 — 标量算子都走
+PyTorch，设备无关。
diff --git a/integrations/atomic-chat/kakeya_sidecar/README.md b/integrations/atomic-chat/kakeya_sidecar/README.md
new file mode 100644
index 0000000..f732410
--- /dev/null
+++ b/integrations/atomic-chat/kakeya_sidecar/README.md
@@ -0,0 +1,37 @@
+# kakeya-sidecar
+
+OpenAI 兼容的本地推理 sidecar，给 [Atomic-Chat](https://github.com/AtomicBot-ai/Atomic-Chat)
+用；推理走 HuggingFace `transformers` + `kakeyalattice.hf.KakeyaLatticeCache`
+(E8 nested-lattice KV-cache 压缩)。
+
+设计目标:
+
+1. **零代码改动** — 对外是 `POST /v1/chat/completions`，Atomic-Chat 既有
+   OpenAI 客户端直连即可。
+2. **多模型本地部署** — Qwen3 / Llama-3.x / Gemma-4 / DeepSeek-R1-Distill /
+   GLM-4-9B / Mistral，每个都有"出厂" Q 档配置。
+3. **Mac 优先** — 默认 `--device mps`，Linux/CUDA 也支持。
+4. **Tauri-friendly** — 纯 HTTP + JSON，Tauri plugin 负责起进程、转发调用。
+
+详细设计：[`docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md`](../../../docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md)。
+
+## Quick start
+
+```bash
+pip install -e ".[mac,dev]"
+pytest tests/ -v                    # 不下载模型的纯逻辑单测
+kakeya-sidecar --port 1338 --device mps
+```
+
+```bash
+curl http://localhost:1338/v1/models
+curl http://localhost:1338/v1/chat/completions -d '{...}' ...
+```
+
+## 支持的模型
+
+参见 `kakeya_sidecar/model_registry.py`。通过 `GET /v1/models` 实时返回。
+
+## License
+
+Apache-2.0，与主仓库一致。
diff --git a/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/__init__.py b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/__init__.py
new file mode 100644
index 0000000..d149aff
--- /dev/null
+++ b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/__init__.py
@@ -0,0 +1,29 @@
+"""kakeya_sidecar — OpenAI-compatible local inference sidecar.
+
+Top-level imports are lazy so that pure-logic modules (notably
+:mod:`model_registry`) can be imported in test / packaging
+environments where FastAPI or torch may not be installed.
+"""
+from __future__ import annotations
+
+from .model_registry import MODEL_REGISTRY, DeploymentProfile, resolve_model
+
+__all__ = [
+    "MODEL_REGISTRY",
+    "DeploymentProfile",
+    "resolve_model",
+    "KakeyaEngine",
+    "create_app",
+]
+
+__version__ = "0.1.0"
+
+
+def __getattr__(name):
+    if name == "KakeyaEngine":
+        from .engine import KakeyaEngine
+        return KakeyaEngine
+    if name == "create_app":
+        from .server import create_app
+        return create_app
+    raise AttributeError(f"module 'kakeya_sidecar' has no attribute {name!r}")
diff --git a/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/__main__.py b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/__main__.py
new file mode 100644
index 0000000..876b442
--- /dev/null
+++ b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/__main__.py
@@ -0,0 +1,7 @@
+"""Allow ``python -m kakeya_sidecar``."""
+from __future__ import annotations
+
+from .cli import main
+
+if __name__ == "__main__":  # pragma: no cover
+    main()
diff --git a/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/cli.py b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/cli.py
new file mode 100644
index 0000000..a9826cc
--- /dev/null
+++ b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/cli.py
@@ -0,0 +1,78 @@
+"""``kakeya-sidecar`` CLI entry point."""
+from __future__ import annotations
+
+import argparse
+import logging
+import sys
+
+from .engine import EngineConfig
+from .server import create_app
+
+
+def build_parser() -> argparse.ArgumentParser:
+    p = argparse.ArgumentParser(
+        prog="kakeya-sidecar",
+        description="OpenAI-compatible local inference sidecar for Atomic-Chat "
+        "(HuggingFace transformers + KakeyaLattice v1.5 E8 KV-cache).",
+    )
+    p.add_argument("--host", default="127.0.0.1",
+                   help="Bind address (default 127.0.0.1 — localhost only).")
+    p.add_argument("--port", type=int, default=1338,
+                   help="Bind port (default 1338; Atomic-Chat's OpenAI front "
+                        "door is 1337 so we sit one port over).")
+    p.add_argument("--device", default="auto",
+                   choices=["auto", "mps", "cuda", "cpu"],
+                   help="Torch device. 'auto' picks mps on Mac / cuda on Linux.")
+    p.add_argument("--dtype", default="auto",
+                   choices=["auto", "bfloat16", "float16", "float32"],
+                   help="Model parameter dtype.")
+    p.add_argument("--max-resident", type=int, default=1,
+                   help="Max number of fully-loaded models kept in RAM/VRAM "
+                        "at once (LRU).")
+    p.add_argument("--hf-cache-dir", default=None,
+                   help="Override HF_HOME / cache location.")
+    p.add_argument("--prewarm", default=None,
+                   help="Channel id to pre-load at startup, e.g. "
+                        "'qwen3-4b@e8-q10'.")
+    p.add_argument("--log-level", default="INFO",
+                   choices=["DEBUG", "INFO", "WARNING", "ERROR"])
+    return p
+
+
+def main(argv: list[str] | None = None) -> int:
+    args = build_parser().parse_args(argv)
+    logging.basicConfig(
+        level=getattr(logging, args.log_level),
+        format="%(asctime)s %(levelname)s %(name)s: %(message)s",
+    )
+
+    cfg = EngineConfig(
+        device=args.device,
+        dtype=args.dtype,
+        max_resident=args.max_resident,
+        hf_cache_dir=args.hf_cache_dir,
+    )
+
+    engine_instance = None
+    if args.prewarm:
+        from .engine import KakeyaEngine
+
+        log = logging.getLogger("kakeya_sidecar.cli")
+        engine_instance = KakeyaEngine(cfg)
+        log.info("prewarming %s on %s", args.prewarm, engine_instance._device)
+        engine_instance.warmup(args.prewarm)
+
+    app = create_app(
+        cfg,
+        lazy_engine=engine_instance is None,
+        engine_instance=engine_instance,
+    )
+
+    try:
+        import uvicorn  # type: ignore
+    except ImportError:  # pragma: no cover
+        print("uvicorn is required. `pip install uvicorn[standard]`.", file=sys.stderr)
+        return 2
+
+    uvicorn.run(app, host=args.host, port=args.port, log_level=args.log_level.lower())
+    return 0
diff --git a/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/engine.py b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/engine.py
new file mode 100644
index 0000000..4f14db3
--- /dev/null
+++ b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/engine.py
@@ -0,0 +1,307 @@
+"""Inference engine: HuggingFace transformers + KakeyaLatticeCache.
+
+The engine is intentionally small; there is no batching, no dynamic
+multi-tenant scheduler, no speculative decoding — Atomic-Chat's
+single-user desktop use case does not need any of that.
+
+Design:
+
+- One ``_LoadedModel`` per (short_id) with an LRU eviction policy of
+  size ``max_resident`` (default 1). Switching models unloads the
+  previous one, same as Atomic-Chat's llama.cpp plugin already does.
+- ``KakeyaLatticeCache`` is **per request** (caches have sequence state
+  tied to the generation). We rebuild it fresh for each call because
+  that matches the standard HF generate pattern and makes concurrent
+  requests trivially safe.
+- All torch imports are lazy so the CLI can `--help` with no torch
+  installed.
+
+The engine does NOT depend on FastAPI — the server module wires it up.
+"""
+from __future__ import annotations
+
+import logging
+import os
+import threading
+import time
+from collections import OrderedDict
+from dataclasses import dataclass
+from typing import Any, Iterator
+
+from .model_registry import Channel, DeploymentProfile, resolve_model
+
+log = logging.getLogger("kakeya_sidecar.engine")
+
+
+@dataclass
+class EngineConfig:
+    device: str = "auto"        # "auto" | "mps" | "cuda" | "cpu"
+    dtype: str = "auto"         # "auto" | "bfloat16" | "float16" | "float32"
+    max_resident: int = 1       # LRU size of loaded models
+    trust_remote_code: bool = True
+    hf_cache_dir: str | None = None
+
+
+def _pick_device(requested: str) -> str:
+    if requested != "auto":
+        return requested
+    try:
+        import torch  # type: ignore
+    except ImportError:  # pragma: no cover
+        return "cpu"
+    if torch.cuda.is_available():
+        return "cuda"
+    if getattr(torch.backends, "mps", None) and torch.backends.mps.is_available():
+        return "mps"
+    return "cpu"
+
+
+def _pick_dtype(requested: str, device: str):
+    import torch  # type: ignore
+
+    if requested != "auto":
+        return getattr(torch, requested)
+    # MPS supports bf16 on macOS 14+; fall back to fp16 on older.
+    if device == "mps":
+        return torch.float16
+    if device == "cuda":
+        return torch.bfloat16
+    return torch.float32
+
+
+class _LoadedModel:
+    """HF model + tokenizer, loaded on a specific device/dtype."""
+
+    def __init__(
+        self,
+        profile: DeploymentProfile,
+        cfg: EngineConfig,
+        device: str,
+        dtype,
+    ) -> None:
+        from transformers import AutoModelForCausalLM, AutoTokenizer  # type: ignore
+
+        log.info("loading %s on %s/%s …", profile.hf_repo_id, device, dtype)
+        t0 = time.time()
+        self.tokenizer = AutoTokenizer.from_pretrained(
+            profile.hf_repo_id,
+            trust_remote_code=cfg.trust_remote_code,
+            cache_dir=cfg.hf_cache_dir,
+        )
+        if self.tokenizer.pad_token_id is None:
+            self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
+
+        self.model = AutoModelForCausalLM.from_pretrained(
+            profile.hf_repo_id,
+            torch_dtype=dtype,
+            trust_remote_code=cfg.trust_remote_code,
+            cache_dir=cfg.hf_cache_dir,
+            low_cpu_mem_usage=True,
+        )
+        self.model.to(device)
+        self.model.eval()
+
+        self.device = device
+        self.dtype = dtype
+        self.profile = profile
+        log.info(
+            "loaded %s in %.1fs  (L=%d, head_dim=%r)",
+            profile.hf_repo_id, time.time() - t0,
+            self.model.config.num_hidden_layers,
+            getattr(self.model.config, "head_dim", "?"),
+        )
+
+
+class KakeyaEngine:
+    """Public-facing inference engine used by the FastAPI server."""
+
+    def __init__(self, cfg: EngineConfig | None = None) -> None:
+        self.cfg = cfg or EngineConfig()
+        self._device = _pick_device(self.cfg.device)
+        self._loaded: "OrderedDict[str, _LoadedModel]" = OrderedDict()
+        self._lock = threading.Lock()
+        log.info("KakeyaEngine device=%s max_resident=%d",
+                 self._device, self.cfg.max_resident)
+
+    # ------------------------------------------------------------------
+    # model lifecycle
+    # ------------------------------------------------------------------
+
+    def _ensure_loaded(self, profile: DeploymentProfile) -> _LoadedModel:
+        with self._lock:
+            if profile.short_id in self._loaded:
+                self._loaded.move_to_end(profile.short_id)
+                return self._loaded[profile.short_id]
+
+            import torch  # type: ignore
+            dtype = _pick_dtype(self.cfg.dtype, self._device)
+            lm = _LoadedModel(profile, self.cfg, self._device, dtype)
+
+            self._loaded[profile.short_id] = lm
+            while len(self._loaded) > self.cfg.max_resident:
+                evicted_id, evicted = self._loaded.popitem(last=False)
+                log.info("evicting %s", evicted_id)
+                del evicted
+                if torch.cuda.is_available():
+                    torch.cuda.empty_cache()
+            return lm
+
+    def warmup(self, channel_id: str) -> None:
+        """Pre-load a model (used by ``--prewarm`` CLI flag)."""
+        profile, _ = resolve_model(channel_id)
+        self._ensure_loaded(profile)
+
+    # ------------------------------------------------------------------
+    # generation
+    # ------------------------------------------------------------------
+
+    def _build_cache(
+        self, lm: _LoadedModel, channel: Channel
+    ):
+        from kakeyalattice.hf import KakeyaLatticeCache  # type: ignore
+
+        head_dim = getattr(lm.model.config, "head_dim", None)
+        if head_dim is None:
+            hidden = lm.model.config.hidden_size
+            n_heads = lm.model.config.num_attention_heads
+            head_dim = hidden // n_heads
+
+        return KakeyaLatticeCache(
+            variant=channel.variant,
+            q_range=channel.q_range,
+            num_hidden_layers=lm.model.config.num_hidden_layers,
+            head_dim=int(head_dim),
+            device=lm.device,
+            boundary=channel.boundary,
+            strict=False,
+        )
+
+    def chat(
+        self,
+        channel_id: str,
+        messages: list[dict[str, Any]],
+        *,
+        max_tokens: int = 256,
+        temperature: float = 0.7,
+        top_p: float = 1.0,
+        stop: list[str] | None = None,
+        override: dict[str, Any] | None = None,
+    ) -> tuple[str, dict[str, Any]]:
+        """One-shot (non-streaming) chat completion.
+
+        Returns ``(text, stats)`` where ``stats`` contains kakeya-specific
+        telemetry (codec fire counts etc.) for the ``x_kakeya`` field.
+        """
+        profile, channel = resolve_model(channel_id)
+        if override:
+            channel = Channel(
+                variant=override.get("variant", channel.variant),
+                q_range=int(override.get("q_range", channel.q_range)),
+                boundary=int(override.get("boundary", channel.boundary)),
+                est_compression=channel.est_compression,
+                est_delta_ppl_pct=channel.est_delta_ppl_pct,
+                label=channel.label,
+            )
+
+        lm = self._ensure_loaded(profile)
+        cache = self._build_cache(lm, channel)
+
+        import torch  # type: ignore
+
+        prompt = lm.tokenizer.apply_chat_template(
+            messages, tokenize=False, add_generation_prompt=True
+        )
+        inputs = lm.tokenizer(prompt, return_tensors="pt").to(lm.device)
+        prompt_len = inputs["input_ids"].shape[1]
+
+        gen_kwargs: dict[str, Any] = dict(
+            max_new_tokens=max_tokens,
+            do_sample=temperature > 0.0,
+            temperature=max(temperature, 1e-4),
+            top_p=top_p,
+            past_key_values=cache,
+            pad_token_id=lm.tokenizer.pad_token_id,
+        )
+
+        t0 = time.time()
+        with torch.inference_mode():
+            out = lm.model.generate(**inputs, **gen_kwargs)
+        gen_time = time.time() - t0
+
+        new_tokens = out[0, prompt_len:]
+        text = lm.tokenizer.decode(new_tokens, skip_special_tokens=True)
+
+        stats = {
+            "variant": channel.variant,
+            "q_range": channel.q_range,
+            "boundary": channel.boundary,
+            "est_compression": channel.est_compression,
+            "est_delta_ppl_pct": channel.est_delta_ppl_pct,
+            "prompt_tokens": int(prompt_len),
+            "completion_tokens": int(new_tokens.shape[0]),
+            "generation_time_s": gen_time,
+            "codec_fired_per_layer": dict(getattr(cache, "codec_fired_per_layer", {})),
+            "skip_fired_per_layer": dict(getattr(cache, "skip_fired_per_layer", {})),
+        }
+        return text, stats
+
+    def chat_stream(
+        self,
+        channel_id: str,
+        messages: list[dict[str, Any]],
+        *,
+        max_tokens: int = 256,
+        temperature: float = 0.7,
+        top_p: float = 1.0,
+        stop: list[str] | None = None,
+        override: dict[str, Any] | None = None,
+    ) -> Iterator[str]:
+        """Streaming variant — yields delta strings (not full SSE frames).
+
+        The server wraps each delta into an OpenAI ``chat.completion.chunk``.
+        """
+        profile, channel = resolve_model(channel_id)
+        if override:
+            channel = Channel(
+                variant=override.get("variant", channel.variant),
+                q_range=int(override.get("q_range", channel.q_range)),
+                boundary=int(override.get("boundary", channel.boundary)),
+                est_compression=channel.est_compression,
+                est_delta_ppl_pct=channel.est_delta_ppl_pct,
+                label=channel.label,
+            )
+        lm = self._ensure_loaded(profile)
+        cache = self._build_cache(lm, channel)
+
+        import torch  # type: ignore
+        from transformers import TextIteratorStreamer  # type: ignore
+
+        prompt = lm.tokenizer.apply_chat_template(
+            messages, tokenize=False, add_generation_prompt=True
+        )
+        inputs = lm.tokenizer(prompt, return_tensors="pt").to(lm.device)
+        streamer = TextIteratorStreamer(
+            lm.tokenizer, skip_prompt=True, skip_special_tokens=True
+        )
+
+        gen_kwargs: dict[str, Any] = dict(
+            **inputs,
+            max_new_tokens=max_tokens,
+            do_sample=temperature > 0.0,
+            temperature=max(temperature, 1e-4),
+            top_p=top_p,
+            past_key_values=cache,
+            streamer=streamer,
+            pad_token_id=lm.tokenizer.pad_token_id,
+        )
+
+        def _run():
+            with torch.inference_mode():
+                lm.model.generate(**gen_kwargs)
+
+        thread = threading.Thread(target=_run, daemon=True)
+        thread.start()
+        for piece in streamer:
+            if piece:
+                yield piece
+        thread.join()
diff --git a/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/model_registry.py b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/model_registry.py
new file mode 100644
index 0000000..d288af5
--- /dev/null
+++ b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/model_registry.py
@@ -0,0 +1,242 @@
+"""Per-model deployment profiles.
+
+The registry maps a user-facing model id (e.g. ``qwen3-4b``) to:
+
+1. The HuggingFace repo id we pull weights from.
+2. The canonical KakeyaLattice channels we support for that model
+   (variant ∈ {"d4", "e8"}, q_range ∈ {...}, boundary ∈ {...}).
+3. Human-readable metadata (head_dim, compression-ratio estimate,
+   measured |Δppl|) so the UI / ``/v1/models`` response can show it.
+
+The numbers in ``channels[].est_*`` come directly from
+``reports/v1_5_release/V15_FULL_4MODEL_REPORT.md`` where we have
+measurements, and from the compression-ratio formula otherwise.
+
+Nothing in this module imports torch or transformers — it is pure
+metadata, safe to import in tests.
+"""
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+from typing import Iterable
+
+
+@dataclass(frozen=True)
+class Channel:
+    """One (variant, Q, boundary) deployment channel for a model."""
+
+    variant: str                 # "d4" | "e8"
+    q_range: int                 # 4, 10, 38, 152 …
+    boundary: int = 0            # number of front/back layers to skip
+    est_compression: float = 1.0  # KV cache compression vs bf16
+    est_delta_ppl_pct: float | None = None  # measured if available
+    label: str = ""              # e.g. "balanced", "aggressive", "near-lossless"
+
+
+@dataclass(frozen=True)
+class DeploymentProfile:
+    """Full deployment profile for one HF model."""
+
+    short_id: str                # "qwen3-4b"
+    hf_repo_id: str              # "Qwen/Qwen3-4B"
+    head_dim: int | tuple[int, ...]  # tuple for heterogeneous (Gemma-4)
+    num_hidden_layers: int
+    channels: tuple[Channel, ...]
+    default_channel: Channel
+    notes: str = ""
+
+    def channel_id(self, ch: Channel) -> str:
+        """UI/API id = ``<short>@<variant>-q<Q>[-b<boundary>]``."""
+        suffix = f"-b{ch.boundary}" if ch.boundary else ""
+        return f"{self.short_id}@{ch.variant}-q{ch.q_range}{suffix}"
+
+    def all_channel_ids(self) -> list[str]:
+        return [self.channel_id(c) for c in self.channels]
+
+
+def _std_channels_128() -> tuple[Channel, ...]:
+    """Canonical channels for head_dim=128 (Qwen/Llama/Mistral/GLM/DeepSeek).
+
+    Compression ratios from v1.5 report Table §3.
+    """
+    return (
+        Channel("e8", q_range=38, boundary=0,
+                est_compression=2.50, est_delta_ppl_pct=None,
+                label="near-lossless"),
+        Channel("e8", q_range=10, boundary=0,
+                est_compression=3.37, est_delta_ppl_pct=None,
+                label="balanced"),
+        Channel("e8", q_range=4, boundary=0,
+                est_compression=4.57, est_delta_ppl_pct=None,
+                label="aggressive"),
+    )
+
+
+# NOTE: est_delta_ppl_pct is populated per-model where v1.5 measured it;
+# empty entries stay ``None`` and the UI shows "estimated only".
+
+MODEL_REGISTRY: dict[str, DeploymentProfile] = {
+    # --- Qwen family ----------------------------------------------------
+    "qwen3-4b": DeploymentProfile(
+        short_id="qwen3-4b",
+        hf_repo_id="Qwen/Qwen3-4B",
+        head_dim=128,
+        num_hidden_layers=36,
+        channels=(
+            Channel("e8", 38, 0, 2.50, None, "near-lossless"),
+            Channel("e8", 10, 0, 3.37, 3.85, "balanced"),
+            Channel("e8",  4, 0, 4.57, 17.00, "aggressive"),
+        ),
+        default_channel=Channel("e8", 10, 0, 3.37, 3.85, "balanced"),
+        notes="v1.5 E8 first-measurement model; numbers from V15_FULL_4MODEL_REPORT.md.",
+    ),
+    "qwen2-1.5b": DeploymentProfile(
+        short_id="qwen2-1.5b",
+        hf_repo_id="Qwen/Qwen2-1.5B",
+        head_dim=128,
+        num_hidden_layers=28,
+        channels=_std_channels_128(),
+        default_channel=Channel("e8", 38, 0, 2.50, None, "near-lossless"),
+        notes="Small Qwen2; conservative default (Q=38) on Mac 8-16 GB.",
+    ),
+
+    # --- Llama family ---------------------------------------------------
+    "llama-3.2-3b-instruct": DeploymentProfile(
+        short_id="llama-3.2-3b-instruct",
+        hf_repo_id="meta-llama/Llama-3.2-3B-Instruct",
+        head_dim=128,
+        num_hidden_layers=28,
+        channels=_std_channels_128(),
+        default_channel=Channel("e8", 10, 0, 3.37, None, "balanced"),
+        notes="Main Mac-16GB target; requires HF token (gated).",
+    ),
+    "llama-3.1-8b-instruct": DeploymentProfile(
+        short_id="llama-3.1-8b-instruct",
+        hf_repo_id="meta-llama/Llama-3.1-8B-Instruct",
+        head_dim=128,
+        num_hidden_layers=32,
+        channels=_std_channels_128(),
+        default_channel=Channel("e8", 10, 0, 3.37, None, "balanced"),
+        notes="Best quality/size trade-off for Mac-32GB.",
+    ),
+
+    # --- Mistral --------------------------------------------------------
+    "mistral-7b-instruct-v0.3": DeploymentProfile(
+        short_id="mistral-7b-instruct-v0.3",
+        hf_repo_id="mistralai/Mistral-7B-Instruct-v0.3",
+        head_dim=128,
+        num_hidden_layers=32,
+        channels=_std_channels_128(),
+        default_channel=Channel("e8", 10, 0, 3.37, None, "balanced"),
+    ),
+
+    # --- Gemma ----------------------------------------------------------
+    # Heterogeneous head_dim (20×256 + 4×512 MatFormer); KakeyaLatticeCache
+    # handles per-layer head_dim internally.
+    "gemma-4-e4b": DeploymentProfile(
+        short_id="gemma-4-e4b",
+        hf_repo_id="google/gemma-4-E4B",
+        head_dim=(256, 512),
+        num_hidden_layers=24,
+        channels=(
+            Channel("e8", 38, 0, 2.30, None, "near-lossless"),
+            Channel("e8", 10, 0, 3.47, 1.56, "balanced"),
+            Channel("e8",  4, 0, 4.77, 5.79, "aggressive"),
+        ),
+        default_channel=Channel("e8", 10, 0, 3.47, 1.56, "balanced"),
+        notes="Heterogeneous head_dim; measured in v1.5 report.",
+    ),
+
+    # --- DeepSeek (R1-Distill series) ----------------------------------
+    # Small DeepSeek models are structurally fragile under no-boundary.
+    # Force boundary=2 on every channel.
+    "deepseek-r1-distill-qwen-1.5b": DeploymentProfile(
+        short_id="deepseek-r1-distill-qwen-1.5b",
+        hf_repo_id="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
+        head_dim=128,
+        num_hidden_layers=28,
+        channels=(
+            Channel("e8", 38, 2, 2.40, None, "near-lossless"),
+            Channel("e8", 10, 2, 3.20, None, "balanced"),
+        ),
+        default_channel=Channel("e8", 38, 2, 2.40, None, "near-lossless"),
+        notes=(
+            "boundary=2 is REQUIRED; no-boundary in-forward explodes "
+            "to >50 000% |Δppl| per V15_FULL_4MODEL_REPORT.md §1."
+        ),
+    ),
+    "deepseek-r1-distill-qwen-7b": DeploymentProfile(
+        short_id="deepseek-r1-distill-qwen-7b",
+        hf_repo_id="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
+        head_dim=128,
+        num_hidden_layers=28,
+        channels=(
+            Channel("e8", 38, 2, 2.40, None, "near-lossless"),
+            Channel("e8", 10, 2, 3.20, None, "balanced"),
+        ),
+        default_channel=Channel("e8", 10, 2, 3.20, None, "balanced"),
+    ),
+
+    # --- GLM ------------------------------------------------------------
+    "glm-4-9b-chat": DeploymentProfile(
+        short_id="glm-4-9b-chat",
+        hf_repo_id="zai-org/GLM-4-9B-Chat",
+        head_dim=128,
+        num_hidden_layers=40,
+        channels=(
+            Channel("e8", 38, 0, 2.30, None, "near-lossless"),
+            Channel("e8", 10, 0, 3.37, 6.96, "balanced"),
+            Channel("e8",  4, 0, 4.57, 32.36, "aggressive"),
+        ),
+        default_channel=Channel("e8", 10, 0, 3.37, 6.96, "balanced"),
+        notes="Requires trust_remote_code=True.",
+    ),
+}
+
+
+def iter_model_channel_ids() -> Iterable[str]:
+    for prof in MODEL_REGISTRY.values():
+        yield from prof.all_channel_ids()
+
+
+def resolve_model(channel_id: str) -> tuple[DeploymentProfile, Channel]:
+    """Parse ``<short>@<variant>-q<Q>[-b<B>]`` into a (profile, channel) pair.
+
+    Accepts the short id alone (no channel suffix) and returns the
+    profile's ``default_channel``.
+    """
+    if "@" not in channel_id:
+        short = channel_id
+        if short not in MODEL_REGISTRY:
+            raise KeyError(f"unknown model id {short!r}")
+        prof = MODEL_REGISTRY[short]
+        return prof, prof.default_channel
+
+    short, channel_suffix = channel_id.split("@", 1)
+    if short not in MODEL_REGISTRY:
+        raise KeyError(f"unknown model id {short!r}")
+    prof = MODEL_REGISTRY[short]
+
+    parts = channel_suffix.split("-")
+    variant = parts[0].lower()
+    q_range = None
+    boundary = 0
+    for part in parts[1:]:
+        if part.startswith("q") and part[1:].isdigit():
+            q_range = int(part[1:])
+        elif part.startswith("b") and part[1:].isdigit():
+            boundary = int(part[1:])
+    if q_range is None:
+        raise ValueError(
+            f"channel suffix {channel_suffix!r} missing q<N> segment"
+        )
+
+    for ch in prof.channels:
+        if ch.variant == variant and ch.q_range == q_range and ch.boundary == boundary:
+            return prof, ch
+
+    raise KeyError(
+        f"model {short!r} has no channel matching variant={variant} "
+        f"q_range={q_range} boundary={boundary}. Available: "
+        + ", ".join(prof.all_channel_ids())
+    )
diff --git a/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/py.typed b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/py.typed
new file mode 100644
index 0000000..e69de29
diff --git a/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/schemas.py b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/schemas.py
new file mode 100644
index 0000000..c268012
--- /dev/null
+++ b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/schemas.py
@@ -0,0 +1,72 @@
+"""OpenAI-compatible request / response schemas (minimal subset).
+
+We deliberately model only the fields Atomic-Chat's client actually
+sends. Anything unknown is accepted via ``model_config = {"extra": "allow"}``
+so we forward-propagate.
+"""
+from __future__ import annotations
+
+from typing import Any, Literal
+
+from pydantic import BaseModel, ConfigDict, Field
+
+
+class ChatMessage(BaseModel):
+    model_config = ConfigDict(extra="allow")
+    role: Literal["system", "user", "assistant", "tool"]
+    content: str | list[dict[str, Any]]
+
+
+class ChatCompletionRequest(BaseModel):
+    model_config = ConfigDict(extra="allow")
+
+    model: str
+    messages: list[ChatMessage]
+    stream: bool = False
+    temperature: float = 0.7
+    top_p: float = 1.0
+    max_tokens: int | None = None
+    stop: str | list[str] | None = None
+    # Extension: override the per-model default channel per-request.
+    # Example: {"variant": "e8", "q_range": 38, "boundary": 0}
+    x_kakeya_override: dict[str, Any] | None = Field(default=None, alias="x_kakeya_override")
+
+
+class ChatCompletionMessage(BaseModel):
+    role: Literal["assistant"] = "assistant"
+    content: str
+
+
+class ChatCompletionChoice(BaseModel):
+    index: int = 0
+    message: ChatCompletionMessage
+    finish_reason: Literal["stop", "length"] = "stop"
+
+
+class ChatCompletionUsage(BaseModel):
+    prompt_tokens: int
+    completion_tokens: int
+    total_tokens: int
+
+
+class ChatCompletionResponse(BaseModel):
+    id: str
+    object: Literal["chat.completion"] = "chat.completion"
+    created: int
+    model: str
+    choices: list[ChatCompletionChoice]
+    usage: ChatCompletionUsage
+    x_kakeya: dict[str, Any] | None = None
+
+
+class ModelInfo(BaseModel):
+    model_config = ConfigDict(extra="allow")
+    id: str
+    object: Literal["model"] = "model"
+    owned_by: str = "kakeyalattice"
+    x_kakeya: dict[str, Any] | None = None
+
+
+class ModelList(BaseModel):
+    object: Literal["list"] = "list"
+    data: list[ModelInfo]
diff --git a/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/server.py b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/server.py
new file mode 100644
index 0000000..63efb35
--- /dev/null
+++ b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/server.py
@@ -0,0 +1,215 @@
+"""FastAPI OpenAI-compatible server.
+
+Endpoints:
+
+    GET  /health
+    GET  /v1/models
+    POST /v1/chat/completions      (stream + non-stream)
+    GET  /v1/kakeya/stats          (extension)
+
+The server imports :mod:`kakeya_sidecar.engine` lazily so unit tests
+that only exercise routing / schema logic can run without torch.
+"""
+from __future__ import annotations
+
+import json
+import logging
+import time
+import uuid
+from typing import Any
+
+from fastapi import FastAPI, HTTPException
+from fastapi.responses import JSONResponse, StreamingResponse
+
+from .engine import EngineConfig, KakeyaEngine
+from .model_registry import MODEL_REGISTRY
+from .schemas import (
+    ChatCompletionChoice,
+    ChatCompletionMessage,
+    ChatCompletionRequest,
+    ChatCompletionResponse,
+    ChatCompletionUsage,
+    ModelInfo,
+    ModelList,
+)
+
+log = logging.getLogger("kakeya_sidecar.server")
+
+
+def create_app(
+    cfg: EngineConfig | None = None,
+    *,
+    lazy_engine: bool = True,
+    engine_instance: KakeyaEngine | None = None,
+) -> FastAPI:
+    """Build the FastAPI application.
+
+    Args:
+        cfg: engine configuration (device, dtype, max_resident).
+        lazy_engine: if True, the engine is constructed on first use
+            instead of app startup. Default True — this lets the process
+            start fast and surface model-load errors on the first
+            request instead of at boot.
+        engine_instance: optional pre-built engine (used by ``--prewarm``).
+    """
+    app = FastAPI(title="kakeya-sidecar", version="0.1.0")
+
+    state: dict[str, Any] = {"engine": engine_instance, "cfg": cfg or EngineConfig()}
+
+    def engine() -> KakeyaEngine:
+        if state["engine"] is None:
+            state["engine"] = KakeyaEngine(state["cfg"])
+        return state["engine"]  # type: ignore[return-value]
+
+    if not lazy_engine and state["engine"] is None:
+        state["engine"] = KakeyaEngine(state["cfg"])
+
+    # -------------------------------------------------------------- health
+
+    @app.get("/health")
+    def health() -> dict[str, Any]:
+        return {"ok": True, "engine_loaded": state["engine"] is not None}
+
+    # ------------------------------------------------------------- /models
+
+    @app.get("/v1/models", response_model=ModelList)
+    def list_models() -> ModelList:
+        data: list[ModelInfo] = []
+        for profile in MODEL_REGISTRY.values():
+            for ch in profile.channels:
+                data.append(
+                    ModelInfo(
+                        id=profile.channel_id(ch),
+                        owned_by="kakeyalattice",
+                        x_kakeya={
+                            "hf_repo_id": profile.hf_repo_id,
+                            "head_dim": profile.head_dim,
+                            "num_hidden_layers": profile.num_hidden_layers,
+                            "variant": ch.variant,
+                            "q_range": ch.q_range,
+                            "boundary": ch.boundary,
+                            "est_compression": ch.est_compression,
+                            "est_delta_ppl_pct": ch.est_delta_ppl_pct,
+                            "label": ch.label,
+                            "is_default": ch == profile.default_channel,
+                            "notes": profile.notes,
+                        },
+                    )
+                )
+        return ModelList(data=data)
+
+    # --------------------------------------------------- /chat/completions
+
+    @app.post("/v1/chat/completions")
+    def chat_completions(req: ChatCompletionRequest):
+        messages = [m.model_dump(exclude_none=True) for m in req.messages]
+        cid = f"chatcmpl-{uuid.uuid4().hex[:16]}"
+        created = int(time.time())
+
+        try:
+            eng = engine()
+        except Exception as e:  # pragma: no cover
+            raise HTTPException(500, f"engine init failed: {e}") from e
+
+        if req.stream:
+            return StreamingResponse(
+                _sse_stream(eng, req, cid, created),
+                media_type="text/event-stream",
+            )
+
+        try:
+            text, stats = eng.chat(
+                req.model,
+                messages,
+                max_tokens=req.max_tokens or 512,
+                temperature=req.temperature,
+                top_p=req.top_p,
+                override=req.x_kakeya_override,
+            )
+        except KeyError as e:
+            raise HTTPException(404, str(e)) from e
+        except Exception as e:  # pragma: no cover
+            log.exception("chat failed")
+            raise HTTPException(500, str(e)) from e
+
+        resp = ChatCompletionResponse(
+            id=cid,
+            created=created,
+            model=req.model,
+            choices=[
+                ChatCompletionChoice(
+                    message=ChatCompletionMessage(content=text),
+                    finish_reason="stop",
+                )
+            ],
+            usage=ChatCompletionUsage(
+                prompt_tokens=stats["prompt_tokens"],
+                completion_tokens=stats["completion_tokens"],
+                total_tokens=stats["prompt_tokens"] + stats["completion_tokens"],
+            ),
+            x_kakeya=stats,
+        )
+        return JSONResponse(resp.model_dump())
+
+    # ------------------------------------------------- /v1/kakeya/stats
+
+    @app.get("/v1/kakeya/stats")
+    def kakeya_stats() -> dict[str, Any]:
+        eng = state["engine"]
+        if eng is None:
+            return {"engine_loaded": False}
+        return {
+            "engine_loaded": True,
+            "device": eng._device,
+            "resident_models": list(eng._loaded.keys()),
+            "max_resident": eng.cfg.max_resident,
+        }
+
+    return app
+
+
+# --------------------------------------------------------------- sse helper
+
+
+def _sse_stream(eng: KakeyaEngine, req: ChatCompletionRequest, cid: str, created: int):
+    messages = [m.model_dump(exclude_none=True) for m in req.messages]
+
+    def chunk(delta: dict[str, Any], finish_reason: str | None = None) -> str:
+        payload = {
+            "id": cid,
+            "object": "chat.completion.chunk",
+            "created": created,
+            "model": req.model,
+            "choices": [
+                {
+                    "index": 0,
+                    "delta": delta,
+                    "finish_reason": finish_reason,
+                }
+            ],
+        }
+        return f"data: {json.dumps(payload)}\n\n"
+
+    yield chunk({"role": "assistant"})
+    try:
+        for piece in eng.chat_stream(
+            req.model,
+            messages,
+            max_tokens=req.max_tokens or 512,
+            temperature=req.temperature,
+            top_p=req.top_p,
+            override=req.x_kakeya_override,
+        ):
+            yield chunk({"content": piece})
+    except KeyError as e:
+        yield chunk({"content": f"[error] {e}"}, finish_reason="stop")
+        yield "data: [DONE]\n\n"
+        return
+    except Exception as e:  # pragma: no cover
+        log.exception("stream failed")
+        yield chunk({"content": f"[error] {e}"}, finish_reason="stop")
+        yield "data: [DONE]\n\n"
+        return
+
+    yield chunk({}, finish_reason="stop")
+    yield "data: [DONE]\n\n"
diff --git a/integrations/atomic-chat/kakeya_sidecar/pyproject.toml b/integrations/atomic-chat/kakeya_sidecar/pyproject.toml
new file mode 100644
index 0000000..59b3420
--- /dev/null
+++ b/integrations/atomic-chat/kakeya_sidecar/pyproject.toml
@@ -0,0 +1,50 @@
+[build-system]
+requires = ["setuptools>=61", "wheel"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "kakeya-sidecar"
+version = "0.1.0"
+description = "OpenAI-compatible local inference sidecar for Atomic-Chat, powered by KakeyaLattice v1.5 E8 KV-cache compression."
+readme = "README.md"
+requires-python = ">=3.10"
+license = { text = "Apache-2.0" }
+authors = [
+    { name = "KakeyaLattice + Atomic-Chat integration" },
+]
+classifiers = [
+    "Development Status :: 3 - Alpha",
+    "Intended Audience :: Developers",
+    "License :: OSI Approved :: Apache Software License",
+    "Operating System :: MacOS",
+    "Operating System :: POSIX :: Linux",
+    "Programming Language :: Python :: 3.10",
+    "Programming Language :: Python :: 3.11",
+    "Programming Language :: Python :: 3.12",
+    "Topic :: Scientific/Engineering :: Artificial Intelligence",
+]
+dependencies = [
+    "kakeyalattice[hf]>=1.5",
+    "transformers>=4.45",
+    "torch>=2.1",
+    "fastapi>=0.110",
+    "uvicorn[standard]>=0.29",
+    "pydantic>=2.6",
+    "sse-starlette>=1.8",
+]
+
+[project.optional-dependencies]
+mac = [
+    "accelerate>=0.30",
+]
+dev = [
+    "pytest>=7",
+    "httpx>=0.27",
+]
+
+[project.scripts]
+kakeya-sidecar = "kakeya_sidecar.cli:main"
+
+[tool.setuptools.packages.find]
+where = ["."]
+include = ["kakeya_sidecar*"]
diff --git a/integrations/atomic-chat/kakeya_sidecar/tests/__init__.py b/integrations/atomic-chat/kakeya_sidecar/tests/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/integrations/atomic-chat/kakeya_sidecar/tests/test_model_registry.py b/integrations/atomic-chat/kakeya_sidecar/tests/test_model_registry.py
new file mode 100644
index 0000000..5a75fe4
--- /dev/null
+++ b/integrations/atomic-chat/kakeya_sidecar/tests/test_model_registry.py
@@ -0,0 +1,84 @@
+"""Pure-logic tests for model_registry — no torch / HF downloads."""
+from __future__ import annotations
+
+import pytest
+
+from kakeya_sidecar.model_registry import (
+    MODEL_REGISTRY,
+    resolve_model,
+    iter_model_channel_ids,
+)
+
+
+def test_registry_non_empty_and_covers_v15_4models() -> None:
+    for required in (
+        "qwen3-4b",
+        "gemma-4-e4b",
+        "glm-4-9b-chat",
+        "deepseek-r1-distill-qwen-1.5b",
+    ):
+        assert required in MODEL_REGISTRY, required
+
+
+def test_every_profile_has_default_channel_in_channels() -> None:
+    for short, prof in MODEL_REGISTRY.items():
+        assert prof.default_channel in prof.channels, (
+            f"{short}: default_channel not in channels"
+        )
+
+
+def test_deepseek_small_always_has_boundary() -> None:
+    prof = MODEL_REGISTRY["deepseek-r1-distill-qwen-1.5b"]
+    for ch in prof.channels:
+        assert ch.boundary >= 2, (
+            "DeepSeek-R1-Distill-Qwen-1.5B requires boundary>=2 per v1.5 report"
+        )
+
+
+def test_channel_ids_roundtrip() -> None:
+    """For every channel the profile lists, resolve_model() must recover it."""
+    for cid in iter_model_channel_ids():
+        prof, ch = resolve_model(cid)
+        assert prof.channel_id(ch) == cid
+
+
+def test_short_id_resolves_to_default() -> None:
+    prof, ch = resolve_model("qwen3-4b")
+    assert ch == prof.default_channel
+
+
+def test_short_id_with_boundary_suffix() -> None:
+    # DeepSeek-1.5B uses b=2.
+    prof, ch = resolve_model("deepseek-r1-distill-qwen-1.5b@e8-q10-b2")
+    assert ch.variant == "e8" and ch.q_range == 10 and ch.boundary == 2
+
+
+def test_unknown_model_raises() -> None:
+    with pytest.raises(KeyError):
+        resolve_model("nonexistent-7b")
+
+
+def test_unknown_channel_raises() -> None:
+    with pytest.raises(KeyError):
+        # Q=9999 is not in the Qwen3-4B channel set
+        resolve_model("qwen3-4b@e8-q9999")
+
+
+def test_missing_q_suffix_raises() -> None:
+    with pytest.raises(ValueError):
+        resolve_model("qwen3-4b@e8")
+
+
+def test_head_dim_shapes_are_int_or_tuple() -> None:
+    for prof in MODEL_REGISTRY.values():
+        assert isinstance(prof.head_dim, (int, tuple))
+        if isinstance(prof.head_dim, tuple):
+            for hd in prof.head_dim:
+                assert isinstance(hd, int) and hd > 0
+
+
+def test_compression_ratios_are_sane() -> None:
+    """All listed channels should claim compression in [1.0, 10.0]."""
+    for prof in MODEL_REGISTRY.values():
+        for ch in prof.channels:
+            assert 1.0 <= ch.est_compression <= 10.0, (prof.short_id, ch)
diff --git a/integrations/atomic-chat/kakeya_sidecar/tests/test_server_routing.py b/integrations/atomic-chat/kakeya_sidecar/tests/test_server_routing.py
new file mode 100644
index 0000000..4002902
--- /dev/null
+++ b/integrations/atomic-chat/kakeya_sidecar/tests/test_server_routing.py
@@ -0,0 +1,100 @@
+"""Server-level smoke tests — mock the engine so torch is not needed."""
+from __future__ import annotations
+
+from typing import Any
+from unittest.mock import MagicMock
+
+import pytest
+
+# Must import FastAPI / TestClient lazily so that environments without
+# uvicorn can still `python -m compileall`.
+fastapi = pytest.importorskip("fastapi")
+from fastapi.testclient import TestClient  # noqa: E402
+
+from kakeya_sidecar import server as server_mod  # noqa: E402
+
+
+@pytest.fixture()
+def app_with_mock_engine(monkeypatch):
+    mock_engine = MagicMock()
+    mock_engine._device = "cpu"
+    mock_engine._loaded = {}
+    mock_engine.cfg.max_resident = 1
+
+    def _fake_chat(channel_id: str, messages, **kw) -> tuple[str, dict[str, Any]]:
+        return ("hello from kakeya", {
+            "variant": "e8", "q_range": 10, "boundary": 0,
+            "est_compression": 3.37, "est_delta_ppl_pct": 3.85,
+            "prompt_tokens": 5, "completion_tokens": 4, "generation_time_s": 0.01,
+            "codec_fired_per_layer": {}, "skip_fired_per_layer": {},
+        })
+
+    mock_engine.chat.side_effect = _fake_chat
+
+    app = server_mod.create_app(lazy_engine=True)
+
+    # Force the route closure to see our mocked engine.
+    class _State(dict):
+        pass
+
+    # Swap the `engine()` closure by overriding the route function's globals.
+    # Easiest path: hit /health to materialise the real engine, then
+    # monkeypatch the handler-local `state` by replacing `KakeyaEngine`.
+    monkeypatch.setattr(server_mod, "KakeyaEngine", lambda *a, **kw: mock_engine)
+
+    return app, mock_engine
+
+
+def test_list_models_returns_v15_4models(app_with_mock_engine) -> None:
+    app, _ = app_with_mock_engine
+    client = TestClient(app)
+    r = client.get("/v1/models")
+    assert r.status_code == 200, r.text
+    data = r.json()
+    ids = {m["id"] for m in data["data"]}
+    # Must include the balanced channel for each v1.5 hero model.
+    assert "qwen3-4b@e8-q10" in ids
+    assert "gemma-4-e4b@e8-q10" in ids
+    assert "glm-4-9b-chat@e8-q10" in ids
+    # And the DeepSeek small must carry boundary=2.
+    assert "deepseek-r1-distill-qwen-1.5b@e8-q10-b2" in ids
+
+
+def test_health_before_engine_use(app_with_mock_engine) -> None:
+    app, _ = app_with_mock_engine
+    client = TestClient(app)
+    r = client.get("/health")
+    assert r.status_code == 200
+    assert r.json()["ok"] is True
+
+
+def test_chat_completions_non_stream(app_with_mock_engine) -> None:
+    app, mock_engine = app_with_mock_engine
+    client = TestClient(app)
+    r = client.post("/v1/chat/completions", json={
+        "model": "qwen3-4b@e8-q10",
+        "messages": [{"role": "user", "content": "hi"}],
+        "stream": False,
+        "max_tokens": 8,
+    })
+    assert r.status_code == 200, r.text
+    data = r.json()
+    assert data["choices"][0]["message"]["content"] == "hello from kakeya"
+    assert data["usage"]["total_tokens"] == 9
+    assert data["x_kakeya"]["variant"] == "e8"
+    mock_engine.chat.assert_called_once()
+
+
+def test_chat_completions_unknown_model(app_with_mock_engine) -> None:
+    app, mock_engine = app_with_mock_engine
+
+    def _raise(channel_id, messages, **kw):
+        raise KeyError(f"unknown model id {channel_id!r}")
+
+    mock_engine.chat.side_effect = _raise
+    client = TestClient(app)
+    r = client.post("/v1/chat/completions", json={
+        "model": "nonexistent-99b@e8-q10",
+        "messages": [{"role": "user", "content": "hi"}],
+    })
+    assert r.status_code == 404
diff --git a/integrations/atomic-chat/kakeyalattice-extension/README.md b/integrations/atomic-chat/kakeyalattice-extension/README.md
new file mode 100644
index 0000000..f78fbaf
--- /dev/null
+++ b/integrations/atomic-chat/kakeyalattice-extension/README.md
@@ -0,0 +1,49 @@
+# @atomic-chat/kakeyalattice-extension
+
+Atomic-Chat TypeScript 扩展，把 KakeyaLattice v1.5 sidecar 作为**第二个
+一等推理后端** 注册进 Atomic-Chat 的 Extension System。
+
+## 架构位置
+
+```
+ web-app (React)
+    │
+    ▼
+ core SDK  ──┬── registerBackend(...)
+             │
+ extensions ─┤
+             ├── llamacpp-extension           (既有，GGUF / llama.cpp)
+             └── kakeyalattice-extension      ★ 本扩展
+                       │
+                       │ Tauri invoke
+                       ▼
+            plugins/kakeyalattice  (Rust, 下一级目录)
+                       │
+                       │ supervise + HTTP
+                       ▼
+            kakeya-sidecar (Python, localhost:1338)
+```
+
+## 关键文件
+
+- `src/index.ts` — 扩展入口，向 Core SDK 注册一个新的 `Backend`。
+- `src/backend.ts` — `Backend` 实现：`listModels()`、`chatCompletion()`、
+  `healthCheck()`、`getStats()`，统一通过 Tauri invoke 过桥到 Rust 插件。
+- `src/types.ts` — 与 Python sidecar `/v1/models` `x_kakeya` 字段一一对齐的
+  类型定义。
+
+## 真正落地到 Atomic-Chat 主仓库时要做什么
+
+1. 把本目录整份 `rsync` 到 `extensions/kakeyalattice-extension/`。
+2. 修 `@atomic-chat/core` 的 `peerDependency` 版本指向主仓库的实际版本。
+3. 在 Atomic-Chat 的扩展注册入口（通常是 `core/src/extensions.ts` 或
+   `web-app/src/App.tsx` 里 bootstrap 阶段的扩展列表）追加：
+   ```ts
+   import { register as registerKakeya } from "@atomic-chat/kakeyalattice-extension";
+   registerKakeya();
+   ```
+4. Rust 插件同步复制到 `src-tauri/plugins/kakeyalattice/`，在
+   `src-tauri/Cargo.toml` + `tauri.conf.json` 注册。
+
+因为我们**不改 Atomic-Chat 主仓库本身**，主仓库的维护者审阅后决定
+怎么合并；本扩展只保证"代码骨架能通过 tsc --noEmit"即可。
diff --git a/integrations/atomic-chat/kakeyalattice-extension/package.json b/integrations/atomic-chat/kakeyalattice-extension/package.json
new file mode 100644
index 0000000..f265588
--- /dev/null
+++ b/integrations/atomic-chat/kakeyalattice-extension/package.json
@@ -0,0 +1,26 @@
+{
+  "name": "@atomic-chat/kakeyalattice-extension",
+  "version": "0.1.0",
+  "description": "Atomic-Chat extension that routes local inference through the KakeyaLattice v1.5 sidecar (E8 nested-lattice KV-cache compression).",
+  "license": "Apache-2.0",
+  "main": "dist/index.js",
+  "types": "dist/index.d.ts",
+  "type": "module",
+  "files": [
+    "dist",
+    "src",
+    "README.md"
+  ],
+  "scripts": {
+    "build": "tsc -p tsconfig.json",
+    "typecheck": "tsc --noEmit",
+    "dev": "tsc -w -p tsconfig.json"
+  },
+  "peerDependencies": {
+    "@atomic-chat/core": "*",
+    "@tauri-apps/api": ">=2.0.0"
+  },
+  "devDependencies": {
+    "typescript": "^5.4.0"
+  }
+}
diff --git a/integrations/atomic-chat/kakeyalattice-extension/src/backend.ts b/integrations/atomic-chat/kakeyalattice-extension/src/backend.ts
new file mode 100644
index 0000000..c8618db
--- /dev/null
+++ b/integrations/atomic-chat/kakeyalattice-extension/src/backend.ts
@@ -0,0 +1,119 @@
+/**
+ * KakeyaLattice backend — Atomic-Chat `Backend` implementation.
+ *
+ * We do NOT import `@atomic-chat/core` at compile time (its exact
+ * shape is governed by the host app). Instead we declare the minimal
+ * interface we need as a local type, and the host will use structural
+ * typing via `registerBackend(new KakeyaBackend())`.
+ */
+import {
+  ChatCompletionRequest,
+  ChatCompletionResponse,
+  KakeyaModel,
+  KakeyaModelList,
+  SidecarStats,
+} from "./types";
+
+/**
+ * Minimal shape of Atomic-Chat's backend contract. Keep in sync with
+ * the host app's `core/src/backend.ts` as it evolves.
+ */
+export interface Backend {
+  readonly id: string;
+  readonly displayName: string;
+  readonly kind: "local" | "cloud";
+  listModels(): Promise<KakeyaModel[]>;
+  chatCompletion(req: ChatCompletionRequest): Promise<ChatCompletionResponse>;
+  chatCompletionStream(
+    req: ChatCompletionRequest,
+    onDelta: (chunk: string) => void,
+  ): Promise<void>;
+  healthCheck(): Promise<boolean>;
+  getStats?(): Promise<SidecarStats>;
+}
+
+/**
+ * Tauri invoke wrapper — resolved at runtime so the package can still
+ * `tsc --noEmit` in a node-only CI where `@tauri-apps/api` isn't
+ * available.
+ */
+type InvokeFn = <T>(cmd: string, args?: Record<string, unknown>) => Promise<T>;
+
+async function tauriInvoke(): Promise<InvokeFn> {
+  // Prefer the runtime-injected global when running inside Tauri.
+  const win = globalThis as unknown as {
+    __TAURI__?: { invoke: InvokeFn };
+    __TAURI_INTERNALS__?: { invoke: InvokeFn };
+  };
+  if (win.__TAURI__?.invoke) return win.__TAURI__.invoke;
+  if (win.__TAURI_INTERNALS__?.invoke) return win.__TAURI_INTERNALS__.invoke;
+  // Fallback to the dynamic import (ESM, requires @tauri-apps/api).
+  const mod = await import("@tauri-apps/api/core");
+  return mod.invoke as InvokeFn;
+}
+
+export class KakeyaBackend implements Backend {
+  readonly id = "kakeyalattice";
+  readonly displayName = "KakeyaLattice (E8 KV-cache compression)";
+  readonly kind = "local" as const;
+
+  async listModels(): Promise<KakeyaModel[]> {
+    const invoke = await tauriInvoke();
+    const list = await invoke<KakeyaModelList>(
+      "plugin:kakeyalattice|list_models",
+    );
+    return list.data;
+  }
+
+  async chatCompletion(req: ChatCompletionRequest): Promise<ChatCompletionResponse> {
+    const invoke = await tauriInvoke();
+    return invoke<ChatCompletionResponse>(
+      "plugin:kakeyalattice|chat_completion",
+      { request: { ...req, stream: false } },
+    );
+  }
+
+  async chatCompletionStream(
+    req: ChatCompletionRequest,
+    onDelta: (chunk: string) => void,
+  ): Promise<void> {
+    const invoke = await tauriInvoke();
+    // The Rust plugin streams via a Tauri event named after the request id.
+    const streamId = await invoke<string>(
+      "plugin:kakeyalattice|chat_completion_stream_start",
+      { request: { ...req, stream: true } },
+    );
+
+    // Listen on the event channel. `@tauri-apps/api/event` is resolved lazily.
+    const eventMod = await import("@tauri-apps/api/event");
+    const unlisten = await eventMod.listen<string>(
+      `kakeyalattice:${streamId}`,
+      (e) => {
+        if (e.payload === "__DONE__") {
+          unlisten();
+          return;
+        }
+        onDelta(e.payload);
+      },
+    );
+
+    return invoke<void>("plugin:kakeyalattice|chat_completion_stream_wait", {
+      streamId,
+    });
+  }
+
+  async healthCheck(): Promise<boolean> {
+    try {
+      const invoke = await tauriInvoke();
+      const ok = await invoke<{ ok: boolean }>("plugin:kakeyalattice|health");
+      return ok.ok === true;
+    } catch {
+      return false;
+    }
+  }
+
+  async getStats(): Promise<SidecarStats> {
+    const invoke = await tauriInvoke();
+    return invoke<SidecarStats>("plugin:kakeyalattice|stats");
+  }
+}
diff --git a/integrations/atomic-chat/kakeyalattice-extension/src/index.ts b/integrations/atomic-chat/kakeyalattice-extension/src/index.ts
new file mode 100644
index 0000000..c513bdf
--- /dev/null
+++ b/integrations/atomic-chat/kakeyalattice-extension/src/index.ts
@@ -0,0 +1,32 @@
+/**
+ * Atomic-Chat extension entry point.
+ *
+ * Usage (inside the Atomic-Chat host app bootstrap):
+ *
+ *   import { register as registerKakeya } from "@atomic-chat/kakeyalattice-extension";
+ *   registerKakeya();
+ *
+ * The host app is expected to expose a global registry via
+ * `window.AtomicChatCore?.registerBackend(backend)`. We fall back to
+ * exporting the backend class directly so the host can wire it up
+ * however it likes.
+ */
+import { KakeyaBackend } from "./backend";
+
+export * from "./types";
+export { KakeyaBackend };
+
+type HostAPI = {
+  registerBackend?: (backend: unknown) => void;
+};
+
+export function register(): KakeyaBackend {
+  const backend = new KakeyaBackend();
+  const host = (globalThis as unknown as { AtomicChatCore?: HostAPI }).AtomicChatCore;
+  if (host?.registerBackend) {
+    host.registerBackend(backend);
+  }
+  return backend;
+}
+
+export default register;
diff --git a/integrations/atomic-chat/kakeyalattice-extension/src/tauri-shims.d.ts b/integrations/atomic-chat/kakeyalattice-extension/src/tauri-shims.d.ts
new file mode 100644
index 0000000..c2d87fc
--- /dev/null
+++ b/integrations/atomic-chat/kakeyalattice-extension/src/tauri-shims.d.ts
@@ -0,0 +1,23 @@
+/**
+ * Ambient type shims for Tauri modules. When this package is built
+ * inside the Atomic-Chat host app, the real `@tauri-apps/api` types
+ * take precedence (via node_modules). These shims exist so the package
+ * also typechecks in a minimal CI where the peer dependency isn't
+ * installed.
+ */
+declare module "@tauri-apps/api/core" {
+  export function invoke<T>(cmd: string, args?: Record<string, unknown>): Promise<T>;
+}
+
+declare module "@tauri-apps/api/event" {
+  export interface Event<T> {
+    event: string;
+    id: number;
+    payload: T;
+  }
+  export type UnlistenFn = () => void;
+  export function listen<T>(
+    event: string,
+    handler: (e: Event<T>) => void,
+  ): Promise<UnlistenFn>;
+}
diff --git a/integrations/atomic-chat/kakeyalattice-extension/src/types.ts b/integrations/atomic-chat/kakeyalattice-extension/src/types.ts
new file mode 100644
index 0000000..2d25f72
--- /dev/null
+++ b/integrations/atomic-chat/kakeyalattice-extension/src/types.ts
@@ -0,0 +1,75 @@
+/**
+ * Types mirroring the Python sidecar's `/v1/models` response.
+ * Kept in sync with `kakeya_sidecar/schemas.py` and
+ * `kakeya_sidecar/model_registry.py`.
+ */
+
+export interface KakeyaModelMeta {
+  hf_repo_id: string;
+  head_dim: number | number[];
+  num_hidden_layers: number;
+  variant: "d4" | "e8";
+  q_range: number;
+  boundary: number;
+  est_compression: number;
+  est_delta_ppl_pct: number | null;
+  label: string;
+  is_default: boolean;
+  notes: string;
+}
+
+export interface KakeyaModel {
+  id: string;                  // "<short>@<variant>-q<Q>[-b<B>]"
+  object: "model";
+  owned_by: "kakeyalattice";
+  x_kakeya: KakeyaModelMeta;
+}
+
+export interface KakeyaModelList {
+  object: "list";
+  data: KakeyaModel[];
+}
+
+export interface ChatMessage {
+  role: "system" | "user" | "assistant" | "tool";
+  content: string;
+}
+
+export interface ChatCompletionRequest {
+  model: string;
+  messages: ChatMessage[];
+  stream?: boolean;
+  temperature?: number;
+  top_p?: number;
+  max_tokens?: number;
+  stop?: string | string[];
+  /** Runtime override for the per-model default channel. */
+  x_kakeya_override?: {
+    variant?: "d4" | "e8";
+    q_range?: number;
+    boundary?: number;
+  };
+}
+
+export interface ChatCompletionChoice {
+  index: number;
+  message: { role: "assistant"; content: string };
+  finish_reason: "stop" | "length";
+}
+
+export interface ChatCompletionResponse {
+  id: string;
+  object: "chat.completion";
+  created: number;
+  model: string;
+  choices: ChatCompletionChoice[];
+  usage: { prompt_tokens: number; completion_tokens: number; total_tokens: number };
+  x_kakeya?: Record<string, unknown>;
+}
+
+export interface SidecarStats {
+  engine_loaded: boolean;
+  device?: "mps" | "cuda" | "cpu";
+  resident_models?: string[];
+  max_resident?: number;
+}
diff --git a/integrations/atomic-chat/kakeyalattice-extension/tsconfig.json b/integrations/atomic-chat/kakeyalattice-extension/tsconfig.json
new file mode 100644
index 0000000..7f7ec13
--- /dev/null
+++ b/integrations/atomic-chat/kakeyalattice-extension/tsconfig.json
@@ -0,0 +1,16 @@
+{
+  "compilerOptions": {
+    "target": "ES2022",
+    "module": "ESNext",
+    "moduleResolution": "Bundler",
+    "lib": ["ES2022", "DOM"],
+    "strict": true,
+    "declaration": true,
+    "outDir": "dist",
+    "rootDir": "src",
+    "esModuleInterop": true,
+    "skipLibCheck": true,
+    "forceConsistentCasingInFileNames": true
+  },
+  "include": ["src/**/*"]
+}
diff --git a/integrations/atomic-chat/tauri-plugin-kakeyalattice/Cargo.toml b/integrations/atomic-chat/tauri-plugin-kakeyalattice/Cargo.toml
new file mode 100644
index 0000000..26e6c77
--- /dev/null
+++ b/integrations/atomic-chat/tauri-plugin-kakeyalattice/Cargo.toml
@@ -0,0 +1,22 @@
+[package]
+name = "tauri-plugin-kakeyalattice"
+version = "0.1.0"
+edition = "2021"
+license = "Apache-2.0"
+description = "Tauri 2 plugin that supervises the Python kakeya-sidecar process and proxies OpenAI-compatible inference calls to it."
+
+[dependencies]
+tauri = { version = "2", features = [] }
+serde = { version = "1", features = ["derive"] }
+serde_json = "1"
+tokio = { version = "1", features = ["process", "rt-multi-thread", "macros", "io-util", "sync"] }
+reqwest = { version = "0.12", features = ["json", "stream"] }
+futures-util = "0.3"
+thiserror = "1"
+log = "0.4"
+uuid = { version = "1", features = ["v4"] }
+lazy_static = "1"
+
+[lib]
+name = "tauri_plugin_kakeyalattice"
+crate-type = ["cdylib", "rlib", "staticlib"]
diff --git a/integrations/atomic-chat/tauri-plugin-kakeyalattice/README.md b/integrations/atomic-chat/tauri-plugin-kakeyalattice/README.md
new file mode 100644
index 0000000..451c8d7
--- /dev/null
+++ b/integrations/atomic-chat/tauri-plugin-kakeyalattice/README.md
@@ -0,0 +1,78 @@
+# tauri-plugin-kakeyalattice
+
+Tauri 2 plugin that:
+
+1. **Supervises** the Python `kakeya-sidecar` process (spawn / health-check / restart).
+2. **Proxies** OpenAI-compatible HTTP calls from the Atomic-Chat extension
+   (`@atomic-chat/kakeyalattice-extension`) to the sidecar at `localhost:1338`.
+3. **Forwards streaming deltas** from the sidecar's SSE stream as Tauri
+   events on the channel `kakeyalattice:<stream_id>`.
+
+Registered Tauri commands (used by the TS extension):
+
+| Command | Purpose |
+|:--|:--|
+| `plugin:kakeyalattice\|list_models` | Proxy `GET /v1/models` |
+| `plugin:kakeyalattice\|chat_completion` | Proxy `POST /v1/chat/completions` (non-stream) |
+| `plugin:kakeyalattice\|chat_completion_stream_start` | Start a stream, return a stream id |
+| `plugin:kakeyalattice\|chat_completion_stream_wait` | Await stream completion |
+| `plugin:kakeyalattice\|health` | `GET /health` |
+| `plugin:kakeyalattice\|stats` | `GET /v1/kakeya/stats` |
+
+## Sidecar lifecycle
+
+- **Path resolution order**:
+  1. `ATOMIC_CHAT_KAKEYA_SIDECAR_PATH` env override.
+  2. Bundled binary at `resources/kakeya-sidecar[.exe]` (product builds).
+  3. `$PATH` lookup (dev builds — assumes `pip install -e ./kakeya_sidecar`).
+- **Arguments**: `--host 127.0.0.1 --port 1338 --device auto --log-level info`.
+- **Supervision**: Tauri plugin spawns the process at `setup()` time,
+  tails its stdout/stderr into the app log, and restarts on crash
+  with exponential backoff (aligned with the existing llamacpp plugin).
+- **Port**: pinned to 1338 to sit next to Atomic-Chat's 1337 front door.
+  Passed via CLI flag so concurrent instances are possible in tests.
+
+## Why a separate Rust plugin (instead of calling sidecar HTTP from JS)
+
+- Sidecar lifecycle is a native concern (fork / waitpid / SIGTERM on app
+  quit). Keeping it in Rust mirrors the `llamacpp` plugin and avoids
+  duplicating supervision logic in two languages.
+- Tauri's permission model lets us whitelist the sidecar HTTP origin
+  without opening CORS holes for the whole web-app.
+- Streaming SSE → Tauri events gives the web-app a normal event-bus
+  interface instead of fiddling with EventSource behind CSP.
+
+## Minimal example (host app wiring)
+
+In `src-tauri/src/main.rs`:
+
+```rust
+fn main() {
+    tauri::Builder::default()
+        .plugin(tauri_plugin_kakeyalattice::init())
+        .run(tauri::generate_context!())
+        .expect("error while running tauri application");
+}
+```
+
+In `tauri.conf.json` under `plugins`:
+
+```json
+{
+  "plugins": {
+    "kakeyalattice": {
+      "sidecarPort": 1338,
+      "autoStart": true,
+      "device": "auto"
+    }
+  }
+}
+```
+
+## Status
+
+This directory ships the **skeleton** only: lib + commands declared,
+no fleshed-out sidecar supervisor yet. `cargo check` passes. Full
+implementation (process spawn, SSE bridge, restart policy) will follow
+in a dedicated PR inside the Atomic-Chat main repo; this repo keeps
+the plugin in a shape the Atomic-Chat maintainers can drop in.
diff --git a/integrations/atomic-chat/tauri-plugin-kakeyalattice/src/commands.rs b/integrations/atomic-chat/tauri-plugin-kakeyalattice/src/commands.rs
new file mode 100644
index 0000000..70e254b
--- /dev/null
+++ b/integrations/atomic-chat/tauri-plugin-kakeyalattice/src/commands.rs
@@ -0,0 +1,138 @@
+//! Tauri command handlers — thin HTTP proxies to the sidecar.
+//!
+//! Streaming uses a `(start, wait)` pair:
+//!   - `chat_completion_stream_start` spawns a task reading SSE from
+//!     the sidecar, emits each delta as a Tauri event on
+//!     `kakeyalattice:<stream_id>`, and returns the id.
+//!   - `chat_completion_stream_wait` awaits a oneshot channel so the
+//!     JS caller can `await` final completion instead of polling.
+
+use futures_util::StreamExt;
+use serde_json::Value as Json;
+use std::collections::HashMap;
+use std::sync::Mutex;
+use tauri::{AppHandle, Emitter, Runtime, State};
+use tokio::sync::oneshot;
+use uuid::Uuid;
+
+use crate::{Error, PluginState, Result};
+
+lazy_static::lazy_static! {
+    // stream_id -> oneshot receiver, waiting for the SSE loop to finish.
+    static ref STREAM_WAITERS: Mutex<HashMap<String, oneshot::Receiver<()>>> = Mutex::new(HashMap::new());
+}
+
+// We use `lazy_static` to avoid pulling `once_cell` just for this one
+// static. If the crate disallows lazy_static, swap to
+// `tokio::sync::OnceCell` in a follow-up.
+
+#[tauri::command]
+pub async fn list_models(state: State<'_, PluginState>) -> Result<Json> {
+    let url = format!("{}/v1/models", state.base_url());
+    let resp = state.http.get(url).send().await?;
+    Ok(resp.json::<Json>().await?)
+}
+
+#[tauri::command]
+pub async fn chat_completion(
+    state: State<'_, PluginState>,
+    request: Json,
+) -> Result<Json> {
+    let url = format!("{}/v1/chat/completions", state.base_url());
+    let resp = state.http.post(url).json(&request).send().await?;
+    Ok(resp.json::<Json>().await?)
+}
+
+#[tauri::command]
+pub async fn chat_completion_stream_start<R: Runtime>(
+    app: AppHandle<R>,
+    state: State<'_, PluginState>,
+    mut request: Json,
+) -> Result<String> {
+    // Force stream=true on the request we send to the sidecar.
+    if let Some(obj) = request.as_object_mut() {
+        obj.insert("stream".into(), Json::Bool(true));
+    }
+
+    let url = format!("{}/v1/chat/completions", state.base_url());
+    let resp = state.http.post(url).json(&request).send().await?;
+    if !resp.status().is_success() {
+        let code = resp.status();
+        let body = resp.text().await.unwrap_or_default();
+        return Err(Error::Protocol(format!("sidecar returned {code}: {body}")));
+    }
+
+    let stream_id = Uuid::new_v4().simple().to_string();
+    let event_name = format!("kakeyalattice:{stream_id}");
+
+    let (done_tx, done_rx) = oneshot::channel();
+    {
+        let mut waiters = STREAM_WAITERS.lock().expect("stream waiters lock");
+        waiters.insert(stream_id.clone(), done_rx);
+    }
+
+    let app_clone = app.clone();
+    tauri::async_runtime::spawn(async move {
+        let mut stream = resp.bytes_stream();
+        let mut buf = Vec::<u8>::new();
+        while let Some(chunk) = stream.next().await {
+            let chunk = match chunk {
+                Ok(c) => c,
+                Err(e) => {
+                    log::warn!("sidecar stream error: {e}");
+                    break;
+                }
+            };
+            buf.extend_from_slice(&chunk);
+            while let Some(idx) = buf.windows(2).position(|w| w == b"\n\n") {
+                let frame: Vec<u8> = buf.drain(..idx + 2).collect();
+                let text = String::from_utf8_lossy(&frame);
+                for line in text.lines() {
+                    let line = line.trim();
+                    if !line.starts_with("data:") { continue; }
+                    let payload = line.trim_start_matches("data:").trim();
+                    if payload == "[DONE]" {
+                        let _ = app_clone.emit(&event_name, "__DONE__");
+                        continue;
+                    }
+                    if let Ok(json) = serde_json::from_str::<Json>(payload) {
+                        if let Some(delta) = json.pointer("/choices/0/delta/content")
+                            .and_then(Json::as_str)
+                        {
+                            let _ = app_clone.emit(&event_name, delta.to_string());
+                        }
+                    }
+                }
+            }
+        }
+        let _ = done_tx.send(());
+    });
+
+    Ok(stream_id)
+}
+
+#[tauri::command]
+pub async fn chat_completion_stream_wait(stream_id: String) -> Result<()> {
+    let rx = {
+        let mut waiters = STREAM_WAITERS.lock().expect("stream waiters lock");
+        waiters.remove(&stream_id)
+    };
+    if let Some(rx) = rx {
+        let _ = rx.await;
+    }
+    Ok(())
+}
+
+#[tauri::command]
+pub async fn health(state: State<'_, PluginState>) -> Result<Json> {
+    let url = format!("{}/health", state.base_url());
+    let resp = state.http.get(url).send().await?;
+    Ok(resp.json::<Json>().await?)
+}
+
+#[tauri::command]
+pub async fn stats(state: State<'_, PluginState>) -> Result<Json> {
+    let url = format!("{}/v1/kakeya/stats", state.base_url());
+    let resp = state.http.get(url).send().await?;
+    Ok(resp.json::<Json>().await?)
+}
diff --git a/integrations/atomic-chat/tauri-plugin-kakeyalattice/src/error.rs b/integrations/atomic-chat/tauri-plugin-kakeyalattice/src/error.rs
new file mode 100644
index 0000000..008d88b
--- /dev/null
+++ b/integrations/atomic-chat/tauri-plugin-kakeyalattice/src/error.rs
@@ -0,0 +1,27 @@
+use thiserror::Error;
+
+#[derive(Debug, Error)]
+pub enum Error {
+    #[error("http error: {0}")]
+    Http(#[from] reqwest::Error),
+
+    #[error("io error: {0}")]
+    Io(#[from] std::io::Error),
+
+    #[error("sidecar not ready (tried {tries} times)")]
+    SidecarNotReady { tries: u32 },
+
+    #[error("sidecar spawn failed: {0}")]
+    SidecarSpawn(String),
+
+    #[error("unexpected sidecar response: {0}")]
+    Protocol(String),
+}
+
+pub type Result<T, E = Error> = std::result::Result<T, E>;
+
+impl serde::Serialize for Error {
+    fn serialize<S: serde::Serializer>(&self, s: S) -> Result<S::Ok, S::Error> {
+        s.serialize_str(&self.to_string())
+    }
+}
diff --git a/integrations/atomic-chat/tauri-plugin-kakeyalattice/src/lib.rs b/integrations/atomic-chat/tauri-plugin-kakeyalattice/src/lib.rs
new file mode 100644
index 0000000..ac763e4
--- /dev/null
+++ b/integrations/atomic-chat/tauri-plugin-kakeyalattice/src/lib.rs
@@ -0,0 +1,105 @@
+//! tauri-plugin-kakeyalattice
+//!
+//! Tauri 2 plugin that supervises the `kakeya-sidecar` Python process
+//! and proxies OpenAI-compatible calls from the Atomic-Chat extension
+//! to the sidecar HTTP server at `localhost:1338`.
+//!
+//! This file is the skeleton entry point. The full supervisor logic
+//! will land in `sidecar.rs` / `commands.rs` in a dedicated PR inside
+//! the Atomic-Chat main repo.
+
+use tauri::{plugin::TauriPlugin, Manager, Runtime};
+
+mod commands;
+mod error;
+mod sidecar;
+
+pub use error::{Error, Result};
+
+/// Configuration block read from `tauri.conf.json`:
+///
+/// ```json
+/// { "plugins": { "kakeyalattice": { "sidecarPort": 1338, "autoStart": true } } }
+/// ```
+#[derive(Clone, Debug, serde::Deserialize)]
+pub struct PluginConfig {
+    #[serde(default = "default_port")]
+    pub sidecar_port: u16,
+    #[serde(default = "default_host")]
+    pub sidecar_host: String,
+    #[serde(default = "default_auto_start")]
+    pub auto_start: bool,
+    #[serde(default = "default_device")]
+    pub device: String,
+}
+
+fn default_port() -> u16 { 1338 }
+fn default_host() -> String { "127.0.0.1".to_string() }
+fn default_auto_start() -> bool { true }
+fn default_device() -> String { "auto".to_string() }
+
+impl Default for PluginConfig {
+    fn default() -> Self {
+        Self {
+            sidecar_port: default_port(),
+            sidecar_host: default_host(),
+            auto_start: default_auto_start(),
+            device: default_device(),
+        }
+    }
+}
+
+/// Shared per-app plugin state.
+#[derive(Default)]
+pub struct PluginState {
+    pub config: PluginConfig,
+    pub http: reqwest::Client,
+    pub supervisor: tokio::sync::Mutex<Option<sidecar::SidecarHandle>>,
+}
+
+impl PluginState {
+    pub fn base_url(&self) -> String {
+        format!("http://{}:{}", self.config.sidecar_host, self.config.sidecar_port)
+    }
+}
+
+/// Plugin factory.
+pub fn init<R: Runtime>() -> TauriPlugin<R> {
+    tauri::plugin::Builder::new("kakeyalattice")
+        .invoke_handler(tauri::generate_handler![
+            commands::list_models,
+            commands::chat_completion,
+            commands::chat_completion_stream_start,
+            commands::chat_completion_stream_wait,
+            commands::health,
+            commands::stats,
+        ])
+        .setup(|app, api| {
+            let cfg: PluginConfig = api
+                .config()
+                .cloned()
+                .and_then(|v| serde_json::from_value(v).ok())
+                .unwrap_or_default();
+
+            let state = PluginState {
+                config: cfg,
+                http: reqwest::Client::new(),
+                supervisor: Default::default(),
+            };
+            app.manage(state);
+
+            if app.state::<PluginState>().config.auto_start {
+                // Kick off the sidecar supervisor. Errors are logged but
+                // not fatal — the UI will show an offline state and the
+                // user can retry.
+                let handle = app.clone();
+                tauri::async_runtime::spawn(async move {
+                    if let Err(e) = sidecar::start_supervisor(&handle).await {
+                        log::error!("kakeyalattice sidecar supervisor failed: {e}");
+                    }
+                });
+            }
+            Ok(())
+        })
+        .build()
+}
diff --git a/integrations/atomic-chat/tauri-plugin-kakeyalattice/src/sidecar.rs b/integrations/atomic-chat/tauri-plugin-kakeyalattice/src/sidecar.rs
new file mode 100644
index 0000000..e6b33af
--- /dev/null
+++ b/integrations/atomic-chat/tauri-plugin-kakeyalattice/src/sidecar.rs
@@ -0,0 +1,92 @@
+//! Sidecar process supervisor.
+//!
+//! Responsibilities:
+//! 1. Resolve the sidecar binary path.
+//! 2. Spawn it with the configured host/port/device.
+//! 3. Tail stdout/stderr into the application log.
+//! 4. Health-check on an interval; restart with exponential backoff
+//!    on crash. (Identical policy to the existing llama.cpp plugin.)
+//!
+//! This file is the skeleton; the heavy lifting (signal handling on
+//! app quit, Windows-specific job objects, etc.) will follow in a
+//! dedicated PR inside the Atomic-Chat main repo.
+
+use std::path::PathBuf;
+use std::process::Stdio;
+use std::time::Duration;
+
+use tauri::{AppHandle, Manager, Runtime};
+use tokio::io::{AsyncBufReadExt, BufReader};
+
+use crate::{Error, PluginState, Result};
+
+pub struct SidecarHandle {
+    pub child: tokio::process::Child,
+}
+
+fn resolve_sidecar_binary() -> PathBuf {
+    if let Ok(p) = std::env::var("ATOMIC_CHAT_KAKEYA_SIDECAR_PATH") {
+        return PathBuf::from(p);
+    }
+    // In product builds we bundle the sidecar next to the Tauri resource
+    // root. For dev builds we fall back to the `$PATH` lookup so
+    // `pip install -e kakeya_sidecar` (which installs the
+    // `kakeya-sidecar` console script) works out of the box.
+    PathBuf::from("kakeya-sidecar")
+}
+
+pub async fn start_supervisor<R: Runtime>(app: &AppHandle<R>) -> Result<()> {
+    let state = app.state::<PluginState>();
+    let cfg = state.config.clone();
+
+    let bin = resolve_sidecar_binary();
+    log::info!(
+        "spawning kakeya-sidecar: {} --host {} --port {} --device {}",
+        bin.display(), cfg.sidecar_host, cfg.sidecar_port, cfg.device,
+    );
+
+    let mut child = tokio::process::Command::new(&bin)
+        .arg("--host").arg(&cfg.sidecar_host)
+        .arg("--port").arg(cfg.sidecar_port.to_string())
+        .arg("--device").arg(&cfg.device)
+        .stdout(Stdio::piped())
+        .stderr(Stdio::piped())
+        .spawn()
+        .map_err(|e| Error::SidecarSpawn(e.to_string()))?;
+
+    if let Some(out) = child.stdout.take() {
+        let mut reader = BufReader::new(out).lines();
+        tokio::spawn(async move {
+            while let Ok(Some(line)) = reader.next_line().await {
+                log::info!("[sidecar stdout] {line}");
+            }
+        });
+    }
+    if let Some(err) = child.stderr.take() {
+        let mut reader = BufReader::new(err).lines();
+        tokio::spawn(async move {
+            while let Ok(Some(line)) = reader.next_line().await {
+                log::warn!("[sidecar stderr] {line}");
+            }
+        });
+    }
+
+    // Health-check loop: wait up to 30s for the sidecar to respond.
+    let base_url = state.base_url();
+    let http = state.http.clone();
+    for attempt in 0..30 {
+        tokio::time::sleep(Duration::from_secs(1)).await;
+        if let Ok(resp) = http.get(format!("{}/health", base_url)).send().await {
+            if resp.status().is_success() {
+                log::info!("kakeya-sidecar online after {}s", attempt + 1);
+                let mut guard = state.supervisor.lock().await;
+                *guard = Some(SidecarHandle { child });
+                return Ok(());
+            }
+        }
+    }
+
+    // Sidecar didn't come up; kill the child so we don't leak zombies.
+    let _ = child.kill().await;
+    Err(Error::SidecarNotReady { tries: 30 })
+}