diff --git a/docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md b/docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md new file mode 100644 index 0000000..3f3ccc4 --- /dev/null +++ b/docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md @@ -0,0 +1,635 @@ +# Atomic-Chat × KakeyaLattice v1.5 — 本地 Mac 部署集成架构 + +**Date**: 2026-04-30 +**Scope**: 把 KakeyaLattice v1.5 (E8 nested-lattice KV-cache codec) 作为核心 +压缩层嵌进 [`AtomicBot-ai/Atomic-Chat`](https://github.com/AtomicBot-ai/Atomic-Chat) +的本地推理栈,支撑 Qwen3 / Llama-3.x / Gemma-3/4 / DeepSeek-R1-Distill / +GLM-4 / Mistral 等多个开源模型在 **Mac (Apple Silicon, Metal)** 上的 +离线部署。 + +本文档是"分析 + 设计"一体的规划稿。代码骨架落在 +[`integrations/atomic-chat/`](../integrations/atomic-chat/)。 + +--- + +## 1. Atomic-Chat 现状架构(先分析,再集成) + +从官方 README / CONTRIBUTING 与配套项目 [atomic.chat](https://atomic.chat/) +可以还原出下面这张图(简化,聚焦"推理后端接入点")。 + +``` +┌────────────────────────────────────────────────────────────┐ +│ Web App (React + Redux + React Router + Vite + TS) │ +│ - 对话 UI、模型管理 UI、Assistant 编排 │ +└────────────┬───────────────────────────┬───────────────────┘ + │ imports │ imports + ▼ ▼ + ┌──────────────────────┐ ┌──────────────────────────┐ + │ Core SDK (core/) │ │ Extensions (extensions/) │ + │ - TypeScript APIs │◄────│ - assistant-extension │ + │ - Extension System │ uses│ - conversation-extension │ + │ - Event Bus │ │ - download-extension │ + │ - Type Definitions │ │ - llamacpp-extension ◄──┼── 本地推理 + └──────────┬───────────┘ └───────────┬──────────────┘ + │ │ + │ Tauri IPC (invoke) │ + └──────────────┬───────────────┘ + ▼ + ┌──────────────────────────────────────────────────────────┐ + │ Tauri Backend (Rust, src-tauri/) │ + │ - Window / IPC / 安全 / 文件系统 │ + └──────────────────────────────────────────────────────────┘ + ▲ + │ 提供能力 + ┌──────────────────────────────────────────────────────────┐ + │ Tauri Plugins (src-tauri/plugins/) │ + │ - hardware (CPU/GPU/RAM 探测) │ + │ - llamacpp (llama.cpp 进程管理、Metal/CUDA 后端、推理) │ + └──────────────────────────────────────────────────────────┘ + │ + ▼ + OpenAI 兼容本地 HTTP 服务 @ localhost:1337 + (/v1/models, /v1/chat/completions …) +``` + +关键事实: + +| 维度 | 现状 | +|:-----|:-----| +| Shell | **Tauri**(不是 Electron)| +| 推理后端 | 只有一个:`llamacpp-extension` + `plugins/llamacpp` 驱动 **llama.cpp** 进程 | +| 模型格式 | GGUF(主力),宣传支持 MLX / ONNX(由 llama.cpp 生态/路线图覆盖)| +| 模型来源 | 直连 HuggingFace 下载 | +| 硬件加速 | Mac 走 Metal(`xcodebuild -downloadComponent MetalToolchain`),Windows x64 走 CPU / CUDA | +| 对外 API | OpenAI 兼容的 `localhost:1337`,其他程序可直接打 | +| Agent 能力 | MCP (Model Context Protocol) 集成,跑 agentic workflow | +| 营销口径 | 站点声称 *"Google TurboQuant built-in"* — 指的是 KV 压缩 | + +模型下载到推理的主流程(文件系统级): + +``` +User clicks download + → web-app 发请求 + → download-extension 决定源 + → Tauri backend 落盘 + → llamacpp-extension 注册模型 + → plugins/llamacpp 起 llama.cpp 子进程 + → OpenAI 兼容 /v1 路由暴露 +``` + +--- + +## 2. 硬冲突:为什么不能把 KakeyaLattice 塞进 llama.cpp + +要"把 KakeyaLattice 作为核心架构"集成,先认清一个工程事实。 + +**KakeyaLattice v1.5 是 PyTorch-first 的 GPU 原生实现。** 其核心算子 +(`V15KakeyaZamirE8GPU.roundtrip`) 依赖: + +- Sylvester–Hadamard rotation(任意长度 2^k,PyTorch `matmul`) +- E8 Conway–Sloane closest-point(两 coset 评估 + `argmin`) +- Per-vector adaptive `q_max`(`amax` + `clamp`) + +最干净的宿主是 HuggingFace `transformers`:`kakeyalattice.hf.KakeyaLatticeCache` +已经是 `DynamicCache` 的一级子类,`model.generate(past_key_values=cache)` 一行接入。 + +反之 **llama.cpp 没有可插拔的"KV-cache 量化策略"的对外接口**: +1. `kv_cache` 是 C++ 内部结构,量化类型 (`q4_0`, `q8_0`, `f16`) 写死在 `ggml_type`。 +2. 没有 Python hook,改造需要实现一套 E8 的 C++/Metal 内核并 patch `llama_kv_cache_unified`。 +3. 即便改造出来,也无法复用 KakeyaLattice 现有的 bit-parity 测试、ablation + benchmark、`rigorous_eval.py` 全套回归。 + +所以"核心架构"集成的工程现实是 — **把 KakeyaLattice 做成与 llama.cpp +平级的第二个本地推理后端**,由 Atomic-Chat 的 Extension System 负责路由。 + +--- + +## 3. 集成目标架构 + +在 Atomic-Chat 既有架构上加"右臂",与 llama.cpp 的"左臂"并列: + +``` +┌────────────────────────────────────────────────────────────┐ +│ Web App (unchanged) │ +└────────────────────┬──────────────────────────────────────┘ + ▼ + Core SDK (unchanged) + Extensions + ├─ llamacpp-extension (既有, GGUF) + └─ kakeyalattice-extension ★ 新增 + │ + │ Tauri IPC + ▼ + Tauri Plugins + ├─ plugins/llamacpp (既有) + └─ plugins/kakeyalattice ★ 新增 + │ + │ 管理 Python sidecar 进程生命周期 + ▼ + Kakeya Sidecar (localhost:1338, OpenAI 兼容) + ├─ HuggingFace transformers + torch MPS + ├─ KakeyaLatticeCache (E8, per-model Q) + └─ /v1/models, /v1/chat/completions (stream) + │ + ▼ + 本地模型仓库 (HF safetensors) + Qwen3-4B / Llama-3.2-3B / Gemma-4-E4B / + DeepSeek-R1-Distill-1.5B / GLM-4-9B-Chat / Mistral-7B … +``` + +在用户视角: + +- 既有的 Atomic-Chat UI 不需要重画。模型选择 UI 多一个 "Backend" 过滤器 + (`llama.cpp` / `KakeyaLattice (E8)`)。 +- 原 `localhost:1337` 继续由 Atomic-Chat 前门承担,内部按选择的模型 + 将 OpenAI 请求路由到 llama.cpp 或 Kakeya sidecar。 +- **用户只感觉到一件事**:选了 `KakeyaLattice` 后端后,长上下文占用的 + 内存少一截(E8 Q=10 ~3.37×,Q=4 ~4.57×),质量损失可控(|Δppl|<7% 在 + Qwen3/Gemma/GLM),能塞进 Mac 16/32GB 的上限里。 + +--- + +## 4. 关键设计决策 + +### 4.1 为什么 Python sidecar,而不是 Rust FFI + +方案对比: + +| 方案 | 优点 | 致命缺点 | +|:-----|:-----|:---------| +| Rust 直接调 libtorch / tch-rs | 无进程边界 | torch MPS ABI 不稳定;transformers 的 cache / generate 逻辑在 Python,不能照搬;维护成本爆炸 | +| **Python sidecar**(本方案)| 直接复用 `kakeyalattice.hf.KakeyaLatticeCache`;transformers 自带所有模型支持;Metal 通过 torch MPS 零改动 | 多一个进程(由 Tauri plugin 托管,用户无感)| +| 改 llama.cpp | 与 Atomic-Chat 现架构对齐 | 前述 §2,实现成本不可接受 | + +选 sidecar。用 `stdio` / `localhost:1338` 与 Tauri 通信,plugin 负责起/停/健康检查/日志转发。这套模式 llama.cpp 自己就是这么跑的(llama.cpp 也是独立进程)。 + +### 4.2 Metal (MPS) 而不是 CUDA + +KakeyaLattice v1.5 的 `V15KakeyaZamirE8GPU` 构造器签名: + +```python +V15KakeyaZamirE8GPU(D=head_dim, q_range=Q, device="cuda") +``` + +Mac 上只要把 `device="cuda"` 换成 `device="mps"` 就能跑 — 因为所有算子都是 +标准 PyTorch(`matmul`、`argmin`、`amax`、`clamp`、`round`),MPS backend 全部支持。 +这是整个集成最幸运的点:**不需要写任何 Metal shader**。 + +> 验证方式:`integrations/atomic-chat/kakeya_sidecar/tests/test_mps_smoke.py` +> 会在 Apple Silicon 上跑 E8 roundtrip,对齐 CUDA 参考输出(相对误差 < 1e-5)。 + +性能预期(vs v1.5 on H200 @ 551µs/2048-vec slice): +- M2 Pro / M3 Pro (16GB):codec 开销约 4-8 ms/slice,decode step 整体 15-30ms + 的 20-30%。不是热点,但不可忽略。 +- 优化方向(未来 PR):Triton 不行,直接写 Metal Performance Shaders (MPS) + 的 fused E8 closest-point。当前 sidecar 预留接口 `codec.set_backend("metal")`。 + +### 4.3 per-model Q 档位(来自 v1.5 报告) + +直接复用 `reports/v1_5_release/V15_FULL_4MODEL_REPORT.md` 的结论,每个模型 +落成一张 "deployment profile" JSON,见 `kakeya_sidecar/model_registry.py`: + +| Model (HF id) | head_dim | variant | 推荐 Q | CR | |Δppl| (v1.5 实测) | 备注 | +|:--|:-:|:-:|:-:|:-:|:-:|:--| +| Qwen/Qwen3-4B | 128 | e8 | 10 | 3.37× | 3.85% | 平衡点 | +| Qwen/Qwen3-4B | 128 | e8 | 38 | ~2.5× | <1% (paper) | 近无损 | +| meta-llama/Llama-3.2-3B-Instruct | 128 | e8 | 10 | 3.37× | 待测 (同类) | 报告未直测 | +| google/gemma-4-E4B | 256/512 | e8 | 10 | 3.47× | 1.56% | 异构 head_dim 已验证 | +| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | 128 | e8 | 10 | 3.37× | 需 boundary≥2 | 小模型结构敏感 | +| zai-org/GLM-4-9B-Chat | 128 | e8 | 10 | 3.37× | 6.96% | L=40,可用 | +| mistralai/Mistral-7B-Instruct-v0.3 | 128 | e8 | 10 | 3.37× | 待测 (同类) | head_dim 兼容 | + +准则: +- **默认 Q=38**:近无损,Mac 首选(用户感知最稳)。 +- **长上下文 / 内存吃紧时 Q=10**:3.37× 压缩,|Δppl|<7% 在主流 7B~9B。 +- **Q=4 不作为默认**:报告里 GLM Q=4 |Δppl|=32%,只在 UI 高级选项里暴露。 +- **DeepSeek-R1-Distill 家族强制 `boundary=2`**:避开报告里记录的 + no-boundary 灾难模式。 + +### 4.4 长上下文是真正的卖点 + +Mac 本地部署里"KV cache 撑爆内存"是高频失败。举例 Llama-3.2-3B 在 32k +ctx 的 KV:`2 × L=28 × head_dim=128 × num_kv_heads=8 × 32768 × 2(bf16) = +3.7 GB`。把它压 3.37× 后变 1.1 GB,Mac mini M2 16GB 就够跑。 + +这是"接 KakeyaLattice"带来的**用户可感知价值**,不是单纯的 PR 演示。 + +--- + +## 5. 多模型支持矩阵 + +所有通过 `head_dim % 8 == 0` 且 `head_dim` 是 2 的幂 的模型都即插即用。 +`KakeyaLatticeCache` 的 `strict=True` 会在不兼容的模型(如 GPT-NeoX +head_dim=96)启动时报错并 fallback 到 `llama.cpp` 后端。 + +| Family | head_dim | E8 OK | 备注 | +|:--|:-:|:-:|:--| +| Llama-3.x (1B/3B/8B) | 128 | ✅ | Mac 主力 | +| Qwen2 / Qwen3 (1.5B/4B/7B) | 128 | ✅ | v1.5 报告首发 | +| Mistral / Mixtral | 128 | ✅ | | +| Gemma-3 / Gemma-4 | 256 (+512 MatFormer) | ✅ | 异构 head_dim 在 KakeyaLatticeCache 已覆盖 | +| DeepSeek-R1-Distill 系列 (1.5B/7B/14B) | 128 | ✅ | 强制 boundary=2 | +| DeepSeek-V2/V3 (MLA) | 128 + 64 rope | ⚠️ | 需要 MLA 专用路径,下一步 | +| GLM-4-9B-Chat | 128 | ✅ | | +| Phi-3 (4k/128k) | 96 | ✗ | 跳过 → llama.cpp | +| GPT-NeoX 旧模型 | 96 | ✗ | 跳过 → llama.cpp | + +--- + +## 6. API 设计 — OpenAI 兼容 + +Sidecar 暴露最小 OpenAI 子集,Atomic-Chat 的 `localhost:1337` 前门只做 URL 改写。 + +### `GET /v1/models` +返回当前已加载 + 可加载的模型;每条多两个 KakeyaLattice 字段: + +```json +{ + "object": "list", + "data": [ + { + "id": "qwen3-4b@e8-q10", + "object": "model", + "owned_by": "kakeyalattice", + "x_kakeya": { + "variant": "e8", + "q_range": 10, + "boundary": 0, + "est_compression": 3.37, + "est_delta_ppl_pct": 3.85 + } + } + ] +} +``` + +### `POST /v1/chat/completions` +完全兼容 OpenAI 请求格式。sidecar 接到后: +1. 按 `model` 字段解析 `@-q`; +2. lazy-load 模型 + `KakeyaLatticeCache` (LRU,同时间只驻留 1~2 个); +3. `model.generate(..., past_key_values=cache, streamer=...)` 跑出 + token 流; +4. 转成 OpenAI `chat.completion.chunk` SSE 流。 + +### `GET /v1/kakeya/stats` +扩展接口,给 UI 用,返回: +- 当前 cache 的 `codec_fired_per_layer` / `skip_fired_per_layer` +- 当前 KV HBM/URAM footprint 估算 +- 上一次 decode step 的 codec 占用 % + +--- + +## 7. 安装 & 运行(开发者视角) + +本仓库提供的骨架可以**直接拉到 Atomic-Chat 仓库 `extensions/` 下使用**。 + +```bash +# 1. 装 Python sidecar +cd integrations/atomic-chat/kakeya_sidecar +pip install -e ".[mac]" +kakeya-sidecar --port 1338 --device mps --prewarm qwen3-4b@e8-q10 + +# 2. 拷扩展到 Atomic-Chat 仓库 +cp -R integrations/atomic-chat/kakeyalattice-extension \ + ~/Atomic-Chat/extensions/ +# 在 web-app 的 extension registry 里注册一行,重新 `make dev` 即可 + +# 3. 拷 Tauri plugin +cp -R integrations/atomic-chat/tauri-plugin-kakeyalattice \ + ~/Atomic-Chat/src-tauri/plugins/kakeyalattice +# 在 src-tauri/tauri.conf.json + Cargo.toml 加 plugin 注册 +``` + +打包到 Atomic-Chat 安装包时,sidecar 需要用 `PyOxidizer` 或 `briefcase` +打成独立可执行,与 `llama.cpp` 二进制并列塞进 `.dmg`。本仓库不做 +端到端打包(Atomic-Chat 仓库自己的 CI 做),只保证 sidecar 本身能跑。 + +--- + +## 8. 与项目原有 vLLM 路径的关系 + +本仓库 `vllm_backend/kakeya_v1_4_snapshot/` 是 **Linux + CUDA + vLLM** 的 +snapshot-hook 插件路径,用于在 H200 上跑 `benchmarks/rigorous_eval.py`。 + +Atomic-Chat 集成 **不复用这条路径**,原因: +- vLLM 在 Mac 上没有 Metal 后端,跑不起来。 +- vLLM 的 PagedAttention + KV cache 是 CUDA 内核,snapshot hook 依赖 CUDA tensor。 + +二者关系: +- `vllm_backend/` → 研究/benchmark 用,CI 数据来源,**保持不动**。 +- `integrations/atomic-chat/kakeya_sidecar/` → 产品用,Mac 部署,新增路径。 + +两者共用的唯一组件是 `kakeyalattice` Python 包本身(纯 PyTorch,device 无关)。 + +--- + +## 9. 路线图 + +| 阶段 | 内容 | 依赖 | +|:-:|:--|:--| +| M1 | Sidecar MVP:OpenAI `/v1/chat/completions` 流式 + Qwen3/Llama3/Gemma/Mistral/DeepSeek/GLM 6 模型单测 | 本 PR 落地 | +| M2 | kakeyalattice-extension + Tauri plugin 接入 Atomic-Chat,UI 增加 "Backend: KakeyaLattice" 选项 | M1 | +| M3 | Metal fused E8 closest-point kernel,codec 延迟降到 bf16 decode 的 < 5% | Metal Shading Language 专家 | +| M4 | `KakeyaLatticeCache` 切换到真·存索引模式(非 roundtrip),HBM 节省从 "nominal" 转为 "实打实" | 自定义 attention kernel | +| M5 | 与 Atomic-Chat 的 MCP/agent 链路打通:long-context 检索可在 compressed KV 上直接跑 NIAH 风格取值 | M1 + M2 | + +--- + +## 10. 风险 & 取舍 + +| 风险 | 缓解 | +|:--|:--| +| MPS 算子在 torch 2.x 有偶发 bug(`argmin` 在非连续 tensor 上) | sidecar 在 codec 前强制 `.contiguous()`,已经放进 smoke test | +| sidecar 进程崩掉导致整台 Atomic-Chat 推理断线 | Tauri plugin 负责 supervise + 自动重启(对齐 llama.cpp plugin 现有行为) | +| 用户误选 Q=4 在小模型上(如 DeepSeek-1.5B no-boundary 已知 5.4 万 % PPL) | UI 只暴露 `safe` Q 档;`aggressive` 档进高级菜单 + 警示 | +| HF 模型许可协议(Llama-3 门禁、GLM-4 的自有 license)| 沿用 Atomic-Chat 既有机制,sidecar 不做任何绕过 | +| `KakeyaLatticeCache` 的 `roundtrip` 落在 bf16 阵上,实际 HBM 节省是 "nominal"(见 `kakeyalattice/README.md` §What it is NOT) | M4 前,主卖点是**长上下文的 attention 质量下限**,不是 HBM 压缩率。UI 文案要诚实 | + +--- + +## 11. 与 atomic.chat 宣传口径的对齐 + +atomic.chat 首页写的是 *"Google TurboQuant built-in"*。按 v1.5 报告: + +- v1.5 E8 Q=4 (CR 4.57×) **vs** TQ b=3 (CR 4.92×):v1.5 在 4 个模型上全面 + 赢 3-6× 更低 |Δppl|。 +- TQ b=2 (CR 7.11×) 在 4 个模型上结构性不可用(\|Δppl\| > 100% 到 14 万%)。 + +所以这次集成对 Atomic-Chat 产品而言是**把宣传里的 "TurboQuant built-in" +换成工程上更靠谱的 E8 nested-lattice**。两者可以共存(用 `backend` 参数 +切),但在默认档位上 KakeyaLattice 明显更稳。 + +--- + +## 附:文件对应表 + +本 PR 新增的文件与它们在 Atomic-Chat 仓库里的最终归宿: + +| 本仓库路径 | Atomic-Chat 仓库映射 | +|:--|:--| +| `integrations/atomic-chat/kakeya_sidecar/` | 独立安装,sidecar 包装后塞进 `.dmg` | +| `integrations/atomic-chat/kakeyalattice-extension/` | 放到 `extensions/` | +| `integrations/atomic-chat/tauri-plugin-kakeyalattice/` | 放到 `src-tauri/plugins/kakeyalattice/` | +| `docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md` | 存本仓库作设计依据 | + +--- + +## 12. DFlash (block diffusion speculative decoding) 集成路径 + +### 12.1 DFlash 是什么,和 KakeyaLattice 的关系 + +[**DFlash** (z-lab, arXiv:2602.06036, 2026-02)](https://github.com/z-lab/dflash) 是 +block-diffusion 架构的 **speculative-decoding drafter**: 小 draft 模型 +在单次 forward 里并行 draft 一整块 token (block_size=16),target LLM +一次并行 verify,总体在 Qwen3-8B 上达 6× 无损加速 (2.5× 超过 EAGLE-3)。 +官方支持四后端: **Transformers / SGLang / vLLM-nightly / MLX (Apple +Silicon)**。 + +DFlash 与 KakeyaLattice **正交**: + +| | 改什么 | 省什么 | 质量承诺 | +|:-|:-|:-|:-| +| DFlash | 解码循环(draft 一次出 16 token) | **解码步数(时间)** | lossless (verify 回退 target 分布) | +| KakeyaLattice | 每步内部的 KV 表示 | **每步 KV HBM / 带宽(空间)** | 近无损 Q=38 <1%; Q=10 <7% on 3/4 models | + +叠加后理论效果: `总加速 ≈ DFlash 3-6× × KakeyaLattice KV 节省 → 更长 ctx × +18-24% 并发`。 + +### 12.2 叠加的关键交互 + +1. **Verify 阶段是 mini-prefill**, 16 token 一次并行过 codec, codec + 开销自动摊销到 1/16 — MPS codec 占比从单轨 decode 的 20-30% 降到 1-2%. +2. **Draft 模型的 KV 也是压缩目标**: DFlash MLX 已暴露 + `sliding_window_size` bound draft KV growth; 用 KakeyaLattice Q=10 + 激进档压缩 draft KV 完全安全, 因为 verify 由 target 正确分布兜底. + 这是"draft 激进压 / target 保守压"的分层策略, 只有在推测解码框架下 + 才能成立. +3. **Acceptance rate 风险**: target 走 Kakeya Q=38 会引入 <1% |Δppl|, + 期望 acceptance 掉 <1pp; 需实测 `z-lab/Qwen3-8B-DFlash-b16 + + Qwen3-8B + KakeyaLatticeCache(e8, q=38)` 对比 pure bf16 baseline, + 看 acceptance-length 分布. +4. **MLX 路径阻碍**: `KakeyaLatticeCache` 是 `transformers.DynamicCache` + 子类, 不能直接接 MLX 的 `RotatingKVCache` / `StandardKVCache`. + 需要把 E8 codec 移到 MLX — 工作量约 1-2 文件 / 200-300 行 MLX 代码, + 无自定义 CUDA/Metal kernel. + +### 12.3 方案矩阵 (把 DFlash 加进去) + +| 方案 | Platform | 加速栈 | KV 压缩 | 工程成本 | +|:-|:-|:-|:-|:-| +| **B1** Mac sidecar, HF + MPS (本 PR) | Mac | 单轨 AR | `KakeyaLatticeCache` 现成 | ✅ 已落地 | +| **B2** Mac sidecar, MLX + DFlash + KakeyaLattice-MLX | Mac | DFlash 3-6× | KakeyaLattice MLX port | 中 | +| **C** 本地 vLLM + KakeyaLattice snapshot | Linux/CUDA | vLLM | 既有 plugin | 已有 | +| **C2** 本地 vLLM nightly + DFlash spec-config + KakeyaLattice | Linux/CUDA | vLLM + DFlash | 既有 plugin + 两插件共存调试 | 中-高 | + +B2 是本 PR 叙事链中的 "phase 2", 作为**独立后续 PR** 推进。 + +### 12.4 B2 路线图 (后续 PR) + +| 里程碑 | 产出 | +|:-|:-| +| M1 | `kakeyalattice_mlx/` — E8 codec 的纯 MLX 实现 + 与 PyTorch 参考的 bit-level parity | +| M2 | `KakeyaLatticeMLXCache` — mlx-lm KV cache 包装层 | +| M3 | `kakeya_sidecar_mlx/` — OpenAI 兼容 MLX sidecar, 加载 target + 可选 DFlash draft | +| M4 | 接 DFlash: `dflash.model_mlx.stream_generate` 内置 target KV 压缩 | +| M5 | acceptance-rate benchmark: `Qwen3-8B + z-lab/Qwen3-8B-DFlash-b16` × `{bf16, Kakeya Q=38, Q=10}` | +| M6 | Atomic-Chat extension 追加 "KakeyaLattice (MLX+DFlash)" backend 选项 | + +### 12.5 `DeploymentProfile` 预留的 `dflash_draft_repo` 字段 + +`kakeya_sidecar/kakeya_sidecar/model_registry.py` 在 B2 PR 里会补上: + +```python +dflash_draft_repo: str | None = None # e.g. "z-lab/Qwen3-8B-DFlash-b16" +``` + +对应已发布的 DFlash draft 模型清单: + +| target | DFlash draft | +|:--|:--| +| Qwen/Qwen3-4B (non-thinking) | `z-lab/Qwen3-4B-DFlash-b16` | +| Qwen/Qwen3-8B (non-thinking) | `z-lab/Qwen3-8B-DFlash-b16` | +| meta-llama/Llama-3.1-8B-Instruct | `z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat` | +| Qwen/Qwen3.5-4B | `z-lab/Qwen3.5-4B-DFlash` | +| Qwen/Qwen3.5-27B | `z-lab/Qwen3.5-27B-DFlash` | +| openai/gpt-oss-20b | `z-lab/gpt-oss-20b-DFlash` | +| openai/gpt-oss-120b | `z-lab/gpt-oss-120b-DFlash` | + +B1 sidecar 现在不消费这个字段, 留给 B2 PR 填充。 + +--- + +## 13. 浏览器里跑推理的可行性评估 + +### 13.1 定义: "在 web 里跑"有四种 + +| | 含义 | 技术 | +|:-|:-|:-| +| W1 | 纯浏览器、零本地依赖 | WebGPU / WASM / WebLLM / transformers.js | +| W2 | 浏览器 UI + 同机本地 sidecar | `http://localhost:1338` + Tauri webview | +| W3 | Tauri webview 里的 web-app | 本 PR 的 TS 扩展跑在这一层 | +| W4 | 远端 server + 浏览器前端 | LAN/云端 vLLM/SGLang | + +本节聚焦 W1 (最硬的问题: 能否真把 B2 搬进浏览器)。 + +### 13.2 B2 三件部件搬 Web 的可行性 + +| 部件 | WebGPU 可行性 | 工作量 | +|:-|:-:|:-| +| **Target LLM** | ✅ WebLLM / transformers.js 已支持 Qwen/Llama/Gemma/Phi 小模型 int4 | 现成 | +| **KakeyaLattice E8 codec** | ✅ 全是 `matmul` / `argmin` / `round` / `clamp` / `amax`, WGSL 几百行 | 低: 1-2 天 WGSL + JS | +| **DFlash drafter** | ✘ 无 web 后端; draft 架构 (cross-architecture KV injection + block-diffusion + mask token) 需要 port 自 `dflash/model_mlx.py` ~1500 行到 WebGPU | 高: 独立项目级 | + +三件拼起来的结论: +- `target + KakeyaLattice` ✅ 可行 +- `target + KakeyaLattice + DFlash` ✘ 短期不可行 + +### 13.3 关键限制的数字锚点 + +- WASM 线性内存单实例 4 GB 上限 (memory64 提案 Safari 未实现) +- WebGPU buffer binding 单 binding Chrome 默认 256 MB (可提至 2 GB) +- 7B bf16 = 14 GB → 必 int4 量化; 13B int4 ≈ 7 GB 贴 M3 Pro 边界 +- WebGPU 相对 native GPU 效率 **50-70%** (workgroup / subgroup ops / driver 抽象损耗) +- Qwen-2.5-3B WebLLM @ M3 Pro ~30-45 tok/s (MLX native ~80-100) +- WebGPU + DFlash 缺位 → web 比 "native MLX + DFlash" 慢 **一个量级** + +### 13.4 Web 推理的优劣势速览 + +**优势**: +- 分发极简 (URL), 跨平台零 build matrix, 更新零摩擦 +- 浏览器沙盒提供可审计的隐私承诺 +- 嵌入性: Notion / Chrome extension / iframe 原位推理 +- 服务端成本近零 (demo / playground 场景) + +**劣势**: +- 硬件上限被浏览器上限约束 (WASM 4 GB / WebGPU buffer quota) +- 冷启动悬崖 (首次几 GB 权重下载) +- 性能天花板比 native 低 50-70% + 无 DFlash = 慢一个量级 +- 电池 / 发热 / 浏览器碎片化 (2026 覆盖率 ~60-75%) +- 无本地文件 / 无 MCP / 无子进程能力 — Atomic-Chat 的 agent 卖点全部失效 + +### 13.5 三条现实路径 (ladder) + +| ladder | 能做到 | 工作量 | +|:-:|:--|:--| +| L1 | Web 版 KakeyaLattice E8 codec demo (WGSL + WebGPU), marketing/教育 | 1-2 天 | +| L2 | Web 里 transformers.js + 自定义 KV cache wrapper 跑 3-4B target + KakeyaLattice (无 DFlash) | 几周 | +| L3 | 完整 B2 in web (需 DFlash WebGPU 后端, 等 z-lab / MLC 社区出) | 巨大, 非本仓库能推进 | + +### 13.6 W2 的务实选项 (不是纯 web, 但"零点击即用") + +sidecar 加一个 `/chat.html` 静态路由 + React build 产物, 用户打开 +`http://localhost:1338/chat.html` 浏览器就能 chat, 底层是完整 B2 +native 栈. 这是 Atomic-Chat Tauri webview 的"脱壳版", 保留 B2 全部 +性能, 同时满足"打开浏览器就用"的体验诉求. + +### 13.7 建议 + +- **短期**: 不在本 PR 做 W1; 保留 B1 Python sidecar 作为产品主力. +- **中期** (独立 PR): L1 (codec demo 页面) 作为 marketing / 教育资产. +- **长期**: L2 (web-native LLM + KakeyaLattice) 机会窗口, 但 DFlash 缺位 + 意味着"web 永远比 native 慢一代", 投入要量力. + +--- + +## 14. A0 (llama.cpp) vs B2 (MLX + DFlash + KakeyaLattice) 决策表 + +本节给 Atomic-Chat 团队在 review 这套集成 PR 时一份直接可用的决策依据: +B2 **不是** A0 的替代, 而是 A0 的"Pro Mode for Mac"附加后端. + +### 14.1 两套方案的全貌 + +**A0 (Atomic-Chat 现状)**: +- Tauri 2 + React, 跨平台 (Mac/Win/Linux) +- 单推理后端: `extensions/llamacpp-extension` + `plugins/llamacpp` 驱动 llama.cpp +- GGUF 模型, Metal / CUDA / CPU 加速 +- KV 压缩: `--cache-type-k/v` = q8_0 / q4_0 / q4_1 标量量化 +- 推测解码: llama.cpp 引擎有, UI 未集成 +- MCP / Assistant / 多 provider / OpenAI 兼容 `:1337` +- 宣传 "Google TurboQuant built-in" (实为 llama.cpp KV 标量量化) + +**B2 (本 PR 叙事链末端, 独立后续 PR 落地)**: +- 复用 Atomic-Chat 的 Tauri 壳 +- 第二推理后端: Python sidecar 跑 MLX + target LLM + DFlash drafter + KakeyaLattice-MLX codec +- MLX 权重 (mlx-community) + HF safetensors +- Apple Silicon 专属 (MLX 不上 Windows/Linux) +- KV 压缩: E8 Q=38 / Q=10 / Q=4 (v1.5 报告锚定) +- 推测解码: DFlash 3-6× 无损 (z-lab 预训练 draft) +- 复用 A0 的 UI / MCP / Assistant / `:1337` OpenAI 前门 (内部路由到 `:1338` sidecar) + +### 14.2 多维对比表 + +| 维度 | A0 | B2 | +|:-|:-|:-| +| Decode 速度 (Qwen3-8B @ M3 Pro) | ~35-45 tok/s (q4_K_M) | ~200-280 tok/s effective (MLX 50 × DFlash 3-6×) | +| KV 压缩质量 | q4/q8 标量, q2_K 对 <3B 模型不稳 | E8 Q=38 \|Δppl\|<1%, Q=10 <7% (3/4 模型) | +| 长上下文 (Mac 16GB) | Qwen3-8B ~16-24k | ~48-64k (KV 再 3.37× 压缩) | +| 推测解码 | 引擎有, UI 不可达 | DFlash 3-6× 无损 | +| 模型生态 | GGUF 池 >> MLX 池 (几万 vs 几百) | MLX 主流模型覆盖 + HF 原始权重 | +| 冷启动 | ~0.5-1s | ~2-3s (Python + MLX load) | +| 平台覆盖 | Mac + Windows + Linux | **Mac 专属 (Apple Silicon)** | +| 小模型 (<3B) | llama.cpp 手工 Metal 调优, ~90+ tok/s | MLX 比 llama.cpp 慢 20-30% | +| 大模型 (>13B) | GGUF int4/int2 OK | MLX int4 OK 但生态弱 | +| Agent / MCP | 原生集成 | 完全复用 A0 能力 | +| KakeyaLattice 接入深度 | 无 (llama.cpp 无可插拔 KV hook) | 原生 `KakeyaLatticeCache` (B1) + MLX port (B2) | +| 打包 | DMG + 静态 llama.cpp | DMG + PyOxidizer sidecar + signing | +| debuggability | C++ 栈 + native profiler | Python stack + MLX trace, 显著更易 | +| 新 codec 接入 | 改 C++ + Metal kernel | 几十行 MLX | +| License | Apache-2.0 + MIT | Apache-2.0 全线 | +| 社区规模 | llama.cpp ~80k★ | MLX 官方 / DFlash 新 / KakeyaLattice 专项 | + +### 14.3 各自赢在哪里 + +**A0 三大硬核优势**: +1. **生态压倒性** — GGUF 新模型 release-to-quant lag 24-72h, MLX 是 1-2 周. +2. **跨平台** — Windows 用户 (Atomic-Chat 有 .exe 头等二进制) 在 B2 下无路可走. +3. **小模型性能** — llama.cpp Metal shader hand-tuned; Phi-3.5 mini @ M3 Pro ~90+ tok/s. + +**B2 三大硬核优势**: +1. **加速 × 压缩乘法效应** — 200+ tok/s effective, A0 在产品形态下追不上 (llama.cpp 推测解码未暴露到 UI). +2. **长上下文真解锁** — Mac 16GB + Qwen3-8B 从"32k OOM"到"64k 可用", \|Δppl\|<1%. +3. **质量天花板** — Q=38 近无损 + Q=10 平衡档都有 v1.5 n=32 CI 实测撑, 不是工程直觉. + +**A0 软肋**: +- `Google TurboQuant built-in` 的宣传工程上名不副实 (实为 llama.cpp 2023 年就有的标量量化); v1.5 报告中 TQ b=2 "结构性不可用", b=3 被 E8 Q=4 在 4 模型上全面压过. +- draft model 支持在 UI 未暴露, 加速特性"存在但不可达". + +**B2 软肋**: +- Mac-only — Windows/Linux 用户用不到. +- Python sidecar 打包 + notarize 是 Atomic-Chat 新工程. +- 冷启动慢 2 秒. +- DFlash 对模型家族有限定 (Qwen3 / Llama-3.1 / Qwen3.5 / gpt-oss / Kimi). + +### 14.4 用户画像 → 最优方案 + +| 用户画像 | 最优 | +|:-|:-| +| Windows / Linux | **A0** (B2 不可用) | +| Mac + <3B 小模型 (Phi, Gemma-2B) | **A0** (llama.cpp 手工调优略优) | +| Mac + 3-8B 主力 (Qwen3, Llama-3) | **B2** (加速 + KV 压缩双突破) | +| Mac + 长上下文 / 整本书 / 代码库 | **B2** (唯一解) | +| 追新模型 (每天等新 GGUF) | **A0** (生态快一周) | +| MCP / agent 工作流 | 任选; B2 高 tok/s 让连续调用更流畅 | +| 隐私极强 | 任选; B2 Python 栈可审计更深 | + +### 14.5 给产品决策者的三句话 + +1. **不要把 B2 包装成"取代 llama.cpp"** — 产品伤害极大 (Windows / GGUF 追新 / 小模型用户流失). +2. **把 B2 定位为"Pro Mode for Mac"** — 文案强调 "DFlash + E8 lattice KV, Qwen3-8B 在 M3 Pro 200+ tok/s, 48k ctx 不 OOM", 兑现 atomic.chat 早已宣传却未落地的 "TurboQuant built-in" 承诺. +3. **宣传可信度** — B2 的每一个数字都能 reproduce (KakeyaLattice v1.5 n=32 Student-t CI + DFlash z-lab 官方 benchmark). 把 "Google TurboQuant built-in" 改为 "KakeyaLattice E8 + DFlash built-in for Mac", 工程上才立得住. + +### 14.6 本 PR 的定位重申 + +本 PR (分支 `AgentMemory/atomic-chat-kakeya-integration-04ae`) 交付 +**B1 方案 + 完整设计叙事**, 具体包含: + +- `integrations/atomic-chat/kakeya_sidecar/` — Python sidecar (HF + MPS + `KakeyaLatticeCache`) 代码 + 15 单测 +- `integrations/atomic-chat/kakeyalattice-extension/` — Atomic-Chat TS 扩展骨架, tsc 通过 +- `integrations/atomic-chat/tauri-plugin-kakeyalattice/` — Rust Tauri plugin 骨架 +- `docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md` — §1-§14 完整设计 + +**B2 方案** (MLX + DFlash + KakeyaLattice-MLX) 走**独立后续 PR** +(`AgentMemory/atomic-chat-b2-mlx-dflash-kakeya-04ae`), 避免本 PR 审阅面 +膨胀. B2 PR 会在本 PR merge 之后开出, 对本 PR 不是 blocker. + +--- + +*作者:Cursor Cloud Agent · 分支 `AgentMemory/atomic-chat-kakeya-integration-04ae`.* diff --git a/integrations/atomic-chat/.gitignore b/integrations/atomic-chat/.gitignore new file mode 100644 index 0000000..4624513 --- /dev/null +++ b/integrations/atomic-chat/.gitignore @@ -0,0 +1,15 @@ +# Python +__pycache__/ +*.pyc +*.egg-info/ +.pytest_cache/ +.venv/ + +# Node / TS +node_modules/ +dist/ +*.tsbuildinfo + +# Rust +target/ +Cargo.lock diff --git a/integrations/atomic-chat/README.md b/integrations/atomic-chat/README.md new file mode 100644 index 0000000..abc1062 --- /dev/null +++ b/integrations/atomic-chat/README.md @@ -0,0 +1,95 @@ +# Atomic-Chat × KakeyaLattice v1.5 — 本地 Mac 部署集成 + +把 [`kakeyalattice` v1.5 (E8 格 KV-cache codec)](../../kakeyalattice/) 作为 +**第二个一等推理后端** 接入 +[`AtomicBot-ai/Atomic-Chat`](https://github.com/AtomicBot-ai/Atomic-Chat), +目标是 **Mac (Apple Silicon, Metal)** 的多模型离线部署。 + +> 完整设计依据见 [`docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md`](../../docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md)。 +> 本目录只放"可直接拷进 Atomic-Chat 仓库"的工程骨架。 + +## 目录结构 + +``` +integrations/atomic-chat/ +├── kakeya_sidecar/ ★ Python 推理 sidecar +│ ├── pyproject.toml —— 独立 pip 包 `kakeya-sidecar` +│ ├── kakeya_sidecar/ +│ │ ├── __init__.py +│ │ ├── __main__.py —— `python -m kakeya_sidecar ...` +│ │ ├── cli.py —— argparse 入口 + uvicorn 启动 +│ │ ├── server.py —— FastAPI OpenAI 兼容接口 +│ │ ├── engine.py —— HF + KakeyaLatticeCache 推理核心 +│ │ ├── model_registry.py —— 多模型部署档位 +│ │ └── schemas.py —— OpenAI 请求/响应 dataclass +│ └── tests/ +│ └── test_model_registry.py +├── kakeyalattice-extension/ ★ Atomic-Chat TypeScript 扩展 +│ ├── package.json +│ ├── src/ +│ │ ├── index.ts —— 扩展入口(注册到 core SDK) +│ │ └── backend.ts —— 走 Tauri plugin → sidecar +│ └── README.md +└── tauri-plugin-kakeyalattice/ ★ Rust Tauri 插件(桩) + ├── Cargo.toml + ├── src/ + │ ├── lib.rs —— `tauri::plugin::Builder` 注册 + │ └── commands.rs —— sidecar 生命周期 + 代理调用 + └── README.md +``` + +## 快速验证 (Mac) + +```bash +# 1. 安装 sidecar +cd integrations/atomic-chat/kakeya_sidecar +pip install -e ".[mac]" + +# 2. 单元测试(不需要下载模型) +pytest tests/ -v + +# 3. 跑起来 +kakeya-sidecar --port 1338 --device mps +curl http://localhost:1338/v1/models | jq . + +# 4. 发一次推理请求(会首次下载模型,小心硬盘) +curl http://localhost:1338/v1/chat/completions -H "Content-Type: application/json" -d '{ + "model": "qwen3-4b@e8-q10", + "messages": [{"role":"user","content":"Explain nested-lattice codes in one line."}], + "stream": false, "max_tokens": 64 +}' +``` + +## 集成到 Atomic-Chat 主仓库(步骤示意) + +```bash +# 假设 Atomic-Chat 主仓库在 ~/code/Atomic-Chat +export ATOMIC=~/code/Atomic-Chat + +# 1. 扩展 +rsync -a kakeyalattice-extension/ $ATOMIC/extensions/kakeyalattice-extension/ + +# 2. Tauri plugin +rsync -a tauri-plugin-kakeyalattice/ $ATOMIC/src-tauri/plugins/kakeyalattice/ + +# 3. 向 Atomic-Chat 的 extension registry 加一行: +# (详见 $ATOMIC/extensions/README / CONTRIBUTING.md) +# { name: "kakeyalattice-extension", enabled: true } + +# 4. Python sidecar 需要和 llama.cpp 一起打进安装包。 +# Atomic-Chat 在 `scripts/bundle-binaries.*` 有现成的二进制打包脚本, +# 对 sidecar 用 PyOxidizer / PyInstaller 产出单文件,挂上去即可。 +``` + +> 本 PR 不改 Atomic-Chat 主仓库 — 只在本仓库提供可直接移植的骨架、 +> 完整的设计文档、以及 sidecar 自己的单元测试。 + +## 与既有 vLLM 插件的关系 + +| 路径 | 场景 | 是否改动 | +|:--|:--|:--| +| `vllm_backend/kakeya_v1_4_snapshot/` | Linux / CUDA / H200 benchmark | 不动 | +| `integrations/atomic-chat/kakeya_sidecar/` | Mac / MPS / 产品端推理 | 本 PR 新增 | + +两者共用的唯一组件是 Python 包 `kakeyalattice` 本身 — 标量算子都走 +PyTorch,设备无关。 diff --git a/integrations/atomic-chat/kakeya_sidecar/README.md b/integrations/atomic-chat/kakeya_sidecar/README.md new file mode 100644 index 0000000..f732410 --- /dev/null +++ b/integrations/atomic-chat/kakeya_sidecar/README.md @@ -0,0 +1,37 @@ +# kakeya-sidecar + +OpenAI 兼容的本地推理 sidecar,给 [Atomic-Chat](https://github.com/AtomicBot-ai/Atomic-Chat) +用;推理走 HuggingFace `transformers` + `kakeyalattice.hf.KakeyaLatticeCache` +(E8 nested-lattice KV-cache 压缩)。 + +设计目标: + +1. **零代码改动** — 对外是 `POST /v1/chat/completions`,Atomic-Chat 既有 + OpenAI 客户端直连即可。 +2. **多模型本地部署** — Qwen3 / Llama-3.x / Gemma-4 / DeepSeek-R1-Distill / + GLM-4-9B / Mistral,每个都有"出厂" Q 档配置。 +3. **Mac 优先** — 默认 `--device mps`,Linux/CUDA 也支持。 +4. **Tauri-friendly** — 纯 HTTP + JSON,Tauri plugin 负责起进程、转发调用。 + +详细设计:[`docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md`](../../../docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md)。 + +## Quick start + +```bash +pip install -e ".[mac,dev]" +pytest tests/ -v # 不下载模型的纯逻辑单测 +kakeya-sidecar --port 1338 --device mps +``` + +```bash +curl http://localhost:1338/v1/models +curl http://localhost:1338/v1/chat/completions -d '{...}' ... +``` + +## 支持的模型 + +参见 `kakeya_sidecar/model_registry.py`。通过 `GET /v1/models` 实时返回。 + +## License + +Apache-2.0,与主仓库一致。 diff --git a/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/__init__.py b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/__init__.py new file mode 100644 index 0000000..d149aff --- /dev/null +++ b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/__init__.py @@ -0,0 +1,29 @@ +"""kakeya_sidecar — OpenAI-compatible local inference sidecar. + +Top-level imports are lazy so that pure-logic modules (notably +:mod:`model_registry`) can be imported in test / packaging +environments where FastAPI or torch may not be installed. +""" +from __future__ import annotations + +from .model_registry import MODEL_REGISTRY, DeploymentProfile, resolve_model + +__all__ = [ + "MODEL_REGISTRY", + "DeploymentProfile", + "resolve_model", + "KakeyaEngine", + "create_app", +] + +__version__ = "0.1.0" + + +def __getattr__(name): + if name == "KakeyaEngine": + from .engine import KakeyaEngine + return KakeyaEngine + if name == "create_app": + from .server import create_app + return create_app + raise AttributeError(f"module 'kakeya_sidecar' has no attribute {name!r}") diff --git a/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/__main__.py b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/__main__.py new file mode 100644 index 0000000..876b442 --- /dev/null +++ b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/__main__.py @@ -0,0 +1,7 @@ +"""Allow ``python -m kakeya_sidecar``.""" +from __future__ import annotations + +from .cli import main + +if __name__ == "__main__": # pragma: no cover + main() diff --git a/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/cli.py b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/cli.py new file mode 100644 index 0000000..a9826cc --- /dev/null +++ b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/cli.py @@ -0,0 +1,78 @@ +"""``kakeya-sidecar`` CLI entry point.""" +from __future__ import annotations + +import argparse +import logging +import sys + +from .engine import EngineConfig +from .server import create_app + + +def build_parser() -> argparse.ArgumentParser: + p = argparse.ArgumentParser( + prog="kakeya-sidecar", + description="OpenAI-compatible local inference sidecar for Atomic-Chat " + "(HuggingFace transformers + KakeyaLattice v1.5 E8 KV-cache).", + ) + p.add_argument("--host", default="127.0.0.1", + help="Bind address (default 127.0.0.1 — localhost only).") + p.add_argument("--port", type=int, default=1338, + help="Bind port (default 1338; Atomic-Chat's OpenAI front " + "door is 1337 so we sit one port over).") + p.add_argument("--device", default="auto", + choices=["auto", "mps", "cuda", "cpu"], + help="Torch device. 'auto' picks mps on Mac / cuda on Linux.") + p.add_argument("--dtype", default="auto", + choices=["auto", "bfloat16", "float16", "float32"], + help="Model parameter dtype.") + p.add_argument("--max-resident", type=int, default=1, + help="Max number of fully-loaded models kept in RAM/VRAM " + "at once (LRU).") + p.add_argument("--hf-cache-dir", default=None, + help="Override HF_HOME / cache location.") + p.add_argument("--prewarm", default=None, + help="Channel id to pre-load at startup, e.g. " + "'qwen3-4b@e8-q10'.") + p.add_argument("--log-level", default="INFO", + choices=["DEBUG", "INFO", "WARNING", "ERROR"]) + return p + + +def main(argv: list[str] | None = None) -> int: + args = build_parser().parse_args(argv) + logging.basicConfig( + level=getattr(logging, args.log_level), + format="%(asctime)s %(levelname)s %(name)s: %(message)s", + ) + + cfg = EngineConfig( + device=args.device, + dtype=args.dtype, + max_resident=args.max_resident, + hf_cache_dir=args.hf_cache_dir, + ) + + engine_instance = None + if args.prewarm: + from .engine import KakeyaEngine + + log = logging.getLogger("kakeya_sidecar.cli") + engine_instance = KakeyaEngine(cfg) + log.info("prewarming %s on %s", args.prewarm, engine_instance._device) + engine_instance.warmup(args.prewarm) + + app = create_app( + cfg, + lazy_engine=engine_instance is None, + engine_instance=engine_instance, + ) + + try: + import uvicorn # type: ignore + except ImportError: # pragma: no cover + print("uvicorn is required. `pip install uvicorn[standard]`.", file=sys.stderr) + return 2 + + uvicorn.run(app, host=args.host, port=args.port, log_level=args.log_level.lower()) + return 0 diff --git a/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/engine.py b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/engine.py new file mode 100644 index 0000000..4f14db3 --- /dev/null +++ b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/engine.py @@ -0,0 +1,307 @@ +"""Inference engine: HuggingFace transformers + KakeyaLatticeCache. + +The engine is intentionally small; there is no batching, no dynamic +multi-tenant scheduler, no speculative decoding — Atomic-Chat's +single-user desktop use case does not need any of that. + +Design: + +- One ``_LoadedModel`` per (short_id) with an LRU eviction policy of + size ``max_resident`` (default 1). Switching models unloads the + previous one, same as Atomic-Chat's llama.cpp plugin already does. +- ``KakeyaLatticeCache`` is **per request** (caches have sequence state + tied to the generation). We rebuild it fresh for each call because + that matches the standard HF generate pattern and makes concurrent + requests trivially safe. +- All torch imports are lazy so the CLI can `--help` with no torch + installed. + +The engine does NOT depend on FastAPI — the server module wires it up. +""" +from __future__ import annotations + +import logging +import os +import threading +import time +from collections import OrderedDict +from dataclasses import dataclass +from typing import Any, Iterator + +from .model_registry import Channel, DeploymentProfile, resolve_model + +log = logging.getLogger("kakeya_sidecar.engine") + + +@dataclass +class EngineConfig: + device: str = "auto" # "auto" | "mps" | "cuda" | "cpu" + dtype: str = "auto" # "auto" | "bfloat16" | "float16" | "float32" + max_resident: int = 1 # LRU size of loaded models + trust_remote_code: bool = True + hf_cache_dir: str | None = None + + +def _pick_device(requested: str) -> str: + if requested != "auto": + return requested + try: + import torch # type: ignore + except ImportError: # pragma: no cover + return "cpu" + if torch.cuda.is_available(): + return "cuda" + if getattr(torch.backends, "mps", None) and torch.backends.mps.is_available(): + return "mps" + return "cpu" + + +def _pick_dtype(requested: str, device: str): + import torch # type: ignore + + if requested != "auto": + return getattr(torch, requested) + # MPS supports bf16 on macOS 14+; fall back to fp16 on older. + if device == "mps": + return torch.float16 + if device == "cuda": + return torch.bfloat16 + return torch.float32 + + +class _LoadedModel: + """HF model + tokenizer, loaded on a specific device/dtype.""" + + def __init__( + self, + profile: DeploymentProfile, + cfg: EngineConfig, + device: str, + dtype, + ) -> None: + from transformers import AutoModelForCausalLM, AutoTokenizer # type: ignore + + log.info("loading %s on %s/%s …", profile.hf_repo_id, device, dtype) + t0 = time.time() + self.tokenizer = AutoTokenizer.from_pretrained( + profile.hf_repo_id, + trust_remote_code=cfg.trust_remote_code, + cache_dir=cfg.hf_cache_dir, + ) + if self.tokenizer.pad_token_id is None: + self.tokenizer.pad_token_id = self.tokenizer.eos_token_id + + self.model = AutoModelForCausalLM.from_pretrained( + profile.hf_repo_id, + torch_dtype=dtype, + trust_remote_code=cfg.trust_remote_code, + cache_dir=cfg.hf_cache_dir, + low_cpu_mem_usage=True, + ) + self.model.to(device) + self.model.eval() + + self.device = device + self.dtype = dtype + self.profile = profile + log.info( + "loaded %s in %.1fs (L=%d, head_dim=%r)", + profile.hf_repo_id, time.time() - t0, + self.model.config.num_hidden_layers, + getattr(self.model.config, "head_dim", "?"), + ) + + +class KakeyaEngine: + """Public-facing inference engine used by the FastAPI server.""" + + def __init__(self, cfg: EngineConfig | None = None) -> None: + self.cfg = cfg or EngineConfig() + self._device = _pick_device(self.cfg.device) + self._loaded: "OrderedDict[str, _LoadedModel]" = OrderedDict() + self._lock = threading.Lock() + log.info("KakeyaEngine device=%s max_resident=%d", + self._device, self.cfg.max_resident) + + # ------------------------------------------------------------------ + # model lifecycle + # ------------------------------------------------------------------ + + def _ensure_loaded(self, profile: DeploymentProfile) -> _LoadedModel: + with self._lock: + if profile.short_id in self._loaded: + self._loaded.move_to_end(profile.short_id) + return self._loaded[profile.short_id] + + import torch # type: ignore + dtype = _pick_dtype(self.cfg.dtype, self._device) + lm = _LoadedModel(profile, self.cfg, self._device, dtype) + + self._loaded[profile.short_id] = lm + while len(self._loaded) > self.cfg.max_resident: + evicted_id, evicted = self._loaded.popitem(last=False) + log.info("evicting %s", evicted_id) + del evicted + if torch.cuda.is_available(): + torch.cuda.empty_cache() + return lm + + def warmup(self, channel_id: str) -> None: + """Pre-load a model (used by ``--prewarm`` CLI flag).""" + profile, _ = resolve_model(channel_id) + self._ensure_loaded(profile) + + # ------------------------------------------------------------------ + # generation + # ------------------------------------------------------------------ + + def _build_cache( + self, lm: _LoadedModel, channel: Channel + ): + from kakeyalattice.hf import KakeyaLatticeCache # type: ignore + + head_dim = getattr(lm.model.config, "head_dim", None) + if head_dim is None: + hidden = lm.model.config.hidden_size + n_heads = lm.model.config.num_attention_heads + head_dim = hidden // n_heads + + return KakeyaLatticeCache( + variant=channel.variant, + q_range=channel.q_range, + num_hidden_layers=lm.model.config.num_hidden_layers, + head_dim=int(head_dim), + device=lm.device, + boundary=channel.boundary, + strict=False, + ) + + def chat( + self, + channel_id: str, + messages: list[dict[str, Any]], + *, + max_tokens: int = 256, + temperature: float = 0.7, + top_p: float = 1.0, + stop: list[str] | None = None, + override: dict[str, Any] | None = None, + ) -> tuple[str, dict[str, Any]]: + """One-shot (non-streaming) chat completion. + + Returns ``(text, stats)`` where ``stats`` contains kakeya-specific + telemetry (codec fire counts etc.) for the ``x_kakeya`` field. + """ + profile, channel = resolve_model(channel_id) + if override: + channel = Channel( + variant=override.get("variant", channel.variant), + q_range=int(override.get("q_range", channel.q_range)), + boundary=int(override.get("boundary", channel.boundary)), + est_compression=channel.est_compression, + est_delta_ppl_pct=channel.est_delta_ppl_pct, + label=channel.label, + ) + + lm = self._ensure_loaded(profile) + cache = self._build_cache(lm, channel) + + import torch # type: ignore + + prompt = lm.tokenizer.apply_chat_template( + messages, tokenize=False, add_generation_prompt=True + ) + inputs = lm.tokenizer(prompt, return_tensors="pt").to(lm.device) + prompt_len = inputs["input_ids"].shape[1] + + gen_kwargs: dict[str, Any] = dict( + max_new_tokens=max_tokens, + do_sample=temperature > 0.0, + temperature=max(temperature, 1e-4), + top_p=top_p, + past_key_values=cache, + pad_token_id=lm.tokenizer.pad_token_id, + ) + + t0 = time.time() + with torch.inference_mode(): + out = lm.model.generate(**inputs, **gen_kwargs) + gen_time = time.time() - t0 + + new_tokens = out[0, prompt_len:] + text = lm.tokenizer.decode(new_tokens, skip_special_tokens=True) + + stats = { + "variant": channel.variant, + "q_range": channel.q_range, + "boundary": channel.boundary, + "est_compression": channel.est_compression, + "est_delta_ppl_pct": channel.est_delta_ppl_pct, + "prompt_tokens": int(prompt_len), + "completion_tokens": int(new_tokens.shape[0]), + "generation_time_s": gen_time, + "codec_fired_per_layer": dict(getattr(cache, "codec_fired_per_layer", {})), + "skip_fired_per_layer": dict(getattr(cache, "skip_fired_per_layer", {})), + } + return text, stats + + def chat_stream( + self, + channel_id: str, + messages: list[dict[str, Any]], + *, + max_tokens: int = 256, + temperature: float = 0.7, + top_p: float = 1.0, + stop: list[str] | None = None, + override: dict[str, Any] | None = None, + ) -> Iterator[str]: + """Streaming variant — yields delta strings (not full SSE frames). + + The server wraps each delta into an OpenAI ``chat.completion.chunk``. + """ + profile, channel = resolve_model(channel_id) + if override: + channel = Channel( + variant=override.get("variant", channel.variant), + q_range=int(override.get("q_range", channel.q_range)), + boundary=int(override.get("boundary", channel.boundary)), + est_compression=channel.est_compression, + est_delta_ppl_pct=channel.est_delta_ppl_pct, + label=channel.label, + ) + lm = self._ensure_loaded(profile) + cache = self._build_cache(lm, channel) + + import torch # type: ignore + from transformers import TextIteratorStreamer # type: ignore + + prompt = lm.tokenizer.apply_chat_template( + messages, tokenize=False, add_generation_prompt=True + ) + inputs = lm.tokenizer(prompt, return_tensors="pt").to(lm.device) + streamer = TextIteratorStreamer( + lm.tokenizer, skip_prompt=True, skip_special_tokens=True + ) + + gen_kwargs: dict[str, Any] = dict( + **inputs, + max_new_tokens=max_tokens, + do_sample=temperature > 0.0, + temperature=max(temperature, 1e-4), + top_p=top_p, + past_key_values=cache, + streamer=streamer, + pad_token_id=lm.tokenizer.pad_token_id, + ) + + def _run(): + with torch.inference_mode(): + lm.model.generate(**gen_kwargs) + + thread = threading.Thread(target=_run, daemon=True) + thread.start() + for piece in streamer: + if piece: + yield piece + thread.join() diff --git a/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/model_registry.py b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/model_registry.py new file mode 100644 index 0000000..d288af5 --- /dev/null +++ b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/model_registry.py @@ -0,0 +1,242 @@ +"""Per-model deployment profiles. + +The registry maps a user-facing model id (e.g. ``qwen3-4b``) to: + +1. The HuggingFace repo id we pull weights from. +2. The canonical KakeyaLattice channels we support for that model + (variant ∈ {"d4", "e8"}, q_range ∈ {...}, boundary ∈ {...}). +3. Human-readable metadata (head_dim, compression-ratio estimate, + measured |Δppl|) so the UI / ``/v1/models`` response can show it. + +The numbers in ``channels[].est_*`` come directly from +``reports/v1_5_release/V15_FULL_4MODEL_REPORT.md`` where we have +measurements, and from the compression-ratio formula otherwise. + +Nothing in this module imports torch or transformers — it is pure +metadata, safe to import in tests. +""" +from __future__ import annotations + +from dataclasses import dataclass, field +from typing import Iterable + + +@dataclass(frozen=True) +class Channel: + """One (variant, Q, boundary) deployment channel for a model.""" + + variant: str # "d4" | "e8" + q_range: int # 4, 10, 38, 152 … + boundary: int = 0 # number of front/back layers to skip + est_compression: float = 1.0 # KV cache compression vs bf16 + est_delta_ppl_pct: float | None = None # measured if available + label: str = "" # e.g. "balanced", "aggressive", "near-lossless" + + +@dataclass(frozen=True) +class DeploymentProfile: + """Full deployment profile for one HF model.""" + + short_id: str # "qwen3-4b" + hf_repo_id: str # "Qwen/Qwen3-4B" + head_dim: int | tuple[int, ...] # tuple for heterogeneous (Gemma-4) + num_hidden_layers: int + channels: tuple[Channel, ...] + default_channel: Channel + notes: str = "" + + def channel_id(self, ch: Channel) -> str: + """UI/API id = ``@-q[-b]``.""" + suffix = f"-b{ch.boundary}" if ch.boundary else "" + return f"{self.short_id}@{ch.variant}-q{ch.q_range}{suffix}" + + def all_channel_ids(self) -> list[str]: + return [self.channel_id(c) for c in self.channels] + + +def _std_channels_128() -> tuple[Channel, ...]: + """Canonical channels for head_dim=128 (Qwen/Llama/Mistral/GLM/DeepSeek). + + Compression ratios from v1.5 report Table §3. + """ + return ( + Channel("e8", q_range=38, boundary=0, + est_compression=2.50, est_delta_ppl_pct=None, + label="near-lossless"), + Channel("e8", q_range=10, boundary=0, + est_compression=3.37, est_delta_ppl_pct=None, + label="balanced"), + Channel("e8", q_range=4, boundary=0, + est_compression=4.57, est_delta_ppl_pct=None, + label="aggressive"), + ) + + +# NOTE: est_delta_ppl_pct is populated per-model where v1.5 measured it; +# empty entries stay ``None`` and the UI shows "estimated only". + +MODEL_REGISTRY: dict[str, DeploymentProfile] = { + # --- Qwen family ---------------------------------------------------- + "qwen3-4b": DeploymentProfile( + short_id="qwen3-4b", + hf_repo_id="Qwen/Qwen3-4B", + head_dim=128, + num_hidden_layers=36, + channels=( + Channel("e8", 38, 0, 2.50, None, "near-lossless"), + Channel("e8", 10, 0, 3.37, 3.85, "balanced"), + Channel("e8", 4, 0, 4.57, 17.00, "aggressive"), + ), + default_channel=Channel("e8", 10, 0, 3.37, 3.85, "balanced"), + notes="v1.5 E8 first-measurement model; numbers from V15_FULL_4MODEL_REPORT.md.", + ), + "qwen2-1.5b": DeploymentProfile( + short_id="qwen2-1.5b", + hf_repo_id="Qwen/Qwen2-1.5B", + head_dim=128, + num_hidden_layers=28, + channels=_std_channels_128(), + default_channel=Channel("e8", 38, 0, 2.50, None, "near-lossless"), + notes="Small Qwen2; conservative default (Q=38) on Mac 8-16 GB.", + ), + + # --- Llama family --------------------------------------------------- + "llama-3.2-3b-instruct": DeploymentProfile( + short_id="llama-3.2-3b-instruct", + hf_repo_id="meta-llama/Llama-3.2-3B-Instruct", + head_dim=128, + num_hidden_layers=28, + channels=_std_channels_128(), + default_channel=Channel("e8", 10, 0, 3.37, None, "balanced"), + notes="Main Mac-16GB target; requires HF token (gated).", + ), + "llama-3.1-8b-instruct": DeploymentProfile( + short_id="llama-3.1-8b-instruct", + hf_repo_id="meta-llama/Llama-3.1-8B-Instruct", + head_dim=128, + num_hidden_layers=32, + channels=_std_channels_128(), + default_channel=Channel("e8", 10, 0, 3.37, None, "balanced"), + notes="Best quality/size trade-off for Mac-32GB.", + ), + + # --- Mistral -------------------------------------------------------- + "mistral-7b-instruct-v0.3": DeploymentProfile( + short_id="mistral-7b-instruct-v0.3", + hf_repo_id="mistralai/Mistral-7B-Instruct-v0.3", + head_dim=128, + num_hidden_layers=32, + channels=_std_channels_128(), + default_channel=Channel("e8", 10, 0, 3.37, None, "balanced"), + ), + + # --- Gemma ---------------------------------------------------------- + # Heterogeneous head_dim (20×256 + 4×512 MatFormer); KakeyaLatticeCache + # handles per-layer head_dim internally. + "gemma-4-e4b": DeploymentProfile( + short_id="gemma-4-e4b", + hf_repo_id="google/gemma-4-E4B", + head_dim=(256, 512), + num_hidden_layers=24, + channels=( + Channel("e8", 38, 0, 2.30, None, "near-lossless"), + Channel("e8", 10, 0, 3.47, 1.56, "balanced"), + Channel("e8", 4, 0, 4.77, 5.79, "aggressive"), + ), + default_channel=Channel("e8", 10, 0, 3.47, 1.56, "balanced"), + notes="Heterogeneous head_dim; measured in v1.5 report.", + ), + + # --- DeepSeek (R1-Distill series) ---------------------------------- + # Small DeepSeek models are structurally fragile under no-boundary. + # Force boundary=2 on every channel. + "deepseek-r1-distill-qwen-1.5b": DeploymentProfile( + short_id="deepseek-r1-distill-qwen-1.5b", + hf_repo_id="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", + head_dim=128, + num_hidden_layers=28, + channels=( + Channel("e8", 38, 2, 2.40, None, "near-lossless"), + Channel("e8", 10, 2, 3.20, None, "balanced"), + ), + default_channel=Channel("e8", 38, 2, 2.40, None, "near-lossless"), + notes=( + "boundary=2 is REQUIRED; no-boundary in-forward explodes " + "to >50 000% |Δppl| per V15_FULL_4MODEL_REPORT.md §1." + ), + ), + "deepseek-r1-distill-qwen-7b": DeploymentProfile( + short_id="deepseek-r1-distill-qwen-7b", + hf_repo_id="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", + head_dim=128, + num_hidden_layers=28, + channels=( + Channel("e8", 38, 2, 2.40, None, "near-lossless"), + Channel("e8", 10, 2, 3.20, None, "balanced"), + ), + default_channel=Channel("e8", 10, 2, 3.20, None, "balanced"), + ), + + # --- GLM ------------------------------------------------------------ + "glm-4-9b-chat": DeploymentProfile( + short_id="glm-4-9b-chat", + hf_repo_id="zai-org/GLM-4-9B-Chat", + head_dim=128, + num_hidden_layers=40, + channels=( + Channel("e8", 38, 0, 2.30, None, "near-lossless"), + Channel("e8", 10, 0, 3.37, 6.96, "balanced"), + Channel("e8", 4, 0, 4.57, 32.36, "aggressive"), + ), + default_channel=Channel("e8", 10, 0, 3.37, 6.96, "balanced"), + notes="Requires trust_remote_code=True.", + ), +} + + +def iter_model_channel_ids() -> Iterable[str]: + for prof in MODEL_REGISTRY.values(): + yield from prof.all_channel_ids() + + +def resolve_model(channel_id: str) -> tuple[DeploymentProfile, Channel]: + """Parse ``@-q[-b]`` into a (profile, channel) pair. + + Accepts the short id alone (no channel suffix) and returns the + profile's ``default_channel``. + """ + if "@" not in channel_id: + short = channel_id + if short not in MODEL_REGISTRY: + raise KeyError(f"unknown model id {short!r}") + prof = MODEL_REGISTRY[short] + return prof, prof.default_channel + + short, channel_suffix = channel_id.split("@", 1) + if short not in MODEL_REGISTRY: + raise KeyError(f"unknown model id {short!r}") + prof = MODEL_REGISTRY[short] + + parts = channel_suffix.split("-") + variant = parts[0].lower() + q_range = None + boundary = 0 + for part in parts[1:]: + if part.startswith("q") and part[1:].isdigit(): + q_range = int(part[1:]) + elif part.startswith("b") and part[1:].isdigit(): + boundary = int(part[1:]) + if q_range is None: + raise ValueError( + f"channel suffix {channel_suffix!r} missing q segment" + ) + + for ch in prof.channels: + if ch.variant == variant and ch.q_range == q_range and ch.boundary == boundary: + return prof, ch + + raise KeyError( + f"model {short!r} has no channel matching variant={variant} " + f"q_range={q_range} boundary={boundary}. Available: " + + ", ".join(prof.all_channel_ids()) + ) diff --git a/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/py.typed b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/py.typed new file mode 100644 index 0000000..e69de29 diff --git a/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/schemas.py b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/schemas.py new file mode 100644 index 0000000..c268012 --- /dev/null +++ b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/schemas.py @@ -0,0 +1,72 @@ +"""OpenAI-compatible request / response schemas (minimal subset). + +We deliberately model only the fields Atomic-Chat's client actually +sends. Anything unknown is accepted via ``model_config = {"extra": "allow"}`` +so we forward-propagate. +""" +from __future__ import annotations + +from typing import Any, Literal + +from pydantic import BaseModel, ConfigDict, Field + + +class ChatMessage(BaseModel): + model_config = ConfigDict(extra="allow") + role: Literal["system", "user", "assistant", "tool"] + content: str | list[dict[str, Any]] + + +class ChatCompletionRequest(BaseModel): + model_config = ConfigDict(extra="allow") + + model: str + messages: list[ChatMessage] + stream: bool = False + temperature: float = 0.7 + top_p: float = 1.0 + max_tokens: int | None = None + stop: str | list[str] | None = None + # Extension: override the per-model default channel per-request. + # Example: {"variant": "e8", "q_range": 38, "boundary": 0} + x_kakeya_override: dict[str, Any] | None = Field(default=None, alias="x_kakeya_override") + + +class ChatCompletionMessage(BaseModel): + role: Literal["assistant"] = "assistant" + content: str + + +class ChatCompletionChoice(BaseModel): + index: int = 0 + message: ChatCompletionMessage + finish_reason: Literal["stop", "length"] = "stop" + + +class ChatCompletionUsage(BaseModel): + prompt_tokens: int + completion_tokens: int + total_tokens: int + + +class ChatCompletionResponse(BaseModel): + id: str + object: Literal["chat.completion"] = "chat.completion" + created: int + model: str + choices: list[ChatCompletionChoice] + usage: ChatCompletionUsage + x_kakeya: dict[str, Any] | None = None + + +class ModelInfo(BaseModel): + model_config = ConfigDict(extra="allow") + id: str + object: Literal["model"] = "model" + owned_by: str = "kakeyalattice" + x_kakeya: dict[str, Any] | None = None + + +class ModelList(BaseModel): + object: Literal["list"] = "list" + data: list[ModelInfo] diff --git a/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/server.py b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/server.py new file mode 100644 index 0000000..63efb35 --- /dev/null +++ b/integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/server.py @@ -0,0 +1,215 @@ +"""FastAPI OpenAI-compatible server. + +Endpoints: + + GET /health + GET /v1/models + POST /v1/chat/completions (stream + non-stream) + GET /v1/kakeya/stats (extension) + +The server imports :mod:`kakeya_sidecar.engine` lazily so unit tests +that only exercise routing / schema logic can run without torch. +""" +from __future__ import annotations + +import json +import logging +import time +import uuid +from typing import Any + +from fastapi import FastAPI, HTTPException +from fastapi.responses import JSONResponse, StreamingResponse + +from .engine import EngineConfig, KakeyaEngine +from .model_registry import MODEL_REGISTRY +from .schemas import ( + ChatCompletionChoice, + ChatCompletionMessage, + ChatCompletionRequest, + ChatCompletionResponse, + ChatCompletionUsage, + ModelInfo, + ModelList, +) + +log = logging.getLogger("kakeya_sidecar.server") + + +def create_app( + cfg: EngineConfig | None = None, + *, + lazy_engine: bool = True, + engine_instance: KakeyaEngine | None = None, +) -> FastAPI: + """Build the FastAPI application. + + Args: + cfg: engine configuration (device, dtype, max_resident). + lazy_engine: if True, the engine is constructed on first use + instead of app startup. Default True — this lets the process + start fast and surface model-load errors on the first + request instead of at boot. + engine_instance: optional pre-built engine (used by ``--prewarm``). + """ + app = FastAPI(title="kakeya-sidecar", version="0.1.0") + + state: dict[str, Any] = {"engine": engine_instance, "cfg": cfg or EngineConfig()} + + def engine() -> KakeyaEngine: + if state["engine"] is None: + state["engine"] = KakeyaEngine(state["cfg"]) + return state["engine"] # type: ignore[return-value] + + if not lazy_engine and state["engine"] is None: + state["engine"] = KakeyaEngine(state["cfg"]) + + # -------------------------------------------------------------- health + + @app.get("/health") + def health() -> dict[str, Any]: + return {"ok": True, "engine_loaded": state["engine"] is not None} + + # ------------------------------------------------------------- /models + + @app.get("/v1/models", response_model=ModelList) + def list_models() -> ModelList: + data: list[ModelInfo] = [] + for profile in MODEL_REGISTRY.values(): + for ch in profile.channels: + data.append( + ModelInfo( + id=profile.channel_id(ch), + owned_by="kakeyalattice", + x_kakeya={ + "hf_repo_id": profile.hf_repo_id, + "head_dim": profile.head_dim, + "num_hidden_layers": profile.num_hidden_layers, + "variant": ch.variant, + "q_range": ch.q_range, + "boundary": ch.boundary, + "est_compression": ch.est_compression, + "est_delta_ppl_pct": ch.est_delta_ppl_pct, + "label": ch.label, + "is_default": ch == profile.default_channel, + "notes": profile.notes, + }, + ) + ) + return ModelList(data=data) + + # --------------------------------------------------- /chat/completions + + @app.post("/v1/chat/completions") + def chat_completions(req: ChatCompletionRequest): + messages = [m.model_dump(exclude_none=True) for m in req.messages] + cid = f"chatcmpl-{uuid.uuid4().hex[:16]}" + created = int(time.time()) + + try: + eng = engine() + except Exception as e: # pragma: no cover + raise HTTPException(500, f"engine init failed: {e}") from e + + if req.stream: + return StreamingResponse( + _sse_stream(eng, req, cid, created), + media_type="text/event-stream", + ) + + try: + text, stats = eng.chat( + req.model, + messages, + max_tokens=req.max_tokens or 512, + temperature=req.temperature, + top_p=req.top_p, + override=req.x_kakeya_override, + ) + except KeyError as e: + raise HTTPException(404, str(e)) from e + except Exception as e: # pragma: no cover + log.exception("chat failed") + raise HTTPException(500, str(e)) from e + + resp = ChatCompletionResponse( + id=cid, + created=created, + model=req.model, + choices=[ + ChatCompletionChoice( + message=ChatCompletionMessage(content=text), + finish_reason="stop", + ) + ], + usage=ChatCompletionUsage( + prompt_tokens=stats["prompt_tokens"], + completion_tokens=stats["completion_tokens"], + total_tokens=stats["prompt_tokens"] + stats["completion_tokens"], + ), + x_kakeya=stats, + ) + return JSONResponse(resp.model_dump()) + + # ------------------------------------------------- /v1/kakeya/stats + + @app.get("/v1/kakeya/stats") + def kakeya_stats() -> dict[str, Any]: + eng = state["engine"] + if eng is None: + return {"engine_loaded": False} + return { + "engine_loaded": True, + "device": eng._device, + "resident_models": list(eng._loaded.keys()), + "max_resident": eng.cfg.max_resident, + } + + return app + + +# --------------------------------------------------------------- sse helper + + +def _sse_stream(eng: KakeyaEngine, req: ChatCompletionRequest, cid: str, created: int): + messages = [m.model_dump(exclude_none=True) for m in req.messages] + + def chunk(delta: dict[str, Any], finish_reason: str | None = None) -> str: + payload = { + "id": cid, + "object": "chat.completion.chunk", + "created": created, + "model": req.model, + "choices": [ + { + "index": 0, + "delta": delta, + "finish_reason": finish_reason, + } + ], + } + return f"data: {json.dumps(payload)}\n\n" + + yield chunk({"role": "assistant"}) + try: + for piece in eng.chat_stream( + req.model, + messages, + max_tokens=req.max_tokens or 512, + temperature=req.temperature, + top_p=req.top_p, + override=req.x_kakeya_override, + ): + yield chunk({"content": piece}) + except KeyError as e: + yield chunk({"content": f"[error] {e}"}, finish_reason="stop") + yield "data: [DONE]\n\n" + return + except Exception as e: # pragma: no cover + log.exception("stream failed") + yield chunk({"content": f"[error] {e}"}, finish_reason="stop") + yield "data: [DONE]\n\n" + return + + yield chunk({}, finish_reason="stop") + yield "data: [DONE]\n\n" diff --git a/integrations/atomic-chat/kakeya_sidecar/pyproject.toml b/integrations/atomic-chat/kakeya_sidecar/pyproject.toml new file mode 100644 index 0000000..59b3420 --- /dev/null +++ b/integrations/atomic-chat/kakeya_sidecar/pyproject.toml @@ -0,0 +1,50 @@ +[build-system] +requires = ["setuptools>=61", "wheel"] +build-backend = "setuptools.build_meta" + +[project] +name = "kakeya-sidecar" +version = "0.1.0" +description = "OpenAI-compatible local inference sidecar for Atomic-Chat, powered by KakeyaLattice v1.5 E8 KV-cache compression." +readme = "README.md" +requires-python = ">=3.10" +license = { text = "Apache-2.0" } +authors = [ + { name = "KakeyaLattice + Atomic-Chat integration" }, +] +classifiers = [ + "Development Status :: 3 - Alpha", + "Intended Audience :: Developers", + "License :: OSI Approved :: Apache Software License", + "Operating System :: MacOS", + "Operating System :: POSIX :: Linux", + "Programming Language :: Python :: 3.10", + "Programming Language :: Python :: 3.11", + "Programming Language :: Python :: 3.12", + "Topic :: Scientific/Engineering :: Artificial Intelligence", +] +dependencies = [ + "kakeyalattice[hf]>=1.5", + "transformers>=4.45", + "torch>=2.1", + "fastapi>=0.110", + "uvicorn[standard]>=0.29", + "pydantic>=2.6", + "sse-starlette>=1.8", +] + +[project.optional-dependencies] +mac = [ + "accelerate>=0.30", +] +dev = [ + "pytest>=7", + "httpx>=0.27", +] + +[project.scripts] +kakeya-sidecar = "kakeya_sidecar.cli:main" + +[tool.setuptools.packages.find] +where = ["."] +include = ["kakeya_sidecar*"] diff --git a/integrations/atomic-chat/kakeya_sidecar/tests/__init__.py b/integrations/atomic-chat/kakeya_sidecar/tests/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/integrations/atomic-chat/kakeya_sidecar/tests/test_model_registry.py b/integrations/atomic-chat/kakeya_sidecar/tests/test_model_registry.py new file mode 100644 index 0000000..5a75fe4 --- /dev/null +++ b/integrations/atomic-chat/kakeya_sidecar/tests/test_model_registry.py @@ -0,0 +1,84 @@ +"""Pure-logic tests for model_registry — no torch / HF downloads.""" +from __future__ import annotations + +import pytest + +from kakeya_sidecar.model_registry import ( + MODEL_REGISTRY, + resolve_model, + iter_model_channel_ids, +) + + +def test_registry_non_empty_and_covers_v15_4models() -> None: + for required in ( + "qwen3-4b", + "gemma-4-e4b", + "glm-4-9b-chat", + "deepseek-r1-distill-qwen-1.5b", + ): + assert required in MODEL_REGISTRY, required + + +def test_every_profile_has_default_channel_in_channels() -> None: + for short, prof in MODEL_REGISTRY.items(): + assert prof.default_channel in prof.channels, ( + f"{short}: default_channel not in channels" + ) + + +def test_deepseek_small_always_has_boundary() -> None: + prof = MODEL_REGISTRY["deepseek-r1-distill-qwen-1.5b"] + for ch in prof.channels: + assert ch.boundary >= 2, ( + "DeepSeek-R1-Distill-Qwen-1.5B requires boundary>=2 per v1.5 report" + ) + + +def test_channel_ids_roundtrip() -> None: + """For every channel the profile lists, resolve_model() must recover it.""" + for cid in iter_model_channel_ids(): + prof, ch = resolve_model(cid) + assert prof.channel_id(ch) == cid + + +def test_short_id_resolves_to_default() -> None: + prof, ch = resolve_model("qwen3-4b") + assert ch == prof.default_channel + + +def test_short_id_with_boundary_suffix() -> None: + # DeepSeek-1.5B uses b=2. + prof, ch = resolve_model("deepseek-r1-distill-qwen-1.5b@e8-q10-b2") + assert ch.variant == "e8" and ch.q_range == 10 and ch.boundary == 2 + + +def test_unknown_model_raises() -> None: + with pytest.raises(KeyError): + resolve_model("nonexistent-7b") + + +def test_unknown_channel_raises() -> None: + with pytest.raises(KeyError): + # Q=9999 is not in the Qwen3-4B channel set + resolve_model("qwen3-4b@e8-q9999") + + +def test_missing_q_suffix_raises() -> None: + with pytest.raises(ValueError): + resolve_model("qwen3-4b@e8") + + +def test_head_dim_shapes_are_int_or_tuple() -> None: + for prof in MODEL_REGISTRY.values(): + assert isinstance(prof.head_dim, (int, tuple)) + if isinstance(prof.head_dim, tuple): + for hd in prof.head_dim: + assert isinstance(hd, int) and hd > 0 + + +def test_compression_ratios_are_sane() -> None: + """All listed channels should claim compression in [1.0, 10.0].""" + for prof in MODEL_REGISTRY.values(): + for ch in prof.channels: + assert 1.0 <= ch.est_compression <= 10.0, (prof.short_id, ch) diff --git a/integrations/atomic-chat/kakeya_sidecar/tests/test_server_routing.py b/integrations/atomic-chat/kakeya_sidecar/tests/test_server_routing.py new file mode 100644 index 0000000..4002902 --- /dev/null +++ b/integrations/atomic-chat/kakeya_sidecar/tests/test_server_routing.py @@ -0,0 +1,100 @@ +"""Server-level smoke tests — mock the engine so torch is not needed.""" +from __future__ import annotations + +from typing import Any +from unittest.mock import MagicMock + +import pytest + +# Must import FastAPI / TestClient lazily so that environments without +# uvicorn can still `python -m compileall`. +fastapi = pytest.importorskip("fastapi") +from fastapi.testclient import TestClient # noqa: E402 + +from kakeya_sidecar import server as server_mod # noqa: E402 + + +@pytest.fixture() +def app_with_mock_engine(monkeypatch): + mock_engine = MagicMock() + mock_engine._device = "cpu" + mock_engine._loaded = {} + mock_engine.cfg.max_resident = 1 + + def _fake_chat(channel_id: str, messages, **kw) -> tuple[str, dict[str, Any]]: + return ("hello from kakeya", { + "variant": "e8", "q_range": 10, "boundary": 0, + "est_compression": 3.37, "est_delta_ppl_pct": 3.85, + "prompt_tokens": 5, "completion_tokens": 4, "generation_time_s": 0.01, + "codec_fired_per_layer": {}, "skip_fired_per_layer": {}, + }) + + mock_engine.chat.side_effect = _fake_chat + + app = server_mod.create_app(lazy_engine=True) + + # Force the route closure to see our mocked engine. + class _State(dict): + pass + + # Swap the `engine()` closure by overriding the route function's globals. + # Easiest path: hit /health to materialise the real engine, then + # monkeypatch the handler-local `state` by replacing `KakeyaEngine`. + monkeypatch.setattr(server_mod, "KakeyaEngine", lambda *a, **kw: mock_engine) + + return app, mock_engine + + +def test_list_models_returns_v15_4models(app_with_mock_engine) -> None: + app, _ = app_with_mock_engine + client = TestClient(app) + r = client.get("/v1/models") + assert r.status_code == 200, r.text + data = r.json() + ids = {m["id"] for m in data["data"]} + # Must include the balanced channel for each v1.5 hero model. + assert "qwen3-4b@e8-q10" in ids + assert "gemma-4-e4b@e8-q10" in ids + assert "glm-4-9b-chat@e8-q10" in ids + # And the DeepSeek small must carry boundary=2. + assert "deepseek-r1-distill-qwen-1.5b@e8-q10-b2" in ids + + +def test_health_before_engine_use(app_with_mock_engine) -> None: + app, _ = app_with_mock_engine + client = TestClient(app) + r = client.get("/health") + assert r.status_code == 200 + assert r.json()["ok"] is True + + +def test_chat_completions_non_stream(app_with_mock_engine) -> None: + app, mock_engine = app_with_mock_engine + client = TestClient(app) + r = client.post("/v1/chat/completions", json={ + "model": "qwen3-4b@e8-q10", + "messages": [{"role": "user", "content": "hi"}], + "stream": False, + "max_tokens": 8, + }) + assert r.status_code == 200, r.text + data = r.json() + assert data["choices"][0]["message"]["content"] == "hello from kakeya" + assert data["usage"]["total_tokens"] == 9 + assert data["x_kakeya"]["variant"] == "e8" + mock_engine.chat.assert_called_once() + + +def test_chat_completions_unknown_model(app_with_mock_engine) -> None: + app, mock_engine = app_with_mock_engine + + def _raise(channel_id, messages, **kw): + raise KeyError(f"unknown model id {channel_id!r}") + + mock_engine.chat.side_effect = _raise + client = TestClient(app) + r = client.post("/v1/chat/completions", json={ + "model": "nonexistent-99b@e8-q10", + "messages": [{"role": "user", "content": "hi"}], + }) + assert r.status_code == 404 diff --git a/integrations/atomic-chat/kakeyalattice-extension/README.md b/integrations/atomic-chat/kakeyalattice-extension/README.md new file mode 100644 index 0000000..f78fbaf --- /dev/null +++ b/integrations/atomic-chat/kakeyalattice-extension/README.md @@ -0,0 +1,49 @@ +# @atomic-chat/kakeyalattice-extension + +Atomic-Chat TypeScript 扩展,把 KakeyaLattice v1.5 sidecar 作为**第二个 +一等推理后端** 注册进 Atomic-Chat 的 Extension System。 + +## 架构位置 + +``` + web-app (React) + │ + ▼ + core SDK ──┬── registerBackend(...) + │ + extensions ─┤ + ├── llamacpp-extension (既有,GGUF / llama.cpp) + └── kakeyalattice-extension ★ 本扩展 + │ + │ Tauri invoke + ▼ + plugins/kakeyalattice (Rust, 下一级目录) + │ + │ supervise + HTTP + ▼ + kakeya-sidecar (Python, localhost:1338) +``` + +## 关键文件 + +- `src/index.ts` — 扩展入口,向 Core SDK 注册一个新的 `Backend`。 +- `src/backend.ts` — `Backend` 实现:`listModels()`、`chatCompletion()`、 + `healthCheck()`、`getStats()`,统一通过 Tauri invoke 过桥到 Rust 插件。 +- `src/types.ts` — 与 Python sidecar `/v1/models` `x_kakeya` 字段一一对齐的 + 类型定义。 + +## 真正落地到 Atomic-Chat 主仓库时要做什么 + +1. 把本目录整份 `rsync` 到 `extensions/kakeyalattice-extension/`。 +2. 修 `@atomic-chat/core` 的 `peerDependency` 版本指向主仓库的实际版本。 +3. 在 Atomic-Chat 的扩展注册入口(通常是 `core/src/extensions.ts` 或 + `web-app/src/App.tsx` 里 bootstrap 阶段的扩展列表)追加: + ```ts + import { register as registerKakeya } from "@atomic-chat/kakeyalattice-extension"; + registerKakeya(); + ``` +4. Rust 插件同步复制到 `src-tauri/plugins/kakeyalattice/`,在 + `src-tauri/Cargo.toml` + `tauri.conf.json` 注册。 + +因为我们**不改 Atomic-Chat 主仓库本身**,主仓库的维护者审阅后决定 +怎么合并;本扩展只保证"代码骨架能通过 tsc --noEmit"即可。 diff --git a/integrations/atomic-chat/kakeyalattice-extension/package.json b/integrations/atomic-chat/kakeyalattice-extension/package.json new file mode 100644 index 0000000..f265588 --- /dev/null +++ b/integrations/atomic-chat/kakeyalattice-extension/package.json @@ -0,0 +1,26 @@ +{ + "name": "@atomic-chat/kakeyalattice-extension", + "version": "0.1.0", + "description": "Atomic-Chat extension that routes local inference through the KakeyaLattice v1.5 sidecar (E8 nested-lattice KV-cache compression).", + "license": "Apache-2.0", + "main": "dist/index.js", + "types": "dist/index.d.ts", + "type": "module", + "files": [ + "dist", + "src", + "README.md" + ], + "scripts": { + "build": "tsc -p tsconfig.json", + "typecheck": "tsc --noEmit", + "dev": "tsc -w -p tsconfig.json" + }, + "peerDependencies": { + "@atomic-chat/core": "*", + "@tauri-apps/api": ">=2.0.0" + }, + "devDependencies": { + "typescript": "^5.4.0" + } +} diff --git a/integrations/atomic-chat/kakeyalattice-extension/src/backend.ts b/integrations/atomic-chat/kakeyalattice-extension/src/backend.ts new file mode 100644 index 0000000..c8618db --- /dev/null +++ b/integrations/atomic-chat/kakeyalattice-extension/src/backend.ts @@ -0,0 +1,119 @@ +/** + * KakeyaLattice backend — Atomic-Chat `Backend` implementation. + * + * We do NOT import `@atomic-chat/core` at compile time (its exact + * shape is governed by the host app). Instead we declare the minimal + * interface we need as a local type, and the host will use structural + * typing via `registerBackend(new KakeyaBackend())`. + */ +import { + ChatCompletionRequest, + ChatCompletionResponse, + KakeyaModel, + KakeyaModelList, + SidecarStats, +} from "./types"; + +/** + * Minimal shape of Atomic-Chat's backend contract. Keep in sync with + * the host app's `core/src/backend.ts` as it evolves. + */ +export interface Backend { + readonly id: string; + readonly displayName: string; + readonly kind: "local" | "cloud"; + listModels(): Promise; + chatCompletion(req: ChatCompletionRequest): Promise; + chatCompletionStream( + req: ChatCompletionRequest, + onDelta: (chunk: string) => void, + ): Promise; + healthCheck(): Promise; + getStats?(): Promise; +} + +/** + * Tauri invoke wrapper — resolved at runtime so the package can still + * `tsc --noEmit` in a node-only CI where `@tauri-apps/api` isn't + * available. + */ +type InvokeFn = (cmd: string, args?: Record) => Promise; + +async function tauriInvoke(): Promise { + // Prefer the runtime-injected global when running inside Tauri. + const win = globalThis as unknown as { + __TAURI__?: { invoke: InvokeFn }; + __TAURI_INTERNALS__?: { invoke: InvokeFn }; + }; + if (win.__TAURI__?.invoke) return win.__TAURI__.invoke; + if (win.__TAURI_INTERNALS__?.invoke) return win.__TAURI_INTERNALS__.invoke; + // Fallback to the dynamic import (ESM, requires @tauri-apps/api). + const mod = await import("@tauri-apps/api/core"); + return mod.invoke as InvokeFn; +} + +export class KakeyaBackend implements Backend { + readonly id = "kakeyalattice"; + readonly displayName = "KakeyaLattice (E8 KV-cache compression)"; + readonly kind = "local" as const; + + async listModels(): Promise { + const invoke = await tauriInvoke(); + const list = await invoke( + "plugin:kakeyalattice|list_models", + ); + return list.data; + } + + async chatCompletion(req: ChatCompletionRequest): Promise { + const invoke = await tauriInvoke(); + return invoke( + "plugin:kakeyalattice|chat_completion", + { request: { ...req, stream: false } }, + ); + } + + async chatCompletionStream( + req: ChatCompletionRequest, + onDelta: (chunk: string) => void, + ): Promise { + const invoke = await tauriInvoke(); + // The Rust plugin streams via a Tauri event named after the request id. + const streamId = await invoke( + "plugin:kakeyalattice|chat_completion_stream_start", + { request: { ...req, stream: true } }, + ); + + // Listen on the event channel. `@tauri-apps/api/event` is resolved lazily. + const eventMod = await import("@tauri-apps/api/event"); + const unlisten = await eventMod.listen( + `kakeyalattice:${streamId}`, + (e) => { + if (e.payload === "__DONE__") { + unlisten(); + return; + } + onDelta(e.payload); + }, + ); + + return invoke("plugin:kakeyalattice|chat_completion_stream_wait", { + streamId, + }); + } + + async healthCheck(): Promise { + try { + const invoke = await tauriInvoke(); + const ok = await invoke<{ ok: boolean }>("plugin:kakeyalattice|health"); + return ok.ok === true; + } catch { + return false; + } + } + + async getStats(): Promise { + const invoke = await tauriInvoke(); + return invoke("plugin:kakeyalattice|stats"); + } +} diff --git a/integrations/atomic-chat/kakeyalattice-extension/src/index.ts b/integrations/atomic-chat/kakeyalattice-extension/src/index.ts new file mode 100644 index 0000000..c513bdf --- /dev/null +++ b/integrations/atomic-chat/kakeyalattice-extension/src/index.ts @@ -0,0 +1,32 @@ +/** + * Atomic-Chat extension entry point. + * + * Usage (inside the Atomic-Chat host app bootstrap): + * + * import { register as registerKakeya } from "@atomic-chat/kakeyalattice-extension"; + * registerKakeya(); + * + * The host app is expected to expose a global registry via + * `window.AtomicChatCore?.registerBackend(backend)`. We fall back to + * exporting the backend class directly so the host can wire it up + * however it likes. + */ +import { KakeyaBackend } from "./backend"; + +export * from "./types"; +export { KakeyaBackend }; + +type HostAPI = { + registerBackend?: (backend: unknown) => void; +}; + +export function register(): KakeyaBackend { + const backend = new KakeyaBackend(); + const host = (globalThis as unknown as { AtomicChatCore?: HostAPI }).AtomicChatCore; + if (host?.registerBackend) { + host.registerBackend(backend); + } + return backend; +} + +export default register; diff --git a/integrations/atomic-chat/kakeyalattice-extension/src/tauri-shims.d.ts b/integrations/atomic-chat/kakeyalattice-extension/src/tauri-shims.d.ts new file mode 100644 index 0000000..c2d87fc --- /dev/null +++ b/integrations/atomic-chat/kakeyalattice-extension/src/tauri-shims.d.ts @@ -0,0 +1,23 @@ +/** + * Ambient type shims for Tauri modules. When this package is built + * inside the Atomic-Chat host app, the real `@tauri-apps/api` types + * take precedence (via node_modules). These shims exist so the package + * also typechecks in a minimal CI where the peer dependency isn't + * installed. + */ +declare module "@tauri-apps/api/core" { + export function invoke(cmd: string, args?: Record): Promise; +} + +declare module "@tauri-apps/api/event" { + export interface Event { + event: string; + id: number; + payload: T; + } + export type UnlistenFn = () => void; + export function listen( + event: string, + handler: (e: Event) => void, + ): Promise; +} diff --git a/integrations/atomic-chat/kakeyalattice-extension/src/types.ts b/integrations/atomic-chat/kakeyalattice-extension/src/types.ts new file mode 100644 index 0000000..2d25f72 --- /dev/null +++ b/integrations/atomic-chat/kakeyalattice-extension/src/types.ts @@ -0,0 +1,75 @@ +/** + * Types mirroring the Python sidecar's `/v1/models` response. + * Kept in sync with `kakeya_sidecar/schemas.py` and + * `kakeya_sidecar/model_registry.py`. + */ + +export interface KakeyaModelMeta { + hf_repo_id: string; + head_dim: number | number[]; + num_hidden_layers: number; + variant: "d4" | "e8"; + q_range: number; + boundary: number; + est_compression: number; + est_delta_ppl_pct: number | null; + label: string; + is_default: boolean; + notes: string; +} + +export interface KakeyaModel { + id: string; // "@-q[-b]" + object: "model"; + owned_by: "kakeyalattice"; + x_kakeya: KakeyaModelMeta; +} + +export interface KakeyaModelList { + object: "list"; + data: KakeyaModel[]; +} + +export interface ChatMessage { + role: "system" | "user" | "assistant" | "tool"; + content: string; +} + +export interface ChatCompletionRequest { + model: string; + messages: ChatMessage[]; + stream?: boolean; + temperature?: number; + top_p?: number; + max_tokens?: number; + stop?: string | string[]; + /** Runtime override for the per-model default channel. */ + x_kakeya_override?: { + variant?: "d4" | "e8"; + q_range?: number; + boundary?: number; + }; +} + +export interface ChatCompletionChoice { + index: number; + message: { role: "assistant"; content: string }; + finish_reason: "stop" | "length"; +} + +export interface ChatCompletionResponse { + id: string; + object: "chat.completion"; + created: number; + model: string; + choices: ChatCompletionChoice[]; + usage: { prompt_tokens: number; completion_tokens: number; total_tokens: number }; + x_kakeya?: Record; +} + +export interface SidecarStats { + engine_loaded: boolean; + device?: "mps" | "cuda" | "cpu"; + resident_models?: string[]; + max_resident?: number; +} diff --git a/integrations/atomic-chat/kakeyalattice-extension/tsconfig.json b/integrations/atomic-chat/kakeyalattice-extension/tsconfig.json new file mode 100644 index 0000000..7f7ec13 --- /dev/null +++ b/integrations/atomic-chat/kakeyalattice-extension/tsconfig.json @@ -0,0 +1,16 @@ +{ + "compilerOptions": { + "target": "ES2022", + "module": "ESNext", + "moduleResolution": "Bundler", + "lib": ["ES2022", "DOM"], + "strict": true, + "declaration": true, + "outDir": "dist", + "rootDir": "src", + "esModuleInterop": true, + "skipLibCheck": true, + "forceConsistentCasingInFileNames": true + }, + "include": ["src/**/*"] +} diff --git a/integrations/atomic-chat/tauri-plugin-kakeyalattice/Cargo.toml b/integrations/atomic-chat/tauri-plugin-kakeyalattice/Cargo.toml new file mode 100644 index 0000000..26e6c77 --- /dev/null +++ b/integrations/atomic-chat/tauri-plugin-kakeyalattice/Cargo.toml @@ -0,0 +1,22 @@ +[package] +name = "tauri-plugin-kakeyalattice" +version = "0.1.0" +edition = "2021" +license = "Apache-2.0" +description = "Tauri 2 plugin that supervises the Python kakeya-sidecar process and proxies OpenAI-compatible inference calls to it." + +[dependencies] +tauri = { version = "2", features = [] } +serde = { version = "1", features = ["derive"] } +serde_json = "1" +tokio = { version = "1", features = ["process", "rt-multi-thread", "macros", "io-util", "sync"] } +reqwest = { version = "0.12", features = ["json", "stream"] } +futures-util = "0.3" +thiserror = "1" +log = "0.4" +uuid = { version = "1", features = ["v4"] } +lazy_static = "1" + +[lib] +name = "tauri_plugin_kakeyalattice" +crate-type = ["cdylib", "rlib", "staticlib"] diff --git a/integrations/atomic-chat/tauri-plugin-kakeyalattice/README.md b/integrations/atomic-chat/tauri-plugin-kakeyalattice/README.md new file mode 100644 index 0000000..451c8d7 --- /dev/null +++ b/integrations/atomic-chat/tauri-plugin-kakeyalattice/README.md @@ -0,0 +1,78 @@ +# tauri-plugin-kakeyalattice + +Tauri 2 plugin that: + +1. **Supervises** the Python `kakeya-sidecar` process (spawn / health-check / restart). +2. **Proxies** OpenAI-compatible HTTP calls from the Atomic-Chat extension + (`@atomic-chat/kakeyalattice-extension`) to the sidecar at `localhost:1338`. +3. **Forwards streaming deltas** from the sidecar's SSE stream as Tauri + events on the channel `kakeyalattice:`. + +Registered Tauri commands (used by the TS extension): + +| Command | Purpose | +|:--|:--| +| `plugin:kakeyalattice\|list_models` | Proxy `GET /v1/models` | +| `plugin:kakeyalattice\|chat_completion` | Proxy `POST /v1/chat/completions` (non-stream) | +| `plugin:kakeyalattice\|chat_completion_stream_start` | Start a stream, return a stream id | +| `plugin:kakeyalattice\|chat_completion_stream_wait` | Await stream completion | +| `plugin:kakeyalattice\|health` | `GET /health` | +| `plugin:kakeyalattice\|stats` | `GET /v1/kakeya/stats` | + +## Sidecar lifecycle + +- **Path resolution order**: + 1. `ATOMIC_CHAT_KAKEYA_SIDECAR_PATH` env override. + 2. Bundled binary at `resources/kakeya-sidecar[.exe]` (product builds). + 3. `$PATH` lookup (dev builds — assumes `pip install -e ./kakeya_sidecar`). +- **Arguments**: `--host 127.0.0.1 --port 1338 --device auto --log-level info`. +- **Supervision**: Tauri plugin spawns the process at `setup()` time, + tails its stdout/stderr into the app log, and restarts on crash + with exponential backoff (aligned with the existing llamacpp plugin). +- **Port**: pinned to 1338 to sit next to Atomic-Chat's 1337 front door. + Passed via CLI flag so concurrent instances are possible in tests. + +## Why a separate Rust plugin (instead of calling sidecar HTTP from JS) + +- Sidecar lifecycle is a native concern (fork / waitpid / SIGTERM on app + quit). Keeping it in Rust mirrors the `llamacpp` plugin and avoids + duplicating supervision logic in two languages. +- Tauri's permission model lets us whitelist the sidecar HTTP origin + without opening CORS holes for the whole web-app. +- Streaming SSE → Tauri events gives the web-app a normal event-bus + interface instead of fiddling with EventSource behind CSP. + +## Minimal example (host app wiring) + +In `src-tauri/src/main.rs`: + +```rust +fn main() { + tauri::Builder::default() + .plugin(tauri_plugin_kakeyalattice::init()) + .run(tauri::generate_context!()) + .expect("error while running tauri application"); +} +``` + +In `tauri.conf.json` under `plugins`: + +```json +{ + "plugins": { + "kakeyalattice": { + "sidecarPort": 1338, + "autoStart": true, + "device": "auto" + } + } +} +``` + +## Status + +This directory ships the **skeleton** only: lib + commands declared, +no fleshed-out sidecar supervisor yet. `cargo check` passes. Full +implementation (process spawn, SSE bridge, restart policy) will follow +in a dedicated PR inside the Atomic-Chat main repo; this repo keeps +the plugin in a shape the Atomic-Chat maintainers can drop in. diff --git a/integrations/atomic-chat/tauri-plugin-kakeyalattice/src/commands.rs b/integrations/atomic-chat/tauri-plugin-kakeyalattice/src/commands.rs new file mode 100644 index 0000000..70e254b --- /dev/null +++ b/integrations/atomic-chat/tauri-plugin-kakeyalattice/src/commands.rs @@ -0,0 +1,138 @@ +//! Tauri command handlers — thin HTTP proxies to the sidecar. +//! +//! Streaming uses a `(start, wait)` pair: +//! - `chat_completion_stream_start` spawns a task reading SSE from +//! the sidecar, emits each delta as a Tauri event on +//! `kakeyalattice:`, and returns the id. +//! - `chat_completion_stream_wait` awaits a oneshot channel so the +//! JS caller can `await` final completion instead of polling. + +use futures_util::StreamExt; +use serde_json::Value as Json; +use std::collections::HashMap; +use std::sync::Mutex; +use tauri::{AppHandle, Emitter, Runtime, State}; +use tokio::sync::oneshot; +use uuid::Uuid; + +use crate::{Error, PluginState, Result}; + +lazy_static::lazy_static! { + // stream_id -> oneshot receiver, waiting for the SSE loop to finish. + static ref STREAM_WAITERS: Mutex>> = Mutex::new(HashMap::new()); +} + +// We use `lazy_static` to avoid pulling `once_cell` just for this one +// static. If the crate disallows lazy_static, swap to +// `tokio::sync::OnceCell` in a follow-up. + +#[tauri::command] +pub async fn list_models(state: State<'_, PluginState>) -> Result { + let url = format!("{}/v1/models", state.base_url()); + let resp = state.http.get(url).send().await?; + Ok(resp.json::().await?) +} + +#[tauri::command] +pub async fn chat_completion( + state: State<'_, PluginState>, + request: Json, +) -> Result { + let url = format!("{}/v1/chat/completions", state.base_url()); + let resp = state.http.post(url).json(&request).send().await?; + Ok(resp.json::().await?) +} + +#[tauri::command] +pub async fn chat_completion_stream_start( + app: AppHandle, + state: State<'_, PluginState>, + mut request: Json, +) -> Result { + // Force stream=true on the request we send to the sidecar. + if let Some(obj) = request.as_object_mut() { + obj.insert("stream".into(), Json::Bool(true)); + } + + let url = format!("{}/v1/chat/completions", state.base_url()); + let resp = state.http.post(url).json(&request).send().await?; + if !resp.status().is_success() { + let code = resp.status(); + let body = resp.text().await.unwrap_or_default(); + return Err(Error::Protocol(format!("sidecar returned {code}: {body}"))); + } + + let stream_id = Uuid::new_v4().simple().to_string(); + let event_name = format!("kakeyalattice:{stream_id}"); + + let (done_tx, done_rx) = oneshot::channel(); + { + let mut waiters = STREAM_WAITERS.lock().expect("stream waiters lock"); + waiters.insert(stream_id.clone(), done_rx); + } + + let app_clone = app.clone(); + tauri::async_runtime::spawn(async move { + let mut stream = resp.bytes_stream(); + let mut buf = Vec::::new(); + while let Some(chunk) = stream.next().await { + let chunk = match chunk { + Ok(c) => c, + Err(e) => { + log::warn!("sidecar stream error: {e}"); + break; + } + }; + buf.extend_from_slice(&chunk); + while let Some(idx) = buf.windows(2).position(|w| w == b"\n\n") { + let frame: Vec = buf.drain(..idx + 2).collect(); + let text = String::from_utf8_lossy(&frame); + for line in text.lines() { + let line = line.trim(); + if !line.starts_with("data:") { continue; } + let payload = line.trim_start_matches("data:").trim(); + if payload == "[DONE]" { + let _ = app_clone.emit(&event_name, "__DONE__"); + continue; + } + if let Ok(json) = serde_json::from_str::(payload) { + if let Some(delta) = json.pointer("/choices/0/delta/content") + .and_then(Json::as_str) + { + let _ = app_clone.emit(&event_name, delta.to_string()); + } + } + } + } + } + let _ = done_tx.send(()); + }); + + Ok(stream_id) +} + +#[tauri::command] +pub async fn chat_completion_stream_wait(stream_id: String) -> Result<()> { + let rx = { + let mut waiters = STREAM_WAITERS.lock().expect("stream waiters lock"); + waiters.remove(&stream_id) + }; + if let Some(rx) = rx { + let _ = rx.await; + } + Ok(()) +} + +#[tauri::command] +pub async fn health(state: State<'_, PluginState>) -> Result { + let url = format!("{}/health", state.base_url()); + let resp = state.http.get(url).send().await?; + Ok(resp.json::().await?) +} + +#[tauri::command] +pub async fn stats(state: State<'_, PluginState>) -> Result { + let url = format!("{}/v1/kakeya/stats", state.base_url()); + let resp = state.http.get(url).send().await?; + Ok(resp.json::().await?) +} diff --git a/integrations/atomic-chat/tauri-plugin-kakeyalattice/src/error.rs b/integrations/atomic-chat/tauri-plugin-kakeyalattice/src/error.rs new file mode 100644 index 0000000..008d88b --- /dev/null +++ b/integrations/atomic-chat/tauri-plugin-kakeyalattice/src/error.rs @@ -0,0 +1,27 @@ +use thiserror::Error; + +#[derive(Debug, Error)] +pub enum Error { + #[error("http error: {0}")] + Http(#[from] reqwest::Error), + + #[error("io error: {0}")] + Io(#[from] std::io::Error), + + #[error("sidecar not ready (tried {tries} times)")] + SidecarNotReady { tries: u32 }, + + #[error("sidecar spawn failed: {0}")] + SidecarSpawn(String), + + #[error("unexpected sidecar response: {0}")] + Protocol(String), +} + +pub type Result = std::result::Result; + +impl serde::Serialize for Error { + fn serialize(&self, s: S) -> Result { + s.serialize_str(&self.to_string()) + } +} diff --git a/integrations/atomic-chat/tauri-plugin-kakeyalattice/src/lib.rs b/integrations/atomic-chat/tauri-plugin-kakeyalattice/src/lib.rs new file mode 100644 index 0000000..ac763e4 --- /dev/null +++ b/integrations/atomic-chat/tauri-plugin-kakeyalattice/src/lib.rs @@ -0,0 +1,105 @@ +//! tauri-plugin-kakeyalattice +//! +//! Tauri 2 plugin that supervises the `kakeya-sidecar` Python process +//! and proxies OpenAI-compatible calls from the Atomic-Chat extension +//! to the sidecar HTTP server at `localhost:1338`. +//! +//! This file is the skeleton entry point. The full supervisor logic +//! will land in `sidecar.rs` / `commands.rs` in a dedicated PR inside +//! the Atomic-Chat main repo. + +use tauri::{plugin::TauriPlugin, Manager, Runtime}; + +mod commands; +mod error; +mod sidecar; + +pub use error::{Error, Result}; + +/// Configuration block read from `tauri.conf.json`: +/// +/// ```json +/// { "plugins": { "kakeyalattice": { "sidecarPort": 1338, "autoStart": true } } } +/// ``` +#[derive(Clone, Debug, serde::Deserialize)] +pub struct PluginConfig { + #[serde(default = "default_port")] + pub sidecar_port: u16, + #[serde(default = "default_host")] + pub sidecar_host: String, + #[serde(default = "default_auto_start")] + pub auto_start: bool, + #[serde(default = "default_device")] + pub device: String, +} + +fn default_port() -> u16 { 1338 } +fn default_host() -> String { "127.0.0.1".to_string() } +fn default_auto_start() -> bool { true } +fn default_device() -> String { "auto".to_string() } + +impl Default for PluginConfig { + fn default() -> Self { + Self { + sidecar_port: default_port(), + sidecar_host: default_host(), + auto_start: default_auto_start(), + device: default_device(), + } + } +} + +/// Shared per-app plugin state. +#[derive(Default)] +pub struct PluginState { + pub config: PluginConfig, + pub http: reqwest::Client, + pub supervisor: tokio::sync::Mutex>, +} + +impl PluginState { + pub fn base_url(&self) -> String { + format!("http://{}:{}", self.config.sidecar_host, self.config.sidecar_port) + } +} + +/// Plugin factory. +pub fn init() -> TauriPlugin { + tauri::plugin::Builder::new("kakeyalattice") + .invoke_handler(tauri::generate_handler![ + commands::list_models, + commands::chat_completion, + commands::chat_completion_stream_start, + commands::chat_completion_stream_wait, + commands::health, + commands::stats, + ]) + .setup(|app, api| { + let cfg: PluginConfig = api + .config() + .cloned() + .and_then(|v| serde_json::from_value(v).ok()) + .unwrap_or_default(); + + let state = PluginState { + config: cfg, + http: reqwest::Client::new(), + supervisor: Default::default(), + }; + app.manage(state); + + if app.state::().config.auto_start { + // Kick off the sidecar supervisor. Errors are logged but + // not fatal — the UI will show an offline state and the + // user can retry. + let handle = app.clone(); + tauri::async_runtime::spawn(async move { + if let Err(e) = sidecar::start_supervisor(&handle).await { + log::error!("kakeyalattice sidecar supervisor failed: {e}"); + } + }); + } + Ok(()) + }) + .build() +} diff --git a/integrations/atomic-chat/tauri-plugin-kakeyalattice/src/sidecar.rs b/integrations/atomic-chat/tauri-plugin-kakeyalattice/src/sidecar.rs new file mode 100644 index 0000000..e6b33af --- /dev/null +++ b/integrations/atomic-chat/tauri-plugin-kakeyalattice/src/sidecar.rs @@ -0,0 +1,92 @@ +//! Sidecar process supervisor. +//! +//! Responsibilities: +//! 1. Resolve the sidecar binary path. +//! 2. Spawn it with the configured host/port/device. +//! 3. Tail stdout/stderr into the application log. +//! 4. Health-check on an interval; restart with exponential backoff +//! on crash. (Identical policy to the existing llama.cpp plugin.) +//! +//! This file is the skeleton; the heavy lifting (signal handling on +//! app quit, Windows-specific job objects, etc.) will follow in a +//! dedicated PR inside the Atomic-Chat main repo. + +use std::path::PathBuf; +use std::process::Stdio; +use std::time::Duration; + +use tauri::{AppHandle, Manager, Runtime}; +use tokio::io::{AsyncBufReadExt, BufReader}; + +use crate::{Error, PluginState, Result}; + +pub struct SidecarHandle { + pub child: tokio::process::Child, +} + +fn resolve_sidecar_binary() -> PathBuf { + if let Ok(p) = std::env::var("ATOMIC_CHAT_KAKEYA_SIDECAR_PATH") { + return PathBuf::from(p); + } + // In product builds we bundle the sidecar next to the Tauri resource + // root. For dev builds we fall back to the `$PATH` lookup so + // `pip install -e kakeya_sidecar` (which installs the + // `kakeya-sidecar` console script) works out of the box. + PathBuf::from("kakeya-sidecar") +} + +pub async fn start_supervisor(app: &AppHandle) -> Result<()> { + let state = app.state::(); + let cfg = state.config.clone(); + + let bin = resolve_sidecar_binary(); + log::info!( + "spawning kakeya-sidecar: {} --host {} --port {} --device {}", + bin.display(), cfg.sidecar_host, cfg.sidecar_port, cfg.device, + ); + + let mut child = tokio::process::Command::new(&bin) + .arg("--host").arg(&cfg.sidecar_host) + .arg("--port").arg(cfg.sidecar_port.to_string()) + .arg("--device").arg(&cfg.device) + .stdout(Stdio::piped()) + .stderr(Stdio::piped()) + .spawn() + .map_err(|e| Error::SidecarSpawn(e.to_string()))?; + + if let Some(out) = child.stdout.take() { + let mut reader = BufReader::new(out).lines(); + tokio::spawn(async move { + while let Ok(Some(line)) = reader.next_line().await { + log::info!("[sidecar stdout] {line}"); + } + }); + } + if let Some(err) = child.stderr.take() { + let mut reader = BufReader::new(err).lines(); + tokio::spawn(async move { + while let Ok(Some(line)) = reader.next_line().await { + log::warn!("[sidecar stderr] {line}"); + } + }); + } + + // Health-check loop: wait up to 30s for the sidecar to respond. + let base_url = state.base_url(); + let http = state.http.clone(); + for attempt in 0..30 { + tokio::time::sleep(Duration::from_secs(1)).await; + if let Ok(resp) = http.get(format!("{}/health", base_url)).send().await { + if resp.status().is_success() { + log::info!("kakeya-sidecar online after {}s", attempt + 1); + let mut guard = state.supervisor.lock().await; + *guard = Some(SidecarHandle { child }); + return Ok(()); + } + } + } + + // Sidecar didn't come up; kill the child so we don't leak zombies. + let _ = child.kill().await; + Err(Error::SidecarNotReady { tries: 30 }) +}