Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
635 changes: 635 additions & 0 deletions docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md

Large diffs are not rendered by default.

15 changes: 15 additions & 0 deletions integrations/atomic-chat/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Python
__pycache__/
*.pyc
*.egg-info/
.pytest_cache/
.venv/

# Node / TS
node_modules/
dist/
*.tsbuildinfo

# Rust
target/
Cargo.lock
95 changes: 95 additions & 0 deletions integrations/atomic-chat/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Atomic-Chat × KakeyaLattice v1.5 — 本地 Mac 部署集成

把 [`kakeyalattice` v1.5 (E8 格 KV-cache codec)](../../kakeyalattice/) 作为
**第二个一等推理后端** 接入
[`AtomicBot-ai/Atomic-Chat`](https://github.com/AtomicBot-ai/Atomic-Chat),
目标是 **Mac (Apple Silicon, Metal)** 的多模型离线部署。

> 完整设计依据见 [`docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md`](../../docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md)。
> 本目录只放"可直接拷进 Atomic-Chat 仓库"的工程骨架。

## 目录结构

```
integrations/atomic-chat/
├── kakeya_sidecar/ ★ Python 推理 sidecar
│ ├── pyproject.toml —— 独立 pip 包 `kakeya-sidecar`
│ ├── kakeya_sidecar/
│ │ ├── __init__.py
│ │ ├── __main__.py —— `python -m kakeya_sidecar ...`
│ │ ├── cli.py —— argparse 入口 + uvicorn 启动
│ │ ├── server.py —— FastAPI OpenAI 兼容接口
│ │ ├── engine.py —— HF + KakeyaLatticeCache 推理核心
│ │ ├── model_registry.py —— 多模型部署档位
│ │ └── schemas.py —— OpenAI 请求/响应 dataclass
│ └── tests/
│ └── test_model_registry.py
├── kakeyalattice-extension/ ★ Atomic-Chat TypeScript 扩展
│ ├── package.json
│ ├── src/
│ │ ├── index.ts —— 扩展入口(注册到 core SDK)
│ │ └── backend.ts —— 走 Tauri plugin → sidecar
│ └── README.md
└── tauri-plugin-kakeyalattice/ ★ Rust Tauri 插件(桩)
├── Cargo.toml
├── src/
│ ├── lib.rs —— `tauri::plugin::Builder` 注册
│ └── commands.rs —— sidecar 生命周期 + 代理调用
└── README.md
```

## 快速验证 (Mac)

```bash
# 1. 安装 sidecar
cd integrations/atomic-chat/kakeya_sidecar
pip install -e ".[mac]"

# 2. 单元测试(不需要下载模型)
pytest tests/ -v

# 3. 跑起来
kakeya-sidecar --port 1338 --device mps
curl http://localhost:1338/v1/models | jq .

# 4. 发一次推理请求(会首次下载模型,小心硬盘)
curl http://localhost:1338/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "qwen3-4b@e8-q10",
"messages": [{"role":"user","content":"Explain nested-lattice codes in one line."}],
"stream": false, "max_tokens": 64
}'
```

## 集成到 Atomic-Chat 主仓库(步骤示意)

```bash
# 假设 Atomic-Chat 主仓库在 ~/code/Atomic-Chat
export ATOMIC=~/code/Atomic-Chat

# 1. 扩展
rsync -a kakeyalattice-extension/ $ATOMIC/extensions/kakeyalattice-extension/

# 2. Tauri plugin
rsync -a tauri-plugin-kakeyalattice/ $ATOMIC/src-tauri/plugins/kakeyalattice/

# 3. 向 Atomic-Chat 的 extension registry 加一行:
# (详见 $ATOMIC/extensions/README / CONTRIBUTING.md)
# { name: "kakeyalattice-extension", enabled: true }

# 4. Python sidecar 需要和 llama.cpp 一起打进安装包。
# Atomic-Chat 在 `scripts/bundle-binaries.*` 有现成的二进制打包脚本,
# 对 sidecar 用 PyOxidizer / PyInstaller 产出单文件,挂上去即可。
```

> 本 PR 不改 Atomic-Chat 主仓库 — 只在本仓库提供可直接移植的骨架、
> 完整的设计文档、以及 sidecar 自己的单元测试。

## 与既有 vLLM 插件的关系

| 路径 | 场景 | 是否改动 |
|:--|:--|:--|
| `vllm_backend/kakeya_v1_4_snapshot/` | Linux / CUDA / H200 benchmark | 不动 |
| `integrations/atomic-chat/kakeya_sidecar/` | Mac / MPS / 产品端推理 | 本 PR 新增 |

两者共用的唯一组件是 Python 包 `kakeyalattice` 本身 — 标量算子都走
PyTorch,设备无关。
37 changes: 37 additions & 0 deletions integrations/atomic-chat/kakeya_sidecar/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# kakeya-sidecar

OpenAI 兼容的本地推理 sidecar,给 [Atomic-Chat](https://github.com/AtomicBot-ai/Atomic-Chat)
用;推理走 HuggingFace `transformers` + `kakeyalattice.hf.KakeyaLatticeCache`
(E8 nested-lattice KV-cache 压缩)。

设计目标:

1. **零代码改动** — 对外是 `POST /v1/chat/completions`,Atomic-Chat 既有
OpenAI 客户端直连即可。
2. **多模型本地部署** — Qwen3 / Llama-3.x / Gemma-4 / DeepSeek-R1-Distill /
GLM-4-9B / Mistral,每个都有"出厂" Q 档配置。
3. **Mac 优先** — 默认 `--device mps`,Linux/CUDA 也支持。
4. **Tauri-friendly** — 纯 HTTP + JSON,Tauri plugin 负责起进程、转发调用。

详细设计:[`docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md`](../../../docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md)。

## Quick start

```bash
pip install -e ".[mac,dev]"
pytest tests/ -v # 不下载模型的纯逻辑单测
kakeya-sidecar --port 1338 --device mps
```

```bash
curl http://localhost:1338/v1/models
curl http://localhost:1338/v1/chat/completions -d '{...}' ...
```

## 支持的模型

参见 `kakeya_sidecar/model_registry.py`。通过 `GET /v1/models` 实时返回。

## License

Apache-2.0,与主仓库一致。
29 changes: 29 additions & 0 deletions integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
"""kakeya_sidecar — OpenAI-compatible local inference sidecar.

Top-level imports are lazy so that pure-logic modules (notably
:mod:`model_registry`) can be imported in test / packaging
environments where FastAPI or torch may not be installed.
"""
from __future__ import annotations

from .model_registry import MODEL_REGISTRY, DeploymentProfile, resolve_model

__all__ = [
"MODEL_REGISTRY",
"DeploymentProfile",
"resolve_model",
"KakeyaEngine",
"create_app",
]

__version__ = "0.1.0"


def __getattr__(name):
if name == "KakeyaEngine":
from .engine import KakeyaEngine
return KakeyaEngine
if name == "create_app":
from .server import create_app
return create_app
raise AttributeError(f"module 'kakeya_sidecar' has no attribute {name!r}")
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"""Allow ``python -m kakeya_sidecar``."""
from __future__ import annotations

from .cli import main

if __name__ == "__main__": # pragma: no cover
main()
78 changes: 78 additions & 0 deletions integrations/atomic-chat/kakeya_sidecar/kakeya_sidecar/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
"""``kakeya-sidecar`` CLI entry point."""
from __future__ import annotations

import argparse
import logging
import sys

from .engine import EngineConfig
from .server import create_app


def build_parser() -> argparse.ArgumentParser:
p = argparse.ArgumentParser(
prog="kakeya-sidecar",
description="OpenAI-compatible local inference sidecar for Atomic-Chat "
"(HuggingFace transformers + KakeyaLattice v1.5 E8 KV-cache).",
)
p.add_argument("--host", default="127.0.0.1",
help="Bind address (default 127.0.0.1 — localhost only).")
p.add_argument("--port", type=int, default=1338,
help="Bind port (default 1338; Atomic-Chat's OpenAI front "
"door is 1337 so we sit one port over).")
p.add_argument("--device", default="auto",
choices=["auto", "mps", "cuda", "cpu"],
help="Torch device. 'auto' picks mps on Mac / cuda on Linux.")
p.add_argument("--dtype", default="auto",
choices=["auto", "bfloat16", "float16", "float32"],
help="Model parameter dtype.")
p.add_argument("--max-resident", type=int, default=1,
help="Max number of fully-loaded models kept in RAM/VRAM "
"at once (LRU).")
p.add_argument("--hf-cache-dir", default=None,
help="Override HF_HOME / cache location.")
p.add_argument("--prewarm", default=None,
help="Channel id to pre-load at startup, e.g. "
"'qwen3-4b@e8-q10'.")
p.add_argument("--log-level", default="INFO",
choices=["DEBUG", "INFO", "WARNING", "ERROR"])
return p


def main(argv: list[str] | None = None) -> int:
args = build_parser().parse_args(argv)
logging.basicConfig(
level=getattr(logging, args.log_level),
format="%(asctime)s %(levelname)s %(name)s: %(message)s",
)

cfg = EngineConfig(
device=args.device,
dtype=args.dtype,
max_resident=args.max_resident,
hf_cache_dir=args.hf_cache_dir,
)

engine_instance = None
if args.prewarm:
from .engine import KakeyaEngine

log = logging.getLogger("kakeya_sidecar.cli")
engine_instance = KakeyaEngine(cfg)
log.info("prewarming %s on %s", args.prewarm, engine_instance._device)
engine_instance.warmup(args.prewarm)

app = create_app(
cfg,
lazy_engine=engine_instance is None,
engine_instance=engine_instance,
)

try:
import uvicorn # type: ignore
except ImportError: # pragma: no cover
print("uvicorn is required. `pip install uvicorn[standard]`.", file=sys.stderr)
return 2

uvicorn.run(app, host=args.host, port=args.port, log_level=args.log_level.lower())
return 0
Loading