Skip to content

feat(atomic-chat): KakeyaLattice v1.5 integration for Atomic-Chat — B1 (HF + MPS sidecar)#57

Draft
FluffyAIcode wants to merge 4 commits intomainfrom
AgentMemory/atomic-chat-kakeya-integration-04ae
Draft

feat(atomic-chat): KakeyaLattice v1.5 integration for Atomic-Chat — B1 (HF + MPS sidecar)#57
FluffyAIcode wants to merge 4 commits intomainfrom
AgentMemory/atomic-chat-kakeya-integration-04ae

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

摘要

KakeyaLattice v1.5 (E8 nested-lattice KV-cache codec) 接入
AtomicBot-ai/Atomic-Chat
作为第二个一等本地推理后端,与既有 llama.cpp 平级,由 Extension
System 路由。本 PR 交付 B1 方案 (HF transformers + MPS Python sidecar) +
完整设计叙事
;B2 方案 (MLX + DFlash + KakeyaLattice-MLX) 走独立后续 PR。

完整设计文档:docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md(§1-§14)。

设计叙事链

章节 内容
§1-4 Atomic-Chat 架构分析 + 为什么塞进 llama.cpp(无可插拔 KV hook)、为什么做并行 backend
§5-11 B1 方案:Python sidecar + HF transformers + MPS + KakeyaLatticeCache + OpenAI 兼容接口 + 多模型部署档位
§12 DFlash (block diffusion speculative decoding) 集成路径 → B2 后续 PR
§13 浏览器里跑推理的可行性评估(W1-W4 / L1-L3 ladder)
§14 A0 (llama.cpp) vs B2 (MLX+DFlash+Kakeya) 决策矩阵 — 给 Atomic-Chat 团队的 review checklist

交付物 (B1)

代码骨架

  • integrations/atomic-chat/kakeya_sidecar/ — OpenAI 兼容 Python sidecar
    • FastAPI endpoints:/health/v1/models/v1/chat/completions (stream + non-stream)、/v1/kakeya/stats
    • KakeyaEngine — HF transformers + KakeyaLatticeCache,设备自动检测 (mps / cuda / cpu),dtype 自动挑
    • model_registry.py — 9 个模型部署档位 (Qwen3-4B, Qwen2-1.5B, Llama-3.2-3B/3.1-8B, Mistral-7B, Gemma-4-E4B, DeepSeek-R1-Distill 1.5B/7B, GLM-4-9B) × 每个 2-3 个 Q 通道
    • Channel id 语法:<short>@<variant>-q<Q>[-b<B>],完整往返解析
    • 所有 Q / boundary 档位来自 reports/v1_5_release/V15_FULL_4MODEL_REPORT.md;DeepSeek-R1-Distill 1.5B 强制 boundary>=2(报告中 no-boundary 会爆到 5万% |Δppl|)
  • integrations/atomic-chat/kakeyalattice-extension/ — Atomic-Chat TypeScript 扩展
    • KakeyaBackend 实现 Atomic-Chat 的 Backend 契约(listModels / chatCompletion / stream / health / stats)
    • 通过 Tauri invoke 过桥到 Rust plugin
    • tsc --noEmit 通过(带 @tauri-apps/api 的 ambient shims 做最小 CI)
  • integrations/atomic-chat/tauri-plugin-kakeyalattice/ — Rust Tauri 2 plugin 骨架
    • 监管 sidecar 生命周期(spawn / 30s 健康检查 / stdout/stderr 日志)
    • HTTP 代理 + SSE-to-Tauri-event 流桥

测试

  • 15/15 单测通过pytest tests/ -v,不需要 torch / HF 下载)
    • test_model_registry.py(11 个):registry 结构、default channel 一致性、DeepSeek 强制 boundary、channel id 双向解析、未知模型/channel 的错误路径
    • test_server_routing.py(4 个):mock engine 下的 /v1/models/health/v1/chat/completions、未知模型 404

关键工程决策

决策 理由
Python sidecar,不直接 Rust FFI torch MPS ABI 不稳;transformers 的 generate / cache 逻辑在 Python;tch-rs 路径维护成本不可接受
不复用 vllm_backend/kakeya_v1_4_snapshot/ vLLM 无 Metal 后端,不可能在 Mac 上跑;snapshot hook 依赖 CUDA tensor。两者仅共用纯 PyTorch 的 kakeyalattice 包本体
每请求新建 KakeyaLatticeCache,不跨请求复用 匹配 HF generate 标准模式,天然并发安全
默认 variant="e8" + q_range=38(近无损) Mac 首选;长上下文再降到 Q=10 (3.37× CR, |Δppl|<7%)
Q=4 只进"高级选项" + 警示 报告中 GLM Q=4 |Δppl|=32%,默认档不能给
DeepSeek-R1-Distill 小模型强制 boundary=2 规避 no-boundary in-forward 的结构性爆炸
不改 Atomic-Chat 主仓库 本 PR 只给出可移植骨架 + 完整设计,让 Atomic-Chat 维护者自行决定合并方式

文件对应表(移植到 Atomic-Chat 主仓库时)

本仓库路径 目标位置
integrations/atomic-chat/kakeya_sidecar/ 独立 pip 包,sidecar 打包后塞进 .dmg
integrations/atomic-chat/kakeyalattice-extension/ extensions/
integrations/atomic-chat/tauri-plugin-kakeyalattice/ src-tauri/plugins/kakeyalattice/

后续 PR(不阻塞本 PR)

  • B2:MLX + DFlash + KakeyaLattice-MLX port(分支 AgentMemory/atomic-chat-b2-mlx-dflash-kakeya-04ae),交付 E8 codec 的 MLX 实现 + mlx-lm KV cache wrapper + 接 DFlash drafter 的 sidecar。
  • 独立 PR 可能:L1 浏览器 codec demo、L2 web-native LLM + Kakeya、远端 vLLM backend 扩展选项。

验证命令

cd integrations/atomic-chat/kakeya_sidecar
pip install --quiet pytest fastapi httpx pydantic sse-starlette
PYTHONPATH=. python3 -m pytest tests/ -v
# => 15 passed
cd integrations/atomic-chat/kakeyalattice-extension
npx --yes -p typescript@5.4 tsc --noEmit
# => clean

Commits

  1. docs: add Atomic-Chat x KakeyaLattice v1.5 integration design
  2. feat(sidecar): OpenAI-compatible Python sidecar with KakeyaLatticeCache
  3. feat(extension): Atomic-Chat TS extension + Tauri Rust plugin skeleton
  4. docs: add §12 DFlash / §13 Web / §14 A0 vs B2 decision matrix

参考 atomic.chat 首页宣传的 "Google TurboQuant built-in" — 按 v1.5 报告 TQ b=2 在 4 模型上结构性不可用,b=3 被 E8 Q=4 全面压过 3-6×。这次集成把宣传里的"KV 压缩"兑换成工程上站得住的实现。

Open in Web Open in Cursor 

cursoragent and others added 4 commits April 30, 2026 03:08
Analyse AtomicBot-ai/Atomic-Chat local-deployment architecture and
propose a concrete plan that embeds KakeyaLattice v1.5 (E8 lattice
KV-cache codec) as a first-class second inference backend on macOS
(Apple Silicon / Metal).

- Atomic-Chat stack analysis: Tauri + React + extensions + llama.cpp
  plugin, OpenAI-compatible server at localhost:1337.
- Hard conflict: KakeyaLattice is PyTorch-first; llama.cpp has no
  pluggable KV-cache quantisation interface. Conclusion: integrate
  as a parallel backend rather than inside llama.cpp.
- Target architecture: Python sidecar (HF transformers + MPS +
  KakeyaLatticeCache) + TS extension + Rust Tauri plugin.
- Multi-model Mac deployment: Qwen3, Llama-3.x, Gemma-4, DeepSeek-R1-
  Distill, GLM-4, Mistral (all head_dim divisible by 8).
- Per-model Q-channel profiles grounded in V15_FULL_4MODEL_REPORT.md.
- Roadmap through Metal-fused E8 kernel and index-storage cache mode.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
New package `kakeya-sidecar` (installable via pip) that runs
HuggingFace transformers + kakeyalattice.hf.KakeyaLatticeCache and
exposes an OpenAI-compatible HTTP server (default :1338) for
Atomic-Chat to proxy into.

- FastAPI endpoints: /health, /v1/models, /v1/chat/completions
  (streaming + non-streaming), /v1/kakeya/stats.
- `KakeyaEngine` with LRU model cache and per-request
  KakeyaLatticeCache. Device auto-detect (mps / cuda / cpu), dtype
  auto-pick (fp16 on MPS, bf16 on CUDA).
- Model registry covering Qwen3-4B, Qwen2-1.5B, Llama-3.2-3B-Instruct,
  Llama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.3, Gemma-4-E4B,
  DeepSeek-R1-Distill 1.5B/7B, GLM-4-9B-Chat. Per-model Q/boundary
  channels grounded in V15_FULL_4MODEL_REPORT.md; DeepSeek small
  models force boundary>=2 per the report's no-boundary blowup.
- Channel id parsing (<short>@<variant>-q<Q>[-b<B>]) with full
  roundtrip coverage.
- 15 unit tests (model_registry + server routing with mocked engine).
  All green without torch or HF downloads.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Drop-in skeleton that the Atomic-Chat host app can place under
extensions/ and src-tauri/plugins/ to register KakeyaLattice as a
first-class local inference backend alongside llama.cpp.

TypeScript extension (@atomic-chat/kakeyalattice-extension):
- `KakeyaBackend` implementing the host Backend contract:
  listModels / chatCompletion / chatCompletionStream / health / stats.
- Streaming via Tauri event channel 'kakeyalattice:<stream_id>'.
- Typecheck passes under tsc 5.4 with ambient shims for @tauri-apps/api.

Rust Tauri 2 plugin (tauri-plugin-kakeyalattice):
- Supervises the kakeya-sidecar Python process (spawn, tail logs,
  wait-for-health up to 30s). Aligned with the existing llamacpp
  plugin's lifecycle model.
- Thin HTTP proxies to the sidecar: list_models, chat_completion,
  health, stats.
- SSE-to-Tauri-event bridge for streaming deltas.
- PluginConfig read from tauri.conf.json: sidecarHost/port,
  autoStart, device.

Implementation is skeleton-level: the commands compile-ready but are
expected to be finalised when dropped into the Atomic-Chat main
repo. This keeps the review surface small and avoids spawning
sidecar work inside a research repo.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…cision matrix

Complete the integration design with three roadmap/evaluation sections:

- \xc2\xa712 DFlash (block diffusion speculative decoding) integration path:
  relationship to KakeyaLattice (orthogonal axes: time vs space),
  interaction surface (verify as mini-prefill amortising codec cost,
  draft-aggressive/target-conservative layered compression, acceptance-
  rate risk, MLX port blocker), B2 roadmap M1-M6, planned
  `dflash_draft_repo` field on DeploymentProfile with the z-lab
  pre-trained draft catalogue.

- \xc2\xa713 Browser inference feasibility assessment: W1-W4 definitions,
  per-component WebGPU feasibility (target LLM \u2713, E8 codec \u2713, DFlash
  \u2717 short-term), number anchors (WASM 4 GB, WebGPU buffer quota,
  native efficiency 50-70%), advantage/disadvantage summary, three
  laddered paths L1-L3, W2 pragmatic option (sidecar-served /chat.html).

- \xc2\xa714 A0 (llama.cpp) vs B2 (MLX + DFlash + KakeyaLattice) decision
  table: full-stack side-by-side, multi-dimensional comparison,
  each side's three hardcore advantages/weaknesses, user-persona-to-
  best-choice mapping, three sentences for product decision makers,
  B1-vs-B2 PR scope reaffirmation.

These three sections close the design narrative from 'why not inside
llama.cpp' (\xc2\xa71-4) through B1 implementation (\xc2\xa75-11) to
DFlash/Web/A0-vs-B2 roadmap (\xc2\xa712-14), giving the Atomic-Chat team
a single document to review when considering this integration.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants