feat(atomic-chat): KakeyaLattice v1.5 integration for Atomic-Chat — B1 (HF + MPS sidecar) by FluffyAIcode · Pull Request #57 · FluffyAIcode/LLM-KV--Cache-compress

FluffyAIcode · 2026-04-30T04:07:03Z

摘要

把 KakeyaLattice v1.5 (E8 nested-lattice KV-cache codec) 接入
AtomicBot-ai/Atomic-Chat
作为第二个一等本地推理后端，与既有 llama.cpp 平级，由 Extension
System 路由。本 PR 交付 B1 方案 (HF transformers + MPS Python sidecar) +
完整设计叙事；B2 方案 (MLX + DFlash + KakeyaLattice-MLX) 走独立后续 PR。

完整设计文档：docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md（§1-§14）。

设计叙事链

章节	内容
§1-4	Atomic-Chat 架构分析 + 为什么不塞进 llama.cpp（无可插拔 KV hook）、为什么要做并行 backend
§5-11	B1 方案：Python sidecar + HF transformers + MPS + `KakeyaLatticeCache` + OpenAI 兼容接口 + 多模型部署档位
§12	DFlash (block diffusion speculative decoding) 集成路径 → B2 后续 PR
§13	浏览器里跑推理的可行性评估（W1-W4 / L1-L3 ladder）
§14	A0 (llama.cpp) vs B2 (MLX+DFlash+Kakeya) 决策矩阵 — 给 Atomic-Chat 团队的 review checklist

交付物 (B1)

代码骨架

integrations/atomic-chat/kakeya_sidecar/ — OpenAI 兼容 Python sidecar
- FastAPI endpoints：/health、/v1/models、/v1/chat/completions (stream + non-stream)、/v1/kakeya/stats
- KakeyaEngine — HF transformers + KakeyaLatticeCache，设备自动检测 (mps / cuda / cpu)，dtype 自动挑
- model_registry.py — 9 个模型部署档位 (Qwen3-4B, Qwen2-1.5B, Llama-3.2-3B/3.1-8B, Mistral-7B, Gemma-4-E4B, DeepSeek-R1-Distill 1.5B/7B, GLM-4-9B) × 每个 2-3 个 Q 通道
- Channel id 语法：<short>@<variant>-q<Q>[-b<B>]，完整往返解析
- 所有 Q / boundary 档位来自 reports/v1_5_release/V15_FULL_4MODEL_REPORT.md；DeepSeek-R1-Distill 1.5B 强制 boundary>=2（报告中 no-boundary 会爆到 5万% |Δppl|）
integrations/atomic-chat/kakeyalattice-extension/ — Atomic-Chat TypeScript 扩展
- KakeyaBackend 实现 Atomic-Chat 的 Backend 契约（listModels / chatCompletion / stream / health / stats）
- 通过 Tauri invoke 过桥到 Rust plugin
- tsc --noEmit 通过（带 @tauri-apps/api 的 ambient shims 做最小 CI）
integrations/atomic-chat/tauri-plugin-kakeyalattice/ — Rust Tauri 2 plugin 骨架
- 监管 sidecar 生命周期（spawn / 30s 健康检查 / stdout/stderr 日志）
- HTTP 代理 + SSE-to-Tauri-event 流桥

测试

15/15 单测通过（pytest tests/ -v，不需要 torch / HF 下载）
- test_model_registry.py（11 个）：registry 结构、default channel 一致性、DeepSeek 强制 boundary、channel id 双向解析、未知模型/channel 的错误路径
- test_server_routing.py（4 个）：mock engine 下的 /v1/models、/health、/v1/chat/completions、未知模型 404

关键工程决策

决策	理由
Python sidecar，不直接 Rust FFI	torch MPS ABI 不稳；transformers 的 generate / cache 逻辑在 Python；tch-rs 路径维护成本不可接受
不复用 `vllm_backend/kakeya_v1_4_snapshot/`	vLLM 无 Metal 后端，不可能在 Mac 上跑；snapshot hook 依赖 CUDA tensor。两者仅共用纯 PyTorch 的 `kakeyalattice` 包本体
每请求新建 `KakeyaLatticeCache`，不跨请求复用	匹配 HF generate 标准模式，天然并发安全
默认 `variant="e8"` + `q_range=38`（近无损）	Mac 首选；长上下文再降到 Q=10 (3.37× CR, \|Δppl\|<7%)
`Q=4` 只进"高级选项" + 警示	报告中 GLM Q=4 \|Δppl\|=32%，默认档不能给
DeepSeek-R1-Distill 小模型强制 `boundary=2`	规避 no-boundary in-forward 的结构性爆炸
不改 Atomic-Chat 主仓库	本 PR 只给出可移植骨架 + 完整设计，让 Atomic-Chat 维护者自行决定合并方式

文件对应表（移植到 Atomic-Chat 主仓库时）

本仓库路径	目标位置
`integrations/atomic-chat/kakeya_sidecar/`	独立 pip 包，sidecar 打包后塞进 `.dmg`
`integrations/atomic-chat/kakeyalattice-extension/`	`extensions/`
`integrations/atomic-chat/tauri-plugin-kakeyalattice/`	`src-tauri/plugins/kakeyalattice/`

后续 PR（不阻塞本 PR）

B2：MLX + DFlash + KakeyaLattice-MLX port（分支 AgentMemory/atomic-chat-b2-mlx-dflash-kakeya-04ae），交付 E8 codec 的 MLX 实现 + mlx-lm KV cache wrapper + 接 DFlash drafter 的 sidecar。
独立 PR 可能：L1 浏览器 codec demo、L2 web-native LLM + Kakeya、远端 vLLM backend 扩展选项。

验证命令

cd integrations/atomic-chat/kakeya_sidecar
pip install --quiet pytest fastapi httpx pydantic sse-starlette
PYTHONPATH=. python3 -m pytest tests/ -v
# => 15 passed

cd integrations/atomic-chat/kakeyalattice-extension
npx --yes -p typescript@5.4 tsc --noEmit
# => clean

Commits

docs: add Atomic-Chat x KakeyaLattice v1.5 integration design
feat(sidecar): OpenAI-compatible Python sidecar with KakeyaLatticeCache
feat(extension): Atomic-Chat TS extension + Tauri Rust plugin skeleton
docs: add §12 DFlash / §13 Web / §14 A0 vs B2 decision matrix

参考 atomic.chat 首页宣传的 "Google TurboQuant built-in" — 按 v1.5 报告 TQ b=2 在 4 模型上结构性不可用，b=3 被 E8 Q=4 全面压过 3-6×。这次集成把宣传里的"KV 压缩"兑换成工程上站得住的实现。

Analyse AtomicBot-ai/Atomic-Chat local-deployment architecture and propose a concrete plan that embeds KakeyaLattice v1.5 (E8 lattice KV-cache codec) as a first-class second inference backend on macOS (Apple Silicon / Metal). - Atomic-Chat stack analysis: Tauri + React + extensions + llama.cpp plugin, OpenAI-compatible server at localhost:1337. - Hard conflict: KakeyaLattice is PyTorch-first; llama.cpp has no pluggable KV-cache quantisation interface. Conclusion: integrate as a parallel backend rather than inside llama.cpp. - Target architecture: Python sidecar (HF transformers + MPS + KakeyaLatticeCache) + TS extension + Rust Tauri plugin. - Multi-model Mac deployment: Qwen3, Llama-3.x, Gemma-4, DeepSeek-R1- Distill, GLM-4, Mistral (all head_dim divisible by 8). - Per-model Q-channel profiles grounded in V15_FULL_4MODEL_REPORT.md. - Roadmap through Metal-fused E8 kernel and index-storage cache mode. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

New package `kakeya-sidecar` (installable via pip) that runs HuggingFace transformers + kakeyalattice.hf.KakeyaLatticeCache and exposes an OpenAI-compatible HTTP server (default :1338) for Atomic-Chat to proxy into. - FastAPI endpoints: /health, /v1/models, /v1/chat/completions (streaming + non-streaming), /v1/kakeya/stats. - `KakeyaEngine` with LRU model cache and per-request KakeyaLatticeCache. Device auto-detect (mps / cuda / cpu), dtype auto-pick (fp16 on MPS, bf16 on CUDA). - Model registry covering Qwen3-4B, Qwen2-1.5B, Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.3, Gemma-4-E4B, DeepSeek-R1-Distill 1.5B/7B, GLM-4-9B-Chat. Per-model Q/boundary channels grounded in V15_FULL_4MODEL_REPORT.md; DeepSeek small models force boundary>=2 per the report's no-boundary blowup. - Channel id parsing (<short>@<variant>-q<Q>[-b<B>]) with full roundtrip coverage. - 15 unit tests (model_registry + server routing with mocked engine). All green without torch or HF downloads. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Drop-in skeleton that the Atomic-Chat host app can place under extensions/ and src-tauri/plugins/ to register KakeyaLattice as a first-class local inference backend alongside llama.cpp. TypeScript extension (@atomic-chat/kakeyalattice-extension): - `KakeyaBackend` implementing the host Backend contract: listModels / chatCompletion / chatCompletionStream / health / stats. - Streaming via Tauri event channel 'kakeyalattice:<stream_id>'. - Typecheck passes under tsc 5.4 with ambient shims for @tauri-apps/api. Rust Tauri 2 plugin (tauri-plugin-kakeyalattice): - Supervises the kakeya-sidecar Python process (spawn, tail logs, wait-for-health up to 30s). Aligned with the existing llamacpp plugin's lifecycle model. - Thin HTTP proxies to the sidecar: list_models, chat_completion, health, stats. - SSE-to-Tauri-event bridge for streaming deltas. - PluginConfig read from tauri.conf.json: sidecarHost/port, autoStart, device. Implementation is skeleton-level: the commands compile-ready but are expected to be finalised when dropped into the Atomic-Chat main repo. This keeps the review surface small and avoids spawning sidecar work inside a research repo. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…cision matrix Complete the integration design with three roadmap/evaluation sections: - \xc2\xa712 DFlash (block diffusion speculative decoding) integration path: relationship to KakeyaLattice (orthogonal axes: time vs space), interaction surface (verify as mini-prefill amortising codec cost, draft-aggressive/target-conservative layered compression, acceptance- rate risk, MLX port blocker), B2 roadmap M1-M6, planned `dflash_draft_repo` field on DeploymentProfile with the z-lab pre-trained draft catalogue. - \xc2\xa713 Browser inference feasibility assessment: W1-W4 definitions, per-component WebGPU feasibility (target LLM \u2713, E8 codec \u2713, DFlash \u2717 short-term), number anchors (WASM 4 GB, WebGPU buffer quota, native efficiency 50-70%), advantage/disadvantage summary, three laddered paths L1-L3, W2 pragmatic option (sidecar-served /chat.html). - \xc2\xa714 A0 (llama.cpp) vs B2 (MLX + DFlash + KakeyaLattice) decision table: full-stack side-by-side, multi-dimensional comparison, each side's three hardcore advantages/weaknesses, user-persona-to- best-choice mapping, three sentences for product decision makers, B1-vs-B2 PR scope reaffirmation. These three sections close the design narrative from 'why not inside llama.cpp' (\xc2\xa71-4) through B1 implementation (\xc2\xa75-11) to DFlash/Web/A0-vs-B2 roadmap (\xc2\xa712-14), giving the Atomic-Chat team a single document to review when considering this integration. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 4 commits April 30, 2026 03:08

This was referenced Apr 30, 2026

feat(atomic-chat): B2 — MLX + DFlash + KakeyaLattice (M1-M3 skeleton) #58

Draft

feat(atomic-chat): B2/M5 — acceptance-rate benchmark + M6 standalone product proposal #60

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(atomic-chat): KakeyaLattice v1.5 integration for Atomic-Chat — B1 (HF + MPS sidecar)#57

feat(atomic-chat): KakeyaLattice v1.5 integration for Atomic-Chat — B1 (HF + MPS sidecar)#57
FluffyAIcode wants to merge 4 commits intomainfrom
AgentMemory/atomic-chat-kakeya-integration-04ae

FluffyAIcode commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Apr 30, 2026

摘要

设计叙事链

交付物 (B1)

代码骨架

测试

关键工程决策

文件对应表（移植到 Atomic-Chat 主仓库时）

后续 PR（不阻塞本 PR）

验证命令

Commits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants