feat(atomic-chat): KakeyaLattice v1.5 integration for Atomic-Chat — B1 (HF + MPS sidecar)#57
Draft
FluffyAIcode wants to merge 4 commits intomainfrom
Draft
feat(atomic-chat): KakeyaLattice v1.5 integration for Atomic-Chat — B1 (HF + MPS sidecar)#57FluffyAIcode wants to merge 4 commits intomainfrom
FluffyAIcode wants to merge 4 commits intomainfrom
Conversation
Analyse AtomicBot-ai/Atomic-Chat local-deployment architecture and propose a concrete plan that embeds KakeyaLattice v1.5 (E8 lattice KV-cache codec) as a first-class second inference backend on macOS (Apple Silicon / Metal). - Atomic-Chat stack analysis: Tauri + React + extensions + llama.cpp plugin, OpenAI-compatible server at localhost:1337. - Hard conflict: KakeyaLattice is PyTorch-first; llama.cpp has no pluggable KV-cache quantisation interface. Conclusion: integrate as a parallel backend rather than inside llama.cpp. - Target architecture: Python sidecar (HF transformers + MPS + KakeyaLatticeCache) + TS extension + Rust Tauri plugin. - Multi-model Mac deployment: Qwen3, Llama-3.x, Gemma-4, DeepSeek-R1- Distill, GLM-4, Mistral (all head_dim divisible by 8). - Per-model Q-channel profiles grounded in V15_FULL_4MODEL_REPORT.md. - Roadmap through Metal-fused E8 kernel and index-storage cache mode. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
New package `kakeya-sidecar` (installable via pip) that runs HuggingFace transformers + kakeyalattice.hf.KakeyaLatticeCache and exposes an OpenAI-compatible HTTP server (default :1338) for Atomic-Chat to proxy into. - FastAPI endpoints: /health, /v1/models, /v1/chat/completions (streaming + non-streaming), /v1/kakeya/stats. - `KakeyaEngine` with LRU model cache and per-request KakeyaLatticeCache. Device auto-detect (mps / cuda / cpu), dtype auto-pick (fp16 on MPS, bf16 on CUDA). - Model registry covering Qwen3-4B, Qwen2-1.5B, Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.3, Gemma-4-E4B, DeepSeek-R1-Distill 1.5B/7B, GLM-4-9B-Chat. Per-model Q/boundary channels grounded in V15_FULL_4MODEL_REPORT.md; DeepSeek small models force boundary>=2 per the report's no-boundary blowup. - Channel id parsing (<short>@<variant>-q<Q>[-b<B>]) with full roundtrip coverage. - 15 unit tests (model_registry + server routing with mocked engine). All green without torch or HF downloads. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Drop-in skeleton that the Atomic-Chat host app can place under extensions/ and src-tauri/plugins/ to register KakeyaLattice as a first-class local inference backend alongside llama.cpp. TypeScript extension (@atomic-chat/kakeyalattice-extension): - `KakeyaBackend` implementing the host Backend contract: listModels / chatCompletion / chatCompletionStream / health / stats. - Streaming via Tauri event channel 'kakeyalattice:<stream_id>'. - Typecheck passes under tsc 5.4 with ambient shims for @tauri-apps/api. Rust Tauri 2 plugin (tauri-plugin-kakeyalattice): - Supervises the kakeya-sidecar Python process (spawn, tail logs, wait-for-health up to 30s). Aligned with the existing llamacpp plugin's lifecycle model. - Thin HTTP proxies to the sidecar: list_models, chat_completion, health, stats. - SSE-to-Tauri-event bridge for streaming deltas. - PluginConfig read from tauri.conf.json: sidecarHost/port, autoStart, device. Implementation is skeleton-level: the commands compile-ready but are expected to be finalised when dropped into the Atomic-Chat main repo. This keeps the review surface small and avoids spawning sidecar work inside a research repo. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…cision matrix Complete the integration design with three roadmap/evaluation sections: - \xc2\xa712 DFlash (block diffusion speculative decoding) integration path: relationship to KakeyaLattice (orthogonal axes: time vs space), interaction surface (verify as mini-prefill amortising codec cost, draft-aggressive/target-conservative layered compression, acceptance- rate risk, MLX port blocker), B2 roadmap M1-M6, planned `dflash_draft_repo` field on DeploymentProfile with the z-lab pre-trained draft catalogue. - \xc2\xa713 Browser inference feasibility assessment: W1-W4 definitions, per-component WebGPU feasibility (target LLM \u2713, E8 codec \u2713, DFlash \u2717 short-term), number anchors (WASM 4 GB, WebGPU buffer quota, native efficiency 50-70%), advantage/disadvantage summary, three laddered paths L1-L3, W2 pragmatic option (sidecar-served /chat.html). - \xc2\xa714 A0 (llama.cpp) vs B2 (MLX + DFlash + KakeyaLattice) decision table: full-stack side-by-side, multi-dimensional comparison, each side's three hardcore advantages/weaknesses, user-persona-to- best-choice mapping, three sentences for product decision makers, B1-vs-B2 PR scope reaffirmation. These three sections close the design narrative from 'why not inside llama.cpp' (\xc2\xa71-4) through B1 implementation (\xc2\xa75-11) to DFlash/Web/A0-vs-B2 roadmap (\xc2\xa712-14), giving the Atomic-Chat team a single document to review when considering this integration. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This was referenced Apr 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
摘要
把 KakeyaLattice v1.5 (E8 nested-lattice KV-cache codec) 接入
AtomicBot-ai/Atomic-Chat作为第二个一等本地推理后端,与既有
llama.cpp平级,由 ExtensionSystem 路由。本 PR 交付 B1 方案 (HF transformers + MPS Python sidecar) +
完整设计叙事;B2 方案 (MLX + DFlash + KakeyaLattice-MLX) 走独立后续 PR。
完整设计文档:
docs/ATOMIC_CHAT_KAKEYA_INTEGRATION.md(§1-§14)。设计叙事链
KakeyaLatticeCache+ OpenAI 兼容接口 + 多模型部署档位交付物 (B1)
代码骨架
integrations/atomic-chat/kakeya_sidecar/— OpenAI 兼容 Python sidecarFastAPIendpoints:/health、/v1/models、/v1/chat/completions(stream + non-stream)、/v1/kakeya/statsKakeyaEngine— HF transformers +KakeyaLatticeCache,设备自动检测 (mps/cuda/cpu),dtype 自动挑model_registry.py— 9 个模型部署档位 (Qwen3-4B, Qwen2-1.5B, Llama-3.2-3B/3.1-8B, Mistral-7B, Gemma-4-E4B, DeepSeek-R1-Distill 1.5B/7B, GLM-4-9B) × 每个 2-3 个 Q 通道<short>@<variant>-q<Q>[-b<B>],完整往返解析reports/v1_5_release/V15_FULL_4MODEL_REPORT.md;DeepSeek-R1-Distill 1.5B 强制boundary>=2(报告中 no-boundary 会爆到 5万% |Δppl|)integrations/atomic-chat/kakeyalattice-extension/— Atomic-Chat TypeScript 扩展KakeyaBackend实现 Atomic-Chat 的Backend契约(listModels / chatCompletion / stream / health / stats)invoke过桥到 Rust plugintsc --noEmit通过(带@tauri-apps/api的 ambient shims 做最小 CI)integrations/atomic-chat/tauri-plugin-kakeyalattice/— Rust Tauri 2 plugin 骨架测试
pytest tests/ -v,不需要 torch / HF 下载)test_model_registry.py(11 个):registry 结构、default channel 一致性、DeepSeek 强制 boundary、channel id 双向解析、未知模型/channel 的错误路径test_server_routing.py(4 个):mock engine 下的/v1/models、/health、/v1/chat/completions、未知模型 404关键工程决策
vllm_backend/kakeya_v1_4_snapshot/kakeyalattice包本体KakeyaLatticeCache,不跨请求复用variant="e8"+q_range=38(近无损)Q=4只进"高级选项" + 警示boundary=2文件对应表(移植到 Atomic-Chat 主仓库时)
integrations/atomic-chat/kakeya_sidecar/.dmgintegrations/atomic-chat/kakeyalattice-extension/extensions/integrations/atomic-chat/tauri-plugin-kakeyalattice/src-tauri/plugins/kakeyalattice/后续 PR(不阻塞本 PR)
AgentMemory/atomic-chat-b2-mlx-dflash-kakeya-04ae),交付 E8 codec 的 MLX 实现 + mlx-lm KV cache wrapper + 接 DFlash drafter 的 sidecar。验证命令
Commits
docs: add Atomic-Chat x KakeyaLattice v1.5 integration designfeat(sidecar): OpenAI-compatible Python sidecar with KakeyaLatticeCachefeat(extension): Atomic-Chat TS extension + Tauri Rust plugin skeletondocs: add §12 DFlash / §13 Web / §14 A0 vs B2 decision matrix参考 atomic.chat 首页宣传的 "Google TurboQuant built-in" — 按 v1.5 报告 TQ b=2 在 4 模型上结构性不可用,b=3 被 E8 Q=4 全面压过 3-6×。这次集成把宣传里的"KV 压缩"兑换成工程上站得住的实现。