From a0c877642f5af4a6b1dffbce446bb21c1f96f5b2 Mon Sep 17 00:00:00 2001 From: sagar-develop Date: Tue, 9 Jun 2026 01:47:22 +0530 Subject: [PATCH] docs: consolidate architecture into root ARCHITECTURE.md + refresh Merge the new Mermaid visual overview (engine/product split, engine internals, RAG pipeline, device tiering) into the canonical root ARCHITECTURE.md the README already links, and refresh the stale parts: the embedder is now task-aware EmbeddingGemma via ONNX (USE-Lite is the entry tier), plus the hybrid retrieval + dominance gate + title-match + reranker + stateful-KV grounding. Remove the duplicate docs/ARCHITECTURE.md created earlier; fix the gource.md cross-link to ../ARCHITECTURE.md. Co-Authored-By: Claude Opus 4.8 --- ARCHITECTURE.md | 374 ++++++++++++++++++++++++++++++++----------- docs/ARCHITECTURE.md | 206 ------------------------ docs/gource.md | 2 +- 3 files changed, 281 insertions(+), 301 deletions(-) delete mode 100644 docs/ARCHITECTURE.md diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index 7dbf93b..92fc38c 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -1,86 +1,280 @@ # Architecture -This document explains how the modules in `litertlm-kmp` fit together, the design decisions behind the separation, and the platform-specific gotchas the library handles for you. +NativeLM is an on-device document-chat app built on **litertlm-kmp**, a Kotlin +Multiplatform engine that wraps Google's LiteRT-LM. Everything — the language model, +the embedder, the vector index, OCR, speech-to-text — runs locally. No account, no +upload, no telemetry. This document explains how the pieces fit together, the design +decisions behind the engine/product separation, and the platform-specific gotchas the +library handles for you. + +Two Gradle modules: + +- **`:lib`** — the engine (`com.sagar.aicore`), published as `com.sagar:litertlm-kmp`. + Dual-licensed (AGPL-3.0 / commercial). Kotlin Multiplatform: `commonMain` holds + platform-neutral contracts and orchestration; `androidMain` holds the Android-backed + inference; `iosMain` carries the iOS roadmap surface; `commonTest` the unit tests. +- **`:sample-app`** — the NativeLM product (`com.nativelm.app`). Android + Compose. It + supplies the platform-backed stores (ObjectBox, DataStore, SAF, ML Kit OCR) and the + user experience, and depends on `:lib` — never the other way around. + +```mermaid +flowchart TB + subgraph product["sample-app · NativeLM (com.nativelm.app)"] + ui["Compose UI
chat · documents · models · settings · studio · sync · lock"] + vm["NativeLmViewModel"] + holders["EngineHolder · RagHolder
NativeLmModelCatalog · EmbedderRecommendation"] + platform["Android platform glue
ObjectBoxDocumentRepository (HNSW)
AndroidTextExtractor + MlKitOcrEngine
AppPreferences (DataStore) · SecureStore"] + end + + subgraph engine[":lib · litertlm-kmp engine (com.sagar.aicore)"] + contracts["Contracts (commonMain)
LocalAiEngine · EmbeddingEngine · Reranker
DocumentIngestor · DocumentRetriever · DocumentStore
ModelCatalog · ModelManager"] + impls["Android impls (androidMain)
LiteRtLmLocalAiEngine (Gemma)
OnnxEmbeddingEngine · OnnxReranker
GemmaBpeTokenizer · BertWordPieceTokenizer"] + end + + ui --> vm --> holders --> contracts + holders --> platform + platform -. implements .-> contracts + contracts --- impls + + classDef p fill:#eef6ee,stroke:#7FA980,color:#1C1B1A; + classDef e fill:#f5f3ef,stroke:#9a8f7a,color:#1C1B1A; + class ui,vm,holders,platform p; + class contracts,impls e; +``` + +The key architectural rule: **the product talks to the engine only through contracts**. +The product *provides* storage implementations (e.g. `ObjectBoxDocumentRepository` +implements the engine's `DocumentStore`) but never reaches into engine internals. That +inversion is what lets the same engine power a second app (a kids' learning app, Curio) +through a Gradle composite build. ## Module layout ``` litertlm-kmp/ -├── lib/ ← the library; published artifact com.sagar:litertlm-kmp +├── lib/ ← the engine; published artifact com.sagar:litertlm-kmp │ └── src/ -│ ├── commonMain/ ← engine interfaces, ModelManager, ToolSchemaConverter -│ ├── androidMain/ ← LiteRT-LM JNI, MediaPipe Text Embedder, OEM-aware HardwareProvider -│ ├── iosMain/ ← iOS PlatformFolders stub (full engine actuals — v0.3) -│ └── commonTest/ ← unit tests for SHA-256 streaming + ToolSchemaConverter shape -└── sample-app/ ← Compose Android demo; depends on :lib via project() +│ ├── commonMain/ ← contracts, ModelManager, RAG orchestration, Studio, chart +│ ├── androidMain/ ← LiteRT-LM, ONNX embedder/reranker, tokenizers, backup, sync +│ ├── iosMain/ ← iOS surface (full engine actuals — v0.3 roadmap) +│ └── commonTest/ ← unit tests (retrieval, SHA-256, tool-schema, chart) +└── sample-app/ ← Compose Android product (NativeLM); depends on :lib +``` + +--- + +## Engine internals (`:lib`) + +The engine is organised around small, swappable contracts in `commonMain`, each with an +Android implementation in `androidMain`. Inference backends are deliberately +**telemetry-free**: the LLM runs on LiteRT-LM (CPU), and the embedder/reranker run on +**ONNX Runtime** (Microsoft, no Google/Play dependency) rather than MediaPipe — a +conscious choice to protect the zero-telemetry promise. + +```mermaid +flowchart LR + subgraph common["commonMain — contracts & orchestration"] + lae["LocalAiEngine
(chat, stateful KV session)"] + ee["EmbeddingEngine
(task-aware: QUERY / DOCUMENT)"] + rr["Reranker
(cross-encoder, optional)"] + ing["DocumentIngestor"] + ret["DocumentRetriever"] + store["DocumentStore"] + cat["ModelCatalog · ModelManager"] + rag["RAG support
TextChunker · KeywordSearch (BM25+RRF)
RagConfig · RagContextFormatter"] + end + + subgraph android["androidMain — inference backends"] + litert["LiteRtLmLocalAiEngine
Gemma via LiteRT-LM (CPU)"] + onnxE["OnnxEmbeddingEngine
EmbeddingGemma-300M (ONNX)"] + useE["MediaPipeEmbeddingEngine
USE-Lite 100-dim (entry tier)"] + onnxR["OnnxReranker
ms-marco MiniLM-L6 (ONNX)"] + tok["GemmaBpeTokenizer · BertWordPieceTokenizer
(pure-Kotlin, validated vs HF)"] + end + + lae -. impl .-> litert + ee -. impl .-> onnxE + ee -. impl .-> useE + rr -. impl .-> onnxR + onnxE --> tok + onnxR --> tok + ing --> ee + ing --> store + ret --> ee + ret --> rr + ret --> store + ret --> rag + + classDef c fill:#f5f3ef,stroke:#9a8f7a,color:#1C1B1A; + classDef a fill:#eef2f6,stroke:#6a86a8,color:#1C1B1A; + class lae,ee,rr,ing,ret,store,cat,rag c; + class litert,onnxE,useE,onnxR,tok a; ``` -## The four core abstractions +Beyond core inference, the engine also hosts **Studio** (`studio/` — generating mind +maps, timelines, podcasts and other artifacts from documents), **Sync** (`sync/` — P2P +device-to-device transfer over NSD/mDNS + TCP, GMS-free), **Backup** (`backup/` — +passphrase-encrypted `.nlmbak` export, Argon2id + AES-256-GCM), and **Chart** +(`chart/`). -### 1. `LocalAiEngine` — generation interface +### `LocalAiEngine` — generation ```kotlin interface LocalAiEngine { val descriptor: EngineDescriptor suspend fun initializeEngine(modelPath: String): EngineState fun generateStream(request: AiEngineRequest): Flow> + fun openChatSession(history: List, systemInstruction: String?): ChatSession fun formatPrompt(userQuery: String, retrievedContext: String, systemInstruction: String?): String fun releaseResources() } ``` -The engine yields a hot `Flow` so callers can stream tokens, observe lifecycle, and react to faults without blocking. `EngineState` is a sealed hierarchy: `Idle`, `Generating`, `TokenGenerated`, `ToolCallEmitted`, `Error`. The structured-output path (`requireStructuredOutput = true`) emits `ToolCallEmitted` instead of streaming text; the free-text path emits one `TokenGenerated` per delta. - -The Android `LiteRtLmLocalAiEngine` implementation: -- Serializes all native calls behind a mutex (LiteRT-LM is not thread-safe) -- Lazily initializes the runtime on first generation request -- Holds the mutex across the LiteRT-LM async callback to prevent stream interleaving when multiple coroutines race - -### 2. `EmbeddingEngine` — vector encoder - -A thin wrapper over MediaPipe's `TextEmbedder` Tasks API. Returns `FloatArray` for each input string. Use it to compute query/document vectors for in-memory cosine similarity in a RAG pipeline. The dimension depends on the bundled embedder model (e.g. 512 for Universal Sentence Encoder, 768 for EmbeddingGemma). - -### 3. `EngineRegistry` — RAM-tier-aware selection - -Multiple `LocalAiEngine` implementations can coexist in the registry. At init time, the registry consults `HardwareProvider.effectiveRamMb()` and selects the right one: - -- 6–9 GB devices → Gemma 4 E2B (smaller, ~2.5 GB on disk, ~3 GB RAM headroom) -- 10+ GB devices → Gemma 4 E4B (larger, ~3.7 GB on disk, more capacity for long contexts) -- Under 6 GB → no engine returned; consumer should surface `DeviceNotSupported` to the user - -This avoids the failure mode where you try to load a model that physically won't fit and the OS kills your app. - -### 4. `ModelManager` — resumable download + integrity check +The engine yields a hot `Flow` so callers can stream tokens, observe +lifecycle, and react to faults without blocking. `EngineState` is a sealed hierarchy: +`Idle`, `Generating`, `TokenGenerated`, `ToolCallEmitted`, `Error`. For multi-turn chat, +`openChatSession` returns a `ChatSession` that keeps a **stateful KV cache** across turns +(flat time-to-first-token); the Android `LiteRtLmLocalAiEngine` serializes all native +calls behind a mutex (LiteRT-LM is not thread-safe) and lazily initializes on first use. -Ktor-backed download with: -- Resume support (uses HTTP `Range` headers if the partial file exists) -- Optional SHA-256 validation post-download (lowercase hex, mismatch deletes the file and emits `DownloadState.Error`) -- Atomic temp → final move so a half-downloaded file is never visible to the engine -- `Flow` for progress UI +### `EmbeddingEngine` — vector encoder (task-aware) -## The OEM RAM-expansion gotcha (this is the load-bearing piece) - -Realme, Xiaomi, OPPO, vivo, and some Samsung variants ship a "virtual RAM" feature that swaps to flash storage: - -- **Realme Dynamic RAM Expansion** (RDRAM / DRE) -- **Xiaomi Memory Extension** -- **OPPO RAM Expansion** +```kotlin +enum class EmbeddingTask { QUERY, DOCUMENT } -When enabled, these features inflate `MemoryInfo.totalMem` as reported by `ActivityManager.getMemoryInfo()`. A device with 8 GB of physical RAM may report **14 GB** (8 physical + 6 swap). If you size your model tier off `totalMem`, you'll happily load the 4-GB Gemma 4 E4B variant on a device that physically can't hold it, get killed by the LMKD, and look broken. +interface EmbeddingEngine { + val dimensions: Int + suspend fun initialize(modelPath: String) + suspend fun embed(text: String, task: EmbeddingTask, title: String? = null): FloatArray +} +``` -`AndroidHardwareProvider` detects this by: +The default embedder is **EmbeddingGemma-300M via ONNX Runtime** (`OnnxEmbeddingEngine`), +with **USE-Lite 100-dim** (`MediaPipeEmbeddingEngine`) as the no-download entry tier. +EmbeddingGemma is *instruction-tuned*, so the contract is task-aware — a query and a +document chunk are embedded with different prompts ("task: search result | query: …" vs +"title: … | text: …"). One downloaded model serves every capable tier via **Matryoshka +truncation** (768 → 512 / 256 / 128) followed by re-normalisation. Tokenization is a +**pure-Kotlin** `GemmaBpeTokenizer` — onnxruntime-extensions' in-graph tokenizer doesn't +support `GemmaTokenizer`, so the BPE (and the reranker's WordPiece) were reimplemented in +Kotlin and validated bit-for-bit against Hugging Face `transformers`. + +### `Reranker` — optional cross-encoder second stage + +`OnnxReranker` runs **ms-marco MiniLM-L6** (Apache-2.0, ungated, ~90 MB) to re-score the +top fused candidates by query↔passage relevance. It's enabled on ≥8 GB tiers and runs +only on the ~24-candidate pool at query time. + +### `ModelManager` — resumable download + integrity check + +Ktor-backed: HTTP `Range` resume, optional SHA-256 validation (mismatch deletes + emits +`DownloadState.Error`), atomic temp→final move so a half-downloaded file is never visible +to the engine, and a `Flow` for progress UI. Models can carry +**companion files** (e.g. the ONNX external-data weights blob and the tokenizer), all +fetched together with aggregate progress; gated models reuse the Hugging Face token flow. + +--- + +## The RAG pipeline + +This is the heart of the product: grounding answers in the user's own documents with +citations. Two phases — **ingestion** (on import) and **retrieval** (per question). + +```mermaid +flowchart TB + subgraph ingest["Ingestion — on import"] + i1["PDF / image / text"] + i2["AndroidTextExtractor
(+ MlKitOcrEngine for scans)"] + i3["TextChunker
(≈500 chars, 50 overlap)"] + i4["EmbeddingEngine.embed(text, DOCUMENT)
EmbeddingGemma → Matryoshka dim"] + i5["ObjectBox HNSW
(per-dim entity: 100/128/256/512)"] + i1 --> i2 --> i3 --> i4 --> i5 + end + + subgraph retrieve["Retrieval — on each question"] + q0["User question"] + q1["EmbeddingEngine.embed(query, QUERY)"] + qV["Vector arm
HNSW k-NN, distance-gated"] + qK["Keyword arm
BM25 over term-matching chunks"] + gate["Document relevance gate
dominance (best doc + ties)
+ title-match override"] + fuse["Reciprocal Rank Fusion
+ per-document cap"] + rerankStep["Reranker (≥8 GB tiers)
cross-encoder re-score top pool"] + topk["Top-k chunks → grounding block
(RagContextFormatter, size-capped)"] + llm["LocalAiEngine
(stateful KV; grounding re-flushed per turn)"] + ans["Answer + citations"] + + q0 --> q1 --> qV + q0 --> qK + qV --> gate + qK --> gate + gate --> fuse --> rerankStep --> topk --> llm --> ans + end + + i5 -. queried by .-> qV + i5 -. queried by .-> qK + + classDef ing fill:#eef6ee,stroke:#7FA980,color:#1C1B1A; + classDef ret fill:#f5f3ef,stroke:#9a8f7a,color:#1C1B1A; + class i1,i2,i3,i4,i5 ing; + class q0,q1,qV,qK,gate,fuse,rerankStep,topk,llm,ans ret; +``` -1. Reading `MemTotal` from `/proc/meminfo` (the actual physical RAM that the kernel sees) -2. Reading `SwapTotal` from the same file -3. If `SwapTotal > 1 GB`, treating the device as RAM-expansion-enabled and capping the effective RAM at **9 GB** — keeping it in the E2B tier even if `MemTotal` says otherwise +Three decisions worth calling out, each from a real failure mode (the full story: +[`_session/material/blog-embedding-enhancements.md`](_session/material/blog-embedding-enhancements.md)): + +- **Hybrid retrieval.** The vector arm finds semantic matches; the BM25 keyword arm + recovers exact strings (names, IDs, codenames) a small embedder ranks poorly. The two + rankings merge with Reciprocal Rank Fusion. +- **Document relevance gate.** With several similar documents (a car, a life, and a + health insurance policy in one project), lexical overlap on words like + "insurance"/"premium" used to let an answer ground on the *wrong* document — e.g. + "car insurance premium" answered from a life policy. The gate keeps only the + document(s) the vector arm clearly favours, and a **title-match override** lets a query + that names a document by its title ("car" → a *CarPolicy* source) ground on that + document over a higher-scoring but wrong one. +- **Stateful KV, flushed grounding.** The chat session keeps a warm KV cache for flat + TTFT, but injecting a fresh grounding block every turn would accumulate in that cache + and overflow the on-device context window (answers degraded to one or two tokens). + Grounded turns re-prefill only the bounded visible transcript, flushing stale grounding. + +--- + +## Device-tiered model selection & the OEM RAM gotcha + +On-device inference must fit the phone. `EmbedderRecommendation.forDevice(ramMb)` mirrors +the LLM tiering and picks embedder, Matryoshka dimension, and reranker — keyed on +**effective** RAM: + +```mermaid +flowchart LR + ram{"effective RAM"} + ram -->|"≥ 10 GB"| t4["EmbeddingGemma @512 + reranker"] + ram -->|"8–10 GB"| t3["EmbeddingGemma @256 + reranker"] + ram -->|"6–8 GB"| t2["EmbeddingGemma @256"] + ram -->|"< 6 GB"| t1["USE-Lite @100 (no download)"] + + classDef n fill:#f5f3ef,stroke:#9a8f7a,color:#1C1B1A; + class t1,t2,t3,t4 n; +``` -The 1 GB swap threshold filters out normal Linux swap (typically a few hundred MB) from OEM-induced swap (always 4 GB+). +**The load-bearing detail — "effective" is not `totalMem`.** Realme, Xiaomi, OPPO, vivo +and some Samsung variants ship a "virtual RAM" feature that swaps to flash (Realme +Dynamic RAM Expansion, Xiaomi Memory Extension, OPPO RAM Expansion). When enabled, these +inflate `MemoryInfo.totalMem`: a phone with 8 GB physical may report **14 GB**. Size your +tier off `totalMem` and you'll load a model that physically can't fit, get killed by the +LMKD, and look broken. `AndroidHardwareProvider` defends against this by reading +`MemTotal` + `SwapTotal` from `/proc/meminfo`; if `SwapTotal > 1 GB` it treats the device +as RAM-expansion-enabled and caps effective RAM at **9 GB**. The 1 GB threshold filters +normal Linux swap from OEM-induced swap (always 4 GB+). If you roll your own on-device LLM +stack and read nothing else, read +[`AndroidHardwareProvider`](lib/src/androidMain/kotlin/com/sagar/aicore/AndroidHardwareProvider.kt). -This is the most important practical lesson in this library. If you're rolling your own on-device LLM stack and read nothing else, read [`AndroidHardwareProvider`](lib/src/androidMain/kotlin/com/sagar/aicore/AndroidHardwareProvider.kt). +--- -## Function-calling: `ToolSchemaConverter` +## Function calling — `ToolSchemaConverter` -LiteRT-LM consumes function-calling tool definitions as OpenAPI 3.0 JSON. Hand-writing that JSON is error-prone, so the library exposes an engine-agnostic `ToolSchema.Definition` and converts internally: +LiteRT-LM consumes tool definitions as OpenAPI 3.0 JSON. The library exposes an +engine-agnostic `ToolSchema.Definition` and converts internally: ```kotlin val def = ToolSchema.Definition( @@ -95,31 +289,25 @@ val def = ToolSchema.Definition( val json: String = def.toOpenApiJson() ``` -The structured-output path on `LocalAiEngine`: - -1. Converts your `Definition` to OpenAPI JSON via `toOpenApiJson()` -2. Wraps as an `OpenApiTool` with `automaticToolCalling = false` -3. Sends the prompt with `systemInstruction = "you MUST call the tool"` -4. Reflectively reads `message.toolCalls` from the LiteRT-LM response -5. Emits one `EngineState.ToolCallEmitted(name, arguments)` per call - -Tool arguments come back as `Map`. LiteRT-LM converts camelCase Kotlin parameter names to snake_case schema keys; integer parameters may surface as `Double` (JSON number ambiguity). Coerce accordingly: `arguments["duration_minutes"]?.let { (it as Number).toInt() }`. +The structured-output path converts your `Definition` to OpenAPI JSON, wraps it as an +`OpenApiTool` (`automaticToolCalling = false`), sends the prompt with a "you MUST call the +tool" system instruction, reads `message.toolCalls` from the response, and emits one +`EngineState.ToolCallEmitted(name, arguments)` per call. Arguments arrive as +`Map`; integer params may surface as `Double` (JSON number ambiguity) — +coerce with `(it as Number).toInt()`. ## Coroutines + thread discipline -- All native LiteRT-LM calls are serialized behind a `Mutex` inside the engine -- The mutex is held across LiteRT-LM's async callback so concurrent `generateStream` calls don't interleave their tokens -- `ModelManager` uses `Dispatchers.IO` internally for file I/O -- All public suspend functions are safe to call from `Dispatchers.Main` +All native LiteRT-LM calls are serialized behind a `Mutex` inside the engine, held across +LiteRT-LM's async callback so concurrent `generateStream` calls don't interleave tokens. +`ModelManager` uses `Dispatchers.IO`; all public suspend functions are safe to call from +`Dispatchers.Main`. ### Production tip — re-throw `CancellationException` in your `collect` -When you wrap `engine.generateStream(...).collect { ... }` in a `try/catch`, a -broad `catch (e: Exception)` will also swallow the `CancellationException` that -coroutine cancellation throws (e.g. when the user taps "Stop", or the -`viewModelScope` is cleared). Swallowing it breaks structured-concurrency -cancellation semantics and surfaces a *cancelled* generation as if it were a -real engine fault. Always let cancellation propagate: +A broad `catch (e: Exception)` around `generateStream(...).collect { }` will also swallow +the `CancellationException` thrown on cancellation (user taps "Stop", scope cleared), +surfacing a *cancelled* generation as a real fault. Always let it propagate: ```kotlin try { @@ -131,34 +319,32 @@ try { } ``` -This is flow exception-transparency: catch only the exceptions you actually -mean to handle. - ## Multimodal vision -`LiteRtLmLocalAiEngine` reports `descriptor.supportsVision = true` and accepts -image input. The engine is initialized with `EngineConfig(visionBackend = -Backend.CPU(), maxNumImages = 1)`; `generateStream` filters -`request.attachments` for `Attachment.Image` and, when present, sends the -prompt as a `Contents` bundle of `Content.Text` + `Content.ImageBytes` instead -of the plain-string overload. Both the free-text and structured-output paths -route images through. The loaded `.litertlm` must carry vision-encoder weights -(standard Gemma 4 E2B / E4B do) or init fails. CPU is the deliberate default: -GPU vision delegates vary by device driver and aren't worth the support burden. -Audio attachments are tolerated by the request API but dropped before -inference. - -## DI +`LiteRtLmLocalAiEngine` reports `descriptor.supportsVision = true` and accepts image +input via `EngineConfig(visionBackend = Backend.CPU(), maxNumImages = 1)`; +`generateStream` filters `request.attachments` for `Attachment.Image` and, when present, +sends a `Contents` bundle of `Content.Text` + `Content.ImageBytes`. The loaded +`.litertlm` must carry vision-encoder weights (Gemma 4 E2B / E4B do). CPU is the +deliberate default — GPU vision delegates vary by device driver and aren't worth the +support burden. -The library exposes its surface through a kotlin-inject component (`AiEngineComponent`). Consumers using kotlin-inject can extend the component; consumers using Hilt / Koin / manual wiring can ignore the component and instantiate the implementations directly — every implementation has a no-arg or simple-arg constructor. +## DI & testing -`@AppScope` marks long-lived per-app singletons (the engine, embedding engine, hardware provider, etc.). If you use kotlin-inject, scope your provider graph at app start; if you use a different DI framework, treat these as application-scoped singletons. +The library exposes its surface through a kotlin-inject component (`AiEngineComponent`); +consumers on Hilt / Koin / manual wiring can ignore it and instantiate implementations +directly (every one has a simple constructor). `@AppScope` marks app-lived singletons. -## Testing +Unit tests cover the retrieval logic (`DefaultDocumentRetrieverTest` — the dominance +gate, title-match override, hybrid fusion, reranker reorder), streaming SHA-256, and the +OpenAPI tool-schema shape. Engine-level integration (download → load → generate) requires +a connected device + real weights and is verified on-device per release. -The shipped unit tests cover: +--- -- `Sha256Test` — streaming SHA-256 against canonical empty-input + Wikipedia "abc" reference vectors -- `ToolSchemaConverterTest` — OpenAPI 3.0 shape correctness, primitive type mapping, nested arrays, required-list filtering +## Visualising growth -Engine-level integration tests (download → load → generate → release) require a connected device + real model weights, and live in the consumer's own test suite. A future `sample-app` module will include a representative integration test. +This file is the intentional, reviewed view of the architecture, kept in version control +so it evolves with the code. For the *organic* view of how the codebase grew over time, +the git history can be rendered with [Gource](https://gource.io/) — see +[`docs/gource.md`](docs/gource.md) for the NativeLM-branded recipe. diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md deleted file mode 100644 index 5a1367c..0000000 --- a/docs/ARCHITECTURE.md +++ /dev/null @@ -1,206 +0,0 @@ -# Architecture - -NativeLM is an on-device document-chat app built on **litertlm-kmp**, a Kotlin -Multiplatform engine that wraps Google's LiteRT-LM. Everything — the language model, -the embedder, the vector index, OCR, speech-to-text — runs locally. No account, no -upload, no telemetry. This document explains how the pieces fit together and how the -codebase is organised so the boundary between the reusable **engine** and the -**product** stays clean as it grows. - -Two Gradle modules: - -- **`:lib`** — the engine (`com.sagar.aicore`). Dual-licensed (AGPL-3.0 / commercial). - Pure Kotlin Multiplatform: `commonMain` holds platform-neutral contracts and - orchestration; `androidMain` holds the Android-backed inference implementations; - `iosMain` carries the iOS roadmap surface. -- **`:sample-app`** — the NativeLM product (`com.nativelm.app`). Android + Compose. It - supplies the platform-backed stores (ObjectBox, DataStore, SAF, ML Kit OCR) and the - user-facing experience, and depends on `:lib` — never the other way around. - -```mermaid -flowchart TB - subgraph product["sample-app · NativeLM (com.nativelm.app)"] - ui["Compose UI
chat · documents · models · settings · studio · sync · lock"] - vm["NativeLmViewModel"] - holders["EngineHolder · RagHolder
NativeLmModelCatalog · EmbedderRecommendation"] - platform["Android platform glue
ObjectBoxDocumentRepository (HNSW)
AndroidTextExtractor + MlKitOcrEngine
AppPreferences (DataStore) · SecureStore"] - end - - subgraph engine[":lib · litertlm-kmp engine (com.sagar.aicore)"] - contracts["Contracts (commonMain)
LocalAiEngine · EmbeddingEngine · Reranker
DocumentIngestor · DocumentRetriever · DocumentStore
ModelCatalog · ModelManager"] - impls["Android impls (androidMain)
LiteRtLmLocalAiEngine (Gemma)
OnnxEmbeddingEngine · OnnxReranker
GemmaBpeTokenizer · BertWordPieceTokenizer"] - end - - ui --> vm --> holders --> contracts - holders --> platform - platform -. implements .-> contracts - contracts --- impls - - classDef p fill:#eef6ee,stroke:#7FA980,color:#1C1B1A; - classDef e fill:#f5f3ef,stroke:#9a8f7a,color:#1C1B1A; - class ui,vm,holders,platform p; - class contracts,impls e; -``` - -The key architectural rule: **the product talks to the engine only through contracts** -(`LocalAiEngine`, `EmbeddingEngine`, `DocumentRetriever`, `DocumentStore`, …). The -product *provides* the storage implementations (e.g. `ObjectBoxDocumentRepository` -implements the engine's `DocumentStore`) but never reaches into engine internals. That -inversion is what lets the same engine power a second app (a kids' learning app, Curio) -through a Gradle composite build. - ---- - -## Engine internals (`:lib`) - -The engine is organised around small, swappable contracts in `commonMain`, each with an -Android implementation in `androidMain`. Inference backends are deliberately -**telemetry-free**: the LLM runs on LiteRT-LM (CPU), and the embedder/reranker run on -**ONNX Runtime** (Microsoft, no Google/Play dependency) rather than MediaPipe — a -conscious choice to protect the zero-telemetry promise. - -```mermaid -flowchart LR - subgraph common["commonMain — contracts & orchestration"] - lae["LocalAiEngine
(chat, stateful KV session)"] - ee["EmbeddingEngine
(task-aware: QUERY / DOCUMENT)"] - rr["Reranker
(cross-encoder, optional)"] - ing["DocumentIngestor"] - ret["DocumentRetriever"] - store["DocumentStore"] - cat["ModelCatalog · ModelManager
ModelDescriptor · CompanionFile"] - rag["RAG support
TextChunker · KeywordSearch (BM25+RRF)
RagConfig · RagContextFormatter"] - end - - subgraph android["androidMain — inference backends"] - litert["LiteRtLmLocalAiEngine
Gemma via LiteRT-LM (CPU)"] - onnxE["OnnxEmbeddingEngine
EmbeddingGemma-300M (ONNX)"] - useE["MediaPipeEmbeddingEngine
USE-Lite 100-dim (entry tier)"] - onnxR["OnnxReranker
ms-marco MiniLM-L6 (ONNX)"] - tok["GemmaBpeTokenizer · BertWordPieceTokenizer
(pure-Kotlin, validated vs HF)"] - end - - lae -. impl .-> litert - ee -. impl .-> onnxE - ee -. impl .-> useE - rr -. impl .-> onnxR - onnxE --> tok - onnxR --> tok - ing --> ee - ing --> store - ret --> ee - ret --> rr - ret --> store - ret --> rag - - classDef c fill:#f5f3ef,stroke:#9a8f7a,color:#1C1B1A; - classDef a fill:#eef2f6,stroke:#6a86a8,color:#1C1B1A; - class lae,ee,rr,ing,ret,store,cat,rag c; - class litert,onnxE,useE,onnxR,tok a; -``` - -Beyond core inference, the engine also hosts: **Studio** (`studio/` — generating -artifacts like mind maps, timelines, podcasts from documents), **Sync** (`sync/` — P2P -device-to-device transfer over NSD/mDNS + TCP, GMS-free), **Backup** (`backup/` — -passphrase-encrypted `.nlmbak` export, Argon2id + AES-256-GCM), and **Chart** -(`chart/`). Speech-to-text (`SpeechToText`) is wired to on-device Whisper in the app. - ---- - -## The RAG pipeline - -This is the heart of the product: grounding answers in the user's own documents with -citations. There are two phases — **ingestion** (when a document is imported) and -**retrieval** (when a question is asked). - -```mermaid -flowchart TB - subgraph ingest["Ingestion — on import"] - i1["PDF / image / text"] - i2["AndroidTextExtractor
(+ MlKitOcrEngine for scans)"] - i3["TextChunker
(≈500 chars, 50 overlap)"] - i4["EmbeddingEngine.embed(text, DOCUMENT)
EmbeddingGemma → Matryoshka dim"] - i5["ObjectBox HNSW
(per-dim entity: 100/128/256/512)"] - i1 --> i2 --> i3 --> i4 --> i5 - end - - subgraph retrieve["Retrieval — on each question"] - q0["User question"] - q1["EmbeddingEngine.embed(query, QUERY)"] - qV["Vector arm
HNSW k-NN, distance-gated"] - qK["Keyword arm
BM25 over term-matching chunks"] - gate["Document relevance gate
dominance (best doc + ties)
+ title-match override"] - fuse["Reciprocal Rank Fusion
+ per-document cap"] - rerankStep["Reranker (≥8 GB tiers)
cross-encoder re-score top pool"] - topk["Top-k chunks → grounding block
(RagContextFormatter, size-capped)"] - llm["LocalAiEngine
(stateful KV; grounding re-flushed per turn)"] - ans["Answer + citations"] - - q0 --> q1 --> qV - q0 --> qK - qV --> gate - qK --> gate - gate --> fuse --> rerankStep --> topk --> llm --> ans - end - - i5 -. queried by .-> qV - i5 -. queried by .-> qK - - classDef ing fill:#eef6ee,stroke:#7FA980,color:#1C1B1A; - classDef ret fill:#f5f3ef,stroke:#9a8f7a,color:#1C1B1A; - class i1,i2,i3,i4,i5 ing; - class q0,q1,qV,qK,gate,fuse,rerankStep,topk,llm,ans ret; -``` - -A few design decisions worth calling out, because they came from real failure modes -(see [`_session/material/blog-embedding-enhancements.md`](../_session/material/blog-embedding-enhancements.md)): - -- **Hybrid retrieval.** The vector arm finds semantic matches; the BM25 keyword arm - recovers exact strings (names, IDs, codenames) that a small embedder ranks poorly. The - two rankings merge with Reciprocal Rank Fusion. -- **Document relevance gate.** With several similar documents (e.g. a car, a life, and a - health insurance policy in one project), lexical overlap on words like - "insurance"/"premium" used to let an answer ground on the *wrong* document. The gate - keeps only the document(s) the vector arm clearly favours, and a **title-match - override** lets a query that names a document by its title ("car" → a *CarPolicy* - source) ground on that document over a higher-scoring but wrong one. -- **Stateful KV, flushed grounding.** The chat session keeps a warm KV cache for flat - time-to-first-token. But injecting a fresh grounding block every turn would accumulate - in that cache and eventually overflow the on-device context window — so grounded turns - re-prefill only the bounded visible transcript, flushing stale grounding. - ---- - -## Device-tiered model selection - -On-device inference must fit the phone. `EmbedderRecommendation.forDevice(ramMb)` mirrors -the LLM tiering and picks the embedder, the Matryoshka dimension, and whether to run the -reranker — keyed on effective RAM (after the OEM RAM-expansion cap). One downloaded -EmbeddingGemma model is truncated per tier; entry devices stay on the no-download, -ungated USE-Lite. - -```mermaid -flowchart LR - ram{"effective RAM"} - ram -->|"≥ 10 GB"| t4["EmbeddingGemma @512
+ reranker"] - ram -->|"8–10 GB"| t3["EmbeddingGemma @256
+ reranker"] - ram -->|"6–8 GB"| t2["EmbeddingGemma @256"] - ram -->|"< 6 GB"| t1["USE-Lite @100
(no download, ungated)"] - - classDef n fill:#f5f3ef,stroke:#9a8f7a,color:#1C1B1A; - class t1,t2,t3,t4 n; -``` - -The same recommendation surfaces in the Models screen as a *Recommended* badge, and the -download flow pulls the model plus its companions (the ONNX external-data weights blob -and the tokenizer) on-device — gated models reuse the Hugging Face token flow. - ---- - -## Visualising growth - -This file is the intentional, reviewed view of the architecture — kept in `docs/` so it -evolves alongside the code (transparent-dev model). For the *organic* view of how the -codebase grew over time, the repository history can be rendered with -[Gource](https://gource.io/) (an animated, file-by-file visualisation of the git log). -See [`docs/gource.md`](gource.md) for the recipe used to produce the growth clip. diff --git a/docs/gource.md b/docs/gource.md index 87cf947..d20c1c4 100644 --- a/docs/gource.md +++ b/docs/gource.md @@ -2,7 +2,7 @@ [Gource](https://gource.io/) renders an animated, file-by-file visualisation of a git repository's history — a "watch the codebase grow" clip. It's a nice companion to -[`ARCHITECTURE.md`](ARCHITECTURE.md): that file is the *intentional* structure, this is +[`ARCHITECTURE.md`](../ARCHITECTURE.md): that file is the *intentional* structure, this is the *organic* growth over time. Handy for launch posts and talks. ## Install (Windows)