diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
index 7dbf93b..92fc38c 100644
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@@ -1,86 +1,280 @@
# Architecture
-This document explains how the modules in `litertlm-kmp` fit together, the design decisions behind the separation, and the platform-specific gotchas the library handles for you.
+NativeLM is an on-device document-chat app built on **litertlm-kmp**, a Kotlin
+Multiplatform engine that wraps Google's LiteRT-LM. Everything — the language model,
+the embedder, the vector index, OCR, speech-to-text — runs locally. No account, no
+upload, no telemetry. This document explains how the pieces fit together, the design
+decisions behind the engine/product separation, and the platform-specific gotchas the
+library handles for you.
+
+Two Gradle modules:
+
+- **`:lib`** — the engine (`com.sagar.aicore`), published as `com.sagar:litertlm-kmp`.
+ Dual-licensed (AGPL-3.0 / commercial). Kotlin Multiplatform: `commonMain` holds
+ platform-neutral contracts and orchestration; `androidMain` holds the Android-backed
+ inference; `iosMain` carries the iOS roadmap surface; `commonTest` the unit tests.
+- **`:sample-app`** — the NativeLM product (`com.nativelm.app`). Android + Compose. It
+ supplies the platform-backed stores (ObjectBox, DataStore, SAF, ML Kit OCR) and the
+ user experience, and depends on `:lib` — never the other way around.
+
+```mermaid
+flowchart TB
+ subgraph product["sample-app · NativeLM (com.nativelm.app)"]
+ ui["Compose UI
chat · documents · models · settings · studio · sync · lock"]
+ vm["NativeLmViewModel"]
+ holders["EngineHolder · RagHolder
NativeLmModelCatalog · EmbedderRecommendation"]
+ platform["Android platform glue
ObjectBoxDocumentRepository (HNSW)
AndroidTextExtractor + MlKitOcrEngine
AppPreferences (DataStore) · SecureStore"]
+ end
+
+ subgraph engine[":lib · litertlm-kmp engine (com.sagar.aicore)"]
+ contracts["Contracts (commonMain)
LocalAiEngine · EmbeddingEngine · Reranker
DocumentIngestor · DocumentRetriever · DocumentStore
ModelCatalog · ModelManager"]
+ impls["Android impls (androidMain)
LiteRtLmLocalAiEngine (Gemma)
OnnxEmbeddingEngine · OnnxReranker
GemmaBpeTokenizer · BertWordPieceTokenizer"]
+ end
+
+ ui --> vm --> holders --> contracts
+ holders --> platform
+ platform -. implements .-> contracts
+ contracts --- impls
+
+ classDef p fill:#eef6ee,stroke:#7FA980,color:#1C1B1A;
+ classDef e fill:#f5f3ef,stroke:#9a8f7a,color:#1C1B1A;
+ class ui,vm,holders,platform p;
+ class contracts,impls e;
+```
+
+The key architectural rule: **the product talks to the engine only through contracts**.
+The product *provides* storage implementations (e.g. `ObjectBoxDocumentRepository`
+implements the engine's `DocumentStore`) but never reaches into engine internals. That
+inversion is what lets the same engine power a second app (a kids' learning app, Curio)
+through a Gradle composite build.
## Module layout
```
litertlm-kmp/
-├── lib/ ← the library; published artifact com.sagar:litertlm-kmp
+├── lib/ ← the engine; published artifact com.sagar:litertlm-kmp
│ └── src/
-│ ├── commonMain/ ← engine interfaces, ModelManager, ToolSchemaConverter
-│ ├── androidMain/ ← LiteRT-LM JNI, MediaPipe Text Embedder, OEM-aware HardwareProvider
-│ ├── iosMain/ ← iOS PlatformFolders stub (full engine actuals — v0.3)
-│ └── commonTest/ ← unit tests for SHA-256 streaming + ToolSchemaConverter shape
-└── sample-app/ ← Compose Android demo; depends on :lib via project()
+│ ├── commonMain/ ← contracts, ModelManager, RAG orchestration, Studio, chart
+│ ├── androidMain/ ← LiteRT-LM, ONNX embedder/reranker, tokenizers, backup, sync
+│ ├── iosMain/ ← iOS surface (full engine actuals — v0.3 roadmap)
+│ └── commonTest/ ← unit tests (retrieval, SHA-256, tool-schema, chart)
+└── sample-app/ ← Compose Android product (NativeLM); depends on :lib
+```
+
+---
+
+## Engine internals (`:lib`)
+
+The engine is organised around small, swappable contracts in `commonMain`, each with an
+Android implementation in `androidMain`. Inference backends are deliberately
+**telemetry-free**: the LLM runs on LiteRT-LM (CPU), and the embedder/reranker run on
+**ONNX Runtime** (Microsoft, no Google/Play dependency) rather than MediaPipe — a
+conscious choice to protect the zero-telemetry promise.
+
+```mermaid
+flowchart LR
+ subgraph common["commonMain — contracts & orchestration"]
+ lae["LocalAiEngine
(chat, stateful KV session)"]
+ ee["EmbeddingEngine
(task-aware: QUERY / DOCUMENT)"]
+ rr["Reranker
(cross-encoder, optional)"]
+ ing["DocumentIngestor"]
+ ret["DocumentRetriever"]
+ store["DocumentStore"]
+ cat["ModelCatalog · ModelManager"]
+ rag["RAG support
TextChunker · KeywordSearch (BM25+RRF)
RagConfig · RagContextFormatter"]
+ end
+
+ subgraph android["androidMain — inference backends"]
+ litert["LiteRtLmLocalAiEngine
Gemma via LiteRT-LM (CPU)"]
+ onnxE["OnnxEmbeddingEngine
EmbeddingGemma-300M (ONNX)"]
+ useE["MediaPipeEmbeddingEngine
USE-Lite 100-dim (entry tier)"]
+ onnxR["OnnxReranker
ms-marco MiniLM-L6 (ONNX)"]
+ tok["GemmaBpeTokenizer · BertWordPieceTokenizer
(pure-Kotlin, validated vs HF)"]
+ end
+
+ lae -. impl .-> litert
+ ee -. impl .-> onnxE
+ ee -. impl .-> useE
+ rr -. impl .-> onnxR
+ onnxE --> tok
+ onnxR --> tok
+ ing --> ee
+ ing --> store
+ ret --> ee
+ ret --> rr
+ ret --> store
+ ret --> rag
+
+ classDef c fill:#f5f3ef,stroke:#9a8f7a,color:#1C1B1A;
+ classDef a fill:#eef2f6,stroke:#6a86a8,color:#1C1B1A;
+ class lae,ee,rr,ing,ret,store,cat,rag c;
+ class litert,onnxE,useE,onnxR,tok a;
```
-## The four core abstractions
+Beyond core inference, the engine also hosts **Studio** (`studio/` — generating mind
+maps, timelines, podcasts and other artifacts from documents), **Sync** (`sync/` — P2P
+device-to-device transfer over NSD/mDNS + TCP, GMS-free), **Backup** (`backup/` —
+passphrase-encrypted `.nlmbak` export, Argon2id + AES-256-GCM), and **Chart**
+(`chart/`).
-### 1. `LocalAiEngine` — generation interface
+### `LocalAiEngine` — generation
```kotlin
interface LocalAiEngine {
val descriptor: EngineDescriptor
suspend fun initializeEngine(modelPath: String): EngineState
fun generateStream(request: AiEngineRequest): Flow>
+ fun openChatSession(history: List, systemInstruction: String?): ChatSession
fun formatPrompt(userQuery: String, retrievedContext: String, systemInstruction: String?): String
fun releaseResources()
}
```
-The engine yields a hot `Flow` so callers can stream tokens, observe lifecycle, and react to faults without blocking. `EngineState` is a sealed hierarchy: `Idle`, `Generating`, `TokenGenerated`, `ToolCallEmitted`, `Error`. The structured-output path (`requireStructuredOutput = true`) emits `ToolCallEmitted` instead of streaming text; the free-text path emits one `TokenGenerated` per delta.
-
-The Android `LiteRtLmLocalAiEngine` implementation:
-- Serializes all native calls behind a mutex (LiteRT-LM is not thread-safe)
-- Lazily initializes the runtime on first generation request
-- Holds the mutex across the LiteRT-LM async callback to prevent stream interleaving when multiple coroutines race
-
-### 2. `EmbeddingEngine` — vector encoder
-
-A thin wrapper over MediaPipe's `TextEmbedder` Tasks API. Returns `FloatArray` for each input string. Use it to compute query/document vectors for in-memory cosine similarity in a RAG pipeline. The dimension depends on the bundled embedder model (e.g. 512 for Universal Sentence Encoder, 768 for EmbeddingGemma).
-
-### 3. `EngineRegistry` — RAM-tier-aware selection
-
-Multiple `LocalAiEngine` implementations can coexist in the registry. At init time, the registry consults `HardwareProvider.effectiveRamMb()` and selects the right one:
-
-- 6–9 GB devices → Gemma 4 E2B (smaller, ~2.5 GB on disk, ~3 GB RAM headroom)
-- 10+ GB devices → Gemma 4 E4B (larger, ~3.7 GB on disk, more capacity for long contexts)
-- Under 6 GB → no engine returned; consumer should surface `DeviceNotSupported` to the user
-
-This avoids the failure mode where you try to load a model that physically won't fit and the OS kills your app.
-
-### 4. `ModelManager` — resumable download + integrity check
+The engine yields a hot `Flow` so callers can stream tokens, observe
+lifecycle, and react to faults without blocking. `EngineState` is a sealed hierarchy:
+`Idle`, `Generating`, `TokenGenerated`, `ToolCallEmitted`, `Error`. For multi-turn chat,
+`openChatSession` returns a `ChatSession` that keeps a **stateful KV cache** across turns
+(flat time-to-first-token); the Android `LiteRtLmLocalAiEngine` serializes all native
+calls behind a mutex (LiteRT-LM is not thread-safe) and lazily initializes on first use.
-Ktor-backed download with:
-- Resume support (uses HTTP `Range` headers if the partial file exists)
-- Optional SHA-256 validation post-download (lowercase hex, mismatch deletes the file and emits `DownloadState.Error`)
-- Atomic temp → final move so a half-downloaded file is never visible to the engine
-- `Flow` for progress UI
+### `EmbeddingEngine` — vector encoder (task-aware)
-## The OEM RAM-expansion gotcha (this is the load-bearing piece)
-
-Realme, Xiaomi, OPPO, vivo, and some Samsung variants ship a "virtual RAM" feature that swaps to flash storage:
-
-- **Realme Dynamic RAM Expansion** (RDRAM / DRE)
-- **Xiaomi Memory Extension**
-- **OPPO RAM Expansion**
+```kotlin
+enum class EmbeddingTask { QUERY, DOCUMENT }
-When enabled, these features inflate `MemoryInfo.totalMem` as reported by `ActivityManager.getMemoryInfo()`. A device with 8 GB of physical RAM may report **14 GB** (8 physical + 6 swap). If you size your model tier off `totalMem`, you'll happily load the 4-GB Gemma 4 E4B variant on a device that physically can't hold it, get killed by the LMKD, and look broken.
+interface EmbeddingEngine {
+ val dimensions: Int
+ suspend fun initialize(modelPath: String)
+ suspend fun embed(text: String, task: EmbeddingTask, title: String? = null): FloatArray
+}
+```
-`AndroidHardwareProvider` detects this by:
+The default embedder is **EmbeddingGemma-300M via ONNX Runtime** (`OnnxEmbeddingEngine`),
+with **USE-Lite 100-dim** (`MediaPipeEmbeddingEngine`) as the no-download entry tier.
+EmbeddingGemma is *instruction-tuned*, so the contract is task-aware — a query and a
+document chunk are embedded with different prompts ("task: search result | query: …" vs
+"title: … | text: …"). One downloaded model serves every capable tier via **Matryoshka
+truncation** (768 → 512 / 256 / 128) followed by re-normalisation. Tokenization is a
+**pure-Kotlin** `GemmaBpeTokenizer` — onnxruntime-extensions' in-graph tokenizer doesn't
+support `GemmaTokenizer`, so the BPE (and the reranker's WordPiece) were reimplemented in
+Kotlin and validated bit-for-bit against Hugging Face `transformers`.
+
+### `Reranker` — optional cross-encoder second stage
+
+`OnnxReranker` runs **ms-marco MiniLM-L6** (Apache-2.0, ungated, ~90 MB) to re-score the
+top fused candidates by query↔passage relevance. It's enabled on ≥8 GB tiers and runs
+only on the ~24-candidate pool at query time.
+
+### `ModelManager` — resumable download + integrity check
+
+Ktor-backed: HTTP `Range` resume, optional SHA-256 validation (mismatch deletes + emits
+`DownloadState.Error`), atomic temp→final move so a half-downloaded file is never visible
+to the engine, and a `Flow` for progress UI. Models can carry
+**companion files** (e.g. the ONNX external-data weights blob and the tokenizer), all
+fetched together with aggregate progress; gated models reuse the Hugging Face token flow.
+
+---
+
+## The RAG pipeline
+
+This is the heart of the product: grounding answers in the user's own documents with
+citations. Two phases — **ingestion** (on import) and **retrieval** (per question).
+
+```mermaid
+flowchart TB
+ subgraph ingest["Ingestion — on import"]
+ i1["PDF / image / text"]
+ i2["AndroidTextExtractor
(+ MlKitOcrEngine for scans)"]
+ i3["TextChunker
(≈500 chars, 50 overlap)"]
+ i4["EmbeddingEngine.embed(text, DOCUMENT)
EmbeddingGemma → Matryoshka dim"]
+ i5["ObjectBox HNSW
(per-dim entity: 100/128/256/512)"]
+ i1 --> i2 --> i3 --> i4 --> i5
+ end
+
+ subgraph retrieve["Retrieval — on each question"]
+ q0["User question"]
+ q1["EmbeddingEngine.embed(query, QUERY)"]
+ qV["Vector arm
HNSW k-NN, distance-gated"]
+ qK["Keyword arm
BM25 over term-matching chunks"]
+ gate["Document relevance gate
dominance (best doc + ties)
+ title-match override"]
+ fuse["Reciprocal Rank Fusion
+ per-document cap"]
+ rerankStep["Reranker (≥8 GB tiers)
cross-encoder re-score top pool"]
+ topk["Top-k chunks → grounding block
(RagContextFormatter, size-capped)"]
+ llm["LocalAiEngine
(stateful KV; grounding re-flushed per turn)"]
+ ans["Answer + citations"]
+
+ q0 --> q1 --> qV
+ q0 --> qK
+ qV --> gate
+ qK --> gate
+ gate --> fuse --> rerankStep --> topk --> llm --> ans
+ end
+
+ i5 -. queried by .-> qV
+ i5 -. queried by .-> qK
+
+ classDef ing fill:#eef6ee,stroke:#7FA980,color:#1C1B1A;
+ classDef ret fill:#f5f3ef,stroke:#9a8f7a,color:#1C1B1A;
+ class i1,i2,i3,i4,i5 ing;
+ class q0,q1,qV,qK,gate,fuse,rerankStep,topk,llm,ans ret;
+```
-1. Reading `MemTotal` from `/proc/meminfo` (the actual physical RAM that the kernel sees)
-2. Reading `SwapTotal` from the same file
-3. If `SwapTotal > 1 GB`, treating the device as RAM-expansion-enabled and capping the effective RAM at **9 GB** — keeping it in the E2B tier even if `MemTotal` says otherwise
+Three decisions worth calling out, each from a real failure mode (the full story:
+[`_session/material/blog-embedding-enhancements.md`](_session/material/blog-embedding-enhancements.md)):
+
+- **Hybrid retrieval.** The vector arm finds semantic matches; the BM25 keyword arm
+ recovers exact strings (names, IDs, codenames) a small embedder ranks poorly. The two
+ rankings merge with Reciprocal Rank Fusion.
+- **Document relevance gate.** With several similar documents (a car, a life, and a
+ health insurance policy in one project), lexical overlap on words like
+ "insurance"/"premium" used to let an answer ground on the *wrong* document — e.g.
+ "car insurance premium" answered from a life policy. The gate keeps only the
+ document(s) the vector arm clearly favours, and a **title-match override** lets a query
+ that names a document by its title ("car" → a *CarPolicy* source) ground on that
+ document over a higher-scoring but wrong one.
+- **Stateful KV, flushed grounding.** The chat session keeps a warm KV cache for flat
+ TTFT, but injecting a fresh grounding block every turn would accumulate in that cache
+ and overflow the on-device context window (answers degraded to one or two tokens).
+ Grounded turns re-prefill only the bounded visible transcript, flushing stale grounding.
+
+---
+
+## Device-tiered model selection & the OEM RAM gotcha
+
+On-device inference must fit the phone. `EmbedderRecommendation.forDevice(ramMb)` mirrors
+the LLM tiering and picks embedder, Matryoshka dimension, and reranker — keyed on
+**effective** RAM:
+
+```mermaid
+flowchart LR
+ ram{"effective RAM"}
+ ram -->|"≥ 10 GB"| t4["EmbeddingGemma @512 + reranker"]
+ ram -->|"8–10 GB"| t3["EmbeddingGemma @256 + reranker"]
+ ram -->|"6–8 GB"| t2["EmbeddingGemma @256"]
+ ram -->|"< 6 GB"| t1["USE-Lite @100 (no download)"]
+
+ classDef n fill:#f5f3ef,stroke:#9a8f7a,color:#1C1B1A;
+ class t1,t2,t3,t4 n;
+```
-The 1 GB swap threshold filters out normal Linux swap (typically a few hundred MB) from OEM-induced swap (always 4 GB+).
+**The load-bearing detail — "effective" is not `totalMem`.** Realme, Xiaomi, OPPO, vivo
+and some Samsung variants ship a "virtual RAM" feature that swaps to flash (Realme
+Dynamic RAM Expansion, Xiaomi Memory Extension, OPPO RAM Expansion). When enabled, these
+inflate `MemoryInfo.totalMem`: a phone with 8 GB physical may report **14 GB**. Size your
+tier off `totalMem` and you'll load a model that physically can't fit, get killed by the
+LMKD, and look broken. `AndroidHardwareProvider` defends against this by reading
+`MemTotal` + `SwapTotal` from `/proc/meminfo`; if `SwapTotal > 1 GB` it treats the device
+as RAM-expansion-enabled and caps effective RAM at **9 GB**. The 1 GB threshold filters
+normal Linux swap from OEM-induced swap (always 4 GB+). If you roll your own on-device LLM
+stack and read nothing else, read
+[`AndroidHardwareProvider`](lib/src/androidMain/kotlin/com/sagar/aicore/AndroidHardwareProvider.kt).
-This is the most important practical lesson in this library. If you're rolling your own on-device LLM stack and read nothing else, read [`AndroidHardwareProvider`](lib/src/androidMain/kotlin/com/sagar/aicore/AndroidHardwareProvider.kt).
+---
-## Function-calling: `ToolSchemaConverter`
+## Function calling — `ToolSchemaConverter`
-LiteRT-LM consumes function-calling tool definitions as OpenAPI 3.0 JSON. Hand-writing that JSON is error-prone, so the library exposes an engine-agnostic `ToolSchema.Definition` and converts internally:
+LiteRT-LM consumes tool definitions as OpenAPI 3.0 JSON. The library exposes an
+engine-agnostic `ToolSchema.Definition` and converts internally:
```kotlin
val def = ToolSchema.Definition(
@@ -95,31 +289,25 @@ val def = ToolSchema.Definition(
val json: String = def.toOpenApiJson()
```
-The structured-output path on `LocalAiEngine`:
-
-1. Converts your `Definition` to OpenAPI JSON via `toOpenApiJson()`
-2. Wraps as an `OpenApiTool` with `automaticToolCalling = false`
-3. Sends the prompt with `systemInstruction = "you MUST call the tool"`
-4. Reflectively reads `message.toolCalls` from the LiteRT-LM response
-5. Emits one `EngineState.ToolCallEmitted(name, arguments)` per call
-
-Tool arguments come back as `Map`. LiteRT-LM converts camelCase Kotlin parameter names to snake_case schema keys; integer parameters may surface as `Double` (JSON number ambiguity). Coerce accordingly: `arguments["duration_minutes"]?.let { (it as Number).toInt() }`.
+The structured-output path converts your `Definition` to OpenAPI JSON, wraps it as an
+`OpenApiTool` (`automaticToolCalling = false`), sends the prompt with a "you MUST call the
+tool" system instruction, reads `message.toolCalls` from the response, and emits one
+`EngineState.ToolCallEmitted(name, arguments)` per call. Arguments arrive as
+`Map`; integer params may surface as `Double` (JSON number ambiguity) —
+coerce with `(it as Number).toInt()`.
## Coroutines + thread discipline
-- All native LiteRT-LM calls are serialized behind a `Mutex` inside the engine
-- The mutex is held across LiteRT-LM's async callback so concurrent `generateStream` calls don't interleave their tokens
-- `ModelManager` uses `Dispatchers.IO` internally for file I/O
-- All public suspend functions are safe to call from `Dispatchers.Main`
+All native LiteRT-LM calls are serialized behind a `Mutex` inside the engine, held across
+LiteRT-LM's async callback so concurrent `generateStream` calls don't interleave tokens.
+`ModelManager` uses `Dispatchers.IO`; all public suspend functions are safe to call from
+`Dispatchers.Main`.
### Production tip — re-throw `CancellationException` in your `collect`
-When you wrap `engine.generateStream(...).collect { ... }` in a `try/catch`, a
-broad `catch (e: Exception)` will also swallow the `CancellationException` that
-coroutine cancellation throws (e.g. when the user taps "Stop", or the
-`viewModelScope` is cleared). Swallowing it breaks structured-concurrency
-cancellation semantics and surfaces a *cancelled* generation as if it were a
-real engine fault. Always let cancellation propagate:
+A broad `catch (e: Exception)` around `generateStream(...).collect { }` will also swallow
+the `CancellationException` thrown on cancellation (user taps "Stop", scope cleared),
+surfacing a *cancelled* generation as a real fault. Always let it propagate:
```kotlin
try {
@@ -131,34 +319,32 @@ try {
}
```
-This is flow exception-transparency: catch only the exceptions you actually
-mean to handle.
-
## Multimodal vision
-`LiteRtLmLocalAiEngine` reports `descriptor.supportsVision = true` and accepts
-image input. The engine is initialized with `EngineConfig(visionBackend =
-Backend.CPU(), maxNumImages = 1)`; `generateStream` filters
-`request.attachments` for `Attachment.Image` and, when present, sends the
-prompt as a `Contents` bundle of `Content.Text` + `Content.ImageBytes` instead
-of the plain-string overload. Both the free-text and structured-output paths
-route images through. The loaded `.litertlm` must carry vision-encoder weights
-(standard Gemma 4 E2B / E4B do) or init fails. CPU is the deliberate default:
-GPU vision delegates vary by device driver and aren't worth the support burden.
-Audio attachments are tolerated by the request API but dropped before
-inference.
-
-## DI
+`LiteRtLmLocalAiEngine` reports `descriptor.supportsVision = true` and accepts image
+input via `EngineConfig(visionBackend = Backend.CPU(), maxNumImages = 1)`;
+`generateStream` filters `request.attachments` for `Attachment.Image` and, when present,
+sends a `Contents` bundle of `Content.Text` + `Content.ImageBytes`. The loaded
+`.litertlm` must carry vision-encoder weights (Gemma 4 E2B / E4B do). CPU is the
+deliberate default — GPU vision delegates vary by device driver and aren't worth the
+support burden.
-The library exposes its surface through a kotlin-inject component (`AiEngineComponent`). Consumers using kotlin-inject can extend the component; consumers using Hilt / Koin / manual wiring can ignore the component and instantiate the implementations directly — every implementation has a no-arg or simple-arg constructor.
+## DI & testing
-`@AppScope` marks long-lived per-app singletons (the engine, embedding engine, hardware provider, etc.). If you use kotlin-inject, scope your provider graph at app start; if you use a different DI framework, treat these as application-scoped singletons.
+The library exposes its surface through a kotlin-inject component (`AiEngineComponent`);
+consumers on Hilt / Koin / manual wiring can ignore it and instantiate implementations
+directly (every one has a simple constructor). `@AppScope` marks app-lived singletons.
-## Testing
+Unit tests cover the retrieval logic (`DefaultDocumentRetrieverTest` — the dominance
+gate, title-match override, hybrid fusion, reranker reorder), streaming SHA-256, and the
+OpenAPI tool-schema shape. Engine-level integration (download → load → generate) requires
+a connected device + real weights and is verified on-device per release.
-The shipped unit tests cover:
+---
-- `Sha256Test` — streaming SHA-256 against canonical empty-input + Wikipedia "abc" reference vectors
-- `ToolSchemaConverterTest` — OpenAPI 3.0 shape correctness, primitive type mapping, nested arrays, required-list filtering
+## Visualising growth
-Engine-level integration tests (download → load → generate → release) require a connected device + real model weights, and live in the consumer's own test suite. A future `sample-app` module will include a representative integration test.
+This file is the intentional, reviewed view of the architecture, kept in version control
+so it evolves with the code. For the *organic* view of how the codebase grew over time,
+the git history can be rendered with [Gource](https://gource.io/) — see
+[`docs/gource.md`](docs/gource.md) for the NativeLM-branded recipe.
diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
deleted file mode 100644
index 5a1367c..0000000
--- a/docs/ARCHITECTURE.md
+++ /dev/null
@@ -1,206 +0,0 @@
-# Architecture
-
-NativeLM is an on-device document-chat app built on **litertlm-kmp**, a Kotlin
-Multiplatform engine that wraps Google's LiteRT-LM. Everything — the language model,
-the embedder, the vector index, OCR, speech-to-text — runs locally. No account, no
-upload, no telemetry. This document explains how the pieces fit together and how the
-codebase is organised so the boundary between the reusable **engine** and the
-**product** stays clean as it grows.
-
-Two Gradle modules:
-
-- **`:lib`** — the engine (`com.sagar.aicore`). Dual-licensed (AGPL-3.0 / commercial).
- Pure Kotlin Multiplatform: `commonMain` holds platform-neutral contracts and
- orchestration; `androidMain` holds the Android-backed inference implementations;
- `iosMain` carries the iOS roadmap surface.
-- **`:sample-app`** — the NativeLM product (`com.nativelm.app`). Android + Compose. It
- supplies the platform-backed stores (ObjectBox, DataStore, SAF, ML Kit OCR) and the
- user-facing experience, and depends on `:lib` — never the other way around.
-
-```mermaid
-flowchart TB
- subgraph product["sample-app · NativeLM (com.nativelm.app)"]
- ui["Compose UI
chat · documents · models · settings · studio · sync · lock"]
- vm["NativeLmViewModel"]
- holders["EngineHolder · RagHolder
NativeLmModelCatalog · EmbedderRecommendation"]
- platform["Android platform glue
ObjectBoxDocumentRepository (HNSW)
AndroidTextExtractor + MlKitOcrEngine
AppPreferences (DataStore) · SecureStore"]
- end
-
- subgraph engine[":lib · litertlm-kmp engine (com.sagar.aicore)"]
- contracts["Contracts (commonMain)
LocalAiEngine · EmbeddingEngine · Reranker
DocumentIngestor · DocumentRetriever · DocumentStore
ModelCatalog · ModelManager"]
- impls["Android impls (androidMain)
LiteRtLmLocalAiEngine (Gemma)
OnnxEmbeddingEngine · OnnxReranker
GemmaBpeTokenizer · BertWordPieceTokenizer"]
- end
-
- ui --> vm --> holders --> contracts
- holders --> platform
- platform -. implements .-> contracts
- contracts --- impls
-
- classDef p fill:#eef6ee,stroke:#7FA980,color:#1C1B1A;
- classDef e fill:#f5f3ef,stroke:#9a8f7a,color:#1C1B1A;
- class ui,vm,holders,platform p;
- class contracts,impls e;
-```
-
-The key architectural rule: **the product talks to the engine only through contracts**
-(`LocalAiEngine`, `EmbeddingEngine`, `DocumentRetriever`, `DocumentStore`, …). The
-product *provides* the storage implementations (e.g. `ObjectBoxDocumentRepository`
-implements the engine's `DocumentStore`) but never reaches into engine internals. That
-inversion is what lets the same engine power a second app (a kids' learning app, Curio)
-through a Gradle composite build.
-
----
-
-## Engine internals (`:lib`)
-
-The engine is organised around small, swappable contracts in `commonMain`, each with an
-Android implementation in `androidMain`. Inference backends are deliberately
-**telemetry-free**: the LLM runs on LiteRT-LM (CPU), and the embedder/reranker run on
-**ONNX Runtime** (Microsoft, no Google/Play dependency) rather than MediaPipe — a
-conscious choice to protect the zero-telemetry promise.
-
-```mermaid
-flowchart LR
- subgraph common["commonMain — contracts & orchestration"]
- lae["LocalAiEngine
(chat, stateful KV session)"]
- ee["EmbeddingEngine
(task-aware: QUERY / DOCUMENT)"]
- rr["Reranker
(cross-encoder, optional)"]
- ing["DocumentIngestor"]
- ret["DocumentRetriever"]
- store["DocumentStore"]
- cat["ModelCatalog · ModelManager
ModelDescriptor · CompanionFile"]
- rag["RAG support
TextChunker · KeywordSearch (BM25+RRF)
RagConfig · RagContextFormatter"]
- end
-
- subgraph android["androidMain — inference backends"]
- litert["LiteRtLmLocalAiEngine
Gemma via LiteRT-LM (CPU)"]
- onnxE["OnnxEmbeddingEngine
EmbeddingGemma-300M (ONNX)"]
- useE["MediaPipeEmbeddingEngine
USE-Lite 100-dim (entry tier)"]
- onnxR["OnnxReranker
ms-marco MiniLM-L6 (ONNX)"]
- tok["GemmaBpeTokenizer · BertWordPieceTokenizer
(pure-Kotlin, validated vs HF)"]
- end
-
- lae -. impl .-> litert
- ee -. impl .-> onnxE
- ee -. impl .-> useE
- rr -. impl .-> onnxR
- onnxE --> tok
- onnxR --> tok
- ing --> ee
- ing --> store
- ret --> ee
- ret --> rr
- ret --> store
- ret --> rag
-
- classDef c fill:#f5f3ef,stroke:#9a8f7a,color:#1C1B1A;
- classDef a fill:#eef2f6,stroke:#6a86a8,color:#1C1B1A;
- class lae,ee,rr,ing,ret,store,cat,rag c;
- class litert,onnxE,useE,onnxR,tok a;
-```
-
-Beyond core inference, the engine also hosts: **Studio** (`studio/` — generating
-artifacts like mind maps, timelines, podcasts from documents), **Sync** (`sync/` — P2P
-device-to-device transfer over NSD/mDNS + TCP, GMS-free), **Backup** (`backup/` —
-passphrase-encrypted `.nlmbak` export, Argon2id + AES-256-GCM), and **Chart**
-(`chart/`). Speech-to-text (`SpeechToText`) is wired to on-device Whisper in the app.
-
----
-
-## The RAG pipeline
-
-This is the heart of the product: grounding answers in the user's own documents with
-citations. There are two phases — **ingestion** (when a document is imported) and
-**retrieval** (when a question is asked).
-
-```mermaid
-flowchart TB
- subgraph ingest["Ingestion — on import"]
- i1["PDF / image / text"]
- i2["AndroidTextExtractor
(+ MlKitOcrEngine for scans)"]
- i3["TextChunker
(≈500 chars, 50 overlap)"]
- i4["EmbeddingEngine.embed(text, DOCUMENT)
EmbeddingGemma → Matryoshka dim"]
- i5["ObjectBox HNSW
(per-dim entity: 100/128/256/512)"]
- i1 --> i2 --> i3 --> i4 --> i5
- end
-
- subgraph retrieve["Retrieval — on each question"]
- q0["User question"]
- q1["EmbeddingEngine.embed(query, QUERY)"]
- qV["Vector arm
HNSW k-NN, distance-gated"]
- qK["Keyword arm
BM25 over term-matching chunks"]
- gate["Document relevance gate
dominance (best doc + ties)
+ title-match override"]
- fuse["Reciprocal Rank Fusion
+ per-document cap"]
- rerankStep["Reranker (≥8 GB tiers)
cross-encoder re-score top pool"]
- topk["Top-k chunks → grounding block
(RagContextFormatter, size-capped)"]
- llm["LocalAiEngine
(stateful KV; grounding re-flushed per turn)"]
- ans["Answer + citations"]
-
- q0 --> q1 --> qV
- q0 --> qK
- qV --> gate
- qK --> gate
- gate --> fuse --> rerankStep --> topk --> llm --> ans
- end
-
- i5 -. queried by .-> qV
- i5 -. queried by .-> qK
-
- classDef ing fill:#eef6ee,stroke:#7FA980,color:#1C1B1A;
- classDef ret fill:#f5f3ef,stroke:#9a8f7a,color:#1C1B1A;
- class i1,i2,i3,i4,i5 ing;
- class q0,q1,qV,qK,gate,fuse,rerankStep,topk,llm,ans ret;
-```
-
-A few design decisions worth calling out, because they came from real failure modes
-(see [`_session/material/blog-embedding-enhancements.md`](../_session/material/blog-embedding-enhancements.md)):
-
-- **Hybrid retrieval.** The vector arm finds semantic matches; the BM25 keyword arm
- recovers exact strings (names, IDs, codenames) that a small embedder ranks poorly. The
- two rankings merge with Reciprocal Rank Fusion.
-- **Document relevance gate.** With several similar documents (e.g. a car, a life, and a
- health insurance policy in one project), lexical overlap on words like
- "insurance"/"premium" used to let an answer ground on the *wrong* document. The gate
- keeps only the document(s) the vector arm clearly favours, and a **title-match
- override** lets a query that names a document by its title ("car" → a *CarPolicy*
- source) ground on that document over a higher-scoring but wrong one.
-- **Stateful KV, flushed grounding.** The chat session keeps a warm KV cache for flat
- time-to-first-token. But injecting a fresh grounding block every turn would accumulate
- in that cache and eventually overflow the on-device context window — so grounded turns
- re-prefill only the bounded visible transcript, flushing stale grounding.
-
----
-
-## Device-tiered model selection
-
-On-device inference must fit the phone. `EmbedderRecommendation.forDevice(ramMb)` mirrors
-the LLM tiering and picks the embedder, the Matryoshka dimension, and whether to run the
-reranker — keyed on effective RAM (after the OEM RAM-expansion cap). One downloaded
-EmbeddingGemma model is truncated per tier; entry devices stay on the no-download,
-ungated USE-Lite.
-
-```mermaid
-flowchart LR
- ram{"effective RAM"}
- ram -->|"≥ 10 GB"| t4["EmbeddingGemma @512
+ reranker"]
- ram -->|"8–10 GB"| t3["EmbeddingGemma @256
+ reranker"]
- ram -->|"6–8 GB"| t2["EmbeddingGemma @256"]
- ram -->|"< 6 GB"| t1["USE-Lite @100
(no download, ungated)"]
-
- classDef n fill:#f5f3ef,stroke:#9a8f7a,color:#1C1B1A;
- class t1,t2,t3,t4 n;
-```
-
-The same recommendation surfaces in the Models screen as a *Recommended* badge, and the
-download flow pulls the model plus its companions (the ONNX external-data weights blob
-and the tokenizer) on-device — gated models reuse the Hugging Face token flow.
-
----
-
-## Visualising growth
-
-This file is the intentional, reviewed view of the architecture — kept in `docs/` so it
-evolves alongside the code (transparent-dev model). For the *organic* view of how the
-codebase grew over time, the repository history can be rendered with
-[Gource](https://gource.io/) (an animated, file-by-file visualisation of the git log).
-See [`docs/gource.md`](gource.md) for the recipe used to produce the growth clip.
diff --git a/docs/gource.md b/docs/gource.md
index 87cf947..d20c1c4 100644
--- a/docs/gource.md
+++ b/docs/gource.md
@@ -2,7 +2,7 @@
[Gource](https://gource.io/) renders an animated, file-by-file visualisation of a git
repository's history — a "watch the codebase grow" clip. It's a nice companion to
-[`ARCHITECTURE.md`](ARCHITECTURE.md): that file is the *intentional* structure, this is
+[`ARCHITECTURE.md`](../ARCHITECTURE.md): that file is the *intentional* structure, this is
the *organic* growth over time. Handy for launch posts and talks.
## Install (Windows)