sagar-develop · sagar-develop · Jun 8, 2026 · Jun 8, 2026
diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
@@ -0,0 +1,206 @@
+# Architecture
+
+NativeLM is an on-device document-chat app built on **litertlm-kmp**, a Kotlin
+Multiplatform engine that wraps Google's LiteRT-LM. Everything — the language model,
+the embedder, the vector index, OCR, speech-to-text — runs locally. No account, no
+upload, no telemetry. This document explains how the pieces fit together and how the
+codebase is organised so the boundary between the reusable **engine** and the
+**product** stays clean as it grows.
+
+Two Gradle modules:
+
+- **`:lib`** — the engine (`com.sagar.aicore`). Dual-licensed (AGPL-3.0 / commercial).
+  Pure Kotlin Multiplatform: `commonMain` holds platform-neutral contracts and
+  orchestration; `androidMain` holds the Android-backed inference implementations;
+  `iosMain` carries the iOS roadmap surface.
+- **`:sample-app`** — the NativeLM product (`com.nativelm.app`). Android + Compose. It
+  supplies the platform-backed stores (ObjectBox, DataStore, SAF, ML Kit OCR) and the
+  user-facing experience, and depends on `:lib` — never the other way around.
+
+```mermaid
+flowchart TB
+    subgraph product["sample-app · NativeLM (com.nativelm.app)"]
+        ui["Compose UI<br/>chat · documents · models · settings · studio · sync · lock"]
+        vm["NativeLmViewModel"]
+        holders["EngineHolder · RagHolder<br/>NativeLmModelCatalog · EmbedderRecommendation"]
+        platform["Android platform glue<br/>ObjectBoxDocumentRepository (HNSW)<br/>AndroidTextExtractor + MlKitOcrEngine<br/>AppPreferences (DataStore) · SecureStore"]
+    end
+
+    subgraph engine[":lib · litertlm-kmp engine (com.sagar.aicore)"]
+        contracts["Contracts (commonMain)<br/>LocalAiEngine · EmbeddingEngine · Reranker<br/>DocumentIngestor · DocumentRetriever · DocumentStore<br/>ModelCatalog · ModelManager"]
+        impls["Android impls (androidMain)<br/>LiteRtLmLocalAiEngine (Gemma)<br/>OnnxEmbeddingEngine · OnnxReranker<br/>GemmaBpeTokenizer · BertWordPieceTokenizer"]
+    end
+
+    ui --> vm --> holders --> contracts
+    holders --> platform
+    platform -. implements .-> contracts
+    contracts --- impls
+
+    classDef p fill:#eef6ee,stroke:#7FA980,color:#1C1B1A;
+    classDef e fill:#f5f3ef,stroke:#9a8f7a,color:#1C1B1A;
+    class ui,vm,holders,platform p;
+    class contracts,impls e;
+```
+
+The key architectural rule: **the product talks to the engine only through contracts**
+(`LocalAiEngine`, `EmbeddingEngine`, `DocumentRetriever`, `DocumentStore`, …). The
+product *provides* the storage implementations (e.g. `ObjectBoxDocumentRepository`
+implements the engine's `DocumentStore`) but never reaches into engine internals. That
+inversion is what lets the same engine power a second app (a kids' learning app, Curio)
+through a Gradle composite build.
+
+---
+
+## Engine internals (`:lib`)
+
+The engine is organised around small, swappable contracts in `commonMain`, each with an
+Android implementation in `androidMain`. Inference backends are deliberately
+**telemetry-free**: the LLM runs on LiteRT-LM (CPU), and the embedder/reranker run on
+**ONNX Runtime** (Microsoft, no Google/Play dependency) rather than MediaPipe — a
+conscious choice to protect the zero-telemetry promise.
+
+```mermaid
+flowchart LR
+    subgraph common["commonMain — contracts & orchestration"]
+        lae["LocalAiEngine<br/>(chat, stateful KV session)"]
+        ee["EmbeddingEngine<br/>(task-aware: QUERY / DOCUMENT)"]
+        rr["Reranker<br/>(cross-encoder, optional)"]
+        ing["DocumentIngestor"]
+        ret["DocumentRetriever"]
+        store["DocumentStore"]
+        cat["ModelCatalog · ModelManager<br/>ModelDescriptor · CompanionFile"]
+        rag["RAG support<br/>TextChunker · KeywordSearch (BM25+RRF)<br/>RagConfig · RagContextFormatter"]
+    end
+
+    subgraph android["androidMain — inference backends"]
+        litert["LiteRtLmLocalAiEngine<br/>Gemma via LiteRT-LM (CPU)"]
+        onnxE["OnnxEmbeddingEngine<br/>EmbeddingGemma-300M (ONNX)"]
+        useE["MediaPipeEmbeddingEngine<br/>USE-Lite 100-dim (entry tier)"]
+        onnxR["OnnxReranker<br/>ms-marco MiniLM-L6 (ONNX)"]
+        tok["GemmaBpeTokenizer · BertWordPieceTokenizer<br/>(pure-Kotlin, validated vs HF)"]
+    end
+
+    lae -. impl .-> litert
+    ee -. impl .-> onnxE
+    ee -. impl .-> useE
+    rr -. impl .-> onnxR
+    onnxE --> tok
+    onnxR --> tok
+    ing --> ee
+    ing --> store
+    ret --> ee
+    ret --> rr
+    ret --> store
+    ret --> rag
+
+    classDef c fill:#f5f3ef,stroke:#9a8f7a,color:#1C1B1A;
+    classDef a fill:#eef2f6,stroke:#6a86a8,color:#1C1B1A;
+    class lae,ee,rr,ing,ret,store,cat,rag c;
+    class litert,onnxE,useE,onnxR,tok a;
+```
+
+Beyond core inference, the engine also hosts: **Studio** (`studio/` — generating
+artifacts like mind maps, timelines, podcasts from documents), **Sync** (`sync/` — P2P
+device-to-device transfer over NSD/mDNS + TCP, GMS-free), **Backup** (`backup/` —
+passphrase-encrypted `.nlmbak` export, Argon2id + AES-256-GCM), and **Chart**
+(`chart/`). Speech-to-text (`SpeechToText`) is wired to on-device Whisper in the app.
+
+---
+
+## The RAG pipeline
+
+This is the heart of the product: grounding answers in the user's own documents with
+citations. There are two phases — **ingestion** (when a document is imported) and
+**retrieval** (when a question is asked).
+
+```mermaid
+flowchart TB
+    subgraph ingest["Ingestion — on import"]
+        i1["PDF / image / text"]
+        i2["AndroidTextExtractor<br/>(+ MlKitOcrEngine for scans)"]
+        i3["TextChunker<br/>(≈500 chars, 50 overlap)"]
+        i4["EmbeddingEngine.embed(text, DOCUMENT)<br/>EmbeddingGemma → Matryoshka dim"]
+        i5["ObjectBox HNSW<br/>(per-dim entity: 100/128/256/512)"]
+        i1 --> i2 --> i3 --> i4 --> i5
+    end
+
+    subgraph retrieve["Retrieval — on each question"]
+        q0["User question"]
+        q1["EmbeddingEngine.embed(query, QUERY)"]
+        qV["Vector arm<br/>HNSW k-NN, distance-gated"]
+        qK["Keyword arm<br/>BM25 over term-matching chunks"]
+        gate["Document relevance gate<br/>dominance (best doc + ties)<br/>+ title-match override"]
+        fuse["Reciprocal Rank Fusion<br/>+ per-document cap"]
+        rerankStep["Reranker (≥8 GB tiers)<br/>cross-encoder re-score top pool"]
+        topk["Top-k chunks → grounding block<br/>(RagContextFormatter, size-capped)"]
+        llm["LocalAiEngine<br/>(stateful KV; grounding re-flushed per turn)"]
+        ans["Answer + citations"]
+
+        q0 --> q1 --> qV
+        q0 --> qK
+        qV --> gate
+        qK --> gate
+        gate --> fuse --> rerankStep --> topk --> llm --> ans
+    end
+
+    i5 -. queried by .-> qV
+    i5 -. queried by .-> qK
+
+    classDef ing fill:#eef6ee,stroke:#7FA980,color:#1C1B1A;
+    classDef ret fill:#f5f3ef,stroke:#9a8f7a,color:#1C1B1A;
+    class i1,i2,i3,i4,i5 ing;
+    class q0,q1,qV,qK,gate,fuse,rerankStep,topk,llm,ans ret;
+```
+
+A few design decisions worth calling out, because they came from real failure modes
+(see [`_session/material/blog-embedding-enhancements.md`](../_session/material/blog-embedding-enhancements.md)):
+
+- **Hybrid retrieval.** The vector arm finds semantic matches; the BM25 keyword arm
+  recovers exact strings (names, IDs, codenames) that a small embedder ranks poorly. The
+  two rankings merge with Reciprocal Rank Fusion.
+- **Document relevance gate.** With several similar documents (e.g. a car, a life, and a
+  health insurance policy in one project), lexical overlap on words like
+  "insurance"/"premium" used to let an answer ground on the *wrong* document. The gate
+  keeps only the document(s) the vector arm clearly favours, and a **title-match
+  override** lets a query that names a document by its title ("car" → a *CarPolicy*
+  source) ground on that document over a higher-scoring but wrong one.
+- **Stateful KV, flushed grounding.** The chat session keeps a warm KV cache for flat
+  time-to-first-token. But injecting a fresh grounding block every turn would accumulate
+  in that cache and eventually overflow the on-device context window — so grounded turns
+  re-prefill only the bounded visible transcript, flushing stale grounding.
+
+---
+
+## Device-tiered model selection
+
+On-device inference must fit the phone. `EmbedderRecommendation.forDevice(ramMb)` mirrors
+the LLM tiering and picks the embedder, the Matryoshka dimension, and whether to run the
+reranker — keyed on effective RAM (after the OEM RAM-expansion cap). One downloaded
+EmbeddingGemma model is truncated per tier; entry devices stay on the no-download,
+ungated USE-Lite.
+
+```mermaid
+flowchart LR
+    ram{"effective RAM"}
+    ram -->|"≥ 10 GB"| t4["EmbeddingGemma @512<br/>+ reranker"]
+    ram -->|"8–10 GB"| t3["EmbeddingGemma @256<br/>+ reranker"]
+    ram -->|"6–8 GB"| t2["EmbeddingGemma @256"]
+    ram -->|"< 6 GB"| t1["USE-Lite @100<br/>(no download, ungated)"]
+
+    classDef n fill:#f5f3ef,stroke:#9a8f7a,color:#1C1B1A;
+    class t1,t2,t3,t4 n;
+```
+
+The same recommendation surfaces in the Models screen as a *Recommended* badge, and the
+download flow pulls the model plus its companions (the ONNX external-data weights blob
+and the tokenizer) on-device — gated models reuse the Hugging Face token flow.
+
+---
+
+## Visualising growth
+
+This file is the intentional, reviewed view of the architecture — kept in `docs/` so it
+evolves alongside the code (transparent-dev model). For the *organic* view of how the
+codebase grew over time, the repository history can be rendered with
+[Gource](https://gource.io/) (an animated, file-by-file visualisation of the git log).
+See [`docs/gource.md`](gource.md) for the recipe used to produce the growth clip.
diff --git a/docs/gource.md b/docs/gource.md
@@ -0,0 +1,65 @@
+# Growth visualisation with Gource
+
+[Gource](https://gource.io/) renders an animated, file-by-file visualisation of a git
+repository's history — a "watch the codebase grow" clip. It's a nice companion to
+[`ARCHITECTURE.md`](ARCHITECTURE.md): that file is the *intentional* structure, this is
+the *organic* growth over time. Handy for launch posts and talks.
+
+## Install (Windows)
+
+```powershell
+winget install Acaceia.Gource
+```
+
+(ffmpeg is also required for video output — already present in this environment. On a
+clean machine: `winget install Gyan.FFmpeg`.)
+
+Gource needs an OpenGL context, so run it on a desktop session (not a headless shell).
+
+## Produce the clip (NativeLM-branded)
+
+Run from the repository root. The colours match the NativeLM palette — warm-dark canvas
+`#1C1B1A`, off-white text `#FAF9F6`, sage-green directories `#7FA980`.
+
+```powershell
+gource . `
+  --title "NativeLM — on-device document chat" `
+  --seconds-per-day 0.5 `
+  --auto-skip-seconds 1 `
+  --max-file-lag 0.1 `
+  --hide mouse,filenames,progress `
+  --highlight-users `
+  --background-colour 1C1B1A `
+  --font-colour FAF9F6 `
+  --dir-colour 7FA980 `
+  --highlight-colour 7FA980 `
+  --key `
+  --1280x720 `
+  --output-framerate 30 `
+  --output-ppm-stream - `
+  | ffmpeg -y -r 30 -f image2pipe -vcodec ppm -i - `
+      -vcodec libx264 -preset slow -pix_fmt yuv420p -crf 20 `
+      _session/material/nativelm-growth.mp4
+```
+
+A short, fast-paced clip (low `--seconds-per-day`) reads best on LinkedIn / X. For a
+longer narrated walkthrough, raise `--seconds-per-day` to ~3–5.
+
+## Focus on the source (optional)
+
+To exclude generated/vendor noise (build outputs, ObjectBox-generated files, session
+material) and visualise only hand-written source, drive Gource from a filtered log:
+
+```powershell
+git log --pretty=format:user:%aN%n%ct --reverse --raw --encoding=UTF-8 `
+  --no-renames -- lib/src sample-app/src docs `
+  > gource.log
+gource gource.log --title "NativeLM" ...  # same flags as above
+```
+
+## Notes
+
+- The output MP4 goes to `_session/material/` (content/marketing material), which is
+  git-ignored — the clip is an artifact, not part of the repo.
+- To put faces on contributors, drop avatar PNGs (named per git author) in a folder and
+  add `--user-image-dir <folder>`.