diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
new file mode 100644
index 0000000..5a1367c
--- /dev/null
+++ b/docs/ARCHITECTURE.md
@@ -0,0 +1,206 @@
+# Architecture
+
+NativeLM is an on-device document-chat app built on **litertlm-kmp**, a Kotlin
+Multiplatform engine that wraps Google's LiteRT-LM. Everything — the language model,
+the embedder, the vector index, OCR, speech-to-text — runs locally. No account, no
+upload, no telemetry. This document explains how the pieces fit together and how the
+codebase is organised so the boundary between the reusable **engine** and the
+**product** stays clean as it grows.
+
+Two Gradle modules:
+
+- **`:lib`** — the engine (`com.sagar.aicore`). Dual-licensed (AGPL-3.0 / commercial).
+ Pure Kotlin Multiplatform: `commonMain` holds platform-neutral contracts and
+ orchestration; `androidMain` holds the Android-backed inference implementations;
+ `iosMain` carries the iOS roadmap surface.
+- **`:sample-app`** — the NativeLM product (`com.nativelm.app`). Android + Compose. It
+ supplies the platform-backed stores (ObjectBox, DataStore, SAF, ML Kit OCR) and the
+ user-facing experience, and depends on `:lib` — never the other way around.
+
+```mermaid
+flowchart TB
+ subgraph product["sample-app · NativeLM (com.nativelm.app)"]
+ ui["Compose UI
chat · documents · models · settings · studio · sync · lock"]
+ vm["NativeLmViewModel"]
+ holders["EngineHolder · RagHolder
NativeLmModelCatalog · EmbedderRecommendation"]
+ platform["Android platform glue
ObjectBoxDocumentRepository (HNSW)
AndroidTextExtractor + MlKitOcrEngine
AppPreferences (DataStore) · SecureStore"]
+ end
+
+ subgraph engine[":lib · litertlm-kmp engine (com.sagar.aicore)"]
+ contracts["Contracts (commonMain)
LocalAiEngine · EmbeddingEngine · Reranker
DocumentIngestor · DocumentRetriever · DocumentStore
ModelCatalog · ModelManager"]
+ impls["Android impls (androidMain)
LiteRtLmLocalAiEngine (Gemma)
OnnxEmbeddingEngine · OnnxReranker
GemmaBpeTokenizer · BertWordPieceTokenizer"]
+ end
+
+ ui --> vm --> holders --> contracts
+ holders --> platform
+ platform -. implements .-> contracts
+ contracts --- impls
+
+ classDef p fill:#eef6ee,stroke:#7FA980,color:#1C1B1A;
+ classDef e fill:#f5f3ef,stroke:#9a8f7a,color:#1C1B1A;
+ class ui,vm,holders,platform p;
+ class contracts,impls e;
+```
+
+The key architectural rule: **the product talks to the engine only through contracts**
+(`LocalAiEngine`, `EmbeddingEngine`, `DocumentRetriever`, `DocumentStore`, …). The
+product *provides* the storage implementations (e.g. `ObjectBoxDocumentRepository`
+implements the engine's `DocumentStore`) but never reaches into engine internals. That
+inversion is what lets the same engine power a second app (a kids' learning app, Curio)
+through a Gradle composite build.
+
+---
+
+## Engine internals (`:lib`)
+
+The engine is organised around small, swappable contracts in `commonMain`, each with an
+Android implementation in `androidMain`. Inference backends are deliberately
+**telemetry-free**: the LLM runs on LiteRT-LM (CPU), and the embedder/reranker run on
+**ONNX Runtime** (Microsoft, no Google/Play dependency) rather than MediaPipe — a
+conscious choice to protect the zero-telemetry promise.
+
+```mermaid
+flowchart LR
+ subgraph common["commonMain — contracts & orchestration"]
+ lae["LocalAiEngine
(chat, stateful KV session)"]
+ ee["EmbeddingEngine
(task-aware: QUERY / DOCUMENT)"]
+ rr["Reranker
(cross-encoder, optional)"]
+ ing["DocumentIngestor"]
+ ret["DocumentRetriever"]
+ store["DocumentStore"]
+ cat["ModelCatalog · ModelManager
ModelDescriptor · CompanionFile"]
+ rag["RAG support
TextChunker · KeywordSearch (BM25+RRF)
RagConfig · RagContextFormatter"]
+ end
+
+ subgraph android["androidMain — inference backends"]
+ litert["LiteRtLmLocalAiEngine
Gemma via LiteRT-LM (CPU)"]
+ onnxE["OnnxEmbeddingEngine
EmbeddingGemma-300M (ONNX)"]
+ useE["MediaPipeEmbeddingEngine
USE-Lite 100-dim (entry tier)"]
+ onnxR["OnnxReranker
ms-marco MiniLM-L6 (ONNX)"]
+ tok["GemmaBpeTokenizer · BertWordPieceTokenizer
(pure-Kotlin, validated vs HF)"]
+ end
+
+ lae -. impl .-> litert
+ ee -. impl .-> onnxE
+ ee -. impl .-> useE
+ rr -. impl .-> onnxR
+ onnxE --> tok
+ onnxR --> tok
+ ing --> ee
+ ing --> store
+ ret --> ee
+ ret --> rr
+ ret --> store
+ ret --> rag
+
+ classDef c fill:#f5f3ef,stroke:#9a8f7a,color:#1C1B1A;
+ classDef a fill:#eef2f6,stroke:#6a86a8,color:#1C1B1A;
+ class lae,ee,rr,ing,ret,store,cat,rag c;
+ class litert,onnxE,useE,onnxR,tok a;
+```
+
+Beyond core inference, the engine also hosts: **Studio** (`studio/` — generating
+artifacts like mind maps, timelines, podcasts from documents), **Sync** (`sync/` — P2P
+device-to-device transfer over NSD/mDNS + TCP, GMS-free), **Backup** (`backup/` —
+passphrase-encrypted `.nlmbak` export, Argon2id + AES-256-GCM), and **Chart**
+(`chart/`). Speech-to-text (`SpeechToText`) is wired to on-device Whisper in the app.
+
+---
+
+## The RAG pipeline
+
+This is the heart of the product: grounding answers in the user's own documents with
+citations. There are two phases — **ingestion** (when a document is imported) and
+**retrieval** (when a question is asked).
+
+```mermaid
+flowchart TB
+ subgraph ingest["Ingestion — on import"]
+ i1["PDF / image / text"]
+ i2["AndroidTextExtractor
(+ MlKitOcrEngine for scans)"]
+ i3["TextChunker
(≈500 chars, 50 overlap)"]
+ i4["EmbeddingEngine.embed(text, DOCUMENT)
EmbeddingGemma → Matryoshka dim"]
+ i5["ObjectBox HNSW
(per-dim entity: 100/128/256/512)"]
+ i1 --> i2 --> i3 --> i4 --> i5
+ end
+
+ subgraph retrieve["Retrieval — on each question"]
+ q0["User question"]
+ q1["EmbeddingEngine.embed(query, QUERY)"]
+ qV["Vector arm
HNSW k-NN, distance-gated"]
+ qK["Keyword arm
BM25 over term-matching chunks"]
+ gate["Document relevance gate
dominance (best doc + ties)
+ title-match override"]
+ fuse["Reciprocal Rank Fusion
+ per-document cap"]
+ rerankStep["Reranker (≥8 GB tiers)
cross-encoder re-score top pool"]
+ topk["Top-k chunks → grounding block
(RagContextFormatter, size-capped)"]
+ llm["LocalAiEngine
(stateful KV; grounding re-flushed per turn)"]
+ ans["Answer + citations"]
+
+ q0 --> q1 --> qV
+ q0 --> qK
+ qV --> gate
+ qK --> gate
+ gate --> fuse --> rerankStep --> topk --> llm --> ans
+ end
+
+ i5 -. queried by .-> qV
+ i5 -. queried by .-> qK
+
+ classDef ing fill:#eef6ee,stroke:#7FA980,color:#1C1B1A;
+ classDef ret fill:#f5f3ef,stroke:#9a8f7a,color:#1C1B1A;
+ class i1,i2,i3,i4,i5 ing;
+ class q0,q1,qV,qK,gate,fuse,rerankStep,topk,llm,ans ret;
+```
+
+A few design decisions worth calling out, because they came from real failure modes
+(see [`_session/material/blog-embedding-enhancements.md`](../_session/material/blog-embedding-enhancements.md)):
+
+- **Hybrid retrieval.** The vector arm finds semantic matches; the BM25 keyword arm
+ recovers exact strings (names, IDs, codenames) that a small embedder ranks poorly. The
+ two rankings merge with Reciprocal Rank Fusion.
+- **Document relevance gate.** With several similar documents (e.g. a car, a life, and a
+ health insurance policy in one project), lexical overlap on words like
+ "insurance"/"premium" used to let an answer ground on the *wrong* document. The gate
+ keeps only the document(s) the vector arm clearly favours, and a **title-match
+ override** lets a query that names a document by its title ("car" → a *CarPolicy*
+ source) ground on that document over a higher-scoring but wrong one.
+- **Stateful KV, flushed grounding.** The chat session keeps a warm KV cache for flat
+ time-to-first-token. But injecting a fresh grounding block every turn would accumulate
+ in that cache and eventually overflow the on-device context window — so grounded turns
+ re-prefill only the bounded visible transcript, flushing stale grounding.
+
+---
+
+## Device-tiered model selection
+
+On-device inference must fit the phone. `EmbedderRecommendation.forDevice(ramMb)` mirrors
+the LLM tiering and picks the embedder, the Matryoshka dimension, and whether to run the
+reranker — keyed on effective RAM (after the OEM RAM-expansion cap). One downloaded
+EmbeddingGemma model is truncated per tier; entry devices stay on the no-download,
+ungated USE-Lite.
+
+```mermaid
+flowchart LR
+ ram{"effective RAM"}
+ ram -->|"≥ 10 GB"| t4["EmbeddingGemma @512
+ reranker"]
+ ram -->|"8–10 GB"| t3["EmbeddingGemma @256
+ reranker"]
+ ram -->|"6–8 GB"| t2["EmbeddingGemma @256"]
+ ram -->|"< 6 GB"| t1["USE-Lite @100
(no download, ungated)"]
+
+ classDef n fill:#f5f3ef,stroke:#9a8f7a,color:#1C1B1A;
+ class t1,t2,t3,t4 n;
+```
+
+The same recommendation surfaces in the Models screen as a *Recommended* badge, and the
+download flow pulls the model plus its companions (the ONNX external-data weights blob
+and the tokenizer) on-device — gated models reuse the Hugging Face token flow.
+
+---
+
+## Visualising growth
+
+This file is the intentional, reviewed view of the architecture — kept in `docs/` so it
+evolves alongside the code (transparent-dev model). For the *organic* view of how the
+codebase grew over time, the repository history can be rendered with
+[Gource](https://gource.io/) (an animated, file-by-file visualisation of the git log).
+See [`docs/gource.md`](gource.md) for the recipe used to produce the growth clip.
diff --git a/docs/gource.md b/docs/gource.md
new file mode 100644
index 0000000..0bcbf28
--- /dev/null
+++ b/docs/gource.md
@@ -0,0 +1,65 @@
+# Growth visualisation with Gource
+
+[Gource](https://gource.io/) renders an animated, file-by-file visualisation of a git
+repository's history — a "watch the codebase grow" clip. It's a nice companion to
+[`ARCHITECTURE.md`](ARCHITECTURE.md): that file is the *intentional* structure, this is
+the *organic* growth over time. Handy for launch posts and talks.
+
+## Install (Windows)
+
+```powershell
+winget install Acaceia.Gource
+```
+
+(ffmpeg is also required for video output — already present in this environment. On a
+clean machine: `winget install Gyan.FFmpeg`.)
+
+Gource needs an OpenGL context, so run it on a desktop session (not a headless shell).
+
+## Produce the clip (NativeLM-branded)
+
+Run from the repository root. The colours match the NativeLM palette — warm-dark canvas
+`#1C1B1A`, off-white text `#FAF9F6`, sage-green directories `#7FA980`.
+
+```powershell
+gource . `
+ --title "NativeLM — on-device document chat" `
+ --seconds-per-day 0.5 `
+ --auto-skip-seconds 1 `
+ --max-file-lag 0.1 `
+ --hide mouse,filenames,progress `
+ --highlight-users `
+ --background-colour 1C1B1A `
+ --font-colour FAF9F6 `
+ --dir-colour 7FA980 `
+ --highlight-colour 7FA980 `
+ --key `
+ --1280x720 `
+ --output-framerate 30 `
+ --output-ppm-stream - `
+ | ffmpeg -y -r 30 -f image2pipe -vcodec ppm -i - `
+ -vcodec libx264 -preset slow -pix_fmt yuv420p -crf 20 `
+ _session/material/nativelm-growth.mp4
+```
+
+A short, fast-paced clip (low `--seconds-per-day`) reads best on LinkedIn / X. For a
+longer narrated walkthrough, raise `--seconds-per-day` to ~3–5.
+
+## Focus on the source (optional)
+
+To exclude generated/vendor noise (build outputs, ObjectBox-generated files, session
+material) and visualise only hand-written source, drive Gource from a filtered log:
+
+```powershell
+git log --pretty=format:user:%aN%n%ct --reverse --raw --encoding=UTF-8 `
+ --no-renames -- lib/src sample-app/src docs `
+ > gource.log
+gource gource.log --title "NativeLM" ... # same flags as above
+```
+
+## Notes
+
+- The output MP4 goes to `_session/material/` (content/marketing material), which is
+ git-ignored — the clip is an artifact, not part of the repo.
+- To put faces on contributors, drop avatar PNGs (named per git author) in a folder and
+ add `--user-image-dir `.