sagar-develop · sagar-develop · Jun 8, 2026 · Jun 8, 2026 · Jun 8, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,9 +9,61 @@ and the project loosely follows [Semantic Versioning](https://semver.org/).
 > `com.sagar:litertlm-kmp`) and the **showcase app** (`sample-app/`, NativeLM)
 > live in one repo and share a single version line. Entries below note which
 > surface a change lands on. The engine library version in
-> `lib/build.gradle.kts` tracks the latest release (`0.9.0`).
+> `lib/build.gradle.kts` tracks the latest release (`0.10.0`).
 
-## [Unreleased]
+## [0.10.0] — 2026-06-09
+
+### Added
+- **EmbeddingGemma RAG embedder (device-tiered, telemetry-free)** — optional upgrade
+  from USE-Lite (100-dim) to **EmbeddingGemma-300M** for document retrieval, run on
+  **ONNX Runtime** (no Google/Play telemetry deps). One downloaded model serves every
+  tier via **Matryoshka** truncation; a **recommendation engine** picks the embedder +
+  dim by device RAM (USE-Lite <6 GB · Gemma@256 6–9 GB · Gemma@512 + reranker ≥10 GB),
+  surfaced as a "Recommended" badge in the in-app model catalogue. The model and its
+  companion files (weights blob + tokenizer) download on-device through the catalogue;
+  nothing is bundled in the APK. (engine + app)
+- **Cross-encoder reranker (flagship)** — optional second-stage `ms-marco-MiniLM-L6`
+  rerank over the top fused candidates, gated to high-RAM devices. (engine + app)
+- **Pure-Kotlin tokenizers** — BPE (EmbeddingGemma) and BERT WordPiece (reranker),
+  reading the HuggingFace `tokenizer.json`, validated bit-for-bit against the reference
+  `transformers` tokenizer. No native tokenizer lib, KMP-portable. (engine)
+- **Task-aware embeddings** — `EmbeddingEngine` now distinguishes query vs document
+  (instruction prefixes), required for EmbeddingGemma's asymmetric retrieval. (engine)
+
+### Changed
+- **Hybrid retrieval** — document retrieval now fuses dense vector search with **BM25**
+  lexical scoring via **Reciprocal Rank Fusion (RRF)**, plus a **per-document cap** so one
+  large source can't fill every top-k slot, wider candidate pools, and a larger grounding
+  budget. Ships independent of the embedder upgrade. (engine)
+- **Backup/sync** — backups now carry chunk embeddings from every embedder dim and tag
+  each chunk with its dim; cross-embedder restores re-index from the included text
+  instead of being rejected (backup schema v2). (app)
+
+### Fixed
+- **Wrong-document grounding** — a **document-level dominance gate** stops BM25 lexical
+  pollution from grounding an answer on the wrong source. A real failure: a "car
+  insurance premium" question answered with the **life policy's** figure (₹41,799) instead
+  of the actual car premium (₹8,504) because the life PDF's wording out-scored the car PDF
+  on shared tokens. The gate keeps grounding on the document that genuinely dominates the
+  candidate set. (engine)
+- **Title-match override** — when a distinctive query term names a document by its title,
+  retrieval grounds on that document. "Who is the insurer of my car policy" was answering
+  from a health policy (whose formal "…insurer" phrasing out-scored the car doc); it now
+  correctly resolves to the car policy (TATA AIG). (engine)
+- **Truncated grounded answers** — grounded replies collapsed to 1–2 tokens after a few
+  turns because the stateful LiteRT-LM KV cache accumulated each turn's grounding block. A
+  **per-grounded-turn session reset** re-prefills only bounded visible history
+  (`MAX_PREFILL_TURNS=16`), keeping answers full-length. (app)
+- **Stale embeddings** — a **document-level self-healing migration** re-indexes a source
+  into the active embedder's index on next open when the embedder/dim changes, with no
+  re-import or OCR. (app)
+- **Health-insurer recall miss** — enabling the cross-encoder reranker (ungated) on the
+  8 GB device tier recovers a relevant chunk the first-stage fusion ranked too low. (app)
+
+### Migration
+- Existing projects re-index from stored chunk text into the active embedder's index on
+  next open when the embedder changes — no re-import/OCR needed; the USE-Lite index is
+  kept as a fallback.
 
 ## [0.9.0] — 2026-06-05
 

diff --git a/docs/EMBEDDING_GEMMA_PLAN.md b/docs/EMBEDDING_GEMMA_PLAN.md
@@ -1,5 +1,20 @@
 # NativeLM — EmbeddingGemma on-device RAG embedder (implementation plan)
 
+> **Status (2026-06-08): IMPLEMENTED** on branch `feat/embedding-gemma` (issue #30).
+> Deviations from the plan below, decided during build:
+> - **Full 3-tier matrix + reranker** built now (not deferred): USE-Lite@100 (entry) /
+>   EmbeddingGemma@256 (mid) / @512 + `ms-marco-MiniLM-L6` cross-encoder reranker
+>   (flagship), chosen by a device-RAM **recommendation engine**. One Matryoshka model
+>   serves all Gemma tiers; per-dim ObjectBox HNSW entities (128/256/512) added.
+> - **Tokenizer: pure-Kotlin**, not onnxruntime-extensions (its `gen_processing_models`
+>   doesn't support GemmaTokenizer) nor a Rust/DJL native lib. BPE (embedder) + BERT
+>   WordPiece (reranker), both validated bit-for-bit vs HF `transformers`. No extra .so.
+> - **Companion download** through the catalogue (graph + `model.onnx_data` + tokenizer);
+>   the external-data blob keeps its original name so ORT resolves it.
+> - Quick-win **per-document cap** shipped in the retriever (the wrong-PDF fix).
+> See `CHANGELOG.md` [Unreleased] and `_session/material/blog-embeddinggemma-rag.md`.
+
+
 _Branch: `claude/analysis-KV4t2`. Goal: replace the 2018-era Universal Sentence
 Encoder (USE-Lite, 100-dim) with **EmbeddingGemma 300M** as the default RAG
 embedder, lifting retrieval quality for both chat answers and every Studio

diff --git a/gradle/libs.versions.toml b/gradle/libs.versions.toml
@@ -13,6 +13,11 @@ napier = "2.7.1"
 kotlin-inject = "0.9.0"
 mediapipe = "0.10.35"
 litertlm = "0.11.0"
+# ONNX Runtime + extensions for the EmbeddingGemma RAG embedder (Microsoft, no
+# Google/Play telemetry deps). extensions version MUST match the version used to
+# generate tokenizer.onnx (gen_processing_models) for custom-op compatibility.
+onnxruntime = "1.26.0"
+onnxruntime-extensions = "0.15.0"
 mlkit-text-recognition = "16.0.1"
 androidx-core-ktx = "1.15.0"
 # sample-app only
@@ -43,6 +48,8 @@ kotlin-inject-runtime = { module = "me.tatarka.inject:kotlin-inject-runtime", ve
 kotlin-inject-compiler = { module = "me.tatarka.inject:kotlin-inject-compiler-ksp", version.ref = "kotlin-inject" }
 mediapipe-tasks-text = { module = "com.google.mediapipe:tasks-text", version.ref = "mediapipe" }
 litertlm-android = { module = "com.google.ai.edge.litertlm:litertlm-android", version.ref = "litertlm" }
+onnxruntime-android = { module = "com.microsoft.onnxruntime:onnxruntime-android", version.ref = "onnxruntime" }
+onnxruntime-extensions-android = { module = "com.microsoft.onnxruntime:onnxruntime-extensions-android", version.ref = "onnxruntime-extensions" }
 androidx-core-ktx = { group = "androidx.core", name = "core-ktx", version.ref = "androidx-core-ktx" }
 kotlinx-coroutines-android = { module = "org.jetbrains.kotlinx:kotlinx-coroutines-android", version.ref = "kotlinx-coroutines" }
 # sample-app only

diff --git a/lib/build.gradle.kts b/lib/build.gradle.kts
@@ -8,7 +8,7 @@ plugins {
 }
 
 group = "com.sagar"
-version = "0.9.0"
+version = "0.10.0"
 
 // Keep the published artifact id stable as "litertlm-kmp" even though the
 // Gradle module is now ":lib" (sample-app is a sibling subproject). Without
@@ -53,6 +53,10 @@ kotlin {
             implementation(libs.ktor.client.okhttp)
             // Argon2id (native, JNI) for passphrase-derived backup encryption keys.
             implementation(libs.signal.argon2)
+            // EmbeddingGemma RAG embedder: ONNX Runtime for inference (Microsoft,
+            // telemetry-free — no Google/Play deps). Tokenization is pure-Kotlin
+            // (GemmaBpeTokenizer), so no onnxruntime-extensions native lib is needed.
+            implementation(libs.onnxruntime.android)
         }
         iosMain.dependencies {
             implementation(libs.ktor.client.darwin)

diff --git a/lib/src/androidMain/kotlin/com/sagar/aicore/BertWordPieceTokenizer.kt b/lib/src/androidMain/kotlin/com/sagar/aicore/BertWordPieceTokenizer.kt
@@ -0,0 +1,176 @@
+/*
+ * Copyright (C) 2026 Sagar Gupta
+ * SPDX-License-Identifier: AGPL-3.0-or-later
+ */
+package com.sagar.aicore
+
+import android.util.JsonReader
+import java.io.File
+import java.io.InputStreamReader
+import java.text.Normalizer
+
+/**
+ * Pure-Kotlin BERT WordPiece tokenizer matching the ms-marco MiniLM cross-encoder's
+ * HuggingFace `tokenizer.json` (uncased: clean → handle-CJK → lowercase → strip
+ * accents → split whitespace/punctuation → greedy `##` WordPiece). Verified
+ * bit-for-bit against the reference `transformers` tokenizer (ASCII, punctuation,
+ * accents, CJK, emails).
+ *
+ * Produces cross-encoder pair input: `[CLS] query [SEP] passage [SEP]` with
+ * token_type_ids 0 for the first segment and 1 for the second.
+ */
+class BertWordPieceTokenizer private constructor(
+    private val vocab: HashMap<String, Int>,
+    private val clsId: Int,
+    private val sepId: Int,
+    private val unkId: Int,
+) {
+
+    class PairEncoding(val ids: LongArray, val typeIds: LongArray, val mask: LongArray)
+
+    fun encodePair(query: String, passage: String): PairEncoding {
+        val q = wordPieces(basicTokenize(query))
+        val p = wordPieces(basicTokenize(passage))
+        val ids = ArrayList<Long>(q.size + p.size + 3)
+        val types = ArrayList<Long>(q.size + p.size + 3)
+        ids.add(clsId.toLong()); types.add(0)
+        for (t in q) { ids.add((vocab[t] ?: unkId).toLong()); types.add(0) }
+        ids.add(sepId.toLong()); types.add(0)
+        for (t in p) { ids.add((vocab[t] ?: unkId).toLong()); types.add(1) }
+        ids.add(sepId.toLong()); types.add(1)
+        val idArr = LongArray(ids.size) { ids[it] }
+        val typeArr = LongArray(types.size) { types[it] }
+        return PairEncoding(idArr, typeArr, LongArray(idArr.size) { 1L })
+    }
+
+    /** BERT basic tokenizer: clean, CJK-pad, lowercase, strip accents, split ws + punct. */
+    private fun basicTokenize(text: String): List<String> {
+        val sb = StringBuilder(text.length + 16)
+        var i = 0
+        while (i < text.length) {
+            val cp = text.codePointAt(i)
+            i += Character.charCount(cp)
+            if (cp == 0 || cp == 0xFFFD || isControl(cp)) continue
+            when {
+                isWhitespace(cp) -> sb.append(' ')
+                isCjk(cp) -> { sb.append(' '); sb.appendCodePoint(cp); sb.append(' ') }
+                else -> sb.appendCodePoint(cp)
+            }
+        }
+        // lowercase + strip accents (NFD, drop non-spacing marks)
+        val lowered = sb.toString().lowercase()
+        val nfd = Normalizer.normalize(lowered, Normalizer.Form.NFD)
+        val stripped = StringBuilder(nfd.length)
+        var j = 0
+        while (j < nfd.length) {
+            val cp = nfd.codePointAt(j)
+            j += Character.charCount(cp)
+            if (Character.getType(cp) == Character.NON_SPACING_MARK.toInt()) continue
+            stripped.appendCodePoint(cp)
+        }
+        // split on whitespace, then isolate punctuation
+        val out = ArrayList<String>()
+        for (word in stripped.toString().split(' ')) {
+            if (word.isEmpty()) continue
+            val cur = StringBuilder()
+            var k = 0
+            while (k < word.length) {
+                val cp = word.codePointAt(k)
+                k += Character.charCount(cp)
+                if (isPunct(cp)) {
+                    if (cur.isNotEmpty()) { out.add(cur.toString()); cur.setLength(0) }
+                    out.add(String(Character.toChars(cp)))
+                } else {
+                    cur.appendCodePoint(cp)
+                }
+            }
+            if (cur.isNotEmpty()) out.add(cur.toString())
+        }
+        return out
+    }
+
+    /** Greedy longest-match WordPiece with `##` continuation; [UNK] on failure. */
+    private fun wordPieces(tokens: List<String>): List<String> {
+        val out = ArrayList<String>()
+        for (token in tokens) {
+            if (token.length > MAX_CHARS) { out.add(UNK); continue }
+            var start = 0
+            val n = token.length
+            var bad = false
+            val pieces = ArrayList<String>()
+            while (start < n) {
+                var end = n
+                var cur: String? = null
+                while (start < end) {
+                    val sub = if (start > 0) "##" + token.substring(start, end) else token.substring(start, end)
+                    if (vocab.containsKey(sub)) { cur = sub; break }
+                    end--
+                }
+                if (cur == null) { bad = true; break }
+                pieces.add(cur)
+                start = end
+            }
+            if (bad) out.add(UNK) else out.addAll(pieces)
+        }
+        return out
+    }
+
+    companion object {
+        private const val UNK = "[UNK]"
+        private const val MAX_CHARS = 100
+
+        private fun isWhitespace(cp: Int): Boolean =
+            cp == ' '.code || cp == '\t'.code || cp == '\n'.code || cp == '\r'.code ||
+                Character.getType(cp) == Character.SPACE_SEPARATOR.toInt()
+
+        private fun isControl(cp: Int): Boolean {
+            if (cp == '\t'.code || cp == '\n'.code || cp == '\r'.code) return false
+            return when (Character.getType(cp)) {
+                Character.CONTROL.toInt(), Character.FORMAT.toInt(),
+                Character.SURROGATE.toInt(), Character.PRIVATE_USE.toInt(),
+                Character.UNASSIGNED.toInt() -> true
+                else -> false
+            }
+        }
+
+        private fun isPunct(cp: Int): Boolean {
+            if ((cp in 33..47) || (cp in 58..64) || (cp in 91..96) || (cp in 123..126)) return true
+            return when (Character.getType(cp)) {
+                Character.CONNECTOR_PUNCTUATION.toInt(), Character.DASH_PUNCTUATION.toInt(),
+                Character.START_PUNCTUATION.toInt(), Character.END_PUNCTUATION.toInt(),
+                Character.OTHER_PUNCTUATION.toInt(), Character.INITIAL_QUOTE_PUNCTUATION.toInt(),
+                Character.FINAL_QUOTE_PUNCTUATION.toInt() -> true
+                else -> false
+            }
+        }
+
+        private fun isCjk(cp: Int): Boolean =
+            (cp in 0x4E00..0x9FFF) || (cp in 0x3400..0x4DBF) || (cp in 0x20000..0x2A6DF) ||
+                (cp in 0x2A700..0x2B73F) || (cp in 0x2B740..0x2B81F) || (cp in 0x2B820..0x2CEAF) ||
+                (cp in 0xF900..0xFAFF) || (cp in 0x2F800..0x2FA1F)
+
+        fun load(tokenizerJsonPath: String): BertWordPieceTokenizer {
+            val vocab = HashMap<String, Int>(35_000)
+            File(tokenizerJsonPath).inputStream().use { stream ->
+                JsonReader(InputStreamReader(stream, Charsets.UTF_8)).use { r ->
+                    r.beginObject()
+                    while (r.hasNext()) {
+                        if (r.nextName() == "model") {
+                            r.beginObject()
+                            while (r.hasNext()) {
+                                if (r.nextName() == "vocab") {
+                                    r.beginObject()
+                                    while (r.hasNext()) vocab[r.nextName()] = r.nextInt()
+                                    r.endObject()
+                                } else r.skipValue()
+                            }
+                            r.endObject()
+                        } else r.skipValue()
+                    }
+                    r.endObject()
+                }
+            }
+            return BertWordPieceTokenizer(vocab, vocab["[CLS]"] ?: 101, vocab["[SEP]"] ?: 102, vocab["[UNK]"] ?: 100)
+        }
+    }
+}
diff --git a/lib/src/androidMain/kotlin/com/sagar/aicore/GemmaBpeTokenizer.kt b/lib/src/androidMain/kotlin/com/sagar/aicore/GemmaBpeTokenizer.kt
diff --git a/lib/src/androidMain/kotlin/com/sagar/aicore/MediaPipeEmbeddingEngine.kt b/lib/src/androidMain/kotlin/com/sagar/aicore/MediaPipeEmbeddingEngine.kt
@@ -31,6 +31,9 @@ class MediaPipeEmbeddingEngine(
     private var textEmbedder: TextEmbedder? = null
     private val mutex = Mutex()
 
+    /** USE-Lite emits 100-dim vectors. */
+    override val dimensions: Int = 100
+
     /**
      * Initializes the embedder with a model path.
      */
@@ -56,7 +59,9 @@ class MediaPipeEmbeddingEngine(
         }
     }
 
-    override suspend fun embed(text: String): FloatArray = withContext(Dispatchers.IO) {
+    // USE-Lite is symmetric: query and document are embedded identically, so
+    // [task]/[title] are ignored.
+    override suspend fun embed(text: String, task: EmbeddingTask, title: String?): FloatArray = withContext(Dispatchers.IO) {
         Napier.d(tag = "EmbeddingEngine") { "embed START hash=${System.identityHashCode(this@MediaPipeEmbeddingEngine)} text_len=${text.length} first50=${text.take(50)}" }
         val embedder = mutex.withLock { textEmbedder } ?: run {
             Napier.e(tag = "EmbeddingEngine") { "Embedding model not loaded! hash=${System.identityHashCode(this@MediaPipeEmbeddingEngine)}" }