sagar-develop · sagar-develop · Jun 5, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -13,6 +13,9 @@ and the project loosely follows [Semantic Versioning](https://semver.org/).
 
 ## [Unreleased]
 
+### Added
+- **EmbeddingGemma RAG embedder** _(app + engine)_ — higher-quality on-device retrieval via EmbeddingGemma 300M (ONNX Runtime, Matryoshka-256, query/document task prompts), default on capable devices with the 6 MB USE-Lite embedder as the friction-free fallback. Existing sources are re-indexed from their stored text on upgrade; backups round-trip both embedders. Engine adds a task-aware `EmbeddingEngine` and `ModelFormat.ONNX_EMBEDDER` + companion-file (tokenizer) downloads. See `docs/EMBEDDING_GEMMA_PLAN.md`.
+
 ## [0.8.0] — 2026-06-03
 
 ### Added

diff --git a/docs/EMBEDDING_GEMMA_PLAN.md b/docs/EMBEDDING_GEMMA_PLAN.md
@@ -0,0 +1,220 @@
+# NativeLM — EmbeddingGemma on-device RAG embedder (implementation plan)
+
+_Branch: `claude/analysis-KV4t2`. Goal: replace the 2018-era Universal Sentence
+Encoder (USE-Lite, 100-dim) with **EmbeddingGemma 300M** as the default RAG
+embedder, lifting retrieval quality for both chat answers and every Studio
+artifact. USE-Lite stays as the low-end / no-download fallback._
+
+## Locked decisions
+
+| Decision | Choice | Why |
+|---|---|---|
+| **Runtime** | **ONNX Runtime (Android)** + a SentencePiece/HF tokenizer | Self-contained; full control of task-prompts, pooling, normalization, Matryoshka. **No Google telemetry deps** (protects the zero-telemetry stance — commit `d5b5fa9`). KMP/iOS-friendly. Avoids the MediaPipe `TextEmbedder` path that broke before. |
+| **Dimension** | **256** (Matryoshka truncation of the 768-native vector) | Best quality/size/speed balance on-device; the migration path already reserved in `DocumentChunkEntity`. ~2.5× storage vs the current 100-dim but far better retrieval; half the index cost of 512. |
+| **Rollout** | **Default on capable devices; USE-Lite stays as fallback** | Friction-free first run preserved. Low-end / no-download installs keep working on USE-Lite. Two HNSW indexes coexist; the active embedder selects which one. |
+
+---
+
+## Why the earlier attempt failed (recorded so we don't repeat it)
+
+EmbeddingGemma is a 300M transformer, **not** a TFLite *Task* model. Three
+independent landmines, any one of which sinks a naive swap:
+
+1. **Wrong loader.** `MediaPipeEmbeddingEngine` calls
+   `TextEmbedder.createFromFile()`, which only accepts TFLite Task models with
+   baked-in tokenizer metadata (USE/BERT-style). EmbeddingGemma won't load there
+   (or returns garbage). **→ This plan introduces a separate ONNX engine; it does
+   not touch the MediaPipe path.**
+2. **Dimension lock.** `DocumentChunkEntity.@HnswIndex(dimensions = 100L)` is an
+   annotation **literal**. EmbeddingGemma emits 768/512/256 → ObjectBox throws the
+   moment a longer vector is inserted/queried. **→ This plan adds a new 256-dim
+   entity rather than editing the 100-dim one.**
+3. **Missing task prompts.** EmbeddingGemma *requires* instruction prefixes;
+   `EmbeddingEngine.embed(text)` is symmetric and `DefaultDocumentRetriever` calls
+   `embed(query)` with no role. Even if it loaded, retrieval would look "broken."
+   **→ This plan makes the interface task-aware.**
+
+---
+
+## Architecture
+
+```
+                         ┌─ EmbeddingTask.QUERY    → "task: search result | query: {q}"
+query / chunk text  ──►  │                                                       │
+                         └─ EmbeddingTask.DOCUMENT → "title: {t|none} | text: {c}"│
+                                                                                  ▼
+                                            ┌──────────────────────────────────────┐
+                                            │ OnnxEmbeddingEngine (androidMain)      │
+                                            │  tokenize (SentencePiece/HF)           │
+                                            │  → ORT run → last_hidden_state         │
+                                            │  → mean-pool over attention mask       │
+                                            │  → truncate to 256 (Matryoshka)        │
+                                            │  → L2 normalize                        │
+                                            └──────────────────────────────────────┘
+                                                                  │ 256-dim FloatArray
+                                                                  ▼
+   active embedder selects index ──►  GemmaChunkEntity (256-dim HNSW)   [default]
+                                      DocumentChunkEntity (100-dim HNSW) [USE fallback / legacy]
+```
+
+The **active embedder** is an install-level property (which EMBEDDING model is
+downloaded + chosen). It determines (a) which engine `embed*` routes to and
+(b) which HNSW entity ingestion/retrieval use. Switching embedders triggers a
+**re-index from stored chunk text** (no re-extraction needed).
+
+---
+
+## Interface contract (lands first — Module 0)
+
+```kotlin
+// lib/commonMain — EmbeddingEngine.kt  (BREAKING: task-aware)
+enum class EmbeddingTask { QUERY, DOCUMENT }
+
+interface EmbeddingEngine {
+    /** Output dimension of this embedder (USE-Lite = 100, EmbeddingGemma = 256). */
+    val dimensions: Int
+    suspend fun initialize(modelPath: String)
+    /** [title] is only used for DOCUMENT task on prompt-instructed models; ignored otherwise. */
+    suspend fun embed(text: String, task: EmbeddingTask, title: String? = null): FloatArray
+}
+```
+
+```kotlin
+// lib/commonMain — ModelCatalog.kt
+enum class ModelFormat { LITERTLM, MEDIAPIPE_TEXT_EMBEDDER, WHISPER_GGML, ONNX_EMBEDDER }
+
+// ModelDescriptor gains companion-file support so the tokenizer ships with the model:
+data class ModelDescriptor(
+    /* …existing… */
+    val companions: List<CompanionFile> = emptyList(),   // NEW — e.g. tokenizer.json
+)
+data class CompanionFile(val url: String, val fileName: String, val sizeBytes: Long, val sha256: String? = null)
+```
+
+```kotlin
+// sample-app data.db — new 256-dim chunk entity (parallel to DocumentChunkEntity)
+@Entity
+class GemmaChunkEntity {
+    @Id var id: Long = 0
+    @Index var documentId: Long = 0
+    @Index var projectId: Long = 0
+    var text: String = ""; var pageNumber: Int = 0; var chunkIndex: Int = 0
+    @HnswIndex(dimensions = 256L, distanceType = VectorDistanceType.COSINE,
+               neighborsPerNode = 48, indexingSearchCount = 200)
+    var embedding: FloatArray? = null
+    companion object { const val EMBEDDING_DIM = 256 }
+}
+```
+
+The `DocumentRepository` ingestion/retrieval methods route to the entity matching
+the active embedder; `ScoredChunk` stays the common return shape so
+`DefaultDocumentRetriever` / `RagContextFormatter` are largely unchanged.
+
+---
+
+## Module breakdown
+
+| Mod | Scope | Key files | Depends on |
+|----|-------|-----------|-----------|
+| **0** | Contracts: task-aware `EmbeddingEngine`, `ONNX_EMBEDDER` format, `companions` on `ModelDescriptor`, `GemmaChunkEntity` (regen `objectbox-models/default.json`) | `EmbeddingEngine.kt`, `ModelCatalog.kt`, `Entities.kt`, version catalog | — |
+| **A** | ONNX engine: ORT session, tokenizer, mean-pool + Matryoshka-256 + L2-norm, task prompts | `OnnxEmbeddingEngine.kt` (androidMain), DI wiring in `AndroidAiEngineComponent.kt` | 0 |
+| **B** | Catalog + download: EmbeddingGemma descriptor (`requiresAuth = true`), tokenizer companion download, sha256 pins | `NativeLmModelCatalog.kt`, model-download path | 0 |
+| **C** | Repository routing: `GemmaChunkEntity` CRUD + HNSW search; active-embedder selector | `ObjectBoxDocumentRepository.kt`, `RagHolder.kt` | 0 |
+| **D** | Ingest/retrieve task wiring: `embedDocument` on ingest, `embedQuery` on retrieve; re-tune distance gate | `DefaultDocumentIngestor.kt`, `DefaultDocumentRetriever.kt` | A,C |
+| **E** | Migration: background re-index USE→Gemma from stored text, with progress + resume | new `EmbeddingMigrator.kt`, `NativeLmViewModel.kt` | C,D |
+| **F** | UI + gating: embedder shown in Models screen (Recommended/Advanced, Gemma terms), device gating, re-index progress | `ModelManagementScreen.kt`, onboarding terms gate | B,E |
+| **G** | Backup/sync compatibility: carry embedder tag; re-index on mismatched import | `BackupManager.kt`, `BackupModels.kt`, sync transport | E |
+
+---
+
+## Migration / re-index plan
+
+Embeddings are **derived data**; `DocumentChunkEntity.text` is already persisted,
+so re-indexing never needs the original PDFs.
+
+1. On first run after EmbeddingGemma is downloaded + selected, kick a background
+   `EmbeddingMigrator` (resumable, idempotent — skip docs already in `GemmaChunkEntity`).
+2. Stream chunks per project → `embed(text, DOCUMENT, title)` → write to
+   `GemmaChunkEntity`. Reuse the `IngestState.Embedding(done,total)` progress UI.
+3. Until a project is migrated, retrieval **falls back to the 100-dim index** so
+   chat keeps working.
+4. After a project migrates, delete its old 100-dim chunks to reclaim storage
+   (tx-split delete — see gotchas).
+5. Low-end devices that never download EmbeddingGemma stay entirely on USE-Lite.
+
+---
+
+## Gotchas (bake these in)
+
+- **HNSW tx-split (carried over):** chunk deletes and parent-doc deletes go in
+  **separate transactions**, or HNSW commit deadlocks. Applies to the re-index
+  cleanup too.
+- **Distance gate is USE-tuned.** `DefaultDocumentRetriever.RELEVANCE_MAX_DISTANCE
+  = 0.75` was tuned for USE-Lite's distribution. EmbeddingGemma's cosine spread
+  differs — **re-tune per active embedder** (likely a separate constant), or
+  off-topic queries will over/under-ground.
+- **Task prompts are mandatory.** Query = `task: search result | query: …`;
+  Document = `title: {title or "none"} | text: …`. Wrong/missing prompts quietly
+  tank recall.
+- **Matryoshka order: truncate *then* re-normalize.** Take the first 256 dims of
+  the pooled vector, *then* L2-normalize — not the reverse.
+- **Tokenizer is the fiddly bit.** Ship `tokenizer.json` as a `companions` file
+  (or app asset) and run it via `onnxruntime-extensions` (in-graph) or the HF
+  `tokenizers` Android binding. Cap `max_seq_len` (~512) — chunks are ~500 chars
+  so this is safe and bounds latency/memory.
+- **Latency.** A 300M transformer per chunk is far slower than USE's 6MB model;
+  a big PDF can go from seconds to minutes. Mitigate: quantized (INT8/QAT) ONNX,
+  XNNPACK threads, batch tokenization, and run ingestion/migration off the main
+  thread (ties into the deferred foreground-service download/ingest work in
+  `PLAY_STORE.md §9`).
+- **Memory coexistence.** Don't embed and generate simultaneously — the LLM is the
+  big RAM tenant. Sequence ingestion/migration vs active chat generation.
+- **Gemma licensing.** EmbeddingGemma is Gemma-licensed → `requiresAuth = true`,
+  `Authorization: Bearer <hf-token>`, surfaced under the **Advanced — Hugging Face
+  account** section and the onboarding **terms gate** already built in PR #22.
+- **Backup/sync dimension mismatch.** A backup/synced DB may carry vectors from a
+  different embedder/dimension. Tag exports with the embedder id; on import with a
+  mismatch, **re-index from the included chunk text** rather than trusting vectors.
+- **APK/download budget.** ORT Android AAR (~10–20 MB, arm64-only to match the
+  existing `abiFilters`) + the quantized ONNX model (~100–200 MB, downloaded, not
+  bundled) + tokenizer. Confirm against the size budget.
+
+---
+
+## Dependencies to add
+
+- `com.microsoft.onnxruntime:onnxruntime-android` (full build for op coverage;
+  revisit ORT-format + `onnxruntime-mobile` later for size).
+- `com.microsoft.onnxruntime:onnxruntime-extensions-android` (in-graph tokenizer),
+  **or** the HF `tokenizers` Android binding as fallback.
+- arm64-v8a only, consistent with `libwhisper.so` and the LiteRT-LM footprint.
+
+---
+
+## Testing / verify (the ship bar)
+
+On-device on **CPH2723 (release build)**:
+1. Fresh ingest of a real PDF → confirm chunks land in `GemmaChunkEntity` (256-dim).
+2. **Retrieval quality A/B**: a fixed query set, USE-Lite vs EmbeddingGemma — the
+   win must be visible (recall on names/concepts, fewer off-topic citations).
+3. **Latency/memory**: per-chunk embed time + peak RAM during ingest; confirm a
+   multi-page PDF completes acceptably and coexists with chat.
+4. **Migration**: upgrade an install with existing USE-Lite docs → re-index runs,
+   shows progress, retrieval keeps working throughout, old vectors reclaimed after.
+5. **Low-end fallback**: a device that declines the download stays on USE-Lite and
+   functions unchanged.
+6. **Distance-gate tuning**: verify off-topic questions still return
+   `RetrievedContext.EMPTY` with the re-tuned threshold.
+
+**Done when:** EmbeddingGemma is the default embedder on a capable device, an
+existing install migrates cleanly, retrieval quality is visibly better, and the
+USE-Lite fallback path still works end-to-end.
+
+---
+
+## Out of scope (follow-ups)
+
+- **Reranker** second stage (cross-encoder) — separate plan; complements this.
+- **ORT-format / mobile build** size optimization — after correctness is proven.
+- **iOS embedder** — the ONNX engine is KMP-portable; wire `iosMain` later.
+- **Token-aware chunking** — still char-based (500/50); revisit independently.
diff --git a/gradle/libs.versions.toml b/gradle/libs.versions.toml
@@ -13,6 +13,8 @@ napier = "2.7.1"
 kotlin-inject = "0.9.0"
 mediapipe = "0.10.35"
 litertlm = "0.11.0"
+onnxruntime = "1.20.0"
+djl-tokenizers = "0.30.0"
 mlkit-text-recognition = "16.0.1"
 androidx-core-ktx = "1.15.0"
 # sample-app only
@@ -43,6 +45,9 @@ kotlin-inject-runtime = { module = "me.tatarka.inject:kotlin-inject-runtime", ve
 kotlin-inject-compiler = { module = "me.tatarka.inject:kotlin-inject-compiler-ksp", version.ref = "kotlin-inject" }
 mediapipe-tasks-text = { module = "com.google.mediapipe:tasks-text", version.ref = "mediapipe" }
 litertlm-android = { module = "com.google.ai.edge.litertlm:litertlm-android", version.ref = "litertlm" }
+# EmbeddingGemma on-device: ONNX Runtime + a HuggingFace tokenizer (reads tokenizer.json).
+onnxruntime-android = { module = "com.microsoft.onnxruntime:onnxruntime-android", version.ref = "onnxruntime" }
+djl-tokenizers = { module = "ai.djl.huggingface:tokenizers", version.ref = "djl-tokenizers" }
 androidx-core-ktx = { group = "androidx.core", name = "core-ktx", version.ref = "androidx-core-ktx" }
 kotlinx-coroutines-android = { module = "org.jetbrains.kotlinx:kotlinx-coroutines-android", version.ref = "kotlinx-coroutines" }
 # sample-app only

diff --git a/lib/build.gradle.kts b/lib/build.gradle.kts
@@ -51,6 +51,9 @@ kotlin {
             implementation(libs.litertlm.android)
             implementation(libs.androidx.core.ktx)
             implementation(libs.ktor.client.okhttp)
+            // EmbeddingGemma on-device: ONNX Runtime + HuggingFace tokenizer.
+            implementation(libs.onnxruntime.android)
+            implementation(libs.djl.tokenizers)
         }
         iosMain.dependencies {
             implementation(libs.ktor.client.darwin)

diff --git a/lib/consumer-rules.pro b/lib/consumer-rules.pro
@@ -33,6 +33,17 @@
 -keep class com.google.common.flogger.** { *; }
 -dontwarn com.google.common.flogger.**
 
+# ---- ONNX Runtime (com.microsoft.onnxruntime) — EmbeddingGemma ----
+# JNI bridge; the native <methods> keep above covers the bindings. Keep the API
+# surface and silence optional references.
+-keep class ai.onnxruntime.** { *; }
+-dontwarn ai.onnxruntime.**
+
+# ---- HuggingFace tokenizers (ai.djl.huggingface) — EmbeddingGemma tokenizer ----
+# Loads a native lib + reflects over JNI types when reading tokenizer.json.
+-keep class ai.djl.** { *; }
+-dontwarn ai.djl.**
+
 # ---- kotlinx.serialization ----
 # Keep generated serializers + the synthetic serializer() accessor.
 -keepclassmembers class **$$serializer { *; }

diff --git a/lib/src/androidMain/kotlin/com/sagar/aicore/GemmaTokenizer.kt b/lib/src/androidMain/kotlin/com/sagar/aicore/GemmaTokenizer.kt
@@ -0,0 +1,49 @@
+/*
+ * Copyright (C) 2026 Sagar Gupta
+ * SPDX-License-Identifier: AGPL-3.0-or-later
+ */
+package com.sagar.aicore
+
+import ai.djl.huggingface.tokenizers.HuggingFaceTokenizer
+import java.nio.file.Paths
+
+/** Tokenized input for the ONNX embedder: parallel token-id and attention-mask arrays. */
+data class TokenizedInput(val ids: LongArray, val attentionMask: LongArray) {
+    val length: Int get() = ids.size
+}
+
+/**
+ * Turns text into model input ids + attention mask for [OnnxEmbeddingEngine].
+ * EmbeddingGemma uses the Gemma SentencePiece vocabulary; we load it from the
+ * `tokenizer.json` companion that ships next to the model.
+ */
+interface GemmaTokenizer {
+    fun encode(text: String): TokenizedInput
+}
+
+/**
+ * [GemmaTokenizer] backed by the HuggingFace tokenizers runtime (DJL binding),
+ * reading the model's `tokenizer.json`. Truncates to [maxLength] tokens — chunks
+ * are ~500 chars so this bounds latency/memory without losing content.
+ *
+ * Note: the exact padding/truncation knobs depend on the shipped `tokenizer.json`;
+ * verify ids/mask shapes against the chosen EmbeddingGemma ONNX export on-device.
+ */
+class HfGemmaTokenizer(
+    tokenizerJsonPath: String,
+    private val maxLength: Int = 512,
+) : GemmaTokenizer {
+
+    private val tokenizer: HuggingFaceTokenizer =
+        HuggingFaceTokenizer.builder()
+            .optTokenizerPath(Paths.get(tokenizerJsonPath))
+            .optAddSpecialTokens(true)
+            .optTruncation(true)
+            .optMaxLength(maxLength)
+            .build()
+
+    override fun encode(text: String): TokenizedInput {
+        val enc = tokenizer.encode(text)
+        return TokenizedInput(ids = enc.ids, attentionMask = enc.attentionMask)
+    }
+}
diff --git a/lib/src/androidMain/kotlin/com/sagar/aicore/MediaPipeEmbeddingEngine.kt b/lib/src/androidMain/kotlin/com/sagar/aicore/MediaPipeEmbeddingEngine.kt
@@ -31,6 +31,9 @@ class MediaPipeEmbeddingEngine(
     private var textEmbedder: TextEmbedder? = null
     private val mutex = Mutex()
 
+    /** USE-Lite is a fixed 100-dim embedder; [task]/title are ignored (it is symmetric). */
+    override val dimensions: Int = 100
+
     /**
      * Initializes the embedder with a model path.
      */
@@ -56,7 +59,11 @@ class MediaPipeEmbeddingEngine(
         }
     }
 
-    override suspend fun embed(text: String): FloatArray = withContext(Dispatchers.IO) {
+    override suspend fun embed(
+        text: String,
+        task: EmbeddingTask,
+        title: String?,
+    ): FloatArray = withContext(Dispatchers.IO) {
         Napier.d(tag = "EmbeddingEngine") { "embed START hash=${System.identityHashCode(this@MediaPipeEmbeddingEngine)} text_len=${text.length} first50=${text.take(50)}" }
         val embedder = mutex.withLock { textEmbedder } ?: run {
             Napier.e(tag = "EmbeddingEngine") { "Embedding model not loaded! hash=${System.identityHashCode(this@MediaPipeEmbeddingEngine)}" }