Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 54 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,61 @@ and the project loosely follows [Semantic Versioning](https://semver.org/).
> `com.sagar:litertlm-kmp`) and the **showcase app** (`sample-app/`, NativeLM)
> live in one repo and share a single version line. Entries below note which
> surface a change lands on. The engine library version in
> `lib/build.gradle.kts` tracks the latest release (`0.9.0`).
> `lib/build.gradle.kts` tracks the latest release (`0.10.0`).

## [Unreleased]
## [0.10.0] — 2026-06-09

### Added
- **EmbeddingGemma RAG embedder (device-tiered, telemetry-free)** — optional upgrade
from USE-Lite (100-dim) to **EmbeddingGemma-300M** for document retrieval, run on
**ONNX Runtime** (no Google/Play telemetry deps). One downloaded model serves every
tier via **Matryoshka** truncation; a **recommendation engine** picks the embedder +
dim by device RAM (USE-Lite <6 GB · Gemma@256 6–9 GB · Gemma@512 + reranker ≥10 GB),
surfaced as a "Recommended" badge in the in-app model catalogue. The model and its
companion files (weights blob + tokenizer) download on-device through the catalogue;
nothing is bundled in the APK. (engine + app)
- **Cross-encoder reranker (flagship)** — optional second-stage `ms-marco-MiniLM-L6`
rerank over the top fused candidates, gated to high-RAM devices. (engine + app)
- **Pure-Kotlin tokenizers** — BPE (EmbeddingGemma) and BERT WordPiece (reranker),
reading the HuggingFace `tokenizer.json`, validated bit-for-bit against the reference
`transformers` tokenizer. No native tokenizer lib, KMP-portable. (engine)
- **Task-aware embeddings** — `EmbeddingEngine` now distinguishes query vs document
(instruction prefixes), required for EmbeddingGemma's asymmetric retrieval. (engine)

### Changed
- **Hybrid retrieval** — document retrieval now fuses dense vector search with **BM25**
lexical scoring via **Reciprocal Rank Fusion (RRF)**, plus a **per-document cap** so one
large source can't fill every top-k slot, wider candidate pools, and a larger grounding
budget. Ships independent of the embedder upgrade. (engine)
- **Backup/sync** — backups now carry chunk embeddings from every embedder dim and tag
each chunk with its dim; cross-embedder restores re-index from the included text
instead of being rejected (backup schema v2). (app)

### Fixed
- **Wrong-document grounding** — a **document-level dominance gate** stops BM25 lexical
pollution from grounding an answer on the wrong source. A real failure: a "car
insurance premium" question answered with the **life policy's** figure (₹41,799) instead
of the actual car premium (₹8,504) because the life PDF's wording out-scored the car PDF
on shared tokens. The gate keeps grounding on the document that genuinely dominates the
candidate set. (engine)
- **Title-match override** — when a distinctive query term names a document by its title,
retrieval grounds on that document. "Who is the insurer of my car policy" was answering
from a health policy (whose formal "…insurer" phrasing out-scored the car doc); it now
correctly resolves to the car policy (TATA AIG). (engine)
- **Truncated grounded answers** — grounded replies collapsed to 1–2 tokens after a few
turns because the stateful LiteRT-LM KV cache accumulated each turn's grounding block. A
**per-grounded-turn session reset** re-prefills only bounded visible history
(`MAX_PREFILL_TURNS=16`), keeping answers full-length. (app)
- **Stale embeddings** — a **document-level self-healing migration** re-indexes a source
into the active embedder's index on next open when the embedder/dim changes, with no
re-import or OCR. (app)
- **Health-insurer recall miss** — enabling the cross-encoder reranker (ungated) on the
8 GB device tier recovers a relevant chunk the first-stage fusion ranked too low. (app)

### Migration
- Existing projects re-index from stored chunk text into the active embedder's index on
next open when the embedder changes — no re-import/OCR needed; the USE-Lite index is
kept as a fallback.

## [0.9.0] — 2026-06-05

Expand Down
15 changes: 15 additions & 0 deletions docs/EMBEDDING_GEMMA_PLAN.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,20 @@
# NativeLM — EmbeddingGemma on-device RAG embedder (implementation plan)

> **Status (2026-06-08): IMPLEMENTED** on branch `feat/embedding-gemma` (issue #30).
> Deviations from the plan below, decided during build:
> - **Full 3-tier matrix + reranker** built now (not deferred): USE-Lite@100 (entry) /
> EmbeddingGemma@256 (mid) / @512 + `ms-marco-MiniLM-L6` cross-encoder reranker
> (flagship), chosen by a device-RAM **recommendation engine**. One Matryoshka model
> serves all Gemma tiers; per-dim ObjectBox HNSW entities (128/256/512) added.
> - **Tokenizer: pure-Kotlin**, not onnxruntime-extensions (its `gen_processing_models`
> doesn't support GemmaTokenizer) nor a Rust/DJL native lib. BPE (embedder) + BERT
> WordPiece (reranker), both validated bit-for-bit vs HF `transformers`. No extra .so.
> - **Companion download** through the catalogue (graph + `model.onnx_data` + tokenizer);
> the external-data blob keeps its original name so ORT resolves it.
> - Quick-win **per-document cap** shipped in the retriever (the wrong-PDF fix).
> See `CHANGELOG.md` [Unreleased] and `_session/material/blog-embeddinggemma-rag.md`.


_Branch: `claude/analysis-KV4t2`. Goal: replace the 2018-era Universal Sentence
Encoder (USE-Lite, 100-dim) with **EmbeddingGemma 300M** as the default RAG
embedder, lifting retrieval quality for both chat answers and every Studio
Expand Down
7 changes: 7 additions & 0 deletions gradle/libs.versions.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@ napier = "2.7.1"
kotlin-inject = "0.9.0"
mediapipe = "0.10.35"
litertlm = "0.11.0"
# ONNX Runtime + extensions for the EmbeddingGemma RAG embedder (Microsoft, no
# Google/Play telemetry deps). extensions version MUST match the version used to
# generate tokenizer.onnx (gen_processing_models) for custom-op compatibility.
onnxruntime = "1.26.0"
onnxruntime-extensions = "0.15.0"
mlkit-text-recognition = "16.0.1"
androidx-core-ktx = "1.15.0"
# sample-app only
Expand Down Expand Up @@ -43,6 +48,8 @@ kotlin-inject-runtime = { module = "me.tatarka.inject:kotlin-inject-runtime", ve
kotlin-inject-compiler = { module = "me.tatarka.inject:kotlin-inject-compiler-ksp", version.ref = "kotlin-inject" }
mediapipe-tasks-text = { module = "com.google.mediapipe:tasks-text", version.ref = "mediapipe" }
litertlm-android = { module = "com.google.ai.edge.litertlm:litertlm-android", version.ref = "litertlm" }
onnxruntime-android = { module = "com.microsoft.onnxruntime:onnxruntime-android", version.ref = "onnxruntime" }
onnxruntime-extensions-android = { module = "com.microsoft.onnxruntime:onnxruntime-extensions-android", version.ref = "onnxruntime-extensions" }
androidx-core-ktx = { group = "androidx.core", name = "core-ktx", version.ref = "androidx-core-ktx" }
kotlinx-coroutines-android = { module = "org.jetbrains.kotlinx:kotlinx-coroutines-android", version.ref = "kotlinx-coroutines" }
# sample-app only
Expand Down
6 changes: 5 additions & 1 deletion lib/build.gradle.kts
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ plugins {
}

group = "com.sagar"
version = "0.9.0"
version = "0.10.0"

// Keep the published artifact id stable as "litertlm-kmp" even though the
// Gradle module is now ":lib" (sample-app is a sibling subproject). Without
Expand Down Expand Up @@ -53,6 +53,10 @@ kotlin {
implementation(libs.ktor.client.okhttp)
// Argon2id (native, JNI) for passphrase-derived backup encryption keys.
implementation(libs.signal.argon2)
// EmbeddingGemma RAG embedder: ONNX Runtime for inference (Microsoft,
// telemetry-free — no Google/Play deps). Tokenization is pure-Kotlin
// (GemmaBpeTokenizer), so no onnxruntime-extensions native lib is needed.
implementation(libs.onnxruntime.android)
}
iosMain.dependencies {
implementation(libs.ktor.client.darwin)
Expand Down
176 changes: 176 additions & 0 deletions lib/src/androidMain/kotlin/com/sagar/aicore/BertWordPieceTokenizer.kt
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
/*
* Copyright (C) 2026 Sagar Gupta
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
package com.sagar.aicore

import android.util.JsonReader
import java.io.File
import java.io.InputStreamReader
import java.text.Normalizer

/**
* Pure-Kotlin BERT WordPiece tokenizer matching the ms-marco MiniLM cross-encoder's
* HuggingFace `tokenizer.json` (uncased: clean → handle-CJK → lowercase → strip
* accents → split whitespace/punctuation → greedy `##` WordPiece). Verified
* bit-for-bit against the reference `transformers` tokenizer (ASCII, punctuation,
* accents, CJK, emails).
*
* Produces cross-encoder pair input: `[CLS] query [SEP] passage [SEP]` with
* token_type_ids 0 for the first segment and 1 for the second.
*/
class BertWordPieceTokenizer private constructor(
private val vocab: HashMap<String, Int>,
private val clsId: Int,
private val sepId: Int,
private val unkId: Int,
) {

class PairEncoding(val ids: LongArray, val typeIds: LongArray, val mask: LongArray)

fun encodePair(query: String, passage: String): PairEncoding {
val q = wordPieces(basicTokenize(query))
val p = wordPieces(basicTokenize(passage))
val ids = ArrayList<Long>(q.size + p.size + 3)
val types = ArrayList<Long>(q.size + p.size + 3)
ids.add(clsId.toLong()); types.add(0)
for (t in q) { ids.add((vocab[t] ?: unkId).toLong()); types.add(0) }
ids.add(sepId.toLong()); types.add(0)
for (t in p) { ids.add((vocab[t] ?: unkId).toLong()); types.add(1) }
ids.add(sepId.toLong()); types.add(1)
val idArr = LongArray(ids.size) { ids[it] }
val typeArr = LongArray(types.size) { types[it] }
return PairEncoding(idArr, typeArr, LongArray(idArr.size) { 1L })
}

/** BERT basic tokenizer: clean, CJK-pad, lowercase, strip accents, split ws + punct. */
private fun basicTokenize(text: String): List<String> {
val sb = StringBuilder(text.length + 16)
var i = 0
while (i < text.length) {
val cp = text.codePointAt(i)
i += Character.charCount(cp)
if (cp == 0 || cp == 0xFFFD || isControl(cp)) continue
when {
isWhitespace(cp) -> sb.append(' ')
isCjk(cp) -> { sb.append(' '); sb.appendCodePoint(cp); sb.append(' ') }
else -> sb.appendCodePoint(cp)
}
}
// lowercase + strip accents (NFD, drop non-spacing marks)
val lowered = sb.toString().lowercase()
val nfd = Normalizer.normalize(lowered, Normalizer.Form.NFD)
val stripped = StringBuilder(nfd.length)
var j = 0
while (j < nfd.length) {
val cp = nfd.codePointAt(j)
j += Character.charCount(cp)
if (Character.getType(cp) == Character.NON_SPACING_MARK.toInt()) continue
stripped.appendCodePoint(cp)
}
// split on whitespace, then isolate punctuation
val out = ArrayList<String>()
for (word in stripped.toString().split(' ')) {
if (word.isEmpty()) continue
val cur = StringBuilder()
var k = 0
while (k < word.length) {
val cp = word.codePointAt(k)
k += Character.charCount(cp)
if (isPunct(cp)) {
if (cur.isNotEmpty()) { out.add(cur.toString()); cur.setLength(0) }
out.add(String(Character.toChars(cp)))
} else {
cur.appendCodePoint(cp)
}
}
if (cur.isNotEmpty()) out.add(cur.toString())
}
return out
}

/** Greedy longest-match WordPiece with `##` continuation; [UNK] on failure. */
private fun wordPieces(tokens: List<String>): List<String> {
val out = ArrayList<String>()
for (token in tokens) {
if (token.length > MAX_CHARS) { out.add(UNK); continue }
var start = 0
val n = token.length
var bad = false
val pieces = ArrayList<String>()
while (start < n) {
var end = n
var cur: String? = null
while (start < end) {
val sub = if (start > 0) "##" + token.substring(start, end) else token.substring(start, end)
if (vocab.containsKey(sub)) { cur = sub; break }
end--
}
if (cur == null) { bad = true; break }
pieces.add(cur)
start = end
}
if (bad) out.add(UNK) else out.addAll(pieces)
}
return out
}

companion object {
private const val UNK = "[UNK]"
private const val MAX_CHARS = 100

private fun isWhitespace(cp: Int): Boolean =
cp == ' '.code || cp == '\t'.code || cp == '\n'.code || cp == '\r'.code ||
Character.getType(cp) == Character.SPACE_SEPARATOR.toInt()

private fun isControl(cp: Int): Boolean {
if (cp == '\t'.code || cp == '\n'.code || cp == '\r'.code) return false
return when (Character.getType(cp)) {
Character.CONTROL.toInt(), Character.FORMAT.toInt(),
Character.SURROGATE.toInt(), Character.PRIVATE_USE.toInt(),
Character.UNASSIGNED.toInt() -> true
else -> false
}
}

private fun isPunct(cp: Int): Boolean {
if ((cp in 33..47) || (cp in 58..64) || (cp in 91..96) || (cp in 123..126)) return true
return when (Character.getType(cp)) {
Character.CONNECTOR_PUNCTUATION.toInt(), Character.DASH_PUNCTUATION.toInt(),
Character.START_PUNCTUATION.toInt(), Character.END_PUNCTUATION.toInt(),
Character.OTHER_PUNCTUATION.toInt(), Character.INITIAL_QUOTE_PUNCTUATION.toInt(),
Character.FINAL_QUOTE_PUNCTUATION.toInt() -> true
else -> false
}
}

private fun isCjk(cp: Int): Boolean =
(cp in 0x4E00..0x9FFF) || (cp in 0x3400..0x4DBF) || (cp in 0x20000..0x2A6DF) ||
(cp in 0x2A700..0x2B73F) || (cp in 0x2B740..0x2B81F) || (cp in 0x2B820..0x2CEAF) ||
(cp in 0xF900..0xFAFF) || (cp in 0x2F800..0x2FA1F)

fun load(tokenizerJsonPath: String): BertWordPieceTokenizer {
val vocab = HashMap<String, Int>(35_000)
File(tokenizerJsonPath).inputStream().use { stream ->
JsonReader(InputStreamReader(stream, Charsets.UTF_8)).use { r ->
r.beginObject()
while (r.hasNext()) {
if (r.nextName() == "model") {
r.beginObject()
while (r.hasNext()) {
if (r.nextName() == "vocab") {
r.beginObject()
while (r.hasNext()) vocab[r.nextName()] = r.nextInt()
r.endObject()
} else r.skipValue()
}
r.endObject()
} else r.skipValue()
}
r.endObject()
}
}
return BertWordPieceTokenizer(vocab, vocab["[CLS]"] ?: 101, vocab["[SEP]"] ?: 102, vocab["[UNK]"] ?: 100)
}
}
}
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,9 @@ class MediaPipeEmbeddingEngine(
private var textEmbedder: TextEmbedder? = null
private val mutex = Mutex()

/** USE-Lite emits 100-dim vectors. */
override val dimensions: Int = 100

/**
* Initializes the embedder with a model path.
*/
Expand All @@ -56,7 +59,9 @@ class MediaPipeEmbeddingEngine(
}
}

override suspend fun embed(text: String): FloatArray = withContext(Dispatchers.IO) {
// USE-Lite is symmetric: query and document are embedded identically, so
// [task]/[title] are ignored.
override suspend fun embed(text: String, task: EmbeddingTask, title: String?): FloatArray = withContext(Dispatchers.IO) {
Napier.d(tag = "EmbeddingEngine") { "embed START hash=${System.identityHashCode(this@MediaPipeEmbeddingEngine)} text_len=${text.length} first50=${text.take(50)}" }
val embedder = mutex.withLock { textEmbedder } ?: run {
Napier.e(tag = "EmbeddingEngine") { "Embedding model not loaded! hash=${System.identityHashCode(this@MediaPipeEmbeddingEngine)}" }
Expand Down
Loading
Loading