feat(rag): EmbeddingGemma-300M via ONNX + hybrid retrieval, doc-relevance gate, reranker#32
Merged
Merged
Conversation
…ance gate, reranker
Replace the USE-Lite (100-dim TFLite) embedder with EmbeddingGemma-300M
running on ONNX Runtime — telemetry-free, no MediaPipe/Play Services. The
embedder is device-tiered: USE-Lite (<6GB), Gemma@256 (6-8GB),
Gemma@256+reranker (8-10GB), Gemma@512+reranker (>=10GB), with a
recommendation engine and on-device gated download.
Embedder upgrade:
- OnnxEmbeddingEngine + OnnxReranker (ms-marco MiniLM-L6 cross-encoder).
- Pure-Kotlin tokenizers (GemmaBpeTokenizer, BertWordPieceTokenizer)
because onnxruntime-extensions has no GemmaTokenizer; validated vs HF.
- Matryoshka truncation (128/256/512) + per-dim ObjectBox HNSW entities
(GemmaChunk128/256/512) with dim routing in the repository.
- Task-aware QUERY/DOCUMENT prompts in EmbeddingEngine.
- Device-tiered EmbedderRecommendation (reranker >=8000MB).
Retrieval-quality fixes (the headline story):
- Document-level DOMINANCE gate: car-insurance queries were grounding on
the wrong document (a life policy answering a wrong premium) due to BM25
lexical pollution; the gate keeps grounding on the dominant document.
- TITLE-MATCH override: a distinctive query term naming a document by
title now grounds on that document ("insurer of my car policy" went from
the health policy to the correct car policy).
- Per-grounded-turn KV reset (reopenSessionAndAwait, MAX_PREFILL_TURNS=16):
grounded answers were truncating to 1-2 tokens after a few turns because
the stateful LiteRT-LM KV cache accumulated each turn's grounding block.
- Document-level self-healing embedding migration in RagHolder.
- Cross-encoder reranker enabled (ungated) on the 8GB tier, fixing a
health-insurer recall miss.
Hybrid retrieval = vector + BM25 + Reciprocal Rank Fusion. 11/11 retriever
unit tests pass; release build green; verified on-device (Realme CPH2723).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
App versionName 0.9.0 -> 0.10.0 (versionCode 7 -> 8); engine lib version bumped in lockstep (0.9.0 -> 0.10.0). Promote the EmbeddingGemma RAG work from [Unreleased] to a dated [0.10.0] section and document the retrieval-quality fixes (hybrid retrieval/RRF, dominance gate, title-match override, per-grounded-turn KV reset, self-healing migration, reranker). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the USE-Lite (100-dim TFLite) document embedder with EmbeddingGemma-300M on ONNX Runtime — telemetry-free, no MediaPipe/Play Services — and rebuilds document retrieval as a hybrid pipeline with a set of retrieval-quality fixes verified on-device.
Embedder upgrade
OnnxEmbeddingEngine), with an optionalms-marco-MiniLM-L6cross-encoder reranker (OnnxReranker).EmbedderRecommendationand surfaced as a "Recommended" badge. Models download on-device through the catalogue; nothing is bundled.GemmaBpeTokenizer,BertWordPieceTokenizer) reading the HFtokenizer.json, validated against the referencetransformerstokenizer —onnxruntime-extensionshas noGemmaTokenizer.GemmaChunk128/256/512) + dim routing in the repository.Hybrid retrieval
Dense vector search + BM25 lexical scoring fused via Reciprocal Rank Fusion, with a per-document cap, wider candidate pools, and a larger grounding budget.
Retrieval-quality fixes (the war story)
A real personal-finance project held three policies: a car policy (TATA AIG, ₹8,504), a life policy (Future Generali, ₹41,799), and a health policy (ICICI Lombard).
reopenSessionAndAwait) that re-prefills only bounded visible history (MAX_PREFILL_TURNS=16).Testing
DefaultDocumentRetrieverTestretriever unit tests pass.:sample-app:assembleRelease(R8 + signed) builds green.Known residual
The life policy's sum assured sits inside a garbled extracted table (PDF text-extraction artifact), so it can be missed at the first-stage recall step. This is a PDF-extraction / first-stage recall limitation, not a ranking bug — documented for follow-up.
🤖 Generated with Claude Code