Skip to content

feat(rag): EmbeddingGemma-300M via ONNX + hybrid retrieval, doc-relevance gate, reranker#32

Merged
sagar-develop merged 2 commits into
mainfrom
feat/embedding-gemma
Jun 8, 2026
Merged

feat(rag): EmbeddingGemma-300M via ONNX + hybrid retrieval, doc-relevance gate, reranker#32
sagar-develop merged 2 commits into
mainfrom
feat/embedding-gemma

Conversation

@sagar-develop

Copy link
Copy Markdown
Owner

Summary

Replaces the USE-Lite (100-dim TFLite) document embedder with EmbeddingGemma-300M on ONNX Runtime — telemetry-free, no MediaPipe/Play Services — and rebuilds document retrieval as a hybrid pipeline with a set of retrieval-quality fixes verified on-device.

Embedder upgrade

  • EmbeddingGemma-300M via ONNX Runtime (OnnxEmbeddingEngine), with an optional ms-marco-MiniLM-L6 cross-encoder reranker (OnnxReranker).
  • Device-tiered: USE-Lite (<6 GB) · Gemma@256 (6–8 GB) · Gemma@256 + reranker (8–10 GB) · Gemma@512 + reranker (≥10 GB), driven by EmbedderRecommendation and surfaced as a "Recommended" badge. Models download on-device through the catalogue; nothing is bundled.
  • Pure-Kotlin tokenizers (GemmaBpeTokenizer, BertWordPieceTokenizer) reading the HF tokenizer.json, validated against the reference transformers tokenizer — onnxruntime-extensions has no GemmaTokenizer.
  • Matryoshka truncation (128/256/512) with per-dim ObjectBox HNSW entities (GemmaChunk128/256/512) + dim routing in the repository.
  • Task-aware QUERY/DOCUMENT embedding prompts.

Hybrid retrieval

Dense vector search + BM25 lexical scoring fused via Reciprocal Rank Fusion, with a per-document cap, wider candidate pools, and a larger grounding budget.

Retrieval-quality fixes (the war story)

A real personal-finance project held three policies: a car policy (TATA AIG, ₹8,504), a life policy (Future Generali, ₹41,799), and a health policy (ICICI Lombard).

  1. Wrong-document grounding — "car insurance premium" answered ₹41,799 from the life policy because BM25 lexical overlap let the life PDF out-score the car PDF. ₹8,504 is the correct car premium. Fixed by a document-level dominance gate that keeps grounding on the source that genuinely dominates the candidate set.
  2. Title-match override — "who is the insurer of my car policy" grounded on the health policy (whose formal "…insurer" phrasing out-scored the car doc). A distinctive query term naming a doc by title now grounds on that doc → answer went ICICI Lombard → TATA AIG.
  3. Truncated grounded answers — grounded replies collapsed to 1–2 tokens after a few turns because the stateful LiteRT-LM KV cache accumulated each turn's grounding block. Fixed by a per-grounded-turn session reset (reopenSessionAndAwait) that re-prefills only bounded visible history (MAX_PREFILL_TURNS=16).
  4. Self-healing migration — document-level re-index into the active embedder's index on next open; no re-import/OCR.
  5. Reranker recall win — enabling the ungated cross-encoder reranker on the 8 GB tier recovered a health-insurer chunk the first-stage fusion ranked too low.

Testing

  • 11/11 DefaultDocumentRetrieverTest retriever unit tests pass.
  • :sample-app:assembleRelease (R8 + signed) builds green.
  • Verified on-device (Realme CPH2723): all four grounding scenarios above behave correctly post-fix.

Known residual

The life policy's sum assured sits inside a garbled extracted table (PDF text-extraction artifact), so it can be missed at the first-stage recall step. This is a PDF-extraction / first-stage recall limitation, not a ranking bug — documented for follow-up.

🤖 Generated with Claude Code

sagar-develop and others added 2 commits June 9, 2026 01:15
…ance gate, reranker

Replace the USE-Lite (100-dim TFLite) embedder with EmbeddingGemma-300M
running on ONNX Runtime — telemetry-free, no MediaPipe/Play Services. The
embedder is device-tiered: USE-Lite (<6GB), Gemma@256 (6-8GB),
Gemma@256+reranker (8-10GB), Gemma@512+reranker (>=10GB), with a
recommendation engine and on-device gated download.

Embedder upgrade:
- OnnxEmbeddingEngine + OnnxReranker (ms-marco MiniLM-L6 cross-encoder).
- Pure-Kotlin tokenizers (GemmaBpeTokenizer, BertWordPieceTokenizer)
  because onnxruntime-extensions has no GemmaTokenizer; validated vs HF.
- Matryoshka truncation (128/256/512) + per-dim ObjectBox HNSW entities
  (GemmaChunk128/256/512) with dim routing in the repository.
- Task-aware QUERY/DOCUMENT prompts in EmbeddingEngine.
- Device-tiered EmbedderRecommendation (reranker >=8000MB).

Retrieval-quality fixes (the headline story):
- Document-level DOMINANCE gate: car-insurance queries were grounding on
  the wrong document (a life policy answering a wrong premium) due to BM25
  lexical pollution; the gate keeps grounding on the dominant document.
- TITLE-MATCH override: a distinctive query term naming a document by
  title now grounds on that document ("insurer of my car policy" went from
  the health policy to the correct car policy).
- Per-grounded-turn KV reset (reopenSessionAndAwait, MAX_PREFILL_TURNS=16):
  grounded answers were truncating to 1-2 tokens after a few turns because
  the stateful LiteRT-LM KV cache accumulated each turn's grounding block.
- Document-level self-healing embedding migration in RagHolder.
- Cross-encoder reranker enabled (ungated) on the 8GB tier, fixing a
  health-insurer recall miss.

Hybrid retrieval = vector + BM25 + Reciprocal Rank Fusion. 11/11 retriever
unit tests pass; release build green; verified on-device (Realme CPH2723).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
App versionName 0.9.0 -> 0.10.0 (versionCode 7 -> 8); engine lib version
bumped in lockstep (0.9.0 -> 0.10.0). Promote the EmbeddingGemma RAG work
from [Unreleased] to a dated [0.10.0] section and document the
retrieval-quality fixes (hybrid retrieval/RRF, dominance gate, title-match
override, per-grounded-turn KV reset, self-healing migration, reranker).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@sagar-develop sagar-develop merged commit b779db3 into main Jun 8, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant