Summary
RAG retrieval sometimes grounds on the wrong document (observed: a query about one source is answered from an unrelated one). Root cause is primarily the embedding model, not the LLM: USE-Lite produces only 100-dimensional vectors, which are too coarse to separate semantically close documents (e.g. car vs health insurance, two similar PDFs). There is also no reranker and the chunker is character-based, so even when the right chunk is retrieved it can be split/diluted.
This issue proposes upgrading the embedding pipeline and making embedding scale with device capability, mirroring the LLM tiering the app already has.
Current state (from code analysis)
- Embedder: Universal Sentence Encoder (USE-Lite),
universal_sentence_encoder.tflite (~6 MB) via MediaPipe Tasks TextEmbedder — lib/src/androidMain/kotlin/com/sagar/aicore/MediaPipeEmbeddingEngine.kt.
- Dimension: 100-dim, hard-coded; not exposed by the interface.
EmbeddingEngine (lib/src/commonMain/kotlin/com/sagar/aicore/EmbeddingEngine.kt) is symmetric/task-agnostic (embed(text) only — no query vs document distinction).
- Vector index: ObjectBox
@HnswIndex(dimensions = 100L, distanceType = COSINE, …) on DocumentChunkEntity (sample-app/.../data/db/Entities.kt). HNSW dimension is a compile-time literal, so changing dims requires a new entity.
- Retrieval:
DefaultDocumentRetriever — hybrid vector + BM25 fused via Reciprocal Rank Fusion. RagConfig defaults: chunkSize=500, chunkOverlap=50, relevanceMaxDistance=0.75, vectorPoolSize=30, keywordPoolSize=120, maxContextChars=4000. No per-document cap; no reranker.
- Chunker:
TextChunker is a character sliding window (not token-aware).
- Device tiers already exist for LLMs in
NativeLmModelCatalog.kt (minDeviceRamMb, ~3.5 GB → 12 GB+), but embedding does not tier — every device uses USE-Lite.
- A planning doc already exists:
docs/EMBEDDING_GEMMA_PLAN.md (EmbeddingGemma 256-dim default, USE-Lite low-end fallback, task-aware interface, parallel GemmaChunkEntity, re-index from stored text). This issue extends/formalizes that.
Proposed direction
1. Adopt EmbeddingGemma with Matryoshka (the key lever)
EmbeddingGemma (308M, Gemma-3-based) is purpose-built for on-device RAG: #1 on MTEB under 500M params, <200 MB RAM with QAT, 2048-token context, 100+ languages, and a LiteRT/litert-community build — so it can run on the same LiteRT runtime we already use for the LLM (no new ONNX dependency strictly required; evaluate LiteRT vs ONNX). Crucially it uses Matryoshka Representation Learning (MRL): one model whose output can be truncated to 768 / 512 / 256 / 128 dims with minimal quality loss — perfect for device tiering.
2. Tier the embedding by device capability
Pick the embedding model + truncation dim at first run from minDeviceRamMb, reusing the existing tier logic:
| Device tier |
Embedder |
Dim |
Reranker |
| Entry (~3–4 GB) |
USE-Lite (fallback) or EmbeddingGemma @128 |
100 / 128 |
none |
| Mid (~6–7 GB) |
EmbeddingGemma |
256 |
none |
| High / Flagship (~10 GB+) |
EmbeddingGemma |
512–768 |
cross-encoder rerank |
3. Add task-aware embedding (query vs document)
EmbeddingGemma is asymmetric — it expects task prefixes (verify exact strings against the model card, but roughly task: search result | query: {q} for queries and title: none | text: {chunk} for documents). Extend EmbeddingEngine with an EmbeddingTask { QUERY, DOCUMENT } parameter and apply the right prefix in the retriever vs ingestor. This alone typically lifts retrieval meaningfully.
4. Add a reranker on capable tiers
After fusion, run a lightweight cross-encoder reranker (e.g. ms-marco-MiniLM-L-6-v2 or bge-reranker-v2-m3) over the top ~30–100 candidates to produce a precision-tuned top-k (+5 to +15 NDCG@10 typical). Gate it to High/Flagship tiers to keep low-end latency acceptable.
5. Better, structure-aware chunking
Move from a pure character window toward token-aware / sentence/paragraph-boundary chunking so tables and numeric rows aren't split mid-record. Consider larger chunks now that EmbeddingGemma supports 2048 tokens.
Quick wins (no re-import, ship first)
These were prototyped and reverted; they help even before the embedder swap:
- Per-document cap in
DefaultDocumentRetriever so one large source can't fill every top-k slot (maxChunksPerDocument in RagConfig).
- Bump retrieval
k (5 → 8), widen vectorPoolSize/keywordPoolSize, raise maxContextChars.
Migration / infra notes
- HNSW dim is fixed per entity → add a new entity per dimension (e.g.
GemmaChunkEntity @256) alongside the existing 100-dim one; migrate by re-embedding from stored chunk text (no re-extraction/re-OCR needed).
- Expose
dimension and task on EmbeddingEngine so the index and embedder can't silently diverge.
- Add iOS embedding impl (currently none) — EmbeddingGemma LiteRT works cross-platform.
Open questions / risks
- P2P/mesh sync across tiers: two devices with different embedding dims can't share a vector index. Standardize on an interop dimension (likely 256) for shared/synced packs; let flagship optionally use higher dims locally only. Needs a decision.
- Cold-start cost: EmbeddingGemma is heavier than USE-Lite — measure ingest latency on entry devices; that's why the low tier keeps USE-Lite/128-dim.
- Reranker model size/latency budget on-device — benchmark before enabling by default.
References
Summary
RAG retrieval sometimes grounds on the wrong document (observed: a query about one source is answered from an unrelated one). Root cause is primarily the embedding model, not the LLM: USE-Lite produces only 100-dimensional vectors, which are too coarse to separate semantically close documents (e.g. car vs health insurance, two similar PDFs). There is also no reranker and the chunker is character-based, so even when the right chunk is retrieved it can be split/diluted.
This issue proposes upgrading the embedding pipeline and making embedding scale with device capability, mirroring the LLM tiering the app already has.
Current state (from code analysis)
universal_sentence_encoder.tflite(~6 MB) via MediaPipe Tasks TextEmbedder —lib/src/androidMain/kotlin/com/sagar/aicore/MediaPipeEmbeddingEngine.kt.EmbeddingEngine(lib/src/commonMain/kotlin/com/sagar/aicore/EmbeddingEngine.kt) is symmetric/task-agnostic (embed(text)only — no query vs document distinction).@HnswIndex(dimensions = 100L, distanceType = COSINE, …)onDocumentChunkEntity(sample-app/.../data/db/Entities.kt). HNSW dimension is a compile-time literal, so changing dims requires a new entity.DefaultDocumentRetriever— hybrid vector + BM25 fused via Reciprocal Rank Fusion.RagConfigdefaults:chunkSize=500,chunkOverlap=50,relevanceMaxDistance=0.75,vectorPoolSize=30,keywordPoolSize=120,maxContextChars=4000. No per-document cap; no reranker.TextChunkeris a character sliding window (not token-aware).NativeLmModelCatalog.kt(minDeviceRamMb, ~3.5 GB → 12 GB+), but embedding does not tier — every device uses USE-Lite.docs/EMBEDDING_GEMMA_PLAN.md(EmbeddingGemma 256-dim default, USE-Lite low-end fallback, task-aware interface, parallelGemmaChunkEntity, re-index from stored text). This issue extends/formalizes that.Proposed direction
1. Adopt EmbeddingGemma with Matryoshka (the key lever)
EmbeddingGemma (308M, Gemma-3-based) is purpose-built for on-device RAG: #1 on MTEB under 500M params, <200 MB RAM with QAT, 2048-token context, 100+ languages, and a LiteRT/
litert-communitybuild — so it can run on the same LiteRT runtime we already use for the LLM (no new ONNX dependency strictly required; evaluate LiteRT vs ONNX). Crucially it uses Matryoshka Representation Learning (MRL): one model whose output can be truncated to 768 / 512 / 256 / 128 dims with minimal quality loss — perfect for device tiering.2. Tier the embedding by device capability
Pick the embedding model + truncation dim at first run from
minDeviceRamMb, reusing the existing tier logic:3. Add task-aware embedding (query vs document)
EmbeddingGemma is asymmetric — it expects task prefixes (verify exact strings against the model card, but roughly
task: search result | query: {q}for queries andtitle: none | text: {chunk}for documents). ExtendEmbeddingEnginewith anEmbeddingTask { QUERY, DOCUMENT }parameter and apply the right prefix in the retriever vs ingestor. This alone typically lifts retrieval meaningfully.4. Add a reranker on capable tiers
After fusion, run a lightweight cross-encoder reranker (e.g.
ms-marco-MiniLM-L-6-v2orbge-reranker-v2-m3) over the top ~30–100 candidates to produce a precision-tuned top-k (+5 to +15 NDCG@10 typical). Gate it to High/Flagship tiers to keep low-end latency acceptable.5. Better, structure-aware chunking
Move from a pure character window toward token-aware / sentence/paragraph-boundary chunking so tables and numeric rows aren't split mid-record. Consider larger chunks now that EmbeddingGemma supports 2048 tokens.
Quick wins (no re-import, ship first)
These were prototyped and reverted; they help even before the embedder swap:
DefaultDocumentRetrieverso one large source can't fill every top-k slot (maxChunksPerDocumentinRagConfig).k(5 → 8), widenvectorPoolSize/keywordPoolSize, raisemaxContextChars.Migration / infra notes
GemmaChunkEntity@256) alongside the existing 100-dim one; migrate by re-embedding from stored chunk text (no re-extraction/re-OCR needed).dimensionandtaskonEmbeddingEngineso the index and embedder can't silently diverge.Open questions / risks
References
docs/EMBEDDING_GEMMA_PLAN.md