Skip to content

RAG embedding quality: replace USE-Lite (100-dim) with a device-tiered embedding + reranking stack #30

@sagar-develop

Description

@sagar-develop

Summary

RAG retrieval sometimes grounds on the wrong document (observed: a query about one source is answered from an unrelated one). Root cause is primarily the embedding model, not the LLM: USE-Lite produces only 100-dimensional vectors, which are too coarse to separate semantically close documents (e.g. car vs health insurance, two similar PDFs). There is also no reranker and the chunker is character-based, so even when the right chunk is retrieved it can be split/diluted.

This issue proposes upgrading the embedding pipeline and making embedding scale with device capability, mirroring the LLM tiering the app already has.

Current state (from code analysis)

  • Embedder: Universal Sentence Encoder (USE-Lite), universal_sentence_encoder.tflite (~6 MB) via MediaPipe Tasks TextEmbedderlib/src/androidMain/kotlin/com/sagar/aicore/MediaPipeEmbeddingEngine.kt.
  • Dimension: 100-dim, hard-coded; not exposed by the interface. EmbeddingEngine (lib/src/commonMain/kotlin/com/sagar/aicore/EmbeddingEngine.kt) is symmetric/task-agnostic (embed(text) only — no query vs document distinction).
  • Vector index: ObjectBox @HnswIndex(dimensions = 100L, distanceType = COSINE, …) on DocumentChunkEntity (sample-app/.../data/db/Entities.kt). HNSW dimension is a compile-time literal, so changing dims requires a new entity.
  • Retrieval: DefaultDocumentRetriever — hybrid vector + BM25 fused via Reciprocal Rank Fusion. RagConfig defaults: chunkSize=500, chunkOverlap=50, relevanceMaxDistance=0.75, vectorPoolSize=30, keywordPoolSize=120, maxContextChars=4000. No per-document cap; no reranker.
  • Chunker: TextChunker is a character sliding window (not token-aware).
  • Device tiers already exist for LLMs in NativeLmModelCatalog.kt (minDeviceRamMb, ~3.5 GB → 12 GB+), but embedding does not tier — every device uses USE-Lite.
  • A planning doc already exists: docs/EMBEDDING_GEMMA_PLAN.md (EmbeddingGemma 256-dim default, USE-Lite low-end fallback, task-aware interface, parallel GemmaChunkEntity, re-index from stored text). This issue extends/formalizes that.

Proposed direction

1. Adopt EmbeddingGemma with Matryoshka (the key lever)

EmbeddingGemma (308M, Gemma-3-based) is purpose-built for on-device RAG: #1 on MTEB under 500M params, <200 MB RAM with QAT, 2048-token context, 100+ languages, and a LiteRT/litert-community build — so it can run on the same LiteRT runtime we already use for the LLM (no new ONNX dependency strictly required; evaluate LiteRT vs ONNX). Crucially it uses Matryoshka Representation Learning (MRL): one model whose output can be truncated to 768 / 512 / 256 / 128 dims with minimal quality loss — perfect for device tiering.

2. Tier the embedding by device capability

Pick the embedding model + truncation dim at first run from minDeviceRamMb, reusing the existing tier logic:

Device tier Embedder Dim Reranker
Entry (~3–4 GB) USE-Lite (fallback) or EmbeddingGemma @128 100 / 128 none
Mid (~6–7 GB) EmbeddingGemma 256 none
High / Flagship (~10 GB+) EmbeddingGemma 512–768 cross-encoder rerank

3. Add task-aware embedding (query vs document)

EmbeddingGemma is asymmetric — it expects task prefixes (verify exact strings against the model card, but roughly task: search result | query: {q} for queries and title: none | text: {chunk} for documents). Extend EmbeddingEngine with an EmbeddingTask { QUERY, DOCUMENT } parameter and apply the right prefix in the retriever vs ingestor. This alone typically lifts retrieval meaningfully.

4. Add a reranker on capable tiers

After fusion, run a lightweight cross-encoder reranker (e.g. ms-marco-MiniLM-L-6-v2 or bge-reranker-v2-m3) over the top ~30–100 candidates to produce a precision-tuned top-k (+5 to +15 NDCG@10 typical). Gate it to High/Flagship tiers to keep low-end latency acceptable.

5. Better, structure-aware chunking

Move from a pure character window toward token-aware / sentence/paragraph-boundary chunking so tables and numeric rows aren't split mid-record. Consider larger chunks now that EmbeddingGemma supports 2048 tokens.

Quick wins (no re-import, ship first)

These were prototyped and reverted; they help even before the embedder swap:

  • Per-document cap in DefaultDocumentRetriever so one large source can't fill every top-k slot (maxChunksPerDocument in RagConfig).
  • Bump retrieval k (5 → 8), widen vectorPoolSize/keywordPoolSize, raise maxContextChars.

Migration / infra notes

  • HNSW dim is fixed per entity → add a new entity per dimension (e.g. GemmaChunkEntity @256) alongside the existing 100-dim one; migrate by re-embedding from stored chunk text (no re-extraction/re-OCR needed).
  • Expose dimension and task on EmbeddingEngine so the index and embedder can't silently diverge.
  • Add iOS embedding impl (currently none) — EmbeddingGemma LiteRT works cross-platform.

Open questions / risks

  • P2P/mesh sync across tiers: two devices with different embedding dims can't share a vector index. Standardize on an interop dimension (likely 256) for shared/synced packs; let flagship optionally use higher dims locally only. Needs a decision.
  • Cold-start cost: EmbeddingGemma is heavier than USE-Lite — measure ingest latency on entry devices; that's why the low tier keeps USE-Lite/128-dim.
  • Reranker model size/latency budget on-device — benchmark before enabling by default.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions