RAG embedding quality: replace USE-Lite (100-dim) with a device-tiered embedding + reranking stack

## Summary
RAG retrieval sometimes grounds on the **wrong document** (observed: a query about one source is answered from an unrelated one). Root cause is primarily the **embedding model**, not the LLM: USE-Lite produces only **100-dimensional** vectors, which are too coarse to separate semantically close documents (e.g. car vs health insurance, two similar PDFs). There is also **no reranker** and the chunker is character-based, so even when the right chunk is retrieved it can be split/diluted.

This issue proposes upgrading the embedding pipeline and making embedding **scale with device capability**, mirroring the LLM tiering the app already has.

## Current state (from code analysis)
- **Embedder:** Universal Sentence Encoder (USE-Lite), `universal_sentence_encoder.tflite` (~6 MB) via **MediaPipe Tasks TextEmbedder** — `lib/src/androidMain/kotlin/com/sagar/aicore/MediaPipeEmbeddingEngine.kt`.
- **Dimension:** **100-dim**, hard-coded; not exposed by the interface. `EmbeddingEngine` (`lib/src/commonMain/kotlin/com/sagar/aicore/EmbeddingEngine.kt`) is symmetric/task-agnostic (`embed(text)` only — no query vs document distinction).
- **Vector index:** ObjectBox `@HnswIndex(dimensions = 100L, distanceType = COSINE, …)` on `DocumentChunkEntity` (`sample-app/.../data/db/Entities.kt`). HNSW dimension is a compile-time literal, so changing dims requires a **new entity**.
- **Retrieval:** `DefaultDocumentRetriever` — hybrid vector + BM25 fused via Reciprocal Rank Fusion. `RagConfig` defaults: `chunkSize=500`, `chunkOverlap=50`, `relevanceMaxDistance=0.75`, `vectorPoolSize=30`, `keywordPoolSize=120`, `maxContextChars=4000`. **No per-document cap; no reranker.**
- **Chunker:** `TextChunker` is a character sliding window (not token-aware).
- **Device tiers already exist for LLMs** in `NativeLmModelCatalog.kt` (`minDeviceRamMb`, ~3.5 GB → 12 GB+), but **embedding does not tier** — every device uses USE-Lite.
- A planning doc already exists: `docs/EMBEDDING_GEMMA_PLAN.md` (EmbeddingGemma 256-dim default, USE-Lite low-end fallback, task-aware interface, parallel `GemmaChunkEntity`, re-index from stored text). This issue extends/formalizes that.

## Proposed direction

### 1. Adopt EmbeddingGemma with Matryoshka (the key lever)
[EmbeddingGemma](https://developers.googleblog.com/en/introducing-embeddinggemma/) (308M, Gemma-3-based) is purpose-built for on-device RAG: **#1 on MTEB under 500M params**, **<200 MB RAM with QAT**, **2048-token** context, **100+ languages**, and a **LiteRT/`litert-community` build** — so it can run on the **same LiteRT runtime we already use for the LLM** (no new ONNX dependency strictly required; evaluate LiteRT vs ONNX). Crucially it uses **Matryoshka Representation Learning (MRL)**: one model whose output can be truncated to **768 / 512 / 256 / 128** dims with minimal quality loss — perfect for device tiering.

### 2. Tier the embedding by device capability
Pick the embedding model + truncation dim at first run from `minDeviceRamMb`, reusing the existing tier logic:

| Device tier | Embedder | Dim | Reranker |
|---|---|---|---|
| Entry (~3–4 GB) | USE-Lite (fallback) **or** EmbeddingGemma @128 | 100 / 128 | none |
| Mid (~6–7 GB) | EmbeddingGemma | 256 | none |
| High / Flagship (~10 GB+) | EmbeddingGemma | 512–768 | cross-encoder rerank |

### 3. Add task-aware embedding (query vs document)
EmbeddingGemma is asymmetric — it expects task prefixes (verify exact strings against the model card, but roughly `task: search result | query: {q}` for queries and `title: none | text: {chunk}` for documents). Extend `EmbeddingEngine` with an `EmbeddingTask { QUERY, DOCUMENT }` parameter and apply the right prefix in the retriever vs ingestor. This alone typically lifts retrieval meaningfully.

### 4. Add a reranker on capable tiers
After fusion, run a lightweight **cross-encoder reranker** (e.g. `ms-marco-MiniLM-L-6-v2` or `bge-reranker-v2-m3`) over the top ~30–100 candidates to produce a precision-tuned top-k (+5 to +15 NDCG@10 typical). Gate it to High/Flagship tiers to keep low-end latency acceptable.

### 5. Better, structure-aware chunking
Move from a pure character window toward token-aware / sentence/paragraph-boundary chunking so tables and numeric rows aren't split mid-record. Consider larger chunks now that EmbeddingGemma supports 2048 tokens.

## Quick wins (no re-import, ship first)
These were prototyped and reverted; they help even before the embedder swap:
- **Per-document cap** in `DefaultDocumentRetriever` so one large source can't fill every top-k slot (`maxChunksPerDocument` in `RagConfig`).
- Bump retrieval `k` (5 → 8), widen `vectorPoolSize`/`keywordPoolSize`, raise `maxContextChars`.

## Migration / infra notes
- HNSW dim is fixed per entity → add a **new entity per dimension** (e.g. `GemmaChunkEntity` @256) alongside the existing 100-dim one; migrate by **re-embedding from stored chunk text** (no re-extraction/re-OCR needed).
- Expose `dimension` and `task` on `EmbeddingEngine` so the index and embedder can't silently diverge.
- Add iOS embedding impl (currently none) — EmbeddingGemma LiteRT works cross-platform.

## Open questions / risks
- **P2P/mesh sync across tiers:** two devices with different embedding dims can't share a vector index. Standardize on an **interop dimension (likely 256)** for shared/synced packs; let flagship optionally use higher dims locally only. Needs a decision.
- Cold-start cost: EmbeddingGemma is heavier than USE-Lite — measure ingest latency on entry devices; that's why the low tier keeps USE-Lite/128-dim.
- Reranker model size/latency budget on-device — benchmark before enabling by default.

## References
- [Introducing EmbeddingGemma — Google Developers Blog](https://developers.googleblog.com/en/introducing-embeddinggemma/)
- [EmbeddingGemma model overview — Google AI for Developers](https://ai.google.dev/gemma/docs/embeddinggemma)
- [litert-community/embeddinggemma-300m — Hugging Face](https://huggingface.co/litert-community/embeddinggemma-300m)
- [Welcome EmbeddingGemma — Hugging Face blog](https://huggingface.co/blog/embeddinggemma)
- [MarkTechPost: EmbeddingGemma 308M, SOTA MTEB <500M](https://www.marktechpost.com/2025/09/04/google-ai-releases-embeddinggemma-a-308m-parameter-on-device-embedding-model-with-state-of-the-art-mteb-results/)
- Rerankers: [BGE reranker guide](https://markaicode.com/bge-reranker-cross-encoder-reranking-rag/) · [Best reranker models 2026](https://docs.bswen.com/blog/2026-02-25-best-reranker-models/)
- Internal: `docs/EMBEDDING_GEMMA_PLAN.md`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAG embedding quality: replace USE-Lite (100-dim) with a device-tiered embedding + reranking stack #30

Summary

Current state (from code analysis)

Proposed direction

1. Adopt EmbeddingGemma with Matryoshka (the key lever)

2. Tier the embedding by device capability

3. Add task-aware embedding (query vs document)

4. Add a reranker on capable tiers

5. Better, structure-aware chunking

Quick wins (no re-import, ship first)

Migration / infra notes

Open questions / risks

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Device tier	Embedder	Dim	Reranker
Entry (~3–4 GB)	USE-Lite (fallback) or EmbeddingGemma @128	100 / 128	none
Mid (~6–7 GB)	EmbeddingGemma	256	none
High / Flagship (~10 GB+)	EmbeddingGemma	512–768	cross-encoder rerank

RAG embedding quality: replace USE-Lite (100-dim) with a device-tiered embedding + reranking stack #30

Description

Summary

Current state (from code analysis)

Proposed direction

1. Adopt EmbeddingGemma with Matryoshka (the key lever)

2. Tier the embedding by device capability

3. Add task-aware embedding (query vs document)

4. Add a reranker on capable tiers

5. Better, structure-aware chunking

Quick wins (no re-import, ship first)

Migration / infra notes

Open questions / risks

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions