Skip to content

feat(embedding): EmbeddingGemma on-device RAG embedder (ONNX, 256-dim) with USE-Lite fallback#24

Closed
sagar-develop wants to merge 1 commit into
mainfrom
claude/embeddinggemma
Closed

feat(embedding): EmbeddingGemma on-device RAG embedder (ONNX, 256-dim) with USE-Lite fallback#24
sagar-develop wants to merge 1 commit into
mainfrom
claude/embeddinggemma

Conversation

@sagar-develop

Copy link
Copy Markdown
Owner

Replaces the 2018-era Universal Sentence Encoder (USE-Lite, 100-dim) with EmbeddingGemma 300M as the default RAG embedder, lifting retrieval quality for both chat answers and every Studio artifact. USE-Lite stays as the friction-free, low-end fallback. Implements the design in docs/EMBEDDING_GEMMA_PLAN.md (included in this PR).

Why the earlier attempt failed (and how this fixes it)

EmbeddingGemma is a 300M transformer, not a TFLite Task model. Three landmines, all addressed here:

  1. Wrong loader — MediaPipe TextEmbedder only accepts TFLite Task models. → New, separate OnnxEmbeddingEngine (ONNX Runtime); the MediaPipe path is untouched.
  2. Dimension lock@HnswIndex(dimensions = 100L) is an annotation literal. → New GemmaChunkEntity at 256-dim parallel to the 100-dim store; no in-place dim change.
  3. Missing task prompts — EmbeddingGemma needs query: / text: instruction prefixes; the old embed(text) was symmetric. → Interface is now task-aware (EmbeddingTask.QUERY/DOCUMENT).

What's implemented

Engine (lib/)

  • EmbeddingEngine — task-aware: dimensions + embed(text, task, title). MediaPipeEmbeddingEngine adapted (dim 100, symmetric).
  • OnnxEmbeddingEngine — EmbeddingGemma via ONNX Runtime: instruction prompts → tokenize → mean-pool over the attention mask (or a pre-pooled sentence_embedding) → Matryoshka-truncate to 256L2-normalize.
  • HfGemmaTokenizer — HuggingFace tokenizer reading the model's tokenizer.json.
  • ModelFormat.ONNX_EMBEDDER + ModelDescriptor.companions so the tokenizer downloads alongside the model.
  • Deps: onnxruntime-android, ai.djl.huggingface:tokenizers (arm64); consumer ProGuard keeps.

App (sample-app/)

  • GemmaChunkEntity (256-dim HNSW) beside DocumentChunkEntity (100-dim). The repository routes vector ops by dimension and returns neutral Chunk/ScoredChunk DTOs (decoupling retriever/Studio from the active store).
  • RagHolder picks the active embedder (Gemma when its files are present and RAM ≥ 4 GB, else USE-Lite), downloads the model + companion tokenizer, and migrateToGemma() re-indexes legacy chunks from their stored text (document-scoped, idempotent, runs in the background).
  • Ingest/retrieve are task-aware; per-embedder distance gate (USE 0.75 vs Gemma 0.55 — provisional).
  • Backups round-trip both stores (schema v2, each ChunkDto carries its dim); non-active-embedder chunks re-index on first use.
  • Catalog descriptor for EmbeddingGemma (ungated ONNX mirror; surface Gemma terms in the onboarding gate).

⚠️ Build & verify checklist (not done in this sandbox)

No Android SDK here, so this was not compiled (same precedent as #22) and these steps must run on a real build/device:

  • Build regenerates the ObjectBox model — the new GemmaChunkEntity makes the ObjectBox plugin update sample-app/objectbox-models/default.json and generate GemmaChunkEntity_ / MyObjectBox. Commit the regenerated default.json.
  • Tokenizer runtime — confirm ai.djl.huggingface:tokenizers ships an arm64 .so for Android and loads tokenizer.json; if not, swap to onnxruntime-extensions (the documented fallback). Highest-risk item.
  • ONNX I/O names — verify the chosen EmbeddingGemma ONNX export's input names (input_ids/attention_mask) and output (sentence_embedding vs last_hidden_state); OnnxEmbeddingEngine.pool() handles both but the export should be confirmed.
  • Pin model artifacts — the catalog URL/sizeBytes/sha256 for the model + tokenizer are placeholders against the onnx-community mirror; pin to a verified revision (prefer a QAT/INT8 build).
  • Re-tune RELEVANCE_MAX_DISTANCE_GEMMA against real corpora.
  • On-device (CPH2723) — USE-vs-Gemma retrieval A/B, embed latency/memory, migration on an upgraded install, low-end USE fallback, and a backup round-trip.

Out of scope (follow-ups)

  • Reranker second stage; ORT-format/mobile size optimization; iOS embedder (the engine is KMP-portable); token-aware chunking.

https://claude.ai/code/session_01GY7vyycq3iQTQxiooMnxJW


Generated by Claude Code

…) with USE-Lite fallback

Engine (lib):
- Task-aware EmbeddingEngine (QUERY/DOCUMENT + dimensions); OnnxEmbeddingEngine
  (EmbeddingGemma 300M via ONNX Runtime: instruction prompts, mean-pool,
  Matryoshka-256, L2-norm) + HfGemmaTokenizer (tokenizer.json).
- ModelFormat.ONNX_EMBEDDER and ModelDescriptor.companions (tokenizer download).

App (sample-app):
- GemmaChunkEntity (256-dim HNSW) alongside the 100-dim USE store; repository
  routes vector ops by dimension and returns neutral Chunk/ScoredChunk DTOs.
- RagHolder selects the active embedder (Gemma on capable devices, USE fallback),
  downloads model+companion, and re-indexes legacy chunks (migrateToGemma).
- Ingest/retrieve are task-aware; per-embedder distance gate.
- Backups round-trip both stores (schema v2, per-chunk dim); catalog descriptor.

Note: not compiled in this sandbox (no Android SDK); first build regenerates the
ObjectBox model. Tokenizer runtime + ONNX output names need on-device verification.
@sagar-develop

Copy link
Copy Markdown
Owner Author

Closing without merging. On-device test found the EmbeddingGemma ONNX embedder won't load as configured: the catalog descriptor lists only tokenizer.json as a companion and omits the model_quantized.onnx_data external-weights file (~295 MB), so the in-app download fetches a weightless 567 KB graph and ORT session creation would fail. Also: branch is 31 commits behind main and the regenerated objectbox default.json (new GemmaChunkEntity) wasn't committed. Revisit as a fresh branch off current main with the weights companion wired in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants