feat(embedding): EmbeddingGemma on-device RAG embedder (ONNX, 256-dim) with USE-Lite fallback#24
Closed
sagar-develop wants to merge 1 commit into
Closed
feat(embedding): EmbeddingGemma on-device RAG embedder (ONNX, 256-dim) with USE-Lite fallback#24sagar-develop wants to merge 1 commit into
sagar-develop wants to merge 1 commit into
Conversation
…) with USE-Lite fallback Engine (lib): - Task-aware EmbeddingEngine (QUERY/DOCUMENT + dimensions); OnnxEmbeddingEngine (EmbeddingGemma 300M via ONNX Runtime: instruction prompts, mean-pool, Matryoshka-256, L2-norm) + HfGemmaTokenizer (tokenizer.json). - ModelFormat.ONNX_EMBEDDER and ModelDescriptor.companions (tokenizer download). App (sample-app): - GemmaChunkEntity (256-dim HNSW) alongside the 100-dim USE store; repository routes vector ops by dimension and returns neutral Chunk/ScoredChunk DTOs. - RagHolder selects the active embedder (Gemma on capable devices, USE fallback), downloads model+companion, and re-indexes legacy chunks (migrateToGemma). - Ingest/retrieve are task-aware; per-embedder distance gate. - Backups round-trip both stores (schema v2, per-chunk dim); catalog descriptor. Note: not compiled in this sandbox (no Android SDK); first build regenerates the ObjectBox model. Tokenizer runtime + ONNX output names need on-device verification.
Owner
Author
|
Closing without merging. On-device test found the EmbeddingGemma ONNX embedder won't load as configured: the catalog descriptor lists only |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replaces the 2018-era Universal Sentence Encoder (USE-Lite, 100-dim) with EmbeddingGemma 300M as the default RAG embedder, lifting retrieval quality for both chat answers and every Studio artifact. USE-Lite stays as the friction-free, low-end fallback. Implements the design in
docs/EMBEDDING_GEMMA_PLAN.md(included in this PR).Why the earlier attempt failed (and how this fixes it)
EmbeddingGemma is a 300M transformer, not a TFLite Task model. Three landmines, all addressed here:
TextEmbedderonly accepts TFLite Task models. → New, separateOnnxEmbeddingEngine(ONNX Runtime); the MediaPipe path is untouched.@HnswIndex(dimensions = 100L)is an annotation literal. → NewGemmaChunkEntityat 256-dim parallel to the 100-dim store; no in-place dim change.query:/text:instruction prefixes; the oldembed(text)was symmetric. → Interface is now task-aware (EmbeddingTask.QUERY/DOCUMENT).What's implemented
Engine (
lib/)EmbeddingEngine— task-aware:dimensions+embed(text, task, title).MediaPipeEmbeddingEngineadapted (dim 100, symmetric).OnnxEmbeddingEngine— EmbeddingGemma via ONNX Runtime: instruction prompts → tokenize → mean-pool over the attention mask (or a pre-pooledsentence_embedding) → Matryoshka-truncate to 256 → L2-normalize.HfGemmaTokenizer— HuggingFace tokenizer reading the model'stokenizer.json.ModelFormat.ONNX_EMBEDDER+ModelDescriptor.companionsso the tokenizer downloads alongside the model.onnxruntime-android,ai.djl.huggingface:tokenizers(arm64); consumer ProGuard keeps.App (
sample-app/)GemmaChunkEntity(256-dim HNSW) besideDocumentChunkEntity(100-dim). The repository routes vector ops by dimension and returns neutralChunk/ScoredChunkDTOs (decoupling retriever/Studio from the active store).RagHolderpicks the active embedder (Gemma when its files are present and RAM ≥ 4 GB, else USE-Lite), downloads the model + companion tokenizer, andmigrateToGemma()re-indexes legacy chunks from their stored text (document-scoped, idempotent, runs in the background).ChunkDtocarries itsdim); non-active-embedder chunks re-index on first use.No Android SDK here, so this was not compiled (same precedent as #22) and these steps must run on a real build/device:
GemmaChunkEntitymakes the ObjectBox plugin updatesample-app/objectbox-models/default.jsonand generateGemmaChunkEntity_/MyObjectBox. Commit the regenerateddefault.json.ai.djl.huggingface:tokenizersships an arm64.sofor Android and loadstokenizer.json; if not, swap toonnxruntime-extensions(the documented fallback). Highest-risk item.input_ids/attention_mask) and output (sentence_embeddingvslast_hidden_state);OnnxEmbeddingEngine.pool()handles both but the export should be confirmed.sizeBytes/sha256for the model + tokenizer are placeholders against the onnx-community mirror; pin to a verified revision (prefer a QAT/INT8 build).RELEVANCE_MAX_DISTANCE_GEMMAagainst real corpora.Out of scope (follow-ups)
https://claude.ai/code/session_01GY7vyycq3iQTQxiooMnxJW
Generated by Claude Code