-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Summary
Explore adopting QMD's advanced search pipeline components to improve traul's search quality. This is a tech spike — research and prototype before committing to implementation.
Reference: tobi/qmd — on-device hybrid search engine using node-llama-cpp with GGUF models.
Context
Traul currently uses a simple hybrid search: FTS5 (BM25) + vector similarity (sqlite-vec) merged via Reciprocal Rank Fusion. QMD adds three additional stages that significantly improve search quality:
1. Query Expansion
- Uses a small LLM (~1.7GB GGUF) to generate alternative query variants before searching
- Produces typed variants:
lex(keyword rewrites for FTS),vec(semantic rewrites for vector search),hyde(hypothetical document for embedding) - Each variant is routed to the appropriate search backend
- QMD model:
qmd-query-expansion-1.7B(custom fine-tune by @tobi) - Why it matters for traul: multilingual queries (Russian ↔ English) would benefit from expansion — a Russian query could generate English keyword variants and vice versa
2. LLM Reranking
- After RRF fusion produces candidates, a reranker model rescores them
- Uses
Qwen3-Reranker-0.6B(~0.6GB GGUF) vianode-llama-cpp - Position-aware score blending: combines RRF rank with reranker score
- Chunked reranking to stay within context limits
- Why it matters for traul: current RRF scores are purely statistical — reranking adds semantic understanding of query-result relevance
3. Smart Chunking
- Documents split into overlapping chunks (900 tokens, 135 token overlap)
- Best chunk per document selected via keyword matching
- Reranking operates on chunks, not full documents
- Relevance for traul: most messages are short (< 500 tokens), so chunking may only matter for long-form content (Claude Code sessions, call transcripts, markdown docs)
Prerequisite
This spike depends on migrating from Ollama to node-llama-cpp for embeddings (separate effort). The same node-llama-cpp singleton would serve embedding, expansion, and reranking models.
Models & Resource Budget
| Component | Model | Size | Purpose |
|---|---|---|---|
| Embeddings | Qwen3-Embedding-0.6B-Q8_0 |
~639MB | Replace snowflake-arctic-embed2 |
| Query expansion | qmd-query-expansion-1.7B-q4_k_m |
~1.7GB | Generate query variants |
| Reranking | Qwen3-Reranker-0.6B-Q8_0 |
~0.6GB | Rescore candidates |
| Total VRAM | ~3GB | All three loaded |
Test Suite (write BEFORE implementation)
Search Quality Tests
search/quality/expansion.test.ts
- Query expansion generates meaningful variants for English queries
- Query expansion generates meaningful variants for Russian queries
- Cross-language expansion: Russian query produces English lex variants
- Cross-language expansion: English query produces Russian vec variants
- Expansion with empty/single-word queries doesn't crash
- Expansion respects timeout (doesn't hang on slow model)
search/quality/reranking.test.ts
- Reranker scores relevant results higher than irrelevant ones
- Reranker handles mixed-language results (Russian + English)
- Reranker improves NDCG over raw RRF scores (golden set comparison)
- Reranker gracefully degrades when model unavailable (falls back to RRF)
- Reranker respects candidate limit (doesn't process more than N docs)
- Position-aware blending produces monotonically decreasing scores
search/quality/chunking.test.ts
- Short messages (< 500 tokens) are not chunked
- Long content (Claude Code sessions) is chunked with overlap
- Best chunk selection picks the chunk with most query term overlap
- Chunk boundaries respect sentence/paragraph breaks when possible
- Chunk overlap ensures no content is lost between chunks
search/quality/pipeline.test.ts
- Full pipeline (expand → search → rerank) returns better results than FTS-only
- Full pipeline returns better results than current hybrid (vec + FTS + RRF)
- Pipeline gracefully degrades: no expansion model → skip expansion
- Pipeline gracefully degrades: no reranker → skip reranking
- Pipeline latency stays under 2s for single-word queries (warm models)
- Pipeline latency stays under 5s for complex multi-word queries (warm models)
Golden Set for Quality Measurement
Build a golden test set of ~20 queries with expected top-3 results from the actual traul database:
- 5 Russian keyword queries (e.g., "метрики продукта")
- 5 English keyword queries (e.g., "deployment issues")
- 5 semantic/conceptual queries (e.g., "how to measure user satisfaction")
- 5 cross-language queries (e.g., Russian query where best results are in English)
Measure: Precision@3, NDCG@10, Mean Reciprocal Rank before and after each pipeline stage.
Integration Tests
search/integration/node-llama-cpp.test.ts
- LlamaCpp singleton loads embedding model on first use
- LlamaCpp singleton reuses model across calls (no reload)
- Multiple concurrent embed calls don't crash
- Model unloads after idle timeout
- Graceful error when GGUF file missing
search/integration/model-lifecycle.test.ts
- Embedding + reranker models can coexist in memory
- All three models (embed + expand + rerank) load within 3GB VRAM
- Process exits cleanly with models loaded (no hanging)
Deliverables
- Test suite written and committed (tests will fail initially — TDD)
- Spike branch with prototype implementation
- Benchmark report: before/after search quality (golden set) and latency
- Decision doc: go/no-go on each component (expansion, reranking, chunking) based on quality gains vs. resource cost
Open Questions
- Can we use QMD's custom
qmd-query-expansion-1.7Bmodel, or do we need our own fine-tune for message-style content? - Is 3GB total VRAM acceptable for a CLI tool? Should models load on-demand and unload aggressively?
- For short messages, does reranking actually improve results, or is RRF sufficient?
- Should we consider QMD as a dependency rather than reimplementing? (traul indexes messages, QMD indexes documents — different data models, but search pipeline could be shared)