Skip to content

Explore QMD-style search pipeline: query expansion, LLM reranking, chunking #17

@dandaka

Description

@dandaka

Summary

Explore adopting QMD's advanced search pipeline components to improve traul's search quality. This is a tech spike — research and prototype before committing to implementation.

Reference: tobi/qmd — on-device hybrid search engine using node-llama-cpp with GGUF models.

Context

Traul currently uses a simple hybrid search: FTS5 (BM25) + vector similarity (sqlite-vec) merged via Reciprocal Rank Fusion. QMD adds three additional stages that significantly improve search quality:

1. Query Expansion

  • Uses a small LLM (~1.7GB GGUF) to generate alternative query variants before searching
  • Produces typed variants: lex (keyword rewrites for FTS), vec (semantic rewrites for vector search), hyde (hypothetical document for embedding)
  • Each variant is routed to the appropriate search backend
  • QMD model: qmd-query-expansion-1.7B (custom fine-tune by @tobi)
  • Why it matters for traul: multilingual queries (Russian ↔ English) would benefit from expansion — a Russian query could generate English keyword variants and vice versa

2. LLM Reranking

  • After RRF fusion produces candidates, a reranker model rescores them
  • Uses Qwen3-Reranker-0.6B (~0.6GB GGUF) via node-llama-cpp
  • Position-aware score blending: combines RRF rank with reranker score
  • Chunked reranking to stay within context limits
  • Why it matters for traul: current RRF scores are purely statistical — reranking adds semantic understanding of query-result relevance

3. Smart Chunking

  • Documents split into overlapping chunks (900 tokens, 135 token overlap)
  • Best chunk per document selected via keyword matching
  • Reranking operates on chunks, not full documents
  • Relevance for traul: most messages are short (< 500 tokens), so chunking may only matter for long-form content (Claude Code sessions, call transcripts, markdown docs)

Prerequisite

This spike depends on migrating from Ollama to node-llama-cpp for embeddings (separate effort). The same node-llama-cpp singleton would serve embedding, expansion, and reranking models.

Models & Resource Budget

Component Model Size Purpose
Embeddings Qwen3-Embedding-0.6B-Q8_0 ~639MB Replace snowflake-arctic-embed2
Query expansion qmd-query-expansion-1.7B-q4_k_m ~1.7GB Generate query variants
Reranking Qwen3-Reranker-0.6B-Q8_0 ~0.6GB Rescore candidates
Total VRAM ~3GB All three loaded

Test Suite (write BEFORE implementation)

Search Quality Tests

search/quality/expansion.test.ts
- Query expansion generates meaningful variants for English queries
- Query expansion generates meaningful variants for Russian queries
- Cross-language expansion: Russian query produces English lex variants
- Cross-language expansion: English query produces Russian vec variants
- Expansion with empty/single-word queries doesn't crash
- Expansion respects timeout (doesn't hang on slow model)

search/quality/reranking.test.ts
- Reranker scores relevant results higher than irrelevant ones
- Reranker handles mixed-language results (Russian + English)
- Reranker improves NDCG over raw RRF scores (golden set comparison)
- Reranker gracefully degrades when model unavailable (falls back to RRF)
- Reranker respects candidate limit (doesn't process more than N docs)
- Position-aware blending produces monotonically decreasing scores

search/quality/chunking.test.ts
- Short messages (< 500 tokens) are not chunked
- Long content (Claude Code sessions) is chunked with overlap
- Best chunk selection picks the chunk with most query term overlap
- Chunk boundaries respect sentence/paragraph breaks when possible
- Chunk overlap ensures no content is lost between chunks

search/quality/pipeline.test.ts
- Full pipeline (expand → search → rerank) returns better results than FTS-only
- Full pipeline returns better results than current hybrid (vec + FTS + RRF)
- Pipeline gracefully degrades: no expansion model → skip expansion
- Pipeline gracefully degrades: no reranker → skip reranking
- Pipeline latency stays under 2s for single-word queries (warm models)
- Pipeline latency stays under 5s for complex multi-word queries (warm models)

Golden Set for Quality Measurement

Build a golden test set of ~20 queries with expected top-3 results from the actual traul database:

  • 5 Russian keyword queries (e.g., "метрики продукта")
  • 5 English keyword queries (e.g., "deployment issues")
  • 5 semantic/conceptual queries (e.g., "how to measure user satisfaction")
  • 5 cross-language queries (e.g., Russian query where best results are in English)

Measure: Precision@3, NDCG@10, Mean Reciprocal Rank before and after each pipeline stage.

Integration Tests

search/integration/node-llama-cpp.test.ts
- LlamaCpp singleton loads embedding model on first use
- LlamaCpp singleton reuses model across calls (no reload)
- Multiple concurrent embed calls don't crash
- Model unloads after idle timeout
- Graceful error when GGUF file missing

search/integration/model-lifecycle.test.ts
- Embedding + reranker models can coexist in memory
- All three models (embed + expand + rerank) load within 3GB VRAM
- Process exits cleanly with models loaded (no hanging)

Deliverables

  1. Test suite written and committed (tests will fail initially — TDD)
  2. Spike branch with prototype implementation
  3. Benchmark report: before/after search quality (golden set) and latency
  4. Decision doc: go/no-go on each component (expansion, reranking, chunking) based on quality gains vs. resource cost

Open Questions

  • Can we use QMD's custom qmd-query-expansion-1.7B model, or do we need our own fine-tune for message-style content?
  • Is 3GB total VRAM acceptable for a CLI tool? Should models load on-demand and unload aggressively?
  • For short messages, does reranking actually improve results, or is RRF sufficient?
  • Should we consider QMD as a dependency rather than reimplementing? (traul indexes messages, QMD indexes documents — different data models, but search pipeline could be shared)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions