Explore QMD-style search pipeline: query expansion, LLM reranking, chunking

## Summary

Explore adopting QMD's advanced search pipeline components to improve traul's search quality. This is a tech spike — research and prototype before committing to implementation.

**Reference:** [tobi/qmd](https://github.com/tobi/qmd) — on-device hybrid search engine using `node-llama-cpp` with GGUF models.

## Context

Traul currently uses a simple hybrid search: FTS5 (BM25) + vector similarity (sqlite-vec) merged via Reciprocal Rank Fusion. QMD adds three additional stages that significantly improve search quality:

### 1. Query Expansion
- Uses a small LLM (~1.7GB GGUF) to generate alternative query variants before searching
- Produces typed variants: `lex` (keyword rewrites for FTS), `vec` (semantic rewrites for vector search), `hyde` (hypothetical document for embedding)
- Each variant is routed to the appropriate search backend
- QMD model: `qmd-query-expansion-1.7B` (custom fine-tune by @tobi)
- **Why it matters for traul:** multilingual queries (Russian ↔ English) would benefit from expansion — a Russian query could generate English keyword variants and vice versa

### 2. LLM Reranking
- After RRF fusion produces candidates, a reranker model rescores them
- Uses `Qwen3-Reranker-0.6B` (~0.6GB GGUF) via `node-llama-cpp`
- Position-aware score blending: combines RRF rank with reranker score
- Chunked reranking to stay within context limits
- **Why it matters for traul:** current RRF scores are purely statistical — reranking adds semantic understanding of query-result relevance

### 3. Smart Chunking
- Documents split into overlapping chunks (900 tokens, 135 token overlap)
- Best chunk per document selected via keyword matching
- Reranking operates on chunks, not full documents
- **Relevance for traul:** most messages are short (< 500 tokens), so chunking may only matter for long-form content (Claude Code sessions, call transcripts, markdown docs)

## Prerequisite

This spike depends on migrating from Ollama to `node-llama-cpp` for embeddings (separate effort). The same `node-llama-cpp` singleton would serve embedding, expansion, and reranking models.

## Models & Resource Budget

| Component | Model | Size | Purpose |
|-----------|-------|------|---------|
| Embeddings | `Qwen3-Embedding-0.6B-Q8_0` | ~639MB | Replace snowflake-arctic-embed2 |
| Query expansion | `qmd-query-expansion-1.7B-q4_k_m` | ~1.7GB | Generate query variants |
| Reranking | `Qwen3-Reranker-0.6B-Q8_0` | ~0.6GB | Rescore candidates |
| **Total VRAM** | | **~3GB** | All three loaded |

## Test Suite (write BEFORE implementation)

### Search Quality Tests

```
search/quality/expansion.test.ts
- Query expansion generates meaningful variants for English queries
- Query expansion generates meaningful variants for Russian queries
- Cross-language expansion: Russian query produces English lex variants
- Cross-language expansion: English query produces Russian vec variants
- Expansion with empty/single-word queries doesn't crash
- Expansion respects timeout (doesn't hang on slow model)

search/quality/reranking.test.ts
- Reranker scores relevant results higher than irrelevant ones
- Reranker handles mixed-language results (Russian + English)
- Reranker improves NDCG over raw RRF scores (golden set comparison)
- Reranker gracefully degrades when model unavailable (falls back to RRF)
- Reranker respects candidate limit (doesn't process more than N docs)
- Position-aware blending produces monotonically decreasing scores

search/quality/chunking.test.ts
- Short messages (< 500 tokens) are not chunked
- Long content (Claude Code sessions) is chunked with overlap
- Best chunk selection picks the chunk with most query term overlap
- Chunk boundaries respect sentence/paragraph breaks when possible
- Chunk overlap ensures no content is lost between chunks

search/quality/pipeline.test.ts
- Full pipeline (expand → search → rerank) returns better results than FTS-only
- Full pipeline returns better results than current hybrid (vec + FTS + RRF)
- Pipeline gracefully degrades: no expansion model → skip expansion
- Pipeline gracefully degrades: no reranker → skip reranking
- Pipeline latency stays under 2s for single-word queries (warm models)
- Pipeline latency stays under 5s for complex multi-word queries (warm models)
```

### Golden Set for Quality Measurement

Build a golden test set of ~20 queries with expected top-3 results from the actual traul database:
- 5 Russian keyword queries (e.g., "метрики продукта")
- 5 English keyword queries (e.g., "deployment issues")
- 5 semantic/conceptual queries (e.g., "how to measure user satisfaction")
- 5 cross-language queries (e.g., Russian query where best results are in English)

Measure: Precision@3, NDCG@10, Mean Reciprocal Rank before and after each pipeline stage.

### Integration Tests

```
search/integration/node-llama-cpp.test.ts
- LlamaCpp singleton loads embedding model on first use
- LlamaCpp singleton reuses model across calls (no reload)
- Multiple concurrent embed calls don't crash
- Model unloads after idle timeout
- Graceful error when GGUF file missing

search/integration/model-lifecycle.test.ts
- Embedding + reranker models can coexist in memory
- All three models (embed + expand + rerank) load within 3GB VRAM
- Process exits cleanly with models loaded (no hanging)
```

## Deliverables

1. **Test suite** written and committed (tests will fail initially — TDD)
2. **Spike branch** with prototype implementation
3. **Benchmark report**: before/after search quality (golden set) and latency
4. **Decision doc**: go/no-go on each component (expansion, reranking, chunking) based on quality gains vs. resource cost

## Open Questions

- Can we use QMD's custom `qmd-query-expansion-1.7B` model, or do we need our own fine-tune for message-style content?
- Is 3GB total VRAM acceptable for a CLI tool? Should models load on-demand and unload aggressively?
- For short messages, does reranking actually improve results, or is RRF sufficient?
- Should we consider QMD as a dependency rather than reimplementing? (traul indexes messages, QMD indexes documents — different data models, but search pipeline could be shared)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore QMD-style search pipeline: query expansion, LLM reranking, chunking #17

Summary

Context

1. Query Expansion

2. LLM Reranking

3. Smart Chunking

Prerequisite

Models & Resource Budget

Test Suite (write BEFORE implementation)

Search Quality Tests

Golden Set for Quality Measurement

Integration Tests

Deliverables

Open Questions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Component	Model	Size	Purpose
Embeddings	`Qwen3-Embedding-0.6B-Q8_0`	~639MB	Replace snowflake-arctic-embed2
Query expansion	`qmd-query-expansion-1.7B-q4_k_m`	~1.7GB	Generate query variants
Reranking	`Qwen3-Reranker-0.6B-Q8_0`	~0.6GB	Rescore candidates
Total VRAM		~3GB	All three loaded

Explore QMD-style search pipeline: query expansion, LLM reranking, chunking #17

Description

Summary

Context

1. Query Expansion

2. LLM Reranking

3. Smart Chunking

Prerequisite

Models & Resource Budget

Test Suite (write BEFORE implementation)

Search Quality Tests

Golden Set for Quality Measurement

Integration Tests

Deliverables

Open Questions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions