Hybrid search over text corpora, fusing BM25 and dense embeddings with Reciprocal Rank Fusion.
Indexes 5,000 arXiv abstracts (cs.LG / cs.AI / cs.CL / cs.CV) twice: once with BM25 (sparse, lexical) and once with sentence-transformer embeddings (dense, semantic). At query time, both rankings are fused with Reciprocal Rank Fusion (RRF). The result beats either index alone on a 20-query benchmark.
In production search, BM25 is unbeatable for exact terminology and dense models are unbeatable for paraphrase. Modern systems (Elasticsearch hybrid, Vespa, OpenSearch, pgvector + tsvector) ship a fusion layer for exactly this reason. RRF is the simplest, most robust fusion strategy and it works without any score normalization, learning, or parameter tuning beyond a single constant.
This pattern (RRF over BM25 plus dense embeddings) is the retrieval backbone of a CV intelligence platform I am currently building in private.
- Sparse index (
src/bm25_index.py): arank-bm25Okapi BM25 over whitespace-tokenized lowercased text. - Dense index (
src/dense_index.py):sentence-transformers/all-MiniLM-L6-v2(22 M params), embeddings normalized to unit L2 norm, cosine similarity = dot product. - Fusion (
src/rrf.py): for each document, RRF score = sum over rankers of1 / (60 + rank). The constant 60 is the value from Cormack et al. and is robust across corpora. - Pseudo-relevance evaluation: for each of 20 reference queries, we take the union of the top-30 from BM25 and dense, rerank that union with a cross-encoder (
cross-encoder/ms-marco-MiniLM-L-6-v2), and treat the cross-encoder's top-10 as the gold relevance set. We then measure Precision@10 of BM25, dense, and RRF against this gold.
git clone https://github.com/Mathos34/hybrid-rrf-search
cd hybrid-rrf-search
python -m venv .venv && source .venv/bin/activate # or .venv\Scripts\activate on Windows
pip install -r requirements.txt
python scripts/download.py # ~5,000 abstracts via HuggingFace streaming
python benchmark.py
python scripts/make_viz.pyEnd-to-end is about 4 minutes on a laptop CPU (most of it is computing dense embeddings the first time; results are cached as runs/dense_embeddings.npy).
5,000 arXiv abstracts. 20 reference queries. P@10 against cross-encoder pseudo-relevance.
| Method | Mean P@10 | Latency / query |
|---|---|---|
| BM25 alone | 55.5% | 10.3 ms |
| Dense (MiniLM-L6-v2) alone | 40.0% | 13.2 ms |
| RRF (BM25 + dense) | 57.0% | 22.4 ms |
RRF outperforms both individual rankers without any tuning. Dense underperforms here because MiniLM is a small model and scientific abstracts are lexically dense; this is exactly the regime where fusion helps most.
Caveat on the metric: cross-encoder pseudo-relevance is a soft proxy for human judgments and tends to slightly favor lexical matches. With graded human relevance the absolute numbers would change, but the relative ordering (RRF >= max(BM25, dense)) is well-established in the literature.
- Cormack, Clarke, Buettcher, Reciprocal Rank Fusion outperforms Condorcet and individual rank learning methods, SIGIR 2009.
- Robertson, Walker, Jones, Hancock-Beaulieu, Gatford, Okapi at TREC-3, NIST 1995.
- Reimers and Gurevych, Sentence-BERT, EMNLP 2019.
Built by Mathis Lacombe, AI Maker at the Intelligence Lab, ECE Paris. LinkedIn · Hugging Face

