hybrid-rrf-search

Hybrid search over text corpora, fusing BM25 and dense embeddings with Reciprocal Rank Fusion.

What it does

Indexes 5,000 arXiv abstracts (cs.LG / cs.AI / cs.CL / cs.CV) twice: once with BM25 (sparse, lexical) and once with sentence-transformer embeddings (dense, semantic). At query time, both rankings are fused with Reciprocal Rank Fusion (RRF). The result beats either index alone on a 20-query benchmark.

Why it matters

In production search, BM25 is unbeatable for exact terminology and dense models are unbeatable for paraphrase. Modern systems (Elasticsearch hybrid, Vespa, OpenSearch, pgvector + tsvector) ship a fusion layer for exactly this reason. RRF is the simplest, most robust fusion strategy and it works without any score normalization, learning, or parameter tuning beyond a single constant.

This pattern (RRF over BM25 plus dense embeddings) is the retrieval backbone of a CV intelligence platform I am currently building in private.

How it works

Sparse index (src/bm25_index.py): a rank-bm25 Okapi BM25 over whitespace-tokenized lowercased text.
Dense index (src/dense_index.py): sentence-transformers/all-MiniLM-L6-v2 (22 M params), embeddings normalized to unit L2 norm, cosine similarity = dot product.
Fusion (src/rrf.py): for each document, RRF score = sum over rankers of 1 / (60 + rank). The constant 60 is the value from Cormack et al. and is robust across corpora.
Pseudo-relevance evaluation: for each of 20 reference queries, we take the union of the top-30 from BM25 and dense, rerank that union with a cross-encoder (cross-encoder/ms-marco-MiniLM-L-6-v2), and treat the cross-encoder's top-10 as the gold relevance set. We then measure Precision@10 of BM25, dense, and RRF against this gold.

Architecture

Quickstart

git clone https://github.com/Mathos34/hybrid-rrf-search
cd hybrid-rrf-search
python -m venv .venv && source .venv/bin/activate   # or .venv\Scripts\activate on Windows
pip install -r requirements.txt
python scripts/download.py     # ~5,000 abstracts via HuggingFace streaming
python benchmark.py
python scripts/make_viz.py

End-to-end is about 4 minutes on a laptop CPU (most of it is computing dense embeddings the first time; results are cached as runs/dense_embeddings.npy).

Results

5,000 arXiv abstracts. 20 reference queries. P@10 against cross-encoder pseudo-relevance.

Method	Mean P@10	Latency / query
BM25 alone	55.5%	10.3 ms
Dense (MiniLM-L6-v2) alone	40.0%	13.2 ms
RRF (BM25 + dense)	57.0%	22.4 ms

RRF outperforms both individual rankers without any tuning. Dense underperforms here because MiniLM is a small model and scientific abstracts are lexically dense; this is exactly the regime where fusion helps most.

Caveat on the metric: cross-encoder pseudo-relevance is a soft proxy for human judgments and tends to slightly favor lexical matches. With graded human relevance the absolute numbers would change, but the relative ordering (RRF >= max(BM25, dense)) is well-established in the literature.

References

Cormack, Clarke, Buettcher, Reciprocal Rank Fusion outperforms Condorcet and individual rank learning methods, SIGIR 2009.
Robertson, Walker, Jones, Hancock-Beaulieu, Gatford, Okapi at TREC-3, NIST 1995.
Reimers and Gurevych, Sentence-BERT, EMNLP 2019.

About

Built by Mathis Lacombe, AI Maker at the Intelligence Lab, ECE Paris. LinkedIn · Hugging Face

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hybrid-rrf-search

What it does

Why it matters

How it works

Architecture

Quickstart

Results

References

About

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

hybrid-rrf-search

What it does

Why it matters

How it works

Architecture

Quickstart

Results

References

About

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages