Skip to content

Mathos34/hybrid-rrf-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hybrid-rrf-search

Hybrid search over text corpora, fusing BM25 and dense embeddings with Reciprocal Rank Fusion.

Python PyTorch License

result

What it does

Indexes 5,000 arXiv abstracts (cs.LG / cs.AI / cs.CL / cs.CV) twice: once with BM25 (sparse, lexical) and once with sentence-transformer embeddings (dense, semantic). At query time, both rankings are fused with Reciprocal Rank Fusion (RRF). The result beats either index alone on a 20-query benchmark.

Why it matters

In production search, BM25 is unbeatable for exact terminology and dense models are unbeatable for paraphrase. Modern systems (Elasticsearch hybrid, Vespa, OpenSearch, pgvector + tsvector) ship a fusion layer for exactly this reason. RRF is the simplest, most robust fusion strategy and it works without any score normalization, learning, or parameter tuning beyond a single constant.

This pattern (RRF over BM25 plus dense embeddings) is the retrieval backbone of a CV intelligence platform I am currently building in private.

How it works

  • Sparse index (src/bm25_index.py): a rank-bm25 Okapi BM25 over whitespace-tokenized lowercased text.
  • Dense index (src/dense_index.py): sentence-transformers/all-MiniLM-L6-v2 (22 M params), embeddings normalized to unit L2 norm, cosine similarity = dot product.
  • Fusion (src/rrf.py): for each document, RRF score = sum over rankers of 1 / (60 + rank). The constant 60 is the value from Cormack et al. and is robust across corpora.
  • Pseudo-relevance evaluation: for each of 20 reference queries, we take the union of the top-30 from BM25 and dense, rerank that union with a cross-encoder (cross-encoder/ms-marco-MiniLM-L-6-v2), and treat the cross-encoder's top-10 as the gold relevance set. We then measure Precision@10 of BM25, dense, and RRF against this gold.

Architecture

architecture

Quickstart

git clone https://github.com/Mathos34/hybrid-rrf-search
cd hybrid-rrf-search
python -m venv .venv && source .venv/bin/activate   # or .venv\Scripts\activate on Windows
pip install -r requirements.txt
python scripts/download.py     # ~5,000 abstracts via HuggingFace streaming
python benchmark.py
python scripts/make_viz.py

End-to-end is about 4 minutes on a laptop CPU (most of it is computing dense embeddings the first time; results are cached as runs/dense_embeddings.npy).

Results

5,000 arXiv abstracts. 20 reference queries. P@10 against cross-encoder pseudo-relevance.

Method Mean P@10 Latency / query
BM25 alone 55.5% 10.3 ms
Dense (MiniLM-L6-v2) alone 40.0% 13.2 ms
RRF (BM25 + dense) 57.0% 22.4 ms

RRF outperforms both individual rankers without any tuning. Dense underperforms here because MiniLM is a small model and scientific abstracts are lexically dense; this is exactly the regime where fusion helps most.

Caveat on the metric: cross-encoder pseudo-relevance is a soft proxy for human judgments and tends to slightly favor lexical matches. With graded human relevance the absolute numbers would change, but the relative ordering (RRF >= max(BM25, dense)) is well-established in the literature.

References

  • Cormack, Clarke, Buettcher, Reciprocal Rank Fusion outperforms Condorcet and individual rank learning methods, SIGIR 2009.
  • Robertson, Walker, Jones, Hancock-Beaulieu, Gatford, Okapi at TREC-3, NIST 1995.
  • Reimers and Gurevych, Sentence-BERT, EMNLP 2019.

About

Built by Mathis Lacombe, AI Maker at the Intelligence Lab, ECE Paris. LinkedIn · Hugging Face

About

Hybrid search over text corpora, fusing BM25 and dense embeddings with Reciprocal Rank Fusion.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages