Skip to content

fix: Add unique IDs for RRF deduplication in hybrid retrieval#644

Open
ananyasingh7052258502-coder wants to merge 1 commit into
param20h:devfrom
ananyasingh7052258502-coder:fix-rrf-dedup
Open

fix: Add unique IDs for RRF deduplication in hybrid retrieval#644
ananyasingh7052258502-coder wants to merge 1 commit into
param20h:devfrom
ananyasingh7052258502-coder:fix-rrf-dedup

Conversation

@ananyasingh7052258502-coder

Copy link
Copy Markdown
Contributor

📋 PR Checklist

Thank you for contributing to PDF-Assistant-RAG!


🔗 Related Issue

Closes #634


📝 What does this PR do?

Adds unique chunk IDs in hybrid retrieval to fix RRF deduplication.

Problem:
RRF was failing to deduplicate chunks because BM25 retriever wasn't assigning unique IDs to chunks. Vector retriever had IDs but BM25 chunks had none, so duplicates from both retrievers weren't merged.

Solution:

  1. Added chunk['id'] generation in CustomBM25Retriever - retriever.py
  2. Added chunk['id'] generation in _query_single_index - bm25.py
  3. ID format: bm25_{document_id}_{page}_{index}

Now RRF can properly identify and merge duplicate chunks from Vector + BM25.


🗂️ Type of Change

  • [ x] 🐛 Bug fix
  • ✨ New feature
  • 🔧 Refactor / code cleanup
  • 📝 Documentation update
  • 🎨 UI / styling change
  • ⚙️ CI / tooling / config change
  • 🧪 Tests

🧪 How was this tested?

  • [ x] Ran the backend locally (uvicorn app.main:app --reload)
  • [ x] Ran the frontend locally (npm run dev inside frontend/)
  • Tested the affected API endpoints manually
  • Added / updated tests

📸 Screenshots (if UI change)


⚠️ Anything to flag for reviewers?

N/A - Simple bug fix. Added unique ID generation for BM25 chunks to enable RRF deduplication. No breaking changes.


✅ Self-Review Checklist

  • [ x] My branch is based on dev, not main
  • [ x] I have not added any secrets / API keys
  • [ x] I have not modified main branch or any HuggingFace deployment config
  • [ x] My code follows the existing style (no unnecessary formatting changes)
  • I have updated relevant docs / comments if needed

@param20h

Copy link
Copy Markdown
Owner

mergeee conflict

@ananyasingh7052258502-coder

Copy link
Copy Markdown
Contributor Author

@param20h Sir, mobile app se merge conflict resolve nahi ho raha.
Please aap latest dev branch se rebase kar denge? Thank you sir 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Fragile RRF deduplication due to missing explicit IDs in BM25 & Chroma retrievers

2 participants