Skip to content

feat(chat): forward document_ids through retrieval for multi-document chat#669

Merged
param20h merged 1 commit into
param20h:devfrom
Yuthika10:feat/multi-document-chat-649-dev
Jun 23, 2026
Merged

feat(chat): forward document_ids through retrieval for multi-document chat#669
param20h merged 1 commit into
param20h:devfrom
Yuthika10:feat/multi-document-chat-649-dev

Conversation

@Yuthika10

Copy link
Copy Markdown
Contributor

🔗 Related Issue

Closes #649

📝 What does this PR do?

document_ids already existed as a field on ChatRequest, and both storage layers already supported it — query_chunks (vectorstore) filters with ChromaDB's $in, and query_bm25 queries per-document indexes. But retrieve() never forwarded document_ids to either, and the chat routes never passed it down, so asking a question across multiple documents silently behaved exactly like no document filter at all. This PR connects that existing capability to the request flow.

  • retrieve() accepts document_ids and forwards it to both query_chunks and query_bm25 (and its trace metadata factory).
  • PDFSearchTool and the agent (get_agent_executor, generate_answer, generate_answer_stream) thread document_ids through to retrieval.
  • /ask and /ask/stream forward document_ids and validate every requested document: 404 if any are missing or not owned by the user, 400 if any are still processing — mirroring the existing single-document guard.
  • The response cache key now includes the selected document_ids, so a multi-document query doesn't collide with a single-document query (or a different set of documents) and return a stale answer.
  • The agent system prompt gets comparison guidance when more than one document is selected, so findings are attributed per source.

document_id keeps precedence over document_ids everywhere, matching the existing vectorstore logic.

🗂️ Type of Change

  • ✨ New feature

🧪 How was this tested?

  • Added / updated tests

Added tests/test_multi_document_chat.py (7 tests): document_ids reaches both the vector and BM25 calls in retrieve(); single-document requests still leave document_ids as None; the comparison guidance appears in the agent prompt only when more than one document is selected; and the route guard returns 404 for missing/unowned documents and 400 for not-ready ones. Full suite passes (239 tests). I also updated one existing mock (fake_retrieve in test_rag_tools.py) to accept the document_ids kwarg the tool now forwards — a signature-only change, assertions unchanged.

I didn't test via live API calls — the agent path needs a HuggingFace model, so verification is through the unit and route tests rather than a real generation.

⚠️ Anything to flag for reviewers?

The storage-layer support for document_ids (vectorstore $in filter, per-document BM25) was already present on dev from earlier work — it just wasn't reachable because retrieve() and the routes didn't forward the field. This PR is the wiring plus the ownership guard, cache-key fix, and comparison prompt, not the storage filters themselves.

One thing worth a look: the cache-key change. Previously the cache keyed on document_id only, so a multi-document query (where document_id is None) would have keyed on an empty string and could return a cached answer from an unrelated query. I prefixed multi-doc cache keys with multi: plus the sorted document IDs so they're distinct and order-independent. Happy to adjust if you'd prefer a different cache strategy.

✅ Self-Review Checklist

  • My branch is based on dev, not main
  • I have not added any secrets / API keys
  • I have not modified main branch or any HuggingFace deployment config
  • My code follows the existing style (no unnecessary formatting changes)
  • I have updated relevant docs / comments if needed

@Yuthika10 Yuthika10 requested a review from param20h as a code owner June 22, 2026 13:13
@Yuthika10

Copy link
Copy Markdown
Contributor Author

Hi @param20h! This PR closes issue #649. Can you please go through it would love your feedback for future PRs

@param20h param20h merged commit c1524bc into param20h:dev Jun 23, 2026
8 checks passed
@github-actions github-actions Bot added enhancement New feature or improvement gssoc GirlScript Summer of Code 2026 issue/PR gssoc:approved Approved for GSSoC base points (+50 pts) mentor:param20h Mentor for this PR labels Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or improvement gssoc:approved Approved for GSSoC base points (+50 pts) gssoc GirlScript Summer of Code 2026 issue/PR level:intermediate +35 pts mentor:param20h Mentor for this PR type:feature +10 pts

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEAT] Wire up multi-document chat - document_ids is in ChatRequest and both vectorstore/BM25 support it, but retrieve() never forwards it

2 participants