feat(chat): forward document_ids through retrieval for multi-document chat by Yuthika10 · Pull Request #669 · param20h/PDF-Assistant-RAG

Yuthika10 · 2026-06-22T13:13:54Z

🔗 Related Issue

Closes #649

📝 What does this PR do?

document_ids already existed as a field on ChatRequest, and both storage layers already supported it — query_chunks (vectorstore) filters with ChromaDB's $in, and query_bm25 queries per-document indexes. But retrieve() never forwarded document_ids to either, and the chat routes never passed it down, so asking a question across multiple documents silently behaved exactly like no document filter at all. This PR connects that existing capability to the request flow.

retrieve() accepts document_ids and forwards it to both query_chunks and query_bm25 (and its trace metadata factory).
PDFSearchTool and the agent (get_agent_executor, generate_answer, generate_answer_stream) thread document_ids through to retrieval.
/ask and /ask/stream forward document_ids and validate every requested document: 404 if any are missing or not owned by the user, 400 if any are still processing — mirroring the existing single-document guard.
The response cache key now includes the selected document_ids, so a multi-document query doesn't collide with a single-document query (or a different set of documents) and return a stale answer.
The agent system prompt gets comparison guidance when more than one document is selected, so findings are attributed per source.

document_id keeps precedence over document_ids everywhere, matching the existing vectorstore logic.

🗂️ Type of Change

✨ New feature

🧪 How was this tested?

Added / updated tests

Added tests/test_multi_document_chat.py (7 tests): document_ids reaches both the vector and BM25 calls in retrieve(); single-document requests still leave document_ids as None; the comparison guidance appears in the agent prompt only when more than one document is selected; and the route guard returns 404 for missing/unowned documents and 400 for not-ready ones. Full suite passes (239 tests). I also updated one existing mock (fake_retrieve in test_rag_tools.py) to accept the document_ids kwarg the tool now forwards — a signature-only change, assertions unchanged.

I didn't test via live API calls — the agent path needs a HuggingFace model, so verification is through the unit and route tests rather than a real generation.

⚠️ Anything to flag for reviewers?

The storage-layer support for document_ids (vectorstore $in filter, per-document BM25) was already present on dev from earlier work — it just wasn't reachable because retrieve() and the routes didn't forward the field. This PR is the wiring plus the ownership guard, cache-key fix, and comparison prompt, not the storage filters themselves.

One thing worth a look: the cache-key change. Previously the cache keyed on document_id only, so a multi-document query (where document_id is None) would have keyed on an empty string and could return a cached answer from an unrelated query. I prefixed multi-doc cache keys with multi: plus the sorted document IDs so they're distinct and order-independent. Happy to adjust if you'd prefer a different cache strategy.

✅ Self-Review Checklist

My branch is based on dev, not main
I have not added any secrets / API keys
I have not modified main branch or any HuggingFace deployment config
My code follows the existing style (no unnecessary formatting changes)
I have updated relevant docs / comments if needed

… chat

Yuthika10 · 2026-06-22T13:15:27Z

Hi @param20h! This PR closes issue #649. Can you please go through it would love your feedback for future PRs

feat(chat): forward document_ids through retrieval for multi-document…

6366111

… chat

Yuthika10 requested a review from param20h as a code owner June 22, 2026 13:13

param20h approved these changes Jun 23, 2026

View reviewed changes

param20h merged commit c1524bc into param20h:dev Jun 23, 2026
8 checks passed

github-actions Bot added enhancement New feature or improvement gssoc GirlScript Summer of Code 2026 issue/PR gssoc:approved Approved for GSSoC base points (+50 pts) mentor:param20h Mentor for this PR labels Jun 23, 2026

param20h added level:intermediate +35 pts type:accessibility +15 pts type:feature +10 pts and removed type:accessibility +15 pts labels Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(chat): forward document_ids through retrieval for multi-document chat#669

feat(chat): forward document_ids through retrieval for multi-document chat#669
param20h merged 1 commit into
param20h:devfrom
Yuthika10:feat/multi-document-chat-649-dev

Yuthika10 commented Jun 22, 2026

Uh oh!

Yuthika10 commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Yuthika10 commented Jun 22, 2026

🔗 Related Issue

📝 What does this PR do?

🗂️ Type of Change

🧪 How was this tested?

⚠️ Anything to flag for reviewers?

✅ Self-Review Checklist

Uh oh!

Yuthika10 commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants