feat: enhance citation handling with confidence and match kind#230
feat: enhance citation handling with confidence and match kind#230rkarmaka wants to merge 4 commits into
Conversation
- Updated the citation emission process to include confidence scores and match kinds when a SourceRegistry is attached, allowing the UI to display verification badges. - Modified the API interfaces and internal state management to accommodate the new citation attributes. - Added tests to ensure proper handling of confidence and match kind in citation updates. - Implemented a confidence threshold in the DeepResearcherAgent to filter citations based on their strength before inclusion in reports. This change improves the user experience by providing clearer insights into citation validity and enhances the overall citation verification process. Signed-off-by: Ranit Karmakar <rkarmaka@mtu.edu>
b556739 to
aac7b3c
Compare
Greptile SummaryThis PR adds structured confidence scores and match-kind annotations to the citation pipeline, flowing from the Python
Confidence Score: 5/5Safe to merge; the full citation pipeline from registry resolution through SSE emission to UI badge rendering is well-tested and all previously flagged issues are addressed. All prior review bugs (NameError in dedup block, misleading docstrings, incorrect reason field for ambiguous citations, inaccurate config description) are resolved in this PR. The new code is backed by a thorough parametrised test suite covering every match kind and threshold semantics. The only remaining item is a minor telemetry inaccuracy in the stripped summary log field, which has no effect on report output or UI behaviour. citation_verification.py — the summary telemetry log is worth a second read before wiring it to production monitoring alerts. Important Files Changed
Sequence DiagramsequenceDiagram
participant Agent as DeepResearcherAgent
participant Registry as SourceRegistry
participant CB as AgentEventCallback
participant SSE as SSE Stream
participant Client as deep-research-client.ts
participant Store as ChatStore
participant UI as CitationCard
Agent->>Registry: resolve_url(url)
Registry-->>Agent: _ResolveMatch(kind, confidence)
Note over Agent: verify_citations sets resolved and verified flags
CB->>Registry: resolve_url(url)
Registry-->>CB: "_ResolveMatch(kind=child_path, confidence=0.6)"
CB->>SSE: citation_use event with confidence and match_kind
SSE-->>Client: artifact.update
Client->>Store: onCitationUpdate with confidence and matchKind
Store->>Store: monotonic confidence merge
Store-->>UI: CitationSource with matchKind and confidence
UI->>UI: render ConfidenceBadge
Reviews (5): Last reviewed commit: "Merge develop; reconcile citation dedup ..." | Re-trigger Greptile |
- Updated the `citation_passthrough_threshold` documentation to specify its role in marking citations as verified in the UI, rather than filtering them from the report. - Adjusted the `verify_citations` function to ensure that only unresolved citations are stripped from the report, allowing all resolved citations to remain and carry their confidence scores. - Enhanced comments throughout the code to improve clarity on citation verification processes and UI interactions. These changes improve the understanding of citation handling and ensure that the UI accurately reflects citation confidence without losing important context in reports.
…tConfig - Updated the documentation for the confidence cutoff parameter to clarify its role in marking citations as verified in the UI. - Improved the explanation of how the threshold affects citation verification without filtering them from the report body. - Ensured that the default value and its implications for citation handling are clearly articulated. These changes aim to provide better guidance on citation confidence settings and their impact on the user interface.
|
Thank you @rkarmaka, we will review this soon |
torkian
left a comment
There was a problem hiding this comment.
Friendly follow-up on the P1 Greptile flagged — I went deeper and there are actually two related bugs in this block that have the same root cause and the same fix:
- NameError on every report with a references section (already flagged): the dedup loop at line 1018 reads
valid_citationsbefore it's defined. - Silent no-op dedup: even after fixing the NameError, line 1055 (
valid_citations = [_citation_to_dict(v) for v in verifications if v.resolved]) reassignsvalid_citations, discarding the deduped result from line 1044. Line 1057 similarly clobbers any duplicate entries appended toremoved_citationsinside the loop. So the dedup pass would run, find duplicates, and then have its work thrown away.
Both go away if we initialize valid_citations / removed_citations / removed_records from verifications before the dedup block instead of after. The dedup then mutates the already-built lists, and the URL-replacement block at the end is unaffected. Suggested change attached on the relevant line range — happy to put it up as a PR against your fork if that's easier.
| @@ -792,15 +1048,47 @@ def verify_citations(report_text: str, registry: SourceRegistry) -> CitationVeri | |||
| for garbled, canonical in url_replacements.items(): | |||
| ref_section = ref_section.replace(garbled, canonical) | |||
|
|
|||
| if not removed_citations: | |||
| logger.debug("[CitationVerify] Result: all %d citation(s) valid — no changes", len(valid_citations)) | |||
| verified = body + ref_section if url_replacements else report_text | |||
| # The report keeps every citation that resolved, even with low confidence. | |||
| # Only genuinely unresolved (unmatched / ambiguous / unverifiable) entries | |||
| # get stripped — those are likely fabrications. ``verified`` is reported | |||
| # separately on each record so the UI can render strength. | |||
| valid_citations = [_citation_to_dict(v) for v in verifications if v.resolved] | |||
| removed_records = [v for v in verifications if not v.resolved] | |||
| removed_citations = [_citation_to_dict(v, include_reason=True) for v in removed_records] | |||
There was a problem hiding this comment.
Move the initial build of valid_citations/removed_records/removed_citations (currently at lines 1055-1057) above the dedup block. Fixes the NameError and prevents the dedup's output from being silently overwritten.
| # The report keeps every citation that resolved, even with low confidence. | |
| # Only genuinely unresolved (unmatched / ambiguous / unverifiable) entries | |
| # get stripped — those are likely fabrications. ``verified`` is reported | |
| # separately on each record so the UI can render strength. | |
| valid_citations = [_citation_to_dict(v) for v in verifications if v.resolved] | |
| removed_records = [v for v in verifications if not v.resolved] | |
| removed_citations = [_citation_to_dict(v, include_reason=True) for v in removed_records] | |
| # Dedup: collapse multiple [N] reference lines that resolve to the same | |
| # registry source. The model often makes the same tool call twice (e.g. | |
| # ``mcp_time__get_current_time`` for two timezones) and emits a separate | |
| # ``[N] tool_name`` line for each call; without this pass both lines | |
| # survive verification because each is independently valid. We keep the | |
| # lowest-numbered occurrence and rewrite later inline citations to that | |
| # number so the prose still cites the source. | |
| seen_keys: dict[str, int] = {} # canonical_key -> kept citation number | |
| duplicate_rewrites: dict[int, int] = {} # duplicate_num -> canonical_num | |
| deduped_valid: list[dict] = [] | |
| for c in valid_citations: | |
| key = c["url"] or c["citation_key"] | |
| if key is None: | |
| # Defensive: a valid citation must have one of url/citation_key. | |
| # If neither is set we cannot dedup, so keep the entry. | |
| deduped_valid.append(c) | |
| continue | |
| canonical_num = seen_keys.get(key) | |
| if canonical_num is None: | |
| seen_keys[key] = c["number"] | |
| deduped_valid.append(c) | |
| continue | |
| duplicate_rewrites[c["number"]] = canonical_num | |
| removed_citations.append( | |
| { | |
| "number": c["number"], | |
| "line": c["line"], | |
| "reason": f"duplicate_of_citation_{canonical_num}", | |
| } | |
| ) | |
| logger.debug( | |
| "[CitationVerify] [%d] REMOVE — duplicate of [%d]: %s", | |
| c["number"], | |
| canonical_num, | |
| key, | |
| ) | |
| valid_citations = deduped_valid | |
| # Apply URL replacements (garbled -> canonical) in the references section | |
| if url_replacements: | |
| for garbled, canonical in url_replacements.items(): | |
| ref_section = ref_section.replace(garbled, canonical) |
There was a problem hiding this comment.
Thanks — merged develop and applied your reorder. Worth flagging: your fix cleared the NameError, but my refactor's if not removed_records early-return was also discarding the deduped result (and skipping duplicate-line stripping), since removed_records is unresolved-only.
There was a problem hiding this comment.
Nice catch on the removed_records vs removed_citations distinction — that's the subtle one. Keying the early-return off removed_citations (the superset of unresolved + dedup entries) is the right call: a report with duplicate citations but no unresolved ones now correctly strips/rewrites the dup lines instead of bailing early. Pulled 80f24de0 and traced it — the reorder + your early-return fix read correctly together, and invalid_numbers = removed_numbers - set(duplicate_rewrites) cleanly separates the strip-vs-rewrite paths. LGTM 👍
…fix NameError + duplicate stripping) Signed-off-by: Ranit Karmakar <karmakarranit6@gmail.com>
bfdeb86 to
80f24de
Compare
|
/ok to test |
@cdgamarose-nv, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test aac7b3c |
@cdgamarose-nv, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
|
/ok to test 80f24de |
|
This PR mostly targets observability and UX - not sure if this is an issue that has been reported or surfaced. Citation handling is still heuristic-based, so I'm not sure if this affects correctness at all. @AjayThorve, what do you think? |
This change improves the user experience by providing clearer insights into citation validity and enhances the overall citation verification process.