Skip to content

feat: enhance citation handling with confidence and match kind#230

Open
rkarmaka wants to merge 4 commits into
NVIDIA-AI-Blueprints:developfrom
rkarmaka:feat/citation-confidence-scoring
Open

feat: enhance citation handling with confidence and match kind#230
rkarmaka wants to merge 4 commits into
NVIDIA-AI-Blueprints:developfrom
rkarmaka:feat/citation-confidence-scoring

Conversation

@rkarmaka
Copy link
Copy Markdown

  • Updated the citation emission process to include confidence scores and match kinds when a SourceRegistry is attached, allowing the UI to display verification badges.
  • Modified the API interfaces and internal state management to accommodate the new citation attributes.
  • Added tests to ensure proper handling of confidence and match kind in citation updates.
  • Implemented a confidence threshold in the DeepResearcherAgent to filter citations based on their strength before inclusion in reports.

This change improves the user experience by providing clearer insights into citation validity and enhances the overall citation verification process.

- Updated the citation emission process to include confidence scores and match kinds when a SourceRegistry is attached, allowing the UI to display verification badges.
- Modified the API interfaces and internal state management to accommodate the new citation attributes.
- Added tests to ensure proper handling of confidence and match kind in citation updates.
- Implemented a confidence threshold in the DeepResearcherAgent to filter citations based on their strength before inclusion in reports.

This change improves the user experience by providing clearer insights into citation validity and enhances the overall citation verification process.

Signed-off-by: Ranit Karmakar <rkarmaka@mtu.edu>
@rkarmaka rkarmaka force-pushed the feat/citation-confidence-scoring branch from b556739 to aac7b3c Compare May 12, 2026 20:46
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 12, 2026

Greptile Summary

This PR adds structured confidence scores and match-kind annotations to the citation pipeline, flowing from the Python SourceRegistry.resolve_url() all the way to a new ConfidenceBadge UI component on each CitationCard. It also introduces a citation_passthrough_threshold config on DeepResearcherAgent that sets a per-citation verified UI hint without filtering the report body — fixing the long-standing semantic confusion between "stripping" and "badge hint."

  • Backend: MatchKind / VerifiedCitation dataclasses added to citation_verification.py; verify_citations now returns a typed verifications list alongside the existing valid_citations/removed_citations dicts; _emit_cited_urls attaches confidence and match_kind to citation_use SSE events via EventData(extra="allow").
  • Frontend: CitationSource extended with confidence? and matchKind?; store merge logic is monotonic (never downgrades confidence); ConfidenceBadge renders for medium/low-tier matches, suppressed for exact/normalized to keep the common case noise-free.
  • Tests: New TestVerifyCitationsConfidence class covers each match kind end-to-end, threshold semantics, structured log content, and OTel summary payload.

Confidence Score: 5/5

Safe to merge; the full citation pipeline from registry resolution through SSE emission to UI badge rendering is well-tested and all previously flagged issues are addressed.

All prior review bugs (NameError in dedup block, misleading docstrings, incorrect reason field for ambiguous citations, inaccurate config description) are resolved in this PR. The new code is backed by a thorough parametrised test suite covering every match kind and threshold semantics. The only remaining item is a minor telemetry inaccuracy in the stripped summary log field, which has no effect on report output or UI behaviour.

citation_verification.py — the summary telemetry log is worth a second read before wiring it to production monitoring alerts.

Important Files Changed

Filename Overview
src/aiq_agent/common/citation_verification.py Introduces MatchKind, _ResolveMatch, and VerifiedCitation data classes; refactors verify_citations to produce structured VerifiedCitation records with OTel counter and structured logging. All previously-flagged bugs fixed. Minor telemetry inaccuracy in stripped and kept_below_threshold summary fields.
frontends/aiq_api/src/aiq_api/jobs/callbacks.py _emit_cited_urls now calls resolve_url and attaches confidence/match_kind to citation_use SSE events via **extra. EventData uses extra=allow so new fields flow through correctly.
frontends/ui/src/features/chat/store.ts addDeepResearchCitation extended with confidence/matchKind; monotonic confidence merge logic correctly uses Math.max.
frontends/ui/src/features/layout/components/CitationCard.tsx Adds ConfidenceBadge for medium/low tier match kinds; exact/normalized matches stay visually unchanged.
src/aiq_agent/agents/deep_researcher/agent.py citation_passthrough_threshold wired through constructor and config; all previously-flagged misleading comments corrected.
src/aiq_agent/agents/deep_researcher/register.py Adds citation_passthrough_threshold Field with ge/le validation; description correctly states it is a UI hint, not a report filter.
frontends/ui/src/features/chat/types.ts Adds CitationMatchKind union type and extends CitationSource/ChatActions with optional confidence and matchKind fields.
frontends/ui/src/adapters/api/deep-research-client.ts Extends ArtifactUpdateEvent and onCitationUpdate callback to carry optional confidence/match_kind.
tests/aiq_agent/common/test_citation_verification.py Comprehensive new TestVerifyCitationsConfidence class covering all match kinds, threshold semantics, and structured log content.

Sequence Diagram

sequenceDiagram
    participant Agent as DeepResearcherAgent
    participant Registry as SourceRegistry
    participant CB as AgentEventCallback
    participant SSE as SSE Stream
    participant Client as deep-research-client.ts
    participant Store as ChatStore
    participant UI as CitationCard

    Agent->>Registry: resolve_url(url)
    Registry-->>Agent: _ResolveMatch(kind, confidence)
    Note over Agent: verify_citations sets resolved and verified flags
    CB->>Registry: resolve_url(url)
    Registry-->>CB: "_ResolveMatch(kind=child_path, confidence=0.6)"
    CB->>SSE: citation_use event with confidence and match_kind
    SSE-->>Client: artifact.update
    Client->>Store: onCitationUpdate with confidence and matchKind
    Store->>Store: monotonic confidence merge
    Store-->>UI: CitationSource with matchKind and confidence
    UI->>UI: render ConfidenceBadge
Loading

Reviews (5): Last reviewed commit: "Merge develop; reconcile citation dedup ..." | Re-trigger Greptile

Comment thread src/aiq_agent/common/citation_verification.py Outdated
Comment thread src/aiq_agent/agents/deep_researcher/agent.py Outdated
Comment thread src/aiq_agent/agents/deep_researcher/agent.py Outdated
Comment thread src/aiq_agent/common/citation_verification.py
- Updated the `citation_passthrough_threshold` documentation to specify its role in marking citations as verified in the UI, rather than filtering them from the report.
- Adjusted the `verify_citations` function to ensure that only unresolved citations are stripped from the report, allowing all resolved citations to remain and carry their confidence scores.
- Enhanced comments throughout the code to improve clarity on citation verification processes and UI interactions.

These changes improve the understanding of citation handling and ensure that the UI accurately reflects citation confidence without losing important context in reports.
Comment thread src/aiq_agent/agents/deep_researcher/register.py Outdated
…tConfig

- Updated the documentation for the confidence cutoff parameter to clarify its role in marking citations as verified in the UI.
- Improved the explanation of how the threshold affects citation verification without filtering them from the report body.
- Ensured that the default value and its implications for citation handling are clearly articulated.

These changes aim to provide better guidance on citation confidence settings and their impact on the user interface.
@AjayThorve
Copy link
Copy Markdown
Collaborator

Thank you @rkarmaka, we will review this soon

Copy link
Copy Markdown

@torkian torkian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Friendly follow-up on the P1 Greptile flagged — I went deeper and there are actually two related bugs in this block that have the same root cause and the same fix:

  1. NameError on every report with a references section (already flagged): the dedup loop at line 1018 reads valid_citations before it's defined.
  2. Silent no-op dedup: even after fixing the NameError, line 1055 (valid_citations = [_citation_to_dict(v) for v in verifications if v.resolved]) reassigns valid_citations, discarding the deduped result from line 1044. Line 1057 similarly clobbers any duplicate entries appended to removed_citations inside the loop. So the dedup pass would run, find duplicates, and then have its work thrown away.

Both go away if we initialize valid_citations / removed_citations / removed_records from verifications before the dedup block instead of after. The dedup then mutates the already-built lists, and the URL-replacement block at the end is unaffected. Suggested change attached on the relevant line range — happy to put it up as a PR against your fork if that's easier.

Comment on lines +1008 to +1057
@@ -792,15 +1048,47 @@ def verify_citations(report_text: str, registry: SourceRegistry) -> CitationVeri
for garbled, canonical in url_replacements.items():
ref_section = ref_section.replace(garbled, canonical)

if not removed_citations:
logger.debug("[CitationVerify] Result: all %d citation(s) valid — no changes", len(valid_citations))
verified = body + ref_section if url_replacements else report_text
# The report keeps every citation that resolved, even with low confidence.
# Only genuinely unresolved (unmatched / ambiguous / unverifiable) entries
# get stripped — those are likely fabrications. ``verified`` is reported
# separately on each record so the UI can render strength.
valid_citations = [_citation_to_dict(v) for v in verifications if v.resolved]
removed_records = [v for v in verifications if not v.resolved]
removed_citations = [_citation_to_dict(v, include_reason=True) for v in removed_records]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the initial build of valid_citations/removed_records/removed_citations (currently at lines 1055-1057) above the dedup block. Fixes the NameError and prevents the dedup's output from being silently overwritten.

Suggested change
# The report keeps every citation that resolved, even with low confidence.
# Only genuinely unresolved (unmatched / ambiguous / unverifiable) entries
# get stripped — those are likely fabrications. ``verified`` is reported
# separately on each record so the UI can render strength.
valid_citations = [_citation_to_dict(v) for v in verifications if v.resolved]
removed_records = [v for v in verifications if not v.resolved]
removed_citations = [_citation_to_dict(v, include_reason=True) for v in removed_records]
# Dedup: collapse multiple [N] reference lines that resolve to the same
# registry source. The model often makes the same tool call twice (e.g.
# ``mcp_time__get_current_time`` for two timezones) and emits a separate
# ``[N] tool_name`` line for each call; without this pass both lines
# survive verification because each is independently valid. We keep the
# lowest-numbered occurrence and rewrite later inline citations to that
# number so the prose still cites the source.
seen_keys: dict[str, int] = {} # canonical_key -> kept citation number
duplicate_rewrites: dict[int, int] = {} # duplicate_num -> canonical_num
deduped_valid: list[dict] = []
for c in valid_citations:
key = c["url"] or c["citation_key"]
if key is None:
# Defensive: a valid citation must have one of url/citation_key.
# If neither is set we cannot dedup, so keep the entry.
deduped_valid.append(c)
continue
canonical_num = seen_keys.get(key)
if canonical_num is None:
seen_keys[key] = c["number"]
deduped_valid.append(c)
continue
duplicate_rewrites[c["number"]] = canonical_num
removed_citations.append(
{
"number": c["number"],
"line": c["line"],
"reason": f"duplicate_of_citation_{canonical_num}",
}
)
logger.debug(
"[CitationVerify] [%d] REMOVE — duplicate of [%d]: %s",
c["number"],
canonical_num,
key,
)
valid_citations = deduped_valid
# Apply URL replacements (garbled -> canonical) in the references section
if url_replacements:
for garbled, canonical in url_replacements.items():
ref_section = ref_section.replace(garbled, canonical)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — merged develop and applied your reorder. Worth flagging: your fix cleared the NameError, but my refactor's if not removed_records early-return was also discarding the deduped result (and skipping duplicate-line stripping), since removed_records is unresolved-only.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch on the removed_records vs removed_citations distinction — that's the subtle one. Keying the early-return off removed_citations (the superset of unresolved + dedup entries) is the right call: a report with duplicate citations but no unresolved ones now correctly strips/rewrites the dup lines instead of bailing early. Pulled 80f24de0 and traced it — the reorder + your early-return fix read correctly together, and invalid_numbers = removed_numbers - set(duplicate_rewrites) cleanly separates the strip-vs-rewrite paths. LGTM 👍

…fix NameError + duplicate stripping)

Signed-off-by: Ranit Karmakar <karmakarranit6@gmail.com>
@rkarmaka rkarmaka force-pushed the feat/citation-confidence-scoring branch from bfdeb86 to 80f24de Compare May 29, 2026 12:59
@cdgamarose-nv
Copy link
Copy Markdown
Collaborator

/ok to test

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 1, 2026

/ok to test

@cdgamarose-nv, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@cdgamarose-nv
Copy link
Copy Markdown
Collaborator

/ok to test aac7b3c

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 1, 2026

/ok to test aac7b3c

@cdgamarose-nv, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@cdgamarose-nv
Copy link
Copy Markdown
Collaborator

/ok to test 80f24de

@cdgamarose-nv
Copy link
Copy Markdown
Collaborator

This PR mostly targets observability and UX - not sure if this is an issue that has been reported or surfaced. Citation handling is still heuristic-based, so I'm not sure if this affects correctness at all. @AjayThorve, what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants