feat: enhance citation handling with confidence and match kind by rkarmaka · Pull Request #230 · NVIDIA-AI-Blueprints/aiq

rkarmaka · 2026-05-12T20:43:40Z

Updated the citation emission process to include confidence scores and match kinds when a SourceRegistry is attached, allowing the UI to display verification badges.
Modified the API interfaces and internal state management to accommodate the new citation attributes.
Added tests to ensure proper handling of confidence and match kind in citation updates.
Implemented a confidence threshold in the DeepResearcherAgent to filter citations based on their strength before inclusion in reports.

This change improves the user experience by providing clearer insights into citation validity and enhances the overall citation verification process.

- Updated the citation emission process to include confidence scores and match kinds when a SourceRegistry is attached, allowing the UI to display verification badges. - Modified the API interfaces and internal state management to accommodate the new citation attributes. - Added tests to ensure proper handling of confidence and match kind in citation updates. - Implemented a confidence threshold in the DeepResearcherAgent to filter citations based on their strength before inclusion in reports. This change improves the user experience by providing clearer insights into citation validity and enhances the overall citation verification process. Signed-off-by: Ranit Karmakar <rkarmaka@mtu.edu>

greptile-apps · 2026-05-12T20:50:43Z

Greptile Summary

This PR adds structured confidence scores and match-kind annotations to the citation pipeline, flowing from the Python SourceRegistry.resolve_url() all the way to a new ConfidenceBadge UI component on each CitationCard. It also introduces a citation_passthrough_threshold config on DeepResearcherAgent that sets a per-citation verified UI hint without filtering the report body — fixing the long-standing semantic confusion between "stripping" and "badge hint."

Backend: MatchKind / VerifiedCitation dataclasses added to citation_verification.py; verify_citations now returns a typed verifications list alongside the existing valid_citations/removed_citations dicts; _emit_cited_urls attaches confidence and match_kind to citation_use SSE events via EventData(extra="allow").
Frontend: CitationSource extended with confidence? and matchKind?; store merge logic is monotonic (never downgrades confidence); ConfidenceBadge renders for medium/low-tier matches, suppressed for exact/normalized to keep the common case noise-free.
Tests: New TestVerifyCitationsConfidence class covers each match kind end-to-end, threshold semantics, structured log content, and OTel summary payload.

Confidence Score: 5/5

Safe to merge; the full citation pipeline from registry resolution through SSE emission to UI badge rendering is well-tested and all previously flagged issues are addressed.

All prior review bugs (NameError in dedup block, misleading docstrings, incorrect reason field for ambiguous citations, inaccurate config description) are resolved in this PR. The new code is backed by a thorough parametrised test suite covering every match kind and threshold semantics. The only remaining item is a minor telemetry inaccuracy in the stripped summary log field, which has no effect on report output or UI behaviour.

citation_verification.py — the summary telemetry log is worth a second read before wiring it to production monitoring alerts.

Important Files Changed

Filename	Overview
src/aiq_agent/common/citation_verification.py	Introduces MatchKind, _ResolveMatch, and VerifiedCitation data classes; refactors verify_citations to produce structured VerifiedCitation records with OTel counter and structured logging. All previously-flagged bugs fixed. Minor telemetry inaccuracy in stripped and kept_below_threshold summary fields.
frontends/aiq_api/src/aiq_api/jobs/callbacks.py	_emit_cited_urls now calls resolve_url and attaches confidence/match_kind to citation_use SSE events via **extra. EventData uses extra=allow so new fields flow through correctly.
frontends/ui/src/features/chat/store.ts	addDeepResearchCitation extended with confidence/matchKind; monotonic confidence merge logic correctly uses Math.max.
frontends/ui/src/features/layout/components/CitationCard.tsx	Adds ConfidenceBadge for medium/low tier match kinds; exact/normalized matches stay visually unchanged.
src/aiq_agent/agents/deep_researcher/agent.py	citation_passthrough_threshold wired through constructor and config; all previously-flagged misleading comments corrected.
src/aiq_agent/agents/deep_researcher/register.py	Adds citation_passthrough_threshold Field with ge/le validation; description correctly states it is a UI hint, not a report filter.
frontends/ui/src/features/chat/types.ts	Adds CitationMatchKind union type and extends CitationSource/ChatActions with optional confidence and matchKind fields.
frontends/ui/src/adapters/api/deep-research-client.ts	Extends ArtifactUpdateEvent and onCitationUpdate callback to carry optional confidence/match_kind.
tests/aiq_agent/common/test_citation_verification.py	Comprehensive new TestVerifyCitationsConfidence class covering all match kinds, threshold semantics, and structured log content.

Sequence Diagram

sequenceDiagram
    participant Agent as DeepResearcherAgent
    participant Registry as SourceRegistry
    participant CB as AgentEventCallback
    participant SSE as SSE Stream
    participant Client as deep-research-client.ts
    participant Store as ChatStore
    participant UI as CitationCard

    Agent->>Registry: resolve_url(url)
    Registry-->>Agent: _ResolveMatch(kind, confidence)
    Note over Agent: verify_citations sets resolved and verified flags
    CB->>Registry: resolve_url(url)
    Registry-->>CB: "_ResolveMatch(kind=child_path, confidence=0.6)"
    CB->>SSE: citation_use event with confidence and match_kind
    SSE-->>Client: artifact.update
    Client->>Store: onCitationUpdate with confidence and matchKind
    Store->>Store: monotonic confidence merge
    Store-->>UI: CitationSource with matchKind and confidence
    UI->>UI: render ConfidenceBadge

_{Reviews (5): Last reviewed commit: "Merge develop; reconcile citation dedup ..." | Re-trigger Greptile}

- Updated the `citation_passthrough_threshold` documentation to specify its role in marking citations as verified in the UI, rather than filtering them from the report. - Adjusted the `verify_citations` function to ensure that only unresolved citations are stripped from the report, allowing all resolved citations to remain and carry their confidence scores. - Enhanced comments throughout the code to improve clarity on citation verification processes and UI interactions. These changes improve the understanding of citation handling and ensure that the UI accurately reflects citation confidence without losing important context in reports.

…tConfig - Updated the documentation for the confidence cutoff parameter to clarify its role in marking citations as verified in the UI. - Improved the explanation of how the threshold affects citation verification without filtering them from the report body. - Ensured that the default value and its implications for citation handling are clearly articulated. These changes aim to provide better guidance on citation confidence settings and their impact on the user interface.

AjayThorve · 2026-05-13T15:09:38Z

Thank you @rkarmaka, we will review this soon

torkian

Friendly follow-up on the P1 Greptile flagged — I went deeper and there are actually two related bugs in this block that have the same root cause and the same fix:

NameError on every report with a references section (already flagged): the dedup loop at line 1018 reads valid_citations before it's defined.
Silent no-op dedup: even after fixing the NameError, line 1055 (valid_citations = [_citation_to_dict(v) for v in verifications if v.resolved]) reassigns valid_citations, discarding the deduped result from line 1044. Line 1057 similarly clobbers any duplicate entries appended to removed_citations inside the loop. So the dedup pass would run, find duplicates, and then have its work thrown away.

Both go away if we initialize valid_citations / removed_citations / removed_records from verifications before the dedup block instead of after. The dedup then mutates the already-built lists, and the URL-replacement block at the end is unaffected. Suggested change attached on the relevant line range — happy to put it up as a PR against your fork if that's easier.

torkian · 2026-05-28T22:24:47Z

@@ -792,15 +1048,47 @@ def verify_citations(report_text: str, registry: SourceRegistry) -> CitationVeri
        for garbled, canonical in url_replacements.items():
            ref_section = ref_section.replace(garbled, canonical)

-    if not removed_citations:
-        logger.debug("[CitationVerify] Result: all %d citation(s) valid — no changes", len(valid_citations))
-        verified = body + ref_section if url_replacements else report_text
+    # The report keeps every citation that resolved, even with low confidence.
+    # Only genuinely unresolved (unmatched / ambiguous / unverifiable) entries
+    # get stripped — those are likely fabrications. ``verified`` is reported
+    # separately on each record so the UI can render strength.
+    valid_citations = [_citation_to_dict(v) for v in verifications if v.resolved]
+    removed_records = [v for v in verifications if not v.resolved]
+    removed_citations = [_citation_to_dict(v, include_reason=True) for v in removed_records]


Move the initial build of valid_citations/removed_records/removed_citations (currently at lines 1055-1057) above the dedup block. Fixes the NameError and prevents the dedup's output from being silently overwritten.

Suggested change

# The report keeps every citation that resolved, even with low confidence.

# Only genuinely unresolved (unmatched / ambiguous / unverifiable) entries

# get stripped — those are likely fabrications. ``verified`` is reported

# separately on each record so the UI can render strength.

valid_citations = [_citation_to_dict(v) for v in verifications if v.resolved]

removed_records = [v for v in verifications if not v.resolved]

removed_citations = [_citation_to_dict(v, include_reason=True) for v in removed_records]

# Dedup: collapse multiple [N] reference lines that resolve to the same

# registry source. The model often makes the same tool call twice (e.g.

# ``mcp_time__get_current_time`` for two timezones) and emits a separate

# ``[N] tool_name`` line for each call; without this pass both lines

# survive verification because each is independently valid. We keep the

# lowest-numbered occurrence and rewrite later inline citations to that

# number so the prose still cites the source.

seen_keys: dict[str, int] = {} # canonical_key -> kept citation number

duplicate_rewrites: dict[int, int] = {} # duplicate_num -> canonical_num

deduped_valid: list[dict] = []

for c in valid_citations:

key = c["url"] or c["citation_key"]

if key is None:

# Defensive: a valid citation must have one of url/citation_key.

# If neither is set we cannot dedup, so keep the entry.

deduped_valid.append(c)

continue

canonical_num = seen_keys.get(key)

if canonical_num is None:

seen_keys[key] = c["number"]

deduped_valid.append(c)

continue

duplicate_rewrites[c["number"]] = canonical_num

removed_citations.append(

{

"number": c["number"],

"line": c["line"],

"reason": f"duplicate_of_citation_{canonical_num}",

}

)

logger.debug(

"[CitationVerify] [%d] REMOVE — duplicate of [%d]: %s",

c["number"],

canonical_num,

key,

)

valid_citations = deduped_valid

# Apply URL replacements (garbled -> canonical) in the references section

if url_replacements:

for garbled, canonical in url_replacements.items():

ref_section = ref_section.replace(garbled, canonical)

Thanks — merged develop and applied your reorder. Worth flagging: your fix cleared the NameError, but my refactor's if not removed_records early-return was also discarding the deduped result (and skipping duplicate-line stripping), since removed_records is unresolved-only.

Nice catch on the removed_records vs removed_citations distinction — that's the subtle one. Keying the early-return off removed_citations (the superset of unresolved + dedup entries) is the right call: a report with duplicate citations but no unresolved ones now correctly strips/rewrites the dup lines instead of bailing early. Pulled 80f24de0 and traced it — the reorder + your early-return fix read correctly together, and invalid_numbers = removed_numbers - set(duplicate_rewrites) cleanly separates the strip-vs-rewrite paths. LGTM 👍

…fix NameError + duplicate stripping) Signed-off-by: Ranit Karmakar <karmakarranit6@gmail.com>

cdgamarose-nv · 2026-06-01T23:53:54Z

/ok to test

copy-pr-bot · 2026-06-01T23:53:56Z

/ok to test

@cdgamarose-nv, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

cdgamarose-nv · 2026-06-01T23:54:15Z

/ok to test aac7b3c

copy-pr-bot · 2026-06-01T23:54:18Z

/ok to test aac7b3c

@cdgamarose-nv, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

cdgamarose-nv · 2026-06-01T23:54:56Z

/ok to test 80f24de

cdgamarose-nv · 2026-06-02T00:11:47Z

This PR mostly targets observability and UX - not sure if this is an issue that has been reported or surfaced. Citation handling is still heuristic-based, so I'm not sure if this affects correctness at all. @AjayThorve, what do you think?

rkarmaka force-pushed the feat/citation-confidence-scoring branch from b556739 to aac7b3c Compare May 12, 2026 20:46

greptile-apps Bot reviewed May 12, 2026

View reviewed changes

Comment thread src/aiq_agent/common/citation_verification.py Outdated

Comment thread src/aiq_agent/agents/deep_researcher/agent.py Outdated

Comment thread src/aiq_agent/agents/deep_researcher/agent.py Outdated

Comment thread src/aiq_agent/common/citation_verification.py

greptile-apps Bot reviewed May 12, 2026

View reviewed changes

Comment thread src/aiq_agent/agents/deep_researcher/register.py Outdated

AjayThorve added AIQ2.2 labels May 12, 2026

AjayThorve requested review from AjayThorve and cdgamarose-nv May 13, 2026 15:09

AjayThorve added the external-contribution label May 27, 2026

torkian reviewed May 28, 2026

View reviewed changes

Merge develop; reconcile citation dedup with verifications refactor (…

80f24de

…fix NameError + duplicate stripping) Signed-off-by: Ranit Karmakar <karmakarranit6@gmail.com>

rkarmaka force-pushed the feat/citation-confidence-scoring branch from bfdeb86 to 80f24de Compare May 29, 2026 12:59

+    # The report keeps every citation that resolved, even with low confidence.
+    # Only genuinely unresolved (unmatched / ambiguous / unverifiable) entries
+    # get stripped — those are likely fabrications. ``verified`` is reported
+    # separately on each record so the UI can render strength.
+    valid_citations = [_citation_to_dict(v) for v in verifications if v.resolved]
+    removed_records = [v for v in verifications if not v.resolved]
+    removed_citations = [_citation_to_dict(v, include_reason=True) for v in removed_records]
+    # Dedup: collapse multiple [N] reference lines that resolve to the same
+    # registry source. The model often makes the same tool call twice (e.g.
+    # ``mcp_time__get_current_time`` for two timezones) and emits a separate
+    # ``[N] tool_name`` line for each call; without this pass both lines
+    # survive verification because each is independently valid. We keep the
+    # lowest-numbered occurrence and rewrite later inline citations to that
+    # number so the prose still cites the source.
+    seen_keys: dict[str, int] = {}  # canonical_key -> kept citation number
+    duplicate_rewrites: dict[int, int] = {}  # duplicate_num -> canonical_num
+    deduped_valid: list[dict] = []
+    for c in valid_citations:
+        key = c["url"] or c["citation_key"]
+        if key is None:
+            # Defensive: a valid citation must have one of url/citation_key.
+            # If neither is set we cannot dedup, so keep the entry.
+            deduped_valid.append(c)
+            continue
+        canonical_num = seen_keys.get(key)
+        if canonical_num is None:
+            seen_keys[key] = c["number"]
+            deduped_valid.append(c)
+            continue
+        duplicate_rewrites[c["number"]] = canonical_num
+        removed_citations.append(
+            {
+                "number": c["number"],
+                "line": c["line"],
+                "reason": f"duplicate_of_citation_{canonical_num}",
+            }
+        )
+        logger.debug(
+            "[CitationVerify]   [%d] REMOVE — duplicate of [%d]: %s",
+            c["number"],
+            canonical_num,
+            key,
+        )
+    valid_citations = deduped_valid
+    # Apply URL replacements (garbled -> canonical) in the references section
+    if url_replacements:
+        for garbled, canonical in url_replacements.items():
+            ref_section = ref_section.replace(garbled, canonical)

Conversation

rkarmaka commented May 12, 2026

Uh oh!

greptile-apps Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AjayThorve commented May 13, 2026

Uh oh!

torkian left a comment

Choose a reason for hiding this comment

Uh oh!

torkian May 28, 2026

Choose a reason for hiding this comment

Uh oh!

rkarmaka May 29, 2026

Choose a reason for hiding this comment

Uh oh!

torkian May 29, 2026

Choose a reason for hiding this comment

Uh oh!

cdgamarose-nv commented Jun 1, 2026

Uh oh!

copy-pr-bot Bot commented Jun 1, 2026

Uh oh!

cdgamarose-nv commented Jun 1, 2026

Uh oh!

copy-pr-bot Bot commented Jun 1, 2026

Uh oh!

cdgamarose-nv commented Jun 1, 2026

Uh oh!

cdgamarose-nv commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

greptile-apps Bot commented May 12, 2026 •

edited

Loading