feat(diagnostics): autonomous-triage enrichments + confidence threshold#170
Merged
Merged
Conversation
Three coordinated changes to reduce the per-report context-asking
loop and make the confidence badge actually visible on typical drafts.
1. Diagnostic payload enrichments (server + client)
Antecedent diagnostics gain:
- np_boundary_char: single char immediately after captured term —
helps identify when an exclusion-set extension is the right fix
- ref_marker_before: which definite-reference marker preceded the
term (so / 前述 / 該 / said / the) — informs possessive vs bare-
intro vs reference classification
- body_cross_refs: numeric refs to other claims in the same claim
text (`claim N` / `請求項N` shapes) — surfaces incorporation-by-
reference candidates without requiring the underlying draft
Spec-support diagnostics gain:
- phrase_charlen / phrase_first_chars / phrase_last_char: shape
markers for triaging tokenization-class FPs
- has_leading_ref_marker: whether the captured phrase retains a
qualifier prefix — surfaces normalize-chain failures
Context windows widened 30/22/18/12 → 60/45/35/25 (Latin/JA/Hangul/
Han) on BOTH client and server. The previous narrow windows
frequently truncated verb-object / Markush / possessive boundaries
needed for classification. Still under Privacy §6 "full paragraph"
threshold; still capped to 5 findings/report.
All new fields are structural metadata (single chars, char counts,
boolean flags, ref number lists). No additional draft prose enters
the payload beyond the widened excerpt windows.
2. Confidence-badge threshold lowered 75 → 65
Initial post-PR-168 diagnostic on the deployed bundle showed the
+25 ML-decision-tree boost (the only mechanism that pushes scores
into the 75-100 range) only fires when intros_pool > 53 — i.e., on
drafts with 50+ claims. Typical 10-20 claim test drafts never had
findings at the threshold, so the badge was effectively invisible.
At threshold 65: measured ~66% precision (vs 38% baseline; +28pp
lift) with ~11.5% coverage = ~1-3 badges per typical draft. Honest
1pp gap from the 70% target disclosed in the title attribute, which
also surfaces baseline for comparison.
3. Tests + harness
pytest 2704 passed, 11 skipped. No walker changes — extractor and
threshold updates only.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Two parallel issues addressed in one PR:
Confidence badge invisible on typical drafts — the +25 ML-tree boost in
compute_confidence_scoreonly fires whenintros_pool > 53(drafts with 50+ claims). Typical 10-20 claim test drafts never had findings at threshold 75 → badge effectively never shown.Diagnostic payloads lack autonomous-triage context — previous narrow context windows (30/22/18/12 chars) and limited per-finding shape markers forced per-report context-asking loops. With bulk reports coming, the payload alone needs to be enough.
Changes
Diagnostic enrichments (Privacy §6 compliant — structural metadata, not draft prose)
Antecedent (
extract_antecedent_basis) gains:np_boundary_char— single char immediately after captured term (identifies whether an exclusion-set extension is the right fix)ref_marker_before— which definite-reference marker preceded the term (so / 前述 / 該 / said / the) — informs possessive vs bare-intro classificationbody_cross_refs— list of cited claim numbers in the same claim text (surfaces incorporation-by-reference candidates autonomously)Spec-support (
extract_spec_support) gains:phrase_charlen/phrase_first_chars/phrase_last_char— shape markers for tokenization-class FP triagehas_leading_ref_marker— whether captured phrase retains a qualifier prefix (surfaces normalize-chain failures)Context windows widened 30/22/18/12 → 60/45/35/25 (Latin/JA/Hangul/Han) on BOTH client and server. Still under Privacy §6 "full paragraph" threshold; still capped to 5 findings/report.
Confidence badge threshold 75 → 65
Measured 66.4% precision (vs 38% baseline; +28pp lift) at threshold 65 with 11.5% coverage = ~1-3 badges per typical draft. Honest 1pp gap from the 70% target disclosed in the title attribute alongside the baseline number.
Tests
pytest -q→ 2704 passed, 11 skipped. Frontend build clean.