Skip to content

Data-quality cleanup: knowledge_type casing, NER pollution, contradictions + periodic janitor#83

Merged
arshadansari27 merged 1 commit into
mainfrom
worktree-prod-data-quality-fixes
May 25, 2026
Merged

Data-quality cleanup: knowledge_type casing, NER pollution, contradictions + periodic janitor#83
arshadansari27 merged 1 commit into
mainfrom
worktree-prod-data-quality-fixes

Conversation

@arshadansari27
Copy link
Copy Markdown
Owner

Summary

Production audit on 2026-05-26 (141k triples) found three drift patterns the prior cleanup PRs didn't close, plus a new ingestion failure mode. This PR fixes them at the write boundary AND adds a periodic janitor that backfills the historical drift idempotently.

Findings → fixes (data-driven)

Finding (production data) Fix
knowledge_type bifurcated: Fact (43768)/fact (36), Claim (11611)/claim (1678), Event/event, Relationship/relationship, plus Relation (126) overlapping Relationship TripleInput validator lowercases + collapses Relation→relationship
~12k entities with URL as both subject and rdfs:label; spaCy UPPERCASE labels as rdf:type (schema:PERSON 3124, schema:ORG 3746, …) _emit_ner_missed skips URL text + numeric labels, maps to schema.org canonical
1 job failed 2026-05-25 with Invalid IRI code point '\n' Strip control chars in TripleInput/EventInput/EntityInput
6/8 contradictions in prod came from the same chunk (extraction conflation, not real contradictions); one had identical objects Same-chunk dedup + identical-object guard in /api/knowledge/contradictions
Admin knowledge_type=Entity filter returned 0 rows (storage uses lowercase entity) Case-insensitive valid_types + inferred added
KS_GRAPH_FEDERATED had 0 triples ever, no producer since migration 017 Removed from rag.py trust_tier logic, admin/content.py sweep, and namespaces.py

New: periodic maintenance sweep

src/knowledge_service/maintenance/ — idempotent cleanup of accumulated drift:

  • normalize_knowledge_types: lowercases existing ks:knowledgeType RDF-star annotations (one-time backfill on first run, then catches any drift).
  • normalize_spacy_rdf_types: remaps existing schema:PERSON/schema:ORG/etc. to canonical names; drops schema:MONEY/CARDINAL/PERCENT from rdf:type (never valid as types).
  • Scheduled background task (6h default, env MAINTENANCE_INTERVAL_SECONDS; 0 disables).
  • Manual trigger: POST /api/admin/maintenance/run returns per-operation stats.

Out of scope (filed for Raphael in aegis)

154 of 200 most-recent ingestion jobs (2026-05-25) produced zero triples — all arxiv papers. Root cause is in aegis/worker/src/aegis_worker/activities/content.py:30_ARXIV_ABS_RE classifies arxiv.org/abs/… as content_type=pdf but the URL is sent as-is. The /abs/ page is HTML abstract (~500 chars), too dense for triple extraction.

Fix in aegis: rewrite arxiv.org/abs/…arxiv.org/pdf/… before calling knowledge_connector.ingest_content(), or fetch+send the full PDF text via raw_text=. Should restore the ~1428 arxiv items in the corpus from 0-yield to useful extraction.

Test plan

  • 699 unit tests pass; ruff check + ruff format --check clean
  • New: TestKnowledgeTypeNormalisation (lowercasing + Relation alias)
  • New: TestRdfTermSanitisation (control-char stripping)
  • New: TestNerFallbackFiltering (URL skip, numeric drop, schema.org remap)
  • New: TestSameSourceDedup (chunk-id filter, identical object skip)
  • New: TestNormalizeKnowledgeTypes / TestNormalizeSpacyRdfTypes (idempotent maintenance ops against real pyoxigraph Store)
  • Deploy + run POST /api/admin/maintenance/run once to verify backfill stats match the 2026-05-26 audit numbers
  • Re-run /api/admin/stats/types after deploy to confirm bifurcation gone

🤖 Generated with Claude Code

…ntradiction noise + periodic janitor

Production audit at 2026-05-26 (#82 + this) showed three persistent drift
patterns the existing fixes hadn't closed:

1. **knowledge_type bifurcation** — same logical type stored as both
   `Fact (43,768)` and `fact (36)` (and the same for Claim/Event/Relationship,
   plus a separate `Relation (126)` that overlaps Relationship). Cause:
   `TripleInput` preserved the LLM's casing; `_normalise_knowledge_type`
   now lowercases and collapses `Relation→relationship` at validation time.

2. **spaCy NER pollution** — 11,752 entities (8% of the graph) had a URL
   as both subject and `rdfs:label`, and used UPPERCASE spaCy labels
   (`schema:PERSON 3,124`, `schema:ORG 3,746`, `schema:GPE 696`,
   `schema:WORK_OF_ART 455`, …) as `rdf:type`. `_emit_ner_missed` now
   skips URL-shaped text + numeric labels (CARDINAL, MONEY, PERCENT,
   QUANTITY, DATE, TIME, ORDINAL) and remaps the remainder to schema.org
   canonical names (ORG→Organization, GPE→Place, …).

3. **Contradiction false positives** — 6 of 8 production contradictions
   came from the same chunk_id (extraction conflation: LLM emits two
   distinct numbers from one paragraph under one subject URI). One
   contradiction had identical `object_a == object_b`. The endpoint now
   filters both.

Other fixes:
- **Invalid IRI code point '\n'** (1 failed job 2026-05-25): strip control
  chars in `TripleInput`/`EventInput`/`EntityInput` so a `\n` in a URI
  doesn't sink the whole job at pyoxigraph insert time.
- **Admin /knowledge/triples filter** was case-sensitive (`Entity` allowed,
  `entity` rejected) but storage uses lowercase — silently returned 0 rows
  for the most common filter. Now case-insensitive + `inferred` accepted.
- **`KS_GRAPH_FEDERATED` dead refs** dropped from `rag.py` (no producer
  since federation column was removed in migration 017; named graph has
  always been empty in production) and from `admin/content.py` sweep.

**New: periodic maintenance sweep** (`src/knowledge_service/maintenance/`)
- `normalize_knowledge_types`: lowercases existing `ks:knowledgeType`
  RDF-star annotations + Relation→relationship.
- `normalize_spacy_rdf_types`: remaps existing `schema:PERSON`/`schema:ORG`
  etc. to canonical, drops `schema:MONEY`/`schema:CARDINAL`/… rdf:type
  triples.
- Background asyncio task in lifespan, every 6h by default
  (`MAINTENANCE_INTERVAL_SECONDS=21600`, set to 0 to disable).
- Admin trigger: `POST /api/admin/maintenance/run` returns per-op stats.
- Idempotent — first run is the historical backfill, subsequent runs
  catch any drift from new extraction bugs.

Out of scope here — arxiv 0-yield problem owned by aegis worker:
`worker/src/aegis_worker/activities/content.py:30` classifies
`arxiv.org/abs/…` as `pdf` but sends the URL as-is. The /abs/ page is HTML
abstract (~500 chars); 154/200 jobs on 2026-05-25 extracted 0 triples
because the abstract is too dense for triple extraction. Fix in aegis is
to rewrite to `arxiv.org/pdf/…` (or fetch+send full text) so the actual
paper text reaches knowledge-service. Filed as follow-up for Raphael.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@arshadansari27 arshadansari27 merged commit f5f1cbf into main May 25, 2026
5 checks passed
arshadansari27 added a commit that referenced this pull request May 25, 2026
…ing, maintenance endpoint (#85)

PRs #83 (data-quality fixes + periodic janitor) and #84 (NamedNode-shape
hotfix) changed three things the docs hadn't caught up to:

1. **`KS_GRAPH_FEDERATED` is gone.** Removed from README's named-graph
   list, the architecture doc's graph table, and the CLAUDE.md trust_tier
   description. The graph was empty in production for as long as the
   service has been deployed; the producer column was dropped in
   migration 017 (PR #82). What remains is 4 graphs: ontology / asserted
   / extracted / inferred.

2. **`knowledge_type` is no longer free-form.** `_normalise_knowledge_type`
   in `models.py` now lowercases at validation and collapses `Relation→
   relationship`. Updated the Status table in README, the Knowledge Types
   Reference in API.md, and every code example in both docs to show the
   lowercase canonical form (`claim`, `fact`, `event`, `entity`,
   `relationship`, `temporalfact`). Capitalised input is still accepted on
   the wire — the prose in API.md says so explicitly.

3. **`POST /api/admin/maintenance/run` exists.** Added the endpoint to
   API.md (full section with request/response/curl), to the admin row in
   the endpoints table, and a Maintenance Service section to CLAUDE.md
   describing `normalize_knowledge_types` / `normalize_spacy_rdf_types`,
   the lifespan wiring, and failure policy. Added the
   `MAINTENANCE_INTERVAL_SECONDS` and `MAINTENANCE_INITIAL_DELAY_SECONDS`
   env vars to API.md's Configuration table and to `docs/deployment.md`.

Also documented:
- The contradictions endpoint's new same-chunk + identical-object filters
  (README + CLAUDE.md).
- The control-char sanitisation on subject/predicate/object that
  prevents `Invalid IRI code point '\n'` job failures (CLAUDE.md Models
  section).
- The NER fallback's URL-skip + numeric-label drop + schema.org canonical
  remap (CLAUDE.md NLP Phase section).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant