Data-quality cleanup: knowledge_type casing, NER pollution, contradictions + periodic janitor by arshadansari27 · Pull Request #83 · arshadansari27/knowledge-service

arshadansari27 · 2026-05-25T21:43:08Z

Summary

Production audit on 2026-05-26 (141k triples) found three drift patterns the prior cleanup PRs didn't close, plus a new ingestion failure mode. This PR fixes them at the write boundary AND adds a periodic janitor that backfills the historical drift idempotently.

Findings → fixes (data-driven)

Finding (production data)	Fix
`knowledge_type` bifurcated: `Fact (43768)/fact (36)`, `Claim (11611)/claim (1678)`, `Event/event`, `Relationship/relationship`, plus `Relation (126)` overlapping Relationship	`TripleInput` validator lowercases + collapses `Relation→relationship`
~12k entities with URL as both subject and `rdfs:label`; spaCy UPPERCASE labels as rdf:type (`schema:PERSON 3124`, `schema:ORG 3746`, …)	`_emit_ner_missed` skips URL text + numeric labels, maps to schema.org canonical
1 job failed 2026-05-25 with `Invalid IRI code point '\n'`	Strip control chars in `TripleInput`/`EventInput`/`EntityInput`
6/8 contradictions in prod came from the same chunk (extraction conflation, not real contradictions); one had identical objects	Same-chunk dedup + identical-object guard in `/api/knowledge/contradictions`
Admin `knowledge_type=Entity` filter returned 0 rows (storage uses lowercase `entity`)	Case-insensitive valid_types + `inferred` added
`KS_GRAPH_FEDERATED` had 0 triples ever, no producer since migration 017	Removed from `rag.py` trust_tier logic, `admin/content.py` sweep, and `namespaces.py`

New: periodic maintenance sweep

src/knowledge_service/maintenance/ — idempotent cleanup of accumulated drift:

normalize_knowledge_types: lowercases existing ks:knowledgeType RDF-star annotations (one-time backfill on first run, then catches any drift).
normalize_spacy_rdf_types: remaps existing schema:PERSON/schema:ORG/etc. to canonical names; drops schema:MONEY/CARDINAL/PERCENT from rdf:type (never valid as types).
Scheduled background task (6h default, env MAINTENANCE_INTERVAL_SECONDS; 0 disables).
Manual trigger: POST /api/admin/maintenance/run returns per-operation stats.

Out of scope (filed for Raphael in aegis)

154 of 200 most-recent ingestion jobs (2026-05-25) produced zero triples — all arxiv papers. Root cause is in aegis/worker/src/aegis_worker/activities/content.py:30 — _ARXIV_ABS_RE classifies arxiv.org/abs/… as content_type=pdf but the URL is sent as-is. The /abs/ page is HTML abstract (~500 chars), too dense for triple extraction.

Fix in aegis: rewrite arxiv.org/abs/… → arxiv.org/pdf/… before calling knowledge_connector.ingest_content(), or fetch+send the full PDF text via raw_text=. Should restore the ~1428 arxiv items in the corpus from 0-yield to useful extraction.

Test plan

699 unit tests pass; ruff check + ruff format --check clean
New: TestKnowledgeTypeNormalisation (lowercasing + Relation alias)
New: TestRdfTermSanitisation (control-char stripping)
New: TestNerFallbackFiltering (URL skip, numeric drop, schema.org remap)
New: TestSameSourceDedup (chunk-id filter, identical object skip)
New: TestNormalizeKnowledgeTypes / TestNormalizeSpacyRdfTypes (idempotent maintenance ops against real pyoxigraph Store)
Deploy + run POST /api/admin/maintenance/run once to verify backfill stats match the 2026-05-26 audit numbers
Re-run /api/admin/stats/types after deploy to confirm bifurcation gone

🤖 Generated with Claude Code

…ntradiction noise + periodic janitor Production audit at 2026-05-26 (#82 + this) showed three persistent drift patterns the existing fixes hadn't closed: 1. **knowledge_type bifurcation** — same logical type stored as both `Fact (43,768)` and `fact (36)` (and the same for Claim/Event/Relationship, plus a separate `Relation (126)` that overlaps Relationship). Cause: `TripleInput` preserved the LLM's casing; `_normalise_knowledge_type` now lowercases and collapses `Relation→relationship` at validation time. 2. **spaCy NER pollution** — 11,752 entities (8% of the graph) had a URL as both subject and `rdfs:label`, and used UPPERCASE spaCy labels (`schema:PERSON 3,124`, `schema:ORG 3,746`, `schema:GPE 696`, `schema:WORK_OF_ART 455`, …) as `rdf:type`. `_emit_ner_missed` now skips URL-shaped text + numeric labels (CARDINAL, MONEY, PERCENT, QUANTITY, DATE, TIME, ORDINAL) and remaps the remainder to schema.org canonical names (ORG→Organization, GPE→Place, …). 3. **Contradiction false positives** — 6 of 8 production contradictions came from the same chunk_id (extraction conflation: LLM emits two distinct numbers from one paragraph under one subject URI). One contradiction had identical `object_a == object_b`. The endpoint now filters both. Other fixes: - **Invalid IRI code point '\n'** (1 failed job 2026-05-25): strip control chars in `TripleInput`/`EventInput`/`EntityInput` so a `\n` in a URI doesn't sink the whole job at pyoxigraph insert time. - **Admin /knowledge/triples filter** was case-sensitive (`Entity` allowed, `entity` rejected) but storage uses lowercase — silently returned 0 rows for the most common filter. Now case-insensitive + `inferred` accepted. - **`KS_GRAPH_FEDERATED` dead refs** dropped from `rag.py` (no producer since federation column was removed in migration 017; named graph has always been empty in production) and from `admin/content.py` sweep. **New: periodic maintenance sweep** (`src/knowledge_service/maintenance/`) - `normalize_knowledge_types`: lowercases existing `ks:knowledgeType` RDF-star annotations + Relation→relationship. - `normalize_spacy_rdf_types`: remaps existing `schema:PERSON`/`schema:ORG` etc. to canonical, drops `schema:MONEY`/`schema:CARDINAL`/… rdf:type triples. - Background asyncio task in lifespan, every 6h by default (`MAINTENANCE_INTERVAL_SECONDS=21600`, set to 0 to disable). - Admin trigger: `POST /api/admin/maintenance/run` returns per-op stats. - Idempotent — first run is the historical backfill, subsequent runs catch any drift from new extraction bugs. Out of scope here — arxiv 0-yield problem owned by aegis worker: `worker/src/aegis_worker/activities/content.py:30` classifies `arxiv.org/abs/…` as `pdf` but sends the URL as-is. The /abs/ page is HTML abstract (~500 chars); 154/200 jobs on 2026-05-25 extracted 0 triples because the abstract is too dense for triple extraction. Fix in aegis is to rewrite to `arxiv.org/pdf/…` (or fetch+send full text) so the actual paper text reaches knowledge-service. Filed as follow-up for Raphael. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ing, maintenance endpoint (#85) PRs #83 (data-quality fixes + periodic janitor) and #84 (NamedNode-shape hotfix) changed three things the docs hadn't caught up to: 1. **`KS_GRAPH_FEDERATED` is gone.** Removed from README's named-graph list, the architecture doc's graph table, and the CLAUDE.md trust_tier description. The graph was empty in production for as long as the service has been deployed; the producer column was dropped in migration 017 (PR #82). What remains is 4 graphs: ontology / asserted / extracted / inferred. 2. **`knowledge_type` is no longer free-form.** `_normalise_knowledge_type` in `models.py` now lowercases at validation and collapses `Relation→ relationship`. Updated the Status table in README, the Knowledge Types Reference in API.md, and every code example in both docs to show the lowercase canonical form (`claim`, `fact`, `event`, `entity`, `relationship`, `temporalfact`). Capitalised input is still accepted on the wire — the prose in API.md says so explicitly. 3. **`POST /api/admin/maintenance/run` exists.** Added the endpoint to API.md (full section with request/response/curl), to the admin row in the endpoints table, and a Maintenance Service section to CLAUDE.md describing `normalize_knowledge_types` / `normalize_spacy_rdf_types`, the lifespan wiring, and failure policy. Added the `MAINTENANCE_INTERVAL_SECONDS` and `MAINTENANCE_INITIAL_DELAY_SECONDS` env vars to API.md's Configuration table and to `docs/deployment.md`. Also documented: - The contradictions endpoint's new same-chunk + identical-object filters (README + CLAUDE.md). - The control-char sanitisation on subject/predicate/object that prevents `Invalid IRI code point '\n'` job failures (CLAUDE.md Models section). - The NER fallback's URL-skip + numeric-label drop + schema.org canonical remap (CLAUDE.md NLP Phase section). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

arshadansari27 merged commit f5f1cbf into main May 25, 2026
5 checks passed

This was referenced May 25, 2026

fix(maintenance): write knowledge_type as NamedNode URI (not Literal) #84

Merged

docs: post-PR-83/84 drift sweep #85

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data-quality cleanup: knowledge_type casing, NER pollution, contradictions + periodic janitor#83

Data-quality cleanup: knowledge_type casing, NER pollution, contradictions + periodic janitor#83
arshadansari27 merged 1 commit into
mainfrom
worktree-prod-data-quality-fixes

arshadansari27 commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arshadansari27 commented May 25, 2026

Summary

Findings → fixes (data-driven)

New: periodic maintenance sweep

Out of scope (filed for Raphael in aegis)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant