Data-quality cleanup: knowledge_type casing, NER pollution, contradictions + periodic janitor#83
Merged
Conversation
…ntradiction noise + periodic janitor Production audit at 2026-05-26 (#82 + this) showed three persistent drift patterns the existing fixes hadn't closed: 1. **knowledge_type bifurcation** — same logical type stored as both `Fact (43,768)` and `fact (36)` (and the same for Claim/Event/Relationship, plus a separate `Relation (126)` that overlaps Relationship). Cause: `TripleInput` preserved the LLM's casing; `_normalise_knowledge_type` now lowercases and collapses `Relation→relationship` at validation time. 2. **spaCy NER pollution** — 11,752 entities (8% of the graph) had a URL as both subject and `rdfs:label`, and used UPPERCASE spaCy labels (`schema:PERSON 3,124`, `schema:ORG 3,746`, `schema:GPE 696`, `schema:WORK_OF_ART 455`, …) as `rdf:type`. `_emit_ner_missed` now skips URL-shaped text + numeric labels (CARDINAL, MONEY, PERCENT, QUANTITY, DATE, TIME, ORDINAL) and remaps the remainder to schema.org canonical names (ORG→Organization, GPE→Place, …). 3. **Contradiction false positives** — 6 of 8 production contradictions came from the same chunk_id (extraction conflation: LLM emits two distinct numbers from one paragraph under one subject URI). One contradiction had identical `object_a == object_b`. The endpoint now filters both. Other fixes: - **Invalid IRI code point '\n'** (1 failed job 2026-05-25): strip control chars in `TripleInput`/`EventInput`/`EntityInput` so a `\n` in a URI doesn't sink the whole job at pyoxigraph insert time. - **Admin /knowledge/triples filter** was case-sensitive (`Entity` allowed, `entity` rejected) but storage uses lowercase — silently returned 0 rows for the most common filter. Now case-insensitive + `inferred` accepted. - **`KS_GRAPH_FEDERATED` dead refs** dropped from `rag.py` (no producer since federation column was removed in migration 017; named graph has always been empty in production) and from `admin/content.py` sweep. **New: periodic maintenance sweep** (`src/knowledge_service/maintenance/`) - `normalize_knowledge_types`: lowercases existing `ks:knowledgeType` RDF-star annotations + Relation→relationship. - `normalize_spacy_rdf_types`: remaps existing `schema:PERSON`/`schema:ORG` etc. to canonical, drops `schema:MONEY`/`schema:CARDINAL`/… rdf:type triples. - Background asyncio task in lifespan, every 6h by default (`MAINTENANCE_INTERVAL_SECONDS=21600`, set to 0 to disable). - Admin trigger: `POST /api/admin/maintenance/run` returns per-op stats. - Idempotent — first run is the historical backfill, subsequent runs catch any drift from new extraction bugs. Out of scope here — arxiv 0-yield problem owned by aegis worker: `worker/src/aegis_worker/activities/content.py:30` classifies `arxiv.org/abs/…` as `pdf` but sends the URL as-is. The /abs/ page is HTML abstract (~500 chars); 154/200 jobs on 2026-05-25 extracted 0 triples because the abstract is too dense for triple extraction. Fix in aegis is to rewrite to `arxiv.org/pdf/…` (or fetch+send full text) so the actual paper text reaches knowledge-service. Filed as follow-up for Raphael. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 25, 2026
arshadansari27
added a commit
that referenced
this pull request
May 25, 2026
…ing, maintenance endpoint (#85) PRs #83 (data-quality fixes + periodic janitor) and #84 (NamedNode-shape hotfix) changed three things the docs hadn't caught up to: 1. **`KS_GRAPH_FEDERATED` is gone.** Removed from README's named-graph list, the architecture doc's graph table, and the CLAUDE.md trust_tier description. The graph was empty in production for as long as the service has been deployed; the producer column was dropped in migration 017 (PR #82). What remains is 4 graphs: ontology / asserted / extracted / inferred. 2. **`knowledge_type` is no longer free-form.** `_normalise_knowledge_type` in `models.py` now lowercases at validation and collapses `Relation→ relationship`. Updated the Status table in README, the Knowledge Types Reference in API.md, and every code example in both docs to show the lowercase canonical form (`claim`, `fact`, `event`, `entity`, `relationship`, `temporalfact`). Capitalised input is still accepted on the wire — the prose in API.md says so explicitly. 3. **`POST /api/admin/maintenance/run` exists.** Added the endpoint to API.md (full section with request/response/curl), to the admin row in the endpoints table, and a Maintenance Service section to CLAUDE.md describing `normalize_knowledge_types` / `normalize_spacy_rdf_types`, the lifespan wiring, and failure policy. Added the `MAINTENANCE_INTERVAL_SECONDS` and `MAINTENANCE_INITIAL_DELAY_SECONDS` env vars to API.md's Configuration table and to `docs/deployment.md`. Also documented: - The contradictions endpoint's new same-chunk + identical-object filters (README + CLAUDE.md). - The control-char sanitisation on subject/predicate/object that prevents `Invalid IRI code point '\n'` job failures (CLAUDE.md Models section). - The NER fallback's URL-skip + numeric-label drop + schema.org canonical remap (CLAUDE.md NLP Phase section). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Production audit on 2026-05-26 (141k triples) found three drift patterns the prior cleanup PRs didn't close, plus a new ingestion failure mode. This PR fixes them at the write boundary AND adds a periodic janitor that backfills the historical drift idempotently.
Findings → fixes (data-driven)
knowledge_typebifurcated:Fact (43768)/fact (36),Claim (11611)/claim (1678),Event/event,Relationship/relationship, plusRelation (126)overlapping RelationshipTripleInputvalidator lowercases + collapsesRelation→relationshiprdfs:label; spaCy UPPERCASE labels as rdf:type (schema:PERSON 3124,schema:ORG 3746, …)_emit_ner_missedskips URL text + numeric labels, maps to schema.org canonicalInvalid IRI code point '\n'TripleInput/EventInput/EntityInput/api/knowledge/contradictionsknowledge_type=Entityfilter returned 0 rows (storage uses lowercaseentity)inferredaddedKS_GRAPH_FEDERATEDhad 0 triples ever, no producer since migration 017rag.pytrust_tier logic,admin/content.pysweep, andnamespaces.pyNew: periodic maintenance sweep
src/knowledge_service/maintenance/— idempotent cleanup of accumulated drift:normalize_knowledge_types: lowercases existingks:knowledgeTypeRDF-star annotations (one-time backfill on first run, then catches any drift).normalize_spacy_rdf_types: remaps existingschema:PERSON/schema:ORG/etc. to canonical names; dropsschema:MONEY/CARDINAL/PERCENTfrom rdf:type (never valid as types).MAINTENANCE_INTERVAL_SECONDS; 0 disables).POST /api/admin/maintenance/runreturns per-operation stats.Out of scope (filed for Raphael in aegis)
154 of 200 most-recent ingestion jobs (2026-05-25) produced zero triples — all arxiv papers. Root cause is in
aegis/worker/src/aegis_worker/activities/content.py:30—_ARXIV_ABS_REclassifiesarxiv.org/abs/…as content_type=pdfbut the URL is sent as-is. The/abs/page is HTML abstract (~500 chars), too dense for triple extraction.Fix in aegis: rewrite
arxiv.org/abs/…→arxiv.org/pdf/…before callingknowledge_connector.ingest_content(), or fetch+send the full PDF text viaraw_text=. Should restore the ~1428 arxiv items in the corpus from 0-yield to useful extraction.Test plan
ruff check+ruff format --checkcleanTestKnowledgeTypeNormalisation(lowercasing + Relation alias)TestRdfTermSanitisation(control-char stripping)TestNerFallbackFiltering(URL skip, numeric drop, schema.org remap)TestSameSourceDedup(chunk-id filter, identical object skip)TestNormalizeKnowledgeTypes/TestNormalizeSpacyRdfTypes(idempotent maintenance ops against real pyoxigraph Store)POST /api/admin/maintenance/runonce to verify backfill stats match the 2026-05-26 audit numbers/api/admin/stats/typesafter deploy to confirm bifurcation gone🤖 Generated with Claude Code