chore(schema): migration 018 — drop unused columns and indexes#82
Merged
Conversation
Audit (2026-05-23) identified 6 columns and 4 indexes with no readers in code. The writers for the columns were removed in PR #81 — this migration must NOT land before #81 is on main, otherwise prod will try to write to columns that no longer exist. Columns dropped: - ``ingestion_jobs.source_format`` — never written (no JobTracker key, no UPDATE). - ``ingestion_jobs.entities_linked`` — JobTracker wrote it after NLP; no reader. Write dropped in #81. - ``ingestion_jobs.entities_coref`` — same shape; write dropped in #81. - ``ingestion_jobs.chunks_skipped`` — ExtractPhase always returned 0; column was structurally always 0. Write + display dropped in #81. - ``entity_aliases.source`` — hard-coded ``"spacy_linking"`` by CoreferencePhase; no reader. Write dropped in #81. - ``content_metadata.metadata`` — API accepted a JSONB metadata dict and ContentStore wrote it, but no SELECT path ever surfaced it back. API field + write dropped in #81. Indexes dropped: - ``idx_provenance_confidence`` — no ``WHERE``/``ORDER BY`` on ``provenance.confidence`` in any code path. Confidence filtering lives in pyoxigraph SPARQL annotations. - ``idx_provenance_source_type`` — no ``WHERE provenance.source_type =`` in any query path. - ``idx_provenance_valid_range`` — ``valid_from`` / ``valid_until`` are read via ``SELECT *`` but never appear in a ``WHERE`` predicate (the temporal-validity filter is in SPARQL). - ``idx_entity_aliases_canonical`` — no reverse "what aliases point at this canonical URI" lookup is implemented; only forward alias-PK lookup is used. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
arshadansari27
added a commit
that referenced
this pull request
May 25, 2026
…ntradiction noise + periodic janitor (#83) Production audit at 2026-05-26 (#82 + this) showed three persistent drift patterns the existing fixes hadn't closed: 1. **knowledge_type bifurcation** — same logical type stored as both `Fact (43,768)` and `fact (36)` (and the same for Claim/Event/Relationship, plus a separate `Relation (126)` that overlaps Relationship). Cause: `TripleInput` preserved the LLM's casing; `_normalise_knowledge_type` now lowercases and collapses `Relation→relationship` at validation time. 2. **spaCy NER pollution** — 11,752 entities (8% of the graph) had a URL as both subject and `rdfs:label`, and used UPPERCASE spaCy labels (`schema:PERSON 3,124`, `schema:ORG 3,746`, `schema:GPE 696`, `schema:WORK_OF_ART 455`, …) as `rdf:type`. `_emit_ner_missed` now skips URL-shaped text + numeric labels (CARDINAL, MONEY, PERCENT, QUANTITY, DATE, TIME, ORDINAL) and remaps the remainder to schema.org canonical names (ORG→Organization, GPE→Place, …). 3. **Contradiction false positives** — 6 of 8 production contradictions came from the same chunk_id (extraction conflation: LLM emits two distinct numbers from one paragraph under one subject URI). One contradiction had identical `object_a == object_b`. The endpoint now filters both. Other fixes: - **Invalid IRI code point '\n'** (1 failed job 2026-05-25): strip control chars in `TripleInput`/`EventInput`/`EntityInput` so a `\n` in a URI doesn't sink the whole job at pyoxigraph insert time. - **Admin /knowledge/triples filter** was case-sensitive (`Entity` allowed, `entity` rejected) but storage uses lowercase — silently returned 0 rows for the most common filter. Now case-insensitive + `inferred` accepted. - **`KS_GRAPH_FEDERATED` dead refs** dropped from `rag.py` (no producer since federation column was removed in migration 017; named graph has always been empty in production) and from `admin/content.py` sweep. **New: periodic maintenance sweep** (`src/knowledge_service/maintenance/`) - `normalize_knowledge_types`: lowercases existing `ks:knowledgeType` RDF-star annotations + Relation→relationship. - `normalize_spacy_rdf_types`: remaps existing `schema:PERSON`/`schema:ORG` etc. to canonical, drops `schema:MONEY`/`schema:CARDINAL`/… rdf:type triples. - Background asyncio task in lifespan, every 6h by default (`MAINTENANCE_INTERVAL_SECONDS=21600`, set to 0 to disable). - Admin trigger: `POST /api/admin/maintenance/run` returns per-op stats. - Idempotent — first run is the historical backfill, subsequent runs catch any drift from new extraction bugs. Out of scope here — arxiv 0-yield problem owned by aegis worker: `worker/src/aegis_worker/activities/content.py:30` classifies `arxiv.org/abs/…` as `pdf` but sends the URL as-is. The /abs/ page is HTML abstract (~500 chars); 154/200 jobs on 2026-05-25 extracted 0 triples because the abstract is too dense for triple extraction. Fix in aegis is to rewrite to `arxiv.org/pdf/…` (or fetch+send full text) so the actual paper text reaches knowledge-service. Filed as follow-up for Raphael. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
arshadansari27
added a commit
that referenced
this pull request
May 25, 2026
…ing, maintenance endpoint (#85) PRs #83 (data-quality fixes + periodic janitor) and #84 (NamedNode-shape hotfix) changed three things the docs hadn't caught up to: 1. **`KS_GRAPH_FEDERATED` is gone.** Removed from README's named-graph list, the architecture doc's graph table, and the CLAUDE.md trust_tier description. The graph was empty in production for as long as the service has been deployed; the producer column was dropped in migration 017 (PR #82). What remains is 4 graphs: ontology / asserted / extracted / inferred. 2. **`knowledge_type` is no longer free-form.** `_normalise_knowledge_type` in `models.py` now lowercases at validation and collapses `Relation→ relationship`. Updated the Status table in README, the Knowledge Types Reference in API.md, and every code example in both docs to show the lowercase canonical form (`claim`, `fact`, `event`, `entity`, `relationship`, `temporalfact`). Capitalised input is still accepted on the wire — the prose in API.md says so explicitly. 3. **`POST /api/admin/maintenance/run` exists.** Added the endpoint to API.md (full section with request/response/curl), to the admin row in the endpoints table, and a Maintenance Service section to CLAUDE.md describing `normalize_knowledge_types` / `normalize_spacy_rdf_types`, the lifespan wiring, and failure policy. Added the `MAINTENANCE_INTERVAL_SECONDS` and `MAINTENANCE_INITIAL_DELAY_SECONDS` env vars to API.md's Configuration table and to `docs/deployment.md`. Also documented: - The contradictions endpoint's new same-chunk + identical-object filters (README + CLAUDE.md). - The control-char sanitisation on subject/predicate/object that prevents `Invalid IRI code point '\n'` job failures (CLAUDE.md Models section). - The NER fallback's URL-skip + numeric-label drop + schema.org canonical remap (CLAUDE.md NLP Phase section). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Audit (2026-05-23) identified 6 columns and 4 indexes with no readers anywhere in the codebase. The writers were removed in PR #81; this migration drops the now-unused storage.
Ordering: this PR MUST land after #81 is on main and the new code is deployed. Otherwise the running prod app will try to write to columns that no longer exist (
entities_linked,entities_coref,chunks_skipped,entity_aliases.source,content_metadata.metadata) and crash on every ingest.Columns dropped
ingestion_jobs.source_formatingestion_jobs.entities_linkedingestion_jobs.entities_corefingestion_jobs.chunks_skippedentity_aliases.source"spacy_linking"; no reader. Write dropped in #81.content_metadata.metadataIndexes dropped
idx_provenance_confidenceWHERE/ORDER BYonprovenance.confidenceanywhere. Confidence filtering lives in pyoxigraph SPARQL annotations, not PG.idx_provenance_source_typeWHERE provenance.source_type =in code.idx_provenance_valid_rangevalid_from/valid_untilare written and read viaSELECT *but never appear in aWHEREpredicate.idx_entity_aliases_canonicalTest plan
uv run pytest tests/ --ignore=tests/e2e -q— 700 passed (this branch is offmain, so production code still references these columns; the test suite passes because tests don't apply migrations to a live DB)aegis_knowledge, watch one ingestion job complete clean.Followup considerations (not in this PR)
content_metadata.metadatabeing dropped means we lose the historical JSONB blob in prod. Per the audit (and the user's "delete the orphan side" call), this is intentional — no reader was ever surfacing it. If a future feature wants per-source metadata, add a new column or table rather than reviving this one.🤖 Generated with Claude Code