chore(schema): migration 018 — drop unused columns and indexes by arshadansari27 · Pull Request #82 · arshadansari27/knowledge-service

arshadansari27 · 2026-05-23T11:00:39Z

Summary

Audit (2026-05-23) identified 6 columns and 4 indexes with no readers anywhere in the codebase. The writers were removed in PR #81; this migration drops the now-unused storage.

Ordering: this PR MUST land after #81 is on main and the new code is deployed. Otherwise the running prod app will try to write to columns that no longer exist (entities_linked, entities_coref, chunks_skipped, entity_aliases.source, content_metadata.metadata) and crash on every ingest.

Columns dropped

Object	Why
`ingestion_jobs.source_format`	Never written — no JobTracker key, no UPDATE.
`ingestion_jobs.entities_linked`	JobTracker wrote it after NLP; no reader. Write dropped in #81.
`ingestion_jobs.entities_coref`	Same — write dropped in #81.
`ingestion_jobs.chunks_skipped`	ExtractPhase always returned 0; structurally always 0. Write + admin display dropped in #81.
`entity_aliases.source`	Hard-coded `"spacy_linking"`; no reader. Write dropped in #81.
`content_metadata.metadata`	API accepted JSONB, ContentStore wrote it; no SELECT path ever read it back. API field + write dropped in #81.

Indexes dropped

Object	Why
`idx_provenance_confidence`	No `WHERE`/`ORDER BY` on `provenance.confidence` anywhere. Confidence filtering lives in pyoxigraph SPARQL annotations, not PG.
`idx_provenance_source_type`	No `WHERE provenance.source_type =` in code.
`idx_provenance_valid_range`	`valid_from`/`valid_until` are written and read via `SELECT *` but never appear in a `WHERE` predicate.
`idx_entity_aliases_canonical`	No reverse-lookup ("what aliases point at this canonical?") query exists; only forward alias-PK lookup is used.

Test plan

uv run pytest tests/ --ignore=tests/e2e -q — 700 passed (this branch is off main, so production code still references these columns; the test suite passes because tests don't apply migrations to a live DB)
Hold this PR until cleanup: 4 latent bug fixes + hasher consolidation + dead-code purge #81 is on main and deployed. The migration runner at startup will apply 018 on the next service restart; if the deployed binary is the pre-cleanup: 4 latent bug fixes + hasher consolidation + dead-code purge #81 one, the next ingest fails.
After cleanup: 4 latent bug fixes + hasher consolidation + dead-code purge #81 ships and prod is on the new image: merge this, restart aegis_knowledge, watch one ingestion job complete clean.

Followup considerations (not in this PR)

content_metadata.metadata being dropped means we lose the historical JSONB blob in prod. Per the audit (and the user's "delete the orphan side" call), this is intentional — no reader was ever surfacing it. If a future feature wants per-source metadata, add a new column or table rather than reviving this one.

🤖 Generated with Claude Code

Audit (2026-05-23) identified 6 columns and 4 indexes with no readers in code. The writers for the columns were removed in PR #81 — this migration must NOT land before #81 is on main, otherwise prod will try to write to columns that no longer exist. Columns dropped: - ``ingestion_jobs.source_format`` — never written (no JobTracker key, no UPDATE). - ``ingestion_jobs.entities_linked`` — JobTracker wrote it after NLP; no reader. Write dropped in #81. - ``ingestion_jobs.entities_coref`` — same shape; write dropped in #81. - ``ingestion_jobs.chunks_skipped`` — ExtractPhase always returned 0; column was structurally always 0. Write + display dropped in #81. - ``entity_aliases.source`` — hard-coded ``"spacy_linking"`` by CoreferencePhase; no reader. Write dropped in #81. - ``content_metadata.metadata`` — API accepted a JSONB metadata dict and ContentStore wrote it, but no SELECT path ever surfaced it back. API field + write dropped in #81. Indexes dropped: - ``idx_provenance_confidence`` — no ``WHERE``/``ORDER BY`` on ``provenance.confidence`` in any code path. Confidence filtering lives in pyoxigraph SPARQL annotations. - ``idx_provenance_source_type`` — no ``WHERE provenance.source_type =`` in any query path. - ``idx_provenance_valid_range`` — ``valid_from`` / ``valid_until`` are read via ``SELECT *`` but never appear in a ``WHERE`` predicate (the temporal-validity filter is in SPARQL). - ``idx_entity_aliases_canonical`` — no reverse "what aliases point at this canonical URI" lookup is implemented; only forward alias-PK lookup is used. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ntradiction noise + periodic janitor (#83) Production audit at 2026-05-26 (#82 + this) showed three persistent drift patterns the existing fixes hadn't closed: 1. **knowledge_type bifurcation** — same logical type stored as both `Fact (43,768)` and `fact (36)` (and the same for Claim/Event/Relationship, plus a separate `Relation (126)` that overlaps Relationship). Cause: `TripleInput` preserved the LLM's casing; `_normalise_knowledge_type` now lowercases and collapses `Relation→relationship` at validation time. 2. **spaCy NER pollution** — 11,752 entities (8% of the graph) had a URL as both subject and `rdfs:label`, and used UPPERCASE spaCy labels (`schema:PERSON 3,124`, `schema:ORG 3,746`, `schema:GPE 696`, `schema:WORK_OF_ART 455`, …) as `rdf:type`. `_emit_ner_missed` now skips URL-shaped text + numeric labels (CARDINAL, MONEY, PERCENT, QUANTITY, DATE, TIME, ORDINAL) and remaps the remainder to schema.org canonical names (ORG→Organization, GPE→Place, …). 3. **Contradiction false positives** — 6 of 8 production contradictions came from the same chunk_id (extraction conflation: LLM emits two distinct numbers from one paragraph under one subject URI). One contradiction had identical `object_a == object_b`. The endpoint now filters both. Other fixes: - **Invalid IRI code point '\n'** (1 failed job 2026-05-25): strip control chars in `TripleInput`/`EventInput`/`EntityInput` so a `\n` in a URI doesn't sink the whole job at pyoxigraph insert time. - **Admin /knowledge/triples filter** was case-sensitive (`Entity` allowed, `entity` rejected) but storage uses lowercase — silently returned 0 rows for the most common filter. Now case-insensitive + `inferred` accepted. - **`KS_GRAPH_FEDERATED` dead refs** dropped from `rag.py` (no producer since federation column was removed in migration 017; named graph has always been empty in production) and from `admin/content.py` sweep. **New: periodic maintenance sweep** (`src/knowledge_service/maintenance/`) - `normalize_knowledge_types`: lowercases existing `ks:knowledgeType` RDF-star annotations + Relation→relationship. - `normalize_spacy_rdf_types`: remaps existing `schema:PERSON`/`schema:ORG` etc. to canonical, drops `schema:MONEY`/`schema:CARDINAL`/… rdf:type triples. - Background asyncio task in lifespan, every 6h by default (`MAINTENANCE_INTERVAL_SECONDS=21600`, set to 0 to disable). - Admin trigger: `POST /api/admin/maintenance/run` returns per-op stats. - Idempotent — first run is the historical backfill, subsequent runs catch any drift from new extraction bugs. Out of scope here — arxiv 0-yield problem owned by aegis worker: `worker/src/aegis_worker/activities/content.py:30` classifies `arxiv.org/abs/…` as `pdf` but sends the URL as-is. The /abs/ page is HTML abstract (~500 chars); 154/200 jobs on 2026-05-25 extracted 0 triples because the abstract is too dense for triple extraction. Fix in aegis is to rewrite to `arxiv.org/pdf/…` (or fetch+send full text) so the actual paper text reaches knowledge-service. Filed as follow-up for Raphael. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ing, maintenance endpoint (#85) PRs #83 (data-quality fixes + periodic janitor) and #84 (NamedNode-shape hotfix) changed three things the docs hadn't caught up to: 1. **`KS_GRAPH_FEDERATED` is gone.** Removed from README's named-graph list, the architecture doc's graph table, and the CLAUDE.md trust_tier description. The graph was empty in production for as long as the service has been deployed; the producer column was dropped in migration 017 (PR #82). What remains is 4 graphs: ontology / asserted / extracted / inferred. 2. **`knowledge_type` is no longer free-form.** `_normalise_knowledge_type` in `models.py` now lowercases at validation and collapses `Relation→ relationship`. Updated the Status table in README, the Knowledge Types Reference in API.md, and every code example in both docs to show the lowercase canonical form (`claim`, `fact`, `event`, `entity`, `relationship`, `temporalfact`). Capitalised input is still accepted on the wire — the prose in API.md says so explicitly. 3. **`POST /api/admin/maintenance/run` exists.** Added the endpoint to API.md (full section with request/response/curl), to the admin row in the endpoints table, and a Maintenance Service section to CLAUDE.md describing `normalize_knowledge_types` / `normalize_spacy_rdf_types`, the lifespan wiring, and failure policy. Added the `MAINTENANCE_INTERVAL_SECONDS` and `MAINTENANCE_INITIAL_DELAY_SECONDS` env vars to API.md's Configuration table and to `docs/deployment.md`. Also documented: - The contradictions endpoint's new same-chunk + identical-object filters (README + CLAUDE.md). - The control-char sanitisation on subject/predicate/object that prevents `Invalid IRI code point '\n'` job failures (CLAUDE.md Models section). - The NER fallback's URL-skip + numeric-label drop + schema.org canonical remap (CLAUDE.md NLP Phase section). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

arshadansari27 merged commit 1dfadb2 into main May 23, 2026
5 checks passed

arshadansari27 deleted the cleanup/migration-018 branch May 23, 2026 12:07

arshadansari27 mentioned this pull request May 25, 2026

docs: post-PR-83/84 drift sweep #85

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(schema): migration 018 — drop unused columns and indexes#82

chore(schema): migration 018 — drop unused columns and indexes#82
arshadansari27 merged 1 commit into
mainfrom
cleanup/migration-018

arshadansari27 commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arshadansari27 commented May 23, 2026

Summary

Columns dropped

Indexes dropped

Test plan

Followup considerations (not in this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant