Skip to content

chore(schema): migration 018 — drop unused columns and indexes#82

Merged
arshadansari27 merged 1 commit into
mainfrom
cleanup/migration-018
May 23, 2026
Merged

chore(schema): migration 018 — drop unused columns and indexes#82
arshadansari27 merged 1 commit into
mainfrom
cleanup/migration-018

Conversation

@arshadansari27
Copy link
Copy Markdown
Owner

Summary

Audit (2026-05-23) identified 6 columns and 4 indexes with no readers anywhere in the codebase. The writers were removed in PR #81; this migration drops the now-unused storage.

Ordering: this PR MUST land after #81 is on main and the new code is deployed. Otherwise the running prod app will try to write to columns that no longer exist (entities_linked, entities_coref, chunks_skipped, entity_aliases.source, content_metadata.metadata) and crash on every ingest.

Columns dropped

Object Why
ingestion_jobs.source_format Never written — no JobTracker key, no UPDATE.
ingestion_jobs.entities_linked JobTracker wrote it after NLP; no reader. Write dropped in #81.
ingestion_jobs.entities_coref Same — write dropped in #81.
ingestion_jobs.chunks_skipped ExtractPhase always returned 0; structurally always 0. Write + admin display dropped in #81.
entity_aliases.source Hard-coded "spacy_linking"; no reader. Write dropped in #81.
content_metadata.metadata API accepted JSONB, ContentStore wrote it; no SELECT path ever read it back. API field + write dropped in #81.

Indexes dropped

Object Why
idx_provenance_confidence No WHERE/ORDER BY on provenance.confidence anywhere. Confidence filtering lives in pyoxigraph SPARQL annotations, not PG.
idx_provenance_source_type No WHERE provenance.source_type = in code.
idx_provenance_valid_range valid_from/valid_until are written and read via SELECT * but never appear in a WHERE predicate.
idx_entity_aliases_canonical No reverse-lookup ("what aliases point at this canonical?") query exists; only forward alias-PK lookup is used.

Test plan

Followup considerations (not in this PR)

  • content_metadata.metadata being dropped means we lose the historical JSONB blob in prod. Per the audit (and the user's "delete the orphan side" call), this is intentional — no reader was ever surfacing it. If a future feature wants per-source metadata, add a new column or table rather than reviving this one.

🤖 Generated with Claude Code

Audit (2026-05-23) identified 6 columns and 4 indexes with no readers
in code. The writers for the columns were removed in PR #81 — this
migration must NOT land before #81 is on main, otherwise prod will try
to write to columns that no longer exist.

Columns dropped:

- ``ingestion_jobs.source_format`` — never written (no JobTracker key,
  no UPDATE).
- ``ingestion_jobs.entities_linked`` — JobTracker wrote it after NLP;
  no reader. Write dropped in #81.
- ``ingestion_jobs.entities_coref`` — same shape; write dropped in #81.
- ``ingestion_jobs.chunks_skipped`` — ExtractPhase always returned 0;
  column was structurally always 0. Write + display dropped in #81.
- ``entity_aliases.source`` — hard-coded ``"spacy_linking"`` by
  CoreferencePhase; no reader. Write dropped in #81.
- ``content_metadata.metadata`` — API accepted a JSONB metadata dict
  and ContentStore wrote it, but no SELECT path ever surfaced it back.
  API field + write dropped in #81.

Indexes dropped:

- ``idx_provenance_confidence`` — no ``WHERE``/``ORDER BY`` on
  ``provenance.confidence`` in any code path. Confidence filtering
  lives in pyoxigraph SPARQL annotations.
- ``idx_provenance_source_type`` — no ``WHERE provenance.source_type =``
  in any query path.
- ``idx_provenance_valid_range`` — ``valid_from`` / ``valid_until`` are
  read via ``SELECT *`` but never appear in a ``WHERE`` predicate (the
  temporal-validity filter is in SPARQL).
- ``idx_entity_aliases_canonical`` — no reverse "what aliases point at
  this canonical URI" lookup is implemented; only forward alias-PK
  lookup is used.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@arshadansari27 arshadansari27 merged commit 1dfadb2 into main May 23, 2026
5 checks passed
@arshadansari27 arshadansari27 deleted the cleanup/migration-018 branch May 23, 2026 12:07
arshadansari27 added a commit that referenced this pull request May 25, 2026
…ntradiction noise + periodic janitor (#83)

Production audit at 2026-05-26 (#82 + this) showed three persistent drift
patterns the existing fixes hadn't closed:

1. **knowledge_type bifurcation** — same logical type stored as both
   `Fact (43,768)` and `fact (36)` (and the same for Claim/Event/Relationship,
   plus a separate `Relation (126)` that overlaps Relationship). Cause:
   `TripleInput` preserved the LLM's casing; `_normalise_knowledge_type`
   now lowercases and collapses `Relation→relationship` at validation time.

2. **spaCy NER pollution** — 11,752 entities (8% of the graph) had a URL
   as both subject and `rdfs:label`, and used UPPERCASE spaCy labels
   (`schema:PERSON 3,124`, `schema:ORG 3,746`, `schema:GPE 696`,
   `schema:WORK_OF_ART 455`, …) as `rdf:type`. `_emit_ner_missed` now
   skips URL-shaped text + numeric labels (CARDINAL, MONEY, PERCENT,
   QUANTITY, DATE, TIME, ORDINAL) and remaps the remainder to schema.org
   canonical names (ORG→Organization, GPE→Place, …).

3. **Contradiction false positives** — 6 of 8 production contradictions
   came from the same chunk_id (extraction conflation: LLM emits two
   distinct numbers from one paragraph under one subject URI). One
   contradiction had identical `object_a == object_b`. The endpoint now
   filters both.

Other fixes:
- **Invalid IRI code point '\n'** (1 failed job 2026-05-25): strip control
  chars in `TripleInput`/`EventInput`/`EntityInput` so a `\n` in a URI
  doesn't sink the whole job at pyoxigraph insert time.
- **Admin /knowledge/triples filter** was case-sensitive (`Entity` allowed,
  `entity` rejected) but storage uses lowercase — silently returned 0 rows
  for the most common filter. Now case-insensitive + `inferred` accepted.
- **`KS_GRAPH_FEDERATED` dead refs** dropped from `rag.py` (no producer
  since federation column was removed in migration 017; named graph has
  always been empty in production) and from `admin/content.py` sweep.

**New: periodic maintenance sweep** (`src/knowledge_service/maintenance/`)
- `normalize_knowledge_types`: lowercases existing `ks:knowledgeType`
  RDF-star annotations + Relation→relationship.
- `normalize_spacy_rdf_types`: remaps existing `schema:PERSON`/`schema:ORG`
  etc. to canonical, drops `schema:MONEY`/`schema:CARDINAL`/… rdf:type
  triples.
- Background asyncio task in lifespan, every 6h by default
  (`MAINTENANCE_INTERVAL_SECONDS=21600`, set to 0 to disable).
- Admin trigger: `POST /api/admin/maintenance/run` returns per-op stats.
- Idempotent — first run is the historical backfill, subsequent runs
  catch any drift from new extraction bugs.

Out of scope here — arxiv 0-yield problem owned by aegis worker:
`worker/src/aegis_worker/activities/content.py:30` classifies
`arxiv.org/abs/…` as `pdf` but sends the URL as-is. The /abs/ page is HTML
abstract (~500 chars); 154/200 jobs on 2026-05-25 extracted 0 triples
because the abstract is too dense for triple extraction. Fix in aegis is
to rewrite to `arxiv.org/pdf/…` (or fetch+send full text) so the actual
paper text reaches knowledge-service. Filed as follow-up for Raphael.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
arshadansari27 added a commit that referenced this pull request May 25, 2026
…ing, maintenance endpoint (#85)

PRs #83 (data-quality fixes + periodic janitor) and #84 (NamedNode-shape
hotfix) changed three things the docs hadn't caught up to:

1. **`KS_GRAPH_FEDERATED` is gone.** Removed from README's named-graph
   list, the architecture doc's graph table, and the CLAUDE.md trust_tier
   description. The graph was empty in production for as long as the
   service has been deployed; the producer column was dropped in
   migration 017 (PR #82). What remains is 4 graphs: ontology / asserted
   / extracted / inferred.

2. **`knowledge_type` is no longer free-form.** `_normalise_knowledge_type`
   in `models.py` now lowercases at validation and collapses `Relation→
   relationship`. Updated the Status table in README, the Knowledge Types
   Reference in API.md, and every code example in both docs to show the
   lowercase canonical form (`claim`, `fact`, `event`, `entity`,
   `relationship`, `temporalfact`). Capitalised input is still accepted on
   the wire — the prose in API.md says so explicitly.

3. **`POST /api/admin/maintenance/run` exists.** Added the endpoint to
   API.md (full section with request/response/curl), to the admin row in
   the endpoints table, and a Maintenance Service section to CLAUDE.md
   describing `normalize_knowledge_types` / `normalize_spacy_rdf_types`,
   the lifespan wiring, and failure policy. Added the
   `MAINTENANCE_INTERVAL_SECONDS` and `MAINTENANCE_INITIAL_DELAY_SECONDS`
   env vars to API.md's Configuration table and to `docs/deployment.md`.

Also documented:
- The contradictions endpoint's new same-chunk + identical-object filters
  (README + CLAUDE.md).
- The control-char sanitisation on subject/predicate/object that
  prevents `Invalid IRI code point '\n'` job failures (CLAUDE.md Models
  section).
- The NER fallback's URL-skip + numeric-label drop + schema.org canonical
  remap (CLAUDE.md NLP Phase section).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant