Skip to content

feat(db): schema v3-v4 and data access layer for knowledge pipeline#32

Merged
gordonkjlee merged 10 commits intomainfrom
feat/dikw-data-access
Apr 7, 2026
Merged

feat(db): schema v3-v4 and data access layer for knowledge pipeline#32
gordonkjlee merged 10 commits intomainfrom
feat/dikw-data-access

Conversation

@gordonkjlee
Copy link
Copy Markdown
Owner

@gordonkjlee gordonkjlee commented Apr 6, 2026

Infrastructure Change

Summary

Schema migrations v3-v4 and synchronous data access layer for the DIKW knowledge pipeline. Creates all database tables (session_facts, facts with FTS5, entities, graph edges, domains, consolidation lock) and their CRUD functions. Second of 5 PRs.

Stakeholder

N/A

Links

Ticket: N/A
Requirements: N/A


Description

Schema v3 creates the Information layer and supporting tables:

  • session_facts — captured/extracted facts awaiting consolidation, with content_hash (SHA-256, UNIQUE per session for dedup), source_origin ('explicit'/'inferred' for hybrid capture), consolidation_id (claimed by which run). Partial index on consolidation_id IS NULL for fast unclaimed-fact queries.
  • session_fact_sources — provenance junction table linking facts to multiple source events with relevance and extraction_type.
  • domains — domain registry (data, not code). Seeded from config or created at runtime.
  • consolidation_lock — single-row advisory lock (CHECK(id = 1)) preventing concurrent consolidation. 2-minute stale threshold for crash recovery.

Schema v4 creates the Knowledge layer:

  • facts + facts_fts (FTS5 virtual table) + INSERT and DELETE sync triggers. Graduated, entity-linked, deduplicated facts. is_latest flag for fast current-state queries. FTS5 UPDATE trigger intentionally omitted: fact content is immutable (never modified, only superseded).
  • entities — graph nodes with canonical_name NOT NULL and UNIQUE(canonical_name, type) constraint for dedup. access_count/last_accessed_at for activation tracking (inspired by ACT-R frequency + recency signals).
  • fact_entities — junction table linking facts to entities with relationship type.
  • entity_edges — entity-to-entity relationships with saturating potentiation for strength (spreading activation, Collins & Loftus, 1975). Strength follows new = old + (1 - old) × α (α=0.3), inspired by LTP saturation — early co-occurrences cause large jumps, later ones diminish. Monotonically increasing by construction, no floating-point precision loss. EDGE_POTENTIATION_ALPHA is a named constant for future parametric feedback.
  • sources — provenance records.
  • consolidations — run records with stats (facts_in, graduated, rejected, entities_created, supersessions).

Data access modules (all synchronous, better-sqlite3):

  • session-facts.tsinsertSessionFact (INSERT OR IGNORE for hash dedup), getSessionFacts, getUnconsolidatedFacts, claimForConsolidation (atomic claim; caller must hold consolidation lock), linkFactSource, getFactSources
  • facts.tsinsertFact (validates INSERT succeeded), getFact, getFactsByDomain, getFactsByEntity, supersedeFact (transaction: mark old + insert new; throws if old fact not found), keywordSearch (FTS5 BM25; throws on malformed FTS5 syntax — callers should sanitise or catch), incrementFactAccess
  • entities.tsfindEntity (canonical name match; non-deterministic without type filter), findEntityByCanonical (exact match, no normalisation), findOrCreateEntity (transaction-wrapped, safe with UNIQUE constraint), createEntity, linkFactEntity, upsertEntityEdge (saturating potentiation with EDGE_POTENTIATION_ALPHA), getEntityEdges, updateEntityAccess
  • domains.tsgetDomains, createDomain, ensureDomain (idempotent)
  • consolidation-lock.tsacquireLock (with 2-minute stale detection and takeover), releaseLock (returns boolean), getLockState

Testing

48 new tests across 3 test files + 2 updated assertions in sessions.test.ts:

  • session-facts (10 tests): insert with auto-hash, intra-session dedup rejection, cross-session allowed, ordered retrieval, unclaimed filtering, atomic claim, provenance linking
  • facts (14 tests): insert, is_latest boolean cast, domain filtering, subdomain filtering, entity-linked retrieval, supersession chains (A→B→C with only C is_latest), FTS5 keyword search with rank type assertion, access count increment, throws on superseding nonexistent fact
  • entities (25 tests): canonical name normalisation, case-insensitive find, findEntityByCanonical exact-match contract, type filter, find-or-create, metadata round-trip, fact-entity linking + idempotency, saturating edge potentiation (monotonic increase verified across 50 iterations), edge retrieval, access tracking, domain CRUD + idempotency, lock acquire/release/stale takeover (2-min threshold)

All tests use :memory: databases — no file system side effects.


Impact Assessment

Breaking changes: None. New tables only — existing sessions and session_events tables unchanged.

Components affected: Schema version moves from 2 → 4. Migrations are additive (CREATE TABLE IF NOT EXISTS). Existing databases auto-migrate on server start.

Rollback plan: Drop tables via DROP TABLE IF EXISTS session_facts, session_fact_sources, domains, consolidation_lock, facts, facts_fts, entities, fact_entities, entity_edges, sources, consolidations and reset PRAGMA user_version = 2.


Complexity

  • Simple (config tweak, dependency update)
  • Moderate (new workflow, migration change)
  • Complex (core infrastructure change, many dependencies)

Checklist

  • npm run build succeeds
  • npm test passes
  • Testing complete (migration tested, workflow triggered, or config validated)
  • Code follows project standards
  • Code is production-ready and can be reviewed

Add schema v3 (session_facts, session_fact_sources, domains,
consolidation_lock) and v4 (facts with FTS5, entities, fact_entities,
entity_edges, sources, consolidations) migrations.

Implement synchronous data access modules:
- session-facts: insert with SHA-256 dedup, claim for consolidation,
  provenance linking via junction table
- facts: CRUD, FTS5 keyword search, supersession chains
- entities: find-or-create with canonical name normalisation,
  fact-entity linking, weighted graph edges with strength capping
- domains: get/create/ensure (idempotent)
- consolidation-lock: advisory lock with 5-minute stale detection

48 new tests covering insertion, dedup, FTS5 triggers, supersession
chains, entity graph edges, provenance linking, lock acquisition.
@gordonkjlee gordonkjlee self-assigned this Apr 6, 2026
Replace linear edge strength increment (0.1 per co-occurrence) with
logarithmic potentiation curve: strength = 1 - 1/(1 + count × K).
K=0.5 models LTP saturation — early co-occurrences cause large jumps,
later ones diminish. EDGE_POTENTIATION_K is a named constant for
Phase 3 parametric feedback adjustment.

Reduce consolidation lock stale threshold from 5 minutes to 2 minutes.
Heuristic consolidation takes milliseconds; even Tier 1 sampling is
well under 60 seconds. Faster recovery from crashed processes.
- Add FTS5 DELETE trigger on facts table with comment explaining why
  UPDATE trigger is omitted (immutable fact content per ADR-4)
- Add missing partial index idx_session_facts_unclaimed on
  session_facts(created_at) WHERE consolidation_id IS NULL
- Hardcode user_version in applyV4 (was using CURRENT_VERSION variable)
- Add NOT NULL constraint on entities.canonical_name (code always writes it)
- Throw in supersedeFact when oldId does not exist (was silent no-op)
- Add test for findEntityByCanonical contract (no normalisation)
- Add test for supersedeFact with nonexistent oldId
@gordonkjlee gordonkjlee force-pushed the feat/dikw-data-access branch from 9d4391c to b781a6d Compare April 6, 2026 23:00
Replace internal document references with self-explanatory descriptions.
Design docs are gitignored — source code must be understandable without
access to them.
@gordonkjlee gordonkjlee force-pushed the feat/dikw-data-access branch from b781a6d to bf0f0d4 Compare April 6, 2026 23:02
- Fix Entity.canonical_name type: string | null → string (matches NOT
  NULL schema constraint)
- Remove DEFAULT 1.0 from entity_edges.strength (force explicit values
  via upsertEntityEdge logarithmic formula)
- Document findEntity non-determinism without type filter
- Document keywordSearch FTS5 syntax throw behaviour
Schema:
- Add UNIQUE(canonical_name, type) on entities — prevents duplicate
  entities and makes findOrCreateEntity safe across processes
- Remove DEFAULT from entity_edges.strength — forces explicit values

Edge potentiation:
- Replace inverse-formula approach with exponential saturation:
  new = old + (1 - old) * alpha. Monotonically increasing by
  construction, no floating-point precision loss, no schema change.
  Rename EDGE_POTENTIATION_K → EDGE_POTENTIATION_ALPHA (0.3).

Terminology:
- "saturating potentiation" not "logarithmic" (the formula is
  hyperbolic/exponential, not logarithmic)
- "inspired by LTP saturation" not "models LTP"
- Remove "(pattern separation)" from content_hash comment — hash
  dedup collapses identical inputs, the opposite of pattern separation

Defensive checks:
- findOrCreateEntity wrapped in transaction
- insertFact checks result.changes
- releaseLock returns boolean
- claimForConsolidation JSDoc documents lock precondition

Tests:
- Tighten FTS5 rank assertion (typeof number, not toBeDefined)
- Tighten search result count (toHaveLength, not >= 1)
- Edge tests verify monotonic increase across 50 iterations
Schema:
- consolidations.session_id nullable (consolidation spans sessions)
- Add Phase 2 comment on sources table (no data access yet)
- Add comment explaining intentional FK omission on v3/v4 tables

Lock:
- Reset started_at on re-acquisition to prevent stale detection
  while holder is still active

Facts:
- Export sanitiseFtsQuery() helper — wraps terms in double quotes
  to force literal matching, strips stray quote characters
- insertFact: distinguish undefined (default to now) from null
  (explicitly unknown valid_from) for bitemporal correctness
- JSDoc documenting valid_from default behaviour

Entities:
- findEntityByCanonical: document non-determinism without type
- createEntity: document UNIQUE constraint throw
- upsertEntityEdge: document entity existence responsibility
- Fix precision claim: "no practical precision concern" not "no loss"
- sanitiseFtsQuery: empty input, stray quotes, FTS5 operators, single term
- insertFact with valid_from: null stores null (unknown validity start)
- insertFact without valid_from defaults to now
- Fix Consolidation.session_id type: string → string | null (matches
  nullable schema — consolidation spans multiple sessions)
- Add optimistic WHERE clause to stale lock takeover (verify holder +
  timestamp unchanged between SELECT and UPDATE)
- Document supersedeFact: valid_from always set to now (intentional)
- Document sanitiseFtsQuery: per-term matching, not phrase
@gordonkjlee gordonkjlee merged commit 801a582 into main Apr 7, 2026
3 checks passed
@gordonkjlee gordonkjlee deleted the feat/dikw-data-access branch April 7, 2026 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant