Skip to content

Catalog convention: encode cross-project lineage #12

@ProfessorPolymorphic

Description

@ProfessorPolymorphic

Context

The current per-project catalog files (catalog/<project>.json) declare FKs and relationships only against tables within their own project. Each project is parsed in isolation, so cross-project lineage is invisible to downstream consumers.

This was discovered while implementing the Data Governance Explorer's lineage rendering on the AISPEG site (slice ui-insight/AISPEG#59 / epic ui-insight/AISPEG#53). The audit walked all five vendored catalogs:

Metric Value
Total tables 71
Distinct table names 67
Tables that share a name across projects 4 (AllowedValues, ActivityLog, Document, users)
FK strings whose target table exists only in another project 0
Relationships whose target exists only in another project 0

The same-named tables across projects are pattern-shared (each project owns its own AllowedValues, etc.), not pointers — i.e. there is no encoded statement that audit-dashboard.AuditReport ever references the institutional Personnel registry, even though the implied dependency exists in practice.

Why this matters

Stakeholders ask questions like:

  • "What other AI4RA projects depend on Personnel?"
  • "If we change the canonical UDM Document shape, who is downstream?"
  • "Which projects share a single source of truth versus duplicating the entity?"

These are unanswerable from today's catalog. The downstream lineage renderer (AISPEG#59) ends up empty until an upstream representation exists.

Proposal

Two viable representations — pick one or combine.

Option A — extend FK strings with an explicit project prefix

Allow foreign_key / relationship target to take the form
<otherproject>:<TableName>.<column> when the target lives in another
project's catalog. Bare <TableName>.<column> continues to mean
"resolves inside this catalog". Example:

{ "name": "Lead_Personnel_ID", "type": "Integer", "foreign_key": "openera:Personnel.Personnel_ID" }

Pros: minimal schema change; round-trips through the existing parser with
a small extension; preserves the column-level FK granularity.

Cons: requires every project to know the slug of every other project; no
top-level overview of inter-project dependencies.

Option B — add a top-level cross_project_references array

Add to each catalog/<project>.json:

"cross_project_references": [
  {
    "source_table": "AuditReport",
    "source_column": "Lead_Personnel_ID",
    "target_project": "openera",
    "target_table": "Personnel",
    "target_column": "Personnel_ID",
    "kind": "foreign-key"
  }
]

Pros: zero impact on the existing tables[] schema; a single read gives
the full outbound dependency graph for a project; supports relationship-
level entries (set kind: "declared-relationship" and omit source_column).

Cons: introduces a second source of truth — the per-column FK string and
the cross-reference array can disagree.

Option C (recommended) — Option B as the canonical surface, Option A as syntactic sugar that the build script expands

The build script in ui-insight/AISPEG/scripts/build-governance-catalog.ts
parses both forms and emits a flat lineage list. The catalog author
chooses whichever is more ergonomic per project.

Acceptance criteria

  • Decision recorded in docs/ (markdown ADR)
  • At least one project's catalog uses the new form
  • Schema documented (e.g. docs/catalog-schema.md)
  • AISPEG#59 follow-up rendering issue references this issue

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions