Skip to content

IDs provenance implementation plan #22

@ofrei

Description

@ofrei

IDs provenance implementation plan

Rationale

.vmap rows carry two independent identities:

  • provenance identity: source_shard and source_index, which locate the original payload row
  • variant identity: prepared chrom:pos:a1:a2, which drives exact matching

For normal import and preparation paths, .vmap id values are also expected to match the original payload IDs for the corresponding provenance row. That invariant is useful: after project_payload.py matches a payload row through source_shard / source_index, it can sanity-check that the payload ID still agrees with the .vmap ID. A mismatch is evidence of a wrong .vmap, corrupted provenance, or an incompatible payload.

assign_vmap_ids.py intentionally breaks that invariant. It replaces .vmap IDs with IDs copied from a prepared .vmap / .vtable ID source by exact chrom:pos:a1:a2 match. Those assigned IDs are meaningful for downstream output, but they no longer describe the original source row ID namespace. Downstream tools therefore need metadata that says how .vmap.id should be interpreted.

This is more general than a projection flag such as --retain-snp-id. The metadata should describe the provenance of the .vmap ID column; projection defaults and provenance sanity checks should derive behavior from that metadata.

Proposed metadata

Add a .vmap metadata field:

"ids_provenance": "source"

Allowed values:

  • source: .vmap.id values are preserved from the original source payload rows. Downstream tools may compare .vmap.id against payload row IDs after provenance matching.
  • assign_vmap_ids: .vmap.id values were intentionally assigned by assign_vmap_ids.py from an external prepared ID source. Downstream tools must not compare .vmap.id against original payload row IDs.

Legacy .vmap metadata that lacks ids_provenance should be interpreted as source.

assign_vmap_ids.py should set ids_provenance: assign_vmap_ids unconditionally, even if every assigned ID happens to equal the previous .vmap ID. The field describes the contract and intent of the ID column, not accidental equality.

Intended spec change

Update the .vmap object contract:

  • .vmap.id is an ID column with declared provenance.
  • If ids_provenance is absent or source, .vmap.id is expected to match the source payload ID addressed by source_shard / source_index.
  • If ids_provenance is assign_vmap_ids, .vmap.id is an assigned output ID namespace and is not expected to match source payload IDs.
  • Tools that preserve source-row provenance and do not rewrite IDs must preserve or emit ids_provenance: source.
  • Tools that copy or rewrite IDs from another prepared ID source must set ids_provenance to the appropriate non-source value.

Update projection semantics:

  • Projection ID output should default from ids_provenance.
  • For ids_provenance: source or absent metadata, preserve current default behavior: output generated chrom:pos:a1:a2 IDs unless the user explicitly requests .vmap IDs.
  • For ids_provenance: assign_vmap_ids, default to output .vmap IDs because those IDs were intentionally assigned for downstream use.
  • Existing explicit projection flags should remain as overrides for backward compatibility.

Update provenance validation semantics:

  • When projecting with ids_provenance: source or absent metadata, after matching by source_shard / source_index, compare the payload ID to .vmap.id and fail clearly on mismatch.
  • When projecting with ids_provenance: assign_vmap_ids, skip that source-ID sanity check.

Implementation outline

  1. Add metadata helpers for reading ids_provenance, defaulting absent values to source, and validating allowed values.
  2. Update .vmap metadata validation to accept the optional field.
  3. Ensure importers and existing preparation tools either preserve absent/source semantics or explicitly write ids_provenance: source when metadata is rewritten.
  4. Update assign_vmap_ids.py metadata writing to set ids_provenance: assign_vmap_ids.
  5. Update projection utilities to derive default ID output behavior from ids_provenance, while retaining explicit override flags.
  6. Add provenance sanity checking for payload IDs only when ids_provenance is source.
  7. Update SPEC.md, relevant spec/ files, README/docs, TESTS.md, and changelog wording.
  8. Add tests for:
    • legacy .vmap metadata without ids_provenance behaves as source
    • normal prepared .vmap preserves or emits source ID provenance
    • assign_vmap_ids.py sets ids_provenance: assign_vmap_ids
    • projection defaults to .vmap IDs for assigned-ID .vmap
    • projection keeps current default for source-ID .vmap
    • source-ID provenance mismatch fails clearly
    • assigned-ID provenance skips the source-ID sanity check

Open decisions

  • Whether to introduce a clearer projection override such as --output-id-source {auto,variant-key,vmap-id} while retaining --retain-snp-id as a compatibility alias.
  • Whether source-ID mismatch should always hard-fail, or whether any legacy compatibility mode is needed.
  • Whether future non-source values should name tools (assign_vmap_ids) or broader semantics (assigned). The initial proposal uses assign_vmap_ids because it maps directly to the current primitive and avoids ambiguity.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions