IDs provenance implementation plan
Rationale
.vmap rows carry two independent identities:
- provenance identity:
source_shard and source_index, which locate the original payload row
- variant identity: prepared
chrom:pos:a1:a2, which drives exact matching
For normal import and preparation paths, .vmap id values are also expected to match the original payload IDs for the corresponding provenance row. That invariant is useful: after project_payload.py matches a payload row through source_shard / source_index, it can sanity-check that the payload ID still agrees with the .vmap ID. A mismatch is evidence of a wrong .vmap, corrupted provenance, or an incompatible payload.
assign_vmap_ids.py intentionally breaks that invariant. It replaces .vmap IDs with IDs copied from a prepared .vmap / .vtable ID source by exact chrom:pos:a1:a2 match. Those assigned IDs are meaningful for downstream output, but they no longer describe the original source row ID namespace. Downstream tools therefore need metadata that says how .vmap.id should be interpreted.
This is more general than a projection flag such as --retain-snp-id. The metadata should describe the provenance of the .vmap ID column; projection defaults and provenance sanity checks should derive behavior from that metadata.
Proposed metadata
Add a .vmap metadata field:
"ids_provenance": "source"
Allowed values:
source: .vmap.id values are preserved from the original source payload rows. Downstream tools may compare .vmap.id against payload row IDs after provenance matching.
assign_vmap_ids: .vmap.id values were intentionally assigned by assign_vmap_ids.py from an external prepared ID source. Downstream tools must not compare .vmap.id against original payload row IDs.
Legacy .vmap metadata that lacks ids_provenance should be interpreted as source.
assign_vmap_ids.py should set ids_provenance: assign_vmap_ids unconditionally, even if every assigned ID happens to equal the previous .vmap ID. The field describes the contract and intent of the ID column, not accidental equality.
Intended spec change
Update the .vmap object contract:
.vmap.id is an ID column with declared provenance.
- If
ids_provenance is absent or source, .vmap.id is expected to match the source payload ID addressed by source_shard / source_index.
- If
ids_provenance is assign_vmap_ids, .vmap.id is an assigned output ID namespace and is not expected to match source payload IDs.
- Tools that preserve source-row provenance and do not rewrite IDs must preserve or emit
ids_provenance: source.
- Tools that copy or rewrite IDs from another prepared ID source must set
ids_provenance to the appropriate non-source value.
Update projection semantics:
- Projection ID output should default from
ids_provenance.
- For
ids_provenance: source or absent metadata, preserve current default behavior: output generated chrom:pos:a1:a2 IDs unless the user explicitly requests .vmap IDs.
- For
ids_provenance: assign_vmap_ids, default to output .vmap IDs because those IDs were intentionally assigned for downstream use.
- Existing explicit projection flags should remain as overrides for backward compatibility.
Update provenance validation semantics:
- When projecting with
ids_provenance: source or absent metadata, after matching by source_shard / source_index, compare the payload ID to .vmap.id and fail clearly on mismatch.
- When projecting with
ids_provenance: assign_vmap_ids, skip that source-ID sanity check.
Implementation outline
- Add metadata helpers for reading
ids_provenance, defaulting absent values to source, and validating allowed values.
- Update
.vmap metadata validation to accept the optional field.
- Ensure importers and existing preparation tools either preserve absent/source semantics or explicitly write
ids_provenance: source when metadata is rewritten.
- Update
assign_vmap_ids.py metadata writing to set ids_provenance: assign_vmap_ids.
- Update projection utilities to derive default ID output behavior from
ids_provenance, while retaining explicit override flags.
- Add provenance sanity checking for payload IDs only when
ids_provenance is source.
- Update SPEC.md, relevant
spec/ files, README/docs, TESTS.md, and changelog wording.
- Add tests for:
- legacy
.vmap metadata without ids_provenance behaves as source
- normal prepared
.vmap preserves or emits source ID provenance
assign_vmap_ids.py sets ids_provenance: assign_vmap_ids
- projection defaults to
.vmap IDs for assigned-ID .vmap
- projection keeps current default for source-ID
.vmap
- source-ID provenance mismatch fails clearly
- assigned-ID provenance skips the source-ID sanity check
Open decisions
- Whether to introduce a clearer projection override such as
--output-id-source {auto,variant-key,vmap-id} while retaining --retain-snp-id as a compatibility alias.
- Whether source-ID mismatch should always hard-fail, or whether any legacy compatibility mode is needed.
- Whether future non-source values should name tools (
assign_vmap_ids) or broader semantics (assigned). The initial proposal uses assign_vmap_ids because it maps directly to the current primitive and avoids ambiguity.
IDs provenance implementation plan
Rationale
.vmaprows carry two independent identities:source_shardandsource_index, which locate the original payload rowchrom:pos:a1:a2, which drives exact matchingFor normal import and preparation paths,
.vmapidvalues are also expected to match the original payload IDs for the corresponding provenance row. That invariant is useful: afterproject_payload.pymatches a payload row throughsource_shard/source_index, it can sanity-check that the payload ID still agrees with the.vmapID. A mismatch is evidence of a wrong.vmap, corrupted provenance, or an incompatible payload.assign_vmap_ids.pyintentionally breaks that invariant. It replaces.vmapIDs with IDs copied from a prepared.vmap/.vtableID source by exactchrom:pos:a1:a2match. Those assigned IDs are meaningful for downstream output, but they no longer describe the original source row ID namespace. Downstream tools therefore need metadata that says how.vmap.idshould be interpreted.This is more general than a projection flag such as
--retain-snp-id. The metadata should describe the provenance of the.vmapID column; projection defaults and provenance sanity checks should derive behavior from that metadata.Proposed metadata
Add a
.vmapmetadata field:Allowed values:
source:.vmap.idvalues are preserved from the original source payload rows. Downstream tools may compare.vmap.idagainst payload row IDs after provenance matching.assign_vmap_ids:.vmap.idvalues were intentionally assigned byassign_vmap_ids.pyfrom an external prepared ID source. Downstream tools must not compare.vmap.idagainst original payload row IDs.Legacy
.vmapmetadata that lacksids_provenanceshould be interpreted assource.assign_vmap_ids.pyshould setids_provenance: assign_vmap_idsunconditionally, even if every assigned ID happens to equal the previous.vmapID. The field describes the contract and intent of the ID column, not accidental equality.Intended spec change
Update the
.vmapobject contract:.vmap.idis an ID column with declared provenance.ids_provenanceis absent orsource,.vmap.idis expected to match the source payload ID addressed bysource_shard/source_index.ids_provenanceisassign_vmap_ids,.vmap.idis an assigned output ID namespace and is not expected to match source payload IDs.ids_provenance: source.ids_provenanceto the appropriate non-source value.Update projection semantics:
ids_provenance.ids_provenance: sourceor absent metadata, preserve current default behavior: output generatedchrom:pos:a1:a2IDs unless the user explicitly requests.vmapIDs.ids_provenance: assign_vmap_ids, default to output.vmapIDs because those IDs were intentionally assigned for downstream use.Update provenance validation semantics:
ids_provenance: sourceor absent metadata, after matching bysource_shard/source_index, compare the payload ID to.vmap.idand fail clearly on mismatch.ids_provenance: assign_vmap_ids, skip that source-ID sanity check.Implementation outline
ids_provenance, defaulting absent values tosource, and validating allowed values..vmapmetadata validation to accept the optional field.ids_provenance: sourcewhen metadata is rewritten.assign_vmap_ids.pymetadata writing to setids_provenance: assign_vmap_ids.ids_provenance, while retaining explicit override flags.ids_provenanceissource.spec/files, README/docs, TESTS.md, and changelog wording..vmapmetadata withoutids_provenancebehaves assource.vmappreserves or emits source ID provenanceassign_vmap_ids.pysetsids_provenance: assign_vmap_ids.vmapIDs for assigned-ID.vmap.vmapOpen decisions
--output-id-source {auto,variant-key,vmap-id}while retaining--retain-snp-idas a compatibility alias.assign_vmap_ids) or broader semantics (assigned). The initial proposal usesassign_vmap_idsbecause it maps directly to the current primitive and avoids ambiguity.