This tutorial uses the checked-in fixture under examples/tutorial-1/ to show the normal prepare -> restrict -> project workflow on small but realistic data.
The reference VCF defines the target variant universe. The study summary statistics and the cohort BFILE are payloads: each keeps its own source-row provenance and is projected independently through its own restricted .vmap.
Release users can download genomatch-tutorial-1-fixtures-<version>.zip from the matching GitHub Release and unzip it in a working directory. The archive expands to examples/tutorial-1/, so the commands below work unchanged.
examples/tutorial-1/
reference.grch38.vcf
study.grch37.tsv
study.grch37.meta.yaml
cohort.21.{bed,bim,fam}
cohort.22.{bed,bim,fam}
cohort.X.{bed,bim,fam}
Warnings are expected. The fixture includes allele swaps, strand flips, ambiguous SNPs, indels, duplicate IDs, invalid alleles, long alleles, a multiallelic VCF row, liftover collisions, invalid summary-statistic values, and missing genotype calls. See examples/tutorial-1/manifest.tsv for row-level intent.
Reference-aware steps require MATCH_CONFIG to point at local GRCh37/GRCh38 FASTA and chain assets. See docs/install.md and docs/downloads.md.
export MATCH_CONFIG=/path/to/ref/config.yaml
mkdir -p work/tutorial-1 out/tutorial-1prepare_variants.py \
--input examples/tutorial-1/reference.grch38.vcf \
--input-format vcf \
--dst-build GRCh38 \
--dst-contig-naming ncbi \
--output work/tutorial-1/referenceThis writes work/tutorial-1/reference.vmap. It is the target universe used for membership.
prepare_variants.py \
--input-format sumstats \
--sumstats-metadata examples/tutorial-1/study.grch37.meta.yaml \
--dst-build GRCh38 \
--dst-contig-naming ncbi \
--drop-strand-ambiguous \
--output work/tutorial-1/studyThe study starts on GRCh37 and is lifted to GRCh38 during preparation. The metadata uses non-canonical column names so the clean projection can derive fields such as BETA, Z, effective N, and allele frequencies.
This fixture deliberately includes many allele flips and several A/T or C/G SNPs. In such summary-statistics data, strand orientation for those ambiguous SNPs cannot be resolved confidently, so the tutorial drops them from the study payload. In real datasets, if alleles are known to be reported on the positive strand for a known reference genome, users usually do not need --drop-strand-ambiguous.
prepare_variants_sharded.py \
--input examples/tutorial-1/cohort.@.bim \
--input-format bim \
--prefix work/tutorial-1/cohort.@ \
--output work/tutorial-1/cohort \
--dst-build GRCh38 \
--dst-contig-naming ncbi \
--shards 21,22,XThis imports the PLINK 1 BIM shards, prepares each chromosome group, and writes one concatenated work/tutorial-1/cohort.vmap.
restrict_vmap.py \
work/tutorial-1/study.vmap \
work/tutorial-1/reference.vmap \
--output work/tutorial-1/study.reference.vmap
restrict_vmap.py \
work/tutorial-1/cohort.vmap \
work/tutorial-1/reference.vmap \
--output work/tutorial-1/cohort.reference.vmapThe restricted .vmap files preserve payload-specific row provenance. The reference controls membership; it is not applied as a payload here.
project_payload.py \
--input-format sumstats-clean \
--sumstats-metadata examples/tutorial-1/study.grch37.meta.yaml \
--vmap work/tutorial-1/study.reference.vmap \
--output out/tutorial-1/study.reference.tsv \
--use-af-inferenceThe output is clean, canonical summary statistics in the GRCh38 reference universe. Invalid numeric values in retained payload rows are written as missing values where the clean pipeline permits that.
project_payload.py \
--input examples/tutorial-1/cohort.@.bim \
--input-format bfile \
--vmap work/tutorial-1/cohort.reference.vmap \
--output out/tutorial-1/cohort.@ \
--skip-ploidy-checkThis writes sharded PLINK 1 BFILE outputs under out/tutorial-1/cohort.<contig>. The tutorial uses --skip-ploidy-check because the fixture focuses on variant mapping and includes compact toy chrX genotypes.
The checked-in fixture is generated from local reference assets:
MATCH_CONFIG=/path/to/ref/config.yaml \
python scripts/generate_tutorial_1_fixture.pyThe generator rewrites every file under examples/tutorial-1/, including README.md and manifest.tsv, and validates the tutorial command path before completing.