Skip to content
T. Quinn Smith edited this page Feb 18, 2026 · 18 revisions

EGGS v1.0

Written by T. Quinn Smith

Principal Investigator: Zachary A. Szpiech

The Pennsylvania State University

Building

Precompiled executables are found in the bin/ directory of the git repo.

Building on MacOS and Linux

make

The compiled executable will be in the bin/ directory.

Options


EGGS v1.0
---------

Written by T. Quinn Smith
Principal Investigator: Zachary A. Szpiech
The Pennsylvania State University

Usage: eggs [OPTIONS]

Reads from stdin and writes to stdout by default.
OPTIONS:
    -h,--help                       Print help and exit.
    -e,--eigenstrat GENO,SNP,IND    Ignores all other options except -o. Reads in EIGENSTRAT/ANCESTRYMAP files
                                         and outputs VCF equivalent. Assumes all sites have REF and ALT alleles.
                                         No hash checking is performed as in ADMIXTOOLS.
    -u,--unphase                    Left and right genotypes are swapped with a probability of 0.5
    -p,--unpolarize                 Biallelic site alleles swapped with a probability of 0.5
    -s,--pseudohap                  Pseudohaploidize all samples. Automatically removes phase.
                                         When one allele is present for a sample, that allele is used.
    -g,--seqerr     DOUBLE          Simulate next-generation sequencing error. For biallelic sites, switch allele
                                         to other allele with probability DOUBLE. Ignored when -d or -s is used.
    -o,--out        STR             Basename to use for output files instead of stdout.
    -m,--mask       VCF             Filename to use as mask for missing genotypes.
                                         Number of records must be greater than input.
    -b,--beta       VCF/STR         Calculate mu/sigma of missingness per site from VCF or supply as
                                        values as "mu,sigma". Defines beta distribution for missingness.
    -r,--random     VCF             Calculates proportion of missing samples per site from file and uses that
                                        distribution to randomly introduce missing genotypes.
    -d,--deamin     STR             Two comma-seperated proportions "prob1,prob2" where prob1 is the probability
                                        the site is a transition and prob2 is the probability of deamination.
    -l,--length     INT             Only used with ms-style input. Sets length of segment in base-pairs.
                                        Default 1,000,000 base-pairs if not provided or invalid.
    -a,--hap                        Split diploid to seperate samples.
                                        For ms-style input, lineages are their own sample in VCF output.
    -x,--ms                         Output ms-style replicates. Cannot use missing data options.
                                        If a genotype is missing, then the ancestral is used. If multiallelic VCF site,
                                        then any alternative alleles are treated as derived.
    -t,--stats                      Print summary statistics for missingness at the individual and locus level.
                                        With -o, produce .ind.tsv and .loci.tsv files.
                                        Ignore other options. Only used with VCF input.
    -k,--keep                       Keep INFO tags in header and in VCF records.

Functionality

EGGS reads standard text or GZIP from stdin and writes to stdout in GZIP format.

Example Data

The examples/ directory contains three files: empirical.vcf.gz, sim.vcf.gz, and sim.ms.gz. empirical.vcf.gz contains missing genotypes. sim.vcf.gz is a segment without missing genotypes. sim.ms.gz contains 10 replicates of 200 haploid individuals in ms formatting.

Examples

We now display the functionality of EGGS with examples.

Convert EIGENSTRAT/ANCESTRYMAP to VCF

eggs -e 1240k_public.geno,1240k_public.snp,1240k_public.ind | zcat | bgzip -c > 1240k.vcf.bgz

NOTE: For larger EIGENSTRAT files, such as 1240k, this command will take ~45min to complete since EGGS is single threaded. We purposely compress this with bgzip for indexing, and later, extracting samples.

Unphase Genotypes

With 50% probability, swap the left and right allele for each sample.

eggs -u < sim.vcf.gz

Unpolarize Genotypes

With 50% probability at each site, swap the ancestral '0' allele with the derived allele '1'.

eggs -p < sim.vcf.gz

Pseudohaploidize

At all sites, for each heterozygous sample, make the sample '0/0' or '1/1' with equal probability.

eggs -s < sim.vcf.gz

Deamination

We treat each site as a transition with probability of 0.7, and then, each reference allele deaminates to the alternative allele with probability 0.05.

eggs -d 0.7,0.05 < sim.vcf.gz

Missingness Statistics

Print summary statistics for './.' at the individual and locus level.

eggs -t < empirical.vcf.gz

Split Lineages

For ms-style input, each lineage becomes a sample in the resulting VCF. For VCF input, the diploid samples are split to separate samples.

eggs -a < sim.vcf.gz

or

eggs -a < sim.ms.gz

Convert to MS-style output

If a genotype is missing, then the ancestral is used. If multiallelic VCF site, then any alternative alleles are treated as derived ('1').

eggs -x < sim.vcf.gz

Random Missing Genotypes

Introduce missing sites for each sample ('./.') according to the empirical distribution captured in empirical.vcf.gz.

eggs -r empirical.vcf.gz < sim.vcf.gz

Beta Distribution Missing Genotypes

Introduce missing sites for each sample ('./.') according to the beta-distribution parameterized by empirical.vcf.gz. Mean number of missing samples and standard deviation per site is calculated for you when a file is supplied. The calculated values appear in the header. You can also explicitly supply the mean,std-dev.

eggs -b empirical.vcf.gz < sim.vcf.gz

or

eggs -b 0.1,0.2 < sim.vcf.gz

Dispersal and Distribution (EGGS's Method)

See the manuscript for EGGS's method.

eggs -m empirical.vcf.gz < sim.vcf.gz

Convert ms-style to VCF

Split ms-replicates to separate VCF files with a segment of 1Mb.

eggs -l 1000000 < sim.ms.gz

Make Replicates Look Like aDNA

We can make many simulated replicates into separate VCF files that contain error seen in aDNA. Here, we introduce missing data according to a beta distribution, unphase the sample, unpolarize the sites, introduce deamination, and pseudohaploidize each sample.

eggs -b 0.1,0.2 -u -p -d 0.7,0.05 -s < sim.ms.gz

Citation