-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Written by T. Quinn Smith
Principal Investigator: Zachary A. Szpiech
The Pennsylvania State University
Precompiled executables are found in the bin/ directory of the git repo.
make
The compiled executable will be in the bin/ directory.
EGGS v1.0
---------
Written by T. Quinn Smith
Principal Investigator: Zachary A. Szpiech
The Pennsylvania State University
Usage: eggs [OPTIONS]
Reads from stdin and writes to stdout by default.
OPTIONS:
-h,--help Print help and exit.
-e,--eigenstrat GENO,SNP,IND Ignores all other options except -o. Reads in EIGENSTRAT/ANCESTRYMAP files
and outputs VCF equivalent. Assumes all sites have REF and ALT alleles.
No hash checking is performed as in ADMIXTOOLS.
-u,--unphase Left and right genotypes are swapped with a probability of 0.5
-p,--unpolarize Biallelic site alleles swapped with a probability of 0.5
-s,--pseudohap Pseudohaploidize all samples. Automatically removes phase.
When one allele is present for a sample, that allele is used.
-g,--seqerr DOUBLE Simulate next-generation sequencing error. For biallelic sites, switch allele
to other allele with probability DOUBLE. Ignored when -d or -s is used.
-o,--out STR Basename to use for output files instead of stdout.
-m,--mask VCF Filename to use as mask for missing genotypes.
Number of records must be greater than input.
-b,--beta VCF/STR Calculate mu/sigma of missingness per site from VCF or supply as
values as "mu,sigma". Defines beta distribution for missingness.
-r,--random VCF Calculates proportion of missing samples per site from file and uses that
distribution to randomly introduce missing genotypes.
-d,--deamin STR Two comma-seperated proportions "prob1,prob2" where prob1 is the probability
the site is a transition and prob2 is the probability of deamination.
-l,--length INT Only used with ms-style input. Sets length of segment in base-pairs.
Default 1,000,000 base-pairs if not provided or invalid.
-a,--hap Split diploid to seperate samples.
For ms-style input, lineages are their own sample in VCF output.
-x,--ms Output ms-style replicates. Cannot use missing data options.
If a genotype is missing, then the ancestral is used. If multiallelic VCF site,
then any alternative alleles are treated as derived.
-t,--stats Print summary statistics for missingness at the individual and locus level.
With -o, produce .ind.tsv and .loci.tsv files.
Ignore other options. Only used with VCF input.
-k,--keep Keep INFO tags in header and in VCF records.
EGGS reads standard text or GZIP from stdin and writes to stdout in GZIP format.
The examples/ directory contains three files: empirical.vcf.gz, sim.vcf.gz, and sim.ms.gz. empirical.vcf.gz contains missing genotypes. sim.vcf.gz is a segment without missing genotypes. sim.ms.gz contains 10 replicates of 200 haploid individuals in ms formatting.
We now display the functionality of EGGS with examples.
eggs -e 1240k_public.geno,1240k_public.snp,1240k_public.ind | zcat | bgzip -c > 1240k.vcf.bgz
NOTE: For larger EIGENSTRAT files, such as 1240k, this command will take ~45min to complete since EGGS is single threaded. We purposely compress this with bgzip for indexing, and later, extracting samples.
With 50% probability, swap the left and right allele for each sample.
eggs -u < sim.vcf.gz
With 50% probability at each site, swap the ancestral '0' allele with the derived allele '1'.
eggs -p < sim.vcf.gz
At all sites, for each heterozygous sample, make the sample '0/0' or '1/1' with equal probability.
eggs -s < sim.vcf.gz
We treat each site as a transition with probability of 0.7, and then, each reference allele deaminates to the alternative allele with probability 0.05.
eggs -d 0.7,0.05 < sim.vcf.gz
Print summary statistics for './.' at the individual and locus level.
eggs -t < empirical.vcf.gz
For ms-style input, each lineage becomes a sample in the resulting VCF. For VCF input, the diploid samples are split to separate samples.
eggs -a < sim.vcf.gz
or
eggs -a < sim.ms.gz
If a genotype is missing, then the ancestral is used. If multiallelic VCF site, then any alternative alleles are treated as derived ('1').
eggs -x < sim.vcf.gz
Introduce missing sites for each sample ('./.') according to the empirical distribution captured in empirical.vcf.gz.
eggs -r empirical.vcf.gz < sim.vcf.gz
Introduce missing sites for each sample ('./.') according to the beta-distribution parameterized by empirical.vcf.gz. Mean number of missing samples and standard deviation per site is calculated for you when a file is supplied. The calculated values appear in the header. You can also explicitly supply the mean,std-dev.
eggs -b empirical.vcf.gz < sim.vcf.gz
or
eggs -b 0.1,0.2 < sim.vcf.gz
See the manuscript for EGGS's method.
eggs -m empirical.vcf.gz < sim.vcf.gz
Split ms-replicates to separate VCF files with a segment of 1Mb.
eggs -l 1000000 < sim.ms.gz
We can make many simulated replicates into separate VCF files that contain error seen in aDNA. Here, we introduce missing data according to a beta distribution, unphase the sample, unpolarize the sites, introduce deamination, and pseudohaploidize each sample.
eggs -b 0.1,0.2 -u -p -d 0.7,0.05 -s < sim.ms.gz