-
Notifications
You must be signed in to change notification settings - Fork 1
2. Simulate Data
FastGxC provides a built-in data simulator for generating synthetic datasets in the package’s required input format. The simulator enables (i) verification of a successful installation, (ii) validation of input/output schemas and pipeline wiring, and (iii) controlled end-to-end testing prior to analysis of empirical data. Users may specify sample size, numbers of genes and variants, context structure, allele frequencies, heritability, missingness, and residual correlation to reproduce a range of scenarios. For reproducibility, a random seed can be set before simulation.
simulate_data() generates synthetic genotype and gene expression data across user-defined contexts (e.g., cell types or tissues) in the exact input format required by FastGxC.
# Directory to store the simulated data and number of contexts
data_dir_sim <- "simulated_example/"
n_contexts <- 10
simulate_data(
data_dir = data_dir_sim, # Output directory
N = 300, # Number of individuals
n_genes = 100, # Number of genes
n_snps_per_gene = 1000, # Number of cis SNPs per gene
n_contexts = n_contexts, # Number of contexts
maf = 0.2, # Minor allele frequency of each SNP
w_corr = 0.2, # Within-individual residual correlation across contexts
missing = 0.05, # Fraction of missing values in the expression matrix (0: no missing values)
hsq = c(rep(0, n_contexts-1), 0.2), # eQTL heritability per context
mus = rep(0, n_contexts), # Mean expression per context
cisDist = 1e6, # cis window
seed = 1 # seed for reproducibility
)
Notes: Ensure that data_dir_sim is writable (it will be created if absent).
The simulator writes four tab-delimited files to data_dir_sim. Unless noted, missing values are encoded as NA.
SNPs.txt - Genotype matrix (SNPs x individuals).
Rows index SNPs; columns index individuals; entries are copies of minor alleles (0/1/2) or NA.
snpid ind1 ind2 ind3 ind4 ind5 ...
snp1 0 2 1 NA 0 ...
snp2 1 2 NA 1 1 ...
snp3 0 1 0 1 1 ...
snp4 1 NA 0 1 1 ...
snp5 0 0 0 1 1 ...
snpsloc.txt - SNP annotations and context mask.
Contains snpid, chromosome, position, followed by one column per context. Context columns use 1/0 to indicate whether the SNP is tested in that context.
snpid chr pos context1 context2 ... context10
snp1 chr1 1 1 1 1
snp2 chr1 1 1 1 0
snp3 chr1 1 1 0 1
snp4 chr1 1 0 1 1
snp5 chr1 1 1 1 1
expression.txt — Expression matrix (design × genes)
Each row corresponds to an individual–context pairing in the design column, followed by one column per gene.
design gene1 gene2 gene3 ... gene100
ind1 - context1 0.4369 NA -1.8113 ... 1.1721
ind1 - context10 0.6437 -0.4357 -0.8296 ... -0.3944
ind1 - context2 0.1092 -0.1294 -0.3222 ... -0.4939
ind1 - context3 NA -0.1947 -0.1111 ... -0.9999
ind1 - context4 0.8347 -0.9876 NA ... -0.3833
geneloc.txt — Gene annotations and context mask
Gene location file with geneid, chromosome, start (s1) and end (s2) positions of the gene, followed by one column per context. Context columns use 1/0 to indicate whether the gene is tested in that context.
geneid chr s1 s2 context1 context2 context3 context4 context5 context6
gene1 chr1 0 1001 1 1 1 1 1 1
gene2 chr1 100000001 100001001 1 1 1 1 1 1
gene3 chr1 200000001 200001001 1 1 1 1 1 1
gene4 chr1 300000001 300001001 1 1 0 1 1 1
gene5 chr1 400000001 400001001 1 1 1 0 1 1
gene6 chr1 500000001 500001001 1 1 1 1 0 1