Skip to content

2. Simulate Data

Lena Krockenberger edited this page Feb 19, 2026 · 31 revisions

FastGxC provides a built-in data simulator for generating synthetic datasets in the package’s required input format. The simulator enables (i) verification of a successful installation, (ii) validation of input/output schemas and pipeline wiring, and (iii) controlled end-to-end testing prior to analysis of empirical data. Users may specify sample size, numbers of genes and variants, context structure, allele frequencies, heritability, missingness, and residual correlation to reproduce a range of scenarios. For reproducibility, a random seed can be set before simulation.

Simulate data

simulate_data() generates synthetic genotype and gene expression data across user-defined contexts (e.g., cell types or tissues) in the exact input format required by FastGxC.

# Directory to store the simulated data and number of contexts
data_dir_sim <- "simulated_example/"
n_contexts   <- 10
  
simulate_data(
  data_dir         = data_dir_sim,                  # Output directory
  N                = 300,                           # Number of individuals
  n_genes          = 100,                           # Number of genes
  n_snps_per_gene  = 1000,                          # Number of cis SNPs per gene
  n_contexts       = n_contexts,                    # Number of contexts
  maf              = 0.2,                           # Minor allele frequency of each SNP
  w_corr           = 0.2,                           # Within-individual residual correlation across contexts
  missing          = 0.05,                          # Fraction of missing values in the expression matrix (0: no missing values)
  hsq              = c(rep(0, n_contexts-1), 0.2),  # eQTL heritability per context
  mus              = rep(0,   n_contexts),          # Mean expression per context
  cisDist          = 1e6,                           # cis window
  seed             = 1                              # seed for reproducibility
)

Notes: Ensure that data_dir_sim is writable (it will be created if absent).

Output Files

The simulator writes four tab-delimited files to data_dir_sim. Unless noted, missing values are encoded as NA.

SNPs.txt - Genotype matrix (SNPs x individuals).

Rows index SNPs; columns index individuals; entries are copies of minor alleles (0/1/2) or NA.

snpid   ind1  ind2  ind3  ind4  ind5 ...
snp1     0     2     1     NA    0   ...
snp2     1     2     NA    1     1   ...
snp3     0     1     0     1     1   ...
snp4     1     NA    0     1     1   ...
snp5     0     0     0     1     1   ...

snpsloc.txt - SNP annotations and context mask.

Contains snpid, chromosome, position, followed by one column per context. Context columns use 1/0 to indicate whether the SNP is tested in that context.

snpid   chr   pos   context1  context2  ...  context10
snp1    chr1   1        1         1           1
snp2    chr1   1        1         1           0 
snp3    chr1   1        1         0           1
snp4    chr1   1        0         1           1
snp5    chr1   1        1         1           1

expression.txt — Expression matrix (design × genes)

Each row corresponds to an individual–context pairing in the design column, followed by one column per gene.

design            gene1     gene2      gene3    ...   gene100
ind1 - context1   0.4369    NA         -1.8113  ...   1.1721
ind1 - context10  0.6437    -0.4357    -0.8296  ...  -0.3944
ind1 - context2   0.1092    -0.1294    -0.3222  ...  -0.4939
ind1 - context3   NA        -0.1947    -0.1111  ...  -0.9999
ind1 - context4   0.8347    -0.9876    NA       ...  -0.3833

geneloc.txt — Gene annotations and context mask

Gene location file with geneid, chromosome, start (s1) and end (s2) positions of the gene, followed by one column per context. Context columns use 1/0 to indicate whether the gene is tested in that context.

geneid  chr   s1          s2          context1  context2  context3  context4  context5  context6
gene1   chr1  0           1001            1         1         1         1         1         1
gene2   chr1  100000001   100001001       1         1         1         1         1         1
gene3   chr1  200000001   200001001       1         1         1         1         1         1
gene4   chr1  300000001   300001001       1         1         0         1         1         1
gene5   chr1  400000001   400001001       1         1         1         0         1         1
gene6   chr1  500000001   500001001       1         1         1         1         0         1

Clone this wiki locally