This repository contains code to reproduce the empirical results from our paper.
- Snakemake
- Matplotlib
- C++17
Four genomic datasets used in our paper are in the ./sequence_data/ directory. Their detailed information is shown below.
| Dataset Name | Chromosome | Starting Position | Complexity Level |
|---|---|---|---|
| D-easy/chr6_RepeatMasker | Chr6 | 559,707 | Easy |
| D-med/chrY_RBMY1A1 | ChrY | 22,410,155 | Medium |
| D-hard/chrY_SimpleRepeat | ChrY | 22,420,785 | Hard |
| D-hardest/chr21_centromere | Chr21 | 10,962,854 | Hardest |
The ./kmer_info analyzes k-mer statistics from sequence data.
g++ -std=c++17 kmer_info.cpp -o kmer_infosnakemake --core [N]Where [N] is the number of CPU cores to utilize.
Results are stored in dataset-specific folders (e.g., ./chr21_centromere/):
-
Occurrence k-mer count:
occ_counts.txtlists the number of occurrencesiand the count of k-mers with occurrenceiin the format "occurrence : count". A visual representation is provided inkmer_distribution_histogram.png. -
Overlapped k-mers:
diff_counts.txtcontains the value of(occ-sep)and the count of k-mers with this difference in the format "occ-sep : count". (See paper for definitions ofoccandsep). -
Hamming distance distribution:
HD_counts.txtshows the Hamming distance and the count of k-mer pairs. This information is also visualized inhamming_distance_distribution_histogram.png.
This module compares the performance of estimators
-
Navigate to the comparison directory:
cd ./Comparison_of_two_estimators make -
Modify
config.yamlto specify your experiments:analyses: - name: "chr6_RepeatMasker" output_dir: "chr6_RepeatMasker" fasta: "../sequence_data/chr6.fasta" simulation_params: length: 100000 start_pos: 559707 precision: 1e-10 replicates: 100 fixed_r: 0.01 plot_params: k_variation: thresholds: [32, 630] ylimits: [0.8, 0.02, 1.0] r_variation: threshold: 0.25 fixed_k: 20
-
Run the analysis:
snakemake --core [N]
-
Results are stored in the specified output directory (e.g.,
./chr6_RepeatMasker/) with simulation data in./specified/varying_kand./specified/varying_rsubdirectories. The output files are in CSV format with three columns representing$\hat{r}$ and$r_{mash}$ for each simulation replicate. Boxplots visualizing the comparison between these estimators are generated in the specified output directory, with separate panels based on the threshold parameters defined in the configuration file.
This module is similar to the previous one but evaluates estimator performance using relative error metrics.
-
Navigate to the relative comparison directory:
cd ./Relative_Comparison_of_two_estimators make -
Modify
config.yamlto specify your experiments (similar format to the previous module):analyses: - name: "chr6_RepeatMasker" output_dir: "chr6_RepeatMasker" fasta: "../sequence_data/chr6.fasta" simulation_params: length: 100000 start_pos: 559707 precision: 1e-10 replicates: 100 fixed_r: 0.01 plot_params: k_variation: thresholds: [32, 630] ylimits: [0.8, 0.02, 1.0] r_variation: threshold: 0.25 fixed_k: 20
-
Run the analysis:
snakemake --core [N]
-
Results are stored in the specified output directory (e.g.,
./chr6_RepeatMasker/) with simulation data in./specified/varying_kand./specified/varying_rsubdirectories. The output files are in CSV format with three columns representing the relative errors of$\hat{r}$ and$r_{mash}$ for each simulation replicate. Boxplots visualizing these relative errors are generated in the specified output directory, with separate panels based on the threshold parameters defined in the configuration.
This module generates a heatmap showing the deviation of
-
Navigate to the heatmap directory:
cd ./Heatmap/ make -
Modify
config.yaml:analyses: - name: "chr21_centromere" output_dir: "chr21_centromere" fasta: "../sequence_data/chr21.fasta" simulation_params: length: 100000 start_pos: 10962854 precision: 1e-10 replicates: 100 r_values: [0.001, 0.011, 0.021, 0.031, 0.041, 0.051, 0.061, 0.071, 0.081, 0.091, 0.101, 0.151, 0.201, 0.251, 0.301] k_values: [4, 8, 16, 32, 60, 90, 120, 150, 180, 210, 240, 270, 300, 330, 360, 390, 420, 450, 480, 510, 540, 570, 600] plot_params: heatmap: title: "Estimate Performance" cmap: "coolwarm" annotate: True figsize: "14,10"
-
Run the analysis:
snakemake --core [N]
-
Results are stored in the specified output directory (e.g.,
./chr21_centromere/), with raw simulation data in./specified_folder/results. The primary output is a comprehensive heatmap visualization (heatmap_deviation.png) showing the bias of$\hat{r}$ estimator across the specified r and k value combinations. The heatmap is colored according to the selected colormap, with annotations displaying the actual values when enabled in the configuration. The raw data for each$(r,k)$ combination are stored in separate CSV files in the results directory, containing the estimator values for each replicate.
This module examines the bias of
-
Navigate to the comparison directory:
cd ./Comparison_of_rstrong_and_sketch/ make -
Modify
config.yamlto specify your experiments:analyses: - name: "chr6_RepeatMasker_sketch_100" output_dir: "chr6_RepeatMasker_sketch_100" fasta: "../sequence_data/chr6.fasta" simulation_params: length: 100000 start_pos: 559707 precision: 1e-10 replicates: 100 sketch_repeats: 100 plot_params: k_variation: thresholds: [32, 630] ylimits: [0.8, 0.02, 1.0] fixed_r: 0.01 r_variation: threshold: 0.25 fixed_k: 20
-
Run the analysis:
snakemake --core [N]
-
Results are stored in the specified output directory (e.g.,
./chr6_RepeatMasker_sketch_100/) with simulation data in./specified/varying_kand./specified/varying_rsubdirectories. Output files are in CSV format with three columns representing r_strong, r_sketch ($\theta=0.1$ ), and r_sketch ($\theta=0.01$ ). Visualization is provided through boxplots in the same directory.
This module focuses on isolating the bias from the sketching process by applying only one mutation and replicating the sketching process multiple times.
-
Navigate to the sketching bias directory:
cd ./Sketch_bias_plot/ make -
Modify
config.yamlto specify your experiments:analyses: - name: "chr6_RepeatMasker" output_dir: "chr6_RepeatMasker" fasta: "../sequence_data/chr6.fasta" simulation_params: length: 100000 start_pos: 559707 precision: 1e-10 sketch_repeats: 100 plot_params: k_variation: thresholds: [32, 630] ylimits: [0.8, 0.02, 1.0] fixed_r: 0.01 r_variation: threshold: 0.151 ylimits: [0.2, 1.1] fixed_k: 20
-
Run the analysis:
snakemake --core [N]
-
Results are stored in the specified output directory (e.g.,
./chr6_RepeatMasker/) with simulation data in./specified/varying_kand./specified/varying_rsubdirectories. The output files are in CSV format with three columns representing r_strong (constant within each output file), r_sketch ($\theta=0.1$ ), and r_sketch ($\theta=0.01$ ). Boxplots visualizing these results are also provided in the same directory.
This experiment demonstrates the theoretical bounds of parameter
-
Navigate to the error bounds directory:
cd ./Error_bounds_of_q/ make -
Configure with
config.yaml:analyses: - name: "chr21_centromere" output_dir: "chr21_centromere" fasta: "../sequence_data/chr21.fasta" simulation_params: length: 100000 start_pos: 10962854 precision: 1e-8 replicates: 100 plot_params: r_variation: fixed_k: 30
-
Run the analysis:
snakemake --core [N]
-
Results are stored in the specified output directory (e.g.,
./chr21_centromere/) with detailed data in./specified/varying_r. Each output file corresponds to a specific r value and contains:- The first row showing the lower and upper theoretical bounds in the format "lower_bound, upper_bound"
- Subsequent rows showing the
$\hat{r}$ values for each simulation replicate
A visualization comparing the theoretical bounds with the empirical results of
$\hat{r}$ values is also generated, allowing for visual confirmation of the theoretical predictions across different$r$ values while keeping$k$ fixed as specified in the configuration.
This experiment demonstrates
-
Navigate to the error bounds directory:
cd ./P_empty_and_unstableness/ make -
Configure with
config.yaml:analyses: - name: "chr21_centromere" output_dir: "chr21_centromere" fasta: "../sequence_data/D-hardest.fasta" simulation_params: length: 100000 start_pos: 0 precision: 1e-8 replicates: 100 plot_params: r_variation: threshold: 0.201 fixed_k: 30
-
Run the analysis:
snakemake --core [N]
-
Results are stored in the specified output directory (e.g.,
./chr21_centromere/) with detailed data in./specified/varying_r. Each output file corresponds to a specific r value and contains:- The first row showing the value of
$P_{empty}$ - Subsequent rows showing the
$\hat{r}$ values for each simulation replicate
A visualization empirical results of
$\hat{r}$ values and the curve of$P_{empty}$ is also generated. - The first row showing the value of
This folder is used to generate the heat map of
-
Navigate to the error bounds directory:
cd ./P_empty_and_unstableness/ make -
Configure with
config.yaml:analyses: - name: "chrY_SimpleRepeat" output_dir: "chrY_SimpleRepeat" simulation_params: length: 2264 start_pos: 0 precision: 1e-10 replicates: 100 r_values: [0.001, 0.031, 0.061, 0.091, 0.121, 0.151, 0.181, 0.211, 0.241, 0.271, 0.301, 0.331, 0.361] k_values: [8,10,12,14,16,18,20,22,24,26,28,30,32] plot_params: heatmap: title: "HARD" cmap: "Greys" annotate: True figsize: "10,10"
Not you do not need to input the sequence,
$P_{empty}$ only depends on$L,r$ and$k$ . -
Run the analysis:
snakemake --core [N]
-
Results are stored in the specified output directory (e.g.,
./chrY_SimpleRepeat/), with raw simulation data in./specified_folder/results. The primary output is a comprehensive heatmap visualization (heatmap.png) showing the bias of$\hat{r}$ estimator across the specified r and k value combinations. The heatmap is colored according to the selected colormap, with annotations displaying the actual values when enabled in the configuration. The$P_{empty}$ for each$(r,k)$ combination are stored in separate CSV files in the results directory.