kmer-stats-repeat: Reproduction Framework

This repository contains code to reproduce the empirical results from our paper.

Dependencies

Snakemake
Matplotlib
C++17

Datasets

Four genomic datasets used in our paper are in the ./sequence_data/ directory. Their detailed information is shown below.

Dataset Name	Chromosome	Starting Position	Complexity Level
D-easy/chr6_RepeatMasker	Chr6	559,707	Easy
D-med/chrY_RBMY1A1	ChrY	22,410,155	Medium
D-hard/chrY_SimpleRepeat	ChrY	22,420,785	Hard
D-hardest/chr21_centromere	Chr21	10,962,854	Hardest

$K$-mer statistics of datasets

The ./kmer_info analyzes k-mer statistics from sequence data.

Compilation

g++ -std=c++17 kmer_info.cpp -o kmer_info

Running the Analysis

snakemake --core [N]

Where [N] is the number of CPU cores to utilize.

Results

Results are stored in dataset-specific folders (e.g., ./chr21_centromere/):

Occurrence k-mer count: occ_counts.txt lists the number of occurrences i and the count of k-mers with occurrence i in the format "occurrence : count". A visual representation is provided in kmer_distribution_histogram.png.
Overlapped k-mers: diff_counts.txt contains the value of (occ-sep) and the count of k-mers with this difference in the format "occ-sep : count". (See paper for definitions of occ and sep).
Hamming distance distribution: HD_counts.txt shows the Hamming distance and the count of k-mer pairs. This information is also visualized in hamming_distance_distribution_histogram.png.

Comparison of Estimators

This module compares the performance of estimators $\hat{r}$ and $r_{mash}$.

Setup and Configuration

Navigate to the comparison directory:
```
cd ./Comparison_of_two_estimators
make
```

Modify config.yaml to specify your experiments:

analyses:
  - name: "chr6_RepeatMasker"
    output_dir: "chr6_RepeatMasker"
    fasta: "../sequence_data/chr6.fasta"
    simulation_params:
      length: 100000
      start_pos: 559707
      precision: 1e-10
      replicates: 100
      fixed_r: 0.01
    plot_params:
      k_variation:
        thresholds: [32, 630]
        ylimits: [0.8, 0.02, 1.0]
      r_variation:
        threshold: 0.25
        fixed_k: 20

Run the analysis:
```
snakemake --core [N]
```
Results are stored in the specified output directory (e.g., ./chr6_RepeatMasker/) with simulation data in ./specified/varying_k and ./specified/varying_r subdirectories. The output files are in CSV format with three columns representing $\hat{r}$ and $r_{mash}$ for each simulation replicate. Boxplots visualizing the comparison between these estimators are generated in the specified output directory, with separate panels based on the threshold parameters defined in the configuration file.

Relative Error Comparison

This module is similar to the previous one but evaluates estimator performance using relative error metrics.

Setup and Configuration

Navigate to the relative comparison directory:

cd ./Relative_Comparison_of_two_estimators
make

Modify config.yaml to specify your experiments (similar format to the previous module):

analyses:
  - name: "chr6_RepeatMasker"
    output_dir: "chr6_RepeatMasker"
    fasta: "../sequence_data/chr6.fasta"
    simulation_params:
      length: 100000
      start_pos: 559707
      precision: 1e-10
      replicates: 100
      fixed_r: 0.01
    plot_params:
      k_variation:
        thresholds: [32, 630]
        ylimits: [0.8, 0.02, 1.0]
      r_variation:
        threshold: 0.25
        fixed_k: 20

Run the analysis:
```
snakemake --core [N]
```
Results are stored in the specified output directory (e.g., ./chr6_RepeatMasker/) with simulation data in ./specified/varying_k and ./specified/varying_r subdirectories. The output files are in CSV format with three columns representing the relative errors of $\hat{r}$ and $r_{mash}$ for each simulation replicate. Boxplots visualizing these relative errors are generated in the specified output directory, with separate panels based on the threshold parameters defined in the configuration.

Heatmap Analysis

This module generates a heatmap showing the deviation of $\hat{r}$ across different $(r,k)$ parameter combinations.

Setup and Configuration

Navigate to the heatmap directory:
```
cd ./Heatmap/
make
```

Modify config.yaml:

analyses:
  - name: "chr21_centromere"
    output_dir: "chr21_centromere"
    fasta: "../sequence_data/chr21.fasta"
    simulation_params:
      length: 100000
      start_pos: 10962854
      precision: 1e-10
      replicates: 100
      r_values: [0.001, 0.011, 0.021, 0.031, 0.041, 0.051, 0.061, 0.071, 0.081, 0.091, 0.101, 0.151, 0.201, 0.251, 0.301]
      k_values: [4, 8, 16, 32, 60, 90, 120, 150, 180, 210, 240, 270, 300, 330, 360, 390, 420, 450, 480, 510, 540, 570, 600]
    plot_params:
       heatmap:
        title: "Estimate Performance"
        cmap: "coolwarm"
        annotate: True
        figsize: "14,10"

Run the analysis:
```
snakemake --core [N]
```
Results are stored in the specified output directory (e.g., ./chr21_centromere/), with raw simulation data in ./specified_folder/results. The primary output is a comprehensive heatmap visualization (heatmap_deviation.png) showing the bias of $\hat{r}$ estimator across the specified r and k value combinations. The heatmap is colored according to the selected colormap, with annotations displaying the actual values when enabled in the configuration. The raw data for each $(r,k)$ combination are stored in separate CSV files in the results directory, containing the estimator values for each replicate.

Sketching Bias Analysis

This module examines the bias of $r_{sketch}$ compared to $r_{strong}$.

Setup and Configuration

Navigate to the comparison directory:

cd ./Comparison_of_rstrong_and_sketch/
make

Modify config.yaml to specify your experiments:

analyses:
  - name: "chr6_RepeatMasker_sketch_100"
    output_dir: "chr6_RepeatMasker_sketch_100"
    fasta: "../sequence_data/chr6.fasta"
    simulation_params:
      length: 100000
      start_pos: 559707
      precision: 1e-10
      replicates: 100
      sketch_repeats: 100
    plot_params:
      k_variation:
        thresholds: [32, 630]
        ylimits: [0.8, 0.02, 1.0]
        fixed_r: 0.01
      r_variation:
        threshold: 0.25
        fixed_k: 20

Run the analysis:
```
snakemake --core [N]
```
Results are stored in the specified output directory (e.g., ./chr6_RepeatMasker_sketch_100/) with simulation data in ./specified/varying_k and ./specified/varying_r subdirectories. Output files are in CSV format with three columns representing r_strong, r_sketch ($\theta=0.1$), and r_sketch ($\theta=0.01$). Visualization is provided through boxplots in the same directory.

Single-Mutation Sketching Bias

This module focuses on isolating the bias from the sketching process by applying only one mutation and replicating the sketching process multiple times.

Navigate to the sketching bias directory:
```
cd ./Sketch_bias_plot/
make
```

Modify config.yaml to specify your experiments:

analyses:
  - name: "chr6_RepeatMasker"
    output_dir: "chr6_RepeatMasker"
    fasta: "../sequence_data/chr6.fasta"
    simulation_params:
      length: 100000
      start_pos: 559707
      precision: 1e-10
      sketch_repeats: 100
    plot_params:
      k_variation:
        thresholds: [32, 630]
        ylimits: [0.8, 0.02, 1.0]
        fixed_r: 0.01
      r_variation:
        threshold: 0.151
        ylimits: [0.2, 1.1]
        fixed_k: 20

Run the analysis:
```
snakemake --core [N]
```
Results are stored in the specified output directory (e.g., ./chr6_RepeatMasker/) with simulation data in ./specified/varying_k and ./specified/varying_r subdirectories. The output files are in CSV format with three columns representing r_strong (constant within each output file), r_sketch ($\theta=0.1$), and r_sketch ($\theta=0.01$). Boxplots visualizing these results are also provided in the same directory.

Error Bounds of q

This experiment demonstrates the theoretical bounds of parameter $q$.

Setup and Configuration

Navigate to the error bounds directory:
```
cd ./Error_bounds_of_q/
make
```

Configure with config.yaml:

analyses:
  - name: "chr21_centromere"
    output_dir: "chr21_centromere"
    fasta: "../sequence_data/chr21.fasta"
    simulation_params:
      length: 100000
      start_pos: 10962854
      precision: 1e-8
      replicates: 100
    plot_params:
      r_variation:
        fixed_k: 30

Run the analysis:
```
snakemake --core [N]
```
Results are stored in the specified output directory (e.g., ./chr21_centromere/) with detailed data in ./specified/varying_r. Each output file corresponds to a specific r value and contains:
- The first row showing the lower and upper theoretical bounds in the format "lower_bound, upper_bound"
- Subsequent rows showing the $\hat{r}$ values for each simulation replicate
A visualization comparing the theoretical bounds with the empirical results of $\hat{r}$ values is also generated, allowing for visual confirmation of the theoretical predictions across different $r$ values while keeping $k$ fixed as specified in the configuration.

Relation of $P_{empty}$ and Unstableness

This experiment demonstrates $P_{empty}$ can be a diagnostic criterion for determining the unstableness for large $r$ and $k$.

Setup and Configuration

Navigate to the error bounds directory:
```
cd ./P_empty_and_unstableness/
make
```

Configure with config.yaml:

analyses:
  - name: "chr21_centromere"
    output_dir: "chr21_centromere"
    fasta: "../sequence_data/D-hardest.fasta"
    simulation_params:
      length: 100000
      start_pos: 0
      precision: 1e-8
      replicates: 100
    plot_params:
      r_variation:
        threshold: 0.201
        fixed_k: 30

Run the analysis:
```
snakemake --core [N]
```
Results are stored in the specified output directory (e.g., ./chr21_centromere/) with detailed data in ./specified/varying_r. Each output file corresponds to a specific r value and contains:
- The first row showing the value of $P_{empty}$
- Subsequent rows showing the $\hat{r}$ values for each simulation replicate
A visualization empirical results of $\hat{r}$ values and the curve of $P_{empty}$ is also generated.

$P_{empty}$ Heatmap

This folder is used to generate the heat map of $P_{empty}$ for the settings of different $k$ and $r$.

Setup and Configuration

Navigate to the error bounds directory:
```
cd ./P_empty_and_unstableness/
make
```

Configure with config.yaml:

analyses:  
  - name: "chrY_SimpleRepeat"
    output_dir: "chrY_SimpleRepeat"
    simulation_params:
      length: 2264
      start_pos: 0
      precision: 1e-10
      replicates: 100
      r_values: [0.001, 0.031, 0.061, 0.091, 0.121, 0.151, 0.181, 0.211, 0.241, 0.271, 0.301, 0.331, 0.361]
      k_values: [8,10,12,14,16,18,20,22,24,26,28,30,32]
    plot_params:
       heatmap:
        title: "HARD"
        cmap: "Greys"
        annotate: True
        figsize: "10,10"

Not you do not need to input the sequence, $P_{empty}$ only depends on $L,r$ and $k$.

Run the analysis:
```
snakemake --core [N]
```
Results are stored in the specified output directory (e.g., ./chrY_SimpleRepeat/), with raw simulation data in ./specified_folder/results. The primary output is a comprehensive heatmap visualization (heatmap.png) showing the bias of $\hat{r}$ estimator across the specified r and k value combinations. The heatmap is colored according to the selected colormap, with annotations displaying the actual values when enabled in the configuration. The $P_{empty}$ for each $(r,k)$ combination are stored in separate CSV files in the results directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kmer-stats-repeat: Reproduction Framework

Dependencies

Datasets

$K$-mer statistics of datasets

Compilation

Running the Analysis

Results

Comparison of Estimators

Setup and Configuration

Relative Error Comparison

Setup and Configuration

Heatmap Analysis

Setup and Configuration

Sketching Bias Analysis

Setup and Configuration

Single-Mutation Sketching Bias

Error Bounds of q

Setup and Configuration

Relation of $P_{empty}$ and Unstableness

Setup and Configuration

$P_{empty}$ Heatmap

Setup and Configuration

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
Ape_phylo_cen_of_ChrX_and_ChrY		Ape_phylo_cen_of_ChrX_and_ChrY
Comparison_of_rstrong_and_rsketch		Comparison_of_rstrong_and_rsketch
Comparison_of_two_estimator		Comparison_of_two_estimator
Error_bounds_of_q		Error_bounds_of_q
Heatmap		Heatmap
P_empty		P_empty
P_empty_and_unstablenss		P_empty_and_unstablenss
Relative_comparison_of_two_estimators		Relative_comparison_of_two_estimators
Sketch_bias_plot		Sketch_bias_plot
kmer_info		kmer_info
sequence_data		sequence_data
.gitmodules		.gitmodules
README.md		README.md

medvedevgroup/kmer-stats-repeat_Reproduce

Folders and files

Latest commit

History

Repository files navigation

kmer-stats-repeat: Reproduction Framework

Dependencies

Datasets

$K$-mer statistics of datasets

Compilation

Running the Analysis

Results

Comparison of Estimators

Setup and Configuration

Relative Error Comparison

Setup and Configuration

Heatmap Analysis

Setup and Configuration

Sketching Bias Analysis

Setup and Configuration

Single-Mutation Sketching Bias

Error Bounds of q

Setup and Configuration

Relation of $P_{empty}$ and Unstableness

Setup and Configuration

$P_{empty}$ Heatmap

Setup and Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages