Skip to content

greninger-lab/rumina_paper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RUMINA: high-throughput UMI deduplication for amplicon and whole-genome sequencing with enhanced error correction.

This repository contains code used in the associated manuscript. Zenodo repositories containing larger input, intermediate, and output files are also listed.

DOI Description
https://doi.org/10.5281/zenodo.18167358 iCLiP inputs part 1
https://doi.org/10.5281/zenodo.18167426 iCLiP inputs part 2
https://doi.org/10.5281/zenodo.18167490 iCLiP outputs part 1
https://doi.org/10.5281/zenodo.18167533 iCLiP outputs part 2
https://doi.org/10.5281/zenodo.18167542 iCLiP outputs part 3
https://doi.org/10.5281/zenodo.18176728 iCLiP outputs part 4
https://doi.org/10.5281/zenodo.18167580 TCR
https://doi.org/10.5281/zenodo.18176787 HIV simulation

Read below for a per-experiment, file-oriented description of data uploaded both here and to Zenodo.

Table of Contents

Prelude

NOTE: All analysis was performed on MacOS Sequioa 15.3.2.

NOTE: RUMINA 0.9.81 was used for all analysis. To install it with Cargo (read about installing Cargo here):

cargo install rumina --version 0.9.81

Software needed

This repository contains a CLI program for UMI cluster analysis, tag_splitter, that can be compiled with Cargo using the Makefile tag_splitter/Makefile:

To compile it:

cd tag_splitter
make
./tag_splitter -h

NOTE: For the instructions below, it is assumed that you have all software listed somewhere in $PATH. For information on how to move software to $PATH, see this thread.

Overall description

This work contains three datasets:

  • iCLiP (individual-nucleotide resolution UV crosslinking and immunoprecipitation)
  • TCR (T-cell receptor sequencing)
  • hiv_simulation (HIV simulated data)

Benchmarking data and software output is generated by running the memtest.py script, which runs a software of interest (UMI-tools, RUMINA, UMICollapse) on all BAMs within a designated folder.

Your working folder to run the memtest.py script should be organized as follows:

memtest.py
|  
| - folder called "iclip" containing iCLIP BAMs  
|  
| - folder called "tcr" containing TCR BAMs  
|  
| - folder called "hiv_sim" containing hiv_simulation BAMs  
|
| - folder called "reports" used for storing reports generated by memtest.py

You can create an empty reports directory using mkdir reports if it doesn't already exist.

Instructions for generating the BAMs from FASTQ data are available in the per-dataset descriptions below. Alternatively, you could use the input BAMs already provided in the Zenodo repositories.

The memtest.py script saves software output BAMs in the same directory as the input BAM, under a sub-directory labeled after the tool and its threads/method/singleton strategy and iteration. For example:

Running RUMINA 8 threads with acyclic method and singletons for 3 iterations on the iclip dataset produces the following folders:

iclip/RUMINA_8_threads.acyclic_0_--singletons
iclip/RUMINA_8_threads.acyclic_1_--singletons
iclip/RUMINA_8_threads.acyclic_2_--singletons

The _0_, _1_, and _2_ denote the iteration. Inside each folder are the output BAMs of each run. All scripts running on the outputs of tools depend on this directory structure.

Notes on running memtest.py

memtest.py contains variables that can be modified to determine the tools to run and on what datasets, with what arguments. The DATASETS variable contains the following:

#   dataset         paired-end  ref fasta                           stratify by length
DATASETS = [
    ("iclip",       False,      None,                                   False),
    ("tcr",         False,      None,                                   False),
    ("hiv_sim",     True,       "hiv_simulation/HIV_HXB2.fasta",        False),
]  # directories containing bamfiles to test

Each dataset is a 4-tuple, with 1) the dataset folder containing input BAMs, 2) whether or not it's paired-end, 3) ref fasta path (unused except for hiv_simulation) and 4) whether or not to stratify UMI grouping additionally by length. Cloning this repository should contain the ref fasta in hiv_simulation.

Iterations can be controlled with the NUM_ITERATIONS variable, and programs exceeding the maximum allowed runtime in seconds will be terminated.

NUM_ITERATIONS = 1

# amount of time a tool is allowed to use before being terminated, in seconds
MAX_TIME = 14400

Tools to run, and with what conditions, are listed in the METHODS variable.

# table structure:
    # tool to run       method              singletons          threads     paired-end method
METHODS = [
    ## other tools 
    (run_umitools,		None,		        None,		        1,	        None),
    (run_umicollapse,	None,		        None,		        1,	        None),

    ## directional method
    (run_rumina,		"directional",		"--singletons",		1,	        None),
    (run_rumina,		"directional",		"--singletons",		4,	        None),
    (run_rumina,		"directional",		"--singletons",		8,	        None),
    
    (run_rumina,		"directional",		None,		        1,	        None),
    (run_rumina,		"directional",		None,		        4,	        None),
    (run_rumina,		"directional",		None,		        8,	        None),
    
    ## acyclic method
    (run_rumina,		"acyclic",		    "--singletons",		1,	        None),
    (run_rumina,		"acyclic",		    "--singletons",		4,	        None),
    (run_rumina,		"acyclic",		    "--singletons",		8,	        None),

    (run_rumina,		"acyclic",		    None,		        1,	        None),
    (run_rumina,		"acyclic",		    None,		        4,	        None),
    (run_rumina,		"acyclic",		    None,		        8,	        None),
    
    ## raw method
    (run_rumina,		"raw",		        "--singletons",		1,	        None),
    (run_rumina,		"raw",		        "--singletons",		4,	        None),
    (run_rumina,		"raw",		        "--singletons",		8,	        None),

    (run_rumina,		"raw",		        None,		        1,          None),
    (run_rumina,		"raw",		        None,		        4,	        None),
    (run_rumina,		"raw",		        None,		        8,	        None),
]

You can adjust these variables by commenting out lines to reduce the output generation/benchmarking to whatever dataset/tools/arguments you are interested in.

Run the script with python3 memtest.py.

File and directory descriptions

analysis

All files relating to this section are located in this repository.

  • bench.Rmd: benchmarking tables
  • decompress.sh: script to decompress TRUST4 output files
  • filereport_read_run_PRJN286202_tsv.txt: sample information for iCLIP data
  • iclip.Rmd: analysis code for iCLiP dataset
  • tcr.Rmd: analysis code for TCR dataset
  • tcr.tar.gz: compressed archive of TRUST4 outputs, see decompress.sh
  • performance: benchmarking report
  • reproducibility: data for varying conditions of RUMINA and other tools run on the iCLIP datasets, as well as the original iCLIP data.

iclip

Instructions

  1. Download the sequencing datasets (SRA/ERA accessions) listed in fastq/sra.txt, and preprocess and align them using the numbered scripts in the same directory:
cd iclip/fastq 

# 1. use fasterq-dump to download the FASTQ data
sh 1_dl.sh

# 2. use umi_tools extract to move the barcodes from the read sequence into the header.
sh 2_extract.sh

# 3. Generate barcode metadata with reaper
python3 3_gen_reaper_metadata.py

# 4. Demux fastqs with reaper
python3 4_demux.py

Before aligning, download the gzipped FASTAs for mm9 chromosomes 1-19, X, Y and M from here. Unzip them, and concatenate them into ../mm9.fa.

for file in *.fa; do cat $file >> ../mm9.fa; done
# 5. Finally, align to the mm9 reference genome using bowtie-1.1.2
sh to_bam.sh

# 6. Move the BAMs to the parent directory "iclip" for downstream steps
mv *.bam ../
  1. You can also use the BAMs uploaded, as they are outputs of step 0.
  2. run memtest.py with the "iclip" entry in the DATASET variable uncommented.
DATASETS = [
    ("iclip",       False,      None,       False),
    #("tcr",         False,      None,       False),
    #("hiv_sim",     True,       "hiv_simulation/HIV_HXB2.fasta",        False),
]
python3 memtest.py
  1. run algo_comp.sh to compare reproducibility values between tools/samples. This will generate output reports in the following directories:
iclip/other_tools       # UMI-tools and UMICollapse
iclip/original          # unprocessed, input  bams (this takes a while...)
iclip/directional       # RUMINA directional
iclip/acyclic           # RUMINA acyclic
iclip/raw               # RUMINA raw
# activate python2 environment found in "twoenv.tar.gz" beforehand
sh algo_comp.sh

You can find pre-generated reproducibility reports in analysis/reproducibility. If you still want to generate them yourself, move the generated folders into the same folder as iclip.Rmd and other files in analysis/ to avoid breaking the script. You should now be able to run iclip.Rmd.

  1. Run group.sh to generate per-file cluster reports using the tag_splitter CLI program. These reports are already included in the Zenodo repository should you wish to skip this step.
sh group.sh
  1. Also, run depth.sh to generate depth reports for each file.
sh depth.sh

File descriptions

Files in this repository
  • CGAT: libraries from CGAT required for iCLIP reproducibility analysis
  • fastq: scripts and tables for processing raw FASTQ data. They are ordered in order of execution (e.g. 1, 2, 3)
  • algo_comp.sh: driver script to compare reproducibility values on output iCLIP data across tools
  • calc_repro.py and repro.py: core code for calculating reproducibility values
Files on Zenodo
Input BAMs generated from the FASTQ processing steps are located in these repositories

https://doi.org/10.5281/zenodo.18167358
https://doi.org/10.5281/zenodo.18167426

These repositories contain the inputs for all tools across 35 samples, in BAM format.

You can find the steps required to produce these BAMs here in the iclip/fastq folder.

Analysis files

https://doi.org/10.5281/zenodo.18167490 contains copies of the fastq processing and reproducibility calculation scripts located in iclip/reproducibility. Files ending with _cluster_report.tsv are cluster grouping reports generated with the tag_splitter program.

https://doi.org/10.5281/zenodo.18167533 contains the first of the output files generated by each tool: the output files are grouped in gzipped tarballs, with each archive belonging to a tool-method-singletons-thread combination.

https://doi.org/10.5281/zenodo.18167542 and https://doi.org/10.5281/zenodo.18176728 contain the remainder the output tarballs. In addition, cluster reports generated on the output bams with the tag_splitter program to investigate UMI clustering patterns are found in cluster_reports.tar.gz

twoenv.tar.gz is a python2 virtual environment for running the CGAT reproducibility scripts.

tcr

Instructions

  1. Download the FASTQs using 1_dl.sh
cd tcr
sh 1_dl.sh
  1. Extract the UMIs from FASTQ read sequence to the header
sh extract_umi.sh
  1. Then map the FASTQs to generate the BAMs. You'll need to download hg38.analysisSet.fa.gz from here.
unpigz hg38.analyisSet.fa.gz
bowtie2-build hg38.analysisSet.fa hg38
sh map.sh
  1. if you skipped this step and downloaded the bam folder from Zenodo, move the BAMs from outside the bams folder into tcr
mv bams/*.bam tcr
for file in tcr/*.bam; do samtools index $file; done
  1. run memtest.py with the "tcr" entry in the DATASET variable uncommented.
DATASETS = [
    #("iclip",       False,      None,       False),
    ("tcr",         False,      None,       False),
    #("hiv_sim",     True,       "hiv_simulation/HIV_HXB2.fasta",        False),
]
  1. run the trust4.sh script to generate trust4 outputs
cd tcr
sh trust4.sh
  1. alternatively to steps 1-3, decompress the trust4 reports hosted on Zenodo

  2. For cluster reports, either download the ones already on Zenodo or run group.sh:

cd tcr
sh group.sh
  1. For depth, run depth.sh:
cd tcr
sh depth.sh

File descriptions

Files in this repository
  • extract_umi.sh: script to move UMI barcodes from read sequence to header
  • map.sh: how to align fastqs to generate BAMs (see Zenodo for the actual bams)
  • sras: a list of SRAs to download for FASTQ data to feed into extract_umi.sh and map.sh for BAM data
  • trust4.sh: script to run trust4 on tool OUTPUT BAMs.
Files on Zenodo

All files for this section can be found at https://doi.org/10.5281/zenodo.18167580.

  • Output for every combination of tool/thread/method/singletons can be found in the *.tar.gz tarballs, where the directory name indicates the tool, threads, method, and singleton strategy used.
  • human_IMGT+C.fa and human_bcrtcr.fa: files required by TRUST4
  • bams.zip: folder of input BAMs, generated from the FASTQs (see above)
  • cluster_reports.tar.gz: tarball of output cluster report generated with the tag_splitter program.
  • *.fastq.gz: input FASTQ files
  • iclip, tcr: files used to prepare the iCLiP and TCR datasets.
  • hiv_simulation: files used to prepare the simulated HIV dataset.
  • memtest.py: the benchmarking script used to output runtime and memory usage. Comments on usage are included within the script.
  • hum_bcrtcr.fa and human_IMGT+C.fa: auxiliary files needed to run TRUST4.
  • tag_splitter: the UMI cluster inspection CLI tool used to gather clsuter composition metrics. Use the Makefile inside to build (requires Cargo).

hiv_simulation

For more details, see the README in this folder.

Instructions

  1. To generate the simulated HIV FASTQs, run generate.sh
cd hiv_simulation
sh generate.sh

Note that generating all files for all conditions (see the conditions table in generate.sh) is rather time- and memory-intensive. You can skip this step by using the BAM files provided in the Zenodo repository.

sh 4_to_bam.sh
  1. move the *.BAM files to their own folder called "hiv_sim" (see the overall description section)
  2. run memtest.py with the "hiv_sim" dataset uncommented
DATASETS = [
    #("iclip",       False,      None,       False),
    #("tcr",         False,      None,       False),
    ("hiv_sim",     True,       "hiv_simulation/HIV_HXB2.fasta",        False),
]
  1. Run group.sh to generate per-file cluster reports using the tag_splitter CLI program. These reports are already included in the Zenodo repository should you wish to skip this step.
cp hiv_simulation/group.sh hiv_sim/
sh hiv_sim/group.sh
  1. Also, run depth.sh to generate depth reports for each file.
cp hiv_simulation/depth.sh hiv_sim/
sh hiv_sim/depth.sh
  1. Finally, run iVar variants to generate SNV data.
cp hiv_simulation/ivar_2025Aug14.sh hiv_sim/ 
sh hiv_sim/ivar_2025Aug14.sh

File descriptions

Files in this repository
  • iVar_reports: per-output-file SNV reports generated using iVar variants.
  • HIV_HXB2.fasta: HIV ref used for alignment/simulation
  • bench_hiv-scalability_2025-08-29_15_50_23.csv: benchmark data
  • rumina_hiv_20250926.html: HTML file generated from R analysis
Files on Zenodo

All files for this section can be found in https://doi.org/10.5281/zenodo.18176787.

  • As with other sections, BAM outputs are grouped into tar-balled archives for every combination of tool/threads/method/singleton strategy. You can find these archives ending in .tar.gz.

  • *.bam: input BAMs generated by aligning the FASTQ data generated from the simulation scripts

  • ivar_2025Aug14.sh: script to run iVar variants on all output files and generate reports in another directory.

  • group.sh: script to generate UMI cluster reports using tag_splitter

  • cluster_reports.tar.gz: holds said cluster reports.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages