RUMINA: high-throughput UMI deduplication for amplicon and whole-genome sequencing with enhanced error correction.
This repository contains code used in the associated manuscript. Zenodo repositories containing larger input, intermediate, and output files are also listed.
| DOI | Description |
|---|---|
| https://doi.org/10.5281/zenodo.18167358 | iCLiP inputs part 1 |
| https://doi.org/10.5281/zenodo.18167426 | iCLiP inputs part 2 |
| https://doi.org/10.5281/zenodo.18167490 | iCLiP outputs part 1 |
| https://doi.org/10.5281/zenodo.18167533 | iCLiP outputs part 2 |
| https://doi.org/10.5281/zenodo.18167542 | iCLiP outputs part 3 |
| https://doi.org/10.5281/zenodo.18176728 | iCLiP outputs part 4 |
| https://doi.org/10.5281/zenodo.18167580 | TCR |
| https://doi.org/10.5281/zenodo.18176787 | HIV simulation |
Read below for a per-experiment, file-oriented description of data uploaded both here and to Zenodo.
- Prelude
- Software needed
- Overall description
- Notes on running
memtest.py - File and directory descriptions
NOTE: All analysis was performed on MacOS Sequioa 15.3.2.
NOTE: RUMINA 0.9.81 was used for all analysis. To install it with Cargo (read about installing Cargo here):
cargo install rumina --version 0.9.81- R and Rstudio (along with packages specified in .Rmd files)
- Python 2.7 (iCLIP calc_repro.py) and Python 3.11+ (all other analysis) along with packages specified in .py files.
- Cargo 1.86.0+
- GNU Parallel v20250322
- samtools 1.21 (htslib 1.22)
- iVar v1.3
- UMI-tools v1.1.6
- UMICollapse v1.0.0
- RUMINA v0.9.81
- fastp v0.23.2
- Reaper v17-257
- Bowtie2 v2.5.4
- BWA MEM 0.7.19-r1273
This repository contains a CLI program for UMI cluster analysis, tag_splitter, that can be compiled with Cargo using the Makefile tag_splitter/Makefile:
To compile it:
cd tag_splitter
make
./tag_splitter -hNOTE: For the instructions below, it is assumed that you have all software listed somewhere in $PATH. For information on how to move software to $PATH, see this thread.
This work contains three datasets:
iCLiP(individual-nucleotide resolution UV crosslinking and immunoprecipitation)TCR(T-cell receptor sequencing)hiv_simulation(HIV simulated data)
Benchmarking data and software output is generated by running the memtest.py script, which runs a software of interest (UMI-tools, RUMINA, UMICollapse) on all BAMs within a designated folder.
Your working folder to run the memtest.py script should be organized as follows:
memtest.py
|
| - folder called "iclip" containing iCLIP BAMs
|
| - folder called "tcr" containing TCR BAMs
|
| - folder called "hiv_sim" containing hiv_simulation BAMs
|
| - folder called "reports" used for storing reports generated by memtest.py
You can create an empty reports directory using mkdir reports if it doesn't already exist.
Instructions for generating the BAMs from FASTQ data are available in the per-dataset descriptions below. Alternatively, you could use the input BAMs already provided in the Zenodo repositories.
The memtest.py script saves software output BAMs in the same directory as the input BAM, under a sub-directory labeled after the tool and its threads/method/singleton strategy and iteration. For example:
Running RUMINA 8 threads with acyclic method and singletons for 3 iterations on the iclip dataset produces the following folders:
iclip/RUMINA_8_threads.acyclic_0_--singletons
iclip/RUMINA_8_threads.acyclic_1_--singletons
iclip/RUMINA_8_threads.acyclic_2_--singletons
The _0_, _1_, and _2_ denote the iteration. Inside each folder are the output BAMs of each run. All scripts running on the outputs of tools depend on this directory structure.
memtest.py contains variables that can be modified to determine the tools to run and on what datasets, with what arguments. The DATASETS variable contains the following:
# dataset paired-end ref fasta stratify by length
DATASETS = [
("iclip", False, None, False),
("tcr", False, None, False),
("hiv_sim", True, "hiv_simulation/HIV_HXB2.fasta", False),
] # directories containing bamfiles to testEach dataset is a 4-tuple, with 1) the dataset folder containing input BAMs, 2) whether or not it's paired-end, 3) ref fasta path (unused except for hiv_simulation) and 4) whether or not to stratify UMI grouping additionally by length. Cloning this repository should contain the ref fasta in hiv_simulation.
Iterations can be controlled with the NUM_ITERATIONS variable, and programs exceeding the maximum allowed runtime in seconds will be terminated.
NUM_ITERATIONS = 1
# amount of time a tool is allowed to use before being terminated, in seconds
MAX_TIME = 14400Tools to run, and with what conditions, are listed in the METHODS variable.
# table structure:
# tool to run method singletons threads paired-end method
METHODS = [
## other tools
(run_umitools, None, None, 1, None),
(run_umicollapse, None, None, 1, None),
## directional method
(run_rumina, "directional", "--singletons", 1, None),
(run_rumina, "directional", "--singletons", 4, None),
(run_rumina, "directional", "--singletons", 8, None),
(run_rumina, "directional", None, 1, None),
(run_rumina, "directional", None, 4, None),
(run_rumina, "directional", None, 8, None),
## acyclic method
(run_rumina, "acyclic", "--singletons", 1, None),
(run_rumina, "acyclic", "--singletons", 4, None),
(run_rumina, "acyclic", "--singletons", 8, None),
(run_rumina, "acyclic", None, 1, None),
(run_rumina, "acyclic", None, 4, None),
(run_rumina, "acyclic", None, 8, None),
## raw method
(run_rumina, "raw", "--singletons", 1, None),
(run_rumina, "raw", "--singletons", 4, None),
(run_rumina, "raw", "--singletons", 8, None),
(run_rumina, "raw", None, 1, None),
(run_rumina, "raw", None, 4, None),
(run_rumina, "raw", None, 8, None),
]You can adjust these variables by commenting out lines to reduce the output generation/benchmarking to whatever dataset/tools/arguments you are interested in.
Run the script with python3 memtest.py.
All files relating to this section are located in this repository.
- bench.Rmd: benchmarking tables
- decompress.sh: script to decompress TRUST4 output files
- filereport_read_run_PRJN286202_tsv.txt: sample information for iCLIP data
- iclip.Rmd: analysis code for iCLiP dataset
- tcr.Rmd: analysis code for TCR dataset
- tcr.tar.gz: compressed archive of TRUST4 outputs, see decompress.sh
- performance: benchmarking report
- reproducibility: data for varying conditions of RUMINA and other tools run on the iCLIP datasets, as well as the original iCLIP data.
- Download the sequencing datasets (SRA/ERA accessions) listed in fastq/sra.txt, and preprocess and align them using the numbered scripts in the same directory:
cd iclip/fastq
# 1. use fasterq-dump to download the FASTQ data
sh 1_dl.sh
# 2. use umi_tools extract to move the barcodes from the read sequence into the header.
sh 2_extract.sh
# 3. Generate barcode metadata with reaper
python3 3_gen_reaper_metadata.py
# 4. Demux fastqs with reaper
python3 4_demux.pyBefore aligning, download the gzipped FASTAs for mm9 chromosomes 1-19, X, Y and M from here. Unzip them, and concatenate them into ../mm9.fa.
for file in *.fa; do cat $file >> ../mm9.fa; done# 5. Finally, align to the mm9 reference genome using bowtie-1.1.2
sh to_bam.sh
# 6. Move the BAMs to the parent directory "iclip" for downstream steps
mv *.bam ../- You can also use the BAMs uploaded, as they are outputs of step 0.
- run memtest.py with the "iclip" entry in the DATASET variable uncommented.
DATASETS = [
("iclip", False, None, False),
#("tcr", False, None, False),
#("hiv_sim", True, "hiv_simulation/HIV_HXB2.fasta", False),
]python3 memtest.py- run algo_comp.sh to compare reproducibility values between tools/samples. This will generate output reports in the following directories:
iclip/other_tools # UMI-tools and UMICollapse
iclip/original # unprocessed, input bams (this takes a while...)
iclip/directional # RUMINA directional
iclip/acyclic # RUMINA acyclic
iclip/raw # RUMINA raw# activate python2 environment found in "twoenv.tar.gz" beforehand
sh algo_comp.shYou can find pre-generated reproducibility reports in analysis/reproducibility. If you still want to generate them yourself, move the generated folders into the same folder as iclip.Rmd and other files in analysis/ to avoid breaking the script. You should now be able to run iclip.Rmd.
- Run group.sh to generate per-file cluster reports using the
tag_splitterCLI program. These reports are already included in the Zenodo repository should you wish to skip this step.
sh group.sh- Also, run depth.sh to generate depth reports for each file.
sh depth.sh- CGAT: libraries from CGAT required for iCLIP reproducibility analysis
- fastq: scripts and tables for processing raw FASTQ data. They are ordered in order of execution (e.g. 1, 2, 3)
- algo_comp.sh: driver script to compare reproducibility values on output iCLIP data across tools
- calc_repro.py and repro.py: core code for calculating reproducibility values
https://doi.org/10.5281/zenodo.18167358
https://doi.org/10.5281/zenodo.18167426
These repositories contain the inputs for all tools across 35 samples, in BAM format.
You can find the steps required to produce these BAMs here in the iclip/fastq folder.
https://doi.org/10.5281/zenodo.18167490 contains copies of the fastq processing and reproducibility calculation scripts located in iclip/reproducibility. Files ending with _cluster_report.tsv are cluster grouping reports generated with the tag_splitter program.
https://doi.org/10.5281/zenodo.18167533 contains the first of the output files generated by each tool: the output files are grouped in gzipped tarballs, with each archive belonging to a tool-method-singletons-thread combination.
https://doi.org/10.5281/zenodo.18167542 and https://doi.org/10.5281/zenodo.18176728 contain the remainder the output tarballs. In addition, cluster reports generated on the output bams with the tag_splitter program to investigate UMI clustering patterns are found in cluster_reports.tar.gz
twoenv.tar.gz is a python2 virtual environment for running the CGAT reproducibility scripts.
- Download the FASTQs using
1_dl.sh
cd tcr
sh 1_dl.sh- Extract the UMIs from FASTQ read sequence to the header
sh extract_umi.sh- Then map the FASTQs to generate the BAMs. You'll need to download
hg38.analysisSet.fa.gzfrom here.
unpigz hg38.analyisSet.fa.gz
bowtie2-build hg38.analysisSet.fa hg38
sh map.sh- if you skipped this step and downloaded the
bamfolder from Zenodo, move the BAMs from outside thebamsfolder intotcr
mv bams/*.bam tcr
for file in tcr/*.bam; do samtools index $file; done- run memtest.py with the "tcr" entry in the DATASET variable uncommented.
DATASETS = [
#("iclip", False, None, False),
("tcr", False, None, False),
#("hiv_sim", True, "hiv_simulation/HIV_HXB2.fasta", False),
]- run the trust4.sh script to generate trust4 outputs
cd tcr
sh trust4.sh-
alternatively to steps 1-3, decompress the trust4 reports hosted on Zenodo
-
For cluster reports, either download the ones already on Zenodo or run group.sh:
cd tcr
sh group.sh- For depth, run depth.sh:
cd tcr
sh depth.sh- extract_umi.sh: script to move UMI barcodes from read sequence to header
- map.sh: how to align fastqs to generate BAMs (see Zenodo for the actual bams)
- sras: a list of SRAs to download for FASTQ data to feed into extract_umi.sh and map.sh for BAM data
- trust4.sh: script to run trust4 on tool OUTPUT BAMs.
All files for this section can be found at https://doi.org/10.5281/zenodo.18167580.
- Output for every combination of tool/thread/method/singletons can be found in the *.tar.gz tarballs, where the directory name indicates the tool, threads, method, and singleton strategy used.
- human_IMGT+C.fa and human_bcrtcr.fa: files required by TRUST4
- bams.zip: folder of input BAMs, generated from the FASTQs (see above)
- cluster_reports.tar.gz: tarball of output cluster report generated with the
tag_splitterprogram. - *.fastq.gz: input FASTQ files
- iclip, tcr: files used to prepare the iCLiP and TCR datasets.
- hiv_simulation: files used to prepare the simulated HIV dataset.
- memtest.py: the benchmarking script used to output runtime and memory usage. Comments on usage are included within the script.
- hum_bcrtcr.fa and human_IMGT+C.fa: auxiliary files needed to run TRUST4.
- tag_splitter: the UMI cluster inspection CLI tool used to gather clsuter composition metrics. Use the Makefile inside to build (requires Cargo).
For more details, see the README in this folder.
- To generate the simulated HIV FASTQs, run generate.sh
cd hiv_simulation
sh generate.shNote that generating all files for all conditions (see the conditions table in generate.sh) is rather time- and memory-intensive. You can skip this step by using the BAM files provided in the Zenodo repository.
sh 4_to_bam.sh- move the *.BAM files to their own folder called "hiv_sim" (see the overall description section)
- run memtest.py with the "hiv_sim" dataset uncommented
DATASETS = [
#("iclip", False, None, False),
#("tcr", False, None, False),
("hiv_sim", True, "hiv_simulation/HIV_HXB2.fasta", False),
]- Run group.sh to generate per-file cluster reports using the
tag_splitterCLI program. These reports are already included in the Zenodo repository should you wish to skip this step.
cp hiv_simulation/group.sh hiv_sim/
sh hiv_sim/group.sh- Also, run depth.sh to generate depth reports for each file.
cp hiv_simulation/depth.sh hiv_sim/
sh hiv_sim/depth.sh- Finally, run iVar variants to generate SNV data.
cp hiv_simulation/ivar_2025Aug14.sh hiv_sim/
sh hiv_sim/ivar_2025Aug14.sh- iVar_reports: per-output-file SNV reports generated using iVar variants.
- HIV_HXB2.fasta: HIV ref used for alignment/simulation
- bench_hiv-scalability_2025-08-29_15_50_23.csv: benchmark data
- rumina_hiv_20250926.html: HTML file generated from R analysis
All files for this section can be found in https://doi.org/10.5281/zenodo.18176787.
-
As with other sections, BAM outputs are grouped into tar-balled archives for every combination of tool/threads/method/singleton strategy. You can find these archives ending in .tar.gz.
-
*.bam: input BAMs generated by aligning the FASTQ data generated from the simulation scripts
-
ivar_2025Aug14.sh: script to run iVar variants on all output files and generate reports in another directory.
-
group.sh: script to generate UMI cluster reports using
tag_splitter -
cluster_reports.tar.gz: holds said cluster reports.