RUMINA: high-throughput UMI deduplication for amplicon and whole-genome sequencing with enhanced error correction.

This repository contains code used in the associated manuscript. Zenodo repositories containing larger input, intermediate, and output files are also listed.

DOI	Description
https://doi.org/10.5281/zenodo.18167358	iCLiP inputs part 1
https://doi.org/10.5281/zenodo.18167426	iCLiP inputs part 2
https://doi.org/10.5281/zenodo.18167490	iCLiP outputs part 1
https://doi.org/10.5281/zenodo.18167533	iCLiP outputs part 2
https://doi.org/10.5281/zenodo.18167542	iCLiP outputs part 3
https://doi.org/10.5281/zenodo.18176728	iCLiP outputs part 4
https://doi.org/10.5281/zenodo.18167580	TCR
https://doi.org/10.5281/zenodo.18176787	HIV simulation

Read below for a per-experiment, file-oriented description of data uploaded both here and to Zenodo.

R and Rstudio (along with packages specified in .Rmd files)
Python 2.7 (iCLIP calc_repro.py) and Python 3.11+ (all other analysis) along with packages specified in .py files.
Cargo 1.86.0+
GNU Parallel v20250322
samtools 1.21 (htslib 1.22)
iVar v1.3
UMI-tools v1.1.6
UMICollapse v1.0.0
RUMINA v0.9.81
fastp v0.23.2
Reaper v17-257
Bowtie2 v2.5.4
BWA MEM 0.7.19-r1273

This repository contains a CLI program for UMI cluster analysis, tag_splitter, that can be compiled with Cargo using the Makefile tag_splitter/Makefile:

To compile it:

cd tag_splitter
make
./tag_splitter -h

NOTE: For the instructions below, it is assumed that you have all software listed somewhere in $PATH. For information on how to move software to $PATH, see this thread.

Overall description

This work contains three datasets:

iCLiP (individual-nucleotide resolution UV crosslinking and immunoprecipitation)
TCR (T-cell receptor sequencing)
hiv_simulation (HIV simulated data)

Benchmarking data and software output is generated by running the memtest.py script, which runs a software of interest (UMI-tools, RUMINA, UMICollapse) on all BAMs within a designated folder.

Your working folder to run the memtest.py script should be organized as follows:

memtest.py
|  
| - folder called "iclip" containing iCLIP BAMs  
|  
| - folder called "tcr" containing TCR BAMs  
|  
| - folder called "hiv_sim" containing hiv_simulation BAMs  
|
| - folder called "reports" used for storing reports generated by memtest.py

You can create an empty reports directory using mkdir reports if it doesn't already exist.

Instructions for generating the BAMs from FASTQ data are available in the per-dataset descriptions below. Alternatively, you could use the input BAMs already provided in the Zenodo repositories.

The memtest.py script saves software output BAMs in the same directory as the input BAM, under a sub-directory labeled after the tool and its threads/method/singleton strategy and iteration. For example:

Running RUMINA 8 threads with acyclic method and singletons for 3 iterations on the iclip dataset produces the following folders:

iclip/RUMINA_8_threads.acyclic_0_--singletons
iclip/RUMINA_8_threads.acyclic_1_--singletons
iclip/RUMINA_8_threads.acyclic_2_--singletons

The _0_, _1_, and _2_ denote the iteration. Inside each folder are the output BAMs of each run. All scripts running on the outputs of tools depend on this directory structure.

Notes on running `memtest.py`

memtest.py contains variables that can be modified to determine the tools to run and on what datasets, with what arguments. The DATASETS variable contains the following:

#   dataset         paired-end  ref fasta                           stratify by length
DATASETS = [
    ("iclip",       False,      None,                                   False),
    ("tcr",         False,      None,                                   False),
    ("hiv_sim",     True,       "hiv_simulation/HIV_HXB2.fasta",        False),
]  # directories containing bamfiles to test

Each dataset is a 4-tuple, with 1) the dataset folder containing input BAMs, 2) whether or not it's paired-end, 3) ref fasta path (unused except for hiv_simulation) and 4) whether or not to stratify UMI grouping additionally by length. Cloning this repository should contain the ref fasta in hiv_simulation.

Iterations can be controlled with the NUM_ITERATIONS variable, and programs exceeding the maximum allowed runtime in seconds will be terminated.

NUM_ITERATIONS = 1

# amount of time a tool is allowed to use before being terminated, in seconds
MAX_TIME = 14400

Tools to run, and with what conditions, are listed in the METHODS variable.

# table structure:
    # tool to run       method              singletons          threads     paired-end method
METHODS = [
    ## other tools 
    (run_umitools,		None,		        None,		        1,	        None),
    (run_umicollapse,	None,		        None,		        1,	        None),

    ## directional method
    (run_rumina,		"directional",		"--singletons",		1,	        None),
    (run_rumina,		"directional",		"--singletons",		4,	        None),
    (run_rumina,		"directional",		"--singletons",		8,	        None),
    
    (run_rumina,		"directional",		None,		        1,	        None),
    (run_rumina,		"directional",		None,		        4,	        None),
    (run_rumina,		"directional",		None,		        8,	        None),
    
    ## acyclic method
    (run_rumina,		"acyclic",		    "--singletons",		1,	        None),
    (run_rumina,		"acyclic",		    "--singletons",		4,	        None),
    (run_rumina,		"acyclic",		    "--singletons",		8,	        None),

    (run_rumina,		"acyclic",		    None,		        1,	        None),
    (run_rumina,		"acyclic",		    None,		        4,	        None),
    (run_rumina,		"acyclic",		    None,		        8,	        None),
    
    ## raw method
    (run_rumina,		"raw",		        "--singletons",		1,	        None),
    (run_rumina,		"raw",		        "--singletons",		4,	        None),
    (run_rumina,		"raw",		        "--singletons",		8,	        None),

    (run_rumina,		"raw",		        None,		        1,          None),
    (run_rumina,		"raw",		        None,		        4,	        None),
    (run_rumina,		"raw",		        None,		        8,	        None),
]

You can adjust these variables by commenting out lines to reduce the output generation/benchmarking to whatever dataset/tools/arguments you are interested in.

Run the script with python3 memtest.py.

File and directory descriptions

`analysis`

All files relating to this section are located in this repository.

bench.Rmd: benchmarking tables
decompress.sh: script to decompress TRUST4 output files
filereport_read_run_PRJN286202_tsv.txt: sample information for iCLIP data
iclip.Rmd: analysis code for iCLiP dataset
tcr.Rmd: analysis code for TCR dataset
tcr.tar.gz: compressed archive of TRUST4 outputs, see decompress.sh
performance: benchmarking report
reproducibility: data for varying conditions of RUMINA and other tools run on the iCLIP datasets, as well as the original iCLIP data.

`iclip`

Instructions

Download the sequencing datasets (SRA/ERA accessions) listed in fastq/sra.txt, and preprocess and align them using the numbered scripts in the same directory:

cd iclip/fastq 

# 1. use fasterq-dump to download the FASTQ data
sh 1_dl.sh

# 2. use umi_tools extract to move the barcodes from the read sequence into the header.
sh 2_extract.sh

# 3. Generate barcode metadata with reaper
python3 3_gen_reaper_metadata.py

# 4. Demux fastqs with reaper
python3 4_demux.py

Before aligning, download the gzipped FASTAs for mm9 chromosomes 1-19, X, Y and M from here. Unzip them, and concatenate them into ../mm9.fa.

for file in *.fa; do cat $file >> ../mm9.fa; done

# 5. Finally, align to the mm9 reference genome using bowtie-1.1.2
sh to_bam.sh

# 6. Move the BAMs to the parent directory "iclip" for downstream steps
mv *.bam ../

You can also use the BAMs uploaded, as they are outputs of step 0.
run memtest.py with the "iclip" entry in the DATASET variable uncommented.

DATASETS = [
    ("iclip",       False,      None,       False),
    #("tcr",         False,      None,       False),
    #("hiv_sim",     True,       "hiv_simulation/HIV_HXB2.fasta",        False),
]

python3 memtest.py

run algo_comp.sh to compare reproducibility values between tools/samples. This will generate output reports in the following directories:

iclip/other_tools       # UMI-tools and UMICollapse
iclip/original          # unprocessed, input  bams (this takes a while...)
iclip/directional       # RUMINA directional
iclip/acyclic           # RUMINA acyclic
iclip/raw               # RUMINA raw

# activate python2 environment found in "twoenv.tar.gz" beforehand
sh algo_comp.sh

You can find pre-generated reproducibility reports in analysis/reproducibility. If you still want to generate them yourself, move the generated folders into the same folder as iclip.Rmd and other files in analysis/ to avoid breaking the script. You should now be able to run iclip.Rmd.

Run group.sh to generate per-file cluster reports using the tag_splitter CLI program. These reports are already included in the Zenodo repository should you wish to skip this step.

sh group.sh

Also, run depth.sh to generate depth reports for each file.

sh depth.sh

File descriptions

Files in this repository

CGAT: libraries from CGAT required for iCLIP reproducibility analysis
fastq: scripts and tables for processing raw FASTQ data. They are ordered in order of execution (e.g. 1, 2, 3)
algo_comp.sh: driver script to compare reproducibility values on output iCLIP data across tools
calc_repro.py and repro.py: core code for calculating reproducibility values

Files on Zenodo

Input BAMs generated from the FASTQ processing steps are located in these repositories

https://doi.org/10.5281/zenodo.18167358
https://doi.org/10.5281/zenodo.18167426

These repositories contain the inputs for all tools across 35 samples, in BAM format.

You can find the steps required to produce these BAMs here in the iclip/fastq folder.

Analysis files

https://doi.org/10.5281/zenodo.18167490 contains copies of the fastq processing and reproducibility calculation scripts located in iclip/reproducibility. Files ending with _cluster_report.tsv are cluster grouping reports generated with the tag_splitter program.

https://doi.org/10.5281/zenodo.18167533 contains the first of the output files generated by each tool: the output files are grouped in gzipped tarballs, with each archive belonging to a tool-method-singletons-thread combination.

https://doi.org/10.5281/zenodo.18167542 and https://doi.org/10.5281/zenodo.18176728 contain the remainder the output tarballs. In addition, cluster reports generated on the output bams with the tag_splitter program to investigate UMI clustering patterns are found in cluster_reports.tar.gz

twoenv.tar.gz is a python2 virtual environment for running the CGAT reproducibility scripts.

`tcr`

Instructions

Download the FASTQs using 1_dl.sh

cd tcr
sh 1_dl.sh

Extract the UMIs from FASTQ read sequence to the header

sh extract_umi.sh

Then map the FASTQs to generate the BAMs. You'll need to download hg38.analysisSet.fa.gz from here.

unpigz hg38.analyisSet.fa.gz
bowtie2-build hg38.analysisSet.fa hg38
sh map.sh

if you skipped this step and downloaded the bam folder from Zenodo, move the BAMs from outside the bams folder into tcr

mv bams/*.bam tcr
for file in tcr/*.bam; do samtools index $file; done

run memtest.py with the "tcr" entry in the DATASET variable uncommented.

DATASETS = [
    #("iclip",       False,      None,       False),
    ("tcr",         False,      None,       False),
    #("hiv_sim",     True,       "hiv_simulation/HIV_HXB2.fasta",        False),
]

run the trust4.sh script to generate trust4 outputs

cd tcr
sh trust4.sh

alternatively to steps 1-3, decompress the trust4 reports hosted on Zenodo
For cluster reports, either download the ones already on Zenodo or run group.sh:

cd tcr
sh group.sh

For depth, run depth.sh:

cd tcr
sh depth.sh

File descriptions

Files in this repository

extract_umi.sh: script to move UMI barcodes from read sequence to header
map.sh: how to align fastqs to generate BAMs (see Zenodo for the actual bams)
sras: a list of SRAs to download for FASTQ data to feed into extract_umi.sh and map.sh for BAM data
trust4.sh: script to run trust4 on tool OUTPUT BAMs.

Files on Zenodo

All files for this section can be found at https://doi.org/10.5281/zenodo.18167580.

Output for every combination of tool/thread/method/singletons can be found in the *.tar.gz tarballs, where the directory name indicates the tool, threads, method, and singleton strategy used.
human_IMGT+C.fa and human_bcrtcr.fa: files required by TRUST4
bams.zip: folder of input BAMs, generated from the FASTQs (see above)
cluster_reports.tar.gz: tarball of output cluster report generated with the tag_splitter program.
*.fastq.gz: input FASTQ files
iclip, tcr: files used to prepare the iCLiP and TCR datasets.
hiv_simulation: files used to prepare the simulated HIV dataset.
memtest.py: the benchmarking script used to output runtime and memory usage. Comments on usage are included within the script.
hum_bcrtcr.fa and human_IMGT+C.fa: auxiliary files needed to run TRUST4.
tag_splitter: the UMI cluster inspection CLI tool used to gather clsuter composition metrics. Use the Makefile inside to build (requires Cargo).

`hiv_simulation`

For more details, see the README in this folder.

Instructions

To generate the simulated HIV FASTQs, run generate.sh

cd hiv_simulation
sh generate.sh

Note that generating all files for all conditions (see the conditions table in generate.sh) is rather time- and memory-intensive. You can skip this step by using the BAM files provided in the Zenodo repository.

sh 4_to_bam.sh

move the *.BAM files to their own folder called "hiv_sim" (see the overall description section)
run memtest.py with the "hiv_sim" dataset uncommented

DATASETS = [
    #("iclip",       False,      None,       False),
    #("tcr",         False,      None,       False),
    ("hiv_sim",     True,       "hiv_simulation/HIV_HXB2.fasta",        False),
]

Run group.sh to generate per-file cluster reports using the tag_splitter CLI program. These reports are already included in the Zenodo repository should you wish to skip this step.

cp hiv_simulation/group.sh hiv_sim/
sh hiv_sim/group.sh

Also, run depth.sh to generate depth reports for each file.

cp hiv_simulation/depth.sh hiv_sim/
sh hiv_sim/depth.sh

Finally, run iVar variants to generate SNV data.

cp hiv_simulation/ivar_2025Aug14.sh hiv_sim/ 
sh hiv_sim/ivar_2025Aug14.sh

File descriptions

Files in this repository

iVar_reports: per-output-file SNV reports generated using iVar variants.
HIV_HXB2.fasta: HIV ref used for alignment/simulation
bench_hiv-scalability_2025-08-29_15_50_23.csv: benchmark data
rumina_hiv_20250926.html: HTML file generated from R analysis

Files on Zenodo

All files for this section can be found in https://doi.org/10.5281/zenodo.18176787.

As with other sections, BAM outputs are grouped into tar-balled archives for every combination of tool/threads/method/singleton strategy. You can find these archives ending in .tar.gz.
*.bam: input BAMs generated by aligning the FASTQ data generated from the simulation scripts
ivar_2025Aug14.sh: script to run iVar variants on all output files and generate reports in another directory.
group.sh: script to generate UMI cluster reports using tag_splitter
cluster_reports.tar.gz: holds said cluster reports.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RUMINA: high-throughput UMI deduplication for amplicon and whole-genome sequencing with enhanced error correction.

Table of Contents

Prelude

Software needed

Overall description

Notes on running `memtest.py`

File and directory descriptions

`analysis`

`iclip`

Instructions

File descriptions

Files in this repository

Files on Zenodo

Input BAMs generated from the FASTQ processing steps are located in these repositories

Analysis files

`tcr`

Instructions

File descriptions

Files in this repository

Files on Zenodo

`hiv_simulation`

Instructions

File descriptions

Files in this repository

Files on Zenodo

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
analysis		analysis
hiv_simulation		hiv_simulation
iclip		iclip
tag_splitter		tag_splitter
tcr		tcr
LICENSE		LICENSE
README.md		README.md
memtest.py		memtest.py

Folders and files

Latest commit

History

Repository files navigation

RUMINA: high-throughput UMI deduplication for amplicon and whole-genome sequencing with enhanced error correction.

Table of Contents

Prelude

Software needed

Overall description

Notes on running memtest.py

File and directory descriptions

analysis

iclip

Instructions

File descriptions

Files in this repository

Files on Zenodo

Input BAMs generated from the FASTQ processing steps are located in these repositories

Analysis files

tcr

Instructions

File descriptions

Files in this repository

Files on Zenodo

hiv_simulation

Instructions

File descriptions

Files in this repository

Files on Zenodo

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Notes on running `memtest.py`

`analysis`

`iclip`

`tcr`

`hiv_simulation`

Packages