BANDITS is a Bayesian hierarchical model for detecting differential splicing of genes and transcripts,
via differential transcript usage (DTU),
between two or more conditions.
The method uses a Bayesian hierarchical framework, which allows for sample specific proportions
in a Dirichlet-multinomial model, and samples the allocation of fragments to the transcripts.
Parameters are inferred via Markov chain Monte Carlo (MCMC) techniques and a DTU test is performed
via a multivariate Wald test on the posterior densities for the average relative abundance of transcripts.
Simone Tiberi and Mark D Robinson (2020). BANDITS: Bayesian differential splicing accounting for sample-to-sample variability and mapping uncertainty.
Genome Biology 21 (69). doi: 10.1186/s13059-020-01967-8
BANDITS is available on Bioconductor and can be installed with the command:
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("BANDITS")The vignette illustrating how to use the package can be accessed on Bioconductor or from R via:
vignette("BANDITS")or
browseVignettes("BANDITS")The package inputs the equivalence classes and respective counts, representing what transcripts each read is compatible with.
These can be obtained by aligning reads either directly to a reference transcriptome with pseudo-alignmers, via salmon or kallisto bustools, or to a reference genome with splice-aware genome alignment algorithms, via STAR, and checking the transcripts compatible with each genome alignment with salmon
NOTE: when using salmon, use the option --dumpEq to obtain the equivalence classes, and when using STAR, use the option --quantMode TranscriptomeSAM to obtain alignments translated into transcript coordinates.
Below we show three pipelines for aligning reads with salmon, kallisto bustools and STAR.
To obtain the example raw data, download or clone the ARMOR github repository:
git clone https://github.com/csoneson/ARMOR.git
# set a base_dir variable to the downloaded repo
base_dir="~/ARMOR"
# input reads:
fastq_files=$base_dir/example_data/FASTQThe example data consits of four paired-end RNA-seq reads of 63 base pairs.
Create a variable for the fasta format reference transcriptome (cDNA):
fasta_tr=$base_dir/example_data/reference/Ensembl.GRCh38.93/Homo_sapiens.GRCh38.cdna.all.1.1.10M.fa.gzMake the directory for the salmon output and the genome index
# create a directory for salmon
mkdir $base_dir/salmon
# create a directory for the genome index
mkdir $base_dir/salmon/Salmon_index
idx=$base_dir/salmon/Salmon_index
# create a directory for the output of the alignment
mkdir $base_dir/salmon/alignment
out_Salmon=$base_dir/salmon/alignmentBuild salmon index
salmon index -i $idx -t $fasta_tr -p 4 --type quasi -k 31Align reads and quantify transcript abundance with salmon:
salmon quant -i $idx -l A -1 $fastq_files/SRR1039508_R1.fastq.gz -2 $fastq_files/SRR1039508_R2.fastq.gz \
-p 4 -o $out_Salmon/sample1 --seqBias --gcBias --dumpEqThe option --dumpEq is essential to obtain the equivalence classes from salmon.
In the output folder ($out_Salmon/sample1), the file quant.sf contains the estimated transcripts abundances, while the equivalence classes (and respective counts) are stored in aux_info/eq_classes.txt.
Create a variable for the fasta format reference transcriptome (cDNA):
fasta_tr=$base_dir/example_data/reference/Ensembl.GRCh38.93/Homo_sapiens.GRCh38.cdna.all.1.1.10M.fa.gzMake the directory for the kallisto output and the genome index
# create a directory for kallisto
mkdir $base_dir/kallisto
# create a directory for the genome index
mkdir $base_dir/kallisto/kallisto_index
idx=$base_dir/kallisto/kallisto_index
# create a directory for the output of the alignment
mkdir $base_dir/kallisto/alignment
out_kallisto_alignment=$base_dir/kallisto/alignment
# create a directory for the output of the equivalence classes
mkdir $base_dir/kallisto/pseudo
out_kallisto_pseudo=$base_dir/kallisto/pseudoBuild kallisto index
kallisto index -i $idx/transcripts.idx $fasta_trAlign reads and quantify transcript abundance with kallisto:
kallisto quant -i $idx/transcripts.idx -o $out_kallisto_alignment/sample1 --bias --threads 4 \
$fastq_files/SRR1039508_R1.fastq.gz $fastq_files/SRR1039508_R2.fastq.gz In the output folder ($out_kallisto_alignment/sample1), the file abundance.tsv contains the estimated transcripts abundances.
Compute the equivalence classes and respective counts with kallisto bustools:
kallisto bus -i $idx/transcripts.idx -o $out_kallisto_pseudo/sample1 --threads 4 \
-x bulk --paired $fastq_files/SRR1039508_R1.fastq.gz $fastq_files/SRR1039508_R2.fastq.gz
bustools text $out_kallisto_pseudo/sample1/output.bus -o $out_kallisto_pseudo/sample1/bus.tsvIn the output folder ($out_kallisto_pseudo/sample1), the file matrix.ec contains the transcripts forming each equivalence class, while bus.tsv contains the equivalence classes counts.
Create a variable for the fasta format reference genome (DNA) and gtf file:
fasta=$base_dir/example_data/reference/Ensembl.GRCh38.93/Homo_sapiens.GRCh38.dna.chromosome.1.1.10M.fa
gtf=$base_dir/example_data/reference/Ensembl.GRCh38.93/Homo_sapiens.GRCh38.93.1.1.10M.gtfMake the directory for the STAR output and the genome index
# create a directory for STAR
mkdir $base_dir/STAR
# create a directory for the genome index
mkdir $base_dir/STAR/genome_index
GDIR=$base_dir/STAR/genome_index
# create a directory for the output of the alignment
mkdir $base_dir/STAR/alignment
outDir=$base_dir/STAR/alignmentGenerate the Genome index:
STAR --runMode genomeGenerate --runThreadN 4 --genomeDir $GDIR \
--genomeFastaFiles $fasta --sjdbGTFfile $gtf --sjdbOverhang 62Note that sjdbOverhang ideally should be set to the lenght of the reads -1 (our reads are 63 bps).
Align reads with STAR:
cd $outDir
STAR --runMode alignReads --runThreadN 4 --genomeDir $GDIR \
--readFilesIn <(zcat $fastq_files/SRR1039508_R1.fastq.gz) <(zcat $fastq_files/SRR1039508_R2.fastq.gz) \
--outFileNamePrefix sample1 --outSAMtype BAM SortedByCoordinate --quantMode TranscriptomeSAMWhen running STAR --quantMode TranscriptomeSAM is essential to obtain the transcript alignments.
Use gffread to build a reference transcriptome (fasta format) compatible with the DNA fasta and gtf files used for STAR:
gffread -w cDNA.fa -g $fasta $gtf
cdna=$outDir/cDNA.faUse salmon on the transcript alignments to compute the equivalence classes:
salmon quant -t $cdna -l A -a sample1Aligned.toTranscriptome.out.bam -o sample1 -p 4 --dumpEqThe option --dumpEq is essential to obtain the equivalence classes from salmon.
In the output folder ($outDir/sample1), the file quant.sf contains the estimated transcripts abundances, while the equivalence classes (and respective counts) are stored in aux_info/eq_classes.txt.
