GitHub - NarlikarLab/exoDIVERSITY

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Example		Example
weblogoMod		weblogoMod
Makefile		Makefile
README		README
allplots.py		allplots.py
bestModelFunctions.c		bestModelFunctions.c
bestModelFunctions.h		bestModelFunctions.h
config.py		config.py
cst1		cst1
cst2		cst2
cstructures.py		cstructures.py
dataStructures.so		dataStructures.so
draw1.r		draw1.r
exoDiversity		exoDiversity
fastaToBed.py		fastaToBed.py
getAlignmentRegions.py		getAlignmentRegions.py
getData.py		getData.py
getHTML.py		getHTML.py
getreads.py		getreads.py
getreads_neg.py		getreads_neg.py
loadModel.py		loadModel.py
mainFunc.py		mainFunc.py
makebinary.py		makebinary.py
messages.c		messages.c
messages.h		messages.h
modelStructures.h		modelStructures.h
modelfunctions.c		modelfunctions.c
motifAndReadsFunctions.c		motifAndReadsFunctions.c
motifAndReadsFunctions.h		motifAndReadsFunctions.h
motifAndReadsStructs.h		motifAndReadsStructs.h
plotLikelihood.r		plotLikelihood.r
plotSeqsProbHeatmap.r		plotSeqsProbHeatmap.r
plotreads.py		plotreads.py
plotreadsHeatmap.py		plotreadsHeatmap.py
plotreadsHeatmap.r		plotreadsHeatmap.r
postExecWrapper.py		postExecWrapper.py
preRunChecks.sh		preRunChecks.sh
readsProcessing.py		readsProcessing.py
saveFiles.py		saveFiles.py
scoring.py		scoring.py
separateChromWise.py		separateChromWise.py
subtractControl_nonnegative.py		subtractControl_nonnegative.py
traindata.c		traindata.c
traindata.h		traindata.h

Repository files navigation

exoDIVERISTY is a tool that can be used to resolve diverse protein-DNA footprints from exonuclease based ChIP experiments such as ChIP-exo or ChIP-nexus

The core engine is written in C with a python wrapper for the parallel processing and plotting. It also uses R to plot the final heatmap images.
The following packages need to be installed for running exoDIVERSITY:
* Python 2.7+ (Not compatible with Python 3.x)
* python-numpy
* python-ctypes
* python-re
* python-matplotlib
* R >= 3.3
* R packages: RColorBrewer and plotfunctions

Extra tools needed:
* bedtools v2.25.0
* twoBitToFa: UCSC tool to extract fasta sequences from .2bit file
=> Linux 64 bit version: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/
=> macOS version: http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.x86_64/

Extra files needed:
* .2bit file for the respective genome assembly

INSTALLATION:

exoDIVERSITY is available at:

To install exoDIVERSITY execute the following commands:
wget https://github.com/NarlikarLab/exoDIVERSITY/releases/download/v1.2/exoDIVERSITY.tar.gz
tar -xvf exodiversity.tar.gz
cd exoDIVERSITY
make

To execute exoDIVERSITY from anywhere export the path to exoDIVERSITY to the PATH variable.
USAGE: exoDiveristy [options]
-f: Input fasta file
-r: Reads file either in BAM (sorted) or bedGraph format
-format: Format of the reads file BAM or BED
-g: genome file containing sizes of call chromosomes
-ctrl: Control reads file in the same format as the reads file
-o: Output directory (must be new)
-rev: is 1 if reverse complement is to be considered; otherwise 0. Default 1
-mask: is 1 if repeats are to be masked; otherwise 0. Default 0
-initialWidth: The width of the motifs at starting point
-minMode: Minimum number of modes in which data should be divided. Default 1
-maxMode: Maximum number of modes in which the data should be divided. Default 10
-rWidth: The width of the read windows for both positive and negative strand. Default 5
-gobeyond: 0 or 1. 1 allows the read windows to go beyond the motif on both strands. Default 0
-nproc: The number of processors to be used for computation. Default is the number of cores the system has
-v: 0 or 1. 1 to save plots for the posterior scores. Default 0
-bin: Binarize read counts based on median, first quartile or third quartile or keep when file is already in binary form {median,Q1,Q3,keep}. Default median
-ntrials: Number of trials for each model. Default 5
-pcZeros: Pseudo count for 0s in reads data. Default 1
-pcOnes: Pseudo count for 1s in reads data. Default 1
-twobit: 2bit file (from UCSC browser) for sequence alignment plots
In case sequence wise + and - ve read counts are present
-p: The positive strand reads file
-n: The negative strand reads file

OUTPUT:
The output of exoDIVERSITY contains the following components:
1) A "reads" directory containing the read counts for each sequence for both the positive and negative strands.
It also contains the final binarized read counts files for both the strands.
2) For each mode <m> (from -minMode to -maxMode) a directory containing the model with <m> modes
In a directory <m>modes there are the following files:
For i in {0,..<m>}
a) logo_<i>.png and logo_<i>_rc.png are the motifs and their reverse complement forms
b) reads_<i>.png are the positive and negative strand read profiles for each motif
c) info.txt: This file contains information about the mode, the motif start position in the sequence, the positions of the positive and negative strand read windows and the strand for each sequence
d) events.bed: It contains the motif regions in each sequence in the sorted order of the modes
e) bestModelParams.txt: It contains all the parameter values for the best model learned
f) seqsInProbSpace.png: It is a plot containing the probability of a sequence belonging to each of the modes in the model for all the sequences.
3) A "settings.txt" file containing all the values of the parameters used for the run
4) "exoDiversity.html" contains the output for the best model identified by exoDiversity.

EXAMPLE:
An example fasta file along with a small BAM file for the experimental reads and control reads are given in the EXAMPLE directory. One can run exoDIVERISTY as follows to get a similar output as given in the Example/out directory.

./exoDiversity -f Example4/combined_FoxA1_CTCF.fasta -r Example4/combined_small.bam -ctrl Example4/combined_control_small.bam -o Example4/out -twobit /data/genomeData/hg19/hg19.2bit -format BAM