OFF-PEAK

CNV detection tool for WES and targeted sequencing data.

16.04.24 UPDATE: An option --nochr was added to allow the use of reference genome with chromosome notation without "chr". 09.04.24 UPDATE: There was an error for the preprocessing step in R (error in ops.factor) because of the lack of "stringsAsFactors = FALSE". It has now been added to all scripts and subscripts.

This software was written by Mathieu Quinodoz in the group of Prof. Rivolta from the IOB in Basel, Switzerland. It was developped on Ubuntu 22.04.1 LTS. It is published in AJHG. [Link]

Prerequisites

BEDTools [Link] (>= v2.25.0)
mosdepth [Link] (>= v0.3.3)
R [Link] (>= v3.2.0) with following libraries: optparse, gplots, ExomeDepth, pROC and caTools

ExomeDepth library has been removed from CRAN. It can be installed the following way:

url <- "https://cran.r-project.org/src/contrib/Archive/ExomeDepth/ExomeDepth_1.1.16.tar.gz"
pkgFile <- "ExomeDepth_1.1.16.tar.gz"
download.file(url = url, destfile = pkgFile)
untar(pkgFile)
install.packages(c("Biostrings","IRanges","Rsamtools","GenomicRanges","aod","VGAM","GenomicAlignments","dplyr","magrittr"))
install.packages(pkgs=pkgFile, type="source", repos=NULL)
unlink(pkgFile)

Installation

The tool does not require compilation.

Usage

This tool contains four modules: 1° target processing, 2° coverage computation, 3° CNV discovery and 4° CNV plot.

1) Target processing

The main script 01_targets-processing.sh takes as input a BED file which contains the target regions of the sequencing data as well as the reference genome (FASTA format). It is outputing a BED file containing processed on-target and off-target regions. It is called with bash and its computation time for an exome BED file is few minutes:

bash 01_targets-processing.sh
  --genome [hg19|hg38]
  --targets target.bed
  --name target-panel
  --ref ref_genome.fa
  [other options]

The output file (target-panel.bed) will be placed in the data folder.

Required arguments

Option	Value	Description
--targets	STRING	BED file containing the target regions of the exome or targeted sequencing
--genome	[hg19/hg38]	Genome build used
--name	STRING	Name of the output file (without extension)
--ref	STRING	Reference genome in FASTA format (same as used to map the reads in BAM files)

Optional arguments

Option	Default	Value	Description
--minOntarget	100	1 - Inf	Minimal size of on-targets. Smaller on-target regions will be extended on each side to reach this value.
--maxOntarget	300	> minOntarget	Maximal size of on-targets. Larger on-target regions will be splitted into regions of equal size.
--minOfftarget	1	1 - Inf	Minimal size of off-targets. Smaller on-target regions will be discarded.
--maxOfftarget	50000	> minOfftarget	Maximal size of off-targets. Larger on-target regions will be splitted into regions of equal size.
--paddingOfftarget	300	0 - Inf	Padding around on-target regions to avoid targeted reads to be counted in off-target regions.
--nochr	---	---	Use if your BED file and your BAM files have the chromosome notation without "chr" like 1, 2,... instead of chr1, chr2,...

2) Coverage computation

The main script 02_coverage-count.sh takes as input a text file listing the sample BAM files (.bai index files are needed in the same folder, they can be generated with samtools index command) and IDs as well as the processed BED file from step 1. It is outputing a file containing the coverage for each sample and for each target. It is called with bash and its computation time is 2-3 minutes per sample for WES:

bash 02_coverage-count.sh
  --listBAM list_BAM.txt
  --mosdepth mosdepth-executable
  --work output_directory
  --targetsBED data/target-panel.bed

Required arguments

Option	Value	Description
--listBAM	STRING	Text file containing two tab-delimited columns: BAM files (with absolute location, for example /user/Download/Patient1.bam) and IDs (for example Patient1). It can also have only one column with BAM files. In this case, IDs will be deduced from the name of BAM files (not recommended).
--mosdepth	STRING	Mosdepth executable file that can be downloaded from the github page here: [Link]
--work	STRING	Output directory
--targetsBED	STRING	BED file with processed targets produced in step 1

3) CNV discovery

The main script 03_OFF-PEAK.R takes as input the coverage data per target computed in step 2 and outputs multiple files including annotated CNVs, CNV plots and statistics. It is called with Rscript and its computation time is ~1 hour per 10 samples:

Rscript 03_OFF-PEAK.R
  --output output_directory
  --data ALL.target.tsv
  --databasefile data/data-hg19.RData
  [other options]

Required arguments

Option	Value	Description
--output	STRING	Output directory
--data	STRING	Output of step 2
--databasefile	STRING	Absolute path to RData file containing various information found in data folder (data-hg19.RData or data-hg38.RData)

Optional arguments

Option	Default	Value	Description
--mincor	0.9	0 - 0.99	Minimal correlation of control samples compared to analyzed one.
--minsignal	2500	1 - Inf	Minimal signal for a target to be analyzed.
--maxvar	-0.2	-1 - 1	Maximal variance for a target to be analyzed.
--leaveoneout	1	0 or 1	If 1, leave-one-out PCA (LOO-PCA) will be used. If 0, standard PCA will be used.
--downsample	20000	100 - Inf	Number of downsampled targets for optimization of PC removal.
--nbFake	500	10 - 10000	Number of fake CNVs used for optimization of PC removal.
--stopPC	0.0001	0 - 0.1	Stopping criteria for optimization of PC removal.
--minZ	4	2 - 10	Minimal absolute Z-score for single target CNV processing.
--minOfftarget	1000	1 - Inf	Minimal size of off-targets without exons.
--chromosome-plots	-	-	If present, coverage plots for each chromosome will be done.
--genome-plots	-	-	If present, genome-wide coverage plots will be done.
--nb-plots	10	0 - Inf	Number of plots for CNVs per individual

4) CNV plotting

In step 3 , the 20 best CNVs per sample will be automatically plotted. This script is used to further plot other CNVs. The main script 04_OFF-PEAK-plot.R takes as input intermediate data from step 3 in order to do additional plots. It is called with Rscript and its computation time is few seconds per CNV:

Rscript 04_OFF-PEAK-plot.R
  --ID ID
  --chr chr
  --begin begin
  --end end
  --batch output_directory
  --out output-directory
  --databasefile data-hg19.RData
  [other options]

Required arguments

Option	Value	Description
--ID	STRING	Sample ID
--chr	STRING	Chromosome
--begin	number	Begin position of CNV
--end	number	End position of CNV
--batch	STRING	Output directory of step 3 (output_directory)
--out	STRING	Output directory
--databasefile	STRING	Absolute path to RData file containing various information (data-hg19.RData or data-hg38.RData)

Optional arguments

Option	Default	Value	Description
--side	20	1 - 100	Number of target plotted on each side of the region.

Output files

The output files are grouped in 6 general folders and one folder per sample.

01_general_stats with:

File name	Explanation
Pairwise-correlations-all.tsv	Text file with pairwise correlation between samples
Heatmap-correlations-all.pdf	PDF file with heatmap of the pairwise correlations
Maximal-correlation-per-sample.tsv	Text file with maximal correlation per sample
Maximal-correlation-per-sample.pdf	PDF file representing maximal correlation per sample
log_parameters.tsv	Text file with the log of parameters used

02_BED-files

File name	Explanation
CNVs-all-IGV.bed	BED file that be loaded on IGV for visualization of all CNVs
CNVs-targets-only-IGV.bed	BED file that be loaded on IGV for visualization of on-target CNVs
CNVs-all-AnnotSV.bed	BED file that be used for AnnotSV annotation of all CNVs
CNVs-targets-only-AnnotSV.bed	BED file that be used for AnnotSV annotation of on-target CNVs

03_Samples-info

File name	Explanation
Samples_quality_info.tsv	Quality information about samples
Samples_low-quality.tsv	Samples with low quality

04_CNVs-results

File name	Explanation
CNVs-all.tsv	All CNVs detected
CNVs-targets-only.tsv	On-target CNVs
The folder also includes filtered CNVs:
Type	Explanation
---	---
HQ	only high quality CNVs (PQ>2, CQ>2.5, QUAL>3)
unique	no other samples with overlapping CNVs
HQ-sample	only samples of high quality

05_RData-files

File name	Explanation
data-ID.RData	Data for all targets
data-plot-ID.RData	Data used for plotting CNVs
data-targets-ID.RData	Data for on-targets only

06_PC-plots

File name	Explanation
PC-removal-ID.pdf	PDF representing AUCs on fake CNVs depending on PC removal
Variance-explained-ID.pdf	PDF representing the variance explained by PCs

Individual folder

File name	Explanation
ID-merged.tsv	All CNVs found in the sample
ID-merged_targets-only.tsv	On-target CNVs found in the sample

Subfolders:

Sufolder name	Explanation
plots_on-targets_off-targets	Folder with graphical representation of best CNVs
plots_on-targets-only	Folder with graphical representation of best on-target CNVs
plots_genome	Folder with genome-wide representation of coverage
plots_chromosomes	Folder with chromosome representation of coverage

Explanation of fields in CNV text files:

Field	Explanation
ID	ID of the sample
Chromosome	Chromosome
Begin	Coordinate of the beginning of the CNV
End	Coordinate of the end of the CNV
Begin-min	Minimal begin position of detected CNV (including low quality targets)
End-max	Maximal end position of detected CNV (including low quality targets)
Type	Deletion or duplication
Ploidy	Most likely ploidy based on Z-scores
Nb_targets	Number of targets in the CNV
Targets	ID of targets in the CNV
Genes	Genes affected by the CNV
Exons	Exons affected by the CNV
Genes-possible	Genes affected by minimal to maximal positions (including low quality targets)
ncRNAs	non-coding RNAs affected by the CNV
RefSeq_Functional_Element	Function elements affected by the CNV
Nb_overlapping_samples	Number of samples with an overlapping CNV
gnomAD-CNV_1%	Frequent included CNVs from gnomAD with AF>1%
gnomAD-CNV_ALL	All included CNVs from gnomAD
Counts	Total coverage in the targets included in the CNV
Average_counts	Average total coverage for the controls
SD_counts	Standard deviation of the total coverage for the controls
Ratio	Ratio between sample coverage and control coverage
Z-score_cor	Corrected Z-score for the coverage
PQ	Ploidy quality
CQ	CNV quality (based on "shape")
QUAL	Overall quality
Rank_sample	Rank of the CNV in the sample based on absolute Z-score
Nb_CNV_HQ	Number of high quality CNVs in the sample
ClinVar-patho_included	All included CNVs from ClinVar
ClinVar-patho_overlap	All overlaping CNVs from ClinVar
Nb-exon-before-chr	Number of exons between the CNV and the beginning of the chromosome
Nb-exon-after-chr	Number of exons between the CNV and the end of the chromosome

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OFF-PEAK

Prerequisites

Installation

Usage

1) Target processing

Required arguments

Optional arguments

2) Coverage computation

Required arguments

3) CNV discovery

Required arguments

Optional arguments

4) CNV plotting

Required arguments

Optional arguments

Output files

01_general_stats with:

02_BED-files

03_Samples-info

04_CNVs-results

05_RData-files

06_PC-plots

Individual folder

Subfolders:

Explanation of fields in CNV text files:

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
data		data
subscripts		subscripts
01_targets-processing.sh		01_targets-processing.sh
02_coverage-count.sh		02_coverage-count.sh
03_OFF-PEAK.R		03_OFF-PEAK.R
04_OFF-PEAK-plot.R		04_OFF-PEAK-plot.R
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

OFF-PEAK

Prerequisites

Installation

Usage

1) Target processing

Required arguments

Optional arguments

2) Coverage computation

Required arguments

3) CNV discovery

Required arguments

Optional arguments

4) CNV plotting

Required arguments

Optional arguments

Output files

01_general_stats with:

02_BED-files

03_Samples-info

04_CNVs-results

05_RData-files

06_PC-plots

Individual folder

Subfolders:

Explanation of fields in CNV text files:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages