SoloVar is a modular toolkit for somatic variant analysis (SNV & CNV) in tumor samples without a matching normal. It provides end-to-end pipelines for trimming, mapping, variant calling, and annotation, all driven by simple YAML configuration files.
| Script | Purpose |
|---|---|
trimming_pipeline.sh |
QC and trim raw FASTQ files |
mapping_pipeline.sh |
Map reads, mark duplicates, BQSR, QC |
qc_pipeline.py |
Generate QC plots & stats from mapping logs |
somatic_variant_calling_pipeline.sh |
Somatic variant calling & filtering (Mutect2) |
vcf_annotation_pipeline.sh |
Annotate/filter VCFs, convert to MAF, merge |
cnvkit_pipeline.sh |
Copy number variation (CNV) analysis |
- Clone the repo & install dependencies
- Prepare your YAML config files (see below)
- Run the desired pipeline:
# Example: Run mapping pipeline
./mapping_pipeline.sh config.yamlYou must have the following tools installed and available in your $PATH:
bwa(for mapping)picard(for sorting, marking duplicates)gatk(for BQSR, Mutect2, etc.)fastqc,multiqc(for QC)fastp(for trimming)python3(for QC pipeline)matplotlib,seaborn,pandas(forqc_pipeline.py)funcotator,vep,OncoKB annotator(for annotation)cnvkit.py(for CNV analysis)parallel(GNU parallel, for CNVkit pipeline)
Install with:
conda install -c bioconda bwa picard gatk4 fastqc multiqc fastp cnvkit parallel
pip install pyyaml matplotlib seaborn pandas
# For annotation: follow GATK Funcotator, VEP, and OncoKB installation guidesThis script generates summary plots and statistics from mapping logs after the mapping pipeline completes. It is called automatically by mapping_pipeline.sh, but can also be run manually:
python3 qc_pipeline.py <combined_log.txt> <output_directory>Inputs:
combined_log.txt: Log file generated by the mapping pipelineoutput_directory: Where to save plots and stats
Outputs:
samples_stats.txt: Table of mapping and pairing stats per samplecombined_mapping_plot_1.png: Multi-panel QC plot
Dependencies: python3, matplotlib, seaborn, pandas
samples: /path/to/sample_list.txt
output_directory: /path/to/output
read_type: paired
threads: 4fastq_dir: /path/to/fastqs
reference: /path/to/genome.fa
output_directory: /path/to/output
threads: 8
known_sites: /path/to/known_sites.vcfbwa_files: /path/to/bams
output_directory: /path/to/output
reference: /path/to/genome.fa
intervals: /path/to/intervals.bed
pon: /path/to/pon.vcf
gr: /path/to/gnomad.vcf
subgr: /path/to/common_sites.vcf
threads: 8vcf_dir: /path/to/vcfs
output_directory: /path/to/output
reference: /path/to/genome.fa
funcotator_data_sources: /path/to/funcotator_data
oncokb_enabled: yes
token: <your_oncokb_token>
threads: 4bed: /path/to/regions.bed
ref: /path/to/reference.fa
acc: /path/to/access-5kb-mappable.hg38.bed
out_dir: /path/to/output/
inputsamples: /path/to/bam_list.txt
cellularity_file: /path/to/cellularity.txt
threads: 8
annotation_file: /path/to/annotation.bed- Runs FastQC & MultiQC on raw and trimmed reads
- Trims reads with fastp (paired/single-end)
- Maps reads with BWA
- Sorts, marks duplicates, and optionally recalibrates base quality
- Runs QC with a Python script
- Calls variants with Mutect2
- Filters variants, calculates contamination, and selects high-confidence calls
- Annotates VCFs with Funcotator
- Converts to MAF, annotates with OncoKB, merges MAFs (if enabled)
- Runs CNVkit to perform copy number analysis on BAM files
- Supports purity correction if cellularity file is provided
The Filtering Pipeline is a conceptual step shown for users who wish to implement additional filtering after annotation (e.g., by allele frequency, known databases, or custom logic). No script or YAML is provided—users can design their own filtering as needed.
- All scripts require a YAML config as the only argument
- Make sure all paths in your YAML are absolute or relative to your working directory
- Check each script's header for more details and options
Open an issue or pull request for questions, bugs, or improvements!




