This repository contains a Nextflow pipeline for sequencing preprocessing steps used in CoCUT&Tag analysis.
The workflow supports two input modes:
paired_fastq: start from already demultiplexed paired FASTQsdemux: start from raw sciCUT&Tag-styleR1/R2/I1/I2FASTQs plus primer and Tn5 annotation tables
The pipeline covers:
- Demultiplexing of FASTQs using the provided index annotation tables
- barcode correction / FASTQ header rewriting using a Well-ID matrix
- adapter trimming with Cutadapt
- paired-end alignment with Bowtie2
- SAM to BAM conversion
- BAM to compressed BED fragment generation
- normalized BigWig generation
main.nf: main Nextflow workflownextflow.config: runtime profiles and resource defaultsbin/rewrite_fastq_barcodes: wrapper that prefers the compiled barcode-rewrite binarybin/rewrite_fastq_barcodes.py: barcode rewrite helper extracted from the notebookbin/modify_scict_header.sh: generic header normalizer used for sciCT demultiplexing modesrc/rewrite_fastq_barcodes.cpp: fast C++ implementation of barcode rewritingtools/build_rewrite_fastq_barcodes.sh: build script for the C++ binaryenvs/cuttag-preprocess.yml: Conda environment definitionenvs/cuttag-preprocess-container.yml: lighter Conda environment used inside the Singularity/Apptainer imagecontainers/cuttag-preprocess.def: Singularity/Apptainer definition filecontainers/cuttag-preprocess.sif: default Singularity image path expected by the configMETHODS.md: extended workflow notes and examples
The barcode rewrite step now prefers a compiled C++ implementation because this stage is dominated by gzip I/O and header string processing, which is much faster in native code than in pure Python.
Build the fast binary once:
./tools/build_rewrite_fastq_barcodes.shThis creates:
bin/rewrite_fastq_barcodes_cppThe wrapper bin/rewrite_fastq_barcodes will:
- use the compiled binary if it already exists
- try to build it automatically with
g++if available - fall back to the original Python implementation if compilation is not possible
Required parameters:
--input_mode:paired_fastqordemux--input_dir: directory containing paired*_R1.fq.gzand*_R2.fq.gz--ref: Bowtie2 index basename--chrom_sizes: chromosome sizes file for BigWig generation
For demux mode, use these instead of --input_dir:
--primer_annot: Primer annotation file.--tn5_annot: Tn5 barcode annotation file.--fastq1: The R1 FASTQ file--fastq2: The R2 FASTQ file--umi1: The I1 FASTQ file--umi2: The I2 FASTQ file--demux_min_reads: Minimum total reads for a sample to be retained after demultiplexing, default10000.--demux_swap_index_ends: swap the first and last 8 bases ofI1/I2before demultiplexing, defaulttrue.
Both annotation files are required for demux mode.
For the current sciCUT&Tag test data, the index reads are arranged as:
I1:i7 ... j7I2:j5 ... i5
while sciCTextract expects:
I1:j7 ... i7I2:i5 ... j5
The pipeline therefore swaps the first and last 8 bases of each index read by default in demux mode. Disable this only if your run already matches the native sciCTextract layout:
--demux_swap_index_ends falsePrimer annotation required columns:
SampleorIDi7_index_seqi5_index_seq
The preprocessing step converts this to:
IDi7_index_seqi5_index_seqi7_index_idi5_index_id
Tn5 annotation required columns:
Sample NameTn5_s7Tn5_s7_seqTn5_s5Tn5_s5_seq
Barcode rewriting input:
--barcode_matrix: CSV with columnsPAGE-1-s7,PAGE-1-s5,PAGE-2-s7,PAGE-2-s5,Well-ID
Filename-based sample filtering is configurable.
By default, no samples are excluded by filename.
Disable sample filtering completely:
--enable_sample_filter falseProvide your own comma-separated exclusion patterns:
--skip_patterns PosCtrl,NegCtrl,MyControlUse the included Conda environment:
nextflow run main.nf -profile conda \
--input_mode paired_fastq \
--input_dir /path/to/fastq \
--barcode_matrix /path/to/barcode_matrix.csv \
--ref /path/to/bowtie2/index_basename \
--chrom_sizes /path/to/genome.chrom.sizes \
--out_dir resultsFor SLURM:
nextflow run main.nf -profile slurm,conda \
--input_mode paired_fastq \
--input_dir /path/to/fastq \
--barcode_matrix /path/to/barcode_matrix.csv \
--ref /path/to/bowtie2/index_basename \
--chrom_sizes /path/to/genome.chrom.sizes \
--out_dir resultsExample sciCT demultiplexing mode:
nextflow run main.nf -profile slurm,conda \
--input_mode demux \
--primer_annot /path/to/Primer_Annotation.csv \
--tn5_annot /path/to/Tn5_Barcode_Annotation.csv \
--fastq1 /path/to/sample_R1.fastq.gz \
--fastq2 /path/to/sample_R2.fastq.gz \
--umi1 /path/to/sample_I1.fastq.gz \
--umi2 /path/to/sample_I2.fastq.gz \
--demux_min_reads 10000 \
--barcode_matrix /path/to/barcode_matrix.csv \
--ref /path/to/bowtie2/index_basename \
--chrom_sizes /path/to/genome.chrom.sizes \
--out_dir resultsBuild the image from the included definition file:
containers/build_singularity_image.shOn a local Linux machine where you have sudo privileges, the helper runs:
sudo singularity build containers/cuttag-preprocess.sif containers/cuttag-preprocess.defIf you use Apptainer instead:
sudo apptainer build containers/cuttag-preprocess.sif containers/cuttag-preprocess.defIf you cannot build locally, use a remote builder or ask your HPC admins to build it:
singularity build --remote containers/cuttag-preprocess.sif containers/cuttag-preprocess.defEquivalent helper command:
containers/build_singularity_image.sh remoteThen run:
nextflow run main.nf -profile singularity \
--input_dir /path/to/fastq \
--barcode_matrix /path/to/barcode_matrix.csv \
--ref /path/to/bowtie2/index_basename \
--chrom_sizes /path/to/genome.chrom.sizes \
--out_dir resultsTo use a different image location:
nextflow run main.nf -profile singularity \
--singularity_image /path/to/container.sif \
--input_dir /path/to/fastq \
--barcode_matrix /path/to/barcode_matrix.csv \
--ref /path/to/bowtie2/index_basename \
--chrom_sizes /path/to/genome.chrom.sizes \
--out_dir resultsIf your references or data are outside the working directory, bind those filesystem roots into the container. The default bind path is ./data relative to the directory where you launch Nextflow.
nextflow run main.nf -profile singularity \
--container_bind_paths ./data \
--input_dir /path/to/fastq \
--barcode_matrix /path/to/barcode_matrix.csv \
--ref /varidata/research/projects/bbc/versioned_references/latest/data/hg38_gencode/indexes/bowtie2/hg38_gencode \
--chrom_sizes /varidata/research/projects/bbc/versioned_references/2024-10-31_10.56.03_v17/data/hg38_gencode/sequence/hg38_gencode.fa.fai \
--out_dir resultsFor multiple roots, provide a comma-separated list:
--container_bind_paths ./data,/scratch,/home- sample-name filtering is available but no filename patterns are excluded unless
--skip_patternsis provided - demux mode keeps only samples with total reads greater than
--demux_min_reads, default10000 - demux mode swaps the first and last 8 bases of
I1/I2by default before callingsciCTextract - barcode rewriting is always applied and requires
--barcode_matrix - adapter sequence defaults to
CTGTCTCTTATACACATCT - alignment resources default to
16 CPUs,256 GB, and18h
nextflowmust be installed on the host system.- The Conda profile creates the software environment automatically from
envs/cuttag-preprocess.yml. - The Singularity and Apptainer profiles use
containers/cuttag-preprocess.sifby default. - Singularity and Apptainer bind
./datafrom the launch directory by default; override with--container_bind_pathsif needed. - Sample filtering is controlled by
--enable_sample_filterand--skip_patterns.