Skip to content

liezeltamon/single-cell-preprocess

Repository files navigation

Snakemake workflow: single-cell-preprocess

Snakemake GitHub actions status workflow catalog

A Snakemake workflow for preprocessing multiplexed 10X single-cell RNA-seq data: empty droplet removal, HTO-based sample demultiplexing, doublet detection, QC filtering, and quality metric reporting.

Overview

The workflow is built using Snakemake and consists of the following steps:

  1. Empty droplet removalDropletUtils::emptyDrops() distinguishes cell-containing droplets from empty droplets using the raw Cell Ranger count matrix, retaining barcodes that pass the configured FDR threshold.
  2. Sample demultiplexingcellhashR runs up to six HTO-based demultiplexing algorithms (HTODemux, MultiSeqDemux, DropletUtils, GMM-Demux, BFF-raw, BFF-cluster) and calls a consensus singlet assignment per barcode by majority vote.
  3. Doublet detectionscDblFinder simulates artificial doublets in PCA space and classifies each droplet as a singlet or doublet using a random forest classifier.
  4. Conventional QC filtering — Per-sample MAD-based outlier removal on library size, feature count, and mitochondrial fraction using scuttle::isOutlier().
  5. Per-sample QC metrics — Computes and plots summary statistics (median, mean, SD, min, max) per sample or grouping variable.
  6. Aggregate QC — Aggregates QC metrics across all samples, normalises values to MADs, and renders cross-sample heatmaps for a global quality overview.

Detailed information about input data and workflow configuration can be found in the config/README.md.

Input data

The workflow expects Cell Ranger multi pipeline outputs. Samples are auto-detected as subdirectories of data/cellranger/.

Input Path Notes
Raw feature-barcode matrix data/cellranger/{sample}/outs/multi/count/raw_feature_bc_matrix.h5 Must contain both Gene Expression and Multiplexing Capture (HTO) libraries
HTO-to-sample mapping results/process_droplets/hto_to_sample_mapping/{sample}/hto_to_sample_mapping.tsv Tab-separated; columns: hto_id, sample_name

Output

All outputs are written to results/process_droplets_pipeline/{CONFIG_FILENAME}/, where CONFIG_FILENAME is set in workflow/Snakefile (default: config). Change this variable to namespace outputs from different config runs.

Directory Key output files
empty/{sample}/ whitelist.txt, blacklist.txt, output.qs, plots.pdf, session_info.txt
dehash/{sample}/ whitelist.txt, barcode_metadata.csv, metrics.csv, output.csv, plots.pdf, session_info.txt
doublet/{sample}/ whitelist.txt, barcode_metadata.csv, output.qs, plots.pdf, session_info.txt
filter/{sample}/ whitelist.txt, barcode_metadata.csv, plots.pdf, session_info.txt
qc_sc_sample/{sample}/ metrics.csv, plots.pdf
qc_sc_aggregate/ metrics.csv, heatmaps.pdf

whitelist.txt files at each step carry the set of high-quality barcodes surviving that step; barcode_metadata.csv files carry per-cell annotations accumulated across steps.

Usage

The usage of this workflow is described in the Snakemake Workflow Catalog.

If you use this workflow in a paper, please cite the repository URL or its DOI and the tools listed in the References section.

Deployment options

Change to the workflow directory and adjust options in config/config.yml.

cd path/to/single-cell-preprocess

Perform a dry run to check the workflow before execution:

snakemake --dry-run

Run with test files using conda:

snakemake --cores 2 --sdm conda --directory .test

Run with apptainer / singularity:

snakemake --cores 2 --sdm conda apptainer --directory .test

Run on an HPC cluster via SLURM (recommended for production):

# Load required modules first
module load R/4.3.2-gfbf-2023a

sbatch -J process_droplets_pipeline -p short,long \
  --mem=80G --cpus-per-task=4 \
  --output=%x.log.out --error=%x.log.err \
  --wrap="snakemake -s Snakefile --cores 4 --rerun-incomplete"

Authors

References

Köster, J., Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., Forster, J., Lee, S., Twardziok, S. O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., & Nahnsen, S. Sustainable data analysis with Snakemake. F1000Research, 10:33, 2021. https://doi.org/10.12688/f1000research.29032.2

Lun, A. T. L., Riesenfeld, S., Andrews, T., Dao, T. P., Gomes, T., & Marioni, J. C. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biology, 20:63, 2019. https://doi.org/10.1186/s13059-019-1662-y

Bimber Lab. cellhashR: A Package for Demultiplexing Cell Hashing Data. R package version 1.2.1, 2026. https://github.com/BimberLab/cellhashR

Germain, P.-L., Lun, A., Garcia Meixide, C., Macnair, W., & Robinson, M. D. Doublet identification in single-cell sequencing data using scDblFinder. F1000Research, 10:979, 2021. https://doi.org/10.12688/f1000research.73600.2

McCarthy, D. J., Campbell, K. R., Lun, A. T. L., & Willis, Q. F. Scater: pre-processing, quality control, normalisation and visualisation of single-cell RNA-seq data in R. Bioinformatics, 33(8), 1179–1186, 2017. https://doi.org/10.1093/bioinformatics/btw777

About

A Snakemake pipeline for processing 10X single-cell data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors