scRNA-seq PBMC Workflow

Reproducible, containerized single-cell RNA-seq workflow built with Snakemake + Docker, controlled via a Python CLI wrapper.

The repository demonstrates production-grade workflow engineering and reproducible analysis infrastructure rather than novel biological discovery.

End-to-end execution:

FASTQ → QC → STARsolo → Seurat → DESeq2/TOST → enrichment → network analysis

CI runs the toy workflow (including STAR index build on chr1) to validate reproducibility.

Rule graph (untrimmed reads)

Orchestration Logic: Directed Acyclic Graph (DAG) automatically generated by Snakemake.

Example outputs

A small set of outputs from a full PBMC run is available under docs/example_outputs/, illustrating key QC, clustering, and differential expression results.

A stable snapshot of representative full-run analysis outputs is archived on Zenodo (See Data availability).

Quick Start (Toy Demo)

Runs a chromosome 1 mini-reference with downsampled FASTQs.
Execution time: ~5–10 minutes after image pull and toy data download.
Toy data download size: ~ 81.3 MB (toy bundle)
Toy runs upstream+Seurat object creation (no downstream from Seurat object creation)

1. Clone the repository

git clone https://github.com/inkasimo/scRNAseq-pbmc-workflow.git

2. Pull the versioned Docker image (in repository)

docker pull ghcr.io/inkasimo/scrnaseq-pbmc-workflow:v2.0.0

First pull may take several minutes (image ~3 GB).

3. Install wrapper dependency (host only)

Ensure venv support is available (Linux users may need python3-venv installed)

python3 -m venv .venv
source .venv/bin/activate
pip install -r wrapper-requirements.txt

Installs only pyyaml>=6.0
Required only if using run_analysis.py.
Not needed if running Snakemake directly via Docker.

If you use venv, activate the environment before using run_analysis.py.

4. Download toy bundle

python3 run_analysis.py download_toy

This downloads and extracts:

data/ref/toy/ (chr1 reference files)
data/toy/toy_donor/ (toy FASTQs)

5. Run toy workflow

Dry run:

python3 run_analysis.py toy --dry-run

Raw mode:

python3 run_analysis.py toy

Trimmed mode:

python3 run_analysis.py toy --trimmed

Outputs written to:

results/

What this pipeline does

Upstream

FASTQ acquisition (download or reuse of existing data)
Raw and optional trimmed QC (FastQC + MultiQC)
Optional read trimming (Cutadapt; non-default)
Reference preparation (barcode whitelist, STAR index)
STARsolo alignment and gene–cell count matrix generation

Downstream

Seurat objects
Cell-level QC and annotation
Differential expression (DESeq2) and equivalence testing (TOST)
Enrichment analysis
Network analysis
Module enrichment analysis

Focus

Reproducible, containerized execution
Explicit DAG-based workflow structure
Clear separation of engineering (upstream) and statistical analysis (downstream)

Full Dataset Execution

Dataset

10x Genomics PBMC 5k
Donors 1–4
3′ Gene Expression

Requirements

Full run requires ~25–30 GB RAM for STAR index and alignment and several hours depending on cores.

Required

Docker

All core tools are provided inside the Docker image, including:

Snakemake
STAR/STARsolo
FastQC
MultiQC
Cutadapt
Seurat
DESeq2
igraph
muumi

Optional (wrapper only)

Python ≥3.9 with venv support (used only for the execution wrapper)

1. Pull docker image

Use the same Docker image as shown in Quick Start.

Execution with python wrapper (recommended)

Wrapper requirements (host-side only)

See Quick Start

Example runs:

Inspect available sections:

python3 run_analysis.py --list-sections

Inspect donors:

python3 run_analysis.py --list-donors

Download data:

python3 run_analysis.py download_data --cpus 8 --cores 8

QC only:

python3 run_analysis.py qc --cpus 8 --cores 8

Align all donors:

python3 run_analysis.py align \
  --donor all \
  --cpus 8 --cores 8 \
  -j 1 \
  --set-threads starsolo=8

Dry run (no execution, sanity check):

python3 run_analysis.py all --dry-run

Trimmed flag:

Use the --trimmed flag to enable read trimming and run all downstream steps using trimmed reads instead of raw reads.

python3 run_analysis.py all --trimmed

Execution directly with Snakemake (no wrapper)

This workflow can be run in Directly with Snakemake inside Docker

Dry run

docker run --rm -it \
  -v "$(pwd)":/work \
  -w /work \
  ghcr.io/inkasimo/scrnaseq-pbmc-workflow:v2.0.0 \
  snakemake -n -p

Run a specific target (example: one donor alignment):

docker run --rm -it \
  -v "$(pwd)":/work \
  -w /work \
  ghcr.io/inkasimo/scrnaseq-pbmc-workflow:v2.0.0 \
  snakemake results/alignment/starsolo/donor1/starsolo.done

Repository structure

containers/           # Dockerfile(s) for reproducible execution
workflow/             # Snakemake workflow (rules, DAG)
config/               # User-editable configuration (config.yaml)
resources/            # Static resources bundled with the workflow
  barcodes/           # 10x barcode whitelist(s)
  genesets/           # Hallmark and C7 .gmt files
data/                 # Input data and references (not versioned)
  raw/                # FASTQ files (downloaded or user-provided)
  ref/                # Reference genome, GTF, STAR index
  toy/                # Toy data if downloaded 
  trimmed/            # Trimmed FASTQ files (generated only if read trimming is enabled)
results/              # Outputs and logs (not versioned)
  qc/                 # FastQC / MultiQC reports
  alignment/          # STARsolo outputs
  logs/               # Execution logs
  downstream/         # Downstream analysis results
	deg_and_tost/     # DEG and TOST analysis results
	seurat/           # Seurat objects and related plots and tables
	networks/         # Network analysis results
docs/                 # Documentation (technical summary, user manual, white paper, results layout, rulegraph)
  example_outputs/    # A small set of representative execution artifacts
scripts/              # R-scripts and helpers
run_analysis.py       # Optional Python wrapper for section-based execution

Workflow

Execution flow


config/config.yaml
        |
        v
run_analysis.py  (host-side CLI wrapper)
        |
        v
docker run -v $PWD:/work -w /work ...  (bind-mount repo)
        |
        v
+------------------------------------------------------+
|                  Docker Container                    |
|                                                      |
|            Snakemake (workflow/Snakefile)            | 
|               → rules → tools/scripts                |
|                                                      |
+------------------------------------------------------+

Upstream (engineering)


FASTQs
  |
  +-- download (optional)
  |
  +-- validate presence
  |
  v
QC (raw)
  - FastQC
  - MultiQC
  |
  +-- trim (optional)
  |     |
  |     v
  |   QC (trimmed)
  |     - FastQC
  |     - MultiQC
  |
  v
Reference
  - genome FASTA
  - GTF
  - barcode whitelist
  - STAR index
  |
  v
Alignment / Counting 
  - STARsolo
  - gene x cell matrix
  - alignment on raw OR trimmed reads depending on mode

Downstream (analysis)


Per-donor
  |
  +-- Seurat object
  +-- cell QC + filtering
  +-- normalization + HVGs
  +-- clustering + annotation
  |
  v
Cross-donor
  |
  +-- pseudobulk by donor & cell type
  +-- DESeq2 (DE)
  +-- TOST (equivalence test)
  +-- enrichment (GSEA / ORA)
  +-- Network analysis

Resources and reproducibility

Barcode whitelist

The 10x Genomics barcode whitelist is bundled directly in the repository:

resources/barcodes/3M-3pgex-may-2023_TRU.txt

This avoids reliance on unstable upstream URLs and ensures reproducible execution.

Gene sets (MSigDB)

Gene sets used for enrichment analyses are stored locally under:

resources/genesets/

This includes:

MSigDB Hallmark gene sets:

h.all.v2026.1.Hs.symbols.gmt

MSigDB C7 (immunologic signatures):

c7.all.v2026.1.Hs.symbols.gmt

Hallmark (H) and Immunologic Signature (C7) gene sets were obtained from the Molecular Signatures Database (MSigDB, Broad Institute) and are included locally to ensure reproducible execution of the workflow. All enrichment steps (GSEA and ORA) explicitly reference these local files

Users should ensure compliance with MSigDB licensing terms when reusing these resources.

Liberzon A et al. The Molecular Signatures Database (MSigDB) hallmark gene set collection.
Cell Systems (2015).

Godec J et al. Compendium of Immune Signatures Identifies Conserved and Species-Specific
Biology in Response to Inflammation. Immunity (2016).

Scalability & Portability

The workflow is built on a "Container-First" architecture. While primary development and validation were conducted on a high-spec local workstation (WSL2/Docker), the use of Snakemake and digest-pinned images ensures the pipeline is architecturally ready for HPC environments (Slurm/Singularity) with minimal configuration of resource profiles.

Documentation

Comprehensive project documentation is maintained within the docs/ directory to ensure full technical provenance and ease of onboarding:

Technical White Paper (docs/White_Paper_scRNAseq_Architectural_Framework.pdf) — The primary technical specification. This 45-page document details the system architecture and presents the full biological results from the 10x PBMC (Donors 1-4) run, including cell-type annotation, donor-aware TOST equivalence testing, and consensus network analysis.
Executive Summary (docs/White_Paper_Summary.pdf) — A concise 7-page overview of the framework’s core value propositions, high-level DAG structure, and key biological validation results.
User Manual (docs/user_manual.md) — Operational instructions covering the Python Controller, Docker image deployment, and configuration schema.
Results Architecture (docs/results_layout.md) — A detailed map of the deterministic output tree, designed for rapid discovery and audit of analytical artifacts.

Notes

Large data files are excluded via .gitignore
FASTQ downloading is controlled via io.download_fastqs in config/config.yaml
(automatically set by the wrapper for relevant sections)
STAR index requires ~25–30 GB RAM
Biological analyses are included to validate pipeline correctness and demonstrate statistically coherent downstream usage, not to claim novel biological findings.
R package versions inside the Docker image are managed with renv to ensure reproducible R environments. Users do not need to interact with renv directly.

Result notes

Read trimming had negligible impact based on QC; untrimmed branch used for final runs
STARsolo mapping rates were consistent across donors (~69–72%)
STARsolo selected over Cell Ranger for transparent, reproducible integration into the workflow
Donor-aware pseudobulk modeling used to avoid pseudo-replication
Differential expression revealed strong transcriptional separation between lymphoid and myeloid compartments with large marker gene sets
Equivalence testing (TOST) used to explicitly identify conserved gene programs
Cell-type–specific co-expression networks revealed distinct modular immune programs and functional organization across CD4 T cells, B cells, and monocytes

Non-goals

This pipeline is not intended to benchmark methods or claim novel biological findings.

Data availability

A complete, versioned archive of pipeline outputs is available on Zenodo: Zenodo archive

This archive includes:

Full pipeline outputs (alignment + downstream analysis)
A small toy dataset (downsampled FASTQs) for demonstration and sanity-check runs

Citation

If you use this workflow in a publication, please cite the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 309 Commits
.github/workflows		.github/workflows
config		config
containers		containers
docs		docs
renv		renv
resources		resources
scripts		scripts
workflow		workflow
.Rprofile		.Rprofile
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENCE		LICENCE
README.md		README.md
renv.lock		renv.lock
run_analysis.py		run_analysis.py
wrapper-requirements.txt		wrapper-requirements.txt

Folders and files

Latest commit

History

Repository files navigation

scRNA-seq PBMC Workflow

Rule graph (untrimmed reads)

Example outputs

Quick Start (Toy Demo)

1. Clone the repository

2. Pull the versioned Docker image (in repository)

3. Install wrapper dependency (host only)

4. Download toy bundle

5. Run toy workflow

What this pipeline does

Upstream

Downstream

Focus

Full Dataset Execution

Dataset

Requirements

Required

Optional (wrapper only)

1. Pull docker image

Execution with python wrapper (recommended)

Wrapper requirements (host-side only)

Example runs:

Trimmed flag:

Execution directly with Snakemake (no wrapper)

Repository structure

Workflow

Execution flow

Upstream (engineering)

Downstream (analysis)

Resources and reproducibility

Barcode whitelist

Gene sets (MSigDB)

Scalability & Portability

Documentation

Notes

Result notes

Non-goals

Data availability

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages