A modular, production-ready Scanpy pipeline for processing and analyzing a single 10x Genomics single-cell RNA-seq sample. This project is optimized for human cancer datasets, but works for any 10x scRNA-seq run.
- 10x matrix ingestion (MTX + barcodes + features)
- Gene ID normalization (Ensembl → Symbol)
- QC filtering (mitochondrial %, UMI counts, genes/cell)
- Normalization, log1p, HVG selection
- PCA, UMAP, t-SNE embeddings
- Leiden clustering
- Cell type annotation (CellTypist)
- Cell-type marker discovery
- Multi-database enrichment (GO, KEGG, Reactome, WikiPathways)
🔗 Final biological summaries
Cell Types → DEGs → Markers → Pathways
singlecell_pipeline/
│
├── config_cli.py # CLI + global configuration
├── loader_10x.py # 10x feature–barcode loading
├── gene_names.py # Gene ID normalization logic
├── group_de.py # DE tests, UMAP per group, compositions
├── markers.py # Cell-type-specific marker detection
├── pathway_enrichment.py # Enrichr/gseapy enrichment + semantic dedup
├── summary_ct_deg.py # Summaries (DEGs → markers → pathways)
├── pipeline.py # High-level Scanpy orchestration
└── main_single.py # Entry point: single-sample pipeline run
Version: v1.0
A clean, modular codebase designed for clinical/translational scRNA-seq workflows.
- Auto-detects
matrix.mtx[.gz],barcodes.tsv[.gz],features.tsv/genes.tsv - Handles sparse matrices efficiently
- Detects Ensembl IDs
- Maps to HGNC gene symbols via
mygene.info - Ensures uniqueness and consistency of
adata.var_names
Calculates:
pct_counts_mtn_genes_by_countstotal_counts
Filters:
- <200 or >6000 genes
- >15% mitochondrial reads
- Genes expressed in <3 cells
normalize_totallog1p- HVG selection (Seurat v3 flavor)
- PCA (50 components)
- UMAP
- t-SNE (for
n_cells < 50k)
- Leiden clustering (resolution 0.5)
- Cluster-level visualizations included
- Auto-detection from metadata OR
- CellTypist ML classifier fallback
- Generates UMAP/TSNE/PCA plots colored by cell types
- Global markers
- Per-cell-type markers
- Rank plots, heatmaps, dotplots
Databases supported via gseapy/Enrichr:
- GO Biological Process
- GO Molecular Function
- GO Cellular Component
- KEGG
- Reactome
- WikiPathways
Includes:
- Semantic deduplication (MiniLM + FAISS)
- Top pathway barplots
- Combined enrichment tables
Creates a comprehensive biological table linking: Cell Type → DEGs → Marker Genes → Pathways
scpipeline `
">> --single-10x-dir "enter the location of 10x files :feature, barcodes, matrix"" `" INPUT FILE LOCATION
">> --single-sample-label ""sample_name"" ` sample name" GSM ID ACC
">> --single-group-label ""sample_group"" sample type" TUMOR OR CANCER OR DISEASE NAME
All results are saved to:
<10x_folder>/SC_RESULTS/
This includes:
- QC plots
- HVG tables
- Embeddings (UMAP/t-SNE)
- Clusters
- Cell types
- Marker gene tables
- Enrichment results
- Summary spreadsheets and text files
- Cancer single-cell analysis
- Tumor microenvironment decomposition
- Biomarker discovery
- Translational/preclinical studies
- ML based celltype prediction