I replicated the Scanpy pipeline to annotate cells from a mystery single-cell dataset, labeled as ‘bone marrow,’ and to speculate whether ‘bone marrow’ is indeed the tissue of origin.
The standard Scanpy pipeline executed using a Jupyter notebook involved the following steps:
- Install and import necessary libraries: (Scanpy, anndata, Decoupler, Pandas)
- Data import and formatting
- Quality Control (QC):
- Filtering out low-quality cells:
- Finding Highly Variable Features
Fig.3: Highly Variable Features - Dimensionality Reduction: using Principal Component Analysis (PCA)
Fig.4: PC vs Variance ratio - UMAP and Leiden Clustering
Fig.5: UMAP clustering - Cell Annotation: using Decoupler and PanglaoDB.
- Use the top-scoring cell type per cluster for annotation.
- Verification of cell annotation with gene marker list: (heatmap, dotplot, violin plot)

Fig.6: Dotplot of annotated UMAP clusters with markers
Fig.7: UMAP of annotated cells
Several populations of blood cells were identified in the dataset(See Fig.7 above):
- Neutrophils: Innate immune cells that are usually the first line of defense against infection, functioning through phagocytosis or extracellular traps.
-
$\gamma\delta$ T cells: An alternative type of T lymphocyte that bridges innate and adaptive immunity to help fight infections or cancer. - Nuocytes (ILC2s): A type of innate lymphoid cell that plays a role in antibody or B cell-mediated immunity against parasites and allergens, or tissue repair through cytokine production.
- NK cells: Innate lymphocytes that destroy virus-infected and cancer cells through the release of cytotoxic molecules that act on target cells.
- Naive B cells: Inactive B lymphocytes that are precursors to plasma cells or memory B cells, and not yet exposed to activating antigen signals.
- Platelets: Fragments of megakaryocyte cells that help with blood clotting and prevention of bleeding.
- Plasma cells: Activated B lymphocytes that serve as antibody factories to help fight off infections from bacteria or viruses.
- Monocytes: Circulating immune cells that perform phagocytosis, antigen presentation, and other functions. They can differentiate into macrophages or dendritic cells upon entering tissues.
The tissue of origin is most likely not bone marrow (most likely the lung or blood), considering the high composition of immune cells in the sample and the absence of stem or progenitor mesenchymal or hematopoietic populations. The tissue could be undergoing an early stage of infection or immune activation due to the prevalence of mostly innate immune cell types.
It was surprising to see many of the genes overlapping between a lot, but not all, cell populations. This could be due to the inability of single-cell sequencing to fully differentiate similar cell populations due to low sequencing depth, or alternatively, the markers used for annotation may not be specific enough, making automated annotation less than optimal and conclusions less solid.
| Name | Version | Build/Channel Info |
|---|---|---|
| scanpy | 1.11.5 | pyhd8ed1ab_0 (conda-forge) |
| anndata | 0.11.4 | pyhd8ed1ab_0 (conda-forge) |
| scvi-tools | 1.3.3 | pypi_0 (pypi) |
| decoupler | 2.1.2 | pypi_0 (pypi) |
| umap-learn | 0.5.9.post2 | py310hff52083_0 (conda-forge) |
| leidenalg | 0.11.0 | pypi_0 (pypi) |
| numpy | 2.2.6 | py310hefbff90_0 (conda-forge) |
| scipy | 1.15.2 | py310h1d65ade_0 (conda-forge) |
| pandas | 2.3.3 | py310h0158d43_1 (conda-forge) |
| seaborn | 0.13.2 | hd8ed1ab_3 (conda-forge) |
###The full library composition can be found in the scanpy_env.yml file

