spatial-tk: Spatial Transcriptomics Analysis Toolkit

A comprehensive modular Python toolkit for Xenium spatial transcriptomics analysis. This package provides a command-line interface with separate subcommands for each stage of the analysis pipeline, enabling flexible and efficient processing of spatial transcriptomics data.

Features

Modular Pipeline: Each analysis step is a separate command for maximum flexibility
Inplace Processing: Optionally modify datasets without duplication to save disk space
Multiple Resolutions: Support for multi-resolution clustering analysis
Rich Annotations: Marker-based and MLM enrichment-based cell type annotation
Flexible Differential Analysis: Compare groups or find cluster markers
Configuration Files: TOML config files for reproducible pipelines
Comprehensive Testing: Unit and functional tests ensure reliability

Installation

From Source

# Clone or navigate to the repository
cd clustering-tools

# Install in development mode (recommended)
pip install -e .

# Or install normally
pip install .

Dependencies

The package requires Python ≥3.9 and includes dependencies for:

Spatial data handling (spatialdata, squidpy)
Single-cell analysis (scanpy, anndata)
Cell type annotation (decoupler)
Visualization (matplotlib, seaborn)
Clustering (igraph, leidenalg)

See pyproject.toml for complete dependency list.

Configuration Files

All commands support TOML configuration files for reproducible pipelines. Each command has its own section in the config file, and CLI arguments override config values when both are provided.

Basic Usage

# Use a config file
spatial-tk concat --config config.toml --input samples.csv --output merged.zarr
spatial-tk normalize --config config.toml --input merged.zarr --inplace

Config File Format

Create a config.toml file with sections for each command:

[concat]
input = "samples.csv"
output = "merged.zarr"
downsample = 1.0

[normalize]
input = "merged.zarr"
inplace = true
min_genes = 100
min_cells = 3
n_top_genes = 2000
save_plots = false

[cluster]
input = "merged.zarr"
inplace = true
leiden_resolution = "0.2,0.5,1.0"
save_plots = true

[annotate]
input = "merged.zarr"
inplace = true
markers = "markers.csv"
calculate_ulm = true
panglao_min_sensitivity = 0.5
tmin = 2
save_plots = true

[differential]
input = "merged.zarr"
output_dir = "results/"
groupby = "leiden_res0p5"
method = "wilcoxon"
n_genes = 100
save_plots = false

Config Key Naming

Config keys use underscores (e.g., min_genes, n_top_genes), which correspond to CLI arguments with hyphens (--min-genes, --n-top-genes). The config system automatically handles this conversion.

CLI Arguments Override Config

When both a config file and CLI arguments are provided, CLI arguments take precedence:

# Config specifies downsample = 0.5, but CLI overrides it to 0.8
spatial-tk concat --config config.toml --input samples.csv --output merged.zarr --downsample 0.8

Example Config File

See example_config.toml in the repository root for a complete example with all available options documented.

# 1. Concatenate multiple samples
spatial-tk concat --input samples.csv --output merged.zarr

# 2. Normalize (inplace to save space)
spatial-tk normalize --input merged.zarr --inplace

# 3. Cluster with multiple resolutions
spatial-tk cluster --input merged.zarr --inplace --leiden-resolution 0.2,0.5,1.0

# 4. Annotate cell types
spatial-tk annotate --input merged.zarr --inplace --markers markers.csv

# 5. Differential expression analysis
spatial-tk differential --input merged.zarr --output-dir results/ --groupby leiden_res0p5

Commands

`spatial-tk concat`

Concatenate multiple Xenium .zarr files into a single dataset.

spatial-tk concat --input samples.csv --output merged.zarr

# With downsampling for testing
spatial-tk concat --input samples.csv --output merged.zarr --downsample 0.1

Arguments:

--input: Path to CSV file with columns: sample, path, [optional metadata]
--output: Path to output .zarr file
--downsample: Fraction of cells to keep (0-1, default: 1.0)
--config: Path to TOML configuration file (optional)

CSV Format:

sample,path,status,location
sample1,/path/to/sample1.zarr,HIV,Drexel
sample2,/path/to/sample2.zarr,NEG,OSU

`spatial-tk normalize`

Perform QC, filtering, normalization, and feature selection.

# Save to new file
spatial-tk normalize --input data.zarr --output normalized.zarr

# Modify in place
spatial-tk normalize --input data.zarr --inplace

# With custom parameters and plots
spatial-tk normalize --input data.zarr --inplace \
  --min-genes 200 \
  --min-cells 5 \
  --n-top-genes 3000 \
  --save-plots

Arguments:

--input: Input .zarr file
--output: Output .zarr file (mutually exclusive with --inplace)
--inplace: Modify input file in place
--min-genes: Minimum genes per cell (default: 100)
--min-cells: Minimum cells per gene (default: 3)
--n-top-genes: Number of highly variable genes (default: 2000)
--save-plots: Generate QC plots
--config: Path to TOML configuration file (optional)

`spatial-tk cluster`

Perform PCA, neighbor graph computation, UMAP, and Leiden clustering.

# Single resolution
spatial-tk cluster --input data.zarr --inplace --leiden-resolution 0.5

# Multiple resolutions with plots
spatial-tk cluster --input data.zarr --inplace \
  --leiden-resolution 0.2,0.5,1.0,2.0 \
  --save-plots

Arguments:

--input: Input normalized .zarr file
--output: Output .zarr file (mutually exclusive with --inplace)
--inplace: Modify input file in place
--leiden-resolution: Clustering resolution(s), comma-separated (default: 0.5)
--save-plots: Generate UMAP plots
--config: Path to TOML configuration file (optional)

`spatial-tk spatial_neighbors`

Build a spatial graph on coordinates using Squidpy.

# kNN graph on obsm['spatial']
spatial-tk spatial_neighbors --input data.zarr --inplace \
  --spatial-key spatial --n-neighs 8

# Radius-based graph with cosine transform, writing to new file
spatial-tk spatial_neighbors --input data.zarr --output neighbors.zarr \
  --coord-type generic --radius 50,200 --transform cosine

Arguments:

--input: Input .zarr file
--output: Output .zarr file (mutually exclusive with --inplace)
--inplace: Modify input file in place
--table-key: Optional table key in SpatialData.tables
--spatial-key: Coordinate key in adata.obsm (default: spatial)
--library-key: Optional obs column containing library ids
--library-id: Optional single-library convenience value
--coord-type: grid or generic (default: inferred by Squidpy)
--n-neighs: Number of neighbors (default: 6)
--radius: Scalar radius or min,max interval
--transform: spectral, cosine, or none
--key-added: Output prefix in adata.obsp/adata.uns (default: spatial)
--config: Path to TOML configuration file (optional)

`spatial-tk spatial_cluster`

Cluster cells into spatial neighborhoods based on local cell-type composition vectors.

# Use existing spatial graph and choose best K by silhouette
spatial-tk spatial_cluster --input data.zarr --inplace \
  --cell-type-key cell_type_res0p5 --max-clusters 20

# Force final cluster count while still saving full K sweep
spatial-tk spatial_cluster --input data.zarr --inplace \
  --cell-type-key cell_type_res0p5 --force-n-clusters 12

# Use HDBSCAN mode instead of k-means
spatial-tk spatial_cluster --input data.zarr --inplace \
  --cell-type-key cell_type_res0p5 --mode hdbscan \
  --hdbscan-min-cluster-size 8 --hdbscan-min-samples 4

Arguments:

--input: Input .zarr file
--output: Output .zarr file (mutually exclusive with --inplace)
--inplace: Modify input file in place
--table-key: Optional table key in SpatialData.tables
--cell-type-key: Required adata.obs column with cell-type labels
--connectivities-key: adata.obsp graph key (default: spatial_connectivities)
--neighbor-k: Compute neighbors on demand if --connectivities-key is missing
--spatial-key: Coordinate key for on-demand neighbor calculation (default: spatial)
--library-key: Optional obs library key for on-demand neighbors
--output-key: Output obs column for selected labels (default: spatial_cluster)
--results-key: adata.uns key for detailed outputs (default: spatial_cluster)
--mode: Clustering mode: kmeans (default) or hdbscan
--min-clusters: Minimum cluster count to test (default: 2)
--max-clusters: Maximum cluster count to test (default: 20)
--force-n-clusters: Force final selected cluster count (k-means mode only)
--random-state: Random seed for reproducibility (default: 0)
--hdbscan-min-cluster-size: HDBSCAN minimum cluster size
--hdbscan-min-samples: HDBSCAN min_samples
--hdbscan-cluster-selection-epsilon: HDBSCAN cluster selection epsilon
--hdbscan-metric: HDBSCAN distance metric
--hdbscan-allow-single-cluster: Allow one-cluster HDBSCAN solution
--include-self/--exclude-self: Include/exclude focal cell in neighborhood window
--normalize-composition/--raw-composition: Store proportions or raw counts
--config: Path to TOML configuration file (optional)

`spatial-tk annotate`

Annotate cell types using marker genes and/or MLM scoring.

# Basic annotation with markers
spatial-tk annotate --input data.zarr --inplace --markers markers.csv

# With MLM enrichment scores
spatial-tk annotate --input data.zarr --inplace \
  --markers markers.csv \
  --calculate-ulm \
  --save-plots

# Annotate specific clustering
spatial-tk annotate --input data.zarr --inplace \
  --markers markers.csv \
  --cluster-key leiden_res1p0

Arguments:

--input: Input clustered .zarr file
--output: Output .zarr file (mutually exclusive with --inplace)
--inplace: Modify input file in place
--markers: Path to marker genes CSV (columns: cell_type, gene)
--cluster-key: Specific cluster column to annotate (default: all leiden_res*)
--calculate-ulm: Calculate MLM enrichment scores for pathways/TFs
--panglao-min-sensitivity: Min sensitivity for PanglaoDB markers (default: 0.5)
--tmin: Minimum marker genes per cell type (default: 2)
--save-plots: Generate annotation plots
--config: Path to TOML configuration file (optional)

MLM Resources:

hallmark: MSigDB Hallmark gene sets
collectri: CollecTRI TF regulons
dorothea: DoRothEA TF activities
progeny: PROGENy pathway activities
PanglaoDB: Filtered cell type markers

`spatial-tk differential`

Differential expression analysis with two modes:

Mode A: Compare two specific groups (e.g., HIV vs NEG) Mode B: Find marker genes for all groups/clusters

# Mode B: Find markers for all clusters
spatial-tk differential \
  --input data.zarr \
  --output-dir results/ \
  --groupby leiden_res0p5

# Mode A: Compare two groups
spatial-tk differential \
  --input data.zarr \
  --output-dir results/ \
  --groupby status \
  --compare-groups HIV,NEG

# With obsm enrichment scores
spatial-tk differential \
  --input data.zarr \
  --output-dir results/ \
  --groupby status \
  --compare-groups HIV,NEG \
  --obsm-layer score_mlm_PanglaoDB \
  --save-plots

# Compare cell types
spatial-tk differential \
  --input data.zarr \
  --output-dir results/ \
  --groupby cell_type_res0p5 \
  --n-genes 50

Arguments:

--input: Input .zarr file with annotations
--output-dir: Directory for results
--groupby: Column in obs to group by (e.g., "leiden_res0p5", "status", "cell_type")
--compare-groups: Two groups to compare (Mode A), comma-separated
--obsm-layer: Optional obsm layer for enrichment analysis (e.g., "score_mlm_PanglaoDB")
--method: Statistical test method (default: wilcoxon)
--layer: Layer to use for expression (default: None uses .X)
--n-genes: Number of top genes to save (default: 100)
--save-plots: Generate differential analysis plots
--config: Path to TOML configuration file (optional)

Example Workflows

Full Pipeline with Config File

# Create config.toml with your settings
# Then run pipeline with config
spatial-tk concat --config config.toml --input samples.csv --output data.zarr
spatial-tk normalize --config config.toml --input data.zarr --inplace
spatial-tk cluster --config config.toml --input data.zarr --inplace
spatial-tk annotate --config config.toml --input data.zarr --inplace
spatial-tk differential --config config.toml --input data.zarr --output-dir results/

Full Pipeline (In-place to Save Space)

# Step 1: Concatenate samples
spatial-tk concat --input samples.csv --output data.zarr

# Step 2-5: Process in place
spatial-tk normalize --input data.zarr --inplace --save-plots
spatial-tk cluster --input data.zarr --inplace --leiden-resolution 0.5,1.0 --save-plots
spatial-tk annotate --input data.zarr --inplace --markers markers.csv --calculate-ulm --save-plots
spatial-tk differential --input data.zarr --output-dir results/ --groupby leiden_res0p5 --save-plots

Separate Files for Each Step

spatial-tk concat --input samples.csv --output step1_concat.zarr
spatial-tk normalize --input step1_concat.zarr --output step2_normalized.zarr
spatial-tk cluster --input step2_normalized.zarr --output step3_clustered.zarr
spatial-tk annotate --input step3_clustered.zarr --output step4_annotated.zarr
spatial-tk differential --input step4_annotated.zarr --output-dir results/

Compare Disease Status

# Process and normalize
spatial-tk concat --input samples.csv --output data.zarr
spatial-tk normalize --input data.zarr --inplace

# Compare HIV vs NEG
spatial-tk differential \
  --input data.zarr \
  --output-dir hiv_vs_neg/ \
  --groupby status \
  --compare-groups HIV,NEG \
  --save-plots

Multi-Resolution Analysis

spatial-tk concat --input samples.csv --output data.zarr
spatial-tk normalize --input data.zarr --inplace
spatial-tk cluster --input data.zarr --inplace --leiden-resolution 0.2,0.5,1.0,2.0

# Annotate all resolutions
spatial-tk annotate --input data.zarr --inplace --markers markers.csv --save-plots

# Differential analysis for each resolution
for res in 0p2 0p5 1p0 2p0; do
  spatial-tk differential \
    --input data.zarr \
    --output-dir results_res${res}/ \
    --groupby leiden_res${res}
done

Output Files

Concat

{output}.zarr: Concatenated spatial dataset

Normalize

{output}.zarr: Normalized dataset with QC metrics
plots/qc_*.png: QC plots (if --save-plots)

Cluster

{output}.zarr: Dataset with clustering results
plots/umap_leiden_res*.png: UMAP plots (if --save-plots)

Annotate

{output}.zarr: Dataset with cell type annotations
plots/umap_celltype_res*.png: Annotated UMAP plots (if --save-plots)
plots/marker_dotplot_res*.png: Marker expression dotplots
plots/deg_*.png: Differential expression plots

Differential

de_genes_*.csv: Differential expression results
de_{obsm_layer}_*.csv: obsm enrichment results (if --obsm-layer used)
plots/: Visualization plots (if --save-plots)

Development

Running Tests

# Install with dev dependencies
pip install -e ".[dev]"

# Run all tests with full external datasets
make test

# Run only unit tests (fast)
make test-unit

# Run functional tests with ROI fixtures
make test-functional

# Run functional tests with ROI fixtures (default fast tier)
SPATIAL_TK_TEST_TIER=fast pytest tests/functional/

# Run functional tests with full external datasets
SPATIAL_TK_TEST_TIER=full pytest tests/functional/

# Run with coverage
pytest --cov=spatial_tk --cov-report=html

Creating Test Data

python scripts/create_test_data.py \
  --input-csv example.csv \
  --output-dir tests/test_data \
  --n-cells 500

Functional Test Data Tiers

Functional tests support two sample manifests via tests/conftest.py:

Fast tier (default): tests/test_data/test_samples_fast.csv
- Uses in-repo ROI fixtures under tests/test_data/rois/.
- Also mirrored in tests/test_data/test_samples.csv for compatibility.
Full tier: tests/test_data/test_samples_full.csv
- Uses full-size external .zarr paths (for slower validation runs).

Environment variables:

SPATIAL_TK_TEST_TIER=fast|full chooses the tier (default: fast).
SPATIAL_TK_FAST_SAMPLES_CSV=/path/to/custom.csv overrides fast manifest path.
SPATIAL_TK_FULL_SAMPLES_CSV=/path/to/custom.csv overrides full manifest path.

Makefile shortcuts:

make test runs full-suite tests with SPATIAL_TK_TEST_TIER=full.
make test-unit runs only unit tests.
make test-functional runs functional tests with SPATIAL_TK_TEST_TIER=fast (ROI fixtures).

Generating ROI Subset Fixtures

Use tests/test_data/generate_roi_subsets.py to generate ROI .zarr subsets from a single input .zarr:

python tests/test_data/generate_roi_subsets.py \
  --input-zarr /path/to/source.zarr \
  --output-dir tests/test_data/roi_generation \
  --sample-name SampleA \
  --n-rois 5 \
  --min-cells 1000 \
  --max-cells 5000 \
  --overwrite

Building Package

# Build distribution
python -m build

# Install locally
pip install dist/spatial_tk-*.whl

Marker Gene CSV Format

cell_type,gene
T cells,CD3D
T cells,CD3E
B cells,MS4A1
B cells,CD19
Macrophages,CD68
Macrophages,CD14

Advanced Usage

Python API

The package can also be used programmatically:

from spatial_tk.core import data_io, preprocessing, clustering, annotation
from spatial_tk.utils.helpers import get_table, set_table

# Load data
sdata = data_io.load_existing_spatial_data("data.zarr")
adata = get_table(sdata)

# Process
adata = preprocessing.normalize_and_log(adata)
adata = clustering.run_pca(adata)
adata = clustering.compute_neighbors_and_umap(adata)
adata = clustering.cluster_leiden(adata, resolution=0.5)

# Save
set_table(sdata, adata)
data_io.save_spatial_data(sdata, "processed.zarr")

Citation

This tool is based on the Scverse ecosystem and follows best practices from:

License

MIT License

Support

For issues, questions, or contributions, please contact the Hope Lab or open an issue on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.vscode		.vscode
notes		notes
projects/PDAC_HIV		projects/PDAC_HIV
scripts		scripts
spatial_tk		spatial_tk
tests		tests
.cursorignore		.cursorignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md
example.csv		example.csv
example_config.toml		example_config.toml
example_markers.csv		example_markers.csv
makefile		makefile
pdac_markers.csv		pdac_markers.csv
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

spatial-tk: Spatial Transcriptomics Analysis Toolkit

Features

Installation

From Source

Dependencies

Configuration Files

Basic Usage

Config File Format

Config Key Naming

CLI Arguments Override Config

Example Config File

Commands

spatial-tk concat

spatial-tk normalize

spatial-tk cluster

spatial-tk spatial_neighbors

spatial-tk spatial_cluster

spatial-tk annotate

spatial-tk differential

Example Workflows

Full Pipeline with Config File

Full Pipeline (In-place to Save Space)

Separate Files for Each Step

Compare Disease Status

Multi-Resolution Analysis

Output Files

Concat

Normalize

Cluster

Annotate

Differential

Development

Running Tests

Creating Test Data

Functional Test Data Tiers

Generating ROI Subset Fixtures

Building Package

Marker Gene CSV Format

Advanced Usage

Python API

Citation

License

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`spatial-tk concat`

`spatial-tk normalize`

`spatial-tk cluster`

`spatial-tk spatial_neighbors`

`spatial-tk spatial_cluster`

`spatial-tk annotate`

`spatial-tk differential`

Packages