EC_codeximaging

Overview

This repository contains a Python-based pipeline for preprocessing, cell segmentation, and cell phenotyping of multiplexed immunofluorescence whole-slide images (WSI) from endometrial cancer tissue.

The pipeline enables large-scale single-cell analysis of multiplexed imaging data by performing tile-based segmentation, marker quantification, and downstream clustering to identify cell phenotypes within the tumor microenvironment.

This codebase was used for image preprocessing, cell segmentation, and cell phenotyping in the manuscript:

"The Spatial Immune Landscape of Mismatch Repair Deficient Endometrial Cancer: Implications for Clinical Outcomes", submitted to Modern Pathology.

Additional implementation details describing the scripts and workflow are provided in Script Info.pdf.

Pipeline Overview

The pipeline performs the following major steps:

1. Tile preprocessing and E-cadherin intensity extraction

Whole-slide images (downloaded from OMERO as QPTIFFs) are divided into tiles. For each tile, the mean intensity of the top 5% of pixels in the E-cadherin channel is calculated.

2. Tile clustering

Per-sample k-means clustering (k = 3) is applied to E-cadherin intensity values to identify epithelial and non-epithelial tissue regions.

Clusters with higher E-cadherin signal are classified as:

E-cadherin positive (Ecad⁺)
E-cadherin negative (Ecad⁻)

3. Segmentation strategy assignment

Different segmentation strategies are applied depending on tile classification:

Ecad⁺ tiles: whole-cell segmentation
Ecad⁻ tiles: nuclear segmentation with pixel expansion

4. Cell segmentation

Segmentation generates:

cell-by-marker intensity matrix
cell metadata including:
- centroid coordinates
- area
- perimeter
- axis ratio
- tile coordinates
- slide/sample identifiers

Separate outputs are generated for Ecad⁺ and Ecad⁻ tiles and later concatenated into a unified dataset.

5. Marker thresholding and normalization

Cell lineage markers are thresholded using k-means clustering to identify marker-positive populations.

The marker matrix is then normalized per sample:

biomarker values capped at the 99.9th percentile
z-score normalization
arcsinh transformation

6. Dimensionality reduction and clustering

Cell phenotypes are identified through:

UMAP embedding
k-means clustering

Clustering results are used to define cell populations based on marker expression.

7. Downstream analysis and visualization

The pipeline generates:

UMAP visualizations
marker expression heatmaps
cluster composition plots
quality control plots (e.g., sample batch effects)

Key Scripts

Segmentation and preprocessing

main_cell_segmentation.py
Performs tile clustering based on E-cadherin intensity and generates cell segmentations.

run_preprocess.py
Prepares image tiles and preprocessing inputs.

Image labeling and export

main_label_images.py
Generates label images for segmented cells.

main_label_images_to_omero.py
Uploads segmentation outputs to OMERO.

Biomarker extraction

main_biomarker_tables.py
Generates biomarker expression tables for downstream analysis.

main_biomarker_tables_to_omero.py
Uploads biomarker tables to OMERO.

Cell phenotyping

main_cell_segmentation_analysis.py
Performs marker normalization, dimensionality reduction, and clustering.

main_celltype_cluster_tables.py
Exports cluster assignments representing cell phenotypes.

Environment

The pipeline was executed using Python within a conda environment.

Example environments used in the pipeline:

module load condaenvs/new/deepcell conda activate omero

Dependencies include common scientific Python packages such as:

numpy
pandas
scikit-learn
matplotlib
umap-learn

Additional dependencies may be required depending on the execution environment.

Downstream analysis (`analysis/`)

The analysis/ directory contains a config-driven pipeline for downstream statistical and spatial analysis of the cell-segmentation and phenotyping outputs. It expects per-cell metadata (e.g. .feather), a normalized counts matrix, and optional clinical data.

Running the analysis

python analysis/main.py

All paths and which analyses to run are set in analysis/config.py. Update RESULTS_DIR, METADATA_PATH, and CLINICAL_DIR to match your data, then set the relevant RUN_* flags to True or False.

Analysis modules

Module	Config flag	Description
Proportions	`RUN_GEN_PROPORTION_SUMMARY_TABLE`	Per-sample cell type proportions (total, T-cell–normalized, parent-type–normalized) and statistical tests vs. clinical variables.
T cell / macrophage proportions	`RUN_GEN_PROPORTION_TCELLS`	Proportion summaries focused on T cell and macrophage subsets.
Cell type ratios	`RUN_GEN_CELL_TYPE_RATIO_SUMMARY_TABLE`	User-defined ratios (e.g. CD8⁺/CD4⁺ T cells) per sample and region, with clinical association tests.
Fraction intratumoral	`RUN_GEN_FRACTION_INTRA_SUMMARY_TABLE`	Fraction of each cell type in intratumoral vs. peritumoral region and clinical associations.
Median distance	`RUN_GEN_MEDIAN_DISTANCE_SUMMARY_TABLE`	Median nearest-neighbor distance (µm) between cell type pairs per sample/region; optional self-interactions.
Scimap neighborhoods	`RUN_SCIMAP_NEIGHBORHOODS`	Spatial neighborhood analysis with scimap: spatial counts/expression, k-means neighborhood clusters, barplots, boxplots, and optional OMERO tables. Must run before the interaction summary.
Scimap interactions	`RUN_GEN_SCIMAP_INTERACTION_SUMMARY_TABLE`	Summary of cell–cell interactions (e.g. proportion of anchor cells near a target type within a radius) from scimap output.

Outputs are written under RESULTS_DIR (e.g. Summary_tables/, proportion and ratio CSVs, boxplots). When boxplots are enabled, results are compared to clinical variables (e.g. recurrence, stage) with configurable statistical tests.

Analysis dependencies

From analysis/requirements.txt:

anndata
matplotlib
numpy
pandas
pyarrow
scipy
seaborn

The scimap-based steps require scimap and plotly (for optional interactive plots). The pipeline also expects a small amount of project-specific structure (e.g. utils.io, stats.tests, visualization.boxplots); see the repository layout and analysis/main.py for import paths.

Test data

The repository includes minimal test data in test_data/ so that the downstream analysis pipeline can be run without access to the full study dataset. Included files:

File	Description
`metadata.csv`	Per-cell metadata (cell type, centroid coordinates, slide ID, region `peri_intra_tumoral`, morphology, etc.) for a small number of cells across a few synthetic slides.
`matrix_raw.csv`	Raw cell-by-marker intensity matrix matching the metadata rows.
`EC_Clinical_Data.csv`	Slide-level clinical variables (e.g. `slide_id`, `Recurred`, `ICI response`) for the same slides.

To run the analysis pipeline on this test data, set in analysis/config.py: RESULTS_DIR to the path to your results directory (e.g. the repo root or a copy of test_data/), METADATA_PATH to the relative path to the metadata file, and CLINICAL_DIR to the full path to test_data/EC_Clinical_Data.csv. The pipeline loads metadata via read_feather, so convert metadata.csv to .feather first if needed (e.g. pd.read_csv('test_data/metadata.csv').to_feather('test_data/metadata.feather')). The test set is small and intended only for verifying the code and workflow.

Data Availability

Multiplexed immunofluorescence images will available on the NYU Langone OMERO Public server.

Notes

Detailed documentation describing preprocessing, cell segmentation, and cell phenotyping is available in Script Info.pdf.

Name		Name	Last commit message	Last commit date
Latest commit History 391 Commits
.github		.github
analysis		analysis
bash_scripts		bash_scripts
config		config
src		src
test_data		test_data
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Script Info.pdf		Script Info.pdf
main_assign_celltypes.py		main_assign_celltypes.py
main_biomarker_tables.py		main_biomarker_tables.py
main_cell_phenotyping_analysis.py		main_cell_phenotyping_analysis.py
main_cell_segmentation.py		main_cell_segmentation.py
main_cell_segmentation_analysis.py		main_cell_segmentation_analysis.py
main_celltype_cluster_tables.py		main_celltype_cluster_tables.py
main_celltype_tables.py		main_celltype_tables.py
main_label_images.py		main_label_images.py
main_label_images_csv.py		main_label_images_csv.py
main_label_images_to_omero.py		main_label_images_to_omero.py
run_preprocess.py		run_preprocess.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EC_codeximaging

Overview

Pipeline Overview

1. Tile preprocessing and E-cadherin intensity extraction

2. Tile clustering

3. Segmentation strategy assignment

4. Cell segmentation

5. Marker thresholding and normalization

6. Dimensionality reduction and clustering

7. Downstream analysis and visualization

Key Scripts

Segmentation and preprocessing

Image labeling and export

Biomarker extraction

Cell phenotyping

Environment

Downstream analysis (`analysis/`)

Running the analysis

Analysis modules

Analysis dependencies

Test data

Data Availability

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EC_codeximaging

Overview

Pipeline Overview

1. Tile preprocessing and E-cadherin intensity extraction

2. Tile clustering

3. Segmentation strategy assignment

4. Cell segmentation

5. Marker thresholding and normalization

6. Dimensionality reduction and clustering

7. Downstream analysis and visualization

Key Scripts

Segmentation and preprocessing

Image labeling and export

Biomarker extraction

Cell phenotyping

Environment

Downstream analysis (analysis/)

Running the analysis

Analysis modules

Analysis dependencies

Test data

Data Availability

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Downstream analysis (`analysis/`)

Packages