Skip to content

Feature/ml stain normalization#1

Draft
LAdam-ix wants to merge 80 commits intomainfrom
feature/ml-stain-normalization
Draft

Feature/ml stain normalization#1
LAdam-ix wants to merge 80 commits intomainfrom
feature/ml-stain-normalization

Conversation

@LAdam-ix
Copy link
Collaborator

@LAdam-ix LAdam-ix commented Mar 5, 2026

No description provided.

@LAdam-ix LAdam-ix self-assigned this Mar 5, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 5, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dc5e9541-70a7-4e71-a226-26d4d2fedec7

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/ml-stain-normalization

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes the foundational structure for a machine learning project focused on stain normalization of histopathological images. It encompasses a complete development lifecycle, from environment setup and code quality enforcement to data preprocessing, model implementation, training, and comprehensive result analysis. The changes enable robust experimentation and deployment of stain normalization techniques, providing tools for both tile-level and whole-slide image processing.

Highlights

  • Project Setup and Tooling: Initialized the project with PDM for dependency management, configured pre-commit hooks for code quality (Ruff, MyPy, commitizen), and set up GitLab CI for automated linting.
  • Stain Normalization Model and Training: Introduced a U-Net based model for stain normalization, utilizing a custom L1+SSIM loss function and integrating with PyTorch Lightning for training and evaluation workflows.
  • Data Handling and Augmentation: Implemented a flexible data module using Hydra for configuration, supporting various datasets (train, val, test, predict) and custom image modification transformations (HEDFactor, ExposureAdjustment, HVSModification, CombinedModifications) to simulate stain variations.
  • Analysis and Export Callbacks: Developed specialized callbacks for exporting predicted tiles, computing comprehensive stain analysis metrics (SSIM, PCC, NMI, Lab PSNR, stain vector comparison), and reassembling normalized tiles into whole-slide pyramid TIFFs.
  • Preprocessing and Demo Utilities: Added scripts for generating tissue masks and tiling whole-slide images, along with a user-friendly demo script to apply the stain normalization model to individual images or directories.
Changelog
  • .flake8
    • Added configuration for the flake8 linter, setting max line length and ignoring specific error codes.
  • .gitignore
    • Added a comprehensive .gitignore file to exclude common Python-related build artifacts, caches, environments, and project-specific data.
  • .gitlab-ci.yml
    • Added GitLab CI configuration to include a Python linting stage, leveraging a shared CI template.
  • .mypy.ini
    • Added MyPy configuration for strict type checking, ignoring missing imports and disabling specific error codes.
  • .pre-commit-config.yaml
    • Added pre-commit hooks for YAML validation, commit message validation using Commitizen, and Ruff for linting and formatting.
  • .ruff.toml
    • Added Ruff linter and formatter configuration, enabling various checks and specifying ignore rules and formatting options.
  • README.md
    • Updated the README file with a detailed project description, demo instructions, and a list of available command-line arguments, written in Slovak.
  • analyze_dataset.py
    • Added a script to compare image datasets against a reference or between two datasets using various stain analysis metrics like SSIM, PCC, NMI, and stain vector differences.
  • configs/data/datasets/stain_normalization/predict.yaml
    • Added Hydra configuration for the prediction dataset, specifying the target class and MLflow URI for data.
  • configs/data/datasets/stain_normalization/test.yaml
    • Added Hydra configuration for the test dataset, including modification and normalization defaults and an MLflow URI.
  • configs/data/datasets/stain_normalization/train.yaml
    • Added Hydra configuration for the training dataset, including modification and normalization defaults and an MLflow URI.
  • configs/data/datasets/stain_normalization/val.yaml
    • Added Hydra configuration for the validation dataset, including modification and normalization defaults and an MLflow URI.
  • configs/data/modify/test.yaml
    • Added Hydra configuration for image modification transformations used in testing, including HEDFactor, ExposureAdjustment, HVSModification, and CombinedModifications.
  • configs/data/modify/train.yaml
    • Added Hydra configuration for image modification transformations used in training, including HEDFactor, ExposureAdjustment, HVSModification, and CombinedModifications.
  • configs/data/normalize/default.yaml
    • Added Hydra configuration for default image normalization parameters, specifying mean, standard deviation, and max pixel value.
  • configs/default.yaml
    • Added the main Hydra configuration file, defining defaults for logging, datasets, callbacks, trainer settings, and MLflow metadata.
  • configs/hydra/default.yaml
    • Added Hydra-specific configuration to disable logging and set the run directory.
  • configs/logger/mlflow.yaml
    • Added MLflow logger configuration, specifying experiment name, run name, and custom tags.
  • demo.py
    • Added a demo script that loads a pre-trained stain normalization model and applies it to single images or entire folders, saving the normalized outputs.
  • preprocessing/mask_generator.py
    • Added a script for generating tissue masks from whole-slide images using pyvips and OpenSlide, saving them as TIFF files.
  • preprocessing/tiler.py
    • Added a script for tiling whole-slide images, applying tissue mask filtering, and saving the resulting slide and tile metadata as MLflow datasets.
  • pyproject.toml
    • Configured PDM for project metadata, dependencies, development dependencies, and script commands for tasks like mask generation, tiling, training, and linting.
  • stain_normalization/main.py
    • Added the main entry point for the stain normalization application, integrating Hydra for configuration, PyTorch Lightning for training, and MLflow for logging.
  • stain_normalization/analysis/init.py
    • Added the init.py file for the analysis module, exposing the StainAnalyzer class and various report metrics.
  • stain_normalization/analysis/analyzer.py
    • Implemented the StainAnalyzer class to compare images using metrics such as stain vectors, SSIM, PCC, NMI, and Lab brightness PSNR, and to generate statistical reports.
  • stain_normalization/callbacks/init.py
    • Added the init.py file for the callbacks module, exposing custom PyTorch Lightning callbacks.
  • stain_normalization/callbacks/_base.py
    • Added a base NormalizationCallback class providing common denormalization utilities for other callbacks.
  • stain_normalization/callbacks/analysis_export.py
    • Implemented the AnalysisExport callback to compute and log stain analysis metrics for original, modified, and predicted images during model testing.
  • stain_normalization/callbacks/tiles_export.py
    • Implemented the TilesExport callback to save original, modified, and predicted image tiles to disk during model testing and prediction.
  • stain_normalization/callbacks/wsi_assembler.py
    • Implemented the WSIAssembler callback to efficiently reassemble predicted image tiles into whole-slide pyramid TIFFs, handling overlaps and memory management.
  • stain_normalization/data/init.py
    • Added the init.py file for the data module, exposing the DataModule class.
  • stain_normalization/data/data_module.py
    • Implemented the DataModule class for managing data loading across different stages (train, val, test, predict) using PyTorch DataLoaders.
  • stain_normalization/data/datasets/init.py
    • Added the init.py file for the datasets module, exposing specific dataset classes for different modes.
  • stain_normalization/data/datasets/predict_dataset.py
    • Implemented the PredictDataset class for loading image tiles for prediction, applying normalization if specified.
  • stain_normalization/data/datasets/test_dataset.py
    • Implemented the TestDataset class for loading image tiles for testing, applying both modification and normalization transformations.
  • stain_normalization/data/datasets/train_dataset.py
    • Implemented the TrainDataset class for loading image tiles for training, applying modification and normalization to generate input-target pairs.
  • stain_normalization/data/modification/init.py
    • Added the init.py file for the image modification module, exposing various transformation classes.
  • stain_normalization/data/modification/combiend_modification.py
    • Implemented the CombinedModifications class to apply random intensity scaling and brightness adjustments to H&E channels in HED color space.
  • stain_normalization/data/modification/exposure_adjustment.py
    • Implemented the ExposureAdjustment class to randomly scale the brightness of an image.
  • stain_normalization/data/modification/hed_factor.py
    • Implemented the HEDFactor class to randomly adjust the intensity of Hematoxylin and Eosin stains in HED color space.
  • stain_normalization/data/modification/hvs_modification.py
    • Implemented the HVSModification class to randomly modify hue, saturation, and value channels in HSV color space.
  • stain_normalization/data/utils/init.py
    • Added the init.py file for data utilities, exposing the collate_fn.
  • stain_normalization/data/utils/collate_fn.py
    • Implemented a custom collate_fn for batching data, specifically handling metadata in prediction and test modes.
  • stain_normalization/metrics/init.py
    • Added the init.py file for the metrics module, exposing image and vector-based metric functions.
  • stain_normalization/metrics/image_metrics.py
    • Implemented image-based metrics including Normalized Median Intensity (NMI), Pearson Correlation Coefficient (PCC), and Lab brightness PSNR.
  • stain_normalization/metrics/vector_metrics.py
    • Implemented vector-based metrics for comparing stain vectors using CIE76 Delta E, including logic for handling stain swapping.
  • stain_normalization/modeling/init.py
    • Added the init.py file for the modeling module, exposing the L1SSIMLoss and UNet classes.
  • stain_normalization/modeling/l1ssim_loss.py
    • Implemented the L1SSIMLoss, a composite loss function combining L1, SSIM, gradient, and luminance losses for image similarity.
  • stain_normalization/modeling/unet.py
    • Implemented a U-Net architecture with double convolutions, downsampling, upsampling, and an output convolution for image-to-image tasks.
  • stain_normalization/stain_normalization_model.py
    • Implemented the StainNormalizationModel as a PyTorch LightningModule, integrating the U-Net, custom loss, and evaluation metrics for training, validation, testing, and prediction.
  • stain_normalization/type_aliases.py
    • Defined type aliases for various data structures used throughout the project, such as Sample, PredictSample, Batch, PredictBatch, and Outputs.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive new project for histopathological image stain normalization. It includes the full pipeline from data loading and preprocessing, a U-Net based model, a PyTorch Lightning training setup, and extensive analysis and visualization callbacks. The overall structure is well-organized and leverages modern tools like Hydra, PDM, and Ruff.

My review has identified several issues. Most critically, there are bugs in the data augmentation pipeline where transforms expect float images but receive uint8, which could lead to incorrect behavior or crashes. I've also found some high-severity issues in preprocessing scripts, such as a redefined variable and leftover debugging code. Additionally, there are several medium-severity suggestions to improve maintainability, correctness, and adherence to best practices, including adding missing type hints, removing unreachable code, and fixing typos. Addressing these points will significantly improve the robustness and quality of the new project.

Returns:
Modified RGB image as a float32 NumPy array with values in [0.0, 1.0].
"""
hed_image = separate_stains(img, hed_from_rgb)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The skimage.color.separate_stains function expects the input image to be of type float with values in the range [0, 1]. However, the input img from the dataset is a uint8 array. This will lead to incorrect results or a crash. You should convert the image to float and normalize it at the beginning of the apply method.

Suggested change
hed_image = separate_stains(img, hed_from_rgb)
img_float = img.astype(np.float32) / 255.0
hed_image = separate_stains(img_float, hed_from_rgb)

saturation_scale = np.random.uniform(*self.saturation_range)
value_scale = np.random.uniform(*self.value_range)

hsv_image = rgb2hsv(img)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The skimage.color.rgb2hsv function expects the input image to be of type float. The img passed here is uint8, which will lead to incorrect color space conversion. You need to convert the image to float and normalize it to the [0, 1] range first.

Suggested change
hsv_image = rgb2hsv(img)
img_float = img.astype(np.float32) / 255.0
hsv_image = rgb2hsv(img_float)

h_factor = np.random.uniform(*self.h_range)
e_factor = np.random.uniform(*self.e_range)

hed_image = separate_stains(img, hed_from_rgb)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The skimage.color.separate_stains function expects a float image in the range [0, 1], but it receives a uint8 image from the dataset. This is a critical bug that will cause incorrect behavior. Please convert the input image to float before processing.

Suggested change
hed_image = separate_stains(img, hed_from_rgb)
img_float = img.astype(np.float32) / 255.0
hed_image = separate_stains(img_float, hed_from_rgb)

Comment on lines +104 to +115
#main()
slides = [("/home/jovyan/staining/demo_data/P-2016_0077-08-1_hed_h0.6_e1.5.tiff")]
train_slides_df, train_tiles_df = tiling(slides=slides, handler=handler)

mlflow.set_experiment(experiment_name="Stain-Normalization")
with mlflow.start_run(run_name="P-2016_0077-08-1_hed all tissue tiles") as _:
save_mlflow_dataset(
slides=train_slides_df,
tiles=train_tiles_df,
dataset_name="P-2016_0077-08-1_hed",
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block of code seems to be leftover from debugging or testing. It's outside of any function and will execute upon module import. It also contains a hardcoded path and comments out the main() function call. This should be cleaned up or moved into a dedicated test script.

Suggested change
#main()
slides = [("/home/jovyan/staining/demo_data/P-2016_0077-08-1_hed_h0.6_e1.5.tiff")]
train_slides_df, train_tiles_df = tiling(slides=slides, handler=handler)
mlflow.set_experiment(experiment_name="Stain-Normalization")
with mlflow.start_run(run_name="P-2016_0077-08-1_hed all tissue tiles") as _:
save_mlflow_dataset(
slides=train_slides_df,
tiles=train_tiles_df,
dataset_name="P-2016_0077-08-1_hed",
)
if __name__ == "__main__":
main()

Comment on lines +1 to +3
from stain_normalization.data.modification.combiend_modification import (
CombinedModifications,
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a typo in the filename combiend_modification. It should be combined_modification. Please rename the file and update the import statement accordingly.

Suggested change
from stain_normalization.data.modification.combiend_modification import (
CombinedModifications,
)
from stain_normalization.data.modification.combined_modification import (
CombinedModifications,
)

Comment on lines +26 to +27
outputs: list[torch.Tensor],
batch: tuple[torch.Tensor, list],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The type hints for outputs and batch are incorrect or too generic.

  • outputs is a single torch.Tensor returned from test_step, not a list[torch.Tensor].
  • The list in the batch tuple is a list of dictionaries. A more specific type hint would improve readability.
Suggested change
outputs: list[torch.Tensor],
batch: tuple[torch.Tensor, list],
outputs: torch.Tensor,
batch: tuple[torch.Tensor, list[dict]],

Comment on lines +123 to +124
if analyzer is None:
return

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This check for analyzer is None appears to be unreachable. The preceding logic ensures that either run_reference_mode or run_paired_mode is called, both of which return a StainAnalyzer instance. If neither condition is met, parser.error() is called, which terminates the script. This block can be safely removed.

return np.array(Image.open(path).convert("RGB"))


def iterate_tiles(slides, tiles):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better code clarity and maintainability, please add type hints to the function arguments (slides, tiles) and the return value. Based on the usage, slides and tiles appear to be pandas DataFrames, and the function returns an iterator.

Suggested change
def iterate_tiles(slides, tiles):
def iterate_tiles(slides: pd.DataFrame, tiles: pd.DataFrame) -> Iterator[tuple[str, np.ndarray, str]]:

"""
# Calculate RGB from optical density by reversing the process in estimate_stain_vectors.
# default i0=240 (transmitted light intensity)
rgb = np.clip(240 * np.exp(-od_vector), 0, 255) / 255.0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The value 240 is a magic number representing the transmitted light intensity (i0) used in optical density calculations. It's better to define this as a named constant at the module level (e.g., TRANSMITTED_LIGHT_INTENSITY = 240) to improve readability and make it easier to change if needed.

Suggested change
rgb = np.clip(240 * np.exp(-od_vector), 0, 255) / 255.0
rgb = np.clip(TRANSMITTED_LIGHT_INTENSITY * np.exp(-od_vector), 0, 255) / 255.0

from math import exp

class L1SSIMLoss(nn.Module):
def __init__(self, lambda_dssim: float = 0.6, lambda_l1: float = 0.2, lambda_lum: float = 0.2, lambda_gdl: float = 0.1):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The weights for the different loss components (lambda_dssim, lambda_l1, lambda_lum, lambda_gdl) sum to 1.1. While not strictly required, it's common practice for these weights to sum to 1.0, as it makes their relative contributions clearer. Consider normalizing the weights or clarifying in a comment why they don't sum to 1.

@matejpekar
Copy link
Member

Make multiple smaller ones instead of one large PR. It is impossible to review

@LAdam-ix
Copy link
Collaborator Author

LAdam-ix commented Mar 6, 2026

@matejpekar Will it be enough if I separate it like this:

  • one PR where the actual ML learning is done (model, data pipeline, configs, loss),

  • another for what was used to analyse it (callbacks, metrics),

  • last one for preprocessing stuff and other things that didn't fit elsewhere, like demo

Would that be enough modularization, or did you imagine some other way?

Also, should I do it by creating new branches and opening separate merge requests to main, or chain the requests together (branch 1 to main, branch 2 to branch 1, and so on)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants