Skip to content

feat: tissue masks#3

Merged
Adames4 merged 32 commits intomasterfrom
feature/tissue-masks
Feb 16, 2026
Merged

feat: tissue masks#3
Adames4 merged 32 commits intomasterfrom
feature/tissue-masks

Conversation

@Adames4
Copy link
Collaborator

@Adames4 Adames4 commented Feb 11, 2026

Tissue masks generation script.

Closes IBD-17

Blocked by #2

Dependency graph:

                         +--------------+
                  -------| tissue-masks |<------+           +------------+      +----------------------+
                 /       +--------------+       |       +---| tile-masks |<-----| preprocessing-report |
                /                               |       |   +------------+      +----------------------+
+---------+    /                            +--------+  |
| dataset | <-+                             | tiling |<-+
+---------+    \                            +--------+  |
                \                               |       |   +------------+
                 \       +-----------------+    |       +---| embeddings |
                  -------| quality-control |<---+           +------------+
                         +-----------------+

Summary by CodeRabbit

  • New Features

    • Tissue mask preprocessing pipeline with parallel processing, MLflow artifact logging, and a CLI entrypoint
    • Added processed dataset configurations for three datasets
  • Chores

    • Added rationai-masks dependency
    • Broadened mypy exclusion patterns
  • Refactor

    • More specific type hints for dataset creation utilities
    • Updated dataset input parameter name in the dataset creation run command
  • Scripts

    • Added orchestration script to schedule remote tissue-mask jobs

@Adames4 Adames4 self-assigned this Feb 11, 2026
@coderabbitai
Copy link

coderabbitai bot commented Feb 11, 2026

📝 Walkthrough

Walkthrough

Adds tissue-mask preprocessing: new configs and scripts, a new preprocessing module using Ray/OpenSlide/VIPS with MLflow artifact handling, a mypy exclusion update, an added dependency, and a CLI flag rename in a create_dataset script.

Changes

Cohort / File(s) Summary
Type Checking Config
\.mypy\.ini
Broadened mypy exclude pattern to skip both scripts and project_name directories.
Processed Dataset Configs
configs/dataset/processed/ftn.yaml, configs/dataset/processed/ikem.yaml, configs/dataset/processed/knl_patos.yaml
Added three processed-dataset YAMLs referencing raw dataset defaults and MLflow artifact URIs.
Preprocessing Config
configs/preprocessing/tissue_masks.yaml
New YAML for tissue mask preprocessing (level, concurrency, artifact_path, metadata/hyperparams).
Preprocessing Code
preprocessing/create_dataset.py, preprocessing/tissue_masks.py
create_dataset.py: narrowed type hints to re.Pattern[str]. Added preprocessing/tissue_masks.py: Ray-parallel slide processing, OpenSlide/VIPS mask creation, writes multi-page TIFF with MPP, and MLflow artifact logging.
Project Metadata & Dependency
pyproject.toml
Added rationai-masks dependency and a UV source entry for it.
Scripts / Job Submission
scripts/preprocessing/create_dataset.py, scripts/preprocessing/tissue_masks.py
create_dataset script: changed flag from --data to --dataset. Added scripts/preprocessing/tissue_masks.py to submit a kube job that runs the tissue_masks workflow with specified resources and secure storage. Review job resource and storage settings.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Hydra as HydraConfig
    participant Main as tissue_masks.main
    participant Ray as Ray
    participant OpenSlide as OpenSlide
    participant VIPS as libvips
    participant MLStorage as MLflowArtifacts
    participant MLFlow as MLFlowLogger

    User->>Hydra: launch with `tissue_masks` config
    Hydra->>Main: provide config & logger
    Main->>MLStorage: download dataset CSV
    MLStorage-->>Main: dataset list
    Main->>Ray: submit parallel `process_slide` tasks
    Ray->>OpenSlide: open slide, read resolution
    OpenSlide-->>Ray: provide mpp/resolution
    Ray->>VIPS: build level image & compute mask
    VIPS-->>Ray: mask image bytes
    Ray->>MLStorage: write TIFF and upload artifact
    Ray-->>Main: task complete
    Main->>MLFlow: log artifact path
    MLFlow-->>User: artifacts recorded
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

  • feat: dataset #2: touches the same preprocessing/create_dataset.py functions and the mypy exclude setting — strong code-level relation.

Suggested reviewers

  • vejtek

Poem

🐇 I hop through slides where pixels hide,
I call on Ray to split the tide,
OpenSlide whispers, VIPS paints light,
MLflow tucks the masks in tight,
Hooray — small hops make data bright!

🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: tissue masks' directly and clearly summarizes the main change—adding tissue mask generation functionality to the preprocessing pipeline.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into master

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/tissue-masks

No actionable comments were generated in the recent review. 🎉


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello @Adames4, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the capability to generate tissue masks from whole slide images, a foundational step for digital pathology workflows. It establishes a new processing script that utilizes parallel computing for efficiency, integrates necessary configuration settings for various datasets and the mask generation process itself, and incorporates a specialized library for mask operations. Furthermore, the changes include a Kubernetes job definition, ensuring that this new functionality can be deployed and scaled effectively within the existing infrastructure, thereby enhancing the overall preprocessing pipeline.

Highlights

  • Tissue Mask Generation Script: A new Python script (preprocessing/tissue_masks.py) was introduced to generate tissue masks from whole slide images, leveraging pyvips for image processing and ray for parallel execution.
  • Configuration Updates: New configuration files were added for processed datasets (configs/dataset/processed/ftn.yaml, ikem.yaml, knl_patos.yaml) and for the tissue mask preprocessing step (configs/preprocessing/tissue_masks.yaml), defining parameters and MLflow logging details.
  • New Dependency: The rationai-masks library was added as a new dependency in pyproject.toml and uv.lock to provide specialized functionalities for mask generation.
  • Kubernetes Job Integration: A new Kubernetes job script (scripts/preprocessing/tissue_masks.py) was added to enable scalable deployment and execution of the tissue mask generation process on a cluster.
Changelog
  • configs/dataset/processed/ftn.yaml
    • Added configuration for the FTN processed dataset, specifying its raw source and MLflow artifact URI.
  • configs/dataset/processed/ikem.yaml
    • Added configuration for the IKEM processed dataset, specifying its raw source and MLflow artifact URI.
  • configs/dataset/processed/knl_patos.yaml
    • Added configuration for the KNL_PATOS processed dataset, specifying its raw source and MLflow artifact URI.
  • configs/preprocessing/tissue_masks.yaml
    • Added a new configuration file for tissue mask preprocessing, defining parameters like processing level, concurrency, artifact path, and metadata for MLflow logging.
  • preprocessing/tissue_masks.py
    • Added a new Python script that implements the logic for generating tissue masks from whole slide images, utilizing OpenSlide, pyvips, and ray for parallel processing, and logging artifacts to MLflow.
  • pyproject.toml
    • Updated project dependencies to include rationai-masks and added its Git source.
  • scripts/preprocessing/tissue_masks.py
    • Added a new script to define and submit a Kubernetes job for running the tissue mask generation process.
  • uv.lock
    • Updated the dependency lock file to reflect the addition of rationai-masks.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a script for generating tissue masks, including the core processing logic, configuration files, and a job submission script. The changes are well-structured. My review includes several suggestions to improve the robustness and maintainability of the new scripts, such as addressing TODO comments in configuration, improving the parallel processing setup, and replacing placeholder values in the job submission script.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds tissue mask generation functionality as part of the preprocessing pipeline for the ulcerative colitis project. It depends on PR #2 (dataset creation) and introduces a Ray-based distributed processing system to generate tissue masks from whole slide images at a specified pyramid level.

Changes:

  • Added rationai-masks library dependency for tissue mask generation
  • Created preprocessing script to download datasets from MLflow and generate tissue masks in parallel using Ray
  • Added configuration files for tissue mask generation and processed dataset URIs for three institutions (IKEM, FTN, KNL PATOS)

Reviewed changes

Copilot reviewed 7 out of 9 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
pyproject.toml Added rationai-masks dependency for tissue mask generation
uv.lock Updated lock file with rationai-masks and its transitive dependencies
preprocessing/tissue_masks.py Main processing script that downloads datasets and generates tissue masks using Ray for parallel processing
scripts/preprocessing/tissue_masks.py Kubernetes job submission script for running tissue mask generation
configs/preprocessing/tissue_masks.yaml Configuration for tissue mask generation including level and concurrency settings
configs/dataset/processed/ikem.yaml Configuration with MLflow URI for IKEM processed dataset
configs/dataset/processed/ftn.yaml Configuration with MLflow URI for FTN processed dataset
configs/dataset/processed/knl_patos.yaml Configuration with MLflow URI for KNL PATOS processed dataset

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Adames4 Adames4 requested a review from vejtek February 11, 2026 13:05
@Adames4 Adames4 mentioned this pull request Feb 12, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@configs/dataset/processed/ftn.yaml`:
- Around line 1-5: The yaml currently contains a placeholder MLflow artifact URI
under the uri key (the line starting with "uri: mlflow-artifacts:/86/... # TODO
update URI"); replace that placeholder with the final, correct MLflow artifact
URI for this dataset (or delete the uri entry/remove this ftn config until the
proper run/artifact is available) so dataset resolution won't fail when this
config is used.

In `@configs/dataset/processed/ikem.yaml`:
- Around line 1-5: The YAML currently contains a placeholder MLflow artifact URI
in the uri key
(mlflow-artifacts:/86/52d0081d60ba4585a16a3bd341d5ab09/artifacts/dataset.csv)
which must be replaced before merging; update the uri value to the final MLflow
artifact URI for the correct run (or remove the uri entry / this config until
the final artifact is available) so dataset resolution using this config
succeeds.

In `@configs/dataset/processed/knl_patos.yaml`:
- Around line 1-5: The config currently contains a placeholder MLflow artifact
URI in the uri field (uri:
mlflow-artifacts:/86/afbcfd43cb3c4fd0b1e9b5dbe7327d91/artifacts/dataset.csv)
which must be replaced or removed; update the uri value to the final, correct
MLflow artifact URI for the dataset (or delete the uri entry / the entire
processed config until the correct run/URI is available) so that dataset
resolution using the defaults entries (the /dataset/raw/knl_patos@_here_ and
_self_ defaults) will point to the correct artifact.

In `@preprocessing/tissue_masks.py`:
- Around line 46-57: The code creates a TemporaryDirectory on the driver and
then runs process_items which may schedule process_slide on remote Ray workers,
causing missing writes because workers can't access the driver's temp path;
replace the driver-local TemporaryDirectory with a shared/configurable output
path (use config.shared_output_dir or add one to config), pass that Path into
process_items via fn_kwargs (preserving "output_path"), and ensure
logger.log_artifacts points to the same shared path (config.artifact_path) or
validate/raise if running in single-node mode (config.max_concurrent implies
multi-node); update any documentation/comments to state that process_slide
requires a shared filesystem when using remote workers.

In `@scripts/preprocessing/tissue_masks.py`:
- Around line 1-16: The submit_job call currently uses Ellipsis placeholders
(username=... and dataset path in script) which will fail at runtime; update the
module to accept real inputs by parsing CLI args or environment variables and
pass them into submit_job (replace the Ellipsis in username and the placeholder
in the "uv run ... +dataset=..." script entry). Locate the submit_job invocation
and the script list (symbols: submit_job, job_name, username, script) and wire
values from argparse (or os.environ) into those parameters, validating inputs
before calling submit_job and ensuring the final script string includes the
concrete dataset path.
🧹 Nitpick comments (1)
pyproject.toml (1)

16-18: Pin the rationai-masks git source to a tag/commit for reproducible builds.

With only a git URL, dependency resolution can drift as the repo changes. Consider pinning rev/tag in the UV source (and keep it aligned with your dependency spec) to make builds deterministic.

♻️ Example pinning with a specific commit/tag
 [tool.uv.sources]
-rationai-masks = { git = "https://gitlab.ics.muni.cz/rationai/digital-pathology/libraries/masks.git" }
+rationai-masks = { git = "https://gitlab.ics.muni.cz/rationai/digital-pathology/libraries/masks.git", rev = "<commit-or-tag>" }

Also applies to: 27-30

@Adames4 Adames4 changed the base branch from feature/dataset to master February 14, 2026 09:25
@Adames4 Adames4 requested review from matejpekar and vejtek February 14, 2026 10:03
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @.mypy.ini:
- Line 6: Remove "project_name" from the mypy exclusion pattern so the main
source package is type-checked; update the exclude setting from `exclude =
^(scripts|project_name)/` to only exclude scripts (e.g., `exclude =
^(scripts)/`) so files under the project package (like __init__.py, __main__.py,
project_name_model.py and subpackages) are included in mypy's strict checks.

In `@preprocessing/tissue_masks.py`:
- Line 29: The current mask_path assignment uses
Path(slide_path).with_suffix(".tiff").name which drops directory context and can
cause filename collisions (e.g., slide.tiff from different folders); update the
mask_path construction to append a unique identifier to the filename (for
example, a short hash of slide_path or include the dataframe index) so each
output is distinct—locate where mask_path is set (variable mask_path using
output_path and slide_path) and change the filename generation to include the
chosen unique token before the .tiff suffix to prevent silent overwrites.
🧹 Nitpick comments (2)
pyproject.toml (1)

17-17: Consider pinning rationai-masks to a specific tag or revision.

Like rationai-mlkit, this dependency has no version constraint and resolves from a git source with no rev, tag, or branch specifier. This means builds may break or behave differently if the upstream masks repo introduces breaking changes. Consider pinning to a specific commit or tag for reproducible builds.

This is consistent with the existing rationai-mlkit entry, so if you intentionally track HEAD for all internal libs, this is fine — just be aware of the trade-off.

Also applies to: 29-29

preprocessing/tissue_masks.py (1)

54-57: Consider try/finally to guarantee ray.shutdown().

Python's atexit handlers cover most exit paths, but wrapping in try/finally is more explicit and also ensures shutdown on KeyboardInterrupt (which bypasses atexit).

♻️ Suggested refactor
 if __name__ == "__main__":
     ray.init(runtime_env={"excludes": [".git", ".venv"]})
-    main()
-    ray.shutdown()
+    try:
+        main()
+    finally:
+        ray.shutdown()

@Adames4 Adames4 merged commit c290bca into master Feb 16, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants