Skip to content

feat: tile masks#6

Open
Adames4 wants to merge 24 commits intofeature/tilingfrom
feature/tile-masks
Open

feat: tile masks#6
Adames4 wants to merge 24 commits intofeature/tilingfrom
feature/tile-masks

Conversation

@Adames4
Copy link
Collaborator

@Adames4 Adames4 commented Feb 12, 2026

Tile masks script.

Closes IBD-20

Blocked by #5

Dependency graph:

                         +--------------+
                  -------| tissue-masks |<------+           +------------+      +----------------------+
                 /       +--------------+       |       +---| tile-masks |<-----| preprocessing-report |
                /                               |       |   +------------+      +----------------------+
+---------+    /                            +--------+  |
| dataset | <-+                             | tiling |<-+
+---------+    \                            +--------+  |
                \                               |       |   +------------+
                 \       +-----------------+    |       +---| embeddings |
                  -------| quality-control |<---+           +------------+
                         +-----------------+


@Adames4 Adames4 requested a review from Copilot February 12, 2026 20:39
@Adames4 Adames4 self-assigned this Feb 12, 2026
@coderabbitai
Copy link

coderabbitai bot commented Feb 12, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/tile-masks

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello @Adames4, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust system for generating tile masks, which are essential for quality control in digital pathology workflows. By creating masks that categorize tiles based on blur and artifact presence, it enables more refined analysis and filtering of image data. The implementation utilizes distributed computing with Ray and integrates with MLflow for artifact management, providing a scalable and traceable solution for preprocessing large-scale datasets.

Highlights

  • New Tile Mask Generation Script: A new Python script preprocessing/tile_masks.py has been introduced to generate masks for image tiles, distinguishing between clean, blurred, and artifact-containing regions.
  • Hydra Configuration for Datasets: Multiple new YAML configuration files have been added under configs/dataset/tiled/ for different datasets (ftn, ikem, knl_patos) and various tiling parameters (mpp, tile_extent, level), enabling flexible execution of the tile mask script.
  • Distributed Processing with Ray: The tile mask generation process leverages Ray for distributed execution, allowing for efficient parallel processing of slides and their associated tiles.
  • MLflow Integration: The script integrates with MLflow for downloading input tile artifacts and logging generated tile mask artifacts, ensuring traceability and reproducibility.
  • Kubernetes Job Submission Script: A helper script scripts/preprocessing/tile_masks.py was added to facilitate submitting the tile mask generation as a Kubernetes job, streamlining deployment in a cluster environment.
Changelog
  • configs/dataset/tiled/ftn/0_320.yaml
    • Added configuration for FTN dataset with 0.17 mpp, 320 tile extent, and level 0.
  • configs/dataset/tiled/ftn/0_430.yaml
    • Added configuration for FTN dataset with 0.17 mpp, 430 tile extent, and level 0.
  • configs/dataset/tiled/ftn/1_224.yaml
    • Added configuration for FTN dataset with 0.52 mpp, 224 tile extent, and level 1.
  • configs/dataset/tiled/ftn/2_224.yaml
    • Added configuration for FTN dataset with 1.55 mpp, 224 tile extent, and level 2.
  • configs/dataset/tiled/ikem/0_320.yaml
    • Added configuration for IKEM dataset with 0.17 mpp, 320 tile extent, and level 0.
  • configs/dataset/tiled/ikem/0_430.yaml
    • Added configuration for IKEM dataset with 0.17 mpp, 430 tile extent, and level 0.
  • configs/dataset/tiled/ikem/1_224.yaml
    • Added configuration for IKEM dataset with 0.52 mpp, 224 tile extent, and level 1.
  • configs/dataset/tiled/ikem/2_224.yaml
    • Added configuration for IKEM dataset with 1.55 mpp, 224 tile extent, and level 2.
  • configs/dataset/tiled/knl_patos/0_320.yaml
    • Added configuration for KNL_PATOS dataset with 0.17 mpp, 320 tile extent, and level 0.
  • configs/dataset/tiled/knl_patos/0_430.yaml
    • Added configuration for KNL_PATOS dataset with 0.17 mpp, 430 tile extent, and level 0.
  • configs/dataset/tiled/knl_patos/1_224.yaml
    • Added configuration for KNL_PATOS dataset with 0.52 mpp, 224 tile extent, and level 1.
  • configs/dataset/tiled/knl_patos/2_224.yaml
    • Added configuration for KNL_PATOS dataset with 1.55 mpp, 224 tile extent, and level 2.
  • configs/preprocessing/tile_masks.yaml
    • Added a new configuration file for the tile masks preprocessing script, defining parameters like max_concurrent, artifact_path, and metadata.
  • preprocessing/tile_masks.py
    • Implemented the core Python script for generating tile masks, including functions for downloading slide tiles, processing individual slides in parallel using Ray, and writing BigTIFF masks.
  • scripts/preprocessing/tile_masks.py
    • Added a Kubernetes job submission script to run the tile masks preprocessing, specifying resource requirements and execution commands.
Activity
  • No specific review comments or activity have been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a script for generating tile masks, along with associated configurations. My review focuses on improving configuration management, code clarity, and fixing a non-functional submission script. The main issues identified are placeholder values in configuration files, hardcoded values in the processing script, and a critical issue with a job submission script that appears to be an incomplete template. I have provided specific comments and suggestions to address these points.

Comment on lines +4 to +17
submit_job(
job_name="ulcerative-colitis-tile-masks-...",
username=...,
public=False,
cpu=64,
memory="32Gi",
script=[
"git clone https://github.com/RationAI/ulcerative-colitis.git workdir",
"cd workdir",
"uv sync --frozen",
"uv run --active -m preprocessing.tile_masks +dataset=tiled/.../...",
],
storage=[storage.secure.Data],
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This script appears to be a template and is not runnable in its current state. It contains placeholder values such as ... for job_name and username, and an incomplete dataset path. These placeholders must be replaced with actual values before this can be used. Committing template files can lead to confusion and errors.

Comment on lines +9 to +10
test_preliminary: "mlflow-artifacts:/86/7b9a446145b14965981bbac88e8e2c8b/artifacts/test preliminary - knl_patos" # TODO update URI
test_final: "mlflow-artifacts:/86/7b9a446145b14965981bbac88e8e2c8b/artifacts/test final - knl_patos" # TODO update URI

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The URIs in this configuration file contain TODO update URI comments, indicating they are placeholders. These should be updated with the correct values before merging to ensure the configuration is complete and functional.

Comment on lines +9 to +10
test_preliminary: "mlflow-artifacts:/86/6782155362d54ecc9f1beccb4362d359/artifacts/test preliminary - knl_patos" # TODO update URI
test_final: "mlflow-artifacts:/86/6782155362d54ecc9f1beccb4362d359/artifacts/test final - knl_patos" # TODO update URI No newline at end of file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The URIs in this configuration file contain TODO update URI comments, indicating they are placeholders. These should be updated with the correct values before merging to ensure the configuration is complete and functional. Additionally, the file is missing a final newline character, which is a standard convention for text files and can prevent issues with some tools.

Comment on lines +9 to +11
train: "mlflow-artifacts:/86/5814484b6cd7467e9d712889655479af/artifacts/train - ftn" # TODO update URI
test_preliminary: "mlflow-artifacts:/86/5814484b6cd7467e9d712889655479af/artifacts/test preliminary - ftn" # TODO update URI
test_final: "mlflow-artifacts:/86/5814484b6cd7467e9d712889655479af/artifacts/test final - ftn" # TODO update URI No newline at end of file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The URIs in this configuration file contain TODO update URI comments, indicating they are placeholders. These should be updated with the correct values before merging to ensure the configuration is complete and functional. Additionally, the file is missing a final newline character, which is a standard convention for text files and can prevent issues with some tools.

Comment on lines +9 to +11
train: "mlflow-artifacts:/86/4486e598446d412d926ac66dadb35e51/artifacts/train - ikem" # TODO update URI
test_preliminary: "mlflow-artifacts:/86/4486e598446d412d926ac66dadb35e51/artifacts/test preliminary - ikem" # TODO update URI
test_final: "mlflow-artifacts:/86/4486e598446d412d926ac66dadb35e51/artifacts/test final - ikem" # TODO update URI No newline at end of file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The URIs in this configuration file contain TODO update URI comments, indicating they are placeholders. These should be updated with the correct values before merging to ensure the configuration is complete and functional. Additionally, the file is missing a final newline character, which is a standard convention for text files and can prevent issues with some tools.

description: Tile masks for ${dataset.institution} at tiling level ${dataset.level}
hyperparams:
mask_level: ${level}
tiling_level: ${dataset.level} No newline at end of file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This file is missing a final newline character. It's a good practice to add one for consistency and to avoid potential issues with some file processing tools.

Comment on lines +20 to +27
slidess, tiless = [], []
for uri in uris:
path = mlflow.artifacts.download_artifacts(artifact_uri=uri)
slidess.append(pd.read_parquet(Path(path) / "slides.parquet"))
tiless.append(pd.read_parquet(Path(path) / "tiles.parquet"))

slides = pd.concat(slidess).reset_index(drop=True)
tiles = pd.concat(tiless).reset_index(drop=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable names slidess and tiless are unconventional. Using a _list suffix, like slides_list and tiles_list, would be more idiomatic and improve readability.

Suggested change
slidess, tiless = [], []
for uri in uris:
path = mlflow.artifacts.download_artifacts(artifact_uri=uri)
slidess.append(pd.read_parquet(Path(path) / "slides.parquet"))
tiless.append(pd.read_parquet(Path(path) / "tiles.parquet"))
slides = pd.concat(slidess).reset_index(drop=True)
tiles = pd.concat(tiless).reset_index(drop=True)
slides_list, tiles_list = [], []
for uri in uris:
path = mlflow.artifacts.download_artifacts(artifact_uri=uri)
slides_list.append(pd.read_parquet(Path(path) / "slides.parquet"))
tiles_list.append(pd.read_parquet(Path(path) / "tiles.parquet"))
slides = pd.concat(slides_list).reset_index(drop=True)
tiles = pd.concat(tiles_list).reset_index(drop=True)

return slides, tiles


@ray.remote(memory=4 * 1024**3)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The memory for the Ray remote function is hardcoded to 4GB. This might not be optimal for all environments. Consider defining this as a constant at the top of the file (e.g., RAY_WORKER_MEMORY = 4 * 1024**3) to make it more visible and easier to change, or making it configurable via the Hydra config.

Comment on lines +44 to +45
blur_slide_tiles = slide_tiles[slide_tiles["blur"] > 0.25]
artifacts_slide_tiles = slide_tiles[slide_tiles["artifacts"] > 0.25]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The values 0.25 for blur and artifacts thresholds are magic numbers. It's better to define them as named constants at the top of the file (e.g., BLUR_THRESHOLD = 0.25) or pass them as configuration parameters. This improves readability and makes it easier to change these values in the future.

Comment on lines +9 to +11
train: "mlflow-artifacts:/86/de450f835f0d4462a91b35f4a79a500f/artifacts/train - ftn" # TODO update URI
test_preliminary: "mlflow-artifacts:/86/de450f835f0d4462a91b35f4a79a500f/artifacts/test preliminary - ftn" # TODO update URI
test_final: "mlflow-artifacts:/86/de450f835f0d4462a91b35f4a79a500f/artifacts/test final - ftn" # TODO update URI No newline at end of file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The URIs in this configuration file contain TODO update URI comments, indicating they are placeholders. These should be updated with the correct values before merging to ensure the configuration is complete and functional. Additionally, the file is missing a final newline character, which is a standard convention for text files and can prevent issues with some tools.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “tile masks” preprocessing stage that consumes MLflow tiling outputs (slides/tiles artifacts) and produces per-slide TIFF masks for different tile-quality subsets, wired into the existing Hydra + Ray preprocessing pipeline.

Changes:

  • Added preprocessing/tile_masks.py to download tiling artifacts, classify tiles (blur/artifacts/clean), and write mask TIFFs per slide via Ray workers.
  • Added a corresponding Hydra config configs/preprocessing/tile_masks.yaml.
  • Added tiled dataset configs for multiple institutions/levels/tile extents with tiling_uris pointing to MLflow artifacts, plus a kube job submission helper script.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
scripts/preprocessing/tile_masks.py Kube job submission script for running the tile-masks preprocessing stage.
preprocessing/tile_masks.py Implements tile-mask generation from tiling artifacts using Ray + OpenSlide + pyvips.
configs/preprocessing/tile_masks.yaml Default configuration for tile-masks run (concurrency, mask level, MLflow metadata).
configs/dataset/tiled/knl_patos/0_320.yaml Tiled dataset config (KNL Patos, level 0, tile 320) with tiling MLflow URIs.
configs/dataset/tiled/knl_patos/0_430.yaml Tiled dataset config (KNL Patos, level 0, tile 430) with tiling MLflow URIs.
configs/dataset/tiled/knl_patos/1_224.yaml Tiled dataset config (KNL Patos, level 1, tile 224) with tiling MLflow URIs.
configs/dataset/tiled/knl_patos/2_224.yaml Tiled dataset config (KNL Patos, level 2, tile 224) with tiling MLflow URIs.
configs/dataset/tiled/ikem/0_320.yaml Tiled dataset config (IKEM, level 0, tile 320) with tiling MLflow URIs.
configs/dataset/tiled/ikem/0_430.yaml Tiled dataset config (IKEM, level 0, tile 430) with tiling MLflow URIs.
configs/dataset/tiled/ikem/1_224.yaml Tiled dataset config (IKEM, level 1, tile 224) with tiling MLflow URIs.
configs/dataset/tiled/ikem/2_224.yaml Tiled dataset config (IKEM, level 2, tile 224) with tiling MLflow URIs.
configs/dataset/tiled/ftn/0_320.yaml Tiled dataset config (FTN, level 0, tile 320) with tiling MLflow URIs.
configs/dataset/tiled/ftn/0_430.yaml Tiled dataset config (FTN, level 0, tile 430) with tiling MLflow URIs.
configs/dataset/tiled/ftn/1_224.yaml Tiled dataset config (FTN, level 1, tile 224) with tiling MLflow URIs.
configs/dataset/tiled/ftn/2_224.yaml Tiled dataset config (FTN, level 2, tile 224) with tiling MLflow URIs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"uv sync --frozen",
"uv run --active -m preprocessing.tile_masks +dataset=tiled/.../...",
],
storage=[storage.secure.Data],
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage.secure.Data looks like a typo/inconsistent with the other preprocessing job scripts (they all use storage.secure.DATA). If Data doesn’t exist, this will raise an AttributeError when submitting the job; use the same storage.secure.DATA constant here.

Suggested change
storage=[storage.secure.Data],
storage=[storage.secure.DATA],

Copilot uses AI. Check for mistakes.
Comment on lines +20 to +27
slidess, tiless = [], []
for uri in uris:
path = mlflow.artifacts.download_artifacts(artifact_uri=uri)
slidess.append(pd.read_parquet(Path(path) / "slides.parquet"))
tiless.append(pd.read_parquet(Path(path) / "tiles.parquet"))

slides = pd.concat(slidess).reset_index(drop=True)
tiles = pd.concat(tiless).reset_index(drop=True)
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable names slidess / tiless read like accidental duplicates and make the code harder to follow. Consider renaming them to something clearer (e.g., slides_parts / tiles_parts).

Suggested change
slidess, tiless = [], []
for uri in uris:
path = mlflow.artifacts.download_artifacts(artifact_uri=uri)
slidess.append(pd.read_parquet(Path(path) / "slides.parquet"))
tiless.append(pd.read_parquet(Path(path) / "tiles.parquet"))
slides = pd.concat(slidess).reset_index(drop=True)
tiles = pd.concat(tiless).reset_index(drop=True)
slides_parts, tiles_parts = [], []
for uri in uris:
path = mlflow.artifacts.download_artifacts(artifact_uri=uri)
slides_parts.append(pd.read_parquet(Path(path) / "slides.parquet"))
tiles_parts.append(pd.read_parquet(Path(path) / "tiles.parquet"))
slides = pd.concat(slides_parts).reset_index(drop=True)
tiles = pd.concat(tiles_parts).reset_index(drop=True)

Copilot uses AI. Check for mistakes.
Comment on lines +42 to +43
slide_tiles = tiles[tiles["slide_id"] == slide.id]

Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slide_tiles = tiles[tiles["slide_id"] == slide.id] performs a full scan of the entire tiles DataFrame for every slide. For large datasets this becomes O(#slides × #tiles) and can dominate runtime. Consider pre-indexing/grouping once (e.g., set tiles index to slide_id and use .loc[...], or build a dict[slide_id, tiles_subset] before launching Ray tasks) so each task only touches its own slide’s rows.

Copilot uses AI. Check for mistakes.
Comment on lines +44 to +45
blur_slide_tiles = slide_tiles[slide_tiles["blur"] > 0.25]
artifacts_slide_tiles = slide_tiles[slide_tiles["artifacts"] > 0.25]
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blur/artifacts thresholds are hard-coded as 0.25. Given that other preprocessing steps expose thresholds via Hydra config (e.g., tissue_threshold in tiling), it would be more maintainable to make these configurable (e.g., blur_threshold, artifacts_threshold in configs/preprocessing/tile_masks.yaml) so they can be tuned without code changes.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants