Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @Adames4, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust system for generating tile masks, which are essential for quality control in digital pathology workflows. By creating masks that categorize tiles based on blur and artifact presence, it enables more refined analysis and filtering of image data. The implementation utilizes distributed computing with Ray and integrates with MLflow for artifact management, providing a scalable and traceable solution for preprocessing large-scale datasets. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a script for generating tile masks, along with associated configurations. My review focuses on improving configuration management, code clarity, and fixing a non-functional submission script. The main issues identified are placeholder values in configuration files, hardcoded values in the processing script, and a critical issue with a job submission script that appears to be an incomplete template. I have provided specific comments and suggestions to address these points.
| submit_job( | ||
| job_name="ulcerative-colitis-tile-masks-...", | ||
| username=..., | ||
| public=False, | ||
| cpu=64, | ||
| memory="32Gi", | ||
| script=[ | ||
| "git clone https://github.com/RationAI/ulcerative-colitis.git workdir", | ||
| "cd workdir", | ||
| "uv sync --frozen", | ||
| "uv run --active -m preprocessing.tile_masks +dataset=tiled/.../...", | ||
| ], | ||
| storage=[storage.secure.Data], | ||
| ) |
There was a problem hiding this comment.
This script appears to be a template and is not runnable in its current state. It contains placeholder values such as ... for job_name and username, and an incomplete dataset path. These placeholders must be replaced with actual values before this can be used. Committing template files can lead to confusion and errors.
| test_preliminary: "mlflow-artifacts:/86/7b9a446145b14965981bbac88e8e2c8b/artifacts/test preliminary - knl_patos" # TODO update URI | ||
| test_final: "mlflow-artifacts:/86/7b9a446145b14965981bbac88e8e2c8b/artifacts/test final - knl_patos" # TODO update URI |
| test_preliminary: "mlflow-artifacts:/86/6782155362d54ecc9f1beccb4362d359/artifacts/test preliminary - knl_patos" # TODO update URI | ||
| test_final: "mlflow-artifacts:/86/6782155362d54ecc9f1beccb4362d359/artifacts/test final - knl_patos" # TODO update URI No newline at end of file |
There was a problem hiding this comment.
The URIs in this configuration file contain TODO update URI comments, indicating they are placeholders. These should be updated with the correct values before merging to ensure the configuration is complete and functional. Additionally, the file is missing a final newline character, which is a standard convention for text files and can prevent issues with some tools.
| train: "mlflow-artifacts:/86/5814484b6cd7467e9d712889655479af/artifacts/train - ftn" # TODO update URI | ||
| test_preliminary: "mlflow-artifacts:/86/5814484b6cd7467e9d712889655479af/artifacts/test preliminary - ftn" # TODO update URI | ||
| test_final: "mlflow-artifacts:/86/5814484b6cd7467e9d712889655479af/artifacts/test final - ftn" # TODO update URI No newline at end of file |
There was a problem hiding this comment.
The URIs in this configuration file contain TODO update URI comments, indicating they are placeholders. These should be updated with the correct values before merging to ensure the configuration is complete and functional. Additionally, the file is missing a final newline character, which is a standard convention for text files and can prevent issues with some tools.
| train: "mlflow-artifacts:/86/4486e598446d412d926ac66dadb35e51/artifacts/train - ikem" # TODO update URI | ||
| test_preliminary: "mlflow-artifacts:/86/4486e598446d412d926ac66dadb35e51/artifacts/test preliminary - ikem" # TODO update URI | ||
| test_final: "mlflow-artifacts:/86/4486e598446d412d926ac66dadb35e51/artifacts/test final - ikem" # TODO update URI No newline at end of file |
There was a problem hiding this comment.
The URIs in this configuration file contain TODO update URI comments, indicating they are placeholders. These should be updated with the correct values before merging to ensure the configuration is complete and functional. Additionally, the file is missing a final newline character, which is a standard convention for text files and can prevent issues with some tools.
| description: Tile masks for ${dataset.institution} at tiling level ${dataset.level} | ||
| hyperparams: | ||
| mask_level: ${level} | ||
| tiling_level: ${dataset.level} No newline at end of file |
| slidess, tiless = [], [] | ||
| for uri in uris: | ||
| path = mlflow.artifacts.download_artifacts(artifact_uri=uri) | ||
| slidess.append(pd.read_parquet(Path(path) / "slides.parquet")) | ||
| tiless.append(pd.read_parquet(Path(path) / "tiles.parquet")) | ||
|
|
||
| slides = pd.concat(slidess).reset_index(drop=True) | ||
| tiles = pd.concat(tiless).reset_index(drop=True) |
There was a problem hiding this comment.
The variable names slidess and tiless are unconventional. Using a _list suffix, like slides_list and tiles_list, would be more idiomatic and improve readability.
| slidess, tiless = [], [] | |
| for uri in uris: | |
| path = mlflow.artifacts.download_artifacts(artifact_uri=uri) | |
| slidess.append(pd.read_parquet(Path(path) / "slides.parquet")) | |
| tiless.append(pd.read_parquet(Path(path) / "tiles.parquet")) | |
| slides = pd.concat(slidess).reset_index(drop=True) | |
| tiles = pd.concat(tiless).reset_index(drop=True) | |
| slides_list, tiles_list = [], [] | |
| for uri in uris: | |
| path = mlflow.artifacts.download_artifacts(artifact_uri=uri) | |
| slides_list.append(pd.read_parquet(Path(path) / "slides.parquet")) | |
| tiles_list.append(pd.read_parquet(Path(path) / "tiles.parquet")) | |
| slides = pd.concat(slides_list).reset_index(drop=True) | |
| tiles = pd.concat(tiles_list).reset_index(drop=True) |
| return slides, tiles | ||
|
|
||
|
|
||
| @ray.remote(memory=4 * 1024**3) |
There was a problem hiding this comment.
The memory for the Ray remote function is hardcoded to 4GB. This might not be optimal for all environments. Consider defining this as a constant at the top of the file (e.g., RAY_WORKER_MEMORY = 4 * 1024**3) to make it more visible and easier to change, or making it configurable via the Hydra config.
| blur_slide_tiles = slide_tiles[slide_tiles["blur"] > 0.25] | ||
| artifacts_slide_tiles = slide_tiles[slide_tiles["artifacts"] > 0.25] |
There was a problem hiding this comment.
| train: "mlflow-artifacts:/86/de450f835f0d4462a91b35f4a79a500f/artifacts/train - ftn" # TODO update URI | ||
| test_preliminary: "mlflow-artifacts:/86/de450f835f0d4462a91b35f4a79a500f/artifacts/test preliminary - ftn" # TODO update URI | ||
| test_final: "mlflow-artifacts:/86/de450f835f0d4462a91b35f4a79a500f/artifacts/test final - ftn" # TODO update URI No newline at end of file |
There was a problem hiding this comment.
The URIs in this configuration file contain TODO update URI comments, indicating they are placeholders. These should be updated with the correct values before merging to ensure the configuration is complete and functional. Additionally, the file is missing a final newline character, which is a standard convention for text files and can prevent issues with some tools.
There was a problem hiding this comment.
Pull request overview
Adds a new “tile masks” preprocessing stage that consumes MLflow tiling outputs (slides/tiles artifacts) and produces per-slide TIFF masks for different tile-quality subsets, wired into the existing Hydra + Ray preprocessing pipeline.
Changes:
- Added
preprocessing/tile_masks.pyto download tiling artifacts, classify tiles (blur/artifacts/clean), and write mask TIFFs per slide via Ray workers. - Added a corresponding Hydra config
configs/preprocessing/tile_masks.yaml. - Added tiled dataset configs for multiple institutions/levels/tile extents with
tiling_urispointing to MLflow artifacts, plus a kube job submission helper script.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/preprocessing/tile_masks.py | Kube job submission script for running the tile-masks preprocessing stage. |
| preprocessing/tile_masks.py | Implements tile-mask generation from tiling artifacts using Ray + OpenSlide + pyvips. |
| configs/preprocessing/tile_masks.yaml | Default configuration for tile-masks run (concurrency, mask level, MLflow metadata). |
| configs/dataset/tiled/knl_patos/0_320.yaml | Tiled dataset config (KNL Patos, level 0, tile 320) with tiling MLflow URIs. |
| configs/dataset/tiled/knl_patos/0_430.yaml | Tiled dataset config (KNL Patos, level 0, tile 430) with tiling MLflow URIs. |
| configs/dataset/tiled/knl_patos/1_224.yaml | Tiled dataset config (KNL Patos, level 1, tile 224) with tiling MLflow URIs. |
| configs/dataset/tiled/knl_patos/2_224.yaml | Tiled dataset config (KNL Patos, level 2, tile 224) with tiling MLflow URIs. |
| configs/dataset/tiled/ikem/0_320.yaml | Tiled dataset config (IKEM, level 0, tile 320) with tiling MLflow URIs. |
| configs/dataset/tiled/ikem/0_430.yaml | Tiled dataset config (IKEM, level 0, tile 430) with tiling MLflow URIs. |
| configs/dataset/tiled/ikem/1_224.yaml | Tiled dataset config (IKEM, level 1, tile 224) with tiling MLflow URIs. |
| configs/dataset/tiled/ikem/2_224.yaml | Tiled dataset config (IKEM, level 2, tile 224) with tiling MLflow URIs. |
| configs/dataset/tiled/ftn/0_320.yaml | Tiled dataset config (FTN, level 0, tile 320) with tiling MLflow URIs. |
| configs/dataset/tiled/ftn/0_430.yaml | Tiled dataset config (FTN, level 0, tile 430) with tiling MLflow URIs. |
| configs/dataset/tiled/ftn/1_224.yaml | Tiled dataset config (FTN, level 1, tile 224) with tiling MLflow URIs. |
| configs/dataset/tiled/ftn/2_224.yaml | Tiled dataset config (FTN, level 2, tile 224) with tiling MLflow URIs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "uv sync --frozen", | ||
| "uv run --active -m preprocessing.tile_masks +dataset=tiled/.../...", | ||
| ], | ||
| storage=[storage.secure.Data], |
There was a problem hiding this comment.
storage.secure.Data looks like a typo/inconsistent with the other preprocessing job scripts (they all use storage.secure.DATA). If Data doesn’t exist, this will raise an AttributeError when submitting the job; use the same storage.secure.DATA constant here.
| storage=[storage.secure.Data], | |
| storage=[storage.secure.DATA], |
| slidess, tiless = [], [] | ||
| for uri in uris: | ||
| path = mlflow.artifacts.download_artifacts(artifact_uri=uri) | ||
| slidess.append(pd.read_parquet(Path(path) / "slides.parquet")) | ||
| tiless.append(pd.read_parquet(Path(path) / "tiles.parquet")) | ||
|
|
||
| slides = pd.concat(slidess).reset_index(drop=True) | ||
| tiles = pd.concat(tiless).reset_index(drop=True) |
There was a problem hiding this comment.
The variable names slidess / tiless read like accidental duplicates and make the code harder to follow. Consider renaming them to something clearer (e.g., slides_parts / tiles_parts).
| slidess, tiless = [], [] | |
| for uri in uris: | |
| path = mlflow.artifacts.download_artifacts(artifact_uri=uri) | |
| slidess.append(pd.read_parquet(Path(path) / "slides.parquet")) | |
| tiless.append(pd.read_parquet(Path(path) / "tiles.parquet")) | |
| slides = pd.concat(slidess).reset_index(drop=True) | |
| tiles = pd.concat(tiless).reset_index(drop=True) | |
| slides_parts, tiles_parts = [], [] | |
| for uri in uris: | |
| path = mlflow.artifacts.download_artifacts(artifact_uri=uri) | |
| slides_parts.append(pd.read_parquet(Path(path) / "slides.parquet")) | |
| tiles_parts.append(pd.read_parquet(Path(path) / "tiles.parquet")) | |
| slides = pd.concat(slides_parts).reset_index(drop=True) | |
| tiles = pd.concat(tiles_parts).reset_index(drop=True) |
| slide_tiles = tiles[tiles["slide_id"] == slide.id] | ||
|
|
There was a problem hiding this comment.
slide_tiles = tiles[tiles["slide_id"] == slide.id] performs a full scan of the entire tiles DataFrame for every slide. For large datasets this becomes O(#slides × #tiles) and can dominate runtime. Consider pre-indexing/grouping once (e.g., set tiles index to slide_id and use .loc[...], or build a dict[slide_id, tiles_subset] before launching Ray tasks) so each task only touches its own slide’s rows.
| blur_slide_tiles = slide_tiles[slide_tiles["blur"] > 0.25] | ||
| artifacts_slide_tiles = slide_tiles[slide_tiles["artifacts"] > 0.25] |
There was a problem hiding this comment.
The blur/artifacts thresholds are hard-coded as 0.25. Given that other preprocessing steps expose thresholds via Hydra config (e.g., tissue_threshold in tiling), it would be more maintainable to make these configurable (e.g., blur_threshold, artifacts_threshold in configs/preprocessing/tile_masks.yaml) so they can be tuned without code changes.
Tile masks script.
Closes IBD-20
Blocked by #5
Dependency graph: