Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @Adames4, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant new capability for processing whole slide images by implementing a robust tiling pipeline. It includes new scripts for performing quality control checks on slides and subsequently extracting tiles based on specified parameters and masks. The changes are supported by new configuration files that define dataset-specific mask URIs and various tiling strategies, enabling flexible and reproducible tile generation for downstream analysis in computational pathology. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces new scripts and configurations for tiling and quality control. The overall structure is good, leveraging ray for parallel processing and hydra for configuration management. However, there are several areas for improvement. Multiple configuration and script files contain placeholders (TODOs and ...) that render them non-functional and need to be addressed. Additionally, the tiling script contains some maintainability issues, such as a non-descriptive function name and hardcoded values that should be refactored into constants for better clarity and robustness. I've provided specific comments and suggestions to address these points.
| tissue_mask_uri: mlflow-artifacts:/86/04778b10de254572b69ce0a101c1eee4/artifacts/tissue_masks # TODO update URI | ||
| qc_mask_uri: mlflow-artifacts:/86/c8edfb2541e84b44b1a28be3540c1a35/artifacts # TODO update URI No newline at end of file |
There was a problem hiding this comment.
| tissue_mask_uri: mlflow-artifacts:/86/13359cdd5d1a47ddabc352b9aa0d7635/artifacts/tissue_masks # TODO update URI | ||
| qc_mask_uri: mlflow-artifacts:/86/98443fe2b67445d5a56598bff15b7f27/artifacts # TODO update URI No newline at end of file |
| tissue_mask_uri: mlflow-artifacts:/86/8ef6d6f0c9af4f35a087596960f675aa/artifacts/tissue_masks # TODO update URI | ||
| qc_mask_uri: mlflow-artifacts:/86/75fc3e53112f4634ae5238777d87e88c/artifacts # TODO update URI No newline at end of file |
There was a problem hiding this comment.
Pull request overview
Adds a preprocessing tiling stage (plus QC masks generation) to the ulcerative-colitis pipeline, wiring it into the existing Hydra/MLflow/Ray-based preprocessing structure and updating project dependencies accordingly.
Changes:
- Introduces
preprocessing/tiling.pyto build slide/tile datasets using tissue + QC masks and save results to MLflow. - Adds
preprocessing/quality_control.pyto generate QC masks/metrics viarationai.AsyncClient, organize outputs, and log artifacts to MLflow. - Extends configs (preprocessing + experiments + datasets-with-masks) and updates dependencies (
pyproject.toml,uv.lock) for tiling/QC tooling.
Reviewed changes
Copilot reviewed 23 out of 25 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Locks new dependencies required by tiling/QC (incl. rationai-tiling, rationai-sdk, ratiopath, and geo/image stack). |
| pyproject.toml | Adds tiling/QC-related runtime deps and Git sources for rationai-tiling + rationai-sdk. |
| preprocessing/tissue_masks.py | Small variable rename (df → dataset) for clarity. |
| preprocessing/tiling.py | New tiling pipeline: dataset split, slide reading via Ray, mask overlap computation, MLflow dataset export. |
| preprocessing/quality_control.py | New QC pipeline: async slide checks, mask organization, metrics aggregation, MLflow artifact logging. |
| scripts/preprocessing/tiling.py | New kube-jobs submission script for tiling. |
| scripts/preprocessing/quality_control.py | New kube-jobs submission script for QC. |
| configs/preprocessing/tiling.yaml | Adds global tiling config schema (mpp/tile_extent/stride/splits/metadata). |
| configs/preprocessing/quality_control.yaml | Adds global QC config schema (output_dir, timeouts, qc_parameters, metadata). |
| configs/experiment/preprocessing/tiling/ftn_*.yaml | Adds FTN tiling experiment presets for multiple mpp/tile sizes. |
| configs/experiment/preprocessing/tiling/ikem_*.yaml | Adds IKEM tiling experiment presets for multiple mpp/tile sizes. |
| configs/experiment/preprocessing/tiling/knl_patos_*.yaml | Adds KNL Patos tiling experiment presets for multiple mpp/tile sizes. |
| configs/dataset/processed_w_masks/ftn.yaml | Adds dataset variant with tissue/QC mask MLflow URIs (currently marked TODO). |
| configs/dataset/processed_w_masks/ikem.yaml | Adds dataset variant with tissue/QC mask MLflow URIs (currently marked TODO). |
| configs/dataset/processed_w_masks/knl_patos.yaml | Adds dataset variant with tissue/QC mask MLflow URIs (currently marked TODO). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
preprocessing/tiling.py
Outdated
| def nancy(row: dict[str, Any], df: pd.DataFrame) -> dict[str, Any]: | ||
| row["nancy_index"] = df.loc[Path(row["path"]).stem, "nancy"] | ||
| return row | ||
|
|
||
|
|
There was a problem hiding this comment.
Function name nancy() is not self-describing (it mutates the row to add nancy_index). Consider renaming to something like add_nancy_index/attach_nancy_label to make the Ray pipeline easier to follow and grep.
| def nancy(row: dict[str, Any], df: pd.DataFrame) -> dict[str, Any]: | |
| row["nancy_index"] = df.loc[Path(row["path"]).stem, "nancy"] | |
| return row | |
| def add_nancy_index(row: dict[str, Any], nancy_index_df: pd.DataFrame) -> dict[str, Any]: | |
| """ | |
| Attach the 'nancy_index' value from `nancy_index_df` to the given `row`. | |
| The row is expected to contain a 'path' key whose stem matches an index | |
| entry in `nancy_index_df`, which must have a 'nancy' column. | |
| """ | |
| row["nancy_index"] = nancy_index_df.loc[Path(row["path"]).stem, "nancy"] | |
| return row | |
| def nancy(row: dict[str, Any], df: pd.DataFrame) -> dict[str, Any]: | |
| """Backward-compatible wrapper; prefer :func:`add_nancy_index`.""" | |
| return add_nancy_index(row, df) |
| "git clone https://gitlab.ics.muni.cz/rationai/digital-pathology/pathology/ulcerative-colitis.git workdir", | ||
| "cd workdir", | ||
| "uv sync --frozen", | ||
| "uv run -m preprocessing.quality_control +dataset=processed/...", | ||
| ], |
There was a problem hiding this comment.
This job script clones ulcerative-colitis from an internal GitLab URL, while the existing scripts in scripts/preprocessing/ clone from GitHub. If that’s unintentional, align the clone URL across scripts (or parameterize it) so job submission is reproducible for all users/environments.
| tissue_mask_uri: mlflow-artifacts:/86/8ef6d6f0c9af4f35a087596960f675aa/artifacts/tissue_masks # TODO update URI | ||
| qc_mask_uri: mlflow-artifacts:/86/75fc3e53112f4634ae5238777d87e88c/artifacts # TODO update URI No newline at end of file |
There was a problem hiding this comment.
This dataset config hard-codes MLflow artifact URIs and leaves a # TODO update URI note. If these URIs are not final/stable, the tiling pipeline will fail or use incorrect artifacts. Please either update them to the final run URIs, or switch to a placeholder/override-based approach so the committed default config is usable.
| tissue_mask_uri: mlflow-artifacts:/86/8ef6d6f0c9af4f35a087596960f675aa/artifacts/tissue_masks # TODO update URI | |
| qc_mask_uri: mlflow-artifacts:/86/75fc3e53112f4634ae5238777d87e88c/artifacts # TODO update URI | |
| tissue_mask_uri: ${oc.env:TISSUE_MASK_URI,} | |
| qc_mask_uri: ${oc.env:QC_MASK_URI,} |
| tissue_mask_uri: mlflow-artifacts:/86/04778b10de254572b69ce0a101c1eee4/artifacts/tissue_masks # TODO update URI | ||
| qc_mask_uri: mlflow-artifacts:/86/c8edfb2541e84b44b1a28be3540c1a35/artifacts # TODO update URI No newline at end of file |
There was a problem hiding this comment.
This dataset config hard-codes MLflow artifact URIs and leaves a # TODO update URI note. If these URIs are not final/stable, the tiling pipeline will fail or use incorrect artifacts. Please either update them to the final run URIs, or switch to a placeholder/override-based approach so the committed default config is usable.
| tissue_mask_uri: mlflow-artifacts:/86/04778b10de254572b69ce0a101c1eee4/artifacts/tissue_masks # TODO update URI | |
| qc_mask_uri: mlflow-artifacts:/86/c8edfb2541e84b44b1a28be3540c1a35/artifacts # TODO update URI | |
| tissue_mask_uri: OVERRIDE_ME_TISSUE_MASK_URI | |
| qc_mask_uri: OVERRIDE_ME_QC_MASK_URI |
| import ray | ||
| from omegaconf import DictConfig | ||
| from rationai.mlkit import with_cli_args | ||
| from rationai.mlkit.autolog import autolog |
There was a problem hiding this comment.
Import style for autolog is inconsistent with the other preprocessing modules (e.g., preprocessing/tissue_masks.py and preprocessing/create_dataset.py import autolog from rationai.mlkit). To keep imports consistent and avoid depending on internal module paths, consider importing autolog from the same public entrypoint here as well.
| from rationai.mlkit.autolog import autolog | |
| from rationai.mlkit import autolog |
| assert isclose( | ||
| splits["train"] + splits["test_preliminary"] + splits["test_final"], 1.0 | ||
| ), "Splits must sum to 1.0" |
There was a problem hiding this comment.
split_dataset() uses an assert to validate that split fractions sum to 1.0. Assertions can be stripped with python -O, which would skip this validation and allow bad configs to proceed; prefer raising a ValueError (or hydra/omegaconf validation) with a clear message instead.
| assert isclose( | |
| splits["train"] + splits["test_preliminary"] + splits["test_final"], 1.0 | |
| ), "Splits must sum to 1.0" | |
| total_split = ( | |
| splits["train"] + splits["test_preliminary"] + splits["test_final"] | |
| ) | |
| if not isclose(total_split, 1.0): | |
| raise ValueError( | |
| f"Splits must sum to 1.0, but got {total_split!r} " | |
| f"(train={splits['train']!r}, " | |
| f"test_preliminary={splits['test_preliminary']!r}, " | |
| f"test_final={splits['test_final']!r})" | |
| ) |
| preliminary_size = splits["test_preliminary"] / (1.0 - splits["train"]) | ||
| test_preliminary, test_final = train_test_split_groups( | ||
| test, | ||
| train_size=preliminary_size, | ||
| groups=test["case_id"], |
There was a problem hiding this comment.
preliminary_size = splits["test_preliminary"] / (1.0 - splits["train"]) will raise ZeroDivisionError when train == 1.0 and test_preliminary > 0.0 (even if splits still sum to 1). Add explicit validation for split ranges/relationships (e.g., require train < 1 when preliminary/final splits are non-zero) or handle the train==1.0 case by returning empty test splits.
scripts/preprocessing/tiling.py
Outdated
| memory="128Gi", | ||
| shm="48Gi", | ||
| script=[ | ||
| "git clone https://gitlab.ics.muni.cz/rationai/digital-pathology/pathology/ulcerative-colitis.git workdir", |
There was a problem hiding this comment.
This job script clones ulcerative-colitis from an internal GitLab URL, while the other preprocessing job scripts in this repo clone from https://github.com/RationAI/ulcerative-colitis.git. Unless the repo has moved, this inconsistency will make the tiling job fail for users without GitLab access; align the clone URL/host with the rest of the scripts (or parameterize it).
| "git clone https://gitlab.ics.muni.cz/rationai/digital-pathology/pathology/ulcerative-colitis.git workdir", | |
| "git clone https://github.com/RationAI/ulcerative-colitis.git workdir", |
Tiling script.
Closes IBD-19
Blocked by #3 , #4
[IGNORE QC FILES]
Dependency graph: