Conversation
📝 WalkthroughWalkthroughAdds tissue-mask preprocessing: new configs and scripts, a new preprocessing module using Ray/OpenSlide/VIPS with MLflow artifact handling, a mypy exclusion update, an added dependency, and a CLI flag rename in a create_dataset script. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Hydra as HydraConfig
participant Main as tissue_masks.main
participant Ray as Ray
participant OpenSlide as OpenSlide
participant VIPS as libvips
participant MLStorage as MLflowArtifacts
participant MLFlow as MLFlowLogger
User->>Hydra: launch with `tissue_masks` config
Hydra->>Main: provide config & logger
Main->>MLStorage: download dataset CSV
MLStorage-->>Main: dataset list
Main->>Ray: submit parallel `process_slide` tasks
Ray->>OpenSlide: open slide, read resolution
OpenSlide-->>Ray: provide mpp/resolution
Ray->>VIPS: build level image & compute mask
VIPS-->>Ray: mask image bytes
Ray->>MLStorage: write TIFF and upload artifact
Ray-->>Main: task complete
Main->>MLFlow: log artifact path
MLFlow-->>User: artifacts recorded
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @Adames4, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces the capability to generate tissue masks from whole slide images, a foundational step for digital pathology workflows. It establishes a new processing script that utilizes parallel computing for efficiency, integrates necessary configuration settings for various datasets and the mask generation process itself, and incorporates a specialized library for mask operations. Furthermore, the changes include a Kubernetes job definition, ensuring that this new functionality can be deployed and scaled effectively within the existing infrastructure, thereby enhancing the overall preprocessing pipeline. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a script for generating tissue masks, including the core processing logic, configuration files, and a job submission script. The changes are well-structured. My review includes several suggestions to improve the robustness and maintainability of the new scripts, such as addressing TODO comments in configuration, improving the parallel processing setup, and replacing placeholder values in the job submission script.
There was a problem hiding this comment.
Pull request overview
This pull request adds tissue mask generation functionality as part of the preprocessing pipeline for the ulcerative colitis project. It depends on PR #2 (dataset creation) and introduces a Ray-based distributed processing system to generate tissue masks from whole slide images at a specified pyramid level.
Changes:
- Added
rationai-maskslibrary dependency for tissue mask generation - Created preprocessing script to download datasets from MLflow and generate tissue masks in parallel using Ray
- Added configuration files for tissue mask generation and processed dataset URIs for three institutions (IKEM, FTN, KNL PATOS)
Reviewed changes
Copilot reviewed 7 out of 9 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
pyproject.toml |
Added rationai-masks dependency for tissue mask generation |
uv.lock |
Updated lock file with rationai-masks and its transitive dependencies |
preprocessing/tissue_masks.py |
Main processing script that downloads datasets and generates tissue masks using Ray for parallel processing |
scripts/preprocessing/tissue_masks.py |
Kubernetes job submission script for running tissue mask generation |
configs/preprocessing/tissue_masks.yaml |
Configuration for tissue mask generation including level and concurrency settings |
configs/dataset/processed/ikem.yaml |
Configuration with MLflow URI for IKEM processed dataset |
configs/dataset/processed/ftn.yaml |
Configuration with MLflow URI for FTN processed dataset |
configs/dataset/processed/knl_patos.yaml |
Configuration with MLflow URI for KNL PATOS processed dataset |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@configs/dataset/processed/ftn.yaml`:
- Around line 1-5: The yaml currently contains a placeholder MLflow artifact URI
under the uri key (the line starting with "uri: mlflow-artifacts:/86/... # TODO
update URI"); replace that placeholder with the final, correct MLflow artifact
URI for this dataset (or delete the uri entry/remove this ftn config until the
proper run/artifact is available) so dataset resolution won't fail when this
config is used.
In `@configs/dataset/processed/ikem.yaml`:
- Around line 1-5: The YAML currently contains a placeholder MLflow artifact URI
in the uri key
(mlflow-artifacts:/86/52d0081d60ba4585a16a3bd341d5ab09/artifacts/dataset.csv)
which must be replaced before merging; update the uri value to the final MLflow
artifact URI for the correct run (or remove the uri entry / this config until
the final artifact is available) so dataset resolution using this config
succeeds.
In `@configs/dataset/processed/knl_patos.yaml`:
- Around line 1-5: The config currently contains a placeholder MLflow artifact
URI in the uri field (uri:
mlflow-artifacts:/86/afbcfd43cb3c4fd0b1e9b5dbe7327d91/artifacts/dataset.csv)
which must be replaced or removed; update the uri value to the final, correct
MLflow artifact URI for the dataset (or delete the uri entry / the entire
processed config until the correct run/URI is available) so that dataset
resolution using the defaults entries (the /dataset/raw/knl_patos@_here_ and
_self_ defaults) will point to the correct artifact.
In `@preprocessing/tissue_masks.py`:
- Around line 46-57: The code creates a TemporaryDirectory on the driver and
then runs process_items which may schedule process_slide on remote Ray workers,
causing missing writes because workers can't access the driver's temp path;
replace the driver-local TemporaryDirectory with a shared/configurable output
path (use config.shared_output_dir or add one to config), pass that Path into
process_items via fn_kwargs (preserving "output_path"), and ensure
logger.log_artifacts points to the same shared path (config.artifact_path) or
validate/raise if running in single-node mode (config.max_concurrent implies
multi-node); update any documentation/comments to state that process_slide
requires a shared filesystem when using remote workers.
In `@scripts/preprocessing/tissue_masks.py`:
- Around line 1-16: The submit_job call currently uses Ellipsis placeholders
(username=... and dataset path in script) which will fail at runtime; update the
module to accept real inputs by parsing CLI args or environment variables and
pass them into submit_job (replace the Ellipsis in username and the placeholder
in the "uv run ... +dataset=..." script entry). Locate the submit_job invocation
and the script list (symbols: submit_job, job_name, username, script) and wire
values from argparse (or os.environ) into those parameters, validating inputs
before calling submit_job and ensuring the final script string includes the
concrete dataset path.
🧹 Nitpick comments (1)
pyproject.toml (1)
16-18: Pin therationai-masksgit source to a tag/commit for reproducible builds.With only a git URL, dependency resolution can drift as the repo changes. Consider pinning
rev/tagin the UV source (and keep it aligned with your dependency spec) to make builds deterministic.♻️ Example pinning with a specific commit/tag
[tool.uv.sources] -rationai-masks = { git = "https://gitlab.ics.muni.cz/rationai/digital-pathology/libraries/masks.git" } +rationai-masks = { git = "https://gitlab.ics.muni.cz/rationai/digital-pathology/libraries/masks.git", rev = "<commit-or-tag>" }Also applies to: 27-30
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @.mypy.ini:
- Line 6: Remove "project_name" from the mypy exclusion pattern so the main
source package is type-checked; update the exclude setting from `exclude =
^(scripts|project_name)/` to only exclude scripts (e.g., `exclude =
^(scripts)/`) so files under the project package (like __init__.py, __main__.py,
project_name_model.py and subpackages) are included in mypy's strict checks.
In `@preprocessing/tissue_masks.py`:
- Line 29: The current mask_path assignment uses
Path(slide_path).with_suffix(".tiff").name which drops directory context and can
cause filename collisions (e.g., slide.tiff from different folders); update the
mask_path construction to append a unique identifier to the filename (for
example, a short hash of slide_path or include the dataframe index) so each
output is distinct—locate where mask_path is set (variable mask_path using
output_path and slide_path) and change the filename generation to include the
chosen unique token before the .tiff suffix to prevent silent overwrites.
🧹 Nitpick comments (2)
pyproject.toml (1)
17-17: Consider pinningrationai-masksto a specific tag or revision.Like
rationai-mlkit, this dependency has no version constraint and resolves from a git source with norev,tag, orbranchspecifier. This means builds may break or behave differently if the upstreammasksrepo introduces breaking changes. Consider pinning to a specific commit or tag for reproducible builds.This is consistent with the existing
rationai-mlkitentry, so if you intentionally trackHEADfor all internal libs, this is fine — just be aware of the trade-off.Also applies to: 29-29
preprocessing/tissue_masks.py (1)
54-57: Considertry/finallyto guaranteeray.shutdown().Python's
atexithandlers cover most exit paths, but wrapping intry/finallyis more explicit and also ensures shutdown onKeyboardInterrupt(which bypassesatexit).♻️ Suggested refactor
if __name__ == "__main__": ray.init(runtime_env={"excludes": [".git", ".venv"]}) - main() - ray.shutdown() + try: + main() + finally: + ray.shutdown()
Tissue masks generation script.
Closes IBD-17
Blocked by #2
Dependency graph:
Summary by CodeRabbit
New Features
Chores
Refactor
Scripts