Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @Adames4, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new feature for generating embeddings from tiled whole slide images, which is crucial for downstream machine learning tasks. It includes a significant refactoring of the project's directory structure, the addition of numerous configuration files to support various datasets and tile encoders, and the implementation of a dedicated script for the embeddings generation process. These changes aim to establish a robust and flexible preprocessing pipeline for creating image embeddings using state-of-the-art foundation models. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new preprocessing step for generating tile embeddings using various foundation models. It includes the main script, model wrappers, dataset configurations, and a job submission script. The overall structure is good, but there are several critical issues that need to be addressed before merging. These include placeholder values in configuration and job scripts, hardcoded secrets, missing type definitions that will lead to runtime errors, and suboptimal error handling that could mask failures in automated pipelines.
| train: "mlflow-artifacts:/86/bbbe4603bc30495d85ac99093fc9269a/artifacts/train - ftn" # TODO update URI | ||
| test_preliminary: "mlflow-artifacts:/86/bbbe4603bc30495d85ac99093fc9269a/artifacts/test preliminary - ftn" # TODO update URI | ||
| test_final: "mlflow-artifacts:/86/bbbe4603bc30495d85ac99093fc9269a/artifacts/test final - ftn" # TODO update URI No newline at end of file |
There was a problem hiding this comment.
These MLFlow artifact URIs appear to be placeholders and are marked with # TODO update URI. Committing placeholder values can lead to runtime errors or incorrect data being used if not updated. Please replace them with the final URIs or use a more dynamic configuration approach to avoid hardcoding them. This comment applies to all similar new dataset configuration files in this pull request.
| print(f"Embeddings for slide {slide_name} already exist, skipping...") | ||
| continue | ||
|
|
||
| try: |
There was a problem hiding this comment.
Catching a broad Exception can hide bugs and make debugging difficult. More importantly, the script continues after printing the error, and will exit with a success status code even if slides fail to process. This can be misleading in automated pipelines. Consider tracking failures and exiting with a non-zero status code if any errors occurred. Also, use a proper logger to log the full traceback for easier debugging.
| embeddings_path = (dest / slide_name).with_suffix(".parquet") | ||
|
|
||
| if embeddings_path.exists(): | ||
| print(f"Embeddings for slide {slide_name} already exist, skipping...") |
There was a problem hiding this comment.
Using print() for logging makes it difficult to control log levels and redirect output. It's better practice to use the standard logging module. This would allow for more flexible configuration, such as writing to a file, setting verbosity, and formatting messages. This applies to other print statements in this file as well (e.g., line 230).
There was a problem hiding this comment.
Pull request overview
Adds an embeddings preprocessing stage to the pipeline, including tile datasets/typing utilities, Hydra configs for multiple foundation-model tile encoders, and the required ML dependencies (timm + HF hub stack).
Changes:
- Add
preprocessing/embeddings.pyto compute and store tile embeddings (Parquet) using timm foundation models pulled from Hugging Face. - Introduce
ml/data/typing scaffolding (tile datasets + label modes) and update entrypoints/imports away from the removedproject_name/*placeholders. - Add Hydra configs for embeddings runs across institutions/levels and add
timmto dependencies (lockfile updates included).
Reviewed changes
Copilot reviewed 49 out of 54 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Locks new dependencies for embeddings models (HF hub stack, timm, torchvision, safetensors, etc.). |
| pyproject.toml | Adds timm>=1.0.24 dependency. |
| scripts/preprocessing/embeddings.py | Adds kube job submission script for running embeddings preprocessing. |
| preprocessing/embeddings.py | Implements embeddings extraction pipeline + foundation model wrappers (timm/HF). |
| ml/typing.py | Adds typed metadata + sample/type aliases for tiles/embeddings data. |
| ml/project_name_model.py | Updates typing import to ml.typing. |
| ml/main.py | Switches imports to ml.data / ml.project_name_model. |
| ml/data/data_module.py | Adjusts dataloader return typing (removes old project_name.typing dependency). |
| ml/data/init.py | Exposes DataModule from ml.data. |
| ml/data/README.md | Adds suggested project structure documentation for ml/data. |
| ml/data/datasets/init.py | Exposes Tiles and TilesPredict datasets. |
| ml/data/datasets/tiles.py | Adds tile datasets built on MetaTiledSlides / OpenSlideTilesDataset. |
| ml/data/datasets/labels.py | Adds label modes + slide processing/label extraction utilities. |
| ml/modeling/README.md | Adds suggested project structure documentation for ml/modeling. |
| project_name/typing.py | Removes placeholder typing aliases from old template package. |
| project_name/data/init.py | Removes old template re-export for DataModule. |
| configs/preprocessing/embeddings.yaml | Adds embeddings preprocessing config (output path, dataloader, mlflow metadata). |
| configs/preprocessing/tile_encoder/prov_gigapath.yaml | Adds Hydra tile-encoder config for ProvGigaPath. |
| configs/preprocessing/tile_encoder/virchow.yaml | Adds Hydra tile-encoder config for Virchow. |
| configs/preprocessing/tile_encoder/virchow2.yaml | Adds Hydra tile-encoder config for Virchow2. |
| configs/preprocessing/tile_encoder/uni.yaml | Adds Hydra tile-encoder config for UNI. |
| configs/preprocessing/tile_encoder/uni2.yaml | Adds Hydra tile-encoder config for UNI2. |
| configs/experiment/preprocessing/embeddings/knl_patos_1_prov_gigapath.yaml | Adds experiment preset for knl_patos L1 + ProvGigaPath embeddings. |
| configs/experiment/preprocessing/embeddings/knl_patos_1_uni.yaml | Adds experiment preset for knl_patos L1 + UNI embeddings. |
| configs/experiment/preprocessing/embeddings/knl_patos_1_uni2.yaml | Adds experiment preset for knl_patos L1 + UNI2 embeddings. |
| configs/experiment/preprocessing/embeddings/knl_patos_1_virchow.yaml | Adds experiment preset for knl_patos L1 + Virchow embeddings. |
| configs/experiment/preprocessing/embeddings/knl_patos_1_virchow2.yaml | Adds experiment preset for knl_patos L1 + Virchow2 embeddings. |
| configs/experiment/preprocessing/embeddings/knl_patos_2_virchow2.yaml | Adds experiment preset for knl_patos L2 + Virchow2 embeddings. |
| configs/experiment/preprocessing/embeddings/ikem_1_prov_gigapath.yaml | Adds experiment preset for ikem L1 + ProvGigaPath embeddings. |
| configs/experiment/preprocessing/embeddings/ikem_1_uni.yaml | Adds experiment preset for ikem L1 + UNI embeddings. |
| configs/experiment/preprocessing/embeddings/ikem_1_uni2.yaml | Adds experiment preset for ikem L1 + UNI2 embeddings. |
| configs/experiment/preprocessing/embeddings/ikem_1_virchow.yaml | Adds experiment preset for ikem L1 + Virchow embeddings. |
| configs/experiment/preprocessing/embeddings/ikem_1_virchow2.yaml | Adds experiment preset for ikem L1 + Virchow2 embeddings. |
| configs/experiment/preprocessing/embeddings/ikem_2_virchow2.yaml | Adds experiment preset for ikem L2 + Virchow2 embeddings. |
| configs/experiment/preprocessing/embeddings/ftn_1_prov_gigapath.yaml | Adds experiment preset for ftn L1 + ProvGigaPath embeddings. |
| configs/experiment/preprocessing/embeddings/ftn_1_uni.yaml | Adds experiment preset for ftn L1 + UNI embeddings. |
| configs/experiment/preprocessing/embeddings/ftn_1_uni2.yaml | Adds experiment preset for ftn L1 + UNI2 embeddings. |
| configs/experiment/preprocessing/embeddings/ftn_1_virchow.yaml | Adds experiment preset for ftn L1 + Virchow embeddings. |
| configs/experiment/preprocessing/embeddings/ftn_1_virchow2.yaml | Adds experiment preset for ftn L1 + Virchow2 embeddings. |
| configs/experiment/preprocessing/embeddings/ftn_2_virchow2.yaml | Adds experiment preset for ftn L2 + Virchow2 embeddings. |
| configs/dataset/tiled/knl_patos/0_320.yaml | Adds tiled dataset config for knl_patos L0 @ 320px. |
| configs/dataset/tiled/knl_patos/0_430.yaml | Adds tiled dataset config for knl_patos L0 @ 430px. |
| configs/dataset/tiled/knl_patos/1_224.yaml | Adds tiled dataset config for knl_patos L1 @ 224px. |
| configs/dataset/tiled/knl_patos/2_224.yaml | Adds tiled dataset config for knl_patos L2 @ 224px. |
| configs/dataset/tiled/ikem/0_320.yaml | Adds tiled dataset config for ikem L0 @ 320px. |
| configs/dataset/tiled/ikem/0_430.yaml | Adds tiled dataset config for ikem L0 @ 430px. |
| configs/dataset/tiled/ikem/1_224.yaml | Adds tiled dataset config for ikem L1 @ 224px. |
| configs/dataset/tiled/ikem/2_224.yaml | Adds tiled dataset config for ikem L2 @ 224px. |
| configs/dataset/tiled/ftn/0_320.yaml | Adds tiled dataset config for ftn L0 @ 320px. |
| configs/dataset/tiled/ftn/0_430.yaml | Adds tiled dataset config for ftn L0 @ 430px. |
| configs/dataset/tiled/ftn/1_224.yaml | Adds tiled dataset config for ftn L1 @ 224px. |
| configs/dataset/tiled/ftn/2_224.yaml | Adds tiled dataset config for ftn L2 @ 224px. |
Comments suppressed due to low confidence (1)
ml/project_name_model.py:6
ml.typingdoes not defineInputorOutputs, so this import will fail at runtime (andProjectNameModel.forwardtype hints won’t resolve). Either addInput/Outputstype aliases toml/typing.py(matching what the model expects) or update this import and the annotations to use the existing aliases (e.g.Outputor more specific types).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| except Exception as e: | ||
| print(f"Error processing slide {slide_name}: {e}") |
There was a problem hiding this comment.
Catching a broad Exception and only printing str(e) makes failures hard to diagnose and doesn’t record stack traces in MLflow. Prefer logging the full exception (e.g., via the logger or logging.exception) and consider failing the run (or collecting failures) if embeddings are incomplete.
Embeddings script.
Closes IBD-21
Blocked by #5
Dependency graph: