Skip to content

feat: embeddings#8

Open
Adames4 wants to merge 25 commits intofeature/tilingfrom
feature/embeddings
Open

feat: embeddings#8
Adames4 wants to merge 25 commits intofeature/tilingfrom
feature/embeddings

Conversation

@Adames4
Copy link
Collaborator

@Adames4 Adames4 commented Feb 12, 2026

Embeddings script.

Closes IBD-21

Blocked by #5

Dependency graph:

                         +--------------+
                  -------| tissue-masks |<------+           +------------+      +----------------------+
                 /       +--------------+       |       +---| tile-masks |<-----| preprocessing-report |
                /                               |       |   +------------+      +----------------------+
+---------+    /                            +--------+  |
| dataset | <-+                             | tiling |<-+
+---------+    \                            +--------+  |
                \                               |       |   +------------+
                 \       +-----------------+    |       +---| embeddings |
                  -------| quality-control |<---+           +------------+
                         +-----------------+

@Adames4 Adames4 requested a review from Copilot February 12, 2026 21:45
@Adames4 Adames4 self-assigned this Feb 12, 2026
@coderabbitai
Copy link

coderabbitai bot commented Feb 12, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/embeddings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello @Adames4, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new feature for generating embeddings from tiled whole slide images, which is crucial for downstream machine learning tasks. It includes a significant refactoring of the project's directory structure, the addition of numerous configuration files to support various datasets and tile encoders, and the implementation of a dedicated script for the embeddings generation process. These changes aim to establish a robust and flexible preprocessing pipeline for creating image embeddings using state-of-the-art foundation models.

Highlights

  • New Embeddings Generation Feature: Introduced a comprehensive system for generating embeddings from tiled whole slide images using various foundation models, addressing the IBD-21 issue.
  • Extensive Configuration Updates: Added numerous YAML configuration files for different datasets (ftn, ikem, knl_patos) with varying tiling parameters (mpp, tile_extent, level) and experiment-specific preprocessing settings for embeddings using different tile encoders.
  • Project Structure Refactoring: Renamed the core project directory from project_name to ml and reorganized internal modules, including data and modeling components, to improve clarity and maintainability.
  • New Data Handling Modules: Implemented new Python modules for handling dataset labels (labels.py) and creating specialized tiled datasets (tiles.py) for both training and prediction, along with a dedicated typing.py for clearer type definitions.
  • Integration of Foundation Models: Incorporated several foundation models (ProvGigaPath, UNI, UNI2, Virchow, Virchow2) for tile encoding, allowing for diverse embedding generation strategies.
  • Dependency Management Updates: Updated pyproject.toml and uv.lock to include new dependencies such as timm, huggingface-hub, safetensors, shellingham, typer-slim, and torchvision, supporting the new embeddings functionality.
Changelog
  • configs/dataset/tiled/ftn/0_320.yaml
    • Added new dataset configuration for FTN institution with mpp 0.17, tile_extent 320, level 0.
  • configs/dataset/tiled/ftn/0_430.yaml
    • Added new dataset configuration for FTN institution with mpp 0.17, tile_extent 430, level 0.
  • configs/dataset/tiled/ftn/1_224.yaml
    • Added new dataset configuration for FTN institution with mpp 0.52, tile_extent 224, level 1.
  • configs/dataset/tiled/ftn/2_224.yaml
    • Added new dataset configuration for FTN institution with mpp 1.55, tile_extent 224, level 2.
  • configs/dataset/tiled/ikem/0_320.yaml
    • Added new dataset configuration for IKEM institution with mpp 0.17, tile_extent 320, level 0.
  • configs/dataset/tiled/ikem/0_430.yaml
    • Added new dataset configuration for IKEM institution with mpp 0.17, tile_extent 430, level 0.
  • configs/dataset/tiled/ikem/1_224.yaml
    • Added new dataset configuration for IKEM institution with mpp 0.52, tile_extent 224, level 1.
  • configs/dataset/tiled/ikem/2_224.yaml
    • Added new dataset configuration for IKEM institution with mpp 1.55, tile_extent 224, level 2.
  • configs/dataset/tiled/knl_patos/0_320.yaml
    • Added new dataset configuration for KNL_PATOS institution with mpp 0.17, tile_extent 320, level 0.
  • configs/dataset/tiled/knl_patos/0_430.yaml
    • Added new dataset configuration for KNL_PATOS institution with mpp 0.17, tile_extent 430, level 0.
  • configs/dataset/tiled/knl_patos/1_224.yaml
    • Added new dataset configuration for KNL_PATOS institution with mpp 0.52, tile_extent 224, level 1.
  • configs/dataset/tiled/knl_patos/2_224.yaml
    • Added new dataset configuration for KNL_PATOS institution with mpp 1.55, tile_extent 224, level 2.
  • configs/experiment/preprocessing/embeddings/ftn_1_prov_gigapath.yaml
    • Added experiment configuration for FTN dataset with ProvGigaPath encoder.
  • configs/experiment/preprocessing/embeddings/ftn_1_uni.yaml
    • Added experiment configuration for FTN dataset with UNI encoder.
  • configs/experiment/preprocessing/embeddings/ftn_1_uni2.yaml
    • Added experiment configuration for FTN dataset with UNI2 encoder.
  • configs/experiment/preprocessing/embeddings/ftn_1_virchow.yaml
    • Added experiment configuration for FTN dataset with Virchow encoder.
  • configs/experiment/preprocessing/embeddings/ftn_1_virchow2.yaml
    • Added experiment configuration for FTN dataset with Virchow2 encoder.
  • configs/experiment/preprocessing/embeddings/ftn_2_virchow2.yaml
    • Added experiment configuration for FTN dataset (level 2) with Virchow2 encoder.
  • configs/experiment/preprocessing/embeddings/ikem_1_prov_gigapath.yaml
    • Added experiment configuration for IKEM dataset with ProvGigaPath encoder.
  • configs/experiment/preprocessing/embeddings/ikem_1_uni.yaml
    • Added experiment configuration for IKEM dataset with UNI encoder.
  • configs/experiment/preprocessing/embeddings/ikem_1_uni2.yaml
    • Added experiment configuration for IKEM dataset with UNI2 encoder.
  • configs/experiment/preprocessing/embeddings/ikem_1_virchow.yaml
    • Added experiment configuration for IKEM dataset with Virchow encoder.
  • configs/experiment/preprocessing/embeddings/ikem_1_virchow2.yaml
    • Added experiment configuration for IKEM dataset with Virchow2 encoder.
  • configs/experiment/preprocessing/embeddings/ikem_2_virchow2.yaml
    • Added experiment configuration for IKEM dataset (level 2) with Virchow2 encoder.
  • configs/experiment/preprocessing/embeddings/knl_patos_1_prov_gigapath.yaml
    • Added experiment configuration for KNL_PATOS dataset with ProvGigaPath encoder.
  • configs/experiment/preprocessing/embeddings/knl_patos_1_uni.yaml
    • Added experiment configuration for KNL_PATOS dataset with UNI encoder.
  • configs/experiment/preprocessing/embeddings/knl_patos_1_uni2.yaml
    • Added experiment configuration for KNL_PATOS dataset with UNI2 encoder.
  • configs/experiment/preprocessing/embeddings/knl_patos_1_virchow.yaml
    • Added experiment configuration for KNL_PATOS dataset with Virchow encoder.
  • configs/experiment/preprocessing/embeddings/knl_patos_1_virchow2.yaml
    • Added experiment configuration for KNL_PATOS dataset with Virchow2 encoder.
  • configs/experiment/preprocessing/embeddings/knl_patos_2_virchow2.yaml
    • Added experiment configuration for KNL_PATOS dataset (level 2) with Virchow2 encoder.
  • configs/preprocessing/embeddings.yaml
    • Added base configuration for embeddings preprocessing, including output directory, dataloader settings, and MLflow metadata.
  • configs/preprocessing/tile_encoder/prov_gigapath.yaml
    • Added configuration for ProvGigaPath tile encoder.
  • configs/preprocessing/tile_encoder/uni.yaml
    • Added configuration for UNI tile encoder.
  • configs/preprocessing/tile_encoder/uni2.yaml
    • Added configuration for UNI2 tile encoder.
  • configs/preprocessing/tile_encoder/virchow.yaml
    • Added configuration for Virchow tile encoder.
  • configs/preprocessing/tile_encoder/virchow2.yaml
    • Added configuration for Virchow2 tile encoder.
  • ml/init.py
    • Renamed project_name/__init__.py to ml/__init__.py.
  • ml/main.py
    • Renamed project_name/__main__.py to ml/__main__.py and updated imports.
  • ml/data/README.md
    • Renamed project_name/data/README.md to ml/data/README.md.
  • ml/data/init.py
    • Added __init__.py to ml/data to expose DataModule.
  • ml/data/data_module.py
    • Renamed project_name/data/data_module.py to ml/data/data_module.py.
    • Updated imports to reflect the new ml package structure.
    • Removed Input type hints from dataloader methods for broader compatibility.
  • ml/data/datasets/init.py
    • Added __init__.py to ml/data/datasets to expose Tiles and TilesPredict.
  • ml/data/datasets/labels.py
    • Added new module labels.py defining LabelMode enum and functions for processing slide labels and retrieving target columns.
  • ml/data/datasets/tiles.py
    • Added new module tiles.py implementing _Tiles, Tiles, and TilesPredict datasets for handling tiled slide images, including transformations and label inclusion logic.
  • ml/modeling/README.md
    • Renamed project_name/modeling/README.md to ml/modeling/README.md.
  • ml/modeling/init.py
    • Renamed project_name/modeling/__init__.py to ml/modeling/__init__.py.
  • ml/project_name_model.py
    • Renamed project_name/project_name_model.py to ml/project_name_model.py and updated imports.
  • ml/typing.py
    • Added new module typing.py defining various type aliases and TypedDicts for metadata and data samples, enhancing type clarity across the project.
  • preprocessing/embeddings.py
    • Added new script embeddings.py for generating tile embeddings using various foundation models, handling dataset loading, model inference, and saving results to parquet files.
  • project_name/data/init.py
    • Removed project_name/data/__init__.py as part of the project refactoring.
  • project_name/typing.py
    • Removed project_name/typing.py as part of the project refactoring, replaced by ml/typing.py.
  • pyproject.toml
    • Updated pyproject.toml to add timm as a new dependency.
  • scripts/preprocessing/embeddings.py
    • Added a new script for submitting embeddings generation jobs to Kubernetes, including environment setup and execution commands.
  • uv.lock
    • Updated uv.lock to include new dependencies: hf-xet, huggingface-hub, safetensors, shellingham, timm, torchvision, and typer-slim.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new preprocessing step for generating tile embeddings using various foundation models. It includes the main script, model wrappers, dataset configurations, and a job submission script. The overall structure is good, but there are several critical issues that need to be addressed before merging. These include placeholder values in configuration and job scripts, hardcoded secrets, missing type definitions that will lead to runtime errors, and suboptimal error handling that could mask failures in automated pipelines.

Comment on lines +9 to +11
train: "mlflow-artifacts:/86/bbbe4603bc30495d85ac99093fc9269a/artifacts/train - ftn" # TODO update URI
test_preliminary: "mlflow-artifacts:/86/bbbe4603bc30495d85ac99093fc9269a/artifacts/test preliminary - ftn" # TODO update URI
test_final: "mlflow-artifacts:/86/bbbe4603bc30495d85ac99093fc9269a/artifacts/test final - ftn" # TODO update URI No newline at end of file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

These MLFlow artifact URIs appear to be placeholders and are marked with # TODO update URI. Committing placeholder values can lead to runtime errors or incorrect data being used if not updated. Please replace them with the final URIs or use a more dynamic configuration approach to avoid hardcoding them. This comment applies to all similar new dataset configuration files in this pull request.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for #5

print(f"Embeddings for slide {slide_name} already exist, skipping...")
continue

try:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Catching a broad Exception can hide bugs and make debugging difficult. More importantly, the script continues after printing the error, and will exit with a success status code even if slides fail to process. This can be misleading in automated pipelines. Consider tracking failures and exiting with a non-zero status code if any errors occurred. Also, use a proper logger to log the full traceback for easier debugging.

embeddings_path = (dest / slide_name).with_suffix(".parquet")

if embeddings_path.exists():
print(f"Embeddings for slide {slide_name} already exist, skipping...")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using print() for logging makes it difficult to control log levels and redirect output. It's better practice to use the standard logging module. This would allow for more flexible configuration, such as writing to a file, setting verbosity, and formatting messages. This applies to other print statements in this file as well (e.g., line 230).

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an embeddings preprocessing stage to the pipeline, including tile datasets/typing utilities, Hydra configs for multiple foundation-model tile encoders, and the required ML dependencies (timm + HF hub stack).

Changes:

  • Add preprocessing/embeddings.py to compute and store tile embeddings (Parquet) using timm foundation models pulled from Hugging Face.
  • Introduce ml/ data/typing scaffolding (tile datasets + label modes) and update entrypoints/imports away from the removed project_name/* placeholders.
  • Add Hydra configs for embeddings runs across institutions/levels and add timm to dependencies (lockfile updates included).

Reviewed changes

Copilot reviewed 49 out of 54 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
uv.lock Locks new dependencies for embeddings models (HF hub stack, timm, torchvision, safetensors, etc.).
pyproject.toml Adds timm>=1.0.24 dependency.
scripts/preprocessing/embeddings.py Adds kube job submission script for running embeddings preprocessing.
preprocessing/embeddings.py Implements embeddings extraction pipeline + foundation model wrappers (timm/HF).
ml/typing.py Adds typed metadata + sample/type aliases for tiles/embeddings data.
ml/project_name_model.py Updates typing import to ml.typing.
ml/main.py Switches imports to ml.data / ml.project_name_model.
ml/data/data_module.py Adjusts dataloader return typing (removes old project_name.typing dependency).
ml/data/init.py Exposes DataModule from ml.data.
ml/data/README.md Adds suggested project structure documentation for ml/data.
ml/data/datasets/init.py Exposes Tiles and TilesPredict datasets.
ml/data/datasets/tiles.py Adds tile datasets built on MetaTiledSlides / OpenSlideTilesDataset.
ml/data/datasets/labels.py Adds label modes + slide processing/label extraction utilities.
ml/modeling/README.md Adds suggested project structure documentation for ml/modeling.
project_name/typing.py Removes placeholder typing aliases from old template package.
project_name/data/init.py Removes old template re-export for DataModule.
configs/preprocessing/embeddings.yaml Adds embeddings preprocessing config (output path, dataloader, mlflow metadata).
configs/preprocessing/tile_encoder/prov_gigapath.yaml Adds Hydra tile-encoder config for ProvGigaPath.
configs/preprocessing/tile_encoder/virchow.yaml Adds Hydra tile-encoder config for Virchow.
configs/preprocessing/tile_encoder/virchow2.yaml Adds Hydra tile-encoder config for Virchow2.
configs/preprocessing/tile_encoder/uni.yaml Adds Hydra tile-encoder config for UNI.
configs/preprocessing/tile_encoder/uni2.yaml Adds Hydra tile-encoder config for UNI2.
configs/experiment/preprocessing/embeddings/knl_patos_1_prov_gigapath.yaml Adds experiment preset for knl_patos L1 + ProvGigaPath embeddings.
configs/experiment/preprocessing/embeddings/knl_patos_1_uni.yaml Adds experiment preset for knl_patos L1 + UNI embeddings.
configs/experiment/preprocessing/embeddings/knl_patos_1_uni2.yaml Adds experiment preset for knl_patos L1 + UNI2 embeddings.
configs/experiment/preprocessing/embeddings/knl_patos_1_virchow.yaml Adds experiment preset for knl_patos L1 + Virchow embeddings.
configs/experiment/preprocessing/embeddings/knl_patos_1_virchow2.yaml Adds experiment preset for knl_patos L1 + Virchow2 embeddings.
configs/experiment/preprocessing/embeddings/knl_patos_2_virchow2.yaml Adds experiment preset for knl_patos L2 + Virchow2 embeddings.
configs/experiment/preprocessing/embeddings/ikem_1_prov_gigapath.yaml Adds experiment preset for ikem L1 + ProvGigaPath embeddings.
configs/experiment/preprocessing/embeddings/ikem_1_uni.yaml Adds experiment preset for ikem L1 + UNI embeddings.
configs/experiment/preprocessing/embeddings/ikem_1_uni2.yaml Adds experiment preset for ikem L1 + UNI2 embeddings.
configs/experiment/preprocessing/embeddings/ikem_1_virchow.yaml Adds experiment preset for ikem L1 + Virchow embeddings.
configs/experiment/preprocessing/embeddings/ikem_1_virchow2.yaml Adds experiment preset for ikem L1 + Virchow2 embeddings.
configs/experiment/preprocessing/embeddings/ikem_2_virchow2.yaml Adds experiment preset for ikem L2 + Virchow2 embeddings.
configs/experiment/preprocessing/embeddings/ftn_1_prov_gigapath.yaml Adds experiment preset for ftn L1 + ProvGigaPath embeddings.
configs/experiment/preprocessing/embeddings/ftn_1_uni.yaml Adds experiment preset for ftn L1 + UNI embeddings.
configs/experiment/preprocessing/embeddings/ftn_1_uni2.yaml Adds experiment preset for ftn L1 + UNI2 embeddings.
configs/experiment/preprocessing/embeddings/ftn_1_virchow.yaml Adds experiment preset for ftn L1 + Virchow embeddings.
configs/experiment/preprocessing/embeddings/ftn_1_virchow2.yaml Adds experiment preset for ftn L1 + Virchow2 embeddings.
configs/experiment/preprocessing/embeddings/ftn_2_virchow2.yaml Adds experiment preset for ftn L2 + Virchow2 embeddings.
configs/dataset/tiled/knl_patos/0_320.yaml Adds tiled dataset config for knl_patos L0 @ 320px.
configs/dataset/tiled/knl_patos/0_430.yaml Adds tiled dataset config for knl_patos L0 @ 430px.
configs/dataset/tiled/knl_patos/1_224.yaml Adds tiled dataset config for knl_patos L1 @ 224px.
configs/dataset/tiled/knl_patos/2_224.yaml Adds tiled dataset config for knl_patos L2 @ 224px.
configs/dataset/tiled/ikem/0_320.yaml Adds tiled dataset config for ikem L0 @ 320px.
configs/dataset/tiled/ikem/0_430.yaml Adds tiled dataset config for ikem L0 @ 430px.
configs/dataset/tiled/ikem/1_224.yaml Adds tiled dataset config for ikem L1 @ 224px.
configs/dataset/tiled/ikem/2_224.yaml Adds tiled dataset config for ikem L2 @ 224px.
configs/dataset/tiled/ftn/0_320.yaml Adds tiled dataset config for ftn L0 @ 320px.
configs/dataset/tiled/ftn/0_430.yaml Adds tiled dataset config for ftn L0 @ 430px.
configs/dataset/tiled/ftn/1_224.yaml Adds tiled dataset config for ftn L1 @ 224px.
configs/dataset/tiled/ftn/2_224.yaml Adds tiled dataset config for ftn L2 @ 224px.
Comments suppressed due to low confidence (1)

ml/project_name_model.py:6

  • ml.typing does not define Input or Outputs, so this import will fail at runtime (and ProjectNameModel.forward type hints won’t resolve). Either add Input/Outputs type aliases to ml/typing.py (matching what the model expects) or update this import and the annotations to use the existing aliases (e.g. Output or more specific types).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +229 to +230
except Exception as e:
print(f"Error processing slide {slide_name}: {e}")
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching a broad Exception and only printing str(e) makes failures hard to diagnose and doesn’t record stack traces in MLflow. Prefer logging the full exception (e.g., via the logger or logging.exception) and consider failing the run (or collecting failures) if embeddings are incomplete.

Copilot uses AI. Check for mistakes.
@Adames4 Adames4 requested a review from vejtek February 12, 2026 22:17
@Adames4 Adames4 removed the request for review from vejtek February 28, 2026 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants