Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions .gemini/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
have_fun: false
memory_config:
disabled: false
code_review:
disable: false
comment_severity_threshold: MEDIUM
max_review_comments: -1
pull_request_opened:
help: false
summary: true
code_review: true
include_drafts: true
ignore_patterns: []
41 changes: 41 additions & 0 deletions .gemini/styleguide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Gemini Code Assist PR Review Styleguide
**Repository:** `ulcerative-colitis` (RationAI)
**Context:** This is a research-focused machine learning repository dedicated to predicting the Nancy index from colon WSIs in digital pathology.

## 🎯 Primary Review Focus
- **Ignore formatting and linting:** We use `ruff` for formatting/linting and `uv` for package management. Assume CI/CD will catch styling issues. Do not comment on line length, quotes, or basic PEP-8 formatting.
- **Focus on ML logic and correctness:** Look for off-by-one errors in array slicing, incorrect tensor device allocations, tensor shape mismatches, and data leakage between train/val/test splits.
- **Research context over production validation:** This is research code. Do not suggest adding heavy input validation, complex exception handling, or enterprise-grade defensive programming unless the current logic will explicitly crash the pipeline. Prioritize readability.
- **Testing:** Do not block PRs or aggressively request unit tests. We do not enforce strict unit testing for this repository.

## 📝 General Comment Style
- Keep comments **short and actionable**.
- Prefer **bullet points** over long paragraphs.
- Point to specific lines or sections when possible.
- Suggest improvements, not rewrite entire snippets.
- Avoid repetition of what the code already clearly states.
- Defer to the repo’s existing conventions unless there’s a clear bug or inconsistency.

## 🔬 Domain-Specific Guidance (RationAI & ratiopath)
- **Use `ratiopath`:** This project relies on our library `ratiopath`.
- If you see custom tiling logic, suggest using `ratiopath.tiling`.
- If you see custom annotation parsing (ASAP/GeoJSON), suggest using `ratiopath.parsers`.
- Check if Ray-based distributed processing in `ratiopath` is being used efficiently for large-scale WSI tasks.
- **WSI Handling:** Verify that `openslide` or `ratiopath` calls use the correct downsample levels and that tile offsets are calculated correctly.

## 🏗️ Architecture & Reproducibility
- **Hydra Configs (`configs/`):** Reproducibility is paramount. If a PR introduces a new module, model architecture, or preprocessing step, check if the author has updated or created the corresponding YAML configuration. Remind them if it seems missing.
- **Experiment Tracking (MLflow):** When PRs add new loss functions, evaluation metrics, or training loops in `project_name/` (or `ml/`), ensure that these new metrics are properly logged to MLflow.
- **Repository Structure:**
- `preprocessing/`: Ensure data transformations (tiling, QC, tissue masks) are logically sound.
- `project_name/` (future `ml/`): Focus on training loops, PyTorch Lightning modules, and model definitions.
- `postprocessing/`: Focus on ensembling and final prediction logic.
- `scripts/`: These are job submission templates. Do not review them as strictly as core Python code.

## 📚 Types & Documentation
- **Type Hinting:** We use strict `mypy`, but it is *not required* for PRs to be merged. Gently suggest type hints for complex function signatures, but do not nitpick missing `Any` types or incomplete typing.
- **Docstrings:** Docstrings are not strictly required everywhere. However, if a docstring *is* provided, ensure it generally follows the **Google Docstring Style**.

## 💻 Libraries & Best Practices
- **PyTorch & PyTorch Lightning:** Suggest idiomatic PyTorch Lightning constructs (e.g., using `self.log` correctly). Watch out for detached tensors or memory leaks in custom training steps.
- **Data Processing (NumPy, Pandas, OpenSlide):** Suggest vectorized operations over `for` loops where applicable for performance. Ensure OpenSlide WSI coordinate extractions are logical (e.g., matching the correct level/downsample).
Empty file removed configs/experiment/.gitkeep
Empty file.
10 changes: 10 additions & 0 deletions configs/experiment/preprocessing/split_dataset/ftn.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# @package _global_

defaults:
- /dataset/processed/ftn@dataset
- _self_

splits:
train: 0.7
test_preliminary: 0.15
test_final: 0.15
10 changes: 10 additions & 0 deletions configs/experiment/preprocessing/split_dataset/ikem.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# @package _global_

defaults:
- /dataset/processed/ikem@dataset
- _self_

splits:
train: 0.7
test_preliminary: 0.15
test_final: 0.15
10 changes: 10 additions & 0 deletions configs/experiment/preprocessing/split_dataset/knl_patos.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# @package _global_

defaults:
- /dataset/processed/knl_patos@dataset
- _self_

splits:
train: 0.0
test_preliminary: 0.5
test_final: 0.5
15 changes: 15 additions & 0 deletions configs/preprocessing/split_dataset.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# @package _global_

splits:
train: ???
test_preliminary: ???
test_final: ???

n_folds: 5
random_state: 42

metadata:
run_name: "✂️ Dataset splitting: ${dataset.institution}"
description: Dataset splitting for ${dataset.institution} dataset.
hyperparams: ${splits}

89 changes: 89 additions & 0 deletions preprocessing/split_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
from math import isclose
from pathlib import Path
from tempfile import TemporaryDirectory

import hydra
import pandas as pd
from mlflow.artifacts import download_artifacts
from omegaconf import DictConfig
from rationai.mlkit import autolog, with_cli_args
from rationai.mlkit.lightning.loggers import MLFlowLogger
from ratiopath.model_selection import train_test_split
from sklearn.model_selection import StratifiedGroupKFold


def split_dataset(
dataset: pd.DataFrame, splits: DictConfig, random_state: int
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
assert isclose(sum(splits.values()), 1.0), "Splits must sum to 1.0"

if isclose(splits["train"], 0.0):
train = pd.DataFrame(columns=dataset.columns)
test = dataset
else:
train, test = train_test_split(
dataset,
train_size=splits["train"],
random_state=random_state,
stratify=dataset["nancy"],
groups=dataset["case_id"],
)

if isclose(splits["test_preliminary"], 0.0):
test_preliminary = pd.DataFrame(columns=dataset.columns)
test_final = test
else:
preliminary_size = splits["test_preliminary"] / (1.0 - splits["train"])
test_preliminary, test_final = train_test_split(
test,
train_size=preliminary_size,
random_state=random_state,
stratify=test["nancy"],
groups=test["case_id"],
)

return train, test_preliminary, test_final


def add_folds(train: pd.DataFrame, n_folds: int, random_state: int) -> pd.DataFrame:
if train.empty:
return train

splitter = StratifiedGroupKFold(
n_splits=n_folds, shuffle=True, random_state=random_state
)
train["fold"] = -1
for fold, (_, val_idx) in enumerate(
splitter.split(train, y=train["nancy"], groups=train["case_id"])
):
train.loc[train.iloc[val_idx].index, "fold"] = fold
return train


@with_cli_args(["+preprocessing=split_dataset"])
@hydra.main(config_path="../configs", config_name="preprocessing", version_base=None)
@autolog
def main(config: DictConfig, logger: MLFlowLogger) -> None:
dataset = pd.read_csv(download_artifacts(config.dataset.uri))

train, test_preliminary, test_final = split_dataset(
dataset, config.splits, config.random_state
)
train = add_folds(train, config.n_folds, config.random_state)

with TemporaryDirectory() as tmpdir:
for name, df in (
("train", train),
("test_preliminary", test_preliminary),
("test_final", test_final),
):
if df.empty:
continue

output_path = Path(tmpdir) / f"{name}.csv"
df.to_csv(output_path, index=False)
logger.log_artifact(str(output_path))


if __name__ == "__main__":
main()
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ dependencies = [
"ray>=2.52.1",
"torch>=2.9.0",
"torchmetrics>=1.8.2",
"ratiopath>=1.1.2",
]

[dependency-groups]
Expand Down
16 changes: 16 additions & 0 deletions scripts/preprocessing/split_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
from kube_jobs import submit_job


submit_job(
job_name="ulcerative-colitis-dataset-split-...",
username=...,
public=False,
cpu=2,
memory="4Gi",
script=[
"git clone https://github.com/RationAI/ulcerative-colitis.git workdir",
"cd workdir",
"uv sync --frozen",
"uv run -m preprocessing.split_dataset +experiment=preprocessing/split_dataset/...",
],
)
Loading