Conversation
📝 WalkthroughWalkthroughAdds dataset-splitting functionality and related configs: new dataset split YAMLs, a Python module for stratified group splitting and fold assignment with MLflow artifact logging, a Kubernetes job submission script, a project PR review styleguide, and a Gemini tool config file. Changes
Sequence DiagramsequenceDiagram
participant Config as Configuration (Hydra)
participant Main as main()
participant MLflow as MLflow Service
participant Split as split_dataset()
participant Fold as add_folds()
participant Artifact as Artifact Storage
Config->>Main: provide config.splits & config.n_folds
Main->>MLflow: download dataset artifact
MLflow-->>Main: dataset CSV
Main->>Split: call split_dataset(dataset, splits)
Split->>Split: validate ratios & perform stratified group split
Split-->>Main: return train, test_preliminary, test_final
Main->>Fold: call add_folds(train, n_folds)
Fold->>Fold: assign StratifiedGroupKFold 'fold' column
Fold-->>Main: return train with folds
Main->>Artifact: log CSV artifacts (train, tests)
Artifact-->>Main: ack
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust and configurable system for splitting datasets. It provides a dedicated Python script for performing stratified group-based splits into training, preliminary testing, and final testing sets, along with K-fold cross-validation assignment. The changes also include specific configurations for various datasets and updates to project dependencies to support this new functionality. Highlights
Changelog
Ignored Files
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new preprocessing step for splitting datasets. It includes the Python script to perform the splitting, along with Hydra configuration files for different datasets and a script to run it as a Kubernetes job. The changes are well-structured. My feedback includes a few suggestions for improvement: fixing a typo in the configuration, using a logger instead of print statements for better logging, making the random seed configurable for reproducibility, and addressing a dependency on a feature branch.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (4)
scripts/preprocessing/split_dataset.py (1)
11-14: Pin the cloned repository ref to keep job results reproducible.Lines [11]-[14] execute against whatever HEAD is at run time. For preprocessing experiments, this makes runs non-deterministic and hard to audit.
Suggested change
- "git clone https://github.com/RationAI/ulcerative-colitis.git workdir", + "git clone https://github.com/RationAI/ulcerative-colitis.git workdir", + "cd workdir && git checkout ${GIT_REF}", - "cd workdir", + "cd workdir",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/preprocessing/split_dataset.py` around lines 11 - 14, The git clone command in scripts/preprocessing/split_dataset.py ("git clone https://github.com/RationAI/ulcerative-colitis.git workdir") must pin the repository ref so runs are reproducible; change that command to clone a specific commit/tag/branch (e.g., use "git clone --branch <TAG_OR_BRANCH> --depth 1 ..." or clone then "git checkout <COMMIT_SHA>") or replace the URL with a ref-qualified URL, and ensure the subsequent command ("uv run -m preprocessing.split_dataset +experiment=preprocessing/split_dataset/...") runs against that pinned ref; update any CI/env variables if you need to pass the chosen tag/sha.preprocessing/split_dataset.py (3)
67-67: Use logger instead ofprint()for consistency.The function receives a
loggerparameter but usesprint()for output. This is inconsistent and♻️ Proposed fix
- print("whole dataset", dataset["nancy"].value_counts() / len(dataset)) + logger.log_text( + str(dataset["nancy"].value_counts() / len(dataset)), + "dataset_distribution.txt" + )Alternatively, if you need console output, consider using Python's
loggingmodule configured with appropriate handlers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preprocessing/split_dataset.py` at line 67, Replace the direct print call with the provided logger to ensure consistent logging: in preprocessing/split_dataset.py where the code prints "whole dataset" (the print at line showing dataset["nancy"].value_counts() / len(dataset)), change it to use the passed-in logger (e.g., logger.info or logger.debug) and include the same message and computed value so output is captured by the app's logging/MLflow handlers; keep the message content but route it through the logger used elsewhere in the function that accepts the logger parameter.
20-30: Edge case:train=1.0bypasses the zero-check and callstrain_test_split.When
splits["train"]equals1.0, theisclose(splits["train"], 0.0)check on line 20 isFalse, sotrain_test_splitis called withtrain_size=1.0. Depending onratiopath's implementation, this may raise an error or produce unexpected results since no samples remain for the test set.Consider adding an explicit check for
train=1.0:♻️ Proposed fix
- if isclose(splits["train"], 0.0): + if isclose(splits["train"], 0.0): train = pd.DataFrame(columns=dataset.columns) test = dataset + elif isclose(splits["train"], 1.0): + train = dataset + test = pd.DataFrame(columns=dataset.columns) else: train, test = train_test_split(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preprocessing/split_dataset.py` around lines 20 - 30, The code currently only checks for train==0.0 but calls train_test_split when splits["train"]==1.0 which can error because no test samples remain; update the logic around splits["train"] in preprocessing/split_dataset.py to explicitly handle the edge case where splits["train"] is close to 1.0 (e.g., isclose(splits["train"], 1.0)) by setting train = dataset and test = empty DataFrame with the same columns (similarly to the 0.0 branch) instead of calling train_test_split; keep using the existing symbols splits["train"], train_test_split, dataset, and dataset.columns so the rest of the code (stratify/groups) is not invoked when train_size is 1.0.
54-57: Simplify fold assignment indexing.The current indexing
train.loc[train.iloc[val_idx].index, "fold"]is convoluted. Sinceval_idxcontains positional indices, you can use the column position directly withiloc:♻️ Proposed simplification
for fold, (_, val_idx) in enumerate( splitter.split(train, y=train["nancy"], groups=train["case_id"]) ): - train.loc[train.iloc[val_idx].index, "fold"] = fold + train.iloc[val_idx, train.columns.get_loc("fold")] = fold🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preprocessing/split_dataset.py` around lines 54 - 57, Replace the convoluted indexing in the fold assignment inside the splitter loop: instead of using train.loc[train.iloc[val_idx].index, "fold"] = fold, assign directly by positional indices—use train.iloc with val_idx and the column position for "fold" (obtainable via train.columns.get_loc("fold")), or alternatively use train.loc with train.index[val_idx] to set the "fold" column; update the assignment in the loop that iterates over splitter.split(train, y=train["nancy"], groups=train["case_id"]) accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@configs/preprocessing/split_dataset.yaml`:
- Around line 11-12: Update the metadata strings for the dataset splitting job:
replace the misspelled "spliting" with "splitting" in both the run_name and
description entries (symbols: run_name, description) so they read "Dataset
splitting: ${dataset.institution}" and "Dataset splitting for
${dataset.institution} dataset."
In `@preprocessing/split_dataset.py`:
- Line 18: Replace the runtime assertion in split_dataset.py with explicit
validation: instead of using assert isclose(sum(splits.values()), 1.0), compute
total = sum(splits.values()) and if not math.isclose(total, 1.0): raise
ValueError(f"Splits must sum to 1.0; got {total}") (ensure math.isclose is
imported); update any surrounding code in the function that depends on the
assert to use this explicit check for production-safe validation.
In `@scripts/preprocessing/split_dataset.py`:
- Around line 5-14: The job definition contains unresolved placeholders—replace
the Ellipsis used for username and the "..." in job_name and the uv run
+experiment path with real values or make them parameters; specifically update
the job_name, username, and the script entry that runs "uv run -m
preprocessing.split_dataset +experiment=preprocessing/split_dataset/..." so they
accept variables or environment args (e.g., CLI flags or template variables) and
validate them before submission so the script and uv command are executable
(look for the job_name, username, and the script list entries to implement the
parameterization and validation).
---
Nitpick comments:
In `@preprocessing/split_dataset.py`:
- Line 67: Replace the direct print call with the provided logger to ensure
consistent logging: in preprocessing/split_dataset.py where the code prints
"whole dataset" (the print at line showing dataset["nancy"].value_counts() /
len(dataset)), change it to use the passed-in logger (e.g., logger.info or
logger.debug) and include the same message and computed value so output is
captured by the app's logging/MLflow handlers; keep the message content but
route it through the logger used elsewhere in the function that accepts the
logger parameter.
- Around line 20-30: The code currently only checks for train==0.0 but calls
train_test_split when splits["train"]==1.0 which can error because no test
samples remain; update the logic around splits["train"] in
preprocessing/split_dataset.py to explicitly handle the edge case where
splits["train"] is close to 1.0 (e.g., isclose(splits["train"], 1.0)) by setting
train = dataset and test = empty DataFrame with the same columns (similarly to
the 0.0 branch) instead of calling train_test_split; keep using the existing
symbols splits["train"], train_test_split, dataset, and dataset.columns so the
rest of the code (stratify/groups) is not invoked when train_size is 1.0.
- Around line 54-57: Replace the convoluted indexing in the fold assignment
inside the splitter loop: instead of using train.loc[train.iloc[val_idx].index,
"fold"] = fold, assign directly by positional indices—use train.iloc with
val_idx and the column position for "fold" (obtainable via
train.columns.get_loc("fold")), or alternatively use train.loc with
train.index[val_idx] to set the "fold" column; update the assignment in the loop
that iterates over splitter.split(train, y=train["nancy"],
groups=train["case_id"]) accordingly.
In `@scripts/preprocessing/split_dataset.py`:
- Around line 11-14: The git clone command in
scripts/preprocessing/split_dataset.py ("git clone
https://github.com/RationAI/ulcerative-colitis.git workdir") must pin the
repository ref so runs are reproducible; change that command to clone a specific
commit/tag/branch (e.g., use "git clone --branch <TAG_OR_BRANCH> --depth 1 ..."
or clone then "git checkout <COMMIT_SHA>") or replace the URL with a
ref-qualified URL, and ensure the subsequent command ("uv run -m
preprocessing.split_dataset +experiment=preprocessing/split_dataset/...") runs
against that pinned ref; update any CI/env variables if you need to pass the
chosen tag/sha.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (10)
.gemini/config.yaml.gemini/styleguide.mdconfigs/experiment/.gitkeepconfigs/experiment/preprocessing/split_dataset/ftn.yamlconfigs/experiment/preprocessing/split_dataset/ikem.yamlconfigs/experiment/preprocessing/split_dataset/knl_patos.yamlconfigs/preprocessing/split_dataset.yamlpreprocessing/split_dataset.pypyproject.tomlscripts/preprocessing/split_dataset.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@preprocessing/split_dataset.py`:
- Line 36: The computation of preliminary_size uses preliminary_size =
splits["test_preliminary"] / (1.0 - splits["train"]) which can blow up when
splits["train"] is ≈1.0; add a small epsilon guard for the denominator (e.g.,
denom = max(1e-8, 1.0 - splits["train"])) and then clamp preliminary_size to the
valid range [0.0, 1.0] (e.g., preliminary_size = max(0.0, min(preliminary_size,
1.0))); apply these changes where preliminary_size is computed and before
calling train_test_split so train_test_split receives a valid fraction.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
configs/preprocessing/split_dataset.yamlpreprocessing/split_dataset.py
🚧 Files skipped from review as they are similar to previous changes (1)
- configs/preprocessing/split_dataset.yaml
There was a problem hiding this comment.
🧹 Nitpick comments (1)
.gemini/styleguide.md (1)
28-34: Tighten path naming and script-review wording to avoid ambiguity.
project_name/looks placeholder-like in a repo-specific styleguide, and Line 33 could be read as relaxing correctness/security checks for scripts. Consider clarifying both.Proposed wording refinement
-- **Experiment Tracking (MLflow):** When PRs add new loss functions, evaluation metrics, or training loops in `project_name/` (or `ml/`), ensure that these new metrics are properly logged to MLflow. +- **Experiment Tracking (MLflow):** When PRs add new loss functions, evaluation metrics, or training loops in the repository’s model/training package (currently `ml/` when introduced), ensure these metrics are logged to MLflow. - - `project_name/` (future `ml/`): Focus on training loops, PyTorch Lightning modules, and model definitions. + - `ml/` (when introduced): Focus on training loops, PyTorch Lightning modules, and model definitions. - `postprocessing/`: Focus on ensembling and final prediction logic. - - `scripts/`: These are job submission templates. Do not review them as strictly as core Python code. + - `scripts/`: These are job submission templates. Prioritize correctness, reproducibility, and safety; be lighter only on style-level nits.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.gemini/styleguide.md around lines 28 - 34, The wording is ambiguous: replace the vague placeholder `project_name/` and the phrase about `scripts/` being "not reviewed as strictly" with clear, repo-specific guidance—use the intended directory name (e.g., `ml/`) or state “(replace with repo ML package name)” where `project_name/` appears, and change the `scripts/` line to clarify they are job submission templates that may have relaxed style requirements but still must meet security and correctness checks; update the lines under "Experiment Tracking (MLflow):" and "Repository Structure:" to mention `ml/` (or an explicit repo ML package) and to state that `scripts/` are exempt from some style reviews but are still subject to security/correctness validation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In @.gemini/styleguide.md:
- Around line 28-34: The wording is ambiguous: replace the vague placeholder
`project_name/` and the phrase about `scripts/` being "not reviewed as strictly"
with clear, repo-specific guidance—use the intended directory name (e.g., `ml/`)
or state “(replace with repo ML package name)” where `project_name/` appears,
and change the `scripts/` line to clarify they are job submission templates that
may have relaxed style requirements but still must meet security and correctness
checks; update the lines under "Experiment Tracking (MLflow):" and "Repository
Structure:" to mention `ml/` (or an explicit repo ML package) and to state that
`scripts/` are exempt from some style reviews but are still subject to
security/correctness validation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 82159faf-d14e-452b-a680-2bb3a986c0ef
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (2)
.gemini/styleguide.mdpyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
- pyproject.toml
Blocked by RationAI/ratiopath#34
Summary by CodeRabbit
New Features
Documentation
Chores