feat: dataset by Adames4 · Pull Request #2 · RationAI/ulcerative-colitis

Adames4 · 2026-02-06T15:07:06Z

Dataset creation script.

For each institution (IKEM, FTN, and KNL PATOS) script creates dataset with slide_id, case_id, path and (label) nancy

Closes IBD-16

Dependency graph:

                         +--------------+
                  -------| tissue-masks |<------+           +------------+      +----------------------+
                 /       +--------------+       |       +---| tile-masks |<-----| preprocessing-report |
                /                               |       |   +------------+      +----------------------+
+---------+    /                            +--------+  |
| dataset | <-+                             | tiling |<-+
+---------+    \                            +--------+  |
                \                               |       |   +------------+
                 \       +-----------------+    |       +---| embeddings |
                  -------| quality-control |<---+           +------------+
                         +-----------------+

Summary by CodeRabbit

New Features
- Added dataset creation module for building labeled slide datasets from folder structures
- Introduced support for multiple institutional dataset sources (FTN, IKEM, KNL_PATOS)
- Added preprocessing pipeline configuration for ulcerative colitis research workflow
Chores
- Restructured configuration architecture with new base configuration
- Updated development tooling and project settings

coderabbitai · 2026-02-06T15:07:24Z

📝 Walkthrough

Walkthrough

This PR introduces a preprocessing pipeline for ulcerative colitis dataset creation, adding Hydra configuration scaffolding for three institutions (FTN, IKEM, KNL_PATOS), a Python module to load and join labels with slides, and a Kubernetes job submission script, while restructuring the configuration system.

Changes

Cohort / File(s)	Summary
Hydra Configuration & Defaults `configs/base.yaml`, `configs/preprocessing.yaml`, `configs/preprocessing/create_dataset.yaml`, `configs/default.yaml`	Restructured Hydra config with new base.yaml setting defaults and experiment metadata; added preprocessing.yaml scaffolding; created create_dataset.yaml with run metadata; removed legacy default.yaml.
Dataset Institution Configurations `configs/dataset/raw/ftn.yaml`, `configs/dataset/raw/ikem.yaml`, `configs/dataset/raw/knl_patos.yaml`	Added three institution-specific dataset configs defining folder paths, regex patterns for slide filenames, and label file references.
Preprocessing Module `preprocessing/create_dataset.py`	New module with get_labels, get_slides, and create_dataset functions to load label files, match slide patterns, join on case_id, filter/normalize data, and main() orchestrator integrating MLFlow logging and artifact writing.
Job Scheduling `scripts/preprocessing/create_dataset.py`	New script submitting Kubernetes job for dataset creation with specified resources, git clone/setup, and preprocessing invocation.
Development Configuration `.gitignore`, `.mypy.ini`	Added playground.ipynb to .gitignore; updated .mypy.ini to exclude scripts directory from type checking.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI Entry
    participant Main as main()
    participant CreateDS as create_dataset()
    participant GetLabels as get_labels()
    participant GetSlides as get_slides()
    participant FileSystem as FileSystem
    participant MLFlow as MLFlowLogger

    CLI->>Main: invoke with config
    Main->>CreateDS: call with config params
    CreateDS->>GetLabels: load label files
    GetLabels->>FileSystem: read CSV/Excel
    FileSystem-->>GetLabels: label dataframes
    GetLabels-->>CreateDS: normalized labels
    CreateDS->>GetSlides: filter slides by pattern
    GetSlides->>FileSystem: scan folder, match regex
    FileSystem-->>GetSlides: matching slides
    GetSlides-->>CreateDS: slide dataframe
    CreateDS->>CreateDS: join on case_id
    CreateDS->>CreateDS: filter/validate
    CreateDS-->>Main: dataset + missing lists
    Main->>FileSystem: write dataset.csv
    Main->>MLFlow: log dataset artifact
    alt missing_slides non-empty
        Main->>MLFlow: log missing_slides.txt
    end
    alt missing_labels non-empty
        Main->>MLFlow: log missing_labels.txt
    end
    MLFlow-->>Main: logged
    Main-->>CLI: complete

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A dataset creation spell is cast,
With labels joined to slides so fast,
From FTN, IKEM, and KNL we blend,
Three institutions' data, preprocessed end-to-end! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'feat: dataset' is vague and overly generic, failing to describe the specific nature of the changes or what aspect of dataset functionality is being added.	Consider a more descriptive title such as 'feat: add dataset creation script for multiple institutions' to clearly communicate the main purpose of the changes.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Merge Conflict Detection	✅ Passed	✅ No merge conflicts detected when merging into `master`

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/dataset

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-02-06T15:07:27Z

Summary of Changes

Hello @Adames4, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust framework for creating and managing datasets crucial for the ulcerative colitis project. It centralizes data processing logic, configures institution-specific data sources, and integrates with a Kubernetes job submission system for efficient execution. The changes streamline the initial data preparation phase, making it more scalable and maintainable for future development and experimentation.

Highlights

Dataset Creation Script: Introduced a new Python script (preprocessing/create_dataset.py) to generate structured datasets for different institutions (IKEM, FTN, KNL PATOS) by processing raw data folders and label files. This script extracts slide and case IDs, joins them with Nancy scores, and handles institution-specific data nuances.
Configuration Management: Implemented a new configuration structure using Hydra, including configs/base.yaml, configs/preprocessing.yaml, and institution-specific data configurations (configs/data/raw/*.yaml). This modular setup allows easy management of experiment parameters and data sources.
Project Dependency and Environment Updates: The pyproject.toml file has been significantly updated to reflect the project's new name ('ulcerative-colitis'), author, Python version requirement (3.12), and a comprehensive list of dependencies. It also introduces uv for dependency management and defines dev and run dependency groups.
Kubernetes Job Submission Script: Added a utility script (scripts/preprocessing/create_dataset.py) to submit the dataset creation process as a Kubernetes job, enabling scalable and managed execution of data preprocessing tasks for different cohorts.
Code Quality and Ignored Files: Updated .mypy.ini to exclude the scripts directory from type checking and added playground.ipynb to .gitignore to maintain a clean repository.

Changelog

.gitignore
- Added 'playground.ipynb' to the list of ignored files.
.mypy.ini
- Excluded the 'scripts' directory from MyPy type checking.
configs/base.yaml
- Added a new base configuration file, defining default Hydra and MLflow logger settings, and a metadata field for 'experiment_name'.
configs/data/raw/ftn.yaml
- Added a new configuration file for the FTN institution, specifying its data folder and label files.
configs/data/raw/ikem.yaml
- Added a new configuration file for the IKEM institution, specifying its data folder and label files.
configs/data/raw/knl_patos.yaml
- Added a new configuration file for the KNL PATOS institution, specifying its data folder and label files.
configs/default.yaml
- Removed the old 'default.yaml' configuration file.
configs/preprocessing.yaml
- Added a new preprocessing configuration file, inheriting from 'base.yaml' and defining the 'project_dir'.
configs/preprocessing/create_dataset.yaml
- Added a new configuration file for the 'create_dataset' preprocessing step, defining 'run_name' and 'description' metadata.
preprocessing/create_dataset.py
- Added a new Python script for creating datasets from raw data and labels.
- Implemented functions to extract labels from Excel/CSV files and identify slides using institution-specific regex patterns.
- Includes logic to join slide information with labels, handling case-level labels for IKEM and filtering out 'ileum' slides.
- Logs the generated dataset, as well as lists of missing slides and labels, as MLflow artifacts.
- Utilizes Hydra for configuration and rationai.mlkit for autologging.
pyproject.toml
- Updated the project name from 'project-name' to 'ulcerative-colitis'.
- Updated author information.
- Changed the required Python version from '>=3.11' to '>=3.12,<3.13'.
- Revised and expanded project dependencies, including new packages like 'aiohttp', 'mlflow', 'omegaconf', 'openpyxl', 'pandas', 'pyvips', 'ray', 'torch', and 'torchmetrics'.
- Removed the 'tool.pdm.dev-dependencies' and 'tool.pdm.scripts' sections.
- Introduced 'dependency-groups' for 'dev' (mypy, ruff, ipykernel) and 'run' (rationai-kube-jobs).
- Added 'tool.uv.sources' to specify git sources for 'rationai-mlkit' and 'rationai-kube-jobs'.
scripts/preprocessing/create_dataset.py
- Added a new script to submit dataset creation jobs to a Kubernetes cluster using the 'kube_jobs' library.
- Allows selection of the cohort (IKEM, FTN, or KNL PATOS) and configures job parameters such as CPU, memory, and storage.
- Includes commands for cloning the repository, syncing dependencies with uv, and running the dataset creation script.

Activity

Adames4 created this pull request to implement a new dataset creation feature.
The pull request addresses and closes issue IBD-16, titled 'Create Dataset MR'.
The changes introduce a new 'dataset' component into the project's dependency graph, which is a prerequisite for 'tiling', 'quality-control', and 'tissue-masks'.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a dataset creation script and its associated configurations. The changes are generally well-structured, but I have identified a few areas for improvement regarding code robustness, clarity, and maintainability. My feedback includes suggestions to fix a typo in a configuration file, improve regular expressions, handle potential file system issues, refactor duplicated code, and make a utility script more robust. Please see the detailed comments below.

preprocessing/create_dataset.py

scripts/preprocessing/create_dataset.py

configs/preprocessing.yaml

preprocessing/create_dataset.py

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@configs/preprocessing.yaml`:
- Line 9: The project_dir value contains a typo: change the string for the
project_dir key from
"/mnt/projects/inflammatory_bowel_dissease/ulcerative_colitis" to
"/mnt/projects/inflammatory_bowel_disease/ulcerative_colitis" (fix "dissease" →
"disease") in configs/preprocessing.yaml; verify no other configs reference the
misspelled path and update any references to the project_dir key or its value
(e.g., CI mounts, deployment manifests) accordingly.

In `@preprocessing/create_dataset.py`:
- Around line 53-57: The index string operations on labels_df assume a string
dtype and can fail if the index is numeric; convert the index to strings first
(e.g., via labels_df.index = labels_df.index.astype(str)) before applying
.str.lstrip, .str.strip, and .str.replace in create_dataset.py so the
transformations on labels_df.index always use the string accessor safely.

In `@scripts/preprocessing/create_dataset.py`:
- Line 8: The current username=... placeholder in create_dataset.py is an
Ellipsis literal and must be replaced with a real value; update the code that
calls submit_job to read the username from a configuration source (e.g., an
environment variable like KUBE_USER or a config file) instead of using the
Ellipsis, validate it is non-empty and raise a clear error if missing so
submit_job receives a real string username; look for the username assignment and
the submit_job(...) invocation to make this change (replace the placeholder with
env/config lookup and a runtime check).

🧹 Nitpick comments (4)

.mypy.ini (1)

5-6: Broad mypy exclusion for scripts/.

Excluding the entire scripts directory means any future scripts added there will also bypass type checking. If only the current script needs exclusion, consider a more targeted pattern (e.g., exclude = scripts/preprocessing/). Otherwise, this is fine as a pragmatic choice for operational scripts.
preprocessing/create_dataset.py (1)
12-36: Minor regex inconsistencies.

The ikem pattern uses a ^ anchor (line 20), which is redundant since fullmatch() is used on line 65 (it implicitly anchors both ends). The other patterns omit it. Consider removing ^ for consistency.

The knl_patos pattern (line 35) uses a capturing group (0[1-9]|1[0-2]) — use a non-capturing group (?:0[1-9]|1[0-2]) to match the style of the ikem pattern.
Suggested diff
-    "ikem": re.compile(r"^[0-9]{1,5}_2[1-4]_HE(?:_0[1-6])?\.czi"),
+    "ikem": re.compile(r"[0-9]{1,5}_2[1-4]_HE(?:_0[1-6])?\.czi"),
-    "knl_patos": re.compile(r"[0-9]{1,5}_25_[A-F]_HE(0[1-9]|1[0-2])\.czi"),
+    "knl_patos": re.compile(r"[0-9]{1,5}_25_[A-F]_HE(?:0[1-9]|1[0-2])\.czi"),
pyproject.toml (1)
16-17: rationai-mlkit has no version constraint.

The dependency is declared without a version specifier and points to a git main branch. Any breaking change upstream will silently break this project. Consider pinning to a specific tag, commit, or version range.
-rationai-mlkit = { git = "https://gitlab.ics.muni.cz/rationai/digital-pathology/libraries/mlkit.git" }
+rationai-mlkit = { git = "https://gitlab.ics.muni.cz/rationai/digital-pathology/libraries/mlkit.git", rev = "<commit-or-tag>" }
scripts/preprocessing/create_dataset.py (1)
4-4: COHORT is hardcoded — consider parameterizing.

Users must edit the script source to change the institution. A lightweight improvement would be to accept it as a CLI argument.
+import sys
+
-COHORT = "ikem"  # "ikem", "ftn", or "knl_patos"
+VALID_COHORTS = {"ikem", "ftn", "knl_patos"}
+if len(sys.argv) < 2 or sys.argv[1] not in VALID_COHORTS:
+    print(f"Usage: python {sys.argv[0]} <{'|'.join(VALID_COHORTS)}>")
+    sys.exit(1)
+COHORT = sys.argv[1]

configs/preprocessing.yaml

preprocessing/create_dataset.py

scripts/preprocessing/create_dataset.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@preprocessing/create_dataset.py`:
- Around line 84-96: The KeyError comes from treating "slide_id" as a column
when it is the DataFrame index; update the missing_slides computation to handle
both cases by using the index when on == "slide_id" (e.g., take
dataset_df[dataset_df["path"].isna()].index.to_list()) and keep using
dataset_df[...][on].to_list() when on == "case_id"; locate the logic around the
variables on, slides_df, dataset_df and the missing_slides line and implement
this conditional branch or normalize the DataFrame (reset_index to make slide_id
a column) before computing missing_slides so the lookup never assumes a
non-existent column.

🧹 Nitpick comments (1)

preprocessing/create_dataset.py (1)
25-25: FTN regex is missing the $ end anchor.

The ikem and knl_patos patterns both end with $, but the ftn pattern does not. While this is not a functional bug because fullmatch is used on line 65 (which implicitly anchors both ends), the inconsistency could mislead a future maintainer who uses match or search instead.
Suggested fix
-    "ftn": re.compile(r"^[0-9]{1,6}_2[0-5]\.czi"),
+    "ftn": re.compile(r"^[0-9]{1,6}_2[0-5]\.czi$"),

preprocessing/create_dataset.py

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@configs/dataset/raw/knl_patos.yaml`:
- Around line 1-17: The file ending lacks a trailing newline which can cause
diffs and parser warnings; open the YAML file containing the keys institution,
folder, regex_pattern, and labels and ensure the file terminates with a single
newline character (i.e., add a final blank line at EOF) so the file ends with
'\n'.

In `@preprocessing/create_dataset.py`:
- Around line 12-26: The function get_labels builds dfs and then calls
pd.concat(dfs) which will raise ValueError if labels is empty; add an early
guard after the loop (or at function start) that checks if dfs is empty and
returns an empty DataFrame (e.g., return pd.DataFrame()) to avoid pd.concat([])
failing. Locate get_labels and the dfs list and implement the empty-check before
calling pd.concat.
- Line 35: Update the function type annotations to include the generic type
parameter for regex patterns to satisfy the linter: change any parameter typed
as re.Pattern to re.Pattern[str] (e.g., in get_slides(pattern: re.Pattern[str])
and the other function at the other occurrence on line 51), ensuring you keep
the existing re import; no logic changes required.

🧹 Nitpick comments (2)

configs/dataset/raw/knl_patos.yaml (1)

12-12: Year 25 is hardcoded in the regex — confirm this is intentional.

The pattern only matches 2025 filenames. If future data arrives for 2026+, this config will silently ignore those files. If this is expected (one config per year/batch), this is fine; otherwise consider a more flexible year group.

preprocessing/create_dataset.py (1)

62-65: NaN values in lokalita column will pass the != "ileum" filter.

NaN != "ileum" evaluates to True in pandas, so rows with missing lokalita values will be retained. If that's intentional (these rows get filtered later by dropna on line 72 if they also lack nancy or path), consider adding a comment to clarify. Otherwise, use dataset_df["lokalita"].str.lower() != "ileum" or add an explicit NaN check.

coderabbitai · 2026-02-13T15:26:31Z

configs/dataset/raw/knl_patos.yaml

+institution: knl_patos
+folder: /mnt/data/KNL_PATOS/colon/IBD_AI
+# [0-9]{1,5} - case ID (1 to 5 digits) (in year scope)
+# _ - underscore separator
+# 25 - year 2025
+# _ - underscore separator
+# [A-F] - block identifier (A to F)
+# _ - underscore separator
+# HE - stain type
+# (0[1-9]|1[0-2]) - slide number (01 to 12)
+# .czi - file extension
+regex_pattern: ^[0-9]{1,5}_25_[A-F]_HE(0[1-9]|1[0-2])\.czi$
+labels:
+  - IBD_AI_Liberec.xlsx
+  - IBD_AI_Liberec_02.xlsx
+  - IBD_AI_Liberec_10_2025.xlsx
+  - IBD_AI_Liberec_28_10_2025.xlsx


⚠️ Potential issue | 🟡 Minor

Missing newline at end of file.

Most editors and POSIX tools expect a trailing newline. Add one to avoid diff noise and potential YAML parser warnings.

🤖 Prompt for AI Agents

In `@configs/dataset/raw/knl_patos.yaml` around lines 1 - 17, The file ending lacks a trailing newline which can cause diffs and parser warnings; open the YAML file containing the keys institution, folder, regex_pattern, and labels and ensure the file terminates with a single newline character (i.e., add a final blank line at EOF) so the file ends with '\n'.

preprocessing/create_dataset.py

Adames4 added 9 commits January 25, 2026 15:46

chore: replace pdm with uv

6e1d4b9

feat: configs

d7a1ef3

feat: dataset creation

34abac4

fix: configs

80eca85

fix: configs

97335ef

feat: add scripts

727c729

fix: invalid job name

fa9f581

fix: set slide_id as index and ensure nancy is an integer

fd978b3

feat: script

87113e1

Adames4 self-assigned this Feb 6, 2026

gemini-code-assist bot reviewed Feb 6, 2026

View reviewed changes

coderabbitai bot reviewed Feb 6, 2026

View reviewed changes

configs/preprocessing.yaml Outdated Show resolved Hide resolved

preprocessing/create_dataset.py Show resolved Hide resolved

scripts/preprocessing/create_dataset.py Show resolved Hide resolved

fix: PR comments

bd0c6ac

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

preprocessing/create_dataset.py Show resolved Hide resolved

Adames4 and others added 2 commits February 11, 2026 08:42

chore: dependencies

9ffea96

Merge branch 'master' into feature/dataset

9b72a7e

Adames4 requested a review from vejtek February 11, 2026 08:47

This was referenced Feb 11, 2026

feat: tissue masks #3

Merged

feat: quality control #4

Merged

vejtek requested changes Feb 13, 2026

View reviewed changes

preprocessing/create_dataset.py Outdated Show resolved Hide resolved

fix: PR

7d778ce

Adames4 requested a review from vejtek February 13, 2026 15:23

coderabbitai bot reviewed Feb 13, 2026

View reviewed changes

vejtek approved these changes Feb 13, 2026

View reviewed changes

vejtek merged commit 2ad25ac into master Feb 13, 2026
2 of 3 checks passed

Conversation

Adames4 commented Feb 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

Uh oh!

gemini-code-assist bot commented Feb 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Adames4 commented Feb 6, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 6, 2026 •

edited

Loading