Conversation
📝 WalkthroughWalkthroughThis PR introduces a preprocessing pipeline for ulcerative colitis dataset creation, adding Hydra configuration scaffolding for three institutions (FTN, IKEM, KNL_PATOS), a Python module to load and join labels with slides, and a Kubernetes job submission script, while restructuring the configuration system. Changes
Sequence DiagramsequenceDiagram
participant CLI as CLI Entry
participant Main as main()
participant CreateDS as create_dataset()
participant GetLabels as get_labels()
participant GetSlides as get_slides()
participant FileSystem as FileSystem
participant MLFlow as MLFlowLogger
CLI->>Main: invoke with config
Main->>CreateDS: call with config params
CreateDS->>GetLabels: load label files
GetLabels->>FileSystem: read CSV/Excel
FileSystem-->>GetLabels: label dataframes
GetLabels-->>CreateDS: normalized labels
CreateDS->>GetSlides: filter slides by pattern
GetSlides->>FileSystem: scan folder, match regex
FileSystem-->>GetSlides: matching slides
GetSlides-->>CreateDS: slide dataframe
CreateDS->>CreateDS: join on case_id
CreateDS->>CreateDS: filter/validate
CreateDS-->>Main: dataset + missing lists
Main->>FileSystem: write dataset.csv
Main->>MLFlow: log dataset artifact
alt missing_slides non-empty
Main->>MLFlow: log missing_slides.txt
end
alt missing_labels non-empty
Main->>MLFlow: log missing_labels.txt
end
MLFlow-->>Main: logged
Main-->>CLI: complete
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @Adames4, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust framework for creating and managing datasets crucial for the ulcerative colitis project. It centralizes data processing logic, configures institution-specific data sources, and integrates with a Kubernetes job submission system for efficient execution. The changes streamline the initial data preparation phase, making it more scalable and maintainable for future development and experimentation. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a dataset creation script and its associated configurations. The changes are generally well-structured, but I have identified a few areas for improvement regarding code robustness, clarity, and maintainability. My feedback includes suggestions to fix a typo in a configuration file, improve regular expressions, handle potential file system issues, refactor duplicated code, and make a utility script more robust. Please see the detailed comments below.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@configs/preprocessing.yaml`:
- Line 9: The project_dir value contains a typo: change the string for the
project_dir key from
"/mnt/projects/inflammatory_bowel_dissease/ulcerative_colitis" to
"/mnt/projects/inflammatory_bowel_disease/ulcerative_colitis" (fix "dissease" →
"disease") in configs/preprocessing.yaml; verify no other configs reference the
misspelled path and update any references to the project_dir key or its value
(e.g., CI mounts, deployment manifests) accordingly.
In `@preprocessing/create_dataset.py`:
- Around line 53-57: The index string operations on labels_df assume a string
dtype and can fail if the index is numeric; convert the index to strings first
(e.g., via labels_df.index = labels_df.index.astype(str)) before applying
.str.lstrip, .str.strip, and .str.replace in create_dataset.py so the
transformations on labels_df.index always use the string accessor safely.
In `@scripts/preprocessing/create_dataset.py`:
- Line 8: The current username=... placeholder in create_dataset.py is an
Ellipsis literal and must be replaced with a real value; update the code that
calls submit_job to read the username from a configuration source (e.g., an
environment variable like KUBE_USER or a config file) instead of using the
Ellipsis, validate it is non-empty and raise a clear error if missing so
submit_job receives a real string username; look for the username assignment and
the submit_job(...) invocation to make this change (replace the placeholder with
env/config lookup and a runtime check).
🧹 Nitpick comments (4)
.mypy.ini (1)
5-6: Broad mypy exclusion forscripts/.Excluding the entire
scriptsdirectory means any future scripts added there will also bypass type checking. If only the current script needs exclusion, consider a more targeted pattern (e.g.,exclude = scripts/preprocessing/). Otherwise, this is fine as a pragmatic choice for operational scripts.preprocessing/create_dataset.py (1)
12-36: Minor regex inconsistencies.
- The
ikempattern uses a^anchor (line 20), which is redundant sincefullmatch()is used on line 65 (it implicitly anchors both ends). The other patterns omit it. Consider removing^for consistency.- The
knl_patospattern (line 35) uses a capturing group(0[1-9]|1[0-2])— use a non-capturing group(?:0[1-9]|1[0-2])to match the style of theikempattern.Suggested diff
- "ikem": re.compile(r"^[0-9]{1,5}_2[1-4]_HE(?:_0[1-6])?\.czi"), + "ikem": re.compile(r"[0-9]{1,5}_2[1-4]_HE(?:_0[1-6])?\.czi"),- "knl_patos": re.compile(r"[0-9]{1,5}_25_[A-F]_HE(0[1-9]|1[0-2])\.czi"), + "knl_patos": re.compile(r"[0-9]{1,5}_25_[A-F]_HE(?:0[1-9]|1[0-2])\.czi"),pyproject.toml (1)
16-17:rationai-mlkithas no version constraint.The dependency is declared without a version specifier and points to a git main branch. Any breaking change upstream will silently break this project. Consider pinning to a specific tag, commit, or version range.
-rationai-mlkit = { git = "https://gitlab.ics.muni.cz/rationai/digital-pathology/libraries/mlkit.git" } +rationai-mlkit = { git = "https://gitlab.ics.muni.cz/rationai/digital-pathology/libraries/mlkit.git", rev = "<commit-or-tag>" }scripts/preprocessing/create_dataset.py (1)
4-4: COHORT is hardcoded — consider parameterizing.Users must edit the script source to change the institution. A lightweight improvement would be to accept it as a CLI argument.
+import sys + -COHORT = "ikem" # "ikem", "ftn", or "knl_patos" +VALID_COHORTS = {"ikem", "ftn", "knl_patos"} +if len(sys.argv) < 2 or sys.argv[1] not in VALID_COHORTS: + print(f"Usage: python {sys.argv[0]} <{'|'.join(VALID_COHORTS)}>") + sys.exit(1) +COHORT = sys.argv[1]
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@preprocessing/create_dataset.py`:
- Around line 84-96: The KeyError comes from treating "slide_id" as a column
when it is the DataFrame index; update the missing_slides computation to handle
both cases by using the index when on == "slide_id" (e.g., take
dataset_df[dataset_df["path"].isna()].index.to_list()) and keep using
dataset_df[...][on].to_list() when on == "case_id"; locate the logic around the
variables on, slides_df, dataset_df and the missing_slides line and implement
this conditional branch or normalize the DataFrame (reset_index to make slide_id
a column) before computing missing_slides so the lookup never assumes a
non-existent column.
🧹 Nitpick comments (1)
preprocessing/create_dataset.py (1)
25-25: FTN regex is missing the$end anchor.The
ikemandknl_patospatterns both end with$, but theftnpattern does not. While this is not a functional bug becausefullmatchis used on line 65 (which implicitly anchors both ends), the inconsistency could mislead a future maintainer who usesmatchorsearchinstead.Suggested fix
- "ftn": re.compile(r"^[0-9]{1,6}_2[0-5]\.czi"), + "ftn": re.compile(r"^[0-9]{1,6}_2[0-5]\.czi$"),
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@configs/dataset/raw/knl_patos.yaml`:
- Around line 1-17: The file ending lacks a trailing newline which can cause
diffs and parser warnings; open the YAML file containing the keys institution,
folder, regex_pattern, and labels and ensure the file terminates with a single
newline character (i.e., add a final blank line at EOF) so the file ends with
'\n'.
In `@preprocessing/create_dataset.py`:
- Around line 12-26: The function get_labels builds dfs and then calls
pd.concat(dfs) which will raise ValueError if labels is empty; add an early
guard after the loop (or at function start) that checks if dfs is empty and
returns an empty DataFrame (e.g., return pd.DataFrame()) to avoid pd.concat([])
failing. Locate get_labels and the dfs list and implement the empty-check before
calling pd.concat.
- Line 35: Update the function type annotations to include the generic type
parameter for regex patterns to satisfy the linter: change any parameter typed
as re.Pattern to re.Pattern[str] (e.g., in get_slides(pattern: re.Pattern[str])
and the other function at the other occurrence on line 51), ensuring you keep
the existing re import; no logic changes required.
🧹 Nitpick comments (2)
configs/dataset/raw/knl_patos.yaml (1)
12-12: Year25is hardcoded in the regex — confirm this is intentional.The pattern only matches 2025 filenames. If future data arrives for 2026+, this config will silently ignore those files. If this is expected (one config per year/batch), this is fine; otherwise consider a more flexible year group.
preprocessing/create_dataset.py (1)
62-65: NaN values inlokalitacolumn will pass the!= "ileum"filter.
NaN != "ileum"evaluates toTruein pandas, so rows with missinglokalitavalues will be retained. If that's intentional (these rows get filtered later bydropnaon line 72 if they also lacknancyorpath), consider adding a comment to clarify. Otherwise, usedataset_df["lokalita"].str.lower() != "ileum"or add an explicit NaN check.
| institution: knl_patos | ||
| folder: /mnt/data/KNL_PATOS/colon/IBD_AI | ||
| # [0-9]{1,5} - case ID (1 to 5 digits) (in year scope) | ||
| # _ - underscore separator | ||
| # 25 - year 2025 | ||
| # _ - underscore separator | ||
| # [A-F] - block identifier (A to F) | ||
| # _ - underscore separator | ||
| # HE - stain type | ||
| # (0[1-9]|1[0-2]) - slide number (01 to 12) | ||
| # .czi - file extension | ||
| regex_pattern: ^[0-9]{1,5}_25_[A-F]_HE(0[1-9]|1[0-2])\.czi$ | ||
| labels: | ||
| - IBD_AI_Liberec.xlsx | ||
| - IBD_AI_Liberec_02.xlsx | ||
| - IBD_AI_Liberec_10_2025.xlsx | ||
| - IBD_AI_Liberec_28_10_2025.xlsx No newline at end of file |
There was a problem hiding this comment.
Missing newline at end of file.
Most editors and POSIX tools expect a trailing newline. Add one to avoid diff noise and potential YAML parser warnings.
🤖 Prompt for AI Agents
In `@configs/dataset/raw/knl_patos.yaml` around lines 1 - 17, The file ending
lacks a trailing newline which can cause diffs and parser warnings; open the
YAML file containing the keys institution, folder, regex_pattern, and labels and
ensure the file terminates with a single newline character (i.e., add a final
blank line at EOF) so the file ends with '\n'.
Dataset creation script.
For each institution (IKEM, FTN, and KNL PATOS) script creates dataset with
slide_id,case_id,pathand (label)nancyCloses IBD-16
Dependency graph:
Summary by CodeRabbit
New Features
Chores