Conversation
📝 WalkthroughWalkthroughAdds a quality-control workflow: new QC config, an async Python QC module with MLFlow integration, a kube job submission script, a dependency addition, and minor YAML/INI formatting fixes across dataset configs (no functional changes to dataset entries). Changes
Sequence DiagramsequenceDiagram
actor User
participant Hydra as Hydra Config
participant Main as main()
participant MLFlow as MLFlow Logger
participant QC as qc_main()
participant Async as AsyncClient
participant Slides as Slide Storage
participant FS as Filesystem
participant CSV as CSV Merge
User->>Hydra: load config
Hydra->>Main: provide DictConfig
Main->>MLFlow: initialize logger / autolog
Main->>Main: download_dataset(uri)
Main->>FS: prepare output dir
Main->>QC: start qc_main(...)
QC->>Async: create AsyncClient (request_timeout, max_concurrent)
Async->>Slides: fetch/process slides (concurrent)
Slides-->>Async: return masks, metrics, errors
Async->>FS: write per-slide CSVs and mask files
QC->>FS: organize_masks into artifact subdirs
QC->>CSV: concatenate per-slide CSVs -> qc_metrics.csv
CSV->>MLFlow: log qc_metrics.csv and artifacts
MLFlow-->>User: artifacts stored / run complete
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @Adames4, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a quality control (QC) masks generation script to identify and flag potential issues in whole slide images. It adds necessary configurations, a Python script for QC processing, and integrates the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new quality control script for generating QC masks. The changes include new configuration files, the main processing script, and a job submission script. The code is generally well-structured. I've identified a potential bug related to handling empty CSV file lists and several areas for improvement regarding configuration, maintainability, and code consistency. There are also some placeholder values in configuration and scripts that need to be addressed.
There was a problem hiding this comment.
Pull request overview
Adds a preprocessing “quality control” step that runs slide QC via the rationai-sdk, organizes generated mask outputs, and logs results/artifacts to MLflow, along with configs and dependency updates to support the new workflow.
Changes:
- Introduces
preprocessing/quality_control.pyto run async QC checks, reorganize mask files, merge per-slide CSV metrics, and log artifacts. - Adds Hydra configuration for the QC step and “processed dataset” configs pointing QC at MLflow-hosted dataset CSV artifacts.
- Updates project dependencies/lockfile to include
rationai-sdk(and transitive deps likehttpx,lz4,tenacity).
Reviewed changes
Copilot reviewed 7 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
preprocessing/quality_control.py |
New QC implementation and artifact organization/aggregation logic. |
configs/preprocessing/quality_control.yaml |
QC step runtime/config parameters and MLflow metadata. |
configs/dataset/processed/ikem.yaml |
Processed dataset config pointing to an MLflow dataset artifact URI (currently TODO). |
configs/dataset/processed/ftn.yaml |
Same as above for FTN (currently TODO). |
configs/dataset/processed/knl_patos.yaml |
Same as above for KNL PATOS (currently TODO). |
scripts/preprocessing/quality_control.py |
Kube job submission template for running the QC step. |
pyproject.toml |
Adds rationai-sdk dependency and uv git source. |
uv.lock |
Locks rationai-sdk and new transitive dependencies. |
preprocessing/.gitkeep |
Placeholder file (present in PR metadata). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
pyproject.toml (1)
16-16: Newrationai-sdkdependency added — consistent with existing patterns.The dependency follows the same no-version-pin + git-source pattern as the other
rationai-*packages. Consider pinning to a specific tag or revision for reproducible builds, though this is consistent with howrationai-mlkitandrationai-masksare already declared.Also applies to: 32-32
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` at line 16, The new dependency "rationai-sdk" is added without a pinned version which makes builds non-reproducible; update the pyproject.toml dependency declaration for "rationai-sdk" (the same list that contains "rationai-mlkit" and "rationai-masks") to pin it to a specific release tag or commit SHA (e.g., use a version constraint or a git URL with @<tag-or-sha>) so installs are deterministic and reproducible.preprocessing/quality_control.py (2)
41-49: Consider usingremoveprefixinstead ofreplacefor stripping the mask prefix.
str.replacereplaces all occurrences of the substring, not just the leading prefix. If the prefix string happens to appear elsewhere in the filename, the result would be mangled.removeprefix(Python 3.9+) is more precise and expresses intent better.Suggested fix
- slide_name = file.name.replace(f"{mask_prefix}_", "") + slide_name = file.name.removeprefix(f"{mask_prefix}_")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preprocessing/quality_control.py` around lines 41 - 49, The code in organize_masks uses slide_name = file.name.replace(f"{mask_prefix}_", "") which can remove occurrences of the mask prefix anywhere in the filename; change this to use the string method removeprefix to only strip the leading prefix (e.g., compute slide_name = file.name.removeprefix(f"{mask_prefix}_")) so filenames that contain the prefix later are preserved before computing destination and calling file.rename.
71-75: Repeated file open/close for each failed slide.The error log file is opened and closed on every failure. If many slides fail, this is needlessly expensive. Consider opening it once (or collecting errors in a list and writing at the end).
Suggested approach
+ errors: list[str] = [] async for result in tqdm( client.qc.check_slides( slides, output_path, config=SlideCheckConfig(**qc_parameters), timeout=request_timeout, max_concurrent=max_concurrent, ), total=len(slides), ): if not result.success: - with open(output_path / "qc_errors.log", "a") as log_file: - log_file.write( - f"Failed to process {result.wsi_path}: {result.error}\n" - ) + errors.append( + f"Failed to process {result.wsi_path}: {result.error}\n" + ) + + if errors: + with open(output_path / "qc_errors.log", "w") as log_file: + log_file.writelines(errors)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preprocessing/quality_control.py` around lines 71 - 75, The current loop opens and closes qc_errors.log for every failed slide (using result and output_path), which is inefficient; change the code in the function that iterates over results so you either (a) open output_path / "qc_errors.log" once before the loop and write each failure inside the loop, or (b) collect failure messages into a list (using result.wsi_path and result.error) and write them all to output_path / "qc_errors.log" once after the loop; ensure you still append or create the file appropriately and preserve the same message format.scripts/preprocessing/quality_control.py (1)
4-17: Template placeholders and unpinned git clone.A couple of observations:
The
...(Ellipsis) placeholders on lines 5, 6, and 14 will pass Python's parser but will fail at runtime if not replaced. Consider adding a comment at the top of the file (e.g.,# TODO: Fill in job_name, username, and dataset before running) to make this explicit for other developers.
git cloneon line 11 will check out the default branch. If reproducibility matters for QC runs, consider pinning to a tag or commit:script=[ "git clone https://github.com/RationAI/ulcerative-colitis.git workdir", "cd workdir", + "git checkout <tag-or-commit>", "uv sync --frozen",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/preprocessing/quality_control.py` around lines 4 - 17, The template uses placeholder ellipses in the submit_job call (job_name, username, dataset) and a non-pinned git clone which will break reproducibility; add a top-of-file TODO comment like “# TODO: Fill in job_name, username, and dataset before running” and update the submit_job invocation (referencing submit_job, job_name, username, script, storage) to require real values (or validate/raise if still set to ...), and change the git clone entry in the script to pin the repository to a stable ref (e.g., clone + checkout a tag/commit or use git clone --branch <tag> && git checkout <commit>) so uv run (preprocessing.quality_control +dataset=...) uses a reproducible commit instead of the default branch.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@preprocessing/quality_control.py`:
- Around line 82-89: The current code calls pd.concat([pd.read_csv(f) for f in
csvs]) which raises ValueError when csvs is empty; update the block in
preprocessing/quality_control.py to first check the csvs list (the variable csvs
obtained from Path(output_path).glob("*.csv")) and only call
pd.concat/pd.read_csv when csvs is non-empty — otherwise write an empty
qc_metrics.csv (e.g., pd.DataFrame()) or skip creating it; ensure the subsequent
removal loop (for f in csvs: f.unlink()) still only runs when csvs is non-empty
so you never call pd.concat on an empty list and never try to unlink
non-existent files.
- Around line 109-118: qc_main is being called with config.qc_parameters (an
OmegaConf DictConfig) while its signature expects a QCParameters TypedDict; to
fix the type mismatch convert the DictConfig to a plain container before passing
it: replace passing config.qc_parameters directly with
OmegaConf.to_container(config.qc_parameters, resolve=True) (and add an import
for OmegaConf if missing) so qc_main(..., qc_parameters=...) receives a proper
dict/TypedDict-compatible object.
---
Nitpick comments:
In `@preprocessing/quality_control.py`:
- Around line 41-49: The code in organize_masks uses slide_name =
file.name.replace(f"{mask_prefix}_", "") which can remove occurrences of the
mask prefix anywhere in the filename; change this to use the string method
removeprefix to only strip the leading prefix (e.g., compute slide_name =
file.name.removeprefix(f"{mask_prefix}_")) so filenames that contain the prefix
later are preserved before computing destination and calling file.rename.
- Around line 71-75: The current loop opens and closes qc_errors.log for every
failed slide (using result and output_path), which is inefficient; change the
code in the function that iterates over results so you either (a) open
output_path / "qc_errors.log" once before the loop and write each failure inside
the loop, or (b) collect failure messages into a list (using result.wsi_path and
result.error) and write them all to output_path / "qc_errors.log" once after the
loop; ensure you still append or create the file appropriately and preserve the
same message format.
In `@pyproject.toml`:
- Line 16: The new dependency "rationai-sdk" is added without a pinned version
which makes builds non-reproducible; update the pyproject.toml dependency
declaration for "rationai-sdk" (the same list that contains "rationai-mlkit" and
"rationai-masks") to pin it to a specific release tag or commit SHA (e.g., use a
version constraint or a git URL with @<tag-or-sha>) so installs are
deterministic and reproducible.
In `@scripts/preprocessing/quality_control.py`:
- Around line 4-17: The template uses placeholder ellipses in the submit_job
call (job_name, username, dataset) and a non-pinned git clone which will break
reproducibility; add a top-of-file TODO comment like “# TODO: Fill in job_name,
username, and dataset before running” and update the submit_job invocation
(referencing submit_job, job_name, username, script, storage) to require real
values (or validate/raise if still set to ...), and change the git clone entry
in the script to pin the repository to a stable ref (e.g., clone + checkout a
tag/commit or use git clone --branch <tag> && git checkout <commit>) so uv run
(preprocessing.quality_control +dataset=...) uses a reproducible commit instead
of the default branch.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
preprocessing/quality_control.py (1)
46-48: Useremoveprefixinstead ofreplacefor safer prefix stripping.
str.replacereplaces all occurrences of the substring in the filename, not just the leading prefix. If the mask prefix string happens to appear elsewhere in a slide filename, this will corrupt the output name.Suggested fix
- slide_name = file.name.replace(f"{mask_prefix}_", "") + slide_name = file.name.removeprefix(f"{mask_prefix}_")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preprocessing/quality_control.py` around lines 46 - 48, The loop that computes slide_name uses file.name.replace(f"{mask_prefix}_", "") which may remove occurrences beyond the leading prefix; update the logic in the for-loop that iterates over output_path.glob(...) so slide_name is derived using file.name.removeprefix(f"{mask_prefix}_") (keeping the same variables: file, mask_prefix, slide_name, destination, prefix_dir) to only strip the leading prefix safely before constructing destination.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@preprocessing/quality_control.py`:
- Around line 82-89: The code currently calls pd.concat([...]) on csvs which
will raise ValueError when csvs is empty; update the block in
preprocessing/quality_control.py to first check if csvs is non-empty (e.g., if
not csvs: create an empty DataFrame or write an empty qc_metrics.csv and skip
unlinking) and only call pd.concat when csvs contains files; ensure the
subsequent loop that unlinks individual files only runs when files were actually
read.
---
Nitpick comments:
In `@preprocessing/quality_control.py`:
- Around line 46-48: The loop that computes slide_name uses
file.name.replace(f"{mask_prefix}_", "") which may remove occurrences beyond the
leading prefix; update the logic in the for-loop that iterates over
output_path.glob(...) so slide_name is derived using
file.name.removeprefix(f"{mask_prefix}_") (keeping the same variables: file,
mask_prefix, slide_name, destination, prefix_dir) to only strip the leading
prefix safely before constructing destination.
QC masks generation script.
Closes IBD-18
Blocked by #2
Dependency graph:
Summary by CodeRabbit
New Features
Chores