Skip to content

feat: quality control#4

Merged
Adames4 merged 28 commits intomasterfrom
feature/quality-control
Feb 22, 2026
Merged

feat: quality control#4
Adames4 merged 28 commits intomasterfrom
feature/quality-control

Conversation

@Adames4
Copy link
Collaborator

@Adames4 Adames4 commented Feb 11, 2026

QC masks generation script.

Closes IBD-18

Blocked by #2

Dependency graph:

                         +--------------+
                  -------| tissue-masks |<------+           +------------+      +----------------------+
                 /       +--------------+       |       +---| tile-masks |<-----| preprocessing-report |
                /                               |       |   +------------+      +----------------------+
+---------+    /                            +--------+  |
| dataset | <-+                             | tiling |<-+
+---------+    \                            +--------+  |
                \                               |       |   +------------+
                 \       +-----------------+    |       +---| embeddings |
                  -------| quality-control |<---+           +------------+
                         +-----------------+

Summary by CodeRabbit

  • New Features

    • Added a quality-control preprocessing workflow for slide datasets with configurable checks, mask generation, and consolidated QC metrics.
  • Chores

    • Added a new SDK dependency.
    • Added default QC configuration settings.
    • Added a job submission script to run QC processing.
    • Fixed YAML formatting and trailing-newline issues across dataset/config files.

@Adames4 Adames4 self-assigned this Feb 11, 2026
@Adames4 Adames4 requested a review from Copilot February 11, 2026 13:14
@coderabbitai
Copy link

coderabbitai bot commented Feb 11, 2026

📝 Walkthrough

Walkthrough

Adds a quality-control workflow: new QC config, an async Python QC module with MLFlow integration, a kube job submission script, a dependency addition, and minor YAML/INI formatting fixes across dataset configs (no functional changes to dataset entries).

Changes

Cohort / File(s) Summary
YAML / INI formatting fixes
\.mypy.ini, configs/dataset/processed/ftn.yaml, configs/dataset/processed/ikem.yaml, configs/dataset/processed/knl_patos.yaml, configs/dataset/raw/ftn.yaml, configs/dataset/raw/ikem.yaml, configs/dataset/raw/knl_patos.yaml, configs/preprocessing/tissue_masks.yaml
Minor formatting fixes: trailing newlines and list indentation corrections. No semantic/data changes.
Quality Control Configuration
configs/preprocessing/quality_control.yaml
New QC configuration with output paths, request_timeout, max_concurrent, qc_parameters (mask/sample levels, checks), and metadata/hyperparams.
Quality Control Module
preprocessing/quality_control.py
New module implementing QCParameters TypedDict, get_qc_masks, organize_masks, async qc_main, download_dataset, and a Hydra-compatible main entry point; MLFlow logging and CSV merging included.
Kubernetes job submission script
scripts/preprocessing/quality_control.py
New script that submits a Kubernetes batch job to run the QC workflow (resource requests, repo sync, execution commands).
Dependency / project config
pyproject.toml
Adds rationai-sdk dependency and corresponding git source entry under tool sources.

Sequence Diagram

sequenceDiagram
    actor User
    participant Hydra as Hydra Config
    participant Main as main()
    participant MLFlow as MLFlow Logger
    participant QC as qc_main()
    participant Async as AsyncClient
    participant Slides as Slide Storage
    participant FS as Filesystem
    participant CSV as CSV Merge

    User->>Hydra: load config
    Hydra->>Main: provide DictConfig
    Main->>MLFlow: initialize logger / autolog
    Main->>Main: download_dataset(uri)
    Main->>FS: prepare output dir
    Main->>QC: start qc_main(...)

    QC->>Async: create AsyncClient (request_timeout, max_concurrent)
    Async->>Slides: fetch/process slides (concurrent)
    Slides-->>Async: return masks, metrics, errors
    Async->>FS: write per-slide CSVs and mask files
    QC->>FS: organize_masks into artifact subdirs
    QC->>CSV: concatenate per-slide CSVs -> qc_metrics.csv
    CSV->>MLFlow: log qc_metrics.csv and artifacts
    MLFlow-->>User: artifacts stored / run complete
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • matejpekar
  • vejtek

Poem

🐰 I hopped through configs, neat and small,
Masks gathered gently, one and all,
Async feet raced slides by night,
MLFlow hummed with logging light,
A QC carrot—crisp and bright! 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'feat: quality control' is a high-level description that relates to the main change (adding QC functionality), but it is overly broad and generic. It does not clearly convey what specific quality control feature is being implemented. Make the title more specific and descriptive, such as 'feat: add quality control mask generation workflow' to better summarize the main implementation added in this PR.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/quality-control

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello @Adames4, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a quality control (QC) masks generation script to identify and flag potential issues in whole slide images. It adds necessary configurations, a Python script for QC processing, and integrates the rationai-sdk for interacting with the RationAI platform. The changes aim to automate and standardize the quality control process for digital pathology slides.

Highlights

  • Quality Control Masks Generation: This PR introduces a script for generating quality control (QC) masks, addressing issue IBD-18. It includes configurations and a Python script to perform QC checks on slides.
  • Dependency on rationai-sdk: The PR adds a dependency on the rationai-sdk for interacting with the RationAI platform.
  • Configuration Files: New YAML configuration files are added for different datasets (ftn, ikem, knl_patos) to specify dataset URIs and default settings.
Changelog
  • configs/dataset/processed/ftn.yaml
    • Added configuration file for the ftn dataset.
  • configs/dataset/processed/ikem.yaml
    • Added configuration file for the ikem dataset.
  • configs/dataset/processed/knl_patos.yaml
    • Added configuration file for the knl_patos dataset.
  • configs/preprocessing/quality_control.yaml
    • Added configuration file for quality control parameters and metadata.
  • preprocessing/quality_control.py
    • Added Python script for generating quality control masks.
  • pyproject.toml
    • Added rationai-sdk as a dependency and updated git sources.
  • scripts/preprocessing/quality_control.py
    • Added script for submitting a quality control job.
  • uv.lock
    • Updated uv.lock file to include new dependencies and versions.
Activity
  • New files added for dataset configurations (ftn, ikem, knl_patos).
  • Quality control configuration file added.
  • Python script added for quality control mask generation.
  • rationai-sdk dependency added to pyproject.toml.
  • Job submission script added.
  • Dependencies updated in uv.lock.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new quality control script for generating QC masks. The changes include new configuration files, the main processing script, and a job submission script. The code is generally well-structured. I've identified a potential bug related to handling empty CSV file lists and several areas for improvement regarding configuration, maintainability, and code consistency. There are also some placeholder values in configuration and scripts that need to be addressed.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a preprocessing “quality control” step that runs slide QC via the rationai-sdk, organizes generated mask outputs, and logs results/artifacts to MLflow, along with configs and dependency updates to support the new workflow.

Changes:

  • Introduces preprocessing/quality_control.py to run async QC checks, reorganize mask files, merge per-slide CSV metrics, and log artifacts.
  • Adds Hydra configuration for the QC step and “processed dataset” configs pointing QC at MLflow-hosted dataset CSV artifacts.
  • Updates project dependencies/lockfile to include rationai-sdk (and transitive deps like httpx, lz4, tenacity).

Reviewed changes

Copilot reviewed 7 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
preprocessing/quality_control.py New QC implementation and artifact organization/aggregation logic.
configs/preprocessing/quality_control.yaml QC step runtime/config parameters and MLflow metadata.
configs/dataset/processed/ikem.yaml Processed dataset config pointing to an MLflow dataset artifact URI (currently TODO).
configs/dataset/processed/ftn.yaml Same as above for FTN (currently TODO).
configs/dataset/processed/knl_patos.yaml Same as above for KNL PATOS (currently TODO).
scripts/preprocessing/quality_control.py Kube job submission template for running the QC step.
pyproject.toml Adds rationai-sdk dependency and uv git source.
uv.lock Locks rationai-sdk and new transitive dependencies.
preprocessing/.gitkeep Placeholder file (present in PR metadata).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Adames4 Adames4 mentioned this pull request Feb 12, 2026
@Adames4 Adames4 changed the base branch from feature/dataset to master February 17, 2026 17:04
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
pyproject.toml (1)

16-16: New rationai-sdk dependency added — consistent with existing patterns.

The dependency follows the same no-version-pin + git-source pattern as the other rationai-* packages. Consider pinning to a specific tag or revision for reproducible builds, though this is consistent with how rationai-mlkit and rationai-masks are already declared.

Also applies to: 32-32

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` at line 16, The new dependency "rationai-sdk" is added
without a pinned version which makes builds non-reproducible; update the
pyproject.toml dependency declaration for "rationai-sdk" (the same list that
contains "rationai-mlkit" and "rationai-masks") to pin it to a specific release
tag or commit SHA (e.g., use a version constraint or a git URL with
@<tag-or-sha>) so installs are deterministic and reproducible.
preprocessing/quality_control.py (2)

41-49: Consider using removeprefix instead of replace for stripping the mask prefix.

str.replace replaces all occurrences of the substring, not just the leading prefix. If the prefix string happens to appear elsewhere in the filename, the result would be mangled. removeprefix (Python 3.9+) is more precise and expresses intent better.

Suggested fix
-        slide_name = file.name.replace(f"{mask_prefix}_", "")
+        slide_name = file.name.removeprefix(f"{mask_prefix}_")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preprocessing/quality_control.py` around lines 41 - 49, The code in
organize_masks uses slide_name = file.name.replace(f"{mask_prefix}_", "") which
can remove occurrences of the mask prefix anywhere in the filename; change this
to use the string method removeprefix to only strip the leading prefix (e.g.,
compute slide_name = file.name.removeprefix(f"{mask_prefix}_")) so filenames
that contain the prefix later are preserved before computing destination and
calling file.rename.

71-75: Repeated file open/close for each failed slide.

The error log file is opened and closed on every failure. If many slides fail, this is needlessly expensive. Consider opening it once (or collecting errors in a list and writing at the end).

Suggested approach
+        errors: list[str] = []
         async for result in tqdm(
             client.qc.check_slides(
                 slides,
                 output_path,
                 config=SlideCheckConfig(**qc_parameters),
                 timeout=request_timeout,
                 max_concurrent=max_concurrent,
             ),
             total=len(slides),
         ):
             if not result.success:
-                with open(output_path / "qc_errors.log", "a") as log_file:
-                    log_file.write(
-                        f"Failed to process {result.wsi_path}: {result.error}\n"
-                    )
+                errors.append(
+                    f"Failed to process {result.wsi_path}: {result.error}\n"
+                )
+
+        if errors:
+            with open(output_path / "qc_errors.log", "w") as log_file:
+                log_file.writelines(errors)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preprocessing/quality_control.py` around lines 71 - 75, The current loop
opens and closes qc_errors.log for every failed slide (using result and
output_path), which is inefficient; change the code in the function that
iterates over results so you either (a) open output_path / "qc_errors.log" once
before the loop and write each failure inside the loop, or (b) collect failure
messages into a list (using result.wsi_path and result.error) and write them all
to output_path / "qc_errors.log" once after the loop; ensure you still append or
create the file appropriately and preserve the same message format.
scripts/preprocessing/quality_control.py (1)

4-17: Template placeholders and unpinned git clone.

A couple of observations:

  1. The ... (Ellipsis) placeholders on lines 5, 6, and 14 will pass Python's parser but will fail at runtime if not replaced. Consider adding a comment at the top of the file (e.g., # TODO: Fill in job_name, username, and dataset before running) to make this explicit for other developers.

  2. git clone on line 11 will check out the default branch. If reproducibility matters for QC runs, consider pinning to a tag or commit:

     script=[
         "git clone https://github.com/RationAI/ulcerative-colitis.git workdir",
         "cd workdir",
+        "git checkout <tag-or-commit>",
         "uv sync --frozen",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/preprocessing/quality_control.py` around lines 4 - 17, The template
uses placeholder ellipses in the submit_job call (job_name, username, dataset)
and a non-pinned git clone which will break reproducibility; add a top-of-file
TODO comment like “# TODO: Fill in job_name, username, and dataset before
running” and update the submit_job invocation (referencing submit_job, job_name,
username, script, storage) to require real values (or validate/raise if still
set to ...), and change the git clone entry in the script to pin the repository
to a stable ref (e.g., clone + checkout a tag/commit or use git clone --branch
<tag> && git checkout <commit>) so uv run (preprocessing.quality_control
+dataset=...) uses a reproducible commit instead of the default branch.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@preprocessing/quality_control.py`:
- Around line 82-89: The current code calls pd.concat([pd.read_csv(f) for f in
csvs]) which raises ValueError when csvs is empty; update the block in
preprocessing/quality_control.py to first check the csvs list (the variable csvs
obtained from Path(output_path).glob("*.csv")) and only call
pd.concat/pd.read_csv when csvs is non-empty — otherwise write an empty
qc_metrics.csv (e.g., pd.DataFrame()) or skip creating it; ensure the subsequent
removal loop (for f in csvs: f.unlink()) still only runs when csvs is non-empty
so you never call pd.concat on an empty list and never try to unlink
non-existent files.
- Around line 109-118: qc_main is being called with config.qc_parameters (an
OmegaConf DictConfig) while its signature expects a QCParameters TypedDict; to
fix the type mismatch convert the DictConfig to a plain container before passing
it: replace passing config.qc_parameters directly with
OmegaConf.to_container(config.qc_parameters, resolve=True) (and add an import
for OmegaConf if missing) so qc_main(..., qc_parameters=...) receives a proper
dict/TypedDict-compatible object.

---

Nitpick comments:
In `@preprocessing/quality_control.py`:
- Around line 41-49: The code in organize_masks uses slide_name =
file.name.replace(f"{mask_prefix}_", "") which can remove occurrences of the
mask prefix anywhere in the filename; change this to use the string method
removeprefix to only strip the leading prefix (e.g., compute slide_name =
file.name.removeprefix(f"{mask_prefix}_")) so filenames that contain the prefix
later are preserved before computing destination and calling file.rename.
- Around line 71-75: The current loop opens and closes qc_errors.log for every
failed slide (using result and output_path), which is inefficient; change the
code in the function that iterates over results so you either (a) open
output_path / "qc_errors.log" once before the loop and write each failure inside
the loop, or (b) collect failure messages into a list (using result.wsi_path and
result.error) and write them all to output_path / "qc_errors.log" once after the
loop; ensure you still append or create the file appropriately and preserve the
same message format.

In `@pyproject.toml`:
- Line 16: The new dependency "rationai-sdk" is added without a pinned version
which makes builds non-reproducible; update the pyproject.toml dependency
declaration for "rationai-sdk" (the same list that contains "rationai-mlkit" and
"rationai-masks") to pin it to a specific release tag or commit SHA (e.g., use a
version constraint or a git URL with @<tag-or-sha>) so installs are
deterministic and reproducible.

In `@scripts/preprocessing/quality_control.py`:
- Around line 4-17: The template uses placeholder ellipses in the submit_job
call (job_name, username, dataset) and a non-pinned git clone which will break
reproducibility; add a top-of-file TODO comment like “# TODO: Fill in job_name,
username, and dataset before running” and update the submit_job invocation
(referencing submit_job, job_name, username, script, storage) to require real
values (or validate/raise if still set to ...), and change the git clone entry
in the script to pin the repository to a stable ref (e.g., clone + checkout a
tag/commit or use git clone --branch <tag> && git checkout <commit>) so uv run
(preprocessing.quality_control +dataset=...) uses a reproducible commit instead
of the default branch.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
preprocessing/quality_control.py (1)

46-48: Use removeprefix instead of replace for safer prefix stripping.

str.replace replaces all occurrences of the substring in the filename, not just the leading prefix. If the mask prefix string happens to appear elsewhere in a slide filename, this will corrupt the output name.

Suggested fix
-        slide_name = file.name.replace(f"{mask_prefix}_", "")
+        slide_name = file.name.removeprefix(f"{mask_prefix}_")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preprocessing/quality_control.py` around lines 46 - 48, The loop that
computes slide_name uses file.name.replace(f"{mask_prefix}_", "") which may
remove occurrences beyond the leading prefix; update the logic in the for-loop
that iterates over output_path.glob(...) so slide_name is derived using
file.name.removeprefix(f"{mask_prefix}_") (keeping the same variables: file,
mask_prefix, slide_name, destination, prefix_dir) to only strip the leading
prefix safely before constructing destination.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@preprocessing/quality_control.py`:
- Around line 82-89: The code currently calls pd.concat([...]) on csvs which
will raise ValueError when csvs is empty; update the block in
preprocessing/quality_control.py to first check if csvs is non-empty (e.g., if
not csvs: create an empty DataFrame or write an empty qc_metrics.csv and skip
unlinking) and only call pd.concat when csvs contains files; ensure the
subsequent loop that unlinks individual files only runs when files were actually
read.

---

Nitpick comments:
In `@preprocessing/quality_control.py`:
- Around line 46-48: The loop that computes slide_name uses
file.name.replace(f"{mask_prefix}_", "") which may remove occurrences beyond the
leading prefix; update the logic in the for-loop that iterates over
output_path.glob(...) so slide_name is derived using
file.name.removeprefix(f"{mask_prefix}_") (keeping the same variables: file,
mask_prefix, slide_name, destination, prefix_dir) to only strip the leading
prefix safely before constructing destination.

@Adames4 Adames4 requested a review from vejtek February 17, 2026 18:15
@Adames4 Adames4 requested a review from matejpekar February 19, 2026 14:37
@Adames4 Adames4 requested a review from a team February 21, 2026 09:56
@matejpekar matejpekar removed the request for review from a team February 21, 2026 18:05
@Adames4 Adames4 merged commit e911e24 into master Feb 22, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants