feat: implement core data pipeline and image segmentation by dhalmazna · Pull Request #5 · RationAI/ciao

dhalmazna · 2026-03-04T16:36:51Z

Context:
This PR introduces the completely self-contained data/ module.

What was changed:

loader.py: Image loading utilities.
preprocessing.py: Image tensor transformations.
segmentation.py: Hexagonal and square image segmentation logic.
replacement.py: Image masking strategies (mean color, blur, interlacing, solid color).

Related Task:
XAI-29

coderabbitai · 2026-03-04T16:37:03Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 43c3c4cc-19d7-415c-9499-34b9a44d4522

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/data-pipeline

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-03-04T16:38:33Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes the foundational data pipeline for the project by introducing a new, self-contained data module. This module provides robust utilities for loading images, applying standard preprocessing steps, and performing advanced image segmentation using both square and hexagonal grids. These changes are crucial for setting up the project's image analysis capabilities and simplify future development by centralizing data handling logic.

Highlights

New data module introduced: A new data module has been added, encapsulating image loading, preprocessing, and segmentation functionalities.
Image Loading Utilities: Implemented loader.py to provide flexible image loading from single paths or batch directories, supporting common image formats.
Image Preprocessing: Developed preprocessing.py to handle standard ImageNet-style transformations, including resizing, cropping, and normalization for image tensors.
Image Segmentation: Introduced segmentation.py which supports both hexagonal and square grid segmentation, along with generating adjacency lists for spatial relationships.
Dependency Management: The networkx dependency has been removed from pyproject.toml and uv.lock, streamlining the project's dependencies.

Changelog

.mypy.ini
- Added explicit_package_bases = True to mypy configuration.
ciao/data/init.py
- Added new file to initialize the data module.
- Exported get_image_loader, load_and_preprocess_image, and create_segmentation for module-level access.
ciao/data/loader.py
- Added new file containing get_image_loader function.
- Implemented logic to load images from a single path or iterate through images in a batch directory.
- Defined supported image extensions.
ciao/data/preprocessing.py
- Added new file containing load_and_preprocess_image function.
- Defined ImageNet preprocessing transforms (resize, center crop, ToTensor, Normalize).
- Implemented image loading with PIL and tensor conversion to specified device.
ciao/data/segmentation.py
- Added new file implementing image segmentation logic.
- Included functions for vectorized hexagonal rounding (_hex_round_vectorized).
- Provided utilities to convert adjacency bitmasks to lists and vice-versa.
- Implemented _build_square_adjacency_list and _build_fast_adjacency_list for square and hexagonal grids respectively.
- Developed _create_square_grid and _create_hexagonal_grid to generate segment IDs and adjacency lists.
- Exposed create_segmentation as the main entry point for generating segmentations.
pyproject.toml
- Removed networkx from project dependencies.
uv.lock
- Removed networkx from the lock file's dependencies and requires-dist sections.

Activity

No specific activity (comments, reviews, approvals) was found in the provided context.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new data module for handling image loading, preprocessing, and segmentation. The overall structure is good, and the use of vectorized operations in the segmentation logic is great for performance. I have provided a few suggestions to improve type safety, address a performance bottleneck in one of the utility functions, and reduce code duplication for better maintainability.

gemini-code-assist · 2026-03-04T16:42:20Z

ciao/data/segmentation.py

+    adjacency_list = []
+    for mask in adj_masks:
+        neighbors = []
+        for i in range(len(adj_masks)):
+            if mask & (1 << i):
+                neighbors.append(i)
+        adjacency_list.append(tuple(neighbors))
+    return tuple(adjacency_list)


The current implementation of bitmasks_to_adjacency_list has a time complexity of O(N^2), where N is the number of segments. This can be very slow if the number of segments is large. The inner loop can be made much more efficient by iterating only up to the number of bits in the mask, rather than through all possible segment IDs.

adjacency_list = [] for mask in adj_masks: neighbors = [] i = 0 temp_mask = mask while temp_mask > 0: if temp_mask & 1: neighbors.append(i) temp_mask >>= 1 i += 1 adjacency_list.append(tuple(neighbors)) return tuple(adjacency_list)

gemini-code-assist · 2026-03-04T16:42:21Z

ciao/data/loader.py

+from collections.abc import Iterator
+from pathlib import Path
+from typing import Any
+
+
+# Supported image formats
+IMAGE_EXTENSIONS = (".jpg", ".jpeg", ".png", ".bmp", ".webp")
+
+
+def get_image_loader(config: Any) -> Iterator[Path]:


The config parameter is typed as Any, which bypasses static type checking and reduces code clarity. Since the project uses Hydra, it's better to use a more specific type. Using omegaconf.DictConfig will improve type safety and make it clear what kind of object is expected.

from collections.abc import Iterator from pathlib import Path from omegaconf import DictConfig # Supported image formats IMAGE_EXTENSIONS = (".jpg", ".jpeg", ".png", ".bmp", ".webp") def get_image_loader(config: DictConfig) -> Iterator[Path]:

gemini-code-assist · 2026-03-04T16:42:21Z

ciao/data/segmentation.py

+    # Vectorized horizontal adjacency
+    left = segments[:, :-1].ravel()
+    right = segments[:, 1:].ravel()
+    mask_h = left != right
+    edges_h = np.column_stack([left[mask_h], right[mask_h]])
+
+    for seg1, seg2 in edges_h:
+        adjacency_sets[seg1].add(seg2)
+        adjacency_sets[seg2].add(seg1)
+
+    # Vectorized vertical adjacency
+    top = segments[:-1, :].ravel()
+    bottom = segments[1:, :].ravel()
+    mask_v = top != bottom
+    edges_v = np.column_stack([top[mask_v], bottom[mask_v]])
+
+    for seg1, seg2 in edges_v:
+        adjacency_sets[seg1].add(seg2)
+        adjacency_sets[seg2].add(seg1)
+
+    if neighborhood == 8:
+        # Vectorized diagonal adjacency (down-right)
+        top_left = segments[:-1, :-1].ravel()
+        bottom_right = segments[1:, 1:].ravel()
+        mask_dr = top_left != bottom_right
+        edges_dr = np.column_stack([top_left[mask_dr], bottom_right[mask_dr]])
+
+        for seg1, seg2 in edges_dr:
+            adjacency_sets[seg1].add(seg2)
+            adjacency_sets[seg2].add(seg1)
+
+        # Vectorized diagonal adjacency (down-left)
+        top_right = segments[:-1, 1:].ravel()
+        bottom_left = segments[1:, :-1].ravel()
+        mask_dl = top_right != bottom_left
+        edges_dl = np.column_stack([top_right[mask_dl], bottom_left[mask_dl]])
+
+        for seg1, seg2 in edges_dl:
+            adjacency_sets[seg1].add(seg2)
+            adjacency_sets[seg2].add(seg1)


The logic for finding and adding edges is repeated for horizontal, vertical, and diagonal neighbors. This code duplication can be reduced by first collecting all edge pairs and then iterating once to populate the adjacency sets. This refactoring improves maintainability and makes the function's intent clearer.

# Vectorized adjacency edge_collections = [] # Horizontal left = segments[:, :-1].ravel() right = segments[:, 1:].ravel() mask_h = left != right edge_collections.append(np.column_stack([left[mask_h], right[mask_h]])) # Vertical top = segments[:-1, :].ravel() bottom = segments[1:, :].ravel() mask_v = top != bottom edge_collections.append(np.column_stack([top[mask_v], bottom[mask_v]])) if neighborhood == 8: # Diagonal (down-right) top_left = segments[:-1, :-1].ravel() bottom_right = segments[1:, 1:].ravel() mask_dr = top_left != bottom_right edge_collections.append(np.column_stack([top_left[mask_dr], bottom_right[mask_dr]])) # Diagonal (down-left) top_right = segments[:-1, 1:].ravel() bottom_left = segments[1:, :-1].ravel() mask_dl = top_right != bottom_left edge_collections.append(np.column_stack([top_right[mask_dl], bottom_left[mask_dl]])) all_edges = np.concatenate(edge_collections) for seg1, seg2 in all_edges: adjacency_sets[seg1].add(seg2) adjacency_sets[seg2].add(seg1)

Copilot

Pull request overview

This PR adds a new self-contained ciao/data/ module that covers image path loading, ImageNet-style preprocessing, and square/hexagonal segmentation with adjacency encoding for downstream pipeline steps.

Changes:

Added ciao/data package with loader, preprocessing, and segmentation utilities.
Implemented square + hexagonal segmentation and adjacency bitmask encoding.
Removed the unused networkx dependency from project metadata / lockfile.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`ciao/data/segmentation.py`	New segmentation implementation (square + hex) and adjacency encoding utilities.
`ciao/data/preprocessing.py`	New ImageNet-style preprocessing + image loading helper.
`ciao/data/loader.py`	New Hydra-config-driven image path iterator (single image or directory).
`ciao/data/__init__.py`	Exposes the new data utilities as package exports.
`pyproject.toml`	Drops `networkx` from dependencies.
`uv.lock`	Lockfile updated to reflect removal of `networkx`.
`.mypy.ini`	Enables `explicit_package_bases` and fixes formatting of `disable_error_code`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-04T16:42:36Z

ciao/data/preprocessing.py

+    image = Image.open(image_path).convert("RGB")
+    input_tensor = preprocess(image).to(device)  # (3, 224, 224) - on correct device


Image.open(image_path) is not closed, which can leak file descriptors when processing many images. Use a context manager (or explicitly close) so the underlying file handle is released after converting/reading.

Suggested change

image = Image.open(image_path).convert("RGB")

input_tensor = preprocess(image).to(device) # (3, 224, 224) - on correct device

with Image.open(image_path) as image:

image = image.convert("RGB")

input_tensor = preprocess(image).to(device) # (3, 224, 224) - on correct device

ciao/data/segmentation.py

Copilot · 2026-03-04T16:42:36Z

ciao/data/segmentation.py

+    # Use np.unique to assign segment IDs efficiently
+    _, segments_flat = np.unique(qr_stacked, axis=0, return_inverse=True)
+    segments = segments_flat.reshape((height, width)).astype(np.int32)
+
+    # Build hex_to_id mapping for adjacency construction
+    unique_qr = np.unique(qr_stacked, axis=0)


np.unique(qr_stacked, axis=0) is computed twice (once for return_inverse=True and again for unique_qr). Reuse the unique array from the first call to avoid the extra full-array sort/scan and to keep the segment-id mapping tightly coupled to the inverse indices.

Suggested change

# Use np.unique to assign segment IDs efficiently

_, segments_flat = np.unique(qr_stacked, axis=0, return_inverse=True)

segments = segments_flat.reshape((height, width)).astype(np.int32)

# Build hex_to_id mapping for adjacency construction

unique_qr = np.unique(qr_stacked, axis=0)

# Use np.unique to assign segment IDs efficiently and get unique coordinates

unique_qr, segments_flat = np.unique(

qr_stacked, axis=0, return_inverse=True

)

segments = segments_flat.reshape((height, width)).astype(np.int32)

# Build hex_to_id mapping for adjacency construction

Copilot · 2026-03-04T16:42:37Z

ciao/data/loader.py

+        for ext in IMAGE_EXTENSIONS:
+            yield from directory.glob(f"**/*{ext}")


Directory mode traverses the tree once per extension (for ext in IMAGE_EXTENSIONS: directory.glob(...)), which can be unnecessarily expensive on large folders and still misses upper-case extensions. Consider a single rglob('*') traversal and filter by path.suffix.lower() in IMAGE_EXTENSIONS for both performance and completeness.

Suggested change

for ext in IMAGE_EXTENSIONS:

yield from directory.glob(f"**/*{ext}")

for path in directory.rglob("*"):

if path.is_file() and path.suffix.lower() in IMAGE_EXTENSIONS:

yield path

Copilot · 2026-03-04T16:42:37Z

ciao/data/loader.py

+    """
+    if config.data.get("image_path"):
+        # Single image mode
+        yield Path(config.data.image_path)


In single-image mode, image_path is yielded without validating that it exists and is a file. Adding an is_file() check here (similar to the directory validation for batch_path) would fail fast with a clearer configuration error.

Suggested change

yield Path(config.data.image_path)

image = Path(config.data.image_path)

if not image.is_file():

raise ValueError(

f"image_path must be a valid file, got: {image}. "

"Check for typos or incorrect path configuration."

)

yield image

dhalmazna added 4 commits March 4, 2026 16:30

feat: add data loading utilities for images

1c2947f

feat: add image preprocessing function

b3d7572

feat: add segmentation functionality for image processing

4b8c6b9

chore: remove the unused networkx dependency

f9f1492

Copilot AI review requested due to automatic review settings March 4, 2026 16:36

Copilot started reviewing on behalf of dhalmazna March 4, 2026 16:37 View session

gemini-code-assist bot reviewed Mar 4, 2026

View reviewed changes

Copilot AI reviewed Mar 4, 2026

View reviewed changes

dhalmazna added 2 commits March 4, 2026 18:02

feat: apply agents' suggestions

ff2c1f4

feat: remove unused bitmask to adjacency list conversion function

96d9ea9

dhalmazna self-assigned this Mar 4, 2026

dhalmazna requested a review from vejtek March 4, 2026 17:23

dhalmazna added 2 commits March 5, 2026 11:09

feat: implement image replacement strategies

5f9c97e

chore: update __init__.py

36f96f0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement core data pipeline and image segmentation#5

feat: implement core data pipeline and image segmentation#5
dhalmazna wants to merge 8 commits intochore/project-setupfrom
feat/data-pipeline

dhalmazna commented Mar 4, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 4, 2026 •

edited

Loading

Review skipped

Uh oh!

gemini-code-assist bot commented Mar 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 4, 2026

Uh oh!

gemini-code-assist bot Mar 4, 2026

Uh oh!

gemini-code-assist bot Mar 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		image = Image.open(image_path).convert("RGB")
		input_tensor = preprocess(image).to(device) # (3, 224, 224) - on correct device

		for ext in IMAGE_EXTENSIONS:
		yield from directory.glob(f"*/{ext}")

-        for ext in IMAGE_EXTENSIONS:
-            yield from directory.glob(f"**/*{ext}")
+        for path in directory.rglob("*"):
+            if path.is_file() and path.suffix.lower() in IMAGE_EXTENSIONS:
+                yield path

-        yield Path(config.data.image_path)
+        image = Path(config.data.image_path)
+        if not image.is_file():
+            raise ValueError(
+                f"image_path must be a valid file, got: {image}. "
+                "Check for typos or incorrect path configuration."
+            )
+        yield image

Conversation

dhalmazna commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

gemini-code-assist bot commented Mar 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dhalmazna commented Mar 4, 2026 •

edited

Loading

coderabbitai bot commented Mar 4, 2026 •

edited

Loading