Skip to content

feat: implement core data pipeline and image segmentation#5

Open
dhalmazna wants to merge 8 commits intochore/project-setupfrom
feat/data-pipeline
Open

feat: implement core data pipeline and image segmentation#5
dhalmazna wants to merge 8 commits intochore/project-setupfrom
feat/data-pipeline

Conversation

@dhalmazna
Copy link
Collaborator

@dhalmazna dhalmazna commented Mar 4, 2026

Context:
This PR introduces the completely self-contained data/ module.

What was changed:

  • loader.py: Image loading utilities.
  • preprocessing.py: Image tensor transformations.
  • segmentation.py: Hexagonal and square image segmentation logic.
  • replacement.py: Image masking strategies (mean color, blur, interlacing, solid color).

Related Task:
XAI-29

Copilot AI review requested due to automatic review settings March 4, 2026 16:36
@coderabbitai
Copy link

coderabbitai bot commented Mar 4, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 43c3c4cc-19d7-415c-9499-34b9a44d4522

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/data-pipeline

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes the foundational data pipeline for the project by introducing a new, self-contained data module. This module provides robust utilities for loading images, applying standard preprocessing steps, and performing advanced image segmentation using both square and hexagonal grids. These changes are crucial for setting up the project's image analysis capabilities and simplify future development by centralizing data handling logic.

Highlights

  • New data module introduced: A new data module has been added, encapsulating image loading, preprocessing, and segmentation functionalities.
  • Image Loading Utilities: Implemented loader.py to provide flexible image loading from single paths or batch directories, supporting common image formats.
  • Image Preprocessing: Developed preprocessing.py to handle standard ImageNet-style transformations, including resizing, cropping, and normalization for image tensors.
  • Image Segmentation: Introduced segmentation.py which supports both hexagonal and square grid segmentation, along with generating adjacency lists for spatial relationships.
  • Dependency Management: The networkx dependency has been removed from pyproject.toml and uv.lock, streamlining the project's dependencies.
Changelog
  • .mypy.ini
    • Added explicit_package_bases = True to mypy configuration.
  • ciao/data/init.py
    • Added new file to initialize the data module.
    • Exported get_image_loader, load_and_preprocess_image, and create_segmentation for module-level access.
  • ciao/data/loader.py
    • Added new file containing get_image_loader function.
    • Implemented logic to load images from a single path or iterate through images in a batch directory.
    • Defined supported image extensions.
  • ciao/data/preprocessing.py
    • Added new file containing load_and_preprocess_image function.
    • Defined ImageNet preprocessing transforms (resize, center crop, ToTensor, Normalize).
    • Implemented image loading with PIL and tensor conversion to specified device.
  • ciao/data/segmentation.py
    • Added new file implementing image segmentation logic.
    • Included functions for vectorized hexagonal rounding (_hex_round_vectorized).
    • Provided utilities to convert adjacency bitmasks to lists and vice-versa.
    • Implemented _build_square_adjacency_list and _build_fast_adjacency_list for square and hexagonal grids respectively.
    • Developed _create_square_grid and _create_hexagonal_grid to generate segment IDs and adjacency lists.
    • Exposed create_segmentation as the main entry point for generating segmentations.
  • pyproject.toml
    • Removed networkx from project dependencies.
  • uv.lock
    • Removed networkx from the lock file's dependencies and requires-dist sections.
Activity
  • No specific activity (comments, reviews, approvals) was found in the provided context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new data module for handling image loading, preprocessing, and segmentation. The overall structure is good, and the use of vectorized operations in the segmentation logic is great for performance. I have provided a few suggestions to improve type safety, address a performance bottleneck in one of the utility functions, and reduce code duplication for better maintainability.

Comment on lines +53 to +60
adjacency_list = []
for mask in adj_masks:
neighbors = []
for i in range(len(adj_masks)):
if mask & (1 << i):
neighbors.append(i)
adjacency_list.append(tuple(neighbors))
return tuple(adjacency_list)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of bitmasks_to_adjacency_list has a time complexity of O(N^2), where N is the number of segments. This can be very slow if the number of segments is large. The inner loop can be made much more efficient by iterating only up to the number of bits in the mask, rather than through all possible segment IDs.

    adjacency_list = []
    for mask in adj_masks:
        neighbors = []
        i = 0
        temp_mask = mask
        while temp_mask > 0:
            if temp_mask & 1:
                neighbors.append(i)
            temp_mask >>= 1
            i += 1
        adjacency_list.append(tuple(neighbors))
    return tuple(adjacency_list)

Comment on lines +3 to +12
from collections.abc import Iterator
from pathlib import Path
from typing import Any


# Supported image formats
IMAGE_EXTENSIONS = (".jpg", ".jpeg", ".png", ".bmp", ".webp")


def get_image_loader(config: Any) -> Iterator[Path]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The config parameter is typed as Any, which bypasses static type checking and reduces code clarity. Since the project uses Hydra, it's better to use a more specific type. Using omegaconf.DictConfig will improve type safety and make it clear what kind of object is expected.

from collections.abc import Iterator
from pathlib import Path

from omegaconf import DictConfig


# Supported image formats
IMAGE_EXTENSIONS = (".jpg", ".jpeg", ".png", ".bmp", ".webp")


def get_image_loader(config: DictConfig) -> Iterator[Path]:

Comment on lines +78 to +117
# Vectorized horizontal adjacency
left = segments[:, :-1].ravel()
right = segments[:, 1:].ravel()
mask_h = left != right
edges_h = np.column_stack([left[mask_h], right[mask_h]])

for seg1, seg2 in edges_h:
adjacency_sets[seg1].add(seg2)
adjacency_sets[seg2].add(seg1)

# Vectorized vertical adjacency
top = segments[:-1, :].ravel()
bottom = segments[1:, :].ravel()
mask_v = top != bottom
edges_v = np.column_stack([top[mask_v], bottom[mask_v]])

for seg1, seg2 in edges_v:
adjacency_sets[seg1].add(seg2)
adjacency_sets[seg2].add(seg1)

if neighborhood == 8:
# Vectorized diagonal adjacency (down-right)
top_left = segments[:-1, :-1].ravel()
bottom_right = segments[1:, 1:].ravel()
mask_dr = top_left != bottom_right
edges_dr = np.column_stack([top_left[mask_dr], bottom_right[mask_dr]])

for seg1, seg2 in edges_dr:
adjacency_sets[seg1].add(seg2)
adjacency_sets[seg2].add(seg1)

# Vectorized diagonal adjacency (down-left)
top_right = segments[:-1, 1:].ravel()
bottom_left = segments[1:, :-1].ravel()
mask_dl = top_right != bottom_left
edges_dl = np.column_stack([top_right[mask_dl], bottom_left[mask_dl]])

for seg1, seg2 in edges_dl:
adjacency_sets[seg1].add(seg2)
adjacency_sets[seg2].add(seg1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for finding and adding edges is repeated for horizontal, vertical, and diagonal neighbors. This code duplication can be reduced by first collecting all edge pairs and then iterating once to populate the adjacency sets. This refactoring improves maintainability and makes the function's intent clearer.

    # Vectorized adjacency
    edge_collections = []

    # Horizontal
    left = segments[:, :-1].ravel()
    right = segments[:, 1:].ravel()
    mask_h = left != right
    edge_collections.append(np.column_stack([left[mask_h], right[mask_h]]))

    # Vertical
    top = segments[:-1, :].ravel()
    bottom = segments[1:, :].ravel()
    mask_v = top != bottom
    edge_collections.append(np.column_stack([top[mask_v], bottom[mask_v]]))

    if neighborhood == 8:
        # Diagonal (down-right)
        top_left = segments[:-1, :-1].ravel()
        bottom_right = segments[1:, 1:].ravel()
        mask_dr = top_left != bottom_right
        edge_collections.append(np.column_stack([top_left[mask_dr], bottom_right[mask_dr]]))

        # Diagonal (down-left)
        top_right = segments[:-1, 1:].ravel()
        bottom_left = segments[1:, :-1].ravel()
        mask_dl = top_right != bottom_left
        edge_collections.append(np.column_stack([top_right[mask_dl], bottom_left[mask_dl]]))

    all_edges = np.concatenate(edge_collections)
    for seg1, seg2 in all_edges:
        adjacency_sets[seg1].add(seg2)
        adjacency_sets[seg2].add(seg1)

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new self-contained ciao/data/ module that covers image path loading, ImageNet-style preprocessing, and square/hexagonal segmentation with adjacency encoding for downstream pipeline steps.

Changes:

  • Added ciao/data package with loader, preprocessing, and segmentation utilities.
  • Implemented square + hexagonal segmentation and adjacency bitmask encoding.
  • Removed the unused networkx dependency from project metadata / lockfile.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
ciao/data/segmentation.py New segmentation implementation (square + hex) and adjacency encoding utilities.
ciao/data/preprocessing.py New ImageNet-style preprocessing + image loading helper.
ciao/data/loader.py New Hydra-config-driven image path iterator (single image or directory).
ciao/data/__init__.py Exposes the new data utilities as package exports.
pyproject.toml Drops networkx from dependencies.
uv.lock Lockfile updated to reflect removal of networkx.
.mypy.ini Enables explicit_package_bases and fixes formatting of disable_error_code.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +34 to +35
image = Image.open(image_path).convert("RGB")
input_tensor = preprocess(image).to(device) # (3, 224, 224) - on correct device
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image.open(image_path) is not closed, which can leak file descriptors when processing many images. Use a context manager (or explicitly close) so the underlying file handle is released after converting/reading.

Suggested change
image = Image.open(image_path).convert("RGB")
input_tensor = preprocess(image).to(device) # (3, 224, 224) - on correct device
with Image.open(image_path) as image:
image = image.convert("RGB")
input_tensor = preprocess(image).to(device) # (3, 224, 224) - on correct device

Copilot uses AI. Check for mistakes.
Comment on lines +231 to +236
# Use np.unique to assign segment IDs efficiently
_, segments_flat = np.unique(qr_stacked, axis=0, return_inverse=True)
segments = segments_flat.reshape((height, width)).astype(np.int32)

# Build hex_to_id mapping for adjacency construction
unique_qr = np.unique(qr_stacked, axis=0)
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.unique(qr_stacked, axis=0) is computed twice (once for return_inverse=True and again for unique_qr). Reuse the unique array from the first call to avoid the extra full-array sort/scan and to keep the segment-id mapping tightly coupled to the inverse indices.

Suggested change
# Use np.unique to assign segment IDs efficiently
_, segments_flat = np.unique(qr_stacked, axis=0, return_inverse=True)
segments = segments_flat.reshape((height, width)).astype(np.int32)
# Build hex_to_id mapping for adjacency construction
unique_qr = np.unique(qr_stacked, axis=0)
# Use np.unique to assign segment IDs efficiently and get unique coordinates
unique_qr, segments_flat = np.unique(
qr_stacked, axis=0, return_inverse=True
)
segments = segments_flat.reshape((height, width)).astype(np.int32)
# Build hex_to_id mapping for adjacency construction

Copilot uses AI. Check for mistakes.
Comment on lines +37 to +38
for ext in IMAGE_EXTENSIONS:
yield from directory.glob(f"**/*{ext}")
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Directory mode traverses the tree once per extension (for ext in IMAGE_EXTENSIONS: directory.glob(...)), which can be unnecessarily expensive on large folders and still misses upper-case extensions. Consider a single rglob('*') traversal and filter by path.suffix.lower() in IMAGE_EXTENSIONS for both performance and completeness.

Suggested change
for ext in IMAGE_EXTENSIONS:
yield from directory.glob(f"**/*{ext}")
for path in directory.rglob("*"):
if path.is_file() and path.suffix.lower() in IMAGE_EXTENSIONS:
yield path

Copilot uses AI. Check for mistakes.
"""
if config.data.get("image_path"):
# Single image mode
yield Path(config.data.image_path)
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In single-image mode, image_path is yielded without validating that it exists and is a file. Adding an is_file() check here (similar to the directory validation for batch_path) would fail fast with a clearer configuration error.

Suggested change
yield Path(config.data.image_path)
image = Path(config.data.image_path)
if not image.is_file():
raise ValueError(
f"image_path must be a valid file, got: {image}. "
"Check for typos or incorrect path configuration."
)
yield image

Copilot uses AI. Check for mistakes.
@dhalmazna dhalmazna self-assigned this Mar 4, 2026
@dhalmazna dhalmazna requested a review from vejtek March 4, 2026 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants