feat: implement core data pipeline and image segmentation#5
feat: implement core data pipeline and image segmentation#5dhalmazna wants to merge 8 commits intochore/project-setupfrom
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request establishes the foundational data pipeline for the project by introducing a new, self-contained Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new data module for handling image loading, preprocessing, and segmentation. The overall structure is good, and the use of vectorized operations in the segmentation logic is great for performance. I have provided a few suggestions to improve type safety, address a performance bottleneck in one of the utility functions, and reduce code duplication for better maintainability.
ciao/data/segmentation.py
Outdated
| adjacency_list = [] | ||
| for mask in adj_masks: | ||
| neighbors = [] | ||
| for i in range(len(adj_masks)): | ||
| if mask & (1 << i): | ||
| neighbors.append(i) | ||
| adjacency_list.append(tuple(neighbors)) | ||
| return tuple(adjacency_list) |
There was a problem hiding this comment.
The current implementation of bitmasks_to_adjacency_list has a time complexity of O(N^2), where N is the number of segments. This can be very slow if the number of segments is large. The inner loop can be made much more efficient by iterating only up to the number of bits in the mask, rather than through all possible segment IDs.
adjacency_list = []
for mask in adj_masks:
neighbors = []
i = 0
temp_mask = mask
while temp_mask > 0:
if temp_mask & 1:
neighbors.append(i)
temp_mask >>= 1
i += 1
adjacency_list.append(tuple(neighbors))
return tuple(adjacency_list)
ciao/data/loader.py
Outdated
| from collections.abc import Iterator | ||
| from pathlib import Path | ||
| from typing import Any | ||
|
|
||
|
|
||
| # Supported image formats | ||
| IMAGE_EXTENSIONS = (".jpg", ".jpeg", ".png", ".bmp", ".webp") | ||
|
|
||
|
|
||
| def get_image_loader(config: Any) -> Iterator[Path]: |
There was a problem hiding this comment.
The config parameter is typed as Any, which bypasses static type checking and reduces code clarity. Since the project uses Hydra, it's better to use a more specific type. Using omegaconf.DictConfig will improve type safety and make it clear what kind of object is expected.
from collections.abc import Iterator
from pathlib import Path
from omegaconf import DictConfig
# Supported image formats
IMAGE_EXTENSIONS = (".jpg", ".jpeg", ".png", ".bmp", ".webp")
def get_image_loader(config: DictConfig) -> Iterator[Path]:
ciao/data/segmentation.py
Outdated
| # Vectorized horizontal adjacency | ||
| left = segments[:, :-1].ravel() | ||
| right = segments[:, 1:].ravel() | ||
| mask_h = left != right | ||
| edges_h = np.column_stack([left[mask_h], right[mask_h]]) | ||
|
|
||
| for seg1, seg2 in edges_h: | ||
| adjacency_sets[seg1].add(seg2) | ||
| adjacency_sets[seg2].add(seg1) | ||
|
|
||
| # Vectorized vertical adjacency | ||
| top = segments[:-1, :].ravel() | ||
| bottom = segments[1:, :].ravel() | ||
| mask_v = top != bottom | ||
| edges_v = np.column_stack([top[mask_v], bottom[mask_v]]) | ||
|
|
||
| for seg1, seg2 in edges_v: | ||
| adjacency_sets[seg1].add(seg2) | ||
| adjacency_sets[seg2].add(seg1) | ||
|
|
||
| if neighborhood == 8: | ||
| # Vectorized diagonal adjacency (down-right) | ||
| top_left = segments[:-1, :-1].ravel() | ||
| bottom_right = segments[1:, 1:].ravel() | ||
| mask_dr = top_left != bottom_right | ||
| edges_dr = np.column_stack([top_left[mask_dr], bottom_right[mask_dr]]) | ||
|
|
||
| for seg1, seg2 in edges_dr: | ||
| adjacency_sets[seg1].add(seg2) | ||
| adjacency_sets[seg2].add(seg1) | ||
|
|
||
| # Vectorized diagonal adjacency (down-left) | ||
| top_right = segments[:-1, 1:].ravel() | ||
| bottom_left = segments[1:, :-1].ravel() | ||
| mask_dl = top_right != bottom_left | ||
| edges_dl = np.column_stack([top_right[mask_dl], bottom_left[mask_dl]]) | ||
|
|
||
| for seg1, seg2 in edges_dl: | ||
| adjacency_sets[seg1].add(seg2) | ||
| adjacency_sets[seg2].add(seg1) |
There was a problem hiding this comment.
The logic for finding and adding edges is repeated for horizontal, vertical, and diagonal neighbors. This code duplication can be reduced by first collecting all edge pairs and then iterating once to populate the adjacency sets. This refactoring improves maintainability and makes the function's intent clearer.
# Vectorized adjacency
edge_collections = []
# Horizontal
left = segments[:, :-1].ravel()
right = segments[:, 1:].ravel()
mask_h = left != right
edge_collections.append(np.column_stack([left[mask_h], right[mask_h]]))
# Vertical
top = segments[:-1, :].ravel()
bottom = segments[1:, :].ravel()
mask_v = top != bottom
edge_collections.append(np.column_stack([top[mask_v], bottom[mask_v]]))
if neighborhood == 8:
# Diagonal (down-right)
top_left = segments[:-1, :-1].ravel()
bottom_right = segments[1:, 1:].ravel()
mask_dr = top_left != bottom_right
edge_collections.append(np.column_stack([top_left[mask_dr], bottom_right[mask_dr]]))
# Diagonal (down-left)
top_right = segments[:-1, 1:].ravel()
bottom_left = segments[1:, :-1].ravel()
mask_dl = top_right != bottom_left
edge_collections.append(np.column_stack([top_right[mask_dl], bottom_left[mask_dl]]))
all_edges = np.concatenate(edge_collections)
for seg1, seg2 in all_edges:
adjacency_sets[seg1].add(seg2)
adjacency_sets[seg2].add(seg1)There was a problem hiding this comment.
Pull request overview
This PR adds a new self-contained ciao/data/ module that covers image path loading, ImageNet-style preprocessing, and square/hexagonal segmentation with adjacency encoding for downstream pipeline steps.
Changes:
- Added
ciao/datapackage with loader, preprocessing, and segmentation utilities. - Implemented square + hexagonal segmentation and adjacency bitmask encoding.
- Removed the unused
networkxdependency from project metadata / lockfile.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
ciao/data/segmentation.py |
New segmentation implementation (square + hex) and adjacency encoding utilities. |
ciao/data/preprocessing.py |
New ImageNet-style preprocessing + image loading helper. |
ciao/data/loader.py |
New Hydra-config-driven image path iterator (single image or directory). |
ciao/data/__init__.py |
Exposes the new data utilities as package exports. |
pyproject.toml |
Drops networkx from dependencies. |
uv.lock |
Lockfile updated to reflect removal of networkx. |
.mypy.ini |
Enables explicit_package_bases and fixes formatting of disable_error_code. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
ciao/data/preprocessing.py
Outdated
| image = Image.open(image_path).convert("RGB") | ||
| input_tensor = preprocess(image).to(device) # (3, 224, 224) - on correct device |
There was a problem hiding this comment.
Image.open(image_path) is not closed, which can leak file descriptors when processing many images. Use a context manager (or explicitly close) so the underlying file handle is released after converting/reading.
| image = Image.open(image_path).convert("RGB") | |
| input_tensor = preprocess(image).to(device) # (3, 224, 224) - on correct device | |
| with Image.open(image_path) as image: | |
| image = image.convert("RGB") | |
| input_tensor = preprocess(image).to(device) # (3, 224, 224) - on correct device |
ciao/data/segmentation.py
Outdated
| # Use np.unique to assign segment IDs efficiently | ||
| _, segments_flat = np.unique(qr_stacked, axis=0, return_inverse=True) | ||
| segments = segments_flat.reshape((height, width)).astype(np.int32) | ||
|
|
||
| # Build hex_to_id mapping for adjacency construction | ||
| unique_qr = np.unique(qr_stacked, axis=0) |
There was a problem hiding this comment.
np.unique(qr_stacked, axis=0) is computed twice (once for return_inverse=True and again for unique_qr). Reuse the unique array from the first call to avoid the extra full-array sort/scan and to keep the segment-id mapping tightly coupled to the inverse indices.
| # Use np.unique to assign segment IDs efficiently | |
| _, segments_flat = np.unique(qr_stacked, axis=0, return_inverse=True) | |
| segments = segments_flat.reshape((height, width)).astype(np.int32) | |
| # Build hex_to_id mapping for adjacency construction | |
| unique_qr = np.unique(qr_stacked, axis=0) | |
| # Use np.unique to assign segment IDs efficiently and get unique coordinates | |
| unique_qr, segments_flat = np.unique( | |
| qr_stacked, axis=0, return_inverse=True | |
| ) | |
| segments = segments_flat.reshape((height, width)).astype(np.int32) | |
| # Build hex_to_id mapping for adjacency construction |
ciao/data/loader.py
Outdated
| for ext in IMAGE_EXTENSIONS: | ||
| yield from directory.glob(f"**/*{ext}") |
There was a problem hiding this comment.
Directory mode traverses the tree once per extension (for ext in IMAGE_EXTENSIONS: directory.glob(...)), which can be unnecessarily expensive on large folders and still misses upper-case extensions. Consider a single rglob('*') traversal and filter by path.suffix.lower() in IMAGE_EXTENSIONS for both performance and completeness.
| for ext in IMAGE_EXTENSIONS: | |
| yield from directory.glob(f"**/*{ext}") | |
| for path in directory.rglob("*"): | |
| if path.is_file() and path.suffix.lower() in IMAGE_EXTENSIONS: | |
| yield path |
ciao/data/loader.py
Outdated
| """ | ||
| if config.data.get("image_path"): | ||
| # Single image mode | ||
| yield Path(config.data.image_path) |
There was a problem hiding this comment.
In single-image mode, image_path is yielded without validating that it exists and is a file. Adding an is_file() check here (similar to the directory validation for batch_path) would fail fast with a clearer configuration error.
| yield Path(config.data.image_path) | |
| image = Path(config.data.image_path) | |
| if not image.is_file(): | |
| raise ValueError( | |
| f"image_path must be a valid file, got: {image}. " | |
| "Check for typos or incorrect path configuration." | |
| ) | |
| yield image |
Context:
This PR introduces the completely self-contained
data/module.What was changed:
loader.py: Image loading utilities.preprocessing.py: Image tensor transformations.segmentation.py: Hexagonal and square image segmentation logic.replacement.py: Image masking strategies (mean color, blur, interlacing, solid color).Related Task:
XAI-29