Skip to content

feat(data): add multi-dataset ingestion pipeline#81

Open
Akshu121796 wants to merge 1 commit into
Brijeshthummar02:masterfrom
Akshu121796:feat/multi-dataset-ingestion
Open

feat(data): add multi-dataset ingestion pipeline#81
Akshu121796 wants to merge 1 commit into
Brijeshthummar02:masterfrom
Akshu121796:feat/multi-dataset-ingestion

Conversation

@Akshu121796

Copy link
Copy Markdown

Summary

Implements a modular multi-dataset ingestion pipeline for NeuroVision to support loading MRI datasets from different sources through a unified interface.

Added

  • New dataset_loader.py module

  • Unified dataset loading interface via MultiDatasetLoader

  • TCGA dataset support

  • BraTS dataset support

  • Dataset validation checks for:

    • Required columns
    • File existence
    • Image and mask paths
    • Supported file extensions
  • Base loader architecture for future dataset integrations

Documentation

  • Added Multi-Dataset Ingestion Pipeline section to README
  • Included supported dataset sources
  • Added usage examples
  • Updated project structure documentation

##Benefits

  • Supports multiple MRI dataset formats and directory structures
  • Provides a consistent loading workflow across datasets
  • Improves maintainability and future extensibility
  • Enables easier integration of additional MRI datasets

Closes #69

@Akshu121796

Copy link
Copy Markdown
Author

Hi @Brijeshthummar02,
I've completed the implementation for Issue #69.

Implemented:
• Modular multi-dataset ingestion pipeline
• Unified dataset loading interface through MultiDatasetLoader
• Support for TCGA and BraTS dataset formats
• Dataset validation checks for structure, file existence, and supported formats
• Extensible architecture using base loader classes for future dataset integration.

Files added/updated:
dataset_loader.py
README.md

Looking forward to your feedback.
If no changes, add required labels(gssoc:approved).
Thank you!

@Akshu121796

Copy link
Copy Markdown
Author

Heyy @Brijeshthummar02,
I've also noticed a couple of documentation inconsistencies while working on the repository:

• The README appears somewhat crowded in a few sections, which may affect readability.
• The repository clone command currently points to a different GitHub repository and results in a "Not Found" page.
• The corresponding cd command references a directory name that does not match the current repository.

I haven't opened a separate issue yet, as I wanted to confirm whether these are already known or intentional.

If you'd like, I can create a dedicated documentation issue (or work on a fix) after completing the currently assigned tasks.
Thank you!

@Brijeshthummar02

Copy link
Copy Markdown
Owner

@Akshu121796 how you tested newly added ingestion pipeline?

@Akshu121796

Copy link
Copy Markdown
Author

Hi @Brijeshthummar02,
Thank you for reviewing the PR.

For testing, I validated the ingestion pipeline primarily through the existing repository dataset structure and validation logic:
• Verified TCGA loading against the repository's existing data_mask.csv structure and ensured the loader correctly reads the dataset into a unified DataFrame format.
• Tested validation checks by confirming the required columns (image_path, mask_path) are present and by exercising the file existence and supported-extension validation logic.
• Tested the MultiDatasetLoader interface by ensuring dataset loaders are selected through a common loading entry point and return a consistent schema.
• For the BraTS loader, the repository does not currently include a BraTS dataset sample. The implementation was therefore validated through the expected directory structure (images/ and masks/) and validation rules rather than a full training dataset import.

If you would prefer, I can further strengthen the PR by adding sample test cases (or lightweight mock dataset fixtures) to demonstrate the loading and validation flow for both supported dataset types.
Thank you for the feedback.

@Brijeshthummar02

Copy link
Copy Markdown
Owner

@Akshu121796 review my comment on code and reply to it.

@Akshu121796

Copy link
Copy Markdown
Author

Hi @Brijeshthummar02,
I don't see any inline code review comments on the PR diff. Could you point me to the specific comment or code section you'd like me to review? My previous response covered the testing approach
let me know if you need additional details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Build Multi-Dataset Ingestion Pipeline

2 participants