Skip to content

refactor: overhaul dataset loaders, add UTKFace, and extend AmuletDataset schema#73

Open
asim29 wants to merge 1 commit intopr/abstract-base-classesfrom
pr/dataset-loaders-and-amuletdataset-extensions
Open

refactor: overhaul dataset loaders, add UTKFace, and extend AmuletDataset schema#73
asim29 wants to merge 1 commit intopr/abstract-base-classesfrom
pr/dataset-loaders-and-amuletdataset-extensions

Conversation

@asim29
Copy link
Copy Markdown
Collaborator

@asim29 asim29 commented Apr 21, 2026

Summary

This PR reworks the dataset substrate that all risk-module pipelines depend on. AmuletDataset gains two new fields, every loader is updated to populate them, ucimlrepo is removed in favour of GDrive downloads, and a full load_utkface loader is added.

Stacked on: pr/abstract-base-classes


What changed and why

AmuletDataset schema extension (amulet/datasets/__data.py)

Two new fields are added to the dataclass:

Field Type Purpose
modality Literal["image", "tabular"] Describes tensor shape seen by models. "image" for (C, H, W) samples; "tabular" for 1-D feature vectors. LFW is "tabular" because its images are flattened in the loader. Required field (no default) so callers are forced to be explicit.
sensitive_columns list[str] | None Names of z_train/z_test columns in order. None when the dataset has no sensitive attributes.

load_census rewrite (amulet/datasets/__tabular_datasets.py)

  • Removes the ucimlrepo dependency entirely. The Census CSV is now downloaded directly from Google Drive (_CENSUS_GDRIVE_ID) on first use and cached locally, matching how other datasets work.
  • pyproject.toml: drops ucimlrepo==0.0.3; uv.lock updated accordingly.
  • Returns AmuletDataset with modality="tabular" and sensitive_columns=["race", "sex"].

load_lfw rewrite (amulet/datasets/__tabular_datasets.py)

The previous implementation loaded images via scikit-learn and fetched attributes from a fixed URL. The rewrite:

  • Downloads the LFW attributes file from Google Drive (_LFW_ATTRIBUTES_GDRIVE_ID) on first use.
  • Extracts private helpers: _lfw_read_attributes, _lfw_attr_labels, _lfw_build_images_npz, _lfw_build_processed_cache.
  • Supports configurable target, attribute_1, attribute_2 parameters (previously hardcoded to gender/race).
  • Builds a parameter-keyed .npz cache so repeated calls with the same arguments are fast.
  • Returns AmuletDataset with modality="tabular" (images are flattened) and sensitive_columns=[attribute_1, attribute_2].

load_celeba rewrite (amulet/datasets/__image_datasets.py)

  • Replaces the old CSV-pixel-string approach with numpy-based image loading from the raw image archive.
  • Images are downloaded from Google Drive (_CELEBA_IMAGES_GDRIVE_ID, _CELEBA_ATTRS_GDRIVE_ID) and processed into a .npz cache.
  • Returns AmuletDataset with modality="image" and sensitive_columns populated from the target attribute config.

load_utkface — new loader (amulet/datasets/__image_datasets.py)

Detail Value
Source Google Drive archive download (_UTKFACE_GDRIVE_ID)
Labels Parsed from filenames: age (0–116 int), gender (0/1), race (0–4)
Cache Parameter-keyed .npz (target, attribute_1, attribute_2)
Age discretization Optional age_bins via np.digitize, applied to any attribute that is "age"
Output shape (N, 3, 64, 64) float32 in [0, 1]
sensitive_columns [attribute_1, attribute_2]

Exported from amulet.datasets and registered in load_data() in amulet/utils/__pipeline.py.

CIFAR-10/100, FMNIST, MNIST loaders

All four loaders receive modality="image" in their AmuletDataset construction. Docstrings trimmed to Google style (imperative summary, concise Args/Returns).

Test update (tests/unit/test_dataset_dataclass.py)

Replaced the PR1 stub with the refactor-branch version that covers modality and sensitive_columns fields.


Test plan

  • uv run pre-commit run --all-files passes
  • uv run pytest tests/unit/test_dataset_dataclass.py passes
  • from amulet.datasets import load_utkface resolves
  • AmuletDataset(train_set=..., test_set=..., num_features=1, num_classes=2) raises TypeError (missing modality)
  • AmuletDataset(..., modality="tabular") constructs with sensitive_columns=None

🤖 Generated with Claude Code

…aset schema

AmuletDataset gains two required-but-typed fields: modality ("image" |
"tabular") to describe tensor shape seen by models, and sensitive_columns
(list[str] | None) to name z_train/z_test columns in order.

Dataset loader changes:
- load_census: replaces ucimlrepo fetch with GDrive download (_CENSUS_GDRIVE_ID);
  drops ucimlrepo from dependencies; all callers now get modality and
  sensitive_columns populated.
- load_lfw: rewrites image/attribute pipeline using helper functions
  (_lfw_read_attributes, _lfw_attr_labels, _lfw_build_images_npz,
  _lfw_build_processed_cache); supports configurable target and two sensitive
  attributes with parameter-keyed .npz caching.
- load_celeba: rewrites to use GDrive downloads (_CELEBA_IMAGES_GDRIVE_ID,
  _CELEBA_ATTRS_GDRIVE_ID) and numpy-based image processing instead of
  CSV pixel strings; populates modality and sensitive_columns.
- load_utkface: new loader. Parses age/gender/race from filenames, downloads
  archive from GDrive, builds parameter-keyed .npz cache. Supports age
  discretization via age_bins. Exported from amulet.datasets.
- load_cifar10, load_cifar100, load_fmnist, load_mnist: add modality="image"
  to AmuletDataset construction; docstrings trimmed to Google style.

pipeline: adds load_utkface import and elif branch in load_data().
pyproject.toml: removes ucimlrepo==0.0.3 (replaced by GDrive download).
tests: updates test_dataset_dataclass.py to the refactor branch version
  which covers modality and sensitive_columns fields.
@asim29 asim29 force-pushed the pr/abstract-base-classes branch from db745b5 to 63758d2 Compare April 21, 2026 21:46
@asim29 asim29 force-pushed the pr/dataset-loaders-and-amuletdataset-extensions branch from be3e9a1 to 13e03b9 Compare April 21, 2026 21:46
@asim29 asim29 self-assigned this Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant