refactor: overhaul dataset loaders, add UTKFace, and extend AmuletDataset schema by asim29 · Pull Request #73 · ssg-research/amulet

asim29 · 2026-04-21T21:44:17Z

Summary

This PR reworks the dataset substrate that all risk-module pipelines depend on. AmuletDataset gains two new fields, every loader is updated to populate them, ucimlrepo is removed in favour of GDrive downloads, and a full load_utkface loader is added.

Stacked on: pr/abstract-base-classes

What changed and why

`AmuletDataset` schema extension (`amulet/datasets/__data.py`)

Two new fields are added to the dataclass:

Field	Type	Purpose
`modality`	`Literal["image", "tabular"]`	Describes tensor shape seen by models. `"image"` for `(C, H, W)` samples; `"tabular"` for 1-D feature vectors. LFW is `"tabular"` because its images are flattened in the loader. Required field (no default) so callers are forced to be explicit.
`sensitive_columns`	`list[str] \| None`	Names of `z_train`/`z_test` columns in order. `None` when the dataset has no sensitive attributes.

`load_census` rewrite (`amulet/datasets/__tabular_datasets.py`)

Removes the ucimlrepo dependency entirely. The Census CSV is now downloaded directly from Google Drive (_CENSUS_GDRIVE_ID) on first use and cached locally, matching how other datasets work.
pyproject.toml: drops ucimlrepo==0.0.3; uv.lock updated accordingly.
Returns AmuletDataset with modality="tabular" and sensitive_columns=["race", "sex"].

`load_lfw` rewrite (`amulet/datasets/__tabular_datasets.py`)

The previous implementation loaded images via scikit-learn and fetched attributes from a fixed URL. The rewrite:

Downloads the LFW attributes file from Google Drive (_LFW_ATTRIBUTES_GDRIVE_ID) on first use.
Extracts private helpers: _lfw_read_attributes, _lfw_attr_labels, _lfw_build_images_npz, _lfw_build_processed_cache.
Supports configurable target, attribute_1, attribute_2 parameters (previously hardcoded to gender/race).
Builds a parameter-keyed .npz cache so repeated calls with the same arguments are fast.
Returns AmuletDataset with modality="tabular" (images are flattened) and sensitive_columns=[attribute_1, attribute_2].

`load_celeba` rewrite (`amulet/datasets/__image_datasets.py`)

Replaces the old CSV-pixel-string approach with numpy-based image loading from the raw image archive.
Images are downloaded from Google Drive (_CELEBA_IMAGES_GDRIVE_ID, _CELEBA_ATTRS_GDRIVE_ID) and processed into a .npz cache.
Returns AmuletDataset with modality="image" and sensitive_columns populated from the target attribute config.

`load_utkface` — new loader (`amulet/datasets/__image_datasets.py`)

Detail	Value
Source	Google Drive archive download (`_UTKFACE_GDRIVE_ID`)
Labels	Parsed from filenames: `age` (0–116 int), `gender` (0/1), `race` (0–4)
Cache	Parameter-keyed `.npz` (`target`, `attribute_1`, `attribute_2`)
Age discretization	Optional `age_bins` via `np.digitize`, applied to any attribute that is `"age"`
Output shape	`(N, 3, 64, 64)` float32 in `[0, 1]`
`sensitive_columns`	`[attribute_1, attribute_2]`

Exported from amulet.datasets and registered in load_data() in amulet/utils/__pipeline.py.

CIFAR-10/100, FMNIST, MNIST loaders

All four loaders receive modality="image" in their AmuletDataset construction. Docstrings trimmed to Google style (imperative summary, concise Args/Returns).

Test update (`tests/unit/test_dataset_dataclass.py`)

Replaced the PR1 stub with the refactor-branch version that covers modality and sensitive_columns fields.

Test plan

uv run pre-commit run --all-files passes
uv run pytest tests/unit/test_dataset_dataclass.py passes
from amulet.datasets import load_utkface resolves
AmuletDataset(train_set=..., test_set=..., num_features=1, num_classes=2) raises TypeError (missing modality)
AmuletDataset(..., modality="tabular") constructs with sensitive_columns=None

🤖 Generated with Claude Code

…aset schema AmuletDataset gains two required-but-typed fields: modality ("image" | "tabular") to describe tensor shape seen by models, and sensitive_columns (list[str] | None) to name z_train/z_test columns in order. Dataset loader changes: - load_census: replaces ucimlrepo fetch with GDrive download (_CENSUS_GDRIVE_ID); drops ucimlrepo from dependencies; all callers now get modality and sensitive_columns populated. - load_lfw: rewrites image/attribute pipeline using helper functions (_lfw_read_attributes, _lfw_attr_labels, _lfw_build_images_npz, _lfw_build_processed_cache); supports configurable target and two sensitive attributes with parameter-keyed .npz caching. - load_celeba: rewrites to use GDrive downloads (_CELEBA_IMAGES_GDRIVE_ID, _CELEBA_ATTRS_GDRIVE_ID) and numpy-based image processing instead of CSV pixel strings; populates modality and sensitive_columns. - load_utkface: new loader. Parses age/gender/race from filenames, downloads archive from GDrive, builds parameter-keyed .npz cache. Supports age discretization via age_bins. Exported from amulet.datasets. - load_cifar10, load_cifar100, load_fmnist, load_mnist: add modality="image" to AmuletDataset construction; docstrings trimmed to Google style. pipeline: adds load_utkface import and elif branch in load_data(). pyproject.toml: removes ucimlrepo==0.0.3 (replaced by GDrive download). tests: updates test_dataset_dataclass.py to the refactor branch version which covers modality and sensitive_columns fields.

asim29 force-pushed the pr/abstract-base-classes branch from db745b5 to 63758d2 Compare April 21, 2026 21:46

asim29 force-pushed the pr/dataset-loaders-and-amuletdataset-extensions branch from be3e9a1 to 13e03b9 Compare April 21, 2026 21:46

asim29 self-assigned this Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: overhaul dataset loaders, add UTKFace, and extend AmuletDataset schema#73

refactor: overhaul dataset loaders, add UTKFace, and extend AmuletDataset schema#73
asim29 wants to merge 1 commit intopr/abstract-base-classesfrom
pr/dataset-loaders-and-amuletdataset-extensions

asim29 commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

asim29 commented Apr 21, 2026

Summary

What changed and why

AmuletDataset schema extension (amulet/datasets/__data.py)

load_census rewrite (amulet/datasets/__tabular_datasets.py)

load_lfw rewrite (amulet/datasets/__tabular_datasets.py)

load_celeba rewrite (amulet/datasets/__image_datasets.py)

load_utkface — new loader (amulet/datasets/__image_datasets.py)

CIFAR-10/100, FMNIST, MNIST loaders

Test update (tests/unit/test_dataset_dataclass.py)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`AmuletDataset` schema extension (`amulet/datasets/__data.py`)

`load_census` rewrite (`amulet/datasets/__tabular_datasets.py`)

`load_lfw` rewrite (`amulet/datasets/__tabular_datasets.py`)

`load_celeba` rewrite (`amulet/datasets/__image_datasets.py`)

`load_utkface` — new loader (`amulet/datasets/__image_datasets.py`)

Test update (`tests/unit/test_dataset_dataclass.py`)