Skip to content

Allowing for dataset derived traits #5

@leifdenby

Description

@leifdenby

The current python api is like this:

# mymodule.myloader
import xarray as xr

TIME_PROFILE = "observation"
SPACE_PROFILE = "grid"
UNCERTAINTY_PROFILE = "deterministic"

def load_dataset(paths: list[str], **kwargs) -> xr.Dataset:
    ds = xr.open_mfdataset(paths, combine="by_coords", **kwargs)
    return ds
# mlwp_data_loaders.cli
from mlwp_data_loaders import load_dataset
from mlwp_data_specs import validate_dataset

# 1. Load the dataset and extract the trait profiles defined by the loader
ds, dataset_traits = load_dataset(
    [
        "/path/to/file1.nc",
        "/path/to/file2.nc",
    ],
    loader="mymodule.myloader",
    return_dataset_traits=True,
)

# 2. Get a detailed validation report by passing the extracted traits
report = validate_dataset(
    ds,
    time=dataset_traits.get("time_profile"),
    space=dataset_traits.get("space_profile"),
    uncertainty=dataset_traits.get("uncertainty_profile"),
)

In this design the load_dataset implementation in mlwp-data-loaders reads the statically defined trait properties from the loader module provided, and return the loaded dataset and the traits, separately.

But this has the issue that the traits that a loader imposes is hard-coded in the loader module, but @mpvginde has pointed out we might want to support allowing the loader to define the traits of a dataset at runtime based on the contents what the loader is reading from disk. For example the current "loader" implementation in mxalign https://github.com/mlwp-tools/mxalign/blob/e2232d93275c7508897a7ddb0cce8b508665f24c/src/mxalign/loaders/base.py#L81-L107

If we instead want to loader to infer the dataset traits from what it reads from disk then:

  1. the traits for a specific dataset need to be defined at runtime
  2. the traits need to be somehow return from the loader

I will use this issue to outline a few approaches to achieving this

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions