The current python api is like this:
# mymodule.myloader
import xarray as xr
TIME_PROFILE = "observation"
SPACE_PROFILE = "grid"
UNCERTAINTY_PROFILE = "deterministic"
def load_dataset(paths: list[str], **kwargs) -> xr.Dataset:
ds = xr.open_mfdataset(paths, combine="by_coords", **kwargs)
return ds
# mlwp_data_loaders.cli
from mlwp_data_loaders import load_dataset
from mlwp_data_specs import validate_dataset
# 1. Load the dataset and extract the trait profiles defined by the loader
ds, dataset_traits = load_dataset(
[
"/path/to/file1.nc",
"/path/to/file2.nc",
],
loader="mymodule.myloader",
return_dataset_traits=True,
)
# 2. Get a detailed validation report by passing the extracted traits
report = validate_dataset(
ds,
time=dataset_traits.get("time_profile"),
space=dataset_traits.get("space_profile"),
uncertainty=dataset_traits.get("uncertainty_profile"),
)
In this design the load_dataset implementation in mlwp-data-loaders reads the statically defined trait properties from the loader module provided, and return the loaded dataset and the traits, separately.
But this has the issue that the traits that a loader imposes is hard-coded in the loader module, but @mpvginde has pointed out we might want to support allowing the loader to define the traits of a dataset at runtime based on the contents what the loader is reading from disk. For example the current "loader" implementation in mxalign https://github.com/mlwp-tools/mxalign/blob/e2232d93275c7508897a7ddb0cce8b508665f24c/src/mxalign/loaders/base.py#L81-L107
If we instead want to loader to infer the dataset traits from what it reads from disk then:
- the traits for a specific dataset need to be defined at runtime
- the traits need to be somehow return from the loader
I will use this issue to outline a few approaches to achieving this
The current python api is like this:
In this design the
load_datasetimplementation inmlwp-data-loadersreads the statically defined trait properties from the loader module provided, and return the loaded dataset and the traits, separately.But this has the issue that the traits that a loader imposes is hard-coded in the loader module, but @mpvginde has pointed out we might want to support allowing the loader to define the traits of a dataset at runtime based on the contents what the loader is reading from disk. For example the current "loader" implementation in
mxalignhttps://github.com/mlwp-tools/mxalign/blob/e2232d93275c7508897a7ddb0cce8b508665f24c/src/mxalign/loaders/base.py#L81-L107If we instead want to loader to infer the dataset traits from what it reads from disk then:
I will use this issue to outline a few approaches to achieving this