Skip to content

Introduce MetadataSet (and a generic download) for multi-variable Metadata workflows #235

@glwagner

Description

@glwagner

Summary

Introduce a MetadataSet type representing many Metadata that share a dataset, dates, region, and dir but differ in variable name. The current Metadata/Metadatum covers exactly one variable, so any workflow touching K variables of the same dataset has to fan out K near-identical constructors and K downstream calls.

Co-proposed: rename download_dataset to a generic download verb whose dispatch table can express aggregation policy — the key payoff a MetadataSet unlocks for backends like ERA5 that already support batched multi-variable requests.

Discussion origin: #233 (comment)

Motivation

The friction is concrete and recurring. From examples/era5_breeze.jl in PR #233:

meta_common = (region = era5_region, dir = era5_datadir)

set!(u,  Metadatum(:eastward_velocity;                   dataset=ds_pl, date=start_date, meta_common...))
set!(v,  Metadatum(:northward_velocity;                  dataset=ds_pl, date=start_date, meta_common...))
set!(T,  Metadatum(:temperature;                         dataset=ds_pl, date=start_date, meta_common...))
set!(qᵛ, Metadatum(:specific_humidity;                   dataset=ds_pl, date=start_date, meta_common...))
set!(qᶜ, Metadatum(:specific_cloud_liquid_water_content; dataset=ds_pl, date=start_date, meta_common...))
set!(qⁱ, Metadatum(:specific_cloud_ice_water_content;    dataset=ds_pl, date=start_date, meta_common...))

Five of six keyword arguments are identical across every line — only name varies. The same pattern lives in src/DataWrangling/ECCO/ECCO_atmosphere.jl:37-42 (six Metadata) and examples/ERA5_hourly_data.jl:338-341 (four).

The asymmetry is sharper still because the download path already speaks "many variables, one dataset". Earlier in the same script:

download_dataset(pl_vars, ds_pl, dates; meta_common...)

That signature lives in ext/NumericalEarthCDSAPIExt.jl:280-336 and batches all variables for a calendar day into one CDS API request — but there's no corresponding object the user can hand to set!, Field, or FieldTimeSeries.

Proposed type

struct MetadataSet{V, D, R, N, F}
    names     :: N        # NTuple{K,Symbol} or Vector{Symbol}
    dataset   :: V        # shared
    dates     :: D        # shared; scalar or AbstractVector
    region    :: R        # shared
    dir       :: String   # shared
    filenames :: F        # auto-derived per name; overridable
end

Constructor mirrors Metadata:

mset = MetadataSet([:eastward_velocity, :northward_velocity, :temperature,
                    :specific_humidity, :specific_cloud_liquid_water_content,
                    :specific_cloud_ice_water_content];
                   dataset = ds_pl,
                   dates   = start_date,
                   region  = era5_region,
                   dir     = era5_datadir)

Iteration axis is variables — orthogonal to Metadata's date axis. Every element of a MetadataSet is itself a Metadata (or Metadatum if dates is scalar), so the design composes with every existing method without changing them.

mset[:temperature]   # → Metadata for one variable
mset[1]              # → same, by position
keys(mset)           # variable names
length(mset)         # number of variables
for m in mset ... end

Initial scope (per discussion): multi-variable, any dates — both scalar dates (each element a Metadatum) and vector dates (each element a multi-date Metadata) supported from the start, covering both the ERA5 example and the ECCO atmosphere case.

New methods that exploit the set

set!(fields::NamedTuple, mset::MetadataSet)   # keyed by variable name
set!(model, mset::MetadataSet)                # if model fields match names
download(mset::MetadataSet)                   # batched per backend; see below
FieldTimeSeries(mset::MetadataSet)            # NamedTuple of FTSs, shared backend

The set!(model, mset) form is the cleanup the reviewer asked for in the same comment thread.

The era5_breeze.jl snippet collapses to:

fields = (; u, v, T=Tᵃ, qᵛ, qᶜ, qⁱ)
set!(fields, mset)

…or, once the model exists, set!(atmos.model, mset).

Generic download (supersedes download_dataset)

Add NumericalEarth.DataWrangling.download as the user-facing verb. Two reasons:

  1. Naming. download_dataset(metadata) is a misnomer — the argument is metadata, not a dataset. The verb-on-object phrasing matches the noun.
  2. Aggregation dispatch. A single generic gives MetadataSet somewhere to put its batching policy:
download(m::Metadatum)                     # per-file
download(m::Metadata)                      # current Metadata behavior (iterates dates)
download(mset::MetadataSet)                # aggregates: hand off to a batched backend
                                           # when one exists, fall back to per-variable
download(ms::AbstractVector{<:Metadata})   # generic many-metadata case

For ERA5, download(mset) routes to the existing download_dataset(::Vector{Symbol}, ::ERA5PressureMetadata) machinery (ext/NumericalEarthCDSAPIExt.jl:280-380), bundling all variables for a calendar day into one CDS request — cutting request counts by ~K. For backends with no batched form, the default just maps download over the elements.

Migration plan

  • Keep download_dataset as a deprecated thin alias forwarding to download for one minor release.
  • Per-backend methods (download_dataset(::ECCOMetadata), download_dataset(::JRA55Metadata), …) get renamed to download and their import lines updated. Currently the rename touches:
    • src/DataWrangling/DataWrangling.jl:276 (fallback)
    • src/DataWrangling/ECCO/ECCO.jl:308
    • src/DataWrangling/JRA55/JRA55_metadata.jl:192
    • src/DataWrangling/EN4/EN4.jl:207
    • src/DataWrangling/IBCAO/IBCAO.jl:80, ETOPO/ETOPO.jl:48, IBCSO/IBCSO.jl:75, GEBCO/GEBCO.jl:69, ORCA/ORCA.jl:110
    • src/DataWrangling/OSPapa/OSPapa_*.jl
    • ext/NumericalEarthCDSAPIExt.jl:155, 174, 280, 294, 344, 362, 382
    • ext/NumericalEarthCopernicusMarineExt.jl:16, 24
    • ext/NumericalEarthWOAExt.jl:38

Tasks

  • Add MetadataSet struct + keyword constructor.
  • Implement Base.getindex (Symbol and Int), iterate, length, keys, show.
  • set!(::NamedTuple{<:Any,<:Tuple{Vararg{Field}}}, ::MetadataSet).
  • set!(model, ::MetadataSet) — define the model-side interface for matching field names to variable names (small RFC subthread).
  • FieldTimeSeries(::MetadataSet) returning a NamedTuple of FieldTimeSeries sharing a backend.
  • Introduce download generic; rewire backends; deprecate download_dataset.
  • download(::MetadataSet) with backend hook for batched downloads; default falls back to per-element download.
  • Specialize download(::MetadataSet{<:ERA5PressureLevelsDataset}) onto the existing CDS multi-variable path.
  • Update examples/era5_breeze.jl and examples/ERA5_hourly_data.jl to use MetadataSet.
  • Refactor ECCOPrescribedAtmosphere (src/DataWrangling/ECCO/ECCO_atmosphere.jl) to use MetadataSet internally.

Open questions

  • Should MetadataSet enforce a single shared dataset, or allow heterogeneous datasets (e.g. ERA5 pressure-level + single-level in the same set)? The first cut is single-dataset; mixed could be a follow-up AbstractVector{<:Metadata} path.
  • For set!(model, mset), what's the canonical mapping from variable name (e.g. :eastward_velocity) to model field (e.g. model.velocities.u)? Per-dataset name table, or a name => field user-supplied map?
  • Naming: kept as MetadataSet (play on dataset). Reasonable alternatives considered: MetadataGroup, MetadataCollection, VariableSet.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions