Summary
Introduce a MetadataSet type representing many Metadata that share a dataset, dates, region, and dir but differ in variable name. The current Metadata/Metadatum covers exactly one variable, so any workflow touching K variables of the same dataset has to fan out K near-identical constructors and K downstream calls.
Co-proposed: rename download_dataset to a generic download verb whose dispatch table can express aggregation policy — the key payoff a MetadataSet unlocks for backends like ERA5 that already support batched multi-variable requests.
Discussion origin: #233 (comment)
Motivation
The friction is concrete and recurring. From examples/era5_breeze.jl in PR #233:
meta_common = (region = era5_region, dir = era5_datadir)
set!(u, Metadatum(:eastward_velocity; dataset=ds_pl, date=start_date, meta_common...))
set!(v, Metadatum(:northward_velocity; dataset=ds_pl, date=start_date, meta_common...))
set!(T, Metadatum(:temperature; dataset=ds_pl, date=start_date, meta_common...))
set!(qᵛ, Metadatum(:specific_humidity; dataset=ds_pl, date=start_date, meta_common...))
set!(qᶜ, Metadatum(:specific_cloud_liquid_water_content; dataset=ds_pl, date=start_date, meta_common...))
set!(qⁱ, Metadatum(:specific_cloud_ice_water_content; dataset=ds_pl, date=start_date, meta_common...))
Five of six keyword arguments are identical across every line — only name varies. The same pattern lives in src/DataWrangling/ECCO/ECCO_atmosphere.jl:37-42 (six Metadata) and examples/ERA5_hourly_data.jl:338-341 (four).
The asymmetry is sharper still because the download path already speaks "many variables, one dataset". Earlier in the same script:
download_dataset(pl_vars, ds_pl, dates; meta_common...)
That signature lives in ext/NumericalEarthCDSAPIExt.jl:280-336 and batches all variables for a calendar day into one CDS API request — but there's no corresponding object the user can hand to set!, Field, or FieldTimeSeries.
Proposed type
struct MetadataSet{V, D, R, N, F}
names :: N # NTuple{K,Symbol} or Vector{Symbol}
dataset :: V # shared
dates :: D # shared; scalar or AbstractVector
region :: R # shared
dir :: String # shared
filenames :: F # auto-derived per name; overridable
end
Constructor mirrors Metadata:
mset = MetadataSet([:eastward_velocity, :northward_velocity, :temperature,
:specific_humidity, :specific_cloud_liquid_water_content,
:specific_cloud_ice_water_content];
dataset = ds_pl,
dates = start_date,
region = era5_region,
dir = era5_datadir)
Iteration axis is variables — orthogonal to Metadata's date axis. Every element of a MetadataSet is itself a Metadata (or Metadatum if dates is scalar), so the design composes with every existing method without changing them.
mset[:temperature] # → Metadata for one variable
mset[1] # → same, by position
keys(mset) # variable names
length(mset) # number of variables
for m in mset ... end
Initial scope (per discussion): multi-variable, any dates — both scalar dates (each element a Metadatum) and vector dates (each element a multi-date Metadata) supported from the start, covering both the ERA5 example and the ECCO atmosphere case.
New methods that exploit the set
set!(fields::NamedTuple, mset::MetadataSet) # keyed by variable name
set!(model, mset::MetadataSet) # if model fields match names
download(mset::MetadataSet) # batched per backend; see below
FieldTimeSeries(mset::MetadataSet) # NamedTuple of FTSs, shared backend
The set!(model, mset) form is the cleanup the reviewer asked for in the same comment thread.
The era5_breeze.jl snippet collapses to:
fields = (; u, v, T=Tᵃ, qᵛ, qᶜ, qⁱ)
set!(fields, mset)
…or, once the model exists, set!(atmos.model, mset).
Generic download (supersedes download_dataset)
Add NumericalEarth.DataWrangling.download as the user-facing verb. Two reasons:
- Naming.
download_dataset(metadata) is a misnomer — the argument is metadata, not a dataset. The verb-on-object phrasing matches the noun.
- Aggregation dispatch. A single generic gives
MetadataSet somewhere to put its batching policy:
download(m::Metadatum) # per-file
download(m::Metadata) # current Metadata behavior (iterates dates)
download(mset::MetadataSet) # aggregates: hand off to a batched backend
# when one exists, fall back to per-variable
download(ms::AbstractVector{<:Metadata}) # generic many-metadata case
For ERA5, download(mset) routes to the existing download_dataset(::Vector{Symbol}, ::ERA5PressureMetadata) machinery (ext/NumericalEarthCDSAPIExt.jl:280-380), bundling all variables for a calendar day into one CDS request — cutting request counts by ~K. For backends with no batched form, the default just maps download over the elements.
Migration plan
- Keep
download_dataset as a deprecated thin alias forwarding to download for one minor release.
- Per-backend methods (
download_dataset(::ECCOMetadata), download_dataset(::JRA55Metadata), …) get renamed to download and their import lines updated. Currently the rename touches:
src/DataWrangling/DataWrangling.jl:276 (fallback)
src/DataWrangling/ECCO/ECCO.jl:308
src/DataWrangling/JRA55/JRA55_metadata.jl:192
src/DataWrangling/EN4/EN4.jl:207
src/DataWrangling/IBCAO/IBCAO.jl:80, ETOPO/ETOPO.jl:48, IBCSO/IBCSO.jl:75, GEBCO/GEBCO.jl:69, ORCA/ORCA.jl:110
src/DataWrangling/OSPapa/OSPapa_*.jl
ext/NumericalEarthCDSAPIExt.jl:155, 174, 280, 294, 344, 362, 382
ext/NumericalEarthCopernicusMarineExt.jl:16, 24
ext/NumericalEarthWOAExt.jl:38
Tasks
Open questions
- Should
MetadataSet enforce a single shared dataset, or allow heterogeneous datasets (e.g. ERA5 pressure-level + single-level in the same set)? The first cut is single-dataset; mixed could be a follow-up AbstractVector{<:Metadata} path.
- For
set!(model, mset), what's the canonical mapping from variable name (e.g. :eastward_velocity) to model field (e.g. model.velocities.u)? Per-dataset name table, or a name => field user-supplied map?
- Naming: kept as
MetadataSet (play on dataset). Reasonable alternatives considered: MetadataGroup, MetadataCollection, VariableSet.
Summary
Introduce a
MetadataSettype representing manyMetadatathat share a dataset, dates, region, anddirbut differ in variable name. The currentMetadata/Metadatumcovers exactly one variable, so any workflow touching K variables of the same dataset has to fan out K near-identical constructors and K downstream calls.Co-proposed: rename
download_datasetto a genericdownloadverb whose dispatch table can express aggregation policy — the key payoff aMetadataSetunlocks for backends like ERA5 that already support batched multi-variable requests.Discussion origin: #233 (comment)
Motivation
The friction is concrete and recurring. From
examples/era5_breeze.jlin PR #233:Five of six keyword arguments are identical across every line — only
namevaries. The same pattern lives insrc/DataWrangling/ECCO/ECCO_atmosphere.jl:37-42(sixMetadata) andexamples/ERA5_hourly_data.jl:338-341(four).The asymmetry is sharper still because the download path already speaks "many variables, one dataset". Earlier in the same script:
That signature lives in
ext/NumericalEarthCDSAPIExt.jl:280-336and batches all variables for a calendar day into one CDS API request — but there's no corresponding object the user can hand toset!,Field, orFieldTimeSeries.Proposed type
Constructor mirrors
Metadata:Iteration axis is variables — orthogonal to
Metadata's date axis. Every element of aMetadataSetis itself aMetadata(orMetadatumifdatesis scalar), so the design composes with every existing method without changing them.Initial scope (per discussion): multi-variable, any
dates— both scalardates(each element aMetadatum) and vectordates(each element a multi-dateMetadata) supported from the start, covering both the ERA5 example and the ECCO atmosphere case.New methods that exploit the set
The
set!(model, mset)form is the cleanup the reviewer asked for in the same comment thread.The
era5_breeze.jlsnippet collapses to:…or, once the model exists,
set!(atmos.model, mset).Generic
download(supersedesdownload_dataset)Add
NumericalEarth.DataWrangling.downloadas the user-facing verb. Two reasons:download_dataset(metadata)is a misnomer — the argument is metadata, not a dataset. The verb-on-object phrasing matches the noun.MetadataSetsomewhere to put its batching policy:For ERA5,
download(mset)routes to the existingdownload_dataset(::Vector{Symbol}, ::ERA5PressureMetadata)machinery (ext/NumericalEarthCDSAPIExt.jl:280-380), bundling all variables for a calendar day into one CDS request — cutting request counts by ~K. For backends with no batched form, the default just mapsdownloadover the elements.Migration plan
download_datasetas a deprecated thin alias forwarding todownloadfor one minor release.download_dataset(::ECCOMetadata),download_dataset(::JRA55Metadata), …) get renamed todownloadand theirimportlines updated. Currently the rename touches:src/DataWrangling/DataWrangling.jl:276(fallback)src/DataWrangling/ECCO/ECCO.jl:308src/DataWrangling/JRA55/JRA55_metadata.jl:192src/DataWrangling/EN4/EN4.jl:207src/DataWrangling/IBCAO/IBCAO.jl:80,ETOPO/ETOPO.jl:48,IBCSO/IBCSO.jl:75,GEBCO/GEBCO.jl:69,ORCA/ORCA.jl:110src/DataWrangling/OSPapa/OSPapa_*.jlext/NumericalEarthCDSAPIExt.jl:155, 174, 280, 294, 344, 362, 382ext/NumericalEarthCopernicusMarineExt.jl:16, 24ext/NumericalEarthWOAExt.jl:38Tasks
MetadataSetstruct + keyword constructor.Base.getindex(SymbolandInt),iterate,length,keys,show.set!(::NamedTuple{<:Any,<:Tuple{Vararg{Field}}}, ::MetadataSet).set!(model, ::MetadataSet)— define the model-side interface for matching field names to variable names (small RFC subthread).FieldTimeSeries(::MetadataSet)returning aNamedTupleofFieldTimeSeriessharing a backend.downloadgeneric; rewire backends; deprecatedownload_dataset.download(::MetadataSet)with backend hook for batched downloads; default falls back to per-elementdownload.download(::MetadataSet{<:ERA5PressureLevelsDataset})onto the existing CDS multi-variable path.examples/era5_breeze.jlandexamples/ERA5_hourly_data.jlto useMetadataSet.ECCOPrescribedAtmosphere(src/DataWrangling/ECCO/ECCO_atmosphere.jl) to useMetadataSetinternally.Open questions
MetadataSetenforce a single shareddataset, or allow heterogeneous datasets (e.g. ERA5 pressure-level + single-level in the same set)? The first cut is single-dataset; mixed could be a follow-upAbstractVector{<:Metadata}path.set!(model, mset), what's the canonical mapping from variable name (e.g.:eastward_velocity) to model field (e.g.model.velocities.u)? Per-dataset name table, or aname => fielduser-supplied map?MetadataSet(play ondataset). Reasonable alternatives considered:MetadataGroup,MetadataCollection,VariableSet.