Config-driven data source for process_morton_cell (Phase 1)#10
Conversation
Original plan (click to expand)Phase 2 Plan: Template-driven YAML configurationWhy this is neededPR #10 completed Phase 1 of #8 —
The solution is to make all of this declarative via YAML configs that live outside library code. Proposed config structureA single YAML file with three sections — data source, aggregation, and output: # configs/atl06.yaml
data_source:
reader: h5coro
groups: [gt1l, gt1r, gt2l, gt2r, gt3l, gt3r]
coordinates:
latitude: "/{group}/land_ice_segments/latitude"
longitude: "/{group}/land_ice_segments/longitude"
variables:
h_li: "/{group}/land_ice_segments/h_li"
s_li: "/{group}/land_ice_segments/h_li_sigma"
quality_filter:
dataset: "/{group}/land_ice_segments/atl06_quality_summary"
value: 0
aggregation:
coordinates:
cell_ids:
dtype: uint64
fill_value: 0
morton:
dtype: int64
fill_value: 0
variables:
count:
function: count
dtype: int32
fill_value: 0
h_min:
function: min
source: h_li
dtype: float32
h_mean:
function: weighted_mean
source: h_li
weight_col: s_li
dtype: float32
h_q25:
function: quantile
source: h_li
params:
q: 0.25
dtype: float32
# ... etc
output:
grid: healpix
indexing_scheme: nestedDesign notes on the structure:
Design decisionsOne config file vs. multipleRecommendation: single file with three sections. Alternatives considered:
The "swap data source, keep aggregation" use case is real but means changing ~10 lines in a YAML file, not reimplementing anything. If composability demand grows, we can add Reader extensibilitySimple registry dict, not a plugin system. READERS = {
"h5coro": read_h5coro_groups,
# "xarray": read_xarray_groups, # future
}Each reader has the same signature: An entry-point plugin system would be overkill — if we ever have 4+ readers we can revisit. The registry dict is trivially extensible (one line) and has zero framework overhead. Output grid flexibility
What happens to CellStatsSchemaIt currently does three jobs:
The Lambda integrationClean separation: PyYAML is only needed on the orchestrator side (where Implementation phasesPhase A: New files:
Key validation: load Phase B: Thread config through schema.py
Phase C: Thread config through processing.py
Phase D: Lambda and CLI integration
Phase E: Second config to validate extensibility
Backward compatibilityEvery change is additive. All new Dependency changes
|
Revised plan (click to expand)Revised Phase 2 Plan: Template-driven YAML configurationUpdates to the previous plan based on design review. Three significant changes: dropping Pandera, numpy-backed function resolution, and expression support for composed operations. Change 1: Drop Pandera
With YAML configs, role 1 moves to the config file and role 2 is derived from the parsed config. Role 3 is used in exactly one test ( for col, expected_dtype in config.dtypes.items():
assert df[col].dtype == np.dtype(expected_dtype)Pandera is ~15MB with transitive deps and adds to Lambda cold start time. Since we're the only users of Change 2: Numpy-backed function resolution (no curated registry)The previous plan kept New design: Resolution rules:
The dispatch: def resolve_function(name: str) -> Callable:
if name in ("len", "count"):
return len
if "." not in name:
name = f"numpy.{name}"
parts = name.split(".")
obj = importlib.import_module(parts[0])
for attr in parts[1:]:
obj = getattr(obj, attr)
return objThis means the current hardcoded functions map to numpy as:
Users can use any numpy function without touching library code. The Change 3: Expression strings for composed operationsSome statistics can't be expressed as a single function call (e.g., the uncertainty of an inverse-variance weighted mean: h_sigma:
expression: "1.0 / np.sqrt(np.sum(1.0 / s_li**2))"
dtype: float32Column resolution differs between the two modes:
The expression namespace is restricted: namespace = {
"__builtins__": {},
"np": np,
"numpy": np,
"len": len,
# + every key from data_source.variables bound to its array
}No builtins beyond
Updated config structure# configs/atl06.yaml
data_source:
reader: h5coro
groups: [gt1l, gt1r, gt2l, gt2r, gt3l, gt3r]
coordinates:
latitude: "/{group}/land_ice_segments/latitude"
longitude: "/{group}/land_ice_segments/longitude"
variables:
h_li: "/{group}/land_ice_segments/h_li"
s_li: "/{group}/land_ice_segments/h_li_sigma"
quality_filter:
dataset: "/{group}/land_ice_segments/atl06_quality_summary"
value: 0
aggregation:
coordinates:
cell_ids:
dtype: uint64
fill_value: 0
morton:
dtype: int64
fill_value: 0
variables:
count:
function: len
source: h_li
dtype: int32
fill_value: 0
h_min:
function: min
source: h_li
dtype: float32
h_max:
function: max
source: h_li
dtype: float32
h_mean:
function: average
source: h_li
params:
weights: s_li
dtype: float32
h_sigma:
expression: "1.0 / np.sqrt(np.sum(1.0 / s_li**2))"
dtype: float32
h_variance:
function: var
source: h_li
dtype: float32
h_q25:
function: quantile
source: h_li
params:
q: 0.25
dtype: float32
h_q50:
function: quantile
source: h_li
params:
q: 0.50
dtype: float32
h_q75:
function: quantile
source: h_li
params:
q: 0.75
dtype: float32
output:
grid: healpix
indexing_scheme: nestedUpdated implementation phasesPhase A:
Phase B: Replace schema.py internals
Phase C: Replace processing.py internals
Phase D: Lambda and CLI integration
Phase E: Validation with a second config
What stays the same from the previous plan
What changes from the previous plan
|
Status update — plan comments supersededThe two plan comments above (initial plan and revised plan) are now superseded by implementation. Phases A through D are complete and merged into this branch. Key differences from what was planned:
See updated PR description for current state. |
Summary
Implements config-driven pipeline configuration for magg, decoupling all ATL06-specific values into YAML templates. This replaces hardcoded aggregation logic with a declarative system where users can customize data sources, aggregation functions, and output without modifying library code.
Closes #8.
Changes
config.py— YAML-drivenPipelineConfigwithload_config(),default_config(),validate_config(),resolve_function(),evaluate_expression()configs/atl06.yaml— default pipeline config; defines data source (h5coro reader, 6 ground tracks, h_li/s_li variables), aggregation (9 statistics via numpy functions and expressions), and output grid (HEALPix nested)schema.py— removedCellStatsSchema(Pandera);xdggs_spec()andxdggs_zarr_template()now derive schema from pipeline configprocessing.py—calculate_cell_statistics()dispatches via config (no moreAGG_FUNCTIONSregistry);process_morton_cell()acceptsconfig: PipelineConfig; params support expressions (e.g.weights: "1.0 / s_li**2")invoke_lambda.pyaccepts--configflag; config serialized to JSON dict in Lambda event payload (no PyYAML needed in Lambda)panderawithpyyamlapi/config.md, updated architecture and schema design docsDesign decisions
function: minresolves tonumpy.min; any numpy function works without library changesexpression: "1.0 / np.sqrt(np.sum(1.0 / s_li**2))"for composed operations in restricted namespaceparams: {weights: "1.0 / s_li**2"}allows derived inputs without switching to expression modedataclasses.asdict(), no PyYAML in Lambda layerTest plan
ruff checkcleanmkdocs build --strictclean## SummaryImplements Phase 1 of #8 — extracts all ATL06-hardcoded values fromprocess_morton_cell()into aDataSourceConfigdataclass.Previously,process_morton_cell()had five things baked in for ICESat-2 ATL06: ground track names, coordinate paths, variable paths, quality filtering, and column mapping. This PR makes all of them configurable via a single data object while preserving identical behavior when called without it.## Changes-DataSourceConfigdataclass — captures groups, coordinate paths, variable paths, and an optional quality filter. All HDF5 paths use{group}template substitution. Includesto_dict()/from_dict()for JSON serialization (ready for Lambda payloads in Phase 2).-ATL06_CONFIG— default config instance containing the previously-hardcoded ATL06 values, shipped with the package.-process_morton_cell()refactored — new optionaldata_sourceparameter (defaults toATL06_CONFIG). Per-group HDF5 reading extracted into_read_group()helper. Quality filtering is conditional on config.- 7 new tests — config structure, serialization roundtrip, optional quality filter, template substitution.## Design decisions (per #8 discussion)- Explicit groups, no globs — reduces debugging surprises- Quality filter is optional —Nonemeans no filtering (for non-NASA datasets)- Multiple value columns supported —variablesdict maps any number of output column names to HDF5 paths- No non-lat/lon coordinate systems — not needed yet## What this does NOT change- Aggregation config (schema.py) is untouched-calculate_cell_statistics()signature unchanged- Existing callers with nodata_sourceargument get identical ATL06 behavior## Test plan- [x] All 83 tests pass locally- [x]ruff checkclean- [ ] CI passes on 3.12 and 3.13Closes #8 (Phase 1)