Config-driven data source for process_morton_cell (Phase 1) by espg · Pull Request #10 · englacial/zagg

espg · 2026-03-06T03:32:36Z

Update (2026-03-13): The scope of this PR expanded beyond Phase 1. Original description preserved below in strikethrough.

Summary

Implements config-driven pipeline configuration for magg, decoupling all ATL06-specific values into YAML templates. This replaces hardcoded aggregation logic with a declarative system where users can customize data sources, aggregation functions, and output without modifying library code.

Closes #8.

Changes

config.py — YAML-driven PipelineConfig with load_config(), default_config(), validate_config(), resolve_function(), evaluate_expression()
configs/atl06.yaml — default pipeline config; defines data source (h5coro reader, 6 ground tracks, h_li/s_li variables), aggregation (9 statistics via numpy functions and expressions), and output grid (HEALPix nested)
schema.py — removed CellStatsSchema (Pandera); xdggs_spec() and xdggs_zarr_template() now derive schema from pipeline config
processing.py — calculate_cell_statistics() dispatches via config (no more AGG_FUNCTIONS registry); process_morton_cell() accepts config: PipelineConfig; params support expressions (e.g. weights: "1.0 / s_li**2")
Lambda integration — invoke_lambda.py accepts --config flag; config serialized to JSON dict in Lambda event payload (no PyYAML needed in Lambda)
Dependency swap — replaced pandera with pyyaml
Docs — new api/config.md, updated architecture and schema design docs
111 tests passing on Python 3.12 and 3.13

Design decisions

Single YAML file, three sections — data_source, aggregation, output. Cross-validation at load time.
Numpy function resolution — function: min resolves to numpy.min; any numpy function works without library changes
Expression strings — expression: "1.0 / np.sqrt(np.sum(1.0 / s_li**2))" for composed operations in restricted namespace
Param expressions — params: {weights: "1.0 / s_li**2"} allows derived inputs without switching to expression mode
Lambda receives JSON — config serialized via dataclasses.asdict(), no PyYAML in Lambda layer

Test plan

111 tests pass locally (Python 3.12, 3.13)
ruff check clean
mkdocs build --strict clean
CI: test, ruff, build all green

Original description (Phase 1 only — superseded by the above):

~~## Summary~~

~~Implements Phase 1 of #8 — extracts all ATL06-hardcoded values from process_morton_cell() into a DataSourceConfig dataclass.~~

Previously, process_morton_cell() had five things baked in for ICESat-2 ATL06: ground track names, coordinate paths, variable paths, quality filtering, and column mapping. This PR makes all of them configurable via a single data object while preserving identical behavior when called without it.

~~## Changes~~

- DataSourceConfig dataclass — captures groups, coordinate paths, variable paths, and an optional quality filter. All HDF5 paths use {group} template substitution. Includes to_dict()/from_dict() for JSON serialization (ready for Lambda payloads in Phase 2).
~~- ATL06_CONFIG — default config instance containing the previously-hardcoded ATL06 values, shipped with the package.~~
- process_morton_cell() refactored — new optional data_source parameter (defaults to ATL06_CONFIG). Per-group HDF5 reading extracted into _read_group() helper. Quality filtering is conditional on config.
~~- 7 new tests — config structure, serialization roundtrip, optional quality filter, template substitution.~~

~~## Design decisions (per #8 discussion)~~

~~- Explicit groups, no globs — reduces debugging surprises~~
~~- Quality filter is optional — None means no filtering (for non-NASA datasets)~~
~~- Multiple value columns supported — variables dict maps any number of output column names to HDF5 paths~~
~~- No non-lat/lon coordinate systems — not needed yet~~

~~## What this does NOT change~~

~~- Aggregation config (schema.py) is untouched~~
~~- calculate_cell_statistics() signature unchanged~~
~~- Existing callers with no data_source argument get identical ATL06 behavior~~

~~## Test plan~~

~~- [x] All 83 tests pass locally~~
~~- [x] ruff check clean~~
~~- [ ] CI passes on 3.12 and 3.13~~

~~Closes #8 (Phase 1)~~

espg · 2026-03-13T19:37:20Z

Superseded (2026-03-13): Phases A–D are implemented. See status update for what changed from this plan.

Original plan (click to expand)

Phase 2 Plan: Template-driven YAML configuration

Why this is needed

PR #10 completed Phase 1 of #8 — DataSourceConfig is a dataclass, ATL06_CONFIG is the default, and process_morton_cell() accepts an optional data_source parameter. But configuration is still entangled with library code in ways that block real extensibility:

Users can't customize aggregations without forking. CellStatsSchema is a Pandera class in schema.py, AGG_FUNCTIONS is a dict in processing.py. Changing what statistics get computed means editing library source. Users who do this immediately diverge from upstream — they can't pull new releases without merge conflicts against their custom aggregation code.
Lambda deployments are locked to ATL06. The Lambda handler has no way to accept a different aggregation or data source config at invocation time. Switching products means redeploying with different code.
The data source and schema are decoupled by accident. CellStatsSchema assumes columns named h_li and s_li exist (via "source" metadata). ATL06_CONFIG happens to produce those column names. But there's nothing validating this link at config time — a mismatch surfaces as a KeyError deep in the pipeline. (We added validate_schema() to catch this, but the underlying problem is that the config is split across two unrelated Python objects.)
Future data sources need different readers. ATL06 is columnar HDF5 read via h5coro, but other NASA CMR products are 2D raster data opened via xarray. The data source config needs to specify which reader to use.
Output grid is hardcoded to HEALPix. Users should be able to override grid parameters declaratively, and we may want non-HEALPix grids in the future.

The solution is to make all of this declarative via YAML configs that live outside library code.

Proposed config structure

A single YAML file with three sections — data source, aggregation, and output:

# configs/atl06.yaml
data_source:
  reader: h5coro
  groups: [gt1l, gt1r, gt2l, gt2r, gt3l, gt3r]
  coordinates:
    latitude: "/{group}/land_ice_segments/latitude"
    longitude: "/{group}/land_ice_segments/longitude"
  variables:
    h_li: "/{group}/land_ice_segments/h_li"
    s_li: "/{group}/land_ice_segments/h_li_sigma"
  quality_filter:
    dataset: "/{group}/land_ice_segments/atl06_quality_summary"
    value: 0

aggregation:
  coordinates:
    cell_ids:
      dtype: uint64
      fill_value: 0
    morton:
      dtype: int64
      fill_value: 0
  variables:
    count:
      function: count
      dtype: int32
      fill_value: 0
    h_min:
      function: min
      source: h_li
      dtype: float32
    h_mean:
      function: weighted_mean
      source: h_li
      weight_col: s_li
      dtype: float32
    h_q25:
      function: quantile
      source: h_li
      params:
        q: 0.25
      dtype: float32
    # ... etc

output:
  grid: healpix
  indexing_scheme: nested

Design notes on the structure:

weight_col is a top-level key per variable (not nested under params) — it's common enough to deserve visibility
params is a catch-all for function-specific kwargs (like q for quantile)
fill_value defaults to NaN for float types and 0 for int types if omitted
output is optional — omitting it means HEALPix nested (current behavior)
parent_order and child_order are runtime parameters (CLI/Lambda), not config

Design decisions

One config file vs. multiple

Recommendation: single file with three sections.

Alternatives considered:

Approach	Strengths	Weaknesses
Three separate files (data_source.yaml, aggregation.yaml, output.yaml)	Maximum composability — swap data source while keeping aggregation	Cross-validation is hard (aggregation references data_source columns across file boundaries); triple the file management; Lambda needs 3 payloads
One file, three sections	One file to validate, one to ship to Lambda, column cross-validation is trivial at load time	Swapping just the data source means copying the file and changing one section
Hybrid with `!include`/`$ref`	Best of both	Over-engineered; YAML `!include` isn't part of the spec (requires custom loader)

The "swap data source, keep aggregation" use case is real but means changing ~10 lines in a YAML file, not reimplementing anything. If composability demand grows, we can add --override-data-source <path> later — a partial YAML that replaces just that section.

Reader extensibility

Simple registry dict, not a plugin system.

READERS = {
    "h5coro": read_h5coro_groups,
    # "xarray": read_xarray_groups,  # future
}

Each reader has the same signature: (urls, data_source_config, spatial_filter_params, credentials) -> list[pd.DataFrame]. The config's data_source.reader field selects which one. The current _read_group logic gets wrapped into read_h5coro_groups.

An entry-point plugin system would be overkill — if we ever have 4+ readers we can revisit. The registry dict is trivially extensible (one line) and has zero framework overhead.

Output grid flexibility

output.grid: healpix is the only supported value for now and is the default when omitted. The output section is a future extension point. When non-HEALPix grids are needed, we'd add a grid registry (same pattern as readers) and the output section would carry grid-specific parameters.

What happens to CellStatsSchema

It currently does three jobs:

Aggregation metadata → moves to YAML aggregation section
Zarr template generation (dtypes, fill values) → derived from parsed YAML
Output DataFrame validation (Pandera) → optional; the config's own cross-validation is sufficient

The CellStatsSchema class stays in schema.py as the ATL06-specific default for backward compatibility. But it's no longer the source of truth — the YAML is.

Lambda integration

Clean separation: PyYAML is only needed on the orchestrator side (where invoke_lambda.py runs). The config is serialized to a JSON dict and embedded in the Lambda event payload (<1 KB, well within the 256 KB limit). The Lambda handler deserializes plain JSON — no YAML parsing needed in Lambda, no new Lambda layer dependencies.

Implementation phases

Phase A: config.py + default YAML + tests (no existing code changes)

New files:

src/magg/config.py — PipelineConfig dataclass, load_config(), default_config(), validate_config()
src/magg/configs/atl06.yaml — YAML representation of current hardcoded values
src/magg/configs/__init__.py — empty, for importlib.resources
tests/test_config.py — loading, validation, cross-validation, roundtrip, defaults

Key validation: load atl06.yaml and verify it produces identical metadata to _agg_fields() from the current CellStatsSchema.

Phase B: Thread config through schema.py

_get_schema_fields(), _agg_fields(), _fields_by_role() gain optional config parameter
xdggs_spec() and xdggs_zarr_template() gain optional config parameter
When config=None, behavior is identical to today
Module-level constants _COORDS and _DATA_VARS stay as ATL06 defaults

Phase C: Thread config through processing.py

DataSourceConfig gains reader: str = "h5coro" field
calculate_cell_statistics() gains optional config parameter
process_morton_cell() gains optional config: PipelineConfig parameter
ATL06_CONFIG stays as a backward-compat constant

Phase D: Lambda and CLI integration

invoke_lambda.py gets --config CLI flag, serializes config to Lambda events
lambda_handler.py checks for event.get("config"), falls through to ATL06 default
pyproject.toml gets pyyaml dependency + package data for YAML files

Phase E: Second config to validate extensibility

Add configs/atl06_minimal.yaml (count + median only, no quantiles) to prove the system works for non-default configs

Backward compatibility

Every change is additive. All new config parameters default to None, which triggers the current hardcoded behavior. ATL06_CONFIG, CellStatsSchema, _COORDS, _DATA_VARS, AGG_FUNCTIONS — all remain importable and functional. Lambda events without a "config" key work exactly as today. No existing tests need to change.

Dependency changes

Add pyyaml to pyproject.toml. It's only needed on the orchestrator/CLI side — Lambda deserializes plain JSON, so no Lambda layer changes needed.
No other new dependencies. Config system uses importlib.resources + dataclasses + PyYAML.

espg · 2026-03-13T20:30:58Z

Superseded (2026-03-13): Phases A–D are implemented. Key diff from this plan: h_mean uses weights: "1.0 / s_li**2" (not weights: s_li), params support expressions, and CellStatsSchema/AGG_FUNCTIONS were fully deleted (not kept for backward compat). See status update.

Revised plan (click to expand)

Revised Phase 2 Plan: Template-driven YAML configuration

Updates to the previous plan based on design review. Three significant changes: dropping Pandera, numpy-backed function resolution, and expression support for composed operations.

Change 1: Drop Pandera

CellStatsSchema (Pandera DataFrameModel) currently does three things:

Stores aggregation metadata in pa.Field metadata dicts
Generates Zarr template arrays (dtype, fill_value)
Validates output DataFrames (column names + dtypes)

With YAML configs, role 1 moves to the config file and role 2 is derived from the parsed config. Role 3 is used in exactly one test (test_schema.py:201) — CellStatsSchema.validate(df) — and never in production code. That validation is a three-line loop without Pandera:

for col, expected_dtype in config.dtypes.items():
    assert df[col].dtype == np.dtype(expected_dtype)

Pandera is ~15MB with transitive deps and adds to Lambda cold start time. Since we're the only users of CellStatsSchema, there are no downstream consumers to worry about. Remove pandera from dependencies entirely.

Change 2: Numpy-backed function resolution (no curated registry)

The previous plan kept AGG_FUNCTIONS as a curated dict mapping string names to lambdas. This has the same problem we're trying to solve — users who want numpy.std or numpy.ptp need to PR the library.

New design: function: resolves to an actual numpy (or other module) function at load time. No curated registry.

Resolution rules:

Dotted path (numpy.min, numpy.fft.fft) → resolved via getattr on the module chain
No dot (min, average) → shorthand for numpy.min, numpy.average
len → Python builtin len()
count → alias for len (convenience)

The dispatch:

def resolve_function(name: str) -> Callable:
    if name in ("len", "count"):
        return len
    if "." not in name:
        name = f"numpy.{name}"
    parts = name.split(".")
    obj = importlib.import_module(parts[0])
    for attr in parts[1:]:
        obj = getattr(obj, attr)
    return obj

This means the current hardcoded functions map to numpy as:

YAML `function`	Resolves to	Notes
`len` or `count`	`len()`	builtin
`min`	`numpy.min`
`max`	`numpy.max`
`var`	`numpy.var`
`quantile`	`numpy.quantile`	`params: {q: 0.25}`
`median`	`numpy.median`
`average`	`numpy.average`	`params: {weights: s_li}` for weighted mean

Users can use any numpy function without touching library code. The AGG_FUNCTIONS dict is deleted.

Change 3: Expression strings for composed operations

Some statistics can't be expressed as a single function call (e.g., the uncertainty of an inverse-variance weighted mean: 1/sqrt(sum(1/sigma²))). For these, a new expression: field accepts a numpy expression as a quoted string:

h_sigma:
  expression: "1.0 / np.sqrt(np.sum(1.0 / s_li**2))"
  dtype: float32

Column resolution differs between the two modes:

function: — source provides the positional argument; additional columns referenced by name in params (e.g., weights: s_li for numpy.average)
expression: — all data_source.variables columns are available by name in the namespace directly; no source needed

The expression namespace is restricted:

namespace = {
    "__builtins__": {},
    "np": np,
    "numpy": np,
    "len": len,
    # + every key from data_source.variables bound to its array
}

No builtins beyond len, no imports, no file access — same security model as pandas.eval().

function: and expression: are mutually exclusive per variable.

Updated config structure

# configs/atl06.yaml
data_source:
  reader: h5coro
  groups: [gt1l, gt1r, gt2l, gt2r, gt3l, gt3r]
  coordinates:
    latitude: "/{group}/land_ice_segments/latitude"
    longitude: "/{group}/land_ice_segments/longitude"
  variables:
    h_li: "/{group}/land_ice_segments/h_li"
    s_li: "/{group}/land_ice_segments/h_li_sigma"
  quality_filter:
    dataset: "/{group}/land_ice_segments/atl06_quality_summary"
    value: 0

aggregation:
  coordinates:
    cell_ids:
      dtype: uint64
      fill_value: 0
    morton:
      dtype: int64
      fill_value: 0
  variables:
    count:
      function: len
      source: h_li
      dtype: int32
      fill_value: 0
    h_min:
      function: min
      source: h_li
      dtype: float32
    h_max:
      function: max
      source: h_li
      dtype: float32
    h_mean:
      function: average
      source: h_li
      params:
        weights: s_li
      dtype: float32
    h_sigma:
      expression: "1.0 / np.sqrt(np.sum(1.0 / s_li**2))"
      dtype: float32
    h_variance:
      function: var
      source: h_li
      dtype: float32
    h_q25:
      function: quantile
      source: h_li
      params:
        q: 0.25
      dtype: float32
    h_q50:
      function: quantile
      source: h_li
      params:
        q: 0.50
      dtype: float32
    h_q75:
      function: quantile
      source: h_li
      params:
        q: 0.75
      dtype: float32

output:
  grid: healpix
  indexing_scheme: nested

Updated implementation phases

Phase A: config.py + default YAML + tests

PipelineConfig dataclass, load_config(), validate_config(), resolve_function()
configs/atl06.yaml — YAML equivalent of the current hardcoded setup
Validation: load atl06.yaml, verify it produces identical aggregation results to the current AGG_FUNCTIONS + CellStatsSchema path
Test expression evaluation with restricted namespace
Test function resolution (dotted paths, shorthands, invalid names)

Phase B: Replace schema.py internals

xdggs_spec() and xdggs_zarr_template() accept a parsed config instead of reading from CellStatsSchema
Schema field helpers (_get_schema_fields, _agg_fields, _fields_by_role, _COORDS, _DATA_VARS) are replaced by config-driven equivalents
Remove CellStatsSchema class and pandera dependency

Phase C: Replace processing.py internals

calculate_cell_statistics() dispatches via resolve_function() and eval() instead of AGG_FUNCTIONS dict
Delete AGG_FUNCTIONS, _weighted_mean, _weighted_sigma
process_morton_cell() accepts config: PipelineConfig
DataSourceConfig gains reader: str field

Phase D: Lambda and CLI integration

invoke_lambda.py gets --config flag
Config serialized to JSON dict in Lambda event payload
lambda_handler.py deserializes and passes through
Add pyyaml to orchestrator dependencies (not Lambda — Lambda receives JSON)

Phase E: Validation with a second config

configs/atl06_minimal.yaml (count + median only) to prove extensibility

What stays the same from the previous plan

Single YAML file with three sections (not separate files)
Reader registry pattern (READERS dict, data_source.reader field)
Output grid as future extension point (output.grid: healpix only for now)
Lambda receives serialized JSON (no PyYAML in Lambda layer)
Backward compatibility during migration (all new parameters default to None → current behavior)

What changes from the previous plan

Pandera removed — replaced by config-driven validation
AGG_FUNCTIONS registry removed — replaced by numpy function resolution
CellStatsSchema removed — replaced by YAML config
Expression strings added — for composed operations that aren't a single function call
Function names are numpy paths — not curated aliases

espg · 2026-03-13T23:46:56Z

Status update — plan comments superseded

The two plan comments above (initial plan and revised plan) are now superseded by implementation. Phases A through D are complete and merged into this branch. Key differences from what was planned:

~~h_mean uses weights: s_li~~ → weights: "1.0 / s_li**2" (inverse-variance weighting, matching the original _weighted_mean behavior)
~~CellStatsSchema stays for backward compatibility~~ → CellStatsSchema fully removed along with pandera dependency
~~AGG_FUNCTIONS stays importable~~ → deleted entirely; resolve_function() + evaluate_expression() replace it
~~_COORDS, _DATA_VARS remain as defaults~~ → replaced by get_coords(), get_data_vars() from config
New capability not in original plan: param values can be expressions (e.g., weights: "1.0 / s_li**2") — not just bare column names or literals
Phase E (second config for validation) still TODO

See updated PR description for current state.

fully decouple icesat-2, modularize processing and aggregation templates

143c959

espg added 4 commits March 13, 2026 13:45

templating phase a changes

8be0ee2

phase b changes for custom agg templates

f03192a

first pass on phase c for templates

7ec5b66

better handling of expressions over parameter inputs

b11394d

espg added 8 commits March 13, 2026 16:50

clean up and sweep of outdated code

7a1eed7

more pruning

95ae41e

updated example notebook

05a5e08

better local vs remote handing

34e4049

refactor on templates for local vs cloud

08a48c2

docs updates

a7b6ed6

ruff fix

3d5dd30

more docs updates

cd2839b

espg marked this pull request as ready for review March 16, 2026 21:15

example plotting notebook

8a3e934

espg merged commit 7c23c81 into main Mar 16, 2026
8 checks passed

espg deleted the issue-8-datasource-config branch March 16, 2026 21:50

Conversation

espg commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Design decisions

Test plan

Uh oh!

espg commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Phase 2 Plan: Template-driven YAML configuration

Why this is needed

Proposed config structure

Design decisions

One config file vs. multiple

Reader extensibility

Output grid flexibility

What happens to CellStatsSchema

Lambda integration

Implementation phases

Backward compatibility

Dependency changes

Uh oh!

espg commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Revised Phase 2 Plan: Template-driven YAML configuration

Change 1: Drop Pandera

Change 2: Numpy-backed function resolution (no curated registry)

Change 3: Expression strings for composed operations

Updated config structure

Updated implementation phases

What stays the same from the previous plan

What changes from the previous plan

Uh oh!

espg commented Mar 13, 2026

Status update — plan comments superseded

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

espg commented Mar 6, 2026 •

edited

Loading

espg commented Mar 13, 2026 •

edited

Loading

espg commented Mar 13, 2026 •

edited

Loading