Skip to content

Config-driven data source for process_morton_cell (Phase 1)#10

Merged
espg merged 14 commits into
mainfrom
issue-8-datasource-config
Mar 16, 2026
Merged

Config-driven data source for process_morton_cell (Phase 1)#10
espg merged 14 commits into
mainfrom
issue-8-datasource-config

Conversation

@espg
Copy link
Copy Markdown
Contributor

@espg espg commented Mar 6, 2026

Update (2026-03-13): The scope of this PR expanded beyond Phase 1. Original description preserved below in strikethrough.

Summary

Implements config-driven pipeline configuration for magg, decoupling all ATL06-specific values into YAML templates. This replaces hardcoded aggregation logic with a declarative system where users can customize data sources, aggregation functions, and output without modifying library code.

Closes #8.

Changes

  • config.py — YAML-driven PipelineConfig with load_config(), default_config(), validate_config(), resolve_function(), evaluate_expression()
  • configs/atl06.yaml — default pipeline config; defines data source (h5coro reader, 6 ground tracks, h_li/s_li variables), aggregation (9 statistics via numpy functions and expressions), and output grid (HEALPix nested)
  • schema.py — removed CellStatsSchema (Pandera); xdggs_spec() and xdggs_zarr_template() now derive schema from pipeline config
  • processing.pycalculate_cell_statistics() dispatches via config (no more AGG_FUNCTIONS registry); process_morton_cell() accepts config: PipelineConfig; params support expressions (e.g. weights: "1.0 / s_li**2")
  • Lambda integrationinvoke_lambda.py accepts --config flag; config serialized to JSON dict in Lambda event payload (no PyYAML needed in Lambda)
  • Dependency swap — replaced pandera with pyyaml
  • Docs — new api/config.md, updated architecture and schema design docs
  • 111 tests passing on Python 3.12 and 3.13

Design decisions

  • Single YAML file, three sections — data_source, aggregation, output. Cross-validation at load time.
  • Numpy function resolutionfunction: min resolves to numpy.min; any numpy function works without library changes
  • Expression stringsexpression: "1.0 / np.sqrt(np.sum(1.0 / s_li**2))" for composed operations in restricted namespace
  • Param expressionsparams: {weights: "1.0 / s_li**2"} allows derived inputs without switching to expression mode
  • Lambda receives JSON — config serialized via dataclasses.asdict(), no PyYAML in Lambda layer

Test plan

  • 111 tests pass locally (Python 3.12, 3.13)
  • ruff check clean
  • mkdocs build --strict clean
  • CI: test, ruff, build all green

Original description (Phase 1 only — superseded by the above):

## Summary

Implements Phase 1 of #8 — extracts all ATL06-hardcoded values from process_morton_cell() into a DataSourceConfig dataclass.

Previously, process_morton_cell() had five things baked in for ICESat-2 ATL06: ground track names, coordinate paths, variable paths, quality filtering, and column mapping. This PR makes all of them configurable via a single data object while preserving identical behavior when called without it.

## Changes

- DataSourceConfig dataclass — captures groups, coordinate paths, variable paths, and an optional quality filter. All HDF5 paths use {group} template substitution. Includes to_dict()/from_dict() for JSON serialization (ready for Lambda payloads in Phase 2).
- ATL06_CONFIG — default config instance containing the previously-hardcoded ATL06 values, shipped with the package.
- process_morton_cell() refactored — new optional data_source parameter (defaults to ATL06_CONFIG). Per-group HDF5 reading extracted into _read_group() helper. Quality filtering is conditional on config.
- 7 new tests — config structure, serialization roundtrip, optional quality filter, template substitution.

## Design decisions (per #8 discussion)

- Explicit groups, no globs — reduces debugging surprises
- Quality filter is optionalNone means no filtering (for non-NASA datasets)
- Multiple value columns supportedvariables dict maps any number of output column names to HDF5 paths
- No non-lat/lon coordinate systems — not needed yet

## What this does NOT change

- Aggregation config (schema.py) is untouched
- calculate_cell_statistics() signature unchanged
- Existing callers with no data_source argument get identical ATL06 behavior

## Test plan

- [x] All 83 tests pass locally
- [x] ruff check clean
- [ ] CI passes on 3.12 and 3.13

Closes #8 (Phase 1)

@espg
Copy link
Copy Markdown
Contributor Author

espg commented Mar 13, 2026

Superseded (2026-03-13): Phases A–D are implemented. See status update for what changed from this plan.

Original plan (click to expand)

Phase 2 Plan: Template-driven YAML configuration

Why this is needed

PR #10 completed Phase 1 of #8DataSourceConfig is a dataclass, ATL06_CONFIG is the default, and process_morton_cell() accepts an optional data_source parameter. But configuration is still entangled with library code in ways that block real extensibility:

  1. Users can't customize aggregations without forking. CellStatsSchema is a Pandera class in schema.py, AGG_FUNCTIONS is a dict in processing.py. Changing what statistics get computed means editing library source. Users who do this immediately diverge from upstream — they can't pull new releases without merge conflicts against their custom aggregation code.

  2. Lambda deployments are locked to ATL06. The Lambda handler has no way to accept a different aggregation or data source config at invocation time. Switching products means redeploying with different code.

  3. The data source and schema are decoupled by accident. CellStatsSchema assumes columns named h_li and s_li exist (via "source" metadata). ATL06_CONFIG happens to produce those column names. But there's nothing validating this link at config time — a mismatch surfaces as a KeyError deep in the pipeline. (We added validate_schema() to catch this, but the underlying problem is that the config is split across two unrelated Python objects.)

  4. Future data sources need different readers. ATL06 is columnar HDF5 read via h5coro, but other NASA CMR products are 2D raster data opened via xarray. The data source config needs to specify which reader to use.

  5. Output grid is hardcoded to HEALPix. Users should be able to override grid parameters declaratively, and we may want non-HEALPix grids in the future.

The solution is to make all of this declarative via YAML configs that live outside library code.


Proposed config structure

A single YAML file with three sections — data source, aggregation, and output:

# configs/atl06.yaml
data_source:
  reader: h5coro
  groups: [gt1l, gt1r, gt2l, gt2r, gt3l, gt3r]
  coordinates:
    latitude: "/{group}/land_ice_segments/latitude"
    longitude: "/{group}/land_ice_segments/longitude"
  variables:
    h_li: "/{group}/land_ice_segments/h_li"
    s_li: "/{group}/land_ice_segments/h_li_sigma"
  quality_filter:
    dataset: "/{group}/land_ice_segments/atl06_quality_summary"
    value: 0

aggregation:
  coordinates:
    cell_ids:
      dtype: uint64
      fill_value: 0
    morton:
      dtype: int64
      fill_value: 0
  variables:
    count:
      function: count
      dtype: int32
      fill_value: 0
    h_min:
      function: min
      source: h_li
      dtype: float32
    h_mean:
      function: weighted_mean
      source: h_li
      weight_col: s_li
      dtype: float32
    h_q25:
      function: quantile
      source: h_li
      params:
        q: 0.25
      dtype: float32
    # ... etc

output:
  grid: healpix
  indexing_scheme: nested

Design notes on the structure:

  • weight_col is a top-level key per variable (not nested under params) — it's common enough to deserve visibility
  • params is a catch-all for function-specific kwargs (like q for quantile)
  • fill_value defaults to NaN for float types and 0 for int types if omitted
  • output is optional — omitting it means HEALPix nested (current behavior)
  • parent_order and child_order are runtime parameters (CLI/Lambda), not config

Design decisions

One config file vs. multiple

Recommendation: single file with three sections.

Alternatives considered:

Approach Strengths Weaknesses
Three separate files (data_source.yaml, aggregation.yaml, output.yaml) Maximum composability — swap data source while keeping aggregation Cross-validation is hard (aggregation references data_source columns across file boundaries); triple the file management; Lambda needs 3 payloads
One file, three sections One file to validate, one to ship to Lambda, column cross-validation is trivial at load time Swapping just the data source means copying the file and changing one section
Hybrid with !include/$ref Best of both Over-engineered; YAML !include isn't part of the spec (requires custom loader)

The "swap data source, keep aggregation" use case is real but means changing ~10 lines in a YAML file, not reimplementing anything. If composability demand grows, we can add --override-data-source <path> later — a partial YAML that replaces just that section.

Reader extensibility

Simple registry dict, not a plugin system.

READERS = {
    "h5coro": read_h5coro_groups,
    # "xarray": read_xarray_groups,  # future
}

Each reader has the same signature: (urls, data_source_config, spatial_filter_params, credentials) -> list[pd.DataFrame]. The config's data_source.reader field selects which one. The current _read_group logic gets wrapped into read_h5coro_groups.

An entry-point plugin system would be overkill — if we ever have 4+ readers we can revisit. The registry dict is trivially extensible (one line) and has zero framework overhead.

Output grid flexibility

output.grid: healpix is the only supported value for now and is the default when omitted. The output section is a future extension point. When non-HEALPix grids are needed, we'd add a grid registry (same pattern as readers) and the output section would carry grid-specific parameters.

What happens to CellStatsSchema

It currently does three jobs:

  1. Aggregation metadata → moves to YAML aggregation section
  2. Zarr template generation (dtypes, fill values) → derived from parsed YAML
  3. Output DataFrame validation (Pandera) → optional; the config's own cross-validation is sufficient

The CellStatsSchema class stays in schema.py as the ATL06-specific default for backward compatibility. But it's no longer the source of truth — the YAML is.

Lambda integration

Clean separation: PyYAML is only needed on the orchestrator side (where invoke_lambda.py runs). The config is serialized to a JSON dict and embedded in the Lambda event payload (<1 KB, well within the 256 KB limit). The Lambda handler deserializes plain JSON — no YAML parsing needed in Lambda, no new Lambda layer dependencies.


Implementation phases

Phase A: config.py + default YAML + tests (no existing code changes)

New files:

  • src/magg/config.pyPipelineConfig dataclass, load_config(), default_config(), validate_config()
  • src/magg/configs/atl06.yaml — YAML representation of current hardcoded values
  • src/magg/configs/__init__.py — empty, for importlib.resources
  • tests/test_config.py — loading, validation, cross-validation, roundtrip, defaults

Key validation: load atl06.yaml and verify it produces identical metadata to _agg_fields() from the current CellStatsSchema.

Phase B: Thread config through schema.py

  • _get_schema_fields(), _agg_fields(), _fields_by_role() gain optional config parameter
  • xdggs_spec() and xdggs_zarr_template() gain optional config parameter
  • When config=None, behavior is identical to today
  • Module-level constants _COORDS and _DATA_VARS stay as ATL06 defaults

Phase C: Thread config through processing.py

  • DataSourceConfig gains reader: str = "h5coro" field
  • calculate_cell_statistics() gains optional config parameter
  • process_morton_cell() gains optional config: PipelineConfig parameter
  • ATL06_CONFIG stays as a backward-compat constant

Phase D: Lambda and CLI integration

  • invoke_lambda.py gets --config CLI flag, serializes config to Lambda events
  • lambda_handler.py checks for event.get("config"), falls through to ATL06 default
  • pyproject.toml gets pyyaml dependency + package data for YAML files

Phase E: Second config to validate extensibility

  • Add configs/atl06_minimal.yaml (count + median only, no quantiles) to prove the system works for non-default configs

Backward compatibility

Every change is additive. All new config parameters default to None, which triggers the current hardcoded behavior. ATL06_CONFIG, CellStatsSchema, _COORDS, _DATA_VARS, AGG_FUNCTIONS — all remain importable and functional. Lambda events without a "config" key work exactly as today. No existing tests need to change.


Dependency changes

  • Add pyyaml to pyproject.toml. It's only needed on the orchestrator/CLI side — Lambda deserializes plain JSON, so no Lambda layer changes needed.
  • No other new dependencies. Config system uses importlib.resources + dataclasses + PyYAML.

@espg
Copy link
Copy Markdown
Contributor Author

espg commented Mar 13, 2026

Superseded (2026-03-13): Phases A–D are implemented. Key diff from this plan: h_mean uses weights: "1.0 / s_li**2" (not weights: s_li), params support expressions, and CellStatsSchema/AGG_FUNCTIONS were fully deleted (not kept for backward compat). See status update.

Revised plan (click to expand)

Revised Phase 2 Plan: Template-driven YAML configuration

Updates to the previous plan based on design review. Three significant changes: dropping Pandera, numpy-backed function resolution, and expression support for composed operations.


Change 1: Drop Pandera

CellStatsSchema (Pandera DataFrameModel) currently does three things:

  1. Stores aggregation metadata in pa.Field metadata dicts
  2. Generates Zarr template arrays (dtype, fill_value)
  3. Validates output DataFrames (column names + dtypes)

With YAML configs, role 1 moves to the config file and role 2 is derived from the parsed config. Role 3 is used in exactly one test (test_schema.py:201) — CellStatsSchema.validate(df) — and never in production code. That validation is a three-line loop without Pandera:

for col, expected_dtype in config.dtypes.items():
    assert df[col].dtype == np.dtype(expected_dtype)

Pandera is ~15MB with transitive deps and adds to Lambda cold start time. Since we're the only users of CellStatsSchema, there are no downstream consumers to worry about. Remove pandera from dependencies entirely.


Change 2: Numpy-backed function resolution (no curated registry)

The previous plan kept AGG_FUNCTIONS as a curated dict mapping string names to lambdas. This has the same problem we're trying to solve — users who want numpy.std or numpy.ptp need to PR the library.

New design: function: resolves to an actual numpy (or other module) function at load time. No curated registry.

Resolution rules:

  • Dotted path (numpy.min, numpy.fft.fft) → resolved via getattr on the module chain
  • No dot (min, average) → shorthand for numpy.min, numpy.average
  • len → Python builtin len()
  • count → alias for len (convenience)

The dispatch:

def resolve_function(name: str) -> Callable:
    if name in ("len", "count"):
        return len
    if "." not in name:
        name = f"numpy.{name}"
    parts = name.split(".")
    obj = importlib.import_module(parts[0])
    for attr in parts[1:]:
        obj = getattr(obj, attr)
    return obj

This means the current hardcoded functions map to numpy as:

YAML function Resolves to Notes
len or count len() builtin
min numpy.min
max numpy.max
var numpy.var
quantile numpy.quantile params: {q: 0.25}
median numpy.median
average numpy.average params: {weights: s_li} for weighted mean

Users can use any numpy function without touching library code. The AGG_FUNCTIONS dict is deleted.


Change 3: Expression strings for composed operations

Some statistics can't be expressed as a single function call (e.g., the uncertainty of an inverse-variance weighted mean: 1/sqrt(sum(1/sigma²))). For these, a new expression: field accepts a numpy expression as a quoted string:

h_sigma:
  expression: "1.0 / np.sqrt(np.sum(1.0 / s_li**2))"
  dtype: float32

Column resolution differs between the two modes:

  • function:source provides the positional argument; additional columns referenced by name in params (e.g., weights: s_li for numpy.average)
  • expression: — all data_source.variables columns are available by name in the namespace directly; no source needed

The expression namespace is restricted:

namespace = {
    "__builtins__": {},
    "np": np,
    "numpy": np,
    "len": len,
    # + every key from data_source.variables bound to its array
}

No builtins beyond len, no imports, no file access — same security model as pandas.eval().

function: and expression: are mutually exclusive per variable.


Updated config structure

# configs/atl06.yaml
data_source:
  reader: h5coro
  groups: [gt1l, gt1r, gt2l, gt2r, gt3l, gt3r]
  coordinates:
    latitude: "/{group}/land_ice_segments/latitude"
    longitude: "/{group}/land_ice_segments/longitude"
  variables:
    h_li: "/{group}/land_ice_segments/h_li"
    s_li: "/{group}/land_ice_segments/h_li_sigma"
  quality_filter:
    dataset: "/{group}/land_ice_segments/atl06_quality_summary"
    value: 0

aggregation:
  coordinates:
    cell_ids:
      dtype: uint64
      fill_value: 0
    morton:
      dtype: int64
      fill_value: 0
  variables:
    count:
      function: len
      source: h_li
      dtype: int32
      fill_value: 0
    h_min:
      function: min
      source: h_li
      dtype: float32
    h_max:
      function: max
      source: h_li
      dtype: float32
    h_mean:
      function: average
      source: h_li
      params:
        weights: s_li
      dtype: float32
    h_sigma:
      expression: "1.0 / np.sqrt(np.sum(1.0 / s_li**2))"
      dtype: float32
    h_variance:
      function: var
      source: h_li
      dtype: float32
    h_q25:
      function: quantile
      source: h_li
      params:
        q: 0.25
      dtype: float32
    h_q50:
      function: quantile
      source: h_li
      params:
        q: 0.50
      dtype: float32
    h_q75:
      function: quantile
      source: h_li
      params:
        q: 0.75
      dtype: float32

output:
  grid: healpix
  indexing_scheme: nested

Updated implementation phases

Phase A: config.py + default YAML + tests

  • PipelineConfig dataclass, load_config(), validate_config(), resolve_function()
  • configs/atl06.yaml — YAML equivalent of the current hardcoded setup
  • Validation: load atl06.yaml, verify it produces identical aggregation results to the current AGG_FUNCTIONS + CellStatsSchema path
  • Test expression evaluation with restricted namespace
  • Test function resolution (dotted paths, shorthands, invalid names)

Phase B: Replace schema.py internals

  • xdggs_spec() and xdggs_zarr_template() accept a parsed config instead of reading from CellStatsSchema
  • Schema field helpers (_get_schema_fields, _agg_fields, _fields_by_role, _COORDS, _DATA_VARS) are replaced by config-driven equivalents
  • Remove CellStatsSchema class and pandera dependency

Phase C: Replace processing.py internals

  • calculate_cell_statistics() dispatches via resolve_function() and eval() instead of AGG_FUNCTIONS dict
  • Delete AGG_FUNCTIONS, _weighted_mean, _weighted_sigma
  • process_morton_cell() accepts config: PipelineConfig
  • DataSourceConfig gains reader: str field

Phase D: Lambda and CLI integration

  • invoke_lambda.py gets --config flag
  • Config serialized to JSON dict in Lambda event payload
  • lambda_handler.py deserializes and passes through
  • Add pyyaml to orchestrator dependencies (not Lambda — Lambda receives JSON)

Phase E: Validation with a second config

  • configs/atl06_minimal.yaml (count + median only) to prove extensibility

What stays the same from the previous plan

  • Single YAML file with three sections (not separate files)
  • Reader registry pattern (READERS dict, data_source.reader field)
  • Output grid as future extension point (output.grid: healpix only for now)
  • Lambda receives serialized JSON (no PyYAML in Lambda layer)
  • Backward compatibility during migration (all new parameters default to None → current behavior)

What changes from the previous plan

  • Pandera removed — replaced by config-driven validation
  • AGG_FUNCTIONS registry removed — replaced by numpy function resolution
  • CellStatsSchema removed — replaced by YAML config
  • Expression strings added — for composed operations that aren't a single function call
  • Function names are numpy paths — not curated aliases

@espg
Copy link
Copy Markdown
Contributor Author

espg commented Mar 13, 2026

Status update — plan comments superseded

The two plan comments above (initial plan and revised plan) are now superseded by implementation. Phases A through D are complete and merged into this branch. Key differences from what was planned:

  • h_mean uses weights: s_liweights: "1.0 / s_li**2" (inverse-variance weighting, matching the original _weighted_mean behavior)
  • CellStatsSchema stays for backward compatibilityCellStatsSchema fully removed along with pandera dependency
  • AGG_FUNCTIONS stays importable → deleted entirely; resolve_function() + evaluate_expression() replace it
  • _COORDS, _DATA_VARS remain as defaults → replaced by get_coords(), get_data_vars() from config
  • New capability not in original plan: param values can be expressions (e.g., weights: "1.0 / s_li**2") — not just bare column names or literals
  • Phase E (second config for validation) still TODO

See updated PR description for current state.

@espg espg marked this pull request as ready for review March 16, 2026 21:15
@espg espg merged commit 7c23c81 into main Mar 16, 2026
8 checks passed
@espg espg deleted the issue-8-datasource-config branch March 16, 2026 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Decouple processing.py from ATL06 — config-driven data source

1 participant