Skip to content

Python API for notebook/JupyterHub use (#13)#14

Merged
espg merged 5 commits into
mainfrom
magg_deployment
Apr 7, 2026
Merged

Python API for notebook/JupyterHub use (#13)#14
espg merged 5 commits into
mainfrom
magg_deployment

Conversation

@espg
Copy link
Copy Markdown
Contributor

@espg espg commented Apr 7, 2026

Summary

  • Extract orchestration logic from CLI scripts into magg.runner.agg() Python API
  • Enable pipeline execution from Jupyter notebooks and JupyterHub environments
  • CLI (python -m magg) becomes a thin wrapper around agg()
  • Supports backend="local" (ThreadPoolExecutor) and backend="lambda" (AWS Lambda)

API

from magg import load_config, agg

config = load_config("atl06.yaml")

# Local processing
results = agg(config, catalog="catalog.json", store="./output.zarr", max_cells=5)

# Lambda backend
results = agg(config, catalog="catalog.json", store="s3://bucket/out.zarr", backend="lambda")

# Dry run
results = agg(config, catalog="catalog.json", store="./out.zarr", dry_run=True)

Remaining work (issue #13)

  • Notebook example demonstrating JupyterHub usage (operator-managed + BYOC)
  • Refactor invoke_lambda.py to use agg(backend="lambda") internally
  • Verify credential flows in JupyterHub environments (env vars, uploaded files)
  • Future backends: Step Functions (backend="sfn"), Lithops (backend="lithops")

Test plan

  • Validation tests (missing catalog, missing store, unknown backend, lambda requires s3://)
  • Dry run tests (summary, max_cells, morton_cell, invalid cell)
  • Cell selection tests (all, max_cells, morton_cell, invalid)
  • Config fallback tests (catalog from config, store from config)
  • Full test suite passes (142 tests)
  • End-to-end notebook test on JupyterHub

🤖 Generated with Claude Code

@espg
Copy link
Copy Markdown
Contributor Author

espg commented Apr 7, 2026

Note on HTTPS base path mapping

The driver=https support uses a base URL mapping stored in catalog metadata to rewrite S3 URLs to HTTPS URLs at runtime. Here's how it works:

At catalog build time, _extract_base_urls() grabs both the S3 and HTTPS URLs from CMR for the first granule that has both, then derives the common suffix to extract the divergent prefixes:

s3://nsidc-cumulus-prod-protected  →  s3_base
https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected  →  https_base

These are stored once in catalog.metadata, not per-granule. The actual catalog entries remain S3 URLs only — no size increase.

At runtime, when driver=https, the runner rewrites each URL by replacing s3_base with https_base. This is a single string substitution, derived from CMR data rather than hardcoded.

Assumption: all granules in a catalog share the same base path. This holds today because a catalog is built from a single CMR query (one short_name + version + provider), so all granules come from the same DAAC bucket. If we ever mix data products within a single catalog (e.g., ATL06 + ATL08 from different providers, or cross-DAAC queries), this assumption would break and we'd need per-granule or per-provider base mappings. I can't think of a realistic case where this happens — catalogs are product-specific by design — but noting it here for future reference.

@espg
Copy link
Copy Markdown
Contributor Author

espg commented Apr 7, 2026

Feel like the best path for the above comment would be to switch to a geoparquet format in the future (for the catalogs)

@espg espg merged commit e1da439 into main Apr 7, 2026
8 checks passed
@espg espg deleted the magg_deployment branch April 7, 2026 23:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant