Summary
All processing orchestration is currently locked inside CLI main() functions (__main__.py, invoke_lambda.py). This makes it impossible to run pipelines from a Jupyter notebook without shelling out. We need a Python API that exposes the same functionality as importable functions.
Motivation
Two JupyterHub scenarios drive this:
Operator-managed hub (CryoCloud, Pangeo, institutional) — operator pre-configures AWS + Earthdata credentials as env vars. Users import magg and call functions directly.
BYOC (bring your own compute) on a free hub — user provides their own AWS credentials (env vars or uploaded file) and fans out to Lambda in their own account.
In both cases, users need a Python API, not a CLI.
Proposed API
from magg import load_config , run
config = load_config ("atl06.yaml" )
# Local processing (zero AWS infrastructure needed)
results = run (config , catalog = "catalog.json" , store = "./output.zarr" , max_cells = 5 )
# Lambda backend (requires deployed Lambda + AWS creds)
results = run (config , catalog = "catalog.json" , store = "s3://bucket/out.zarr" , backend = "lambda" )
magg.run() signature
def run (
config : PipelineConfig ,
* ,
catalog : str | None = None , # path to catalog JSON (overrides config.catalog)
store : str | None = None , # store path (overrides config.output.store)
backend : str = "local" , # "local" or "lambda"
max_cells : int | None = None , # limit cells (for testing)
morton_cell : str | None = None , # process a specific cell
max_workers : int | None = None , # concurrency (default: 4 local, 1700 lambda)
overwrite : bool = False , # overwrite existing Zarr template
dry_run : bool = False , # preview without processing
# Lambda-specific
function_name : str | None = None , # Lambda function name (env var fallback)
region : str = "us-west-2" ,
) -> dict :
"""Run the aggregation pipeline."""
Returns a results dict with summary statistics (cells processed, total obs, errors, timing).
Backend dispatch
backend="local": ThreadPoolExecutor, processes cells in-process. Current __main__.py logic.
backend="lambda": Direct Lambda invocation via boto3. Current invoke_lambda.py logic.
Future: backend="sfn" (Step Functions), backend="lithops".
Credential handling
No custom credential logic needed — existing libraries handle it:
Earthdata : earthaccess.login() reads EARTHDATA_USERNAME/EARTHDATA_PASSWORD env vars, ~/.netrc, or prompts interactively
AWS : boto3 reads AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY env vars, ~/.aws/credentials, or IAM role
JupyterHub operators set env vars; BYOC users set them in their notebook session.
Implementation plan
Extract orchestration logic from __main__.py and invoke_lambda.py into magg.runner module
run() dispatches to _run_local() or _run_lambda() based on backend
CLIs become thin wrappers: parse args → load_config() → run()
Tests for the Python API
Notebook example demonstrating JupyterHub usage
Interaction with other work
Builds on Decouple processing.py from ATL06 — config-driven data source #8 (config-driven pipeline) — run() takes a PipelineConfig
Uses open_store() from magg.store for backend-agnostic store creation
Uses get_child_order(), get_store_path() config helpers
Lambda function name configurable via MAGG_LAMBDA_FUNCTION_NAME env var (already implemented)
Future backends
Step Functions : CDK-deployed state machine, triggered via boto3.client('stepfunctions').start_execution()
Lithops : auto-deploys Lambda function via container image, fexec.map() for dispatch
Backend interface should be simple enough that adding new backends is straightforward
Summary
All processing orchestration is currently locked inside CLI
main()functions (__main__.py,invoke_lambda.py). This makes it impossible to run pipelines from a Jupyter notebook without shelling out. We need a Python API that exposes the same functionality as importable functions.Motivation
Two JupyterHub scenarios drive this:
import maggand call functions directly.In both cases, users need a Python API, not a CLI.
Proposed API
magg.run()signatureReturns a results dict with summary statistics (cells processed, total obs, errors, timing).
Backend dispatch
backend="local":ThreadPoolExecutor, processes cells in-process. Current__main__.pylogic.backend="lambda": Direct Lambda invocation via boto3. Currentinvoke_lambda.pylogic.backend="sfn"(Step Functions),backend="lithops".Credential handling
No custom credential logic needed — existing libraries handle it:
earthaccess.login()readsEARTHDATA_USERNAME/EARTHDATA_PASSWORDenv vars,~/.netrc, or prompts interactivelyboto3readsAWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEYenv vars,~/.aws/credentials, or IAM roleJupyterHub operators set env vars; BYOC users set them in their notebook session.
Implementation plan
__main__.pyandinvoke_lambda.pyintomagg.runnermodulerun()dispatches to_run_local()or_run_lambda()based onbackendload_config()→run()Interaction with other work
run()takes aPipelineConfigopen_store()frommagg.storefor backend-agnostic store creationget_child_order(),get_store_path()config helpersMAGG_LAMBDA_FUNCTION_NAMEenv var (already implemented)Future backends
boto3.client('stepfunctions').start_execution()fexec.map()for dispatch