Skip to content

Python API for notebook/JupyterHub use #13

@espg

Description

@espg

Summary

All processing orchestration is currently locked inside CLI main() functions (__main__.py, invoke_lambda.py). This makes it impossible to run pipelines from a Jupyter notebook without shelling out. We need a Python API that exposes the same functionality as importable functions.

Motivation

Two JupyterHub scenarios drive this:

  1. Operator-managed hub (CryoCloud, Pangeo, institutional) — operator pre-configures AWS + Earthdata credentials as env vars. Users import magg and call functions directly.
  2. BYOC (bring your own compute) on a free hub — user provides their own AWS credentials (env vars or uploaded file) and fans out to Lambda in their own account.

In both cases, users need a Python API, not a CLI.

Proposed API

from magg import load_config, run

config = load_config("atl06.yaml")

# Local processing (zero AWS infrastructure needed)
results = run(config, catalog="catalog.json", store="./output.zarr", max_cells=5)

# Lambda backend (requires deployed Lambda + AWS creds)
results = run(config, catalog="catalog.json", store="s3://bucket/out.zarr", backend="lambda")

magg.run() signature

def run(
    config: PipelineConfig,
    *,
    catalog: str | None = None,      # path to catalog JSON (overrides config.catalog)
    store: str | None = None,        # store path (overrides config.output.store)
    backend: str = "local",          # "local" or "lambda"
    max_cells: int | None = None,    # limit cells (for testing)
    morton_cell: str | None = None,  # process a specific cell
    max_workers: int | None = None,  # concurrency (default: 4 local, 1700 lambda)
    overwrite: bool = False,         # overwrite existing Zarr template
    dry_run: bool = False,           # preview without processing
    # Lambda-specific
    function_name: str | None = None,  # Lambda function name (env var fallback)
    region: str = "us-west-2",
) -> dict:
    """Run the aggregation pipeline."""

Returns a results dict with summary statistics (cells processed, total obs, errors, timing).

Backend dispatch

  • backend="local": ThreadPoolExecutor, processes cells in-process. Current __main__.py logic.
  • backend="lambda": Direct Lambda invocation via boto3. Current invoke_lambda.py logic.
  • Future: backend="sfn" (Step Functions), backend="lithops".

Credential handling

No custom credential logic needed — existing libraries handle it:

  • Earthdata: earthaccess.login() reads EARTHDATA_USERNAME/EARTHDATA_PASSWORD env vars, ~/.netrc, or prompts interactively
  • AWS: boto3 reads AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY env vars, ~/.aws/credentials, or IAM role

JupyterHub operators set env vars; BYOC users set them in their notebook session.

Implementation plan

  1. Extract orchestration logic from __main__.py and invoke_lambda.py into magg.runner module
  2. run() dispatches to _run_local() or _run_lambda() based on backend
  3. CLIs become thin wrappers: parse args → load_config()run()
  4. Tests for the Python API
  5. Notebook example demonstrating JupyterHub usage

Interaction with other work

  • Builds on Decouple processing.py from ATL06 — config-driven data source #8 (config-driven pipeline) — run() takes a PipelineConfig
  • Uses open_store() from magg.store for backend-agnostic store creation
  • Uses get_child_order(), get_store_path() config helpers
  • Lambda function name configurable via MAGG_LAMBDA_FUNCTION_NAME env var (already implemented)

Future backends

  • Step Functions: CDK-deployed state machine, triggered via boto3.client('stepfunctions').start_execution()
  • Lithops: auto-deploys Lambda function via container image, fexec.map() for dispatch
  • Backend interface should be simple enough that adding new backends is straightforward

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions