Create `intake` plugin #52

brendan-m-murphy · 2026-05-01T08:45:21Z

brendan-m-murphy
May 1, 2026
Maintainer

The intake package has classes for many data types (.nc, .csv, etc) and readers/writers that can convert between them. It would be useful to have a set of plugins with ArtifactWriters that use intake. This would greatly expand the functionality of ogcat without having to vendor a lot of writers, etc.

Here is what copilot thinks:

Yes — ogcat looks like a very good fit for an optional Intake integration layer, especially because its current design already separates:

artifact description (ArtifactLocator)
source description (OperationSource)
materialisation/transformation (ArtifactWriter)
metadata derivation and policy (hooks / plugins)
core catalog persistence (kept intentionally lightweight)

That means you can add “intelligent loading” without turning ogcat itself into a data-access framework.

A few repo details support that direction:

add_artifact() is explicitly record-only unless a writer is supplied, so data loading is already optional rather than core behavior. See README.md and docs/design-note-hooks-plugins.md.
ArtifactWriter is just a protocol with write(context, source, target), so wrapping Intake objects should be straightforward. See src/ogcat/hooks.py.
FunctionArtifactWriter, path_writer, memory_writer, and UnzipArtifactWriter show the intended extension model for transformations/materialisation. See src/ogcat/writers.py and docs/api/writers-transactions.rst.
The long-term plan already mentions entry-point based plugin discovery, which is exactly the right home for an optional ogcat-intake package. See docs/ogcat_long_term_plan.md.

One note: code search results may be incomplete due to tool limits. You can browse more in the GitHub UI:

https://github.com/search?q=repo%3Aopenghg%2Fogcat+ArtifactWriter&type=code

Best overall shape

I’d recommend:

keep ogcat core free of Intake
add optional dependency groups in ogcat or, better,
create a separate plugin package such as ogcat-intake or ogcat-openghg-intake

That plugin package can provide:

reader hints / loader metadata
writers that materialise Intake datatypes
transforms between source types
metadata extraction using Intake/xarray/pandas/etc
optional runtime loading helpers

That preserves ogcat as a catalog + lifecycle engine, while Intake becomes a pluggable execution layer.

How Intake datatypes could combine with ogcat

The cleanest mental model is:

ogcat record = “what this artifact is, where it lives, and metadata about it”
Intake datatype / reader = “how to open it intelligently”
ArtifactWriter = “how to materialise or transform it into another artifact”

1. Store loader intent as metadata, not executable state

Instead of making ogcat core understand Intake deeply, store a lightweight description in record metadata, e.g.:

{
  "artifact_type": "netcdf_file",
  "reader_hint": "intake_xarray.netcdf",
  "reader_options": {
    "chunks": {"time": 100},
    "decode_times": true
  }
}

This aligns well with the repo’s existing idea of advisory “reader hints” in docs/ideas.md.

Then the plugin can offer:

load_with_intake(record) -> intake datatype / loaded object

or

build_intake_source(record) -> DataSource

This keeps the catalog portable:

without Intake installed, records still work
with Intake installed, you get smart opening

2. Map `ArtifactLocator` + metadata -> Intake source builder

A plugin can inspect:

locator kind/path/URI
record type
suffixes / filename
user metadata
derived metadata

and choose an Intake loader.

Example mapping logic:

.nc or artifact_type=netcdf -> Intake xarray source
.csv -> Intake pandas/csv source
.parquet -> Intake parquet source
zarr directory -> Intake xarray/zarr source
remote URL / object-store URI -> Intake source with storage options

Conceptually:

def build_source(record):
    locator = record.artifact
    meta = {**record.user_metadata, **record.derived_metadata}

    if meta.get("artifact_type") == "netcdf" or str(locator).endswith(".nc"):
        return intake.open_netcdf(locator_path, **meta.get("reader_options", {}))

    if meta.get("artifact_type") == "csv":
        return intake.open_csv(locator_path, **meta.get("reader_options", {}))

    ...

This gives you intelligent loading while keeping selection policy in the plugin.

3. Use `OperationSource.kind` as the bridge between Intake datatypes and writers

This is probably the nicest fit with the current API.

OperationSource already carries:

kind
path
descriptor
metadata
payload

So an Intake plugin can define a family of source kinds like:

intake_source
xarray_dataset
pandas_dataframe
zip_file
compressed_netcdf
remote_opendap
zarr_store

Then ArtifactWriter implementations can declare what they accept using the current source_kind pattern.

Examples:

source.kind == "zip_file" -> UnzipArtifactWriter
source.kind == "xarray_dataset" -> write NetCDF or Zarr
source.kind == "pandas_dataframe" -> write CSV or Parquet
source.kind == "intake_source" -> materialise to file, directory, or derived object

This works especially well because FunctionArtifactWriter already supports source-kind validation.

Transformations between types: yes, this looks easy

I agree with your instinct here.

ArtifactWriter is already the right abstraction for simple transforms, and UnzipArtifactWriter proves the pattern.

Examples of useful transforms

zip -> extracted directory
csv.gz -> csv
tar.gz -> directory
NetCDF -> Zarr
xarray Dataset -> NetCDF
pandas DataFrame -> Parquet
Intake source -> local cached artifact
remote URL -> downloaded local file
GRIB -> NetCDF
multiple files -> combined dataset artifact

In ogcat terms, these are all just:

take an OperationSource
write target artifact
register rollback
emit derived metadata

That is already exactly how writers are documented in docs/design-note-hooks-plugins.md.

Example plugin writer pattern

Something like:

from pathlib import Path
from ogcat import source_writer

def materialise_intake_source(source, target: Path):
    ds = source.payload  # intake source or loaded object
    obj = ds.read()

    if hasattr(obj, "to_netcdf"):
        obj.to_netcdf(target)
        return {"materialized_as": "netcdf"}

    raise TypeError("unsupported intake payload")

Or for compression-aware transforms:

def gunzip_to_file(source_path: Path, target: Path):
    import gzip, shutil
    with gzip.open(source_path, "rb") as src, target.open("wb") as dst:
        shutil.copyfileobj(src, dst)
    return {"compression": "gzip", "transformed_to": "file"}

So yes: type transforms are probably one of the strongest reasons to add Intake-adjacent plugins.

What Intake functionality would be most useful beyond loading?

A lot, but I’d prioritize the following.

1. Unified opener selection

This is the biggest immediate win.

Given a catalog record, the plugin decides:

which backend to use
whether to open lazily
whether to chunk
whether to read locally vs remotely
which format-specific options to apply

This turns ogcat from “catalog knows path” into “catalog + plugin knows how to open meaningfully”.

Best for:

NetCDF
Zarr
CSV / Parquet tabular data
multi-file collections

2. Lazy loading and deferred execution

Very useful if OpenGHG-style datasets can be large.

An Intake source can represent:

not-yet-loaded data
lazily loaded xarray / dask-backed data
remote stores

That means ogcat records can remain lightweight, while plugin consumers choose when to pay the cost of reading.

This is especially attractive if you want:

notebook workflows
remote/cloud datasets
large atmospheric datasets

3. Catalog-of-catalogs behavior

Intake’s catalog concepts could help if you want to expose subsets of ogcat records as a browsable data catalog.

For example, the plugin could generate an Intake catalog dynamically from ogcat.search(...) results:

all records of type surface_mole_fraction
all files for a site/species/date range
all artifacts with a given schema

This would let downstream users use Intake-native tooling without changing ogcat’s storage model.

That might be one of the most compelling advanced features.

4. Driver-based format extensibility

Rather than hardcoding every reader/writer in ogcat, Intake gives you an ecosystem of drivers and conventions.

That’s useful for:

xarray datasets
parquet/tabular data
cloud/object storage access
remote references
domain-specific drivers later

So the plugin layer can stay thin:

inspect metadata
select Intake driver
pass through options

5. Caching

Potentially very useful, especially for remote or expensive reads.

An Intake-based loader plugin could:

cache downloaded data locally
cache transformed artifacts
cache opened source descriptors
distinguish “reference record” vs “materialised cache artifact”

In ogcat terms, this could be modelled as:

original record points to authoritative source
optional derived/materialised record points to cached local artifact

This fits the existing separation between record metadata and writer-based artifact creation.

6. Multi-file dataset assembly

This could be huge for scientific data.

A plugin could:

take many ogcat records
resolve their locators
build an Intake/xarray multi-file source
return combined dataset objects

Examples:

one record per month -> assembled time series
one record per site -> combined dataset
multiple chunks -> virtual aggregate

This is probably better as a higher-level plugin API than a core ogcat feature.

7. Storage abstraction for remote backends

The glossary and long-term plan mention future “storage profiles”.

Intake could complement that by helping with:

S3 / GCS / Azure-backed reads
fsspec integration
storage options carried in metadata
remote artifact access without materialising everything locally

That would be especially valuable if ogcat wants to catalog references to remote data, not just local files.

8. Rich metadata extraction

A plugin can use Intake/xarray/pandas readers during ingest to derive metadata such as:

dimensions
coordinates
variable names
units
time coverage
geographic bounds
chunking/compression info
schema summary / column dtypes

That fits perfectly with extract_metadata hooks and writer-derived metadata.

So Intake is useful not just for opening data, but for standardising metadata enrichment.

Strong design suggestion: split functionality into three plugin layers

Instead of one big Intake integration, I’d separate it into:

A. Loader plugins

Turn ogcat records into readable objects.

Examples:

load_record(record)
build_source(record)
open_dataset(record)

B. Writer/transform plugins

Turn sources into managed artifacts.

Examples:

unzip
decompress
convert format
materialise lazy source locally

C. Metadata extractor plugins

Inspect artifacts/sources and derive metadata.

Examples:

xarray dataset metadata extractor
pandas dataframe schema extractor
compression/archive inspector

That maps neatly onto current ogcat extension points:

hooks for metadata/policy
writers for materialisation
optional helper API in separate package for loading

Recommended architecture

Option 1: separate package `ogcat-intake` — best option

I think this is the cleanest.

Why

avoids heavy optional deps in core
avoids API pressure on ogcat
can evolve faster
can depend on intake, xarray, pandas, fsspec, etc.
can publish entry points later when ogcat adds plugin discovery

Package contents

ogcat_intake.loaders
ogcat_intake.writers
ogcat_intake.extractors
ogcat_intake.plugin

Example exposed API

from ogcat_intake import (
    IntakeLoaderPlugin,
    IntakeMaterializeWriter,
    XarrayMetadataExtractor,
    open_record,
)

Option 2: optional extra in ogcat

If you want tighter integration but still optional:

[project.optional-dependencies]
intake = ["intake", "intake-xarray", "pandas", "xarray", "fsspec"]

This is okay, but I’d still keep the implementation in a subpackage that feels plugin-like, not core.

Concrete integration ideas for ogcat as it exists today

1. Add advisory metadata conventions

You may not need any core code changes for this.

Define plugin conventions like:

artifact_type
reader_hint
reader_options
storage_options
materialization_format
compression
logical_dataset_type

This is very aligned with the repo’s current direction and keeps records serialisable and portable.

2. Add helper constructors for Intake sources

In plugin package:

def intake_source(obj, *, kind="intake_source", metadata=None):
    return OperationSource(
        kind=kind,
        descriptor="intake source",
        payload=obj,
        metadata=metadata or {},
    )

Then writers can consume those directly.

3. Add format-specific writers

Examples:

IntakeToNetCDFWriter
IntakeToZarrWriter
DataFrameToParquetWriter
CompressedFileWriter
ArchiveExpandWriter

Each should:

validate source kind
write target
register rollback
emit metadata

4. Add loader resolution API

Given a record, plugin decides best opener.

Example:

def open_record(record, *, lazy=True):
    ...

Under the hood:

inspect locator + metadata
choose Intake driver
build source
optionally call .read()

This is probably the single most user-visible feature.

5. Add search-to-catalog adapter

Very attractive for users:

catalog.search(species="ch4", site="MHD")

then

to_intake_catalog(results)

or

open_many(results)

This would let users move from metadata discovery to analysis very naturally.

6. Add transform pipelines as plugin utilities, not core workflow

You probably don’t want ogcat core to become a pipeline orchestrator.

But a plugin utility could support simple composable transforms:

source -> decompress -> parse -> write
remote -> cache -> convert -> catalog

Built on top of chained writers/helpers.

What I would avoid

1. Don’t persist live Intake objects in ogcat records

Store:

hints
options
format descriptors
storage config references

Do not store:

executable objects
arbitrary Python callables
pickled loaders

Keep records declarative.

2. Don’t make Intake required for normal ogcat use

The repo clearly wants plugins/hooks to carry domain behavior, not the core package.

3. Don’t overload hooks with heavy data writes

The docs are clear that writers are the right place for materialisation, while hooks should handle lifecycle/metadata/policy around it.

4. Don’t make reader selection opaque

Prefer explicit metadata conventions and overridable resolver logic. Users should be able to see why a record is opened a certain way.

A practical first milestone

If you want an incremental path, I’d start with this:

Phase 1

Create ogcat-intake with:

open_record(record)
build_intake_source(record)
intake_source(...) helper for OperationSource
one or two writers:
- GunzipWriter
- XarrayToNetCDFWriter
one metadata extractor:
- dimensions / variables / time coverage from xarray-readable artifacts

Phase 2

Add:

dynamic mapping from metadata to driver
batch open / multi-file combine
caching/materialisation helpers

Phase 3

Add:

entry-point registration with ogcat.plugins
search results -> Intake catalog adapter
remote storage profiles

That would deliver useful functionality early without overcommitting the ogcat core.

My bottom line

Yes — this is a strong fit, and the current ogcat architecture already seems designed for this sort of extension.

The best model is:

ogcat core: catalog, lifecycle, metadata, transactions, artifact protocols
Intake plugin package: smart opening, datatype-aware transforms, metadata extraction, optional caching/materialisation

And yes, transforming between types via ArtifactWriter looks especially natural in this codebase. UnzipArtifactWriter is basically the proof-of-concept for the whole idea.

If you want, I can next sketch either:

a proposed ogcat-intake package structure, or
a concrete API design for open_record(), source-kind conventions, and a few writer classes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create `intake` plugin #52

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Create intake plugin #52

Uh oh!

brendan-m-murphy May 1, 2026 Maintainer

Best overall shape

How Intake datatypes could combine with ogcat

1. Store loader intent as metadata, not executable state

2. Map ArtifactLocator + metadata -> Intake source builder

3. Use OperationSource.kind as the bridge between Intake datatypes and writers

Transformations between types: yes, this looks easy

Examples of useful transforms

Example plugin writer pattern

What Intake functionality would be most useful beyond loading?

1. Unified opener selection

2. Lazy loading and deferred execution

3. Catalog-of-catalogs behavior

4. Driver-based format extensibility

5. Caching

6. Multi-file dataset assembly

7. Storage abstraction for remote backends

8. Rich metadata extraction

Strong design suggestion: split functionality into three plugin layers

A. Loader plugins

B. Writer/transform plugins

C. Metadata extractor plugins

Recommended architecture

Option 1: separate package ogcat-intake — best option

Why

Package contents

Example exposed API

Option 2: optional extra in ogcat

Concrete integration ideas for ogcat as it exists today

1. Add advisory metadata conventions

2. Add helper constructors for Intake sources

3. Add format-specific writers

4. Add loader resolution API

5. Add search-to-catalog adapter

6. Add transform pipelines as plugin utilities, not core workflow

What I would avoid

1. Don’t persist live Intake objects in ogcat records

2. Don’t make Intake required for normal ogcat use

3. Don’t overload hooks with heavy data writes

4. Don’t make reader selection opaque

A practical first milestone

Phase 1

Phase 2

Phase 3

My bottom line

Replies: 0 comments

Create `intake` plugin #52

brendan-m-murphy
May 1, 2026
Maintainer

2. Map `ArtifactLocator` + metadata -> Intake source builder

3. Use `OperationSource.kind` as the bridge between Intake datatypes and writers

Option 1: separate package `ogcat-intake` — best option