Create intake plugin
#52
brendan-m-murphy
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
The
intakepackage has classes for many data types (.nc, .csv, etc) and readers/writers that can convert between them. It would be useful to have a set of plugins withArtifactWritersthat useintake. This would greatly expand the functionality ofogcatwithout having to vendor a lot of writers, etc.Here is what copilot thinks:
Yes —
ogcatlooks like a very good fit for an optional Intake integration layer, especially because its current design already separates:ArtifactLocator)OperationSource)ArtifactWriter)That means you can add “intelligent loading” without turning
ogcatitself into a data-access framework.A few repo details support that direction:
add_artifact()is explicitly record-only unless a writer is supplied, so data loading is already optional rather than core behavior. SeeREADME.mdanddocs/design-note-hooks-plugins.md.ArtifactWriteris just a protocol withwrite(context, source, target), so wrapping Intake objects should be straightforward. Seesrc/ogcat/hooks.py.FunctionArtifactWriter,path_writer,memory_writer, andUnzipArtifactWritershow the intended extension model for transformations/materialisation. Seesrc/ogcat/writers.pyanddocs/api/writers-transactions.rst.ogcat-intakepackage. Seedocs/ogcat_long_term_plan.md.One note: code search results may be incomplete due to tool limits. You can browse more in the GitHub UI:
Best overall shape
I’d recommend:
ogcatcore free of Intakeogcator, better,ogcat-intakeorogcat-openghg-intakeThat plugin package can provide:
That preserves
ogcatas a catalog + lifecycle engine, while Intake becomes a pluggable execution layer.How Intake datatypes could combine with ogcat
The cleanest mental model is:
1. Store loader intent as metadata, not executable state
Instead of making
ogcatcore understand Intake deeply, store a lightweight description in record metadata, e.g.:{ "artifact_type": "netcdf_file", "reader_hint": "intake_xarray.netcdf", "reader_options": { "chunks": {"time": 100}, "decode_times": true } }This aligns well with the repo’s existing idea of advisory “reader hints” in
docs/ideas.md.Then the plugin can offer:
or
This keeps the catalog portable:
2. Map
ArtifactLocator+ metadata -> Intake source builderA plugin can inspect:
and choose an Intake loader.
Example mapping logic:
.ncorartifact_type=netcdf-> Intake xarray source.csv-> Intake pandas/csv source.parquet-> Intake parquet sourceConceptually:
This gives you intelligent loading while keeping selection policy in the plugin.
3. Use
OperationSource.kindas the bridge between Intake datatypes and writersThis is probably the nicest fit with the current API.
OperationSourcealready carries:kindpathdescriptormetadatapayloadSo an Intake plugin can define a family of source kinds like:
intake_sourcexarray_datasetpandas_dataframezip_filecompressed_netcdfremote_opendapzarr_storeThen
ArtifactWriterimplementations can declare what they accept using the currentsource_kindpattern.Examples:
source.kind == "zip_file"->UnzipArtifactWritersource.kind == "xarray_dataset"-> write NetCDF or Zarrsource.kind == "pandas_dataframe"-> write CSV or Parquetsource.kind == "intake_source"-> materialise to file, directory, or derived objectThis works especially well because
FunctionArtifactWriteralready supports source-kind validation.Transformations between types: yes, this looks easy
I agree with your instinct here.
ArtifactWriteris already the right abstraction for simple transforms, andUnzipArtifactWriterproves the pattern.Examples of useful transforms
In ogcat terms, these are all just:
OperationSourceThat is already exactly how writers are documented in
docs/design-note-hooks-plugins.md.Example plugin writer pattern
Something like:
Or for compression-aware transforms:
So yes: type transforms are probably one of the strongest reasons to add Intake-adjacent plugins.
What Intake functionality would be most useful beyond loading?
A lot, but I’d prioritize the following.
1. Unified opener selection
This is the biggest immediate win.
Given a catalog record, the plugin decides:
This turns
ogcatfrom “catalog knows path” into “catalog + plugin knows how to open meaningfully”.Best for:
2. Lazy loading and deferred execution
Very useful if OpenGHG-style datasets can be large.
An Intake source can represent:
That means
ogcatrecords can remain lightweight, while plugin consumers choose when to pay the cost of reading.This is especially attractive if you want:
3. Catalog-of-catalogs behavior
Intake’s catalog concepts could help if you want to expose subsets of ogcat records as a browsable data catalog.
For example, the plugin could generate an Intake catalog dynamically from
ogcat.search(...)results:surface_mole_fractionThis would let downstream users use Intake-native tooling without changing ogcat’s storage model.
That might be one of the most compelling advanced features.
4. Driver-based format extensibility
Rather than hardcoding every reader/writer in ogcat, Intake gives you an ecosystem of drivers and conventions.
That’s useful for:
So the plugin layer can stay thin:
5. Caching
Potentially very useful, especially for remote or expensive reads.
An Intake-based loader plugin could:
In ogcat terms, this could be modelled as:
This fits the existing separation between record metadata and writer-based artifact creation.
6. Multi-file dataset assembly
This could be huge for scientific data.
A plugin could:
Examples:
This is probably better as a higher-level plugin API than a core ogcat feature.
7. Storage abstraction for remote backends
The glossary and long-term plan mention future “storage profiles”.
Intake could complement that by helping with:
That would be especially valuable if ogcat wants to catalog references to remote data, not just local files.
8. Rich metadata extraction
A plugin can use Intake/xarray/pandas readers during ingest to derive metadata such as:
That fits perfectly with
extract_metadatahooks and writer-derived metadata.So Intake is useful not just for opening data, but for standardising metadata enrichment.
Strong design suggestion: split functionality into three plugin layers
Instead of one big Intake integration, I’d separate it into:
A. Loader plugins
Turn ogcat records into readable objects.
Examples:
load_record(record)build_source(record)open_dataset(record)B. Writer/transform plugins
Turn sources into managed artifacts.
Examples:
C. Metadata extractor plugins
Inspect artifacts/sources and derive metadata.
Examples:
That maps neatly onto current ogcat extension points:
Recommended architecture
Option 1: separate package
ogcat-intake— best optionI think this is the cleanest.
Why
Package contents
ogcat_intake.loadersogcat_intake.writersogcat_intake.extractorsogcat_intake.pluginExample exposed API
Option 2: optional extra in ogcat
If you want tighter integration but still optional:
This is okay, but I’d still keep the implementation in a subpackage that feels plugin-like, not core.
Concrete integration ideas for ogcat as it exists today
1. Add advisory metadata conventions
You may not need any core code changes for this.
Define plugin conventions like:
artifact_typereader_hintreader_optionsstorage_optionsmaterialization_formatcompressionlogical_dataset_typeThis is very aligned with the repo’s current direction and keeps records serialisable and portable.
2. Add helper constructors for Intake sources
In plugin package:
Then writers can consume those directly.
3. Add format-specific writers
Examples:
IntakeToNetCDFWriterIntakeToZarrWriterDataFrameToParquetWriterCompressedFileWriterArchiveExpandWriterEach should:
4. Add loader resolution API
Given a record, plugin decides best opener.
Example:
Under the hood:
.read()This is probably the single most user-visible feature.
5. Add search-to-catalog adapter
Very attractive for users:
then
or
This would let users move from metadata discovery to analysis very naturally.
6. Add transform pipelines as plugin utilities, not core workflow
You probably don’t want ogcat core to become a pipeline orchestrator.
But a plugin utility could support simple composable transforms:
Built on top of chained writers/helpers.
What I would avoid
1. Don’t persist live Intake objects in ogcat records
Store:
Do not store:
Keep records declarative.
2. Don’t make Intake required for normal ogcat use
The repo clearly wants plugins/hooks to carry domain behavior, not the core package.
3. Don’t overload hooks with heavy data writes
The docs are clear that writers are the right place for materialisation, while hooks should handle lifecycle/metadata/policy around it.
4. Don’t make reader selection opaque
Prefer explicit metadata conventions and overridable resolver logic. Users should be able to see why a record is opened a certain way.
A practical first milestone
If you want an incremental path, I’d start with this:
Phase 1
Create
ogcat-intakewith:open_record(record)build_intake_source(record)intake_source(...)helper forOperationSourceGunzipWriterXarrayToNetCDFWriterPhase 2
Add:
Phase 3
Add:
ogcat.pluginsThat would deliver useful functionality early without overcommitting the ogcat core.
My bottom line
Yes — this is a strong fit, and the current
ogcatarchitecture already seems designed for this sort of extension.The best model is:
And yes, transforming between types via
ArtifactWriterlooks especially natural in this codebase.UnzipArtifactWriteris basically the proof-of-concept for the whole idea.If you want, I can next sketch either:
ogcat-intakepackage structure, oropen_record(), source-kind conventions, and a few writer classes.Beta Was this translation helpful? Give feedback.
All reactions