Skip to content

Add collections and datasets to dev/staging Airflow #2641

@Ben-Hodgkiss

Description

@Ben-Hodgkiss

Background

Currently, a dataset is added to Airflow across all three environments (development, staging, production) as soon as the collection property in the dataset specification is populated. This has two problems:

  • It is not obvious to anyone editing the spec that setting collection will immediately trigger changes in production
  • It provides no mechanism to test a new dataset and collection in development or staging before it goes live in production
    We need a way to express which environments a dataset should be active in, so that new datasets can be introduced incrementally and the pipeline can gate behaviour on this field rather than on the presence of collection.

Proposed field

Add a new field to the dataset specification: availability

Note: we are not fully settled on availability as the field name. The name should feel consistent with existing spec field conventions.

Values

availability would accept one of three values:

Value Meaning
development Active in development only
staging Active in staging and development
production Active in all environments

The field is hierarchical - setting a value implies all lower environments are also included. There is no mechanism to be active in production but not staging, or staging but not development, as there is no anticipated need for this.

Behaviour when empty

If availability is not set, the dataset should be treated as inactive in all environments. This replaces the current implicit behaviour where presence of collection alone is sufficient to activate a dataset.

Scope of work

  • Add availability as a new field in the dataset specification
  • Populate availability for all existing datasets that currently have a collection value in the spec - these should all be set to production to preserve current behaviour
  • Update the Airflow DAGs to gate dataset inclusion on the availability field rather than the presence of collection, using the environment the DAG is running in to determine which datasets are in scope
  • Confirm the approach works correctly for all three environments (development, staging, production)

Notes

  • When backfilling existing datasets, cross-reference against the current list of datasets with a non-empty collection field in the spec - these and only these should receive availability: production
  • Consider whether the availability field should be validated via the build specification action to reject unknown values

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions