Background
Currently, a dataset is added to Airflow across all three environments (development, staging, production) as soon as the collection property in the dataset specification is populated. This has two problems:
- It is not obvious to anyone editing the spec that setting
collection will immediately trigger changes in production
- It provides no mechanism to test a new dataset and collection in development or staging before it goes live in production
We need a way to express which environments a dataset should be active in, so that new datasets can be introduced incrementally and the pipeline can gate behaviour on this field rather than on the presence of collection.
Proposed field
Add a new field to the dataset specification: availability
Note: we are not fully settled on availability as the field name. The name should feel consistent with existing spec field conventions.
Values
availability would accept one of three values:
| Value |
Meaning |
development |
Active in development only |
staging |
Active in staging and development |
production |
Active in all environments |
The field is hierarchical - setting a value implies all lower environments are also included. There is no mechanism to be active in production but not staging, or staging but not development, as there is no anticipated need for this.
Behaviour when empty
If availability is not set, the dataset should be treated as inactive in all environments. This replaces the current implicit behaviour where presence of collection alone is sufficient to activate a dataset.
Scope of work
Notes
- When backfilling existing datasets, cross-reference against the current list of datasets with a non-empty
collection field in the spec - these and only these should receive availability: production
- Consider whether the
availability field should be validated via the build specification action to reject unknown values
Background
Currently, a dataset is added to Airflow across all three environments (development, staging, production) as soon as the
collectionproperty in the dataset specification is populated. This has two problems:collectionwill immediately trigger changes in productionWe need a way to express which environments a dataset should be active in, so that new datasets can be introduced incrementally and the pipeline can gate behaviour on this field rather than on the presence of
collection.Proposed field
Add a new field to the dataset specification:
availabilityValues
availabilitywould accept one of three values:developmentstagingproductionThe field is hierarchical - setting a value implies all lower environments are also included. There is no mechanism to be active in production but not staging, or staging but not development, as there is no anticipated need for this.
Behaviour when empty
If
availabilityis not set, the dataset should be treated as inactive in all environments. This replaces the current implicit behaviour where presence ofcollectionalone is sufficient to activate a dataset.Scope of work
availabilityas a new field in the dataset specificationavailabilityfor all existing datasets that currently have acollectionvalue in the spec - these should all be set toproductionto preserve current behaviouravailabilityfield rather than the presence ofcollection, using the environment the DAG is running in to determine which datasets are in scopeNotes
collectionfield in the spec - these and only these should receiveavailability: productionavailabilityfield should be validated via the build specification action to reject unknown values