Skip to content

1 - Create an Airflow DAG for Performance #660

@Ben-Hodgkiss

Description

@Ben-Hodgkiss

Overview:
We need to create a new pipeline to add performance datasets to the platform. This first task involves generating a daily provision-quality file and storing it in S3. Initially, it will only need one task which creates the provision-quality table for a given day.

Technical Approach:

  • Alter the collection-task container so that the CMD command is used rather than ENTRYPOINT. This is done by editing the Dockerfile. Test this in dev and deploy through to prod to ensure it doesn't affect the current pipelines. This reduces risk later.
  • Create a separate shell script build-performance.sh that can be executed to run a performance build process. This will have access to digital-land-python so any Python commands can be created and run here. Testing can be done in digital-land-python.
  • If it's not appropriate to put code in digital-land-python, then a task/src/ directory can be created containing Python files. Tests should be added in a test/ directory. Review testing guidance in the tech docs. Note: this is not advised – it would be better to put code into digital-land-python, but this could be a good way to get started.
  • Update your script so that it builds a file in task/performance/provision-quality/entry-date=<entry date>/...parquet, which contains the provision-quality dataset for that particular date. If a file exists, it should be overwritten (i.e. the partition should be replaced if it's reproduced on the same day).
  • There is already code to create the provision-quality dataset in a Jupyter notebook. Review this and talk to the data scientists if needed. Optimise and productionise this code – it is not currently in the right format.
  • Once you have something working locally, create a new DAG in the airflow-dags repository to run this container using your script. You are writing to an S3 bucket – ensure you do not remove any unintended content and only write where you should.
  • Once the DAG is made, you can schedule it to start producing the relevant files. Additional work may be needed to schedule appropriately – raise this if needed.

Acceptance Criteria:

  • A DAG exists that when run, creates a Parquet file in S3 under the appropriate key.
  • File contains the provision-quality dataset for the specified date.
  • DAG can be manually triggered in Airflow.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions