1 - Create an Airflow DAG for Performance

**Overview:**  
We need to create a new pipeline to add performance datasets to the platform. This first task involves generating a daily provision-quality file and storing it in S3. Initially, it will only need one task which creates the provision-quality table for a given day.

**Technical Approach:**  
- Alter the `collection-task` container so that the `CMD` command is used rather than `ENTRYPOINT`. This is done by editing the Dockerfile. Test this in dev and deploy through to prod to ensure it doesn't affect the current pipelines. This reduces risk later.
- Create a separate shell script `build-performance.sh` that can be executed to run a performance build process. This will have access to `digital-land-python` so any Python commands can be created and run here. Testing can be done in `digital-land-python`.
- If it's not appropriate to put code in `digital-land-python`, then a `task/src/` directory can be created containing Python files. Tests should be added in a `test/` directory. Review testing guidance in the tech docs. **Note:** this is not advised – it would be better to put code into `digital-land-python`, but this could be a good way to get started.
- Update your script so that it builds a file in `task/performance/provision-quality/entry-date=<entry date>/...parquet`, which contains the provision-quality dataset for that particular date. If a file exists, it should be overwritten (i.e. the partition should be replaced if it's reproduced on the same day).
- There is already code to create the provision-quality dataset in a [Jupyter notebook](https://github.com/digital-land/jupyter-analysis/blob/main/reports/measure_odp_data_quality/provision_quality_mvp.ipynb). Review this and talk to the data scientists if needed. Optimise and productionise this code – it is not currently in the right format.
- Once you have something working locally, create a new DAG in the `airflow-dags` repository to run this container using your script. You are writing to an S3 bucket – ensure you do not remove any unintended content and only write where you should.
- Once the DAG is made, you can schedule it to start producing the relevant files. Additional work may be needed to schedule appropriately – raise this if needed.

**Acceptance Criteria:**  
- A DAG exists that when run, creates a Parquet file in S3 under the appropriate key.
- File contains the provision-quality dataset for the specified date.
- DAG can be manually triggered in Airflow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1 - Create an Airflow DAG for Performance #660

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

1 - Create an Airflow DAG for Performance #660

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions