Overview:
We need to create a new pipeline to add performance datasets to the platform. This first task involves generating a daily provision-quality file and storing it in S3. Initially, it will only need one task which creates the provision-quality table for a given day.
Technical Approach:
- Alter the
collection-task container so that the CMD command is used rather than ENTRYPOINT. This is done by editing the Dockerfile. Test this in dev and deploy through to prod to ensure it doesn't affect the current pipelines. This reduces risk later.
- Create a separate shell script
build-performance.sh that can be executed to run a performance build process. This will have access to digital-land-python so any Python commands can be created and run here. Testing can be done in digital-land-python.
- If it's not appropriate to put code in
digital-land-python, then a task/src/ directory can be created containing Python files. Tests should be added in a test/ directory. Review testing guidance in the tech docs. Note: this is not advised – it would be better to put code into digital-land-python, but this could be a good way to get started.
- Update your script so that it builds a file in
task/performance/provision-quality/entry-date=<entry date>/...parquet, which contains the provision-quality dataset for that particular date. If a file exists, it should be overwritten (i.e. the partition should be replaced if it's reproduced on the same day).
- There is already code to create the provision-quality dataset in a Jupyter notebook. Review this and talk to the data scientists if needed. Optimise and productionise this code – it is not currently in the right format.
- Once you have something working locally, create a new DAG in the
airflow-dags repository to run this container using your script. You are writing to an S3 bucket – ensure you do not remove any unintended content and only write where you should.
- Once the DAG is made, you can schedule it to start producing the relevant files. Additional work may be needed to schedule appropriately – raise this if needed.
Acceptance Criteria:
- A DAG exists that when run, creates a Parquet file in S3 under the appropriate key.
- File contains the provision-quality dataset for the specified date.
- DAG can be manually triggered in Airflow.
Overview:
We need to create a new pipeline to add performance datasets to the platform. This first task involves generating a daily provision-quality file and storing it in S3. Initially, it will only need one task which creates the provision-quality table for a given day.
Technical Approach:
collection-taskcontainer so that theCMDcommand is used rather thanENTRYPOINT. This is done by editing the Dockerfile. Test this in dev and deploy through to prod to ensure it doesn't affect the current pipelines. This reduces risk later.build-performance.shthat can be executed to run a performance build process. This will have access todigital-land-pythonso any Python commands can be created and run here. Testing can be done indigital-land-python.digital-land-python, then atask/src/directory can be created containing Python files. Tests should be added in atest/directory. Review testing guidance in the tech docs. Note: this is not advised – it would be better to put code intodigital-land-python, but this could be a good way to get started.task/performance/provision-quality/entry-date=<entry date>/...parquet, which contains the provision-quality dataset for that particular date. If a file exists, it should be overwritten (i.e. the partition should be replaced if it's reproduced on the same day).airflow-dagsrepository to run this container using your script. You are writing to an S3 bucket – ensure you do not remove any unintended content and only write where you should.Acceptance Criteria: