-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
This the uber-epic for the complete evolution of CKAN DataStore load to AirCan.
Acceptance
- We are using AirCan in production for data loading to datastore
- (central?) AirCan service in our cluster
- CKAN instances updated with connector for AirCan
- Monitoring / Debugging working i.e. we can see what is happening and if there are issues
- New UI for CKAN instances for data loading experience ...
Tasks
- v0.1 MVP DataStore load working including
- Staging environment: AirCan (Google Cloud Composer) + CKAN instance with extension [Infra] Deploy CKAN + Aircan-connector to DX #47 MVP DX actions including continuous deployment #66 DONE. Live at https://ckan.aircan.dev.datopian.com/ Repo with the helm chart: https://gitlab.com/datopian/tech/dx-helm-ckan-aircan
- CI working Passing tests
- Integration tests of DataStore Load (in cypress) e.g. we upload a file to CKAN staging instance and 5m later data is in datastore. We have the tests, but they are assuming a local
npm test. Right now they are pointing to a temporary CKAN instance (not to DX) and they run the entire flow. No CI for this on github atm - CD of AirCan and CKAN extension into staging (includes terraform setup of GCC) Have automated deployment script - see MVP DX actions including continuous deployment #66 but we don't have CD such that changes to AirCan DAGs or the ckan extension get auto re-deployed.
- BONUS: BigQuery DAG DONE. DAG is done and working. https://github.com/datopian/aircan/blob/master/aircan/dags/api_ckan_import_to_bq.py
- v0.2 - errors and logging [epic] v0.2 Error and Logging #65
- Refactor DAGs and ckanext-aircan etc to take a
run_idwhich you can pass in to the DAG and which it uses in logging etc when running it so we can reliably track logs etc. Also move airflow status info into logs (so we don't depend on AirFlow API).- Research how others solve this problem of getting unique run ids per DAG run in AirFlow (and how we could pass this info down into stackdriver so that we can filter logs). Goal is that we have a reliable
aircan_status(run_id)function that can be turned into an API in CKAN (or elsewhere)
- Research how others solve this problem of getting unique run ids per DAG run in AirFlow (and how we could pass this info down into stackdriver so that we can filter logs). Goal is that we have a reliable
- Refactor DAGs and ckanext-aircan etc to take a
- v0.3 - UI integration into CKAN [epic] v0.3 #89
- v0.4 - improved datastore load e.g. more formats
- Loads XLSX ok (uses types)
- Load google sheets
- v0.5 - harvesting MVP
Plan of work (from 4 nov)
- Test instance of CKAN + ckanext-aircan (+ AirFlow) https://ckan.aircan.dev.datopian.com/
- Move this into the "dev/test cluster" @cuducos
- Instance of Google Cloud Composer and a way to update DAGs there.
- Should it be a test instance OR do we could use production (think this is OK in part because we can create new DAGs if we need so we don’t interfere with existing ones. E.g. Suppose we want to update datastore_load_dag and that is being used by production CKAN instances … Well, we can create datastore_load_dag_v2)?ANS: Use Production
- Shut down all other Cloud Composer instances
- Integration test for ckanext-aircan etc: start with a simple CSV. Scripted test to upload a file and check it is imported ckanext-aircan#26 🔥
- with some large files [automatedly generate them] e.g. Does AirFlow DAG have an issue, is it very slow …
FUTURE after this
- Deploy this for other potential users
- GUI work
- Add a "Workflows/Actions” Table in CKAN – see https://tech.datopian.com/flows/design/#ui-in-dms
graph TD
v1[v0.1 CSV load working, CI/CD setup with rich tests]
v2[v0.2 errors, logging and UI integration]
v3[UI integration]
v3[v0.3 expand the tasks and e.g. xlsx, google sheets loading]
v4[v0.4 harvesting ...]
v1 --> v2
v2 --> v3
v3 --> v4
Detailed
graph TD
deploytotest[Deploy DAGs to test GCC]
deploydags[Deploy DAGs into this AirFlow<br/>starting with CKAN data load]
deploygcc[Deploy Airflow<br/>i.e. Google Cloud Composer]
nhsdag[NHS DAG for loading to bigquery]
nhs[NHS Done: instance updated<br/>with extension and working in production]
logging[Logging]
reporting[Reporting]
othersite["Other Site Done"]
start[Start] --> deploygcc
start --> logging
multinodedag --> deploytotest
subgraph General Dev of AirCan
errors[Error Handling]
aircanlib[AirCan lib refactoring]
multinodedag[Multi Node DAG]
logging --> reporting
end
subgraph Deploy into Datopian Cluster
deploytotest[Deploy DAGs to test GCC] --> deploydags
deploygcc --> deploydags
end
subgraph CKAN Integration
setschema[Set Schema from Resource]
endckan[End CKAN work]
setschema --> endckan
end
deploydags --> nhsdag
deploydags --> othersite
endckan --> nhs
subgraph NHS
nhsdag --> nhs
end
classDef done fill:#21bf73,stroke:#333,stroke-width:1px;
classDef nearlydone fill:lightgreen,stroke:#333,stroke-width:1px;
classDef inprogress fill:orange,stroke:#333,stroke-width:1px;
classDef next fill:lightblue,stroke:#333,stroke-width:1px;
class multinodedag done;
class versioning nearlydone;
class setschema,errors,deploydags,nhsdag,deploygcc inprogress;
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels