Skip to content

Commit a68ba4e

Browse files
authored
Merge pull request #79 from TechnologyEnhancedLearning/TD-6798-Stored-Proc-Test-Exploration
Td 6798 stored proc test exploration
2 parents fff111b + fdb3457 commit a68ba4e

4 files changed

Lines changed: 233 additions & 39 deletions

File tree

databricks.yml

Lines changed: 1 addition & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ bundle:
3030

3131
include:
3232
# - 'variables/*.yml' if execute in order might be able to put vars in folders based off headings, may affect running in personal area before deploy?
33+
- resources/variables/*.yml
3334
- resources/**/*.yml
3435
- resources/**/**/*.yml
3536
# ** double astrix doesnt seem to do sub directories
@@ -86,40 +87,6 @@ variables:
8687
description: Service Principal client ID for production environment
8788

8889

89-
90-
# ============================================================
91-
# Data Architecture Constants
92-
# Structural identifiers for medallion layers and logical domains.
93-
# These should remain stable across environments.
94-
# ============================================================
95-
96-
# ---- Medallion Layers ----
97-
98-
layer_bronze:
99-
default: "bronze"
100-
description: Bronze layer identifier for raw ingested data
101-
102-
layer_silver:
103-
default: "silver"
104-
description: Silver layer identifier for cleansed and enriched data
105-
106-
layer_gold:
107-
default: "gold"
108-
description: Gold layer identifier for curated and business-ready data
109-
110-
111-
# ---- Data Domains ----
112-
113-
domain_ods:
114-
default: "ods"
115-
description: Operational Data Store domain for source-aligned datasets
116-
117-
domain_reporting:
118-
default: "reporting"
119-
description: Reporting domain for aggregated and analytics-ready datasets
120-
121-
122-
12390
# ============================================================
12491
# Testing Configuration
12592
# Controls automated test behavior during job execution.

docs/jira-orientation-task.md

Lines changed: 196 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,196 @@
1-
# This file is just a copy of initial task for exploring the poc
1+
2+
# Specific tasks
3+
4+
**Contribute to**
5+
- do as much or as little as you like as long as there is a single change even a typo resolution, because we want to be able to do a git commit. Obviously I would love lots of input but decide for yourself.
6+
7+
**look through** literally to the detail that is useful, i may have put in some question prompts to help
8+
9+
**Add to your doc notes for** because we will be looking at the same files for comment on files you are not contributing to, follow the instruction for making a doc, in docs with your name and just list file route and observations :)
10+
11+
12+
- X review (make one little change to)
13+
- **contribute to** pull_request template in .github folder
14+
- look at (just for gist) .github/copilot-instructions it might interest you its copilot building a context prompt for copilot based on the repo structure
15+
- **Add to your doc notes for** Any recommendation or questions
16+
- X review (make one little change to)
17+
- **contribute to** "git branch and commit naming" in docs
18+
- **Add to your doc notes for** please can you look at the examples ive done ods_ingest, dlt, would love more input on best processes that could be in here, especial maybe an end2end process would be brilliant
19+
- X review (make one little change to)
20+
- **look through** the project docs generally
21+
- **look through** github/workflow folder
22+
- **contribute to** project-architecture.md
23+
- **contribute to** tests/test-readme.md
24+
- Phil
25+
- review the poc against any nhs githubs we can access
26+
- i have some ive managed to get access for
27+
- support people with exploring the poc
28+
- present the poc
29+
- make updates based on feedback
30+
- Ensure dabs are updating pipelines once deployed (test runner notepad deployed by job in personal seems to not have been updated - see todo in docs)
31+
32+
33+
34+
# Context for all tasks
35+
36+
*If you have any problems at all ask Phil :) this task can be done together if preferred, (whenever you do this kind of thing there will be some issues dont struggle thats not the purpose just ask :) )*
37+
38+
The purpose of the task is to:
39+
- provide an opportunity to see the POC
40+
- identify whats missing
41+
- see if it works (we may need to add permissions for example)
42+
- see documentation
43+
- create a pull-request
44+
- provide feedback on someone elses pull-request
45+
- provide feedback after/in the task
46+
- Will do this in <name>-feedback.md (this isnt a typical approach but we wouldnt normally be working on the same text files much so its just for convenience)
47+
- improvements
48+
- what would make a better example
49+
- whats liked
50+
- what we could apply now
51+
52+
53+
After the tasks are done and there is a pull-request we will have a meeting and merge the tasks in the meeting together as part of exploring the functionality.
54+
55+
The task is not about overcoming obstacles any issues just ask straight away :)
56+
57+
**To avoid code conflicts, change the file in the assignment and other changes add to a file **
58+
59+
The intention is that we will end up with documentation that would help anyone new to the project to be involved.
60+
61+
# Steps
62+
63+
*There is alot here you may wish to do each step at a different time and break it up a bit*
64+
65+
## This is the process of setting up a branch for your repo folder
66+
67+
Go to POC workspace
68+
[POC DBX](https://adb-295718430158257.17.azuredatabricks.net/browse?o=295718430158257)
69+
click branch tab to get here [branches tab github](https://github.com/TechnologyEnhancedLearning/DatabricksPOC/branches)
70+
- Make a branch using your jira task name from source "main"
71+
- TD-<your jira task code>-Jira-Task-Name
72+
- (the td- means the task will automatically progress across the jira board as it moves through devOps)
73+
- copy the branch path
74+
- I have created a repo folder in your user area
75+
- find git folder "DatabricksPOC"
76+
- click on the folder and change the git branch target to your new target
77+
78+
## Personal AI context configuration
79+
At the top level of the workspace folder is a file called .assistant_workspace_instructions.md
80+
My hope is that this file will be updated as a team as we develop more preferences for how AI responds in databricks. E.g. if we want it to stop recommending a package, or it needs better context for a certain type of file.
81+
82+
Look in your user space for a file called
83+
.assistant_instructions.md
84+
This is your personal AI instructions used in tandem with the global ones.
85+
86+
If you have preferences to how it responds put them in here.
87+
E.g. When recommending sql to python conversions recommend me a similar file in the project I can use for comparison
88+
Or provide context of the kind of work you will do.
89+
Or just add a commented note to the file.
90+
91+
## Orientation
92+
- create a file <name>-feedback.md
93+
- as you go any idea, preferences, things needing readmes more explanation, anything that helps please add in here copy paste the url to the file and just say what i could change :) thankyou
94+
- go to the docs folder and look at the orientation readme
95+
- add to your notes file an improvement (it can be tiny) for the ods_ingest example
96+
- add to your notes file an improvement (it can be tiny) for the dlt example
97+
98+
## Test
99+
*if some test dont work like data-test they may start working after you deploy your local dab later*
100+
-skim test-readme.md
101+
- go to run_tests
102+
- run the whole file
103+
- look through how the notebook is seperating the tests into sections, we could use different sections if we like
104+
- (Q) what is job.pytest_marks for?
105+
- look at the end of the notebook where a fail is generated so cicd pipeline can fail after we have tried all the test
106+
- run the tests with run_all at the top
107+
- click on see performance in the bottom left of each code box as each finish as well as looking at the output
108+
- look in the following files, click through to see the structure
109+
- tests/unit-tests/utils/test_reporter_logic.py
110+
- see how it test with a small df
111+
- the dlt is a wrapper python is the functionality that is wrapped
112+
- this can be compared to the previous code we saw
113+
- tests/unit-tests/transformations/test_date_transformations.py
114+
- this was added based off some real code that was simplified for the poc
115+
- actual code is here src/transformations/date_transforms.py
116+
- tests/integration-tests
117+
- tests/integration-tests/utils/test_loaders.py
118+
- this can be compared to the previous code we saw
119+
- its an integration test because it needs to interact with dbx for storage
120+
- there may be a better way but i found i had to create in the catalog a temp area for integration tests to store files
121+
- tests/data-quality-tests/bronze/ods/test_contact_details.py
122+
- look to the catalog to see whats being tested its the dlt output
123+
- for me that is dev_catalog philip_tate_bronze_ods contact details
124+
- if these tests fail for you it maybe resolved once we deploy the dab
125+
126+
## Review the assigned file
127+
- make changes directly to the file this will give us changes to git commit
128+
- for other comments, feedback, suggestions, add to the file you created for yourself in docs
129+
130+
131+
132+
133+
134+
135+
## Extra bit if you like
136+
- look at run_lint this is just a useful code check before deploying code, you can target a file with it, and its not in cicd
137+
138+
## Deploy personal dab
139+
- look in databricks.yml
140+
- look at personal section, and then root_path workspace this is where your bundle will deployed
141+
- notice some pytest.marks are passed from the bundle as we may use some tests dependent on the environment
142+
- also notice the exclude sync for the prod environment
143+
- (Q) do you think these are the right things to exclude?
144+
- go to your workspace, go to the top level, you should be able to see a rocket icon, under the folder and above the triangle square circle.
145+
- The source should be DatabricksPOC
146+
- the target change to personal (the rest are deployed by cicd)
147+
- hit deploy to deploy to the .bundle folder in your personal area at the top level so its outside of source control
148+
- if you run your version of "[dev X] [personal] medallion_refresh run" you will run the two pipelines, create the tables in the catalog and the data tests should start working. (the testing job runs this same task so cicd will cause a data refresh currently so it can do the data quality checks, we may want to make this more targetted to files dataqualities checks cover, or just run the checks for specific tables we expect to be effected)
149+
*I have seen some issue here so please have a look at what bundle resources are listed under the deploy*
150+
- some jobs should run on deploy
151+
- some pipelines should appear
152+
- go to .bundle in your user space and look for recent files
153+
154+
## Commit and pull request
155+
156+
- click on the repo folder and the git symbol button next to the branch name to open git within databricks
157+
- create a commit using commit naming doc from the docs folder
158+
- click commit and push, then within the window click create pull request it appears as a link in the window
159+
- **Really easy to forget** -> change the base drop down so your branch targets "dev-data-team-shared" not main.
160+
- reviewers set to: X, Phil, X, X as reviewers
161+
- assignee yourself
162+
- labels documentation
163+
- fillout the pullrequest md doc, switch to preview to see it
164+
- look to the bottom for your list of commits
165+
- there is a commits tab always worth a quick check, and files changed
166+
- look through files changed and follow peer review guidance in pr template to comment or ask questions yourself for others to contribute to and give context. The discussion is the moment where the team gets to colaborate before something escapes to eventually prod
167+
- click on the drop down where it says review required with a red cross, all checks passed with a green cross, merging is blocked with a red triangle. And open the all checks have passed drop down
168+
- doubly click on the pytest one
169+
- open run pytest against unit tests drop down on the right
170+
- you should see it logging tests have passed, these were run inside github, if you wish you can click workflow file, to see the code behind.
171+
- Thats it, now just wait once your pull request is approved you can merge it
172+
173+
## Contribute to others pull requests
174+
- give feedback in comments and in the changes tab on someoneelse pull request and once your happy approve their request (we can merge it in a meeting to see the cicd steps)
175+
176+
177+
178+
179+
180+
181+
## Task checklist
182+
[ ] Looked at .assistant_workspace_instructions.md
183+
[ ] added an instructions to .assistant_instructions.md
184+
[ ] asked the ai tool in databricks a question about the repo or a file in it
185+
[ ] looked at files recommended in orientation doc
186+
[ ] looked at test folder
187+
[ ] ran tests
188+
[ ] deployed personal dab
189+
[ ] reviewed requested doc
190+
[ ] added notes in <name>-.md
191+
[ ] created pull request
192+
[ ] commented on someone elses pull request
193+
194+
195+
196+

resources/pipeline/ingest/ods_ingestion.yml

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
###############################
22
## notes
33
###############################
4+
5+
## ⚠️ WARNING this job will fail if pipelines are running already ⚠️
6+
47
## We should think about these resource files I think potentially a .yml per layer bronze.yml may make sense
58
## We will not define schemas here
6-
## We use this file to expose from databricks.yml the variables we need to set up the pipeline
7-
## We will define too variables just for the set of pipelines here too if we start running layer based .ymls then we would have layer level variables here
9+
810
###############################
9-
# If we are running multlipe pipelines we may define all their vars at the top
10-
# bundle for vars originating from databricks.ymly
1111
# I am naming vars pipeline. from pipeline files
1212

1313
# dont define variables here they are global anyway so set in dab
@@ -22,6 +22,7 @@ x-bronze-config: &bronze-config
2222
# f"{ADLS_PROTOCOL}{container}@{storage_account}{ADLS_SUFFIX}/ -> py adds {folder_name}/"
2323
pipeline.storage_container_path: "abfss://${var.layer_bronze}@${var.storage_account}.dfs.core.windows.net/"
2424

25+
2526
resources:
2627
pipelines:
2728
pipeline_ods_ingestion:
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
variables:
2+
# ============================================================
3+
# Data Architecture Constants
4+
# Structural identifiers for medallion layers and logical domains.
5+
# These should remain stable across environments.
6+
# ============================================================
7+
8+
# ---- Medallion Layers ----
9+
10+
layer_bronze:
11+
default: "bronze"
12+
description: Bronze layer identifier for raw ingested data
13+
14+
layer_silver:
15+
default: "silver"
16+
description: Silver layer identifier for cleansed and enriched data
17+
18+
layer_gold:
19+
default: "gold"
20+
description: Gold layer identifier for curated and business-ready data
21+
22+
23+
# ---- Data Domains ----
24+
25+
domain_ods:
26+
default: "ods"
27+
description: Operational Data Store domain for source-aligned datasets
28+
29+
domain_reporting:
30+
default: "reporting"
31+
description: Reporting domain for aggregated and analytics-ready datasets

0 commit comments

Comments
 (0)