Skip to content

Commit a311a5f

Browse files
authored
Merge pull request #86 from TechnologyEnhancedLearning/feedback-implementation
Feedback implementation
2 parents b9174da + c7e18f7 commit a311a5f

5 files changed

Lines changed: 69 additions & 200 deletions

File tree

databricks.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,6 @@ variables:
8686
default: "811de200-7873-4ab3-a435-da3cddf13c36"
8787
description: Service Principal client ID for production environment
8888

89-
9089
# ============================================================
9190
# Testing Configuration
9291
# Controls automated test behavior during job execution.
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Adding Packages
2+
3+
## Warning
4+
- AI tends to suggest packages unneccesarily, and list them as imports
5+
6+
7+
## Checks before adding packages
8+
- Is the package already in use within the project?
9+
- Is a similiar package already in use, can we use only one of the two? (unit tests should allow confidence in changing prexisiting package)
10+
- search (to see where used)
11+
- requirements.txt and requirements-dev.txt
12+
- Is the package still regularly updated, trusted, managed by a team not an individual and lots of downloads?
13+
- E.g. releases https://pypi.org/project/pandas/#history
14+
- E.g. versioning and downloads https://pepy.tech/projects/pandas?timeRange=threeMonths&category=version&includeCIDownloads=true&granularity=daily&viewType=line&versions=3.0.1%2C3.0.0%2C3.0.0rc2
15+
- Is it necessary?
16+
- niceties like packages providing nicer syntaxes can make AI effective less and increase knowledge requirement for engaging with ecosystem so pros and cons should be weighed potential including the team
17+
- they increase the maintenance surface
18+
- Is there a better package?
19+
- More downloaded/maintained
20+
- Related to existing package
21+
- Recommended for databricks
22+
- Is more future proof
23+
24+
25+
## How to add packages
26+
27+
- Add to AI prompts global and individual instructions to know we are using the package so it can recommend its usage
28+
- Add to toml (if needs some configuration)
29+
- Add to requirements.txt and requirements-dev.txt so they are global not local, and usable in git pipeline
30+
- requirements.txt should be the same as requirments-dev.txt excluding packages not used in prod such as test and lint
31+
- requirements-dev.txt is for all environments excepts prod
32+
- pin version numbers when adding packages to these files
33+
- include a comment on why it was added
34+
35+
36+
# Extras
37+
- we are not currently creating package wheels with own libraries to apply to clusters 18/3/26
38+
- [JAR may be relevant for wheel](https://learn.microsoft.com/en-gb/azure/databricks/release-notes/release-types)
39+
- [sharing our packages across dbx ... may not be useful as may not help with git](https://learn.microsoft.com/en-gb/azure/databricks/compute/serverless/dependencies#create-common-utilities-to-share-across-your-workspace)

docs/Clusters notes in progress.md

Whitespace-only changes.

docs/TODO.md

Lines changed: 29 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,42 @@
11
# Not covered in POC
2-
- **try moving vars to const files etc and out of bundle**
3-
include:
4-
- 'variables/*.yml' if execute in order might be able to put vars in folders based off headings, may affect running in personal area before deploy?
52

6-
# Not done
7-
- explore what black does in toml
3+
4+
# Changes to make for the POC review qqqq
5+
- get input on adhoc analysis template and file naming
6+
- add adhoc read me to someone feedback objectives
7+
8+
# Not done qqqq
9+
- add black to linting
810
- This document has excellent coverage and should be used to plan next steps and best practice examples [Dataquality](https://www.databricks.com/discover/pages/data-quality-management)
911
- read https://blogs.perficient.com/2025/03/19/delta-live-tables-and-great-expectations/
10-
- need public repo for branch rules, theyre not tweaked so cant just be exported but
11-
- can set deployment rules
12-
- and rules per branch .yml file
13-
- github auto merge staging
14-
- **version numbering**
15-
- enable copilot auto pr
16-
- recommend enable in branch rules
17-
- and require one reviewer
18-
- /addinstructions as a command in databricks ai can work so can put user space or work space instructions
19-
- lakehouse monitoring!
20-
- seperated tests requiring rerunning pipelines from those that dont (data quality, sql sp)
21-
- different job per test type and different notebook
22-
23-
24-
- seperating dabs
12+
- environment branch rules stop pr directing against staging and prod make only correct branches mergeable (can always turn rule off in emergency)
13+
- ~~version numbering~~ we can do release versioning manually makes more sense for us but it is not vital
14+
- ~~enable copilot auto pr~~
15+
- **IMP**seperated tests requiring rerunning pipelines from those that dont (data quality, sql sp)
16+
- different job per test type and different notebook
17+
18+
19+
- ~~seperating dabs~~ dont see the advantage, team may have reason for why its desired
2520
- can do bronze, silver, gold etc
2621
- means modular deployment
2722
- cicd would need to detect changes in folders in order to know which dab to deploy
2823
- wheels again may be needed for code between dabs but maybe not
2924
- cluster dealing with smaller sizes
3025
- how to test
3126

32-
- are requirements txt still needed is it due to python version and setup in git pipeline that i am not just using toml?
27+
- requirements.txt and requirements-dev.txt need sorting, individual installs need removing
28+
29+
30+
31+
32+
[review existing policys](https://adb-295718430158257.17.azuredatabricks.net/compute/policies?o=295718430158257)
33+
3334

35+
- config in the workspace level outside of git
36+
- this may be how we should manage the config
3437

38+
# something here may help seperating unit integration etc tests
39+
[An example of some tests in NHS repo](https://github.com/nhsengland/NHSE_probabilistic_linkage/blob/main/tests/DAE_only_tests.py)
40+
[An example of some tests in NHS repo2](https://github.com/nhsengland/NHSE_probabilistic_linkage/tree/main/tests)
41+
- # MAGIC %run ../utils/dataset_ingestion_utils
42+
- dbutils.notebook.run('../../notebooks_linking/clerical_review_evaluation', 0, {"params": params_serialized})

docs/phil-feedback.md

Lines changed: 1 addition & 178 deletions
Original file line numberDiff line numberDiff line change
@@ -1,178 +1 @@
1-
# Feedback
2-
3-
Collect feedback here qqqq
4-
5-
6-
## Concerns to check
7-
- Environments and Packages in https://nhsdigital.github.io/rap-community-of-practice/training_resources/python/intro-to-python/ am i doing enough here is it set by my toml?
8-
9-
## Reviewing Other DBX Repos
10-
- [Nice standard Git flow explanation a resource for getting a clear feel of the process and all things github](https://docs.github.com/en/get-started/using-github/github-flow)
11-
- add to git docs once other changes made
12-
13-
14-
## Make a RAP doc
15-
[RAP](https://github.com/NHSDigital/rap-community-of-practice/tree/main/docs)
16-
[what rap is why it matters high level gov stuff long strategy](https://analysisfunction.civilservice.gov.uk/policy-store/reproducible-analytical-pipelines-strategy/)
17-
18-
""What you need for RAP
19-
There is no specific tool that is required to build a RAP, but both R and Python provide the power and flexibility to carry out end-to-end analytical processes, from data source to final presentation.
20-
21-
Once the minimum RAP has been implemented statisticians and analysts should attempt to further develop their pipeline using:
22-
23-
functions or code modularity
24-
unit testing of functions
25-
error handling for functions
26-
documentation of functions
27-
packaging
28-
code style
29-
input data validation
30-
logging of data and the analysis
31-
continuous integration
32-
dependency management"
33-
34-
[Nice rap blog](https://analysisfunction.civilservice.gov.uk/blog/the-nature-of-reproducible-analytical-pipelines-rap/)
35-
- should we do unit tests together initially, or pair code initially
36-
“Just-in-time” learning is the best approach
37-
"As we progressed through the project we were inevitably presented with new concepts or tasks unfamiliar to us as a working group. We got into the habit of tackling these with bespoke just-in-time training sessions for the team. When it came to test parts of the code we had developed, we held a unit testing session, facilitated by the BPI Team."
38-
..."We had lots of sessions like this; when we needed to resolve merge conflicts in Git, complete peer reviewing and develop documentation"
39-
40-
41-
doc git branch and comits add
42-
43-
"https://analysisfunction.civilservice.gov.uk/blog/the-nature-of-reproducible-analytical-pipelines-rap/"
44-
- power of git
45-
- was a learning curve
46-
- "It’s a completely new way of working for us so it took some time for us to get to grips with it, but we got there. If we were to do this project again, I would push my team to use this to host their code from the first line they developed. I would also encourage them to get into the habit of committing to our repository much more frequently. When it came to quality assuring our pipeline, time elapsed between commits meant we had large chunks of code to check. This would have been a much more streamlined and manageable process had we used Git little and often from the start."
47-
48-
49-
[cute rap overview friendly read]("https://nhsdigital.github.io/rap-community-of-practice/#what-is-rap")
50-
51-
""Recommendation 7: promote and resource ‘Reproducible Analytical Pipelines’ ... as the minimum standard for academic and NHS data analysis"
52-
Data Saves Lives, 2022 government strategy report"
53-
54-
**Can we do this**
55-
"Our RAP Service
56-
The NHS England Data Science team offers support to NHSE teams looking to implement RAP.
57-
58-
We'll:
59-
60-
Work alongside your team for 6-12 weeks
61-
Work with you to recreate one of your processes in the RAP way
62-
Deliver training in Git, Python, R, Databricks, and anything else you'll need
63-
Learn more:"
64-
65-
[Rap tools git pyspark etc great](https://nhsdigital.github.io/rap-community-of-practice/training_resources/git/introduction-to-git/)
66-
- this is really good
67-
- Learning hub python course 2 days
68-
- https://analysisfunction.civilservice.gov.uk/training/introduction-to-python/
69-
- learning hub 2 day pyspark
70-
- https://analysisfunction.civilservice.gov.uk/training/introduction-to-pyspark/
71-
72-
[Rap 7min explanation 7 min questions vid nhs](https://www.youtube.com/watch?v=npEh7RmdTKM)
73-
- python git etc
74-
- process mapping 5.13
75-
- advise write functions names at the beggining just dont populate them yet
76-
- write the test names for them too but dont create them yet
77-
78-
- DBX git testing, setting us up well for rap
79-
80-
81-
- add this to code quality and the peer review doc
82-
https://nhsdigital.github.io/rap-community-of-practice/training_resources/coding_tips/refactoring-guide/
83-
84-
85-
86-
[Time management starting to use Rap](https://nhsdigital.github.io/rap-community-of-practice/implementing_RAP/rap-readiness/)
87-
- e.g. 15 hours of tool training recommended
88-
- think slice is what we are after as our next step i think!
89-
- target a level of RAP <- we should do this
90-
91-
[Preparing for RAP List of code, docs, tool resource to help get rap ready](https://nhsdigital.github.io/rap-community-of-practice/tags/#preparing-for-rap)
92-
- really useful list here
93-
- coding tips section
94-
95-
96-
Add to unit testing readme and PR
97-
https://nhsdigital.github.io/rap-community-of-practice/training_resources/python/unit-testing/
98-
- provides good guidances
99-
- good all these docs saying python and pytest
100-
101-
- add to PR does python have python doc strings
102-
"""
103-
Converts temperatures in Fahrenheit to Celsius.
104-
105-
Takes the input temperature (in Fahrenheit), calculates the value of
106-
the same temperature in Celsius, and then returns this value.
107-
108-
Args:
109-
temp: a float value representing a temperature in Fahrenheit.
110-
111-
Returns:
112-
The input temperature converted into Celsius
113-
114-
Example:
115-
fahrenheit_to_celsius(temp=77)
116-
>>> 25.0
117-
"""
118-
- this will help us identify when to use other functions, and how to adapt them if they need to change to support more callers
119-
120-
121-
IDE and python over jupyter
122-
"Jupyter notebooks require quite a bit of arduous coding gymnastics to perform what in Python files would be simple imports"
123-
but make notebooks easier with utility functions to turn into python libraries sort of https://jupyter-notebook.readthedocs.io/en/4.x/examples/Notebook/rstversions/Importing%20Notebooks.html
124-
125-
126-
Create Next steps doc
127-
Next Steps. Future task
128-
129-
PT recommends
130-
https://nhsdigital.github.io/rap-community-of-practice/implementing_RAP/thin-slice-strategy/
131-
focus on one ingestion → dashboard process then slice
132-
133-
implement, review as a team, try to maximise RAP good practice using the pipeline
134-
135-
Everyone a reviewer for PR, take detours for simple training dives on anything that feels not right or is a blocker.
136-
There are in the POC project good practice docs, references and some training reference for pyspark and python, unit testing etc.
137-
138-
139-
[RAP massive gov doc maybe useful if looking for somehting specific its long](https://www.gov.uk/guidance/the-aqua-book)
140-
141-
[How to refactor](https://nhsdigital.github.io/rap-community-of-practice/training_resources/coding_tips/refactoring-guide/)
142-
143-
[click newbie for git guide for rap](https://nhsdigital.github.io/rap-community-of-practice/implementing_RAP/skills_for_rap/git_for_rap/)
144-
145-
[Quality coding, e.g doc manual coding in repo?](https://nhsdigital.github.io/rap-community-of-practice/implementing_RAP/workflow/quality-assuring-analytical-outputs/)
146-
147-
148-
[RAP level really shows we are on the right track](https://nhsdigital.github.io/rap-community-of-practice/introduction_to_RAP/levels_of_RAP/)
149-
- we probably aim for silver as should achieve alot but ensure baseline
150-
- we also are looking at hitting most the gold with cicd, and databricks and dabs
151-
152-
153-
[python style guide if interested, can be useful not just to lint but to help name etc](https://peps.python.org/pep-0008/)
154-
155-
https://nhsdigital.github.io/rap-community-of-practice/training_resources/pyspark/pyspark-style-guide/
156-
"We avoid using pandas or koalas because it adds another layer of learning. The PySpark method chaining syntax is easy to learn, easy to read, and will be familiar for anyone who has used SQL."
157-
158-
159-
pyspark dynamic modular queries lazy.
160-
161-
unit test, expected, error handling, edge cases <- alreay got this comment but make sure i put it in
162-
163-
python > pandas
164-
165-
[pyspark RAP](https://nhsdigital.github.io/rap-community-of-practice/training_resources/pyspark/) put this in PR
166-
167-
[another pyguide](https://github.com/google/styleguide/blob/gh-pages/pyguide.md)
168-
[another approachable longer doc about RAP and what it involves](https://best-practice-and-impact.github.io/qa-of-code-guidance/principles.html)
169-
170-
171-
[gov rap policies](https://github.com/NHSDigital/rap-community-of-practice/blob/main/docs/introduction_to_RAP/gov-policy-on-rap.md)
172-
173-
174-
# Not really need but nice to cpmment somewhere
175-
[open source 1](https://www.gov.uk/guidance/be-open-and-use-open-source)
176-
[open source 2](https://gds.blog.gov.uk/2017/09/04/the-benefits-of-coding-in-the-open/)
177-
[open source 3](https://github.com/nhsx/open-source-policy/blob/main/open-source-policy.md)
178-
1+
# Feedback

0 commit comments

Comments
 (0)