This project analyses how ehrQL (Electronic Health Records Query Language) is being used across the OpenSAFELY research ecosystem. The goal is to provide insights into which features are widely used, underutilised, or deprecated, supporting better platform development, training, and documentation.
- Analyse the usage of ehrQL across OpenSAFELY repositories.
- Identify which features are most popular, rarely used, or deprecated.
- Provide actionable insights for the tech team, researchers, and documentation teams.
Key Questions Addressed:
- Which ehrQL features are most commonly used?
- Which features are barely touched?
- Are deprecated features still in use?
- How can findings guide platform improvements and training?
- Prioritise Improvements: Tech team can focus on features most relevant to users.
- Guide Researchers: Encourage adoption of up-to-date features.
- Targeted Documentation: Identify features that need better guidance or tutorials.
- Monitor Platform Health: Spot features that may be underutilised or obsolete.
The analysis was carried out in three main stages:
- Used the GitHub REST API (
/orgs/{ORG}/repos) to retrieve all repositories in the OpenSAFELY GitHub organisation. - Authentication via personal access token (up to 5,000 requests/hour).
- Paginated results to retrieve 100 repositories per request.
- Extracted and stored in CSV:
- Repository name
- Creation date
- Repository URL
- Used GitHub code search API with the query:
ehrQL language:python org:opensafely - Collected metadata for each matching file:
- File name & path
- GitHub file URL
- Raw file URL for direct download
- Downloaded all relevant files locally, with logging and error handling to track failures.
- Parsed Python files to count usage of each ehrQL feature:
- Handled Unicode errors via fallback encoding
- Case-insensitive search using a reference list of features
- Counted occurrences and mapped them to repositories
- Generated CSV outputs:
feature_counts.csv– total occurrences of each featurefeature_repository_map.csv– which repositories use each feature
To make results more accessible, an interactive dashboard was built using Streamlit.
Dashboard Views:
- All Repositories: Repository metadata, creation dates, and trends over time
- Feature Counts: Interactive heatmaps showing feature usage frequency
- Feature Details: Select a feature to view repositories using it, with charts and tables
Most Popular Features:
- where() – 3027 uses: Central for filtering datasets based on conditions
- codelist_from_csv() – 1782 uses: Commonly used to load code lists for patient cohorts
- Temporal Functions:
days()– 1162 usesmonths()– 1275 usesyears()– 336 usesstart_date– 1448 usesend_date– 2693 uses- Conditional & Aggregation Functions:
case()– 702 usesfirst_for_patient()– 436 useslast_for_patient()– 365 uses
These findings highlight key workflows, such as filtering datasets, cohort definitions, date handling, and patient-level calculations, which are central to reproducible health research.
- Worked mostly remotely, communicating progress through Slack updates, and quick calls.
- Developed skills in documenting technical decisions clearly for distributed teams.
- Integrate job execution data to track which features are actually run against patient data, providing insights into real-world feature usage over time.
- Apply learnings to future projects and continue building practical software solutions.
- Languages: Python, SQL
- Libraries: Pandas, NumPy, Regex, tqdm, Pathlib
- Tools: Git, GitHub REST API, Streamlit, Logging
- Data: CSV exports for repository metadata and feature counts
-
Clone the repository:
git clone <repository-url> -
Install dependencies:
pip install -r requirements.txt -
Run scripts to retrieve repository metadata, parse Python files, and count feature usage.
-
Launch Streamlit dashboard:
streamlit run dashboard.py