An analytics engineering project that transforms GitHub repository data into business-ready insights using dbt, DuckDB, and GitHub APIs.
This project demonstrates modern analytics engineering practices by:
- Extracting repository data from GitHub APIs
- Storing raw data in DuckDB
- Transforming data using dbt
- Creating clean analytics models
- Generating documentation and lineage graphs
- Producing insights about repository activity and contributors
The goal is to simulate a real-world analytics engineering workflow similar to those used by data teams at technology companies.
- Python
- DuckDB
- dbt Core
- dbt-duckdb
- GitHub REST API
- SQL
- Git
github-analytics-dbt/
│
├── data/
│ └── raw/
│
├── dbt_project/
│ ├── models/
│ │ ├── staging/
│ │ ├── marts/
│ │ └── schema.yml
│ │
│ ├── analyses/
│ ├── tests/
│ └── dbt_project.yml
│
├── ingest_github.py
├── requirements.txt
├── README.md
└── github_analytics.duckdb
This project uses the GitHub REST API.
Example repository:
microsoft/playwright
Data collected:
- Repository metadata
- Stars
- Forks
- Open issues
- Pull requests
- Contributors
- Repository activity
- Which repositories have the most stars?
- Which repositories are growing fastest?
- What is the star-to-fork ratio?
- Top contributors by commits
- Most active repositories
- Contribution distribution
- Open vs closed pull requests
- Issue activity
- Repository engagement metrics
git clone https://github.com/yourusername/github-analytics-dbt.git
cd github-analytics-dbtMac/Linux:
python3 -m venv venv
source venv/bin/activateWindows:
python -m venv venv
venv\Scripts\activatepip install -r requirements.txtRun:
python ingest_github.pyThis will:
- Call GitHub APIs
- Retrieve repository information
- Store raw data in DuckDB
Navigate to dbt project:
cd dbt_projectRun transformations:
dbt runRun tests:
dbt testGenerate documentation:
dbt docs generateLaunch docs site:
dbt docs serve- stg_repositories
- stg_contributors
Purpose:
- Rename columns
- Standardize formats
- Clean raw API data
- repository_summary
- contributor_summary
Purpose:
- Business-friendly reporting tables
- Aggregated metrics
- KPI calculations
| Metric | Description |
|---|---|
| Total Stars | Repository popularity |
| Total Forks | Repository adoption |
| Open Issues | Development workload |
| Contributors | Community engagement |
| Stars per Contributor | Efficiency metric |
Implemented using dbt:
- Not Null Tests
- Unique Tests
- Accepted Values Tests
- Relationship Tests
Example:
tests:
- unique
- not_null- Data Modeling
- SQL Transformations
- ETL Pipelines
- Data Quality Testing
- Documentation
- API Integration
- Python Automation
- DuckDB
- Data Storage
- KPI Development
- Reporting Models
- Data Storytelling
- GitHub Actions CI/CD
- Incremental Models
- dbt Snapshots
- Multiple Repository Support
- Power BI Dashboard
- Streamlit Analytics Dashboard