This project builds a two-phase GitHub data pipeline:
- Phase 1 samples repositories from GitHub Search and stores repository metadata.
- Phase 2 reads the saved repositories, refreshes repository metadata, and downloads commit history from the current default branch.
The project uses Poetry for dependency management and a src package layout for Python modules.
src/github_datapipe/core/Shared config, GitHub API, runtime, and file IO helpers.src/github_datapipe/phases/phase1_repository_sampling/Phase 1 repository discovery and persistence.src/github_datapipe/phases/phase2_commit_ingestion/Phase 2 commit fetching and persistence.tests/Automated tests for the pipeline behavior.
- Python 3.12 installed locally
- Poetry installed locally
- GitHub personal access token stored in
.envas:
github_token=YOUR_TOKEN_HEREFrom the project root, install the local environment with:
poetry installThe default phase 1 command collects 10 repositories using the default search query from src/github_datapipe/core/config.py.
poetry run github-datapipe sample-reposUseful overrides:
poetry run github-datapipe sample-repos --count 25
poetry run github-datapipe sample-repos --count 25 --query "is:public stars:>50 size:>5000 archived:false"
poetry run github-datapipe sample-repos --mode freshAfter phase 1 completes, check:
runs/<run_id>/phase1_repository_sampling/repos.jsonlruns/<run_id>/phase1_repository_sampling/manifest.jsonruns/seen_repo_ids.json
The command prints the generated run_id, which you will use for phase 2.
Phase 2 consumes the saved phase 1 repositories and downloads commit history from each repository's current default branch.
Run against a phase 1 run:
poetry run github-datapipe fetch-commits --run-id <run_id>Useful overrides:
poetry run github-datapipe fetch-commits --run-id <run_id> --mode resume
poetry run github-datapipe fetch-commits --run-id <run_id> --max-pages-per-repo 3
poetry run github-datapipe fetch-commits --run-id <run_id> --per-page 50mode=refreshper_page=100max_pages_per_repo=1
The default max_pages_per_repo=1 keeps the prototype small and limits commit downloads to the first page for each repository.
After phase 2 completes, check:
runs/<run_id>/phase2_commit_ingestion/commits/commits-0001.jsonlruns/<run_id>/phase2_commit_ingestion/repo_status.jsonlruns/<run_id>/phase2_commit_ingestion/manifest.json
Run the automated tests with:
poetry run pytest -qIf the Poetry script entrypoint is not available yet, run the CLI module directly:
poetry run python -m github_datapipe.cli sample-repos
poetry run python -m github_datapipe.cli fetch-commits --run-id <run_id>- Phase 1 currently uses deterministic GitHub Search traversal rather than randomized sampling.
- Phase 2 stores one normalized commit per JSONL line.
- Resume mode in phase 2 skips repositories already marked
complete. - Truncated repositories are marked as
success_with_warningwhen the page cap is reached.