GitHub Data Pipeline

This project builds a two-phase GitHub data pipeline:

Phase 1 samples repositories from GitHub Search and stores repository metadata.
Phase 2 reads the saved repositories, refreshes repository metadata, and downloads commit history from the current default branch.

The project uses Poetry for dependency management and a src package layout for Python modules.

Project Layout

src/github_datapipe/core/ Shared config, GitHub API, runtime, and file IO helpers.
src/github_datapipe/phases/phase1_repository_sampling/ Phase 1 repository discovery and persistence.
src/github_datapipe/phases/phase2_commit_ingestion/ Phase 2 commit fetching and persistence.
tests/ Automated tests for the pipeline behavior.

Prerequisites

Python 3.12 installed locally
Poetry installed locally
GitHub personal access token stored in .env as:

github_token=YOUR_TOKEN_HERE

Install Dependencies

From the project root, install the local environment with:

poetry install

Run Phase 1

The default phase 1 command collects 10 repositories using the default search query from src/github_datapipe/core/config.py.

poetry run github-datapipe sample-repos

Useful overrides:

poetry run github-datapipe sample-repos --count 25
poetry run github-datapipe sample-repos --count 25 --query "is:public stars:>50 size:>5000 archived:false"
poetry run github-datapipe sample-repos --mode fresh

Phase 1 Outputs

After phase 1 completes, check:

runs/<run_id>/phase1_repository_sampling/repos.jsonl
runs/<run_id>/phase1_repository_sampling/manifest.json
runs/seen_repo_ids.json

The command prints the generated run_id, which you will use for phase 2.

Run Phase 2

Phase 2 consumes the saved phase 1 repositories and downloads commit history from each repository's current default branch.

Run against a phase 1 run:

poetry run github-datapipe fetch-commits --run-id <run_id>

Useful overrides:

poetry run github-datapipe fetch-commits --run-id <run_id> --mode resume
poetry run github-datapipe fetch-commits --run-id <run_id> --max-pages-per-repo 3
poetry run github-datapipe fetch-commits --run-id <run_id> --per-page 50

Phase 2 Defaults

mode=refresh
per_page=100
max_pages_per_repo=1

The default max_pages_per_repo=1 keeps the prototype small and limits commit downloads to the first page for each repository.

Phase 2 Outputs

After phase 2 completes, check:

runs/<run_id>/phase2_commit_ingestion/commits/commits-0001.jsonl
runs/<run_id>/phase2_commit_ingestion/repo_status.jsonl
runs/<run_id>/phase2_commit_ingestion/manifest.json

Run Tests

Run the automated tests with:

poetry run pytest -q

Alternate CLI Invocation

If the Poetry script entrypoint is not available yet, run the CLI module directly:

poetry run python -m github_datapipe.cli sample-repos
poetry run python -m github_datapipe.cli fetch-commits --run-id <run_id>

Notes

Phase 1 currently uses deterministic GitHub Search traversal rather than randomized sampling.
Phase 2 stores one normalized commit per JSONL line.
Resume mode in phase 2 skips repositories already marked complete.
Truncated repositories are marked as success_with_warning when the page cap is reached.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src/github_datapipe		src/github_datapipe
tests		tests
.gitignore		.gitignore
README.md		README.md
plan.md		plan.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GitHub Data Pipeline

Project Layout

Prerequisites

Install Dependencies

Run Phase 1

Phase 1 Outputs

Run Phase 2

Phase 2 Defaults

Phase 2 Outputs

Run Tests

Alternate CLI Invocation

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GitHub Data Pipeline

Project Layout

Prerequisites

Install Dependencies

Run Phase 1

Phase 1 Outputs

Run Phase 2

Phase 2 Defaults

Phase 2 Outputs

Run Tests

Alternate CLI Invocation

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages