Skip to content

HBrahmbhatt/github_extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHub Data Pipeline

This project builds a two-phase GitHub data pipeline:

  • Phase 1 samples repositories from GitHub Search and stores repository metadata.
  • Phase 2 reads the saved repositories, refreshes repository metadata, and downloads commit history from the current default branch.

The project uses Poetry for dependency management and a src package layout for Python modules.

Project Layout

  • src/github_datapipe/core/ Shared config, GitHub API, runtime, and file IO helpers.
  • src/github_datapipe/phases/phase1_repository_sampling/ Phase 1 repository discovery and persistence.
  • src/github_datapipe/phases/phase2_commit_ingestion/ Phase 2 commit fetching and persistence.
  • tests/ Automated tests for the pipeline behavior.

Prerequisites

  • Python 3.12 installed locally
  • Poetry installed locally
  • GitHub personal access token stored in .env as:
github_token=YOUR_TOKEN_HERE

Install Dependencies

From the project root, install the local environment with:

poetry install

Run Phase 1

The default phase 1 command collects 10 repositories using the default search query from src/github_datapipe/core/config.py.

poetry run github-datapipe sample-repos

Useful overrides:

poetry run github-datapipe sample-repos --count 25
poetry run github-datapipe sample-repos --count 25 --query "is:public stars:>50 size:>5000 archived:false"
poetry run github-datapipe sample-repos --mode fresh

Phase 1 Outputs

After phase 1 completes, check:

  • runs/<run_id>/phase1_repository_sampling/repos.jsonl
  • runs/<run_id>/phase1_repository_sampling/manifest.json
  • runs/seen_repo_ids.json

The command prints the generated run_id, which you will use for phase 2.

Run Phase 2

Phase 2 consumes the saved phase 1 repositories and downloads commit history from each repository's current default branch.

Run against a phase 1 run:

poetry run github-datapipe fetch-commits --run-id <run_id>

Useful overrides:

poetry run github-datapipe fetch-commits --run-id <run_id> --mode resume
poetry run github-datapipe fetch-commits --run-id <run_id> --max-pages-per-repo 3
poetry run github-datapipe fetch-commits --run-id <run_id> --per-page 50

Phase 2 Defaults

  • mode=refresh
  • per_page=100
  • max_pages_per_repo=1

The default max_pages_per_repo=1 keeps the prototype small and limits commit downloads to the first page for each repository.

Phase 2 Outputs

After phase 2 completes, check:

  • runs/<run_id>/phase2_commit_ingestion/commits/commits-0001.jsonl
  • runs/<run_id>/phase2_commit_ingestion/repo_status.jsonl
  • runs/<run_id>/phase2_commit_ingestion/manifest.json

Run Tests

Run the automated tests with:

poetry run pytest -q

Alternate CLI Invocation

If the Poetry script entrypoint is not available yet, run the CLI module directly:

poetry run python -m github_datapipe.cli sample-repos
poetry run python -m github_datapipe.cli fetch-commits --run-id <run_id>

Notes

  • Phase 1 currently uses deterministic GitHub Search traversal rather than randomized sampling.
  • Phase 2 stores one normalized commit per JSONL line.
  • Resume mode in phase 2 skips repositories already marked complete.
  • Truncated repositories are marked as success_with_warning when the page cap is reached.

About

This project aims to create a pipeline to extract github data such as repository metadata, commits, etc.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages