Skip to content

Sinnick4r/orbital-stack

Repository files navigation

orbital-stack

CI Python Version License

A weekly data pipeline that scrapes the UNOOSA Online Index of Space Objects, validates it against a pinned schema, persists it as a versioned parquet snapshot, and surfaces a semantic diff against the prior week.

Built as a portfolio project exploring data engineering discipline on a niche public dataset: ~25k registered space objects spanning six decades of launches, with known data-quality quirks that make honest schemas and drift detection a real problem rather than a textbook example.

🇪🇸 Leer en español


Quick start

git clone https://github.com/Sinnick4r/orbital-stack.git
cd orbital-stack
make setup                              # uv sync + pre-commit install
make test                               # 145 tests, expect all green

# Run the pipeline against live UNOOSA (takes ~30 minutes)
uv run python -m pipelines.flows.ingest_flow --snapshot-date $(date -u +%F)

Requires Python 3.11+ and uv.

What's inside

  • src/orbital/ingest/unoosa.py - Paginated UNOOSA scraper with tenacity retries, configurable via YAML, typed Polars output.
  • src/orbital/quality/schemas.py - Pandera schema for UNOOSA raw snapshots. Strict mode: upstream column additions break the pipeline.
  • src/orbital/quality/expectations.py - Six empirical drift checks that run after schema validation and emit structured warnings without failing the pipeline: launch year range, COSPAR format/year coherence, XXXX placeholder tracking, whitespace residual, State of Registry outliers, and ±5% cardinality drift.
  • src/orbital/utils/io.py - Atomic parquet writer with hive partitioning (snapshot_date=YYYY-MM-DD), zstd compression, and overwrite protection.
  • src/orbital/transform/diff.py - DuckDB-backed semantic diff between two snapshots. Added, removed, and per-column modifications in tidy long form.
  • pipelines/flows/ingest_flow.py - CLI entry point orchestrating scrape → validate → expectations → save → diff.

Tech stack

uv · Polars · DuckDB · Pandera · structlog · tenacity · Pydantic · DVC (storage only, Backblaze B2) · pytest · ruff · mypy strict

See ADR-002 for the rationale behind each choice, and ADR-003 for why the pipeline runs as plain Python without an orchestration server.

Current status

v0.1.1 (April 2026) - Phase 1 complete: weekly OrbitWatch pipeline running end-to-end in CI. 145 tests, 96% coverage. Snapshots versioned on Backblaze B2 via DVC. First automated weekly run opened PR #1 with 24,866 rows scraped and pushed. See CHANGELOG.md.

Roadmap:

  • v0.5.0 - Phase 2: canonical cross-source dataset reconciling UNOOSA with Celestrak and Space-Track.
  • v1.0.0 - Phase 3: Evidence.dev dashboard ("el tratado silencioso").

Documentation

License

MIT. See LICENSE.

About

Weekly UNOOSA space objects registry pipeline. Scrape, validate, snapshot, diff. Plain Python + Polars + DuckDB.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages