Skip to content

markmcmurray/osv-data-comparison

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

osv-compare

A Go CLI for surfacing notable, specific upstream changes in the OSV vulnerability database. It pulls the last N days of per-ecosystem snapshots from the public gs://osv-vulnerabilities bucket via GCS object versioning, then reports schema drift, JSON path-set diffs, volume changes, and record churn between days.

It exists to make upstream changes: new ecosystems, new fields, schema-version bumps, volume jumps, churn spikes. Easy to spot and attribute to a specific date and ecosystem.


How it works

OSV publishes per-ecosystem zips at gs://osv-vulnerabilities/<Ecosystem>/all.zip. The bucket has object versioning enabled, and OSV republishes each zip roughly every 15 minutes. That means the noncurrent-version history is a fine-grained time-series of the feed.

osv-compare walks that history:

  1. Discover ecosystems: list top-level prefixes in the bucket (~105 ecosystems including PyPI, npm, Debian:12, Alpine:v3.20, ...) or use --ecosystem to restrict.
  2. List versions:for each ecosystem's all.zip, list noncurrent versions and pick the latest version per UTC day in the window.
  3. Download: pull each picked version (pinned by generation) into cache/<ecosystem>/<YYYY-MM-DD>.zip. Cached zips are reused on re-run unless --no-cache is passed. Downloads run with bounded concurrency (--concurrency, default 8).
  4. Analyse: stream every JSON record out of each zip (no decompression to disk) and accumulate, per ecosystem-day:
    • Record count: how big this snapshot is.
    • schema_version distribution: coarse OSV spec drift (e.g. 1.6.0=42, 1.7.0=18).
    • JSON path set: every distinct path observed across all records, e.g. affected[].ranges[].events[].introduced. Captures field shape, not values.
    • Per-record content hash (canonical-JSON SHA-256, keyed by id): for churn detection.
  5. Diff: between consecutive days: which paths appeared/disappeared, which records were added/removed/changed, and overall churn %.
  6. Report: write out/report.md (markdown with a "Top suspects" section + per-ecosystem tables) and out/diffs/<Ecosystem>.json (full machine-readable diffs, ID lists capped at 1000 each).

What "Top suspects" flags

The markdown report leads with ecosystems that triggered any of:

  • Volume jump: record count changed >20% day-over-day.
  • New paths: a JSON path appeared that wasn't there the previous day.
  • New schema_version: a new value in the schema_version field.
  • High churn: >5% of records added/removed/changed in a single day.

Layout

cmd/osv-compare/main.go        cobra CLI, orchestration
internal/gcs/                  versioned listing, daily-version picker, downloads
internal/snapshot/             zip → JSON record iterator, canonical hashing
internal/analyze/              DaySummary accumulator + Diff
internal/report/               markdown + per-ecosystem JSON writers

Setup

Prerequisites

  • Go 1.22+ (built/tested with 1.26)
  • gcloud CLI: install via brew install --cask google-cloud-sdk or the docs
  • A Google account that can authenticate (any account works, the bucket is public, but listing versioned objects requires an authenticated identity).

One-time auth

gcloud auth application-default login

This opens a browser, signs you in, and writes Application Default Credentials to ~/.config/gcloud/application_default_credentials.json. The CLI picks them up automatically.

Build

git clone <this-repo> && cd osv-data-compare
go build -o osv-compare ./cmd/osv-compare

(Or run directly: go run ./cmd/osv-compare ...)


Usage

Default, last 7 days, all ecosystems

./osv-compare

This downloads ~7 × 105 ≈ 735 zips (most are small per-ecosystem subsets, a few hundred MB total) and writes the report to ./out/report.md. First run takes a few minutes; subsequent runs reuse the cache and complete in seconds.

Restrict to specific ecosystems

./osv-compare --ecosystem PyPI --ecosystem npm --days 14

All flags

Flag Default Purpose
--days N 7 How many UTC days back to pull. Must be ≥ 2 to compute diffs.
--ecosystem E (all) Restrict to one ecosystem. Repeatable.
--concurrency N 8 Parallel downloads.
--cache-dir PATH ./cache Where downloaded zips live. Safe to delete; will re-download.
--out-dir PATH ./out Where report.md and diffs/ are written.
--no-cache false Force re-download even if cached zips exist.

Output

After a successful run:

out/
  report.md                    # human-facing triage report
  diffs/
    PyPI.json                  # full per-day path-sets and ID-level churn
    npm.json
    ...
cache/
  PyPI/
    2026-04-23.zip             # reused on subsequent runs
    2026-04-24.zip
    ...

Open out/report.md first. The "Top suspects" section at the top is the triage starting point. Drill into per-ecosystem tables and JSON diffs from there.


Tests

go test ./...

Covers:

  • gcs.PickDailyVersions: daily-version picker with gap days, out-of-window versions, multiple-per-day.
  • snapshot.Iterate: record streaming + ignoring non-JSON entries.
  • snapshot.RecordHash: stability under key reordering.
  • analyze.walkPaths: path emission for nested objects, arrays, leaves.
  • analyze.Diff: added/removed/changed IDs, added/removed paths, churn %.

End-to-end smoke test (real GCS, free reads):

./osv-compare --days 2 --ecosystem PyPI --ecosystem npm

Then check that out/report.md shows two date rows for both ecosystems with non-zero record counts, and re-running the same command logs 0 new, N from cache.


Limitations / out of scope

  • Does not compare against the OSV schema spec itself. Drift is inferred from observed fields in published records.
  • Not a monitoring system; re-run on demand. (See --days for backfill.)
  • Anonymous GCS auth is not supported by default; if you need to run unattended in CI, use a service account JSON via GOOGLE_APPLICATION_CREDENTIALS.

About

Small Go binary to pull and analyse OSV data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages