A Go CLI for surfacing notable, specific upstream changes in the OSV vulnerability database. It pulls the last N days of per-ecosystem snapshots from the public gs://osv-vulnerabilities bucket via GCS object versioning, then reports schema drift, JSON path-set diffs, volume changes, and record churn between days.
It exists to make upstream changes: new ecosystems, new fields, schema-version bumps, volume jumps, churn spikes. Easy to spot and attribute to a specific date and ecosystem.
OSV publishes per-ecosystem zips at gs://osv-vulnerabilities/<Ecosystem>/all.zip. The bucket has object versioning enabled, and OSV republishes each zip roughly every 15 minutes. That means the noncurrent-version history is a fine-grained time-series of the feed.
osv-compare walks that history:
- Discover ecosystems: list top-level prefixes in the bucket (~105 ecosystems including
PyPI,npm,Debian:12,Alpine:v3.20, ...) or use--ecosystemto restrict. - List versions:for each ecosystem's
all.zip, list noncurrent versions and pick the latest version per UTC day in the window. - Download: pull each picked version (pinned by generation) into
cache/<ecosystem>/<YYYY-MM-DD>.zip. Cached zips are reused on re-run unless--no-cacheis passed. Downloads run with bounded concurrency (--concurrency, default 8). - Analyse: stream every JSON record out of each zip (no decompression to disk) and accumulate, per ecosystem-day:
- Record count: how big this snapshot is.
schema_versiondistribution: coarse OSV spec drift (e.g.1.6.0=42, 1.7.0=18).- JSON path set: every distinct path observed across all records, e.g.
affected[].ranges[].events[].introduced. Captures field shape, not values. - Per-record content hash (canonical-JSON SHA-256, keyed by
id): for churn detection.
- Diff: between consecutive days: which paths appeared/disappeared, which records were added/removed/changed, and overall churn %.
- Report: write
out/report.md(markdown with a "Top suspects" section + per-ecosystem tables) andout/diffs/<Ecosystem>.json(full machine-readable diffs, ID lists capped at 1000 each).
The markdown report leads with ecosystems that triggered any of:
- Volume jump: record count changed >20% day-over-day.
- New paths: a JSON path appeared that wasn't there the previous day.
- New schema_version: a new value in the
schema_versionfield. - High churn: >5% of records added/removed/changed in a single day.
cmd/osv-compare/main.go cobra CLI, orchestration
internal/gcs/ versioned listing, daily-version picker, downloads
internal/snapshot/ zip → JSON record iterator, canonical hashing
internal/analyze/ DaySummary accumulator + Diff
internal/report/ markdown + per-ecosystem JSON writers
- Go 1.22+ (built/tested with 1.26)
- gcloud CLI: install via
brew install --cask google-cloud-sdkor the docs - A Google account that can authenticate (any account works, the bucket is public, but listing versioned objects requires an authenticated identity).
gcloud auth application-default loginThis opens a browser, signs you in, and writes Application Default Credentials to ~/.config/gcloud/application_default_credentials.json. The CLI picks them up automatically.
git clone <this-repo> && cd osv-data-compare
go build -o osv-compare ./cmd/osv-compare(Or run directly: go run ./cmd/osv-compare ...)
./osv-compareThis downloads ~7 × 105 ≈ 735 zips (most are small per-ecosystem subsets, a few hundred MB total) and writes the report to ./out/report.md. First run takes a few minutes; subsequent runs reuse the cache and complete in seconds.
./osv-compare --ecosystem PyPI --ecosystem npm --days 14| Flag | Default | Purpose |
|---|---|---|
--days N |
7 |
How many UTC days back to pull. Must be ≥ 2 to compute diffs. |
--ecosystem E |
(all) | Restrict to one ecosystem. Repeatable. |
--concurrency N |
8 |
Parallel downloads. |
--cache-dir PATH |
./cache |
Where downloaded zips live. Safe to delete; will re-download. |
--out-dir PATH |
./out |
Where report.md and diffs/ are written. |
--no-cache |
false |
Force re-download even if cached zips exist. |
After a successful run:
out/
report.md # human-facing triage report
diffs/
PyPI.json # full per-day path-sets and ID-level churn
npm.json
...
cache/
PyPI/
2026-04-23.zip # reused on subsequent runs
2026-04-24.zip
...
Open out/report.md first. The "Top suspects" section at the top is the triage starting point. Drill into per-ecosystem tables and JSON diffs from there.
go test ./...Covers:
gcs.PickDailyVersions: daily-version picker with gap days, out-of-window versions, multiple-per-day.snapshot.Iterate: record streaming + ignoring non-JSON entries.snapshot.RecordHash: stability under key reordering.analyze.walkPaths: path emission for nested objects, arrays, leaves.analyze.Diff: added/removed/changed IDs, added/removed paths, churn %.
End-to-end smoke test (real GCS, free reads):
./osv-compare --days 2 --ecosystem PyPI --ecosystem npmThen check that out/report.md shows two date rows for both ecosystems with non-zero record counts, and re-running the same command logs 0 new, N from cache.
- Does not compare against the OSV schema spec itself. Drift is inferred from observed fields in published records.
- Not a monitoring system; re-run on demand. (See
--daysfor backfill.) - Anonymous GCS auth is not supported by default; if you need to run unattended in CI, use a service account JSON via
GOOGLE_APPLICATION_CREDENTIALS.