-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Wayback Machine Downloader is a Python CLI for downloading archived websites from the Internet Archive Wayback Machine and rewriting them for local browsing.
It is designed for:
- digital preservation
- recovering or mirroring defunct sites
- offline browsing of archived captures
- historical web analysis and OSINT workflows
- repeatable, resumable archive downloads
- downloads the latest capture of each logical file by default
- can keep every archived timestamp for a target
- can build a best-effort composite snapshot as of a chosen date
- rewrites downloaded HTML, CSS, JS, and archived absolute links for local use
- resumes interrupted runs using local state files
- discovers linked page assets after HTML downloads
- can recursively mirror subdomains into a local
subdomains/tree - ships with GitHub Actions for CI, build, TestPyPI, and PyPI publishing
- User Guide
- Installation and Setup
- Quick Start
- CLI Reference
- Snapshot Modes and Filtering
- Local Rewriting
- Asset Discovery and Cross-Host Downloads
- Subdomain Mirroring
- Output Layout and State Files
- Troubleshooting
- FAQ
Download the latest capture of every logical file for a site:
python -m wayback_downloader https://example.comPreview the planned captures without downloading:
python -m wayback_downloader --list https://example.comDownload all timestamps instead of only the newest capture:
python -m wayback_downloader --all-timestamps https://example.comBuild a site as it looked around a given time:
python -m wayback_downloader --snapshot-at 20130101000000 https://example.comRewrite an existing download tree for offline browsing:
python -m wayback_downloader --local-only ./websites/example.comThe downloader maps archived URLs to stable logical file IDs. Those IDs drive:
- the local output path
- resume tracking in
.downloaded.txt - duplicate detection
- local link rewriting
The CDX API returns raw (timestamp, original_url) rows. The downloader turns
them into planned Snapshot objects using the current filters and snapshot
mode.
Each output tree can keep:
-
.cdx.jsonfor cached CDX results -
.downloaded.txtfor successful logical file IDs
Those files allow interrupted runs to continue without starting from scratch.
The project publishes as the PyPI distribution:
wayback-machine-downloader
The import package remains:
import wayback_downloaderRelease automation is documented in Automation and Release.