Skip to content
Arie Joe edited this page Jun 3, 2026 · 3 revisions

Wayback Machine Downloader Wiki

Wayback Machine Downloader is a Python CLI for downloading archived websites from the Internet Archive Wayback Machine and rewriting them for local browsing.

It is designed for:

  • digital preservation
  • recovering or mirroring defunct sites
  • offline browsing of archived captures
  • historical web analysis and OSINT workflows
  • repeatable, resumable archive downloads

What It Does

  • downloads the latest capture of each logical file by default
  • can keep every archived timestamp for a target
  • can build a best-effort composite snapshot as of a chosen date
  • rewrites downloaded HTML, CSS, JS, and archived absolute links for local use
  • resumes interrupted runs using local state files
  • discovers linked page assets after HTML downloads
  • can recursively mirror subdomains into a local subdomains/ tree
  • ships with GitHub Actions for CI, build, TestPyPI, and PyPI publishing

Start Here

Typical Commands

Download the latest capture of every logical file for a site:

python -m wayback_downloader https://example.com

Preview the planned captures without downloading:

python -m wayback_downloader --list https://example.com

Download all timestamps instead of only the newest capture:

python -m wayback_downloader --all-timestamps https://example.com

Build a site as it looked around a given time:

python -m wayback_downloader --snapshot-at 20130101000000 https://example.com

Rewrite an existing download tree for offline browsing:

python -m wayback_downloader --local-only ./websites/example.com

Core Concepts

Logical file ID

The downloader maps archived URLs to stable logical file IDs. Those IDs drive:

  • the local output path
  • resume tracking in .downloaded.txt
  • duplicate detection
  • local link rewriting

Snapshot planning

The CDX API returns raw (timestamp, original_url) rows. The downloader turns them into planned Snapshot objects using the current filters and snapshot mode.

Resume state

Each output tree can keep:

  • .cdx.json for cached CDX results
  • .downloaded.txt for successful logical file IDs

Those files allow interrupted runs to continue without starting from scratch.

Packaging and Automation

The project publishes as the PyPI distribution:

wayback-machine-downloader

The import package remains:

import wayback_downloader

Release automation is documented in Automation and Release.

Clone this wiki locally