Skip to content

User Guide

Arie Joe edited this page Jun 3, 2026 · 1 revision

User Guide

Overview

This guide walks through the most common real-world use of Wayback Machine Downloader:

  1. choose a target
  2. preview the archive scope
  3. download the site
  4. rewrite links for local browsing
  5. pull in missing assets if needed
  6. resume or refine the run

If you only want the quick command list, see Quick Start. This page is the fuller end-to-end guide.

1. Install the Tool

From PyPI:

python -m pip install wayback-machine-downloader

Check that it works:

wayback-machine-downloader --version

2. Pick the Right Target

Typical targets:

  • a full site URL: https://example.com
  • a bare host: example.com
  • a directory subtree: https://example.com/wiki/
  • one exact file: https://example.com/index.html

Important default behavior:

  • https://example.com/ is treated like a site prefix
  • https://example.com/wiki/ is treated like a directory prefix
  • --exact-url disables that expansion

3. Preview Before Downloading

Before a large run, list the planned captures:

python -m wayback_downloader --list https://example.com

This is useful when:

  • the archive is large
  • you are testing filters
  • you are narrowing by date
  • you want to confirm the target expands the way you expect

4. Do a First Download

The default mode downloads the newest capture of each logical file:

python -m wayback_downloader https://example.com

This is the best starting point for most users because the output is cleaner than an all-timestamps dump.

5. Make the Site Browseable Offline

If you plan to open the site locally, use:

python -m wayback_downloader --local https://example.com

This rewrites:

  • Wayback wrapper URLs
  • absolute site URLs
  • CSS asset URLs
  • JavaScript string URLs
  • JSON-escaped script URLs
  • srcset image URLs

If you already downloaded the site without rewriting:

python -m wayback_downloader --local-only ./websites/example.com

6. Pull In Missing Assets

Some pages render incompletely after a basic mirror because the initial capture plan may not include every referenced stylesheet, script, font, or image.

Use:

python -m wayback_downloader --page-requisites --local https://example.com

This tells the downloader to:

  • download HTML-like pages
  • scan them for linked assets
  • queue those assets for download
  • rewrite the final page to point at the saved local paths

7. Handle Third-Party Assets Only When Needed

By default, page-asset discovery stays on the target host.

That avoids exploding the crawl into:

  • CDNs
  • analytics
  • ad networks
  • social widgets

If you truly want cross-host assets:

python -m wayback_downloader --page-requisites --cross-host --local https://example.com

Use this carefully. It can grow the run dramatically.

8. Choose the Right Snapshot Mode

Latest capture per logical file

Best for normal browsing and most site recovery:

python -m wayback_downloader https://example.com

All timestamps

Best for deep archival collection:

python -m wayback_downloader --all-timestamps https://example.com

Composite snapshot at a point in time

Best for reconstructing how the site looked around a specific date:

python -m wayback_downloader --snapshot-at 20130101000000 https://example.com

9. Narrow the Download

Limit the archive by time:

python -m wayback_downloader --from 20060101 --to 20071231 https://example.com

Restrict by URL pattern:

python -m wayback_downloader --only "/\\.(css|js|png)$/i" https://example.com

Skip unwanted areas:

python -m wayback_downloader --exclude admin https://example.com

Download one exact file only:

python -m wayback_downloader --exact-url https://example.com/index.html

10. Resume Interrupted Runs

The downloader can keep two state files in the output tree:

  • .cdx.json
  • .downloaded.txt

Behavior:

  • failed runs keep state by default
  • successful runs remove state by default
  • --keep preserves state after success
  • --reset forces a clean restart

Examples:

python -m wayback_downloader --keep https://example.com
python -m wayback_downloader --reset https://example.com

11. Mirror Subdomains When the Site Uses Them

If the archived site references first-party subdomains:

python -m wayback_downloader --recursive-subdomains --subdomain-depth 2 https://example.com

Those downloads are stored under:

subdomains/<host>/

Use this when the site depends on hosts such as:

  • blog.example.com
  • media.example.com
  • cdn.example.com

12. Understand the Output

Default path:

./websites/<backup-name>/

Important filename behavior:

  • directory-like captures become index.html
  • query strings become __q<digest> in filenames
  • invalid Windows path characters are sanitized

This is expected and is part of how resumability and rewriting stay consistent.

Recommended Workflows

Basic offline mirror

python -m wayback_downloader --local https://example.com

More complete offline mirror

python -m wayback_downloader --page-requisites --local https://example.com

Historical archive collection

python -m wayback_downloader --all-timestamps --keep https://example.com

Date-focused reconstruction

python -m wayback_downloader --snapshot-at 20130101000000 --local https://example.com

Subdomain-heavy site

python -m wayback_downloader --page-requisites --recursive-subdomains --local https://example.com

When Something Looks Wrong

Go to:

Suggested Next Reading

Clone this wiki locally