User Guide

Overview

This guide walks through the most common real-world use of Wayback Machine Downloader:

choose a target
preview the archive scope
download the site
rewrite links for local browsing
pull in missing assets if needed
resume or refine the run

If you only want the quick command list, see Quick Start. This page is the fuller end-to-end guide.

1. Install the Tool

From PyPI:

python -m pip install wayback-machine-downloader

Check that it works:

wayback-machine-downloader --version

2. Pick the Right Target

Typical targets:

a full site URL: https://example.com
a bare host: example.com
a directory subtree: https://example.com/wiki/
one exact file: https://example.com/index.html

Important default behavior:

https://example.com/ is treated like a site prefix
https://example.com/wiki/ is treated like a directory prefix
--exact-url disables that expansion

3. Preview Before Downloading

Before a large run, list the planned captures:

python -m wayback_downloader --list https://example.com

This is useful when:

the archive is large
you are testing filters
you are narrowing by date
you want to confirm the target expands the way you expect

4. Do a First Download

The default mode downloads the newest capture of each logical file:

python -m wayback_downloader https://example.com

This is the best starting point for most users because the output is cleaner than an all-timestamps dump.

5. Make the Site Browseable Offline

If you plan to open the site locally, use:

python -m wayback_downloader --local https://example.com

This rewrites:

Wayback wrapper URLs
absolute site URLs
CSS asset URLs
JavaScript string URLs
JSON-escaped script URLs
srcset image URLs

If you already downloaded the site without rewriting:

python -m wayback_downloader --local-only ./websites/example.com

6. Pull In Missing Assets

Some pages render incompletely after a basic mirror because the initial capture plan may not include every referenced stylesheet, script, font, or image.

Use:

python -m wayback_downloader --page-requisites --local https://example.com

This tells the downloader to:

download HTML-like pages
scan them for linked assets
queue those assets for download
rewrite the final page to point at the saved local paths

7. Handle Third-Party Assets Only When Needed

By default, page-asset discovery stays on the target host.

That avoids exploding the crawl into:

CDNs
analytics
ad networks
social widgets

If you truly want cross-host assets:

python -m wayback_downloader --page-requisites --cross-host --local https://example.com

Use this carefully. It can grow the run dramatically.

8. Choose the Right Snapshot Mode

Latest capture per logical file

Best for normal browsing and most site recovery:

python -m wayback_downloader https://example.com

All timestamps

Best for deep archival collection:

python -m wayback_downloader --all-timestamps https://example.com

Composite snapshot at a point in time

Best for reconstructing how the site looked around a specific date:

python -m wayback_downloader --snapshot-at 20130101000000 https://example.com

9. Narrow the Download

Limit the archive by time:

python -m wayback_downloader --from 20060101 --to 20071231 https://example.com

Restrict by URL pattern:

python -m wayback_downloader --only "/\\.(css|js|png)$/i" https://example.com

Skip unwanted areas:

python -m wayback_downloader --exclude admin https://example.com

Download one exact file only:

python -m wayback_downloader --exact-url https://example.com/index.html

10. Resume Interrupted Runs

The downloader can keep two state files in the output tree:

.cdx.json
.downloaded.txt

Behavior:

failed runs keep state by default
successful runs remove state by default
--keep preserves state after success
--reset forces a clean restart

Examples:

python -m wayback_downloader --keep https://example.com
python -m wayback_downloader --reset https://example.com

11. Mirror Subdomains When the Site Uses Them

If the archived site references first-party subdomains:

python -m wayback_downloader --recursive-subdomains --subdomain-depth 2 https://example.com

Those downloads are stored under:

subdomains/<host>/

Use this when the site depends on hosts such as:

blog.example.com
media.example.com
cdn.example.com

12. Understand the Output

Default path:

./websites/<backup-name>/

Important filename behavior:

directory-like captures become index.html
query strings become __q<digest> in filenames
invalid Windows path characters are sanitized

This is expected and is part of how resumability and rewriting stay consistent.

Recommended Workflows

Basic offline mirror

python -m wayback_downloader --local https://example.com

More complete offline mirror

python -m wayback_downloader --page-requisites --local https://example.com

Historical archive collection

python -m wayback_downloader --all-timestamps --keep https://example.com

Date-focused reconstruction

python -m wayback_downloader --snapshot-at 20130101000000 --local https://example.com

Subdomain-heavy site

python -m wayback_downloader --page-requisites --recursive-subdomains --local https://example.com

When Something Looks Wrong

Go to:

User Guide

User Guide

Overview

1. Install the Tool

2. Pick the Right Target

3. Preview Before Downloading

4. Do a First Download

5. Make the Site Browseable Offline

6. Pull In Missing Assets

7. Handle Third-Party Assets Only When Needed

8. Choose the Right Snapshot Mode

Latest capture per logical file

All timestamps

Composite snapshot at a point in time

9. Narrow the Download

10. Resume Interrupted Runs

11. Mirror Subdomains When the Site Uses Them

12. Understand the Output

Recommended Workflows

Basic offline mirror

More complete offline mirror

Historical archive collection

Date-focused reconstruction

Subdomain-heavy site

When Something Looks Wrong

Suggested Next Reading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Usage

Internals

Clone this wiki locally