-
Notifications
You must be signed in to change notification settings - Fork 0
User Guide
This guide walks through the most common real-world use of Wayback Machine Downloader:
- choose a target
- preview the archive scope
- download the site
- rewrite links for local browsing
- pull in missing assets if needed
- resume or refine the run
If you only want the quick command list, see Quick Start. This page is the fuller end-to-end guide.
From PyPI:
python -m pip install wayback-machine-downloaderCheck that it works:
wayback-machine-downloader --versionTypical targets:
- a full site URL:
https://example.com - a bare host:
example.com - a directory subtree:
https://example.com/wiki/ - one exact file:
https://example.com/index.html
Important default behavior:
-
https://example.com/is treated like a site prefix -
https://example.com/wiki/is treated like a directory prefix -
--exact-urldisables that expansion
Before a large run, list the planned captures:
python -m wayback_downloader --list https://example.comThis is useful when:
- the archive is large
- you are testing filters
- you are narrowing by date
- you want to confirm the target expands the way you expect
The default mode downloads the newest capture of each logical file:
python -m wayback_downloader https://example.comThis is the best starting point for most users because the output is cleaner than an all-timestamps dump.
If you plan to open the site locally, use:
python -m wayback_downloader --local https://example.comThis rewrites:
- Wayback wrapper URLs
- absolute site URLs
- CSS asset URLs
- JavaScript string URLs
- JSON-escaped script URLs
-
srcsetimage URLs
If you already downloaded the site without rewriting:
python -m wayback_downloader --local-only ./websites/example.comSome pages render incompletely after a basic mirror because the initial capture plan may not include every referenced stylesheet, script, font, or image.
Use:
python -m wayback_downloader --page-requisites --local https://example.comThis tells the downloader to:
- download HTML-like pages
- scan them for linked assets
- queue those assets for download
- rewrite the final page to point at the saved local paths
By default, page-asset discovery stays on the target host.
That avoids exploding the crawl into:
- CDNs
- analytics
- ad networks
- social widgets
If you truly want cross-host assets:
python -m wayback_downloader --page-requisites --cross-host --local https://example.comUse this carefully. It can grow the run dramatically.
Best for normal browsing and most site recovery:
python -m wayback_downloader https://example.comBest for deep archival collection:
python -m wayback_downloader --all-timestamps https://example.comBest for reconstructing how the site looked around a specific date:
python -m wayback_downloader --snapshot-at 20130101000000 https://example.comLimit the archive by time:
python -m wayback_downloader --from 20060101 --to 20071231 https://example.comRestrict by URL pattern:
python -m wayback_downloader --only "/\\.(css|js|png)$/i" https://example.comSkip unwanted areas:
python -m wayback_downloader --exclude admin https://example.comDownload one exact file only:
python -m wayback_downloader --exact-url https://example.com/index.htmlThe downloader can keep two state files in the output tree:
.cdx.json.downloaded.txt
Behavior:
- failed runs keep state by default
- successful runs remove state by default
-
--keeppreserves state after success -
--resetforces a clean restart
Examples:
python -m wayback_downloader --keep https://example.com
python -m wayback_downloader --reset https://example.comIf the archived site references first-party subdomains:
python -m wayback_downloader --recursive-subdomains --subdomain-depth 2 https://example.comThose downloads are stored under:
subdomains/<host>/
Use this when the site depends on hosts such as:
blog.example.commedia.example.comcdn.example.com
Default path:
./websites/<backup-name>/
Important filename behavior:
- directory-like captures become
index.html - query strings become
__q<digest>in filenames - invalid Windows path characters are sanitized
This is expected and is part of how resumability and rewriting stay consistent.
python -m wayback_downloader --local https://example.compython -m wayback_downloader --page-requisites --local https://example.compython -m wayback_downloader --all-timestamps --keep https://example.compython -m wayback_downloader --snapshot-at 20130101000000 --local https://example.compython -m wayback_downloader --page-requisites --recursive-subdomains --local https://example.comGo to: