-
Notifications
You must be signed in to change notification settings - Fork 0
Development and Testing
wayback_downloader/
archive.py
cli.py
config.py
downloader.py
filters.py
models.py
paths.py
requisites.py
snapshots.py
state.py
subdomains.py
text.py
transport.py
url_rewrite.py
tests/
test_wayback_downloader.py
Create an editable install:
python -m pip install -e .There are no third-party runtime dependencies.
Module form:
python -m wayback_downloader --helpConsole-script form after editable install:
wayback-machine-downloader --helpPrimary command:
python -B -m unittest discover -s tests -t .Optional import/compile sanity check:
python -m compileall wayback_downloader testsThe suite avoids live archive traffic.
Instead it relies on:
- fake transports that return canned
HTTPResponsevalues - temporary directories for output
- direct assertions on path layout, state, and rewritten content
Benefits:
- deterministic
- fast
- safe to run repeatedly
- focused on downloader logic rather than archive availability
Update tests when changing:
- wildcard normalization
- timestamp filters
- all-timestamps behavior
- composite snapshot logic
Update tests when changing:
- query-string filename hashing
- filesystem sanitization
- directory-vs-file mapping
- blocking-file restructuring
Update tests when changing:
- resume logic
- worker behavior
- page requisites
- state-file cleanup
Update tests when changing:
- HTML attribute rewriting
- CSS
url(...) - JSON-escaped URLs
srcset- subdomain rewrites
If new HTTP behavior is needed, keep it inside the transport abstraction or
ArchiveClient rather than reaching straight into lower-level networking from
other modules.
If URL-to-path behavior changes, keep these aligned:
- downloader writes
- local rewriter output
- resume-state validation
paths.py is the shared source of truth.
Important invariants:
- missing files must be re-queued even if
.downloaded.txtcontains the ID - failed runs should usually keep state
- successful runs should remove state unless
--keepis set
The biggest remaining gap is opt-in live integration coverage against the real Wayback service. The code is structured so this can be added later without weakening the fast offline unit suite.