Skip to content

Development and Testing

Arie Joe edited this page Jun 3, 2026 · 1 revision

Development and Testing

Repository Layout

wayback_downloader/
  archive.py
  cli.py
  config.py
  downloader.py
  filters.py
  models.py
  paths.py
  requisites.py
  snapshots.py
  state.py
  subdomains.py
  text.py
  transport.py
  url_rewrite.py

tests/
  test_wayback_downloader.py

Local Setup

Create an editable install:

python -m pip install -e .

There are no third-party runtime dependencies.

Running the CLI in Development

Module form:

python -m wayback_downloader --help

Console-script form after editable install:

wayback-machine-downloader --help

Running Tests

Primary command:

python -B -m unittest discover -s tests -t .

Optional import/compile sanity check:

python -m compileall wayback_downloader tests

Test Philosophy

The suite avoids live archive traffic.

Instead it relies on:

  • fake transports that return canned HTTPResponse values
  • temporary directories for output
  • direct assertions on path layout, state, and rewritten content

Benefits:

  • deterministic
  • fast
  • safe to run repeatedly
  • focused on downloader logic rather than archive availability

What to Test When Changing Behavior

Snapshot planning

Update tests when changing:

  • wildcard normalization
  • timestamp filters
  • all-timestamps behavior
  • composite snapshot logic

Path handling

Update tests when changing:

  • query-string filename hashing
  • filesystem sanitization
  • directory-vs-file mapping
  • blocking-file restructuring

Download orchestration

Update tests when changing:

  • resume logic
  • worker behavior
  • page requisites
  • state-file cleanup

Rewriting

Update tests when changing:

  • HTML attribute rewriting
  • CSS url(...)
  • JSON-escaped URLs
  • srcset
  • subdomain rewrites

Implementation Notes

Keep network access behind the transport

If new HTTP behavior is needed, keep it inside the transport abstraction or ArchiveClient rather than reaching straight into lower-level networking from other modules.

Reuse path rules everywhere

If URL-to-path behavior changes, keep these aligned:

  • downloader writes
  • local rewriter output
  • resume-state validation

paths.py is the shared source of truth.

Preserve resume safety

Important invariants:

  • missing files must be re-queued even if .downloaded.txt contains the ID
  • failed runs should usually keep state
  • successful runs should remove state unless --keep is set

Known Gaps

The biggest remaining gap is opt-in live integration coverage against the real Wayback service. The code is structured so this can be added later without weakening the fast offline unit suite.

Related Pages

Clone this wiki locally