Python 3.11+ CLI and library for extracting website intelligence from HTML pages.
- Core metadata: title, description, canonical URL, language.
- Page details: heading lists, links/images/forms/scripts/word counts.
- Technology hints: CMS/framework/analytics/server/backend signatures.
- Motto/tagline: best candidate + ranked candidate list.
- Structured warnings and machine-readable error envelope.
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtpython3 -m scraper_tool --url https://example.com
python3 -m scraper_tool --url https://example.com --pretty
python3 -m scraper_tool --url https://example.com --save result.json --pretty--url: target website URL.--timeout: HTTP timeout in seconds (default15).--output: currentlyjsononly.--pretty: pretty-print JSON in console/file.--save: write JSON output to file.
from scraper_tool import analyze_url
result = analyze_url("https://example.com", timeout=15)
print(result["technologies"])Stable top-level keys (in order):
input_urlfinal_urlstatus_codefetched_attitledescriptionmotto_bestmotto_candidatestechnologiespage_detailswarningserror
Run full test suite:
python3 -m pytest -q- HTML-only scraping (no JavaScript rendering/browser execution).
- Technology detection is heuristic and may miss custom stacks.
- Network-dependent scraping can be affected by bot protection/rate limits.
- If requests time out, increase
--timeout. - If output has
validation_error, verify URL scheme/domain format. - If tech detection is empty, check if the site loads key assets only via JavaScript.