Web-scraper-tool

Python 3.11+ CLI and library for extracting website intelligence from HTML pages.

What it extracts

Core metadata: title, description, canonical URL, language.
Page details: heading lists, links/images/forms/scripts/word counts.
Technology hints: CMS/framework/analytics/server/backend signatures.
Motto/tagline: best candidate + ranked candidate list.
Structured warnings and machine-readable error envelope.

Install

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

CLI usage

python3 -m scraper_tool --url https://example.com
python3 -m scraper_tool --url https://example.com --pretty
python3 -m scraper_tool --url https://example.com --save result.json --pretty

CLI arguments

--url: target website URL.
--timeout: HTTP timeout in seconds (default 15).
--output: currently json only.
--pretty: pretty-print JSON in console/file.
--save: write JSON output to file.

Python API

from scraper_tool import analyze_url

result = analyze_url("https://example.com", timeout=15)
print(result["technologies"])

Output schema

Stable top-level keys (in order):

input_url
final_url
status_code
fetched_at
title
description
motto_best
motto_candidates
technologies
page_details
warnings
error

Testing

Run full test suite:

python3 -m pytest -q

Limitations

HTML-only scraping (no JavaScript rendering/browser execution).
Technology detection is heuristic and may miss custom stacks.
Network-dependent scraping can be affected by bot protection/rate limits.

Troubleshooting

If requests time out, increase --timeout.
If output has validation_error, verify URL scheme/domain format.
If tech detection is empty, check if the site loads key assets only via JavaScript.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
scraper_tool		scraper_tool
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web-scraper-tool

What it extracts

Install

CLI usage

CLI arguments

Python API

Output schema

Testing

Limitations

Troubleshooting

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web-scraper-tool

What it extracts

Install

CLI usage

CLI arguments

Python API

Output schema

Testing

Limitations

Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages