A flexible, Dockerized web scraper that supports both static and dynamic websites using Selenium (Chrome or Firefox). Extraction is rule-based via JSON, and supports multiple output formats like JSON and CSV. Scrappus is developed in python. If you want to learn about web scrapping you can check this article, you need to know basics or HTML5 tags to define the rules.json file who includes the specifications to start getting the data from the target.
- Supports static and JavaScript-rendered (dynamic) websites
- Rule-based scraping via JSON file
- Output to JSON or CSV
- Dockerized for easy setup
- Supports Chrome and Firefox headless modes
git clone https://github.com/kur0bai/scrappus.git
cd scrappusFeed the rules.json file with the required or desired rules to extract data, for example:
{
"title": "h3",
"description": "meta[name='description']",
"modules": "div[class='container']",
"links": "a[href]"
}
Docker:
- Create an image:
docker build -t scrappus . - Execute it:
docker run --rm scrappus "https://example.com" "rules.json" --output output.json --dynamic
Traditional:
- Install requirements (python virtual enviroment recommended):
pip install requirements.txt - Run the script using:
python3 main.pyto see the help commands.
Easy right? 🍥 You should see the results on the output file you defined before.
I want to clarify that this tool is open to modifications if necessary for the usefulness of those who are interested. However, I am not responsible if it is used for malicious purposes, as it is not the idea. Use it under your own responsibility. Without further ado I hope this can help your projects.