Let's use an LLM to create a scraper for the Massachusetts state electios website.
- First, clone this repo
- Set up your python virtual environment
- Just in case your virtual environemnt from yesterday is still active, run
deactivate. If doing so gives you an error, don't worry! - Create a new Python virtual environment:
python3 -m venv scrape_env
- Activate the new virtual environment:
source ./scrape_env/bin/activate - Install the
lxml,requests, andBeautifulSouppython libraries:pip install lxml requests beautifulsoup4
- Just in case your virtual environemnt from yesterday is still active, run
-
Visit the website: https://electionstats.state.ma.us/elections/search/year_from:2025/year_to:2025
-
Check
robots.txtfile (visit https://electionstats.state.ma.us/robots.txt). What preferences does this site express about whether they want to be scraped or not? -
Inspect the page to identify the table where election data is stored: a. Right-click anywhere on the page > Click "Inspect" > find the table with election data
-
Read the IMDB scraper example to scrape the elections website and discuss what each line does with your teammate.
-
Make a new scraper in
election-scraper.pythat scrapes information from the MA state gov website. Complete the following tasks by copy/paste and modifying lines from the IMDB scraper code one-by-one.a. Task 1: grab each row of the table (What is the CSS Selector for this?) b. Task 2: extract the year and print it out b. Task 3: extract the name of the candidate and print it out c. Task 4: extract the name of the party and print it out
-
Ask the LLM to help you write the code to save the data to a CSV file.
-
Adjust your scraper to only grab 2024 elections. Hint: try manually adjusting the the year filters on the elections website. Does the URL change?
-
Bonus: Extract additional data fields!