Welcome to the SICSS Web Scraping Workshop! In this exercise, you'll learn how to use Selenium to automate web browsers and extract data from websites. This repository contains code and instructions you'll need to get started. There are three files in this repository that will be used to scrape data, increasing in order of complexity.
test_selenium.py- This code is just to ensure that you have everything set up properly. It opens a chrome window and navigates to the UCLA homepage.practice.py- A "sandbox" where you can practice on a wonderful website for this exact purpose. It will take us tocorrectional_facilities.py- A more complex real world example of how to scrape public data from the Bureau of Prisons website.
I recommend using an IDE like Visual Studio Code (VSCode) to write your code. It will make your life much easier. However, if you want to use something more light weight, like Sublime Text, or even a text editor like Notepad++, it will work just fine.
git clone https://github.com/jakemanderson/sicss.gitA virtual environment keeps the project’s packages separate from the rest of your computer.
python3 -m venv venvmacOS / Linux
source venv/bin/activateYour prompt should now show (venv) at the start.
Windows (PowerShell)
venv\Scripts\Activate.ps1Your prompt should now show (venv) at the start.
pip install selenium webdriver-manager pandasTo confirm Selenium is installed:
pip show seleniumYou should see something like:
Name: selenium
Version: 4.23.1
Use the test_selenium.py file to test Selenium with Chrome.
To run interactively, navigate to /sicss/ and run:
pythonYou will see three tick marks, indicating you are now in the python interpreter. It will look like:
>>> Then, copy paste the first part of the script into the interpreter:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
options = Options()
service = Service()
driver = webdriver.Chrome(options=options, service=service)You should have a new chrome window open. Now, as you run commands into the python interpreter, you can interact with the chrome window.
driver.get("https://www.ucla.edu")Your browser should load the ucla home page.
Now, you're ready to proceed to practice.py to try some scraping!
To get more info on an element so you know how to select it, use the Chrome DevTools. You can open them by right clicking an element and clicking "Inspect".
Then, you can see the element's HTML and CSS, and you can hover over the element to see its attributes.
Even more convenient, you can use the hover to select and it will dynamically show you as you hover over different parts of the page!
Once your element is selected, you can see the element's attributes in the right panel. You can use the attributes to select the element in your code. There are many ways to select the element, and we want to use the one that will uniquely identify the element on the page. If you are too broad in your selection, you may select multiple elements, and you will not be able to interact with the one you want.
As you continue to improve your Selenium scraping skills, I recommend making more advanced and custom functions that are as reusable as possible. Here are some resources:
What's the use of these functions?
def funct1(element, text, delay=0.1):
for char in text:
element.send_keys(char)
time.sleep(random.uniform(0.08, 0.15))
def funct2(driver):
username = "jbruin@g.ucla.edu"
password = "J0hn_W00d3n-Ce3n73R"
email_input = driver.find_element(By.ID, "email")
password_input = driver.find_element(By.ID, "password")
funct1(email_input, username, delay=0.1)
funct1(password_input, password, delay=0.1)
driver.find_element(By.ID, "btn-login").click()

