This project provides a two-step data pipeline to scrape Rent Adjustment Program (RAP) data from the Oakland website and then clean and harmonize the scraped data.
The project is organized into the following directories:
01-scraper/: Contains the web scraping script (rap_scrape.py).02-cleaning/: Contains the data cleaning and harmonization script (clean_data.py).codebooks/: Stores CSV files used for data harmonization (e.g.,tenant_codebook.csv,landlord_codebook.csv).data/raw/: Where the raw, scraped data is saved.data/cleaned/: Where the cleaned and harmonized data is saved.
This project uses poetry for dependency management.
- Install Poetry (if you haven't already):
pip install poetry
- Install Project Dependencies:
Navigate to the project's root directory and run:
poetry install
- Install ChromeDriver:
The scraper uses Selenium, which requires a ChromeDriver executable. Ensure you have a ChromeDriver version compatible with your Chrome browser installed and accessible in your system's PATH, or specify its location in
01-scraper/rap_scrape.py.
The scraper (01-scraper/rap_scrape.py) navigates the Oakland RAP website and extracts raw case data.
To run the scraper:
- Open
01-scraper/rap_scrape.pyin a text editor. - Adjust the
start_dateandend_datevariables to define the scraping period.start_date = '01-01-2025' # Example: January 1, 2025 end_date = '11-01-2025' # Example: November 1, 2025
- Execute the script from the project's root directory:
The raw data will be saved as a CSV file in the
poetry run python 01-scraper/rap_scrape.py
data/raw/directory (e.g.,data_01012025_11012025.csv).
The cleaning script (02-cleaning/clean_data.py) processes the raw data, harmonizes it, and saves a cleaned version.
To clean the latest scraped file (default behavior):
Simply run the script from the project's root directory:
poetry run python 02-cleaning/clean_data.pyThe script will automatically identify the most recently created raw data file in data/raw/ and process it.
To clean a specific file:
You can specify an input file using the --file argument:
poetry run python 02-cleaning/clean_data.py --file ./data/raw/data_01012010_12312014.csvReplace ./data/raw/data_01012010_12312014.csv with the actual path to the raw data file you wish to clean.
The cleaned data will be saved as a CSV file in the data/cleaned/ directory (e.g., cleaned_data_01012025_11012025.csv).
selenium.common.exceptions.InvalidSessionIdException: This error often occurs if the browser window controlled by Selenium is closed unexpectedly, or if the computer goes to sleep during a long scrape. The scraper is designed to catch this and save any data scraped up to that point. You may need to restart the scrape.selenium.common.exceptions.NoSuchElementException: This error can appear in the logs when the scraper attempts to find an element (like a "Landlord Grounds" table) that is not present on a particular case's detail page. This is often expected behavior, as not all cases will have all possible data fields. The script is designed to handle these cases gracefully by recordingNaNfor missing data.petition_numbertype: Thepetition_numberis extracted as a string to preserve any leading zeros or non-numeric characters. The cleaning script ensures it remains a string type.