This project demonstrates a basic web‑scraping workflow using Python to extract tabular hockey statistics from a web page, transform the data into a structured format, and persist the results to a CSV file for downstream analysis.
The implementation is provided as a Jupyter Notebook (webscraping.ipynb) and is intended for instructional and exploratory use.
- Fetches and parses an HTML table containing hockey statistics
- Extracts table rows and cells using BeautifulSoup
- Cleans and normalizes text values
- Stores the extracted data in a Pandas DataFrame
- Exports the dataset to a CSV file (
Hockey.csv)
- Python 3.x
- Jupyter Notebook
- Requests (for HTTP requests)
- BeautifulSoup (bs4) (for HTML parsing)
- Pandas (for data manipulation and storage)
Ensure the following packages are installed in your Python environment:
pip install requests beautifulsoup4 pandasIf you are using Jupyter:
pip install notebook.
├── webscraping.ipynb # Main notebook containing the scraping logic
├── Hockey.csv # Output file generated by the notebook
└── README.md # Project documentation
- A target web page containing a hockey statistics table is requested.
- The HTML content is parsed using BeautifulSoup.
- The relevant
<table>element is located. - Each table row (
<tr>) is iterated over and cell values (<td>) are extracted. - Extracted text is cleaned using
.strip(). - Each row is appended to a Pandas DataFrame.
- The DataFrame is saved to a CSV file in the project directory.
-
Open the notebook:
jupyter notebook webscraping.ipynb
-
Run all cells in sequence.
-
After execution, a file named
Hockey.csvwill be created in the current directory.
The output CSV file contains one row per team (or record) and one column per statistic, exactly as scraped from the source table.
- The scraper depends on the structure of the target website. Any HTML changes may require code updates.
- This project does not include rate‑limiting or advanced error handling.
- Always review and comply with the target website’s
robots.txtand terms of service before scraping.
- Add column headers explicitly for clarity
- Implement exception handling for network and parsing errors
- Parameterize the target URL
- Add logging instead of
print()statements - Package the logic into reusable functions or a module
This project is provided for educational purposes. No warranty is expressed or implied.