Skip to content

Josebrown92/web-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Web Scraping Hockey Statistics

Overview

This project demonstrates a basic web‑scraping workflow using Python to extract tabular hockey statistics from a web page, transform the data into a structured format, and persist the results to a CSV file for downstream analysis.

The implementation is provided as a Jupyter Notebook (webscraping.ipynb) and is intended for instructional and exploratory use.

Features

  • Fetches and parses an HTML table containing hockey statistics
  • Extracts table rows and cells using BeautifulSoup
  • Cleans and normalizes text values
  • Stores the extracted data in a Pandas DataFrame
  • Exports the dataset to a CSV file (Hockey.csv)

Technology Stack

  • Python 3.x
  • Jupyter Notebook
  • Requests (for HTTP requests)
  • BeautifulSoup (bs4) (for HTML parsing)
  • Pandas (for data manipulation and storage)

Prerequisites

Ensure the following packages are installed in your Python environment:

pip install requests beautifulsoup4 pandas

If you are using Jupyter:

pip install notebook

Project Structure

.
├── webscraping.ipynb   # Main notebook containing the scraping logic
├── Hockey.csv          # Output file generated by the notebook
└── README.md           # Project documentation

How It Works

  1. A target web page containing a hockey statistics table is requested.
  2. The HTML content is parsed using BeautifulSoup.
  3. The relevant <table> element is located.
  4. Each table row (<tr>) is iterated over and cell values (<td>) are extracted.
  5. Extracted text is cleaned using .strip().
  6. Each row is appended to a Pandas DataFrame.
  7. The DataFrame is saved to a CSV file in the project directory.

Usage

  1. Open the notebook:

    jupyter notebook webscraping.ipynb
  2. Run all cells in sequence.

  3. After execution, a file named Hockey.csv will be created in the current directory.

Output

The output CSV file contains one row per team (or record) and one column per statistic, exactly as scraped from the source table.

Notes and Limitations

  • The scraper depends on the structure of the target website. Any HTML changes may require code updates.
  • This project does not include rate‑limiting or advanced error handling.
  • Always review and comply with the target website’s robots.txt and terms of service before scraping.

Possible Enhancements

  • Add column headers explicitly for clarity
  • Implement exception handling for network and parsing errors
  • Parameterize the target URL
  • Add logging instead of print() statements
  • Package the logic into reusable functions or a module

License

This project is provided for educational purposes. No warranty is expressed or implied.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors