Skip to content

Celnet-hub/Coursera---Web-Scraping-Task

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Movie Web Scraping Project

This project extracts information about the top 50 highest-rated movies from a web archive and saves the data to both CSV and SQLite database formats.

Overview

The script scrapes movie data from the archived webpage of "100 Most Highly-Ranked Films" and extracts key information including movie rankings, titles, and release years.

Data Source

Features

  • Scrapes the top 50 movies from the ranked list
  • Extracts three key data points for each movie:
    • Average Rank
    • Film Title
    • Release Year
  • Saves data to multiple formats:
    • CSV file (top_50_films.csv)
    • SQLite database (Movies.db with table Top_50)
  • Robust error handling and logging
  • Automatic retry mechanism for network requests
  • Input validation and data verification
  • Cross-platform compatibility

Requirements

Python Libraries

  • requests - For HTTP requests to fetch web content
  • beautifulsoup4 - For HTML parsing and web scraping
  • pandas - For data manipulation and export
  • sqlite3 - For database operations (built-in)

Installation

Install the required packages using pip:

pip install requests beautifulsoup4 pandas

Usage

  1. Clone or download the project files
  2. Install the required dependencies
  3. Run the script:
python webscraping_movies.py

Output Files

After running the script, you'll find:

  • top_50_films.csv: CSV file containing the scraped movie data
  • Movies.db: SQLite database file with a table named Top_50
  • webscraping.log: Log file containing detailed execution information and any errors

Data Structure

The extracted data includes the following columns:

Column Description
Average Rank The movie's ranking position
Film The title of the movie
Year The release year of the movie

Project Structure

webscraping/
├── webscraping_movies.py    # Main scraping script
├── README.md               # Project documentation
├── top_50_films.csv        # Output CSV file (generated)
├── Movies.db              # Output SQLite database (generated)
└── webscraping.log        # Log file (generated)

Technical Details

Web Scraping Process

  1. Fetch HTML: Uses requests.get() to retrieve the webpage content
  2. Parse HTML: Utilizes BeautifulSoup to parse the HTML structure
  3. Extract Data: Targets the first table (tbody) and iterates through rows (tr)
  4. Data Processing: Extracts cell data (td) and structures it into a pandas DataFrame
  5. Export: Saves the DataFrame to both CSV and SQLite formats

Error Handling

The script includes comprehensive error handling and reliability features:

  • Network Error Handling: Automatic retry mechanism (up to 3 attempts) for network failures
  • Timeout Protection: 30-second timeout for HTTP requests to prevent hanging
  • URL Validation: Validates URL format before making requests
  • Data Validation: Checks for empty or invalid data before processing
  • File System Errors: Handles permission errors and invalid file paths
  • Database Errors: Proper SQLite error handling with connection cleanup
  • Logging System: Comprehensive logging to both file and console
  • Graceful Failure: Detailed error messages and proper exit codes
  • Progress Tracking: Real-time logging of extraction progress
  • Cross-platform Support: Uses relative paths for better compatibility

Notes

  • The script targets a specific archived webpage to ensure consistent data availability
  • Improved path handling: Uses relative paths for better cross-platform compatibility
  • The database connection is properly closed after operations to prevent resource leaks
  • Logging: Check webscraping.log for detailed execution information and troubleshooting
  • Error Recovery: The script will attempt to continue processing even if some rows fail
  • Performance: Includes timeout and retry mechanisms to handle slow or unreliable connections

License

This project is for educational purposes as part of a Coursera Data Engineering course.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages