Movie Web Scraping Project

This project extracts information about the top 50 highest-rated movies from a web archive and saves the data to both CSV and SQLite database formats.

Overview

The script scrapes movie data from the archived webpage of "100 Most Highly-Ranked Films" and extracts key information including movie rankings, titles, and release years.

Data Source

URL: 100 Most Highly-Ranked Films (Web Archive)
Archive Date: September 2, 2023

Features

Scrapes the top 50 movies from the ranked list
Extracts three key data points for each movie:
- Average Rank
- Film Title
- Release Year
Saves data to multiple formats:
- CSV file (top_50_films.csv)
- SQLite database (Movies.db with table Top_50)
Robust error handling and logging
Automatic retry mechanism for network requests
Input validation and data verification
Cross-platform compatibility

Requirements

Python Libraries

requests - For HTTP requests to fetch web content
beautifulsoup4 - For HTML parsing and web scraping
pandas - For data manipulation and export
sqlite3 - For database operations (built-in)

Installation

Install the required packages using pip:

pip install requests beautifulsoup4 pandas

Usage

Clone or download the project files
Install the required dependencies
Run the script:

python webscraping_movies.py

Output Files

After running the script, you'll find:

top_50_films.csv: CSV file containing the scraped movie data
Movies.db: SQLite database file with a table named Top_50
webscraping.log: Log file containing detailed execution information and any errors

Data Structure

The extracted data includes the following columns:

Column	Description
Average Rank	The movie's ranking position
Film	The title of the movie
Year	The release year of the movie

Project Structure

webscraping/
├── webscraping_movies.py    # Main scraping script
├── README.md               # Project documentation
├── top_50_films.csv        # Output CSV file (generated)
├── Movies.db              # Output SQLite database (generated)
└── webscraping.log        # Log file (generated)

Technical Details

Web Scraping Process

Fetch HTML: Uses requests.get() to retrieve the webpage content
Parse HTML: Utilizes BeautifulSoup to parse the HTML structure
Extract Data: Targets the first table (tbody) and iterates through rows (tr)
Data Processing: Extracts cell data (td) and structures it into a pandas DataFrame
Export: Saves the DataFrame to both CSV and SQLite formats

Error Handling

The script includes comprehensive error handling and reliability features:

Network Error Handling: Automatic retry mechanism (up to 3 attempts) for network failures
Timeout Protection: 30-second timeout for HTTP requests to prevent hanging
URL Validation: Validates URL format before making requests
Data Validation: Checks for empty or invalid data before processing
File System Errors: Handles permission errors and invalid file paths
Database Errors: Proper SQLite error handling with connection cleanup
Logging System: Comprehensive logging to both file and console
Graceful Failure: Detailed error messages and proper exit codes
Progress Tracking: Real-time logging of extraction progress
Cross-platform Support: Uses relative paths for better compatibility

Notes

The script targets a specific archived webpage to ensure consistent data availability
Improved path handling: Uses relative paths for better cross-platform compatibility
The database connection is properly closed after operations to prevent resource leaks
Logging: Check webscraping.log for detailed execution information and troubleshooting
Error Recovery: The script will attempt to continue processing even if some rows fail
Performance: Includes timeout and retry mechanisms to handle slow or unreliable connections

License

This project is for educational purposes as part of a Coursera Data Engineering course.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
webscraping_movies.py		webscraping_movies.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Movie Web Scraping Project

Overview

Data Source

Features

Requirements

Python Libraries

Installation

Usage

Output Files

Data Structure

Project Structure

Technical Details

Web Scraping Process

Error Handling

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Movie Web Scraping Project

Overview

Data Source

Features

Requirements

Python Libraries

Installation

Usage

Output Files

Data Structure

Project Structure

Technical Details

Web Scraping Process

Error Handling

Notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages