This project extracts information about the top 50 highest-rated movies from a web archive and saves the data to both CSV and SQLite database formats.
The script scrapes movie data from the archived webpage of "100 Most Highly-Ranked Films" and extracts key information including movie rankings, titles, and release years.
- URL: 100 Most Highly-Ranked Films (Web Archive)
- Archive Date: September 2, 2023
- Scrapes the top 50 movies from the ranked list
- Extracts three key data points for each movie:
- Average Rank
- Film Title
- Release Year
- Saves data to multiple formats:
- CSV file (
top_50_films.csv) - SQLite database (
Movies.dbwith tableTop_50)
- CSV file (
- Robust error handling and logging
- Automatic retry mechanism for network requests
- Input validation and data verification
- Cross-platform compatibility
requests- For HTTP requests to fetch web contentbeautifulsoup4- For HTML parsing and web scrapingpandas- For data manipulation and exportsqlite3- For database operations (built-in)
Install the required packages using pip:
pip install requests beautifulsoup4 pandas- Clone or download the project files
- Install the required dependencies
- Run the script:
python webscraping_movies.pyAfter running the script, you'll find:
top_50_films.csv: CSV file containing the scraped movie dataMovies.db: SQLite database file with a table namedTop_50webscraping.log: Log file containing detailed execution information and any errors
The extracted data includes the following columns:
| Column | Description |
|---|---|
| Average Rank | The movie's ranking position |
| Film | The title of the movie |
| Year | The release year of the movie |
webscraping/
├── webscraping_movies.py # Main scraping script
├── README.md # Project documentation
├── top_50_films.csv # Output CSV file (generated)
├── Movies.db # Output SQLite database (generated)
└── webscraping.log # Log file (generated)
- Fetch HTML: Uses
requests.get()to retrieve the webpage content - Parse HTML: Utilizes BeautifulSoup to parse the HTML structure
- Extract Data: Targets the first table (
tbody) and iterates through rows (tr) - Data Processing: Extracts cell data (
td) and structures it into a pandas DataFrame - Export: Saves the DataFrame to both CSV and SQLite formats
The script includes comprehensive error handling and reliability features:
- Network Error Handling: Automatic retry mechanism (up to 3 attempts) for network failures
- Timeout Protection: 30-second timeout for HTTP requests to prevent hanging
- URL Validation: Validates URL format before making requests
- Data Validation: Checks for empty or invalid data before processing
- File System Errors: Handles permission errors and invalid file paths
- Database Errors: Proper SQLite error handling with connection cleanup
- Logging System: Comprehensive logging to both file and console
- Graceful Failure: Detailed error messages and proper exit codes
- Progress Tracking: Real-time logging of extraction progress
- Cross-platform Support: Uses relative paths for better compatibility
- The script targets a specific archived webpage to ensure consistent data availability
- Improved path handling: Uses relative paths for better cross-platform compatibility
- The database connection is properly closed after operations to prevent resource leaks
- Logging: Check
webscraping.logfor detailed execution information and troubleshooting - Error Recovery: The script will attempt to continue processing even if some rows fail
- Performance: Includes timeout and retry mechanisms to handle slow or unreliable connections
This project is for educational purposes as part of a Coursera Data Engineering course.