Officemonster

This project is a comprehensive web scraping application designed to collect product data from a specified e-commerce website. The project consists of two main Jupyter notebooks: URLS Scraper.ipynb and Product Scraper.ipynb.

Installation

Clone the repository:

git clone https://github.com/faisal-fida/officemonster.git
cd officemonster

Install the dependencies:

pip install -r requirements.txt

Make sure you have the required CSV files in the URLS directory.

Usage

URLS Scraper

Open URLS Scraper.ipynb to run the script that processes multiple CSV files in the URLS directory, concatenates them, and saves the combined URLs to combined_urls.csv.
The script then extracts URLs from combined_urls.csv and saves them to urls.csv.

Product Scraper

Open Product Scraper.ipynb to run the script that reads URLs from urls.csv and scrapes product details.
The script utilizes BeautifulSoup to parse HTML content and extract relevant product information such as title, price, images, and descriptions.

Features

Data Aggregation: Combining multiple CSV files into a single DataFrame while maintaining data integrity.
Web Scraping: Handling dynamic web content and possible changes in website structure.
Error Handling: Managing HTTP errors and ensuring the script continues to run smoothly.
Efficient Data Handling: Used pandas to efficiently read, concatenate, and save CSV files.
Robust Web Scraping: Implemented functions to handle HTTP requests and parse HTML with BeautifulSoup, ensuring data is extracted even if some pages do not follow the expected structure.
Error Management: Added try-except blocks to catch and log HTTP errors, allowing the script to skip problematic URLs and continue processing others.
Data Consistency: Ensuring all CSV files have a consistent format and contain valid URLs.
Website Variability: Handling variations in web page design that could affect scraping logic.
Performance: Optimizing the script to handle large datasets and multiple HTTP requests efficiently.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
URL Processing		URL Processing
Product Scraper.ipynb		Product Scraper.ipynb
README.md		README.md
urls.csv		urls.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Officemonster

Table of Contents

Installation

Usage

URLS Scraper

Product Scraper

Features

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Officemonster

Table of Contents

Installation

Usage

URLS Scraper

Product Scraper

Features

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages