Web Scraping

Introduction

This repository hosts a collection of web scraping projects designed to demonstrate practical data extraction, processing, and application development. The primary goal is to illustrate how to effectively gather online data, clean it, store it in a structured format, and present it through a user-friendly interface. This workspace serves as an excellent resource for learning Python-based scraping, full-stack web development with Node.js and React, and data handling.

Project Overview

1. Companies Market Cap Scraper

A Python-based web scraper that extracts company market capitalization data from a public financial website.

Key Features:

Data Extraction: Retrieves real-time market cap data for various companies.
Data Cleaning: Processes and normalizes scraped values to ensure consistency and accuracy.
CSV Export: Stores the cleaned dataset in a CSV file, ready for analysis, reporting, or integration into other systems.
Technologies: Built using Python with powerful libraries like BeautifulSoup for HTML parsing, Requests for HTTP requests, and pandas for data manipulation.

2. Myntra Ecommerce Scraping App

A full-stack application that scrapes product data from an e-commerce site (mimicking Myntra) and displays it in a modern, interactive React frontend.

Key Features:

Product Data Scraping: Gathers product details such as names, prices, images, and descriptions.
Category-Based Browsing: Enables users to browse products by categories like shoes, watches, and T-shirts.
Full-Stack Architecture:
- React Frontend: Provides a dynamic and responsive user interface for displaying products.
- Node.js Backend (Express.js): Manages data retrieval, API endpoints, and serves product information to the frontend.
Structured Data Storage: Stores and renders product data in a well-organized format, facilitating easy display and management.

Web Scraping Fundamentals

Web scraping involves programmatically extracting information from websites. The general workflow demonstrated in these projects includes:

Sending HTTP Requests: Initiating requests to target web pages.
Receiving HTML Response: Obtaining the raw HTML content of the page.
Parsing Content: Analyzing the HTML structure to locate relevant data.
Extracting Fields: Pulling out specific data points (e.g., product names, prices).
Cleaning & Normalizing: Standardizing extracted data for consistency.
Storing Output: Saving the processed data in formats like CSV or into a database for application use.

Key Technologies

This repository leverages a diverse set of technologies across its projects:

Backend & Scraping:
- Python: Primary language for data scraping and processing.
- BeautifulSoup: Python library for parsing HTML and XML documents.
- Requests: Python library for making HTTP requests.
- pandas: Python library for data manipulation and analysis, especially with DataFrames.
- Node.js: JavaScript runtime for the backend server.
- Express.js: Web application framework for Node.js, used for building APIs.
- MongoDB: NoSQL database for storing product data.
- CSV Export: For structured data output.
Frontend:
- React: JavaScript library for building user interfaces.
- HTML5, CSS3, JavaScript: Core web technologies.

Features & Learning Outcomes

By exploring these projects, you will gain practical experience and insights into:

Web Page Inspection: Understanding how to analyze web page structures and identify data points for extraction.
HTML Parsing: Mastering selector-based scraping techniques using libraries like BeautifulSoup.
Data Transformation: Cleaning, normalizing, and transforming raw scraped data into usable formats.
Structured Output: Generating CSV files and handling structured data for various applications.
Full-Stack Development: Integrating frontend (React) and backend (Node.js/Express.js) components.
Dynamic Data Presentation: Implementing category-based filtering and efficient product display.
Performance Optimization: Developing an awareness of loading times and optimizing data handling.
Python Automation: Automating data collection tasks with Python scripts.

Getting Started

Follow these steps to set up and run the projects in your local environment.

Prerequisites

Ensure you have the following installed:

Git
Python 3.8+
Node.js 14+ and npm (or yarn)
MongoDB (running locally or accessible via a cloud service like MongoDB Atlas)

Installation

Clone the repository:

git clone https://github.com/your-username/web-scraping.git
cd web-scraping

Setup Companies Market Cap Scraper:

cd companiesmarketcap
pip install -r requirements.txt
cd ..

Setup Myntra Ecommerce Scraping App:

Backend:

cd myntra/backend
npm install
# Create a .env file based on .env.example and configure your MongoDB URI
cp .env.example .env 
# Open .env and add your MongoDB connection string (e.g., MONGODB_URI=mongodb://localhost:27017/myntradb)
cd ..

Frontend:

cd myntra/frontend
npm install
cd ..

Return to the root directory:

cd ..

Running the Applications

Run Companies Market Cap Scraper:
```
cd companiesmarketcap
python web_scraping.py
```
This will generate Companies.csv in the companiesmarketcap directory.
Run Myntra Ecommerce Scraping App:

Backend (in a separate terminal):
```
cd myntra/backend
npm start
```
The backend server will typically run on http://localhost:5000.

Frontend (in another separate terminal):
```
cd myntra/frontend
npm run dev
```
The frontend application will typically open in your browser at http://localhost:5173 (or similar port).

Project Structure

.
├── companiesmarketcap/                 # Python project for market cap scraping
│   ├── Companies.csv                   # Output CSV file
│   ├── README.md                       # Project-specific README
│   ├── requirements.txt                # Python dependencies
│   └── web_scraping.py                 # Main scraping script
├── myntra/                             # Full-stack e-commerce scraping app
│   ├── data.csv                        # Scraped data (example)
│   ├── index.html                      # Frontend entry point
│   ├── myntra.py                       # Python scraping script for Myntra (if applicable)
│   ├── package-lock.json
│   ├── package.json
│   ├── README.md                       # Project-specific README
│   ├── backend/                        # Node.js Express backend
│   │   ├── .env                        # Environment variables
│   │   ├── src/                        # Backend source code
│   │   │   ├── app.js
│   │   │   ├── server.js
│   │   │   ├── config/                 # Database configuration
│   │   │   ├── controllers/            # Request handlers
│   │   │   ├── middleware/             # Express middleware
│   │   │   ├── models/                 # Mongoose schemas
│   │   │   ├── routes/                 # API routes
│   │   │   ├── services/               # Business logic, e.g., CSV handling
│   │   │   └── utils/                  # Utility functions
│   │   └── ...                         # Node modules, etc.
│   └── frontend/                       # React.js frontend
│       ├── public/                     # Static assets
│       ├── src/                        # Frontend source code
│       │   ├── App.jsx                 # Main React component
│       │   ├── components/             # Reusable UI components
│       │   │   ├── common/
│       │   │   ├── home/
│       │   │   ├── layout/
│       │   │   └── shop/
│       │   ├── context/                # React context for state management
│       │   ├── lib/                    # Utility functions and API clients
│       │   └── pages/                  # Page-level React components
│       │       └── admin/
│       └── ...                         # Node modules, build configs, etc.
└── README.md                           # Main project README (this file)

Best Practices for Web Scraping

Adhering to best practices ensures ethical and efficient scraping:

Respect robots.txt: Always check a website's robots.txt file for scraping guidelines.
Review Terms of Service: Understand the legal implications of scraping a particular site.
Polite Scraping: Implement delays between requests to avoid overwhelming servers (e.g., time.sleep() in Python).
Error Handling: Implement robust error handling for network issues, changing website structures, and rate limits.
User-Agent: Set a descriptive User-Agent header to identify your scraper.
Avoid Unnecessary Requests: Cache data when possible and only request information you need.
Data Validation: Clean and validate all scraped data thoroughly before storage or use.
Modularity: Keep your scraping code modular and reusable.

Contributing

Contributions are welcome! If you have suggestions for improvements or new features, please feel free to:

Fork the repository.
Create a new branch (git checkout -b feature/YourFeature).
Make your changes.
Commit your changes (git commit -m 'Add new feature').
Push to the branch (git push origin feature/YourFeature).
Open a Pull Request.

License

This project is licensed under the MIT License. See the LICENSE file for details (if applicable, otherwise state 'No specific license').

This README was generated and enhanced by an AI assistant.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
companiesmarketcap		companiesmarketcap
myntra		myntra
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraping

Table of Contents

Introduction

Project Overview

1. Companies Market Cap Scraper

Key Features:

2. Myntra Ecommerce Scraping App

Key Features:

Web Scraping Fundamentals

Key Technologies

Features & Learning Outcomes

Getting Started

Prerequisites

Installation

Running the Applications

Project Structure

Best Practices for Web Scraping

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Scraping

Table of Contents

Introduction

Project Overview

1. Companies Market Cap Scraper

Key Features:

2. Myntra Ecommerce Scraping App

Key Features:

Web Scraping Fundamentals

Key Technologies

Features & Learning Outcomes

Getting Started

Prerequisites

Installation

Running the Applications

Project Structure

Best Practices for Web Scraping

Contributing

License

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages