Skip to content

aksharpatel007/web-scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Scraping

Python Node.js React HTML5 CSS3 JavaScript

Table of Contents


Introduction

This repository hosts a collection of web scraping projects designed to demonstrate practical data extraction, processing, and application development. The primary goal is to illustrate how to effectively gather online data, clean it, store it in a structured format, and present it through a user-friendly interface. This workspace serves as an excellent resource for learning Python-based scraping, full-stack web development with Node.js and React, and data handling.

Project Overview

1. Companies Market Cap Scraper

A Python-based web scraper that extracts company market capitalization data from a public financial website.

Key Features:

  • Data Extraction: Retrieves real-time market cap data for various companies.
  • Data Cleaning: Processes and normalizes scraped values to ensure consistency and accuracy.
  • CSV Export: Stores the cleaned dataset in a CSV file, ready for analysis, reporting, or integration into other systems.
  • Technologies: Built using Python with powerful libraries like BeautifulSoup for HTML parsing, Requests for HTTP requests, and pandas for data manipulation.

2. Myntra Ecommerce Scraping App

A full-stack application that scrapes product data from an e-commerce site (mimicking Myntra) and displays it in a modern, interactive React frontend.

Key Features:

  • Product Data Scraping: Gathers product details such as names, prices, images, and descriptions.
  • Category-Based Browsing: Enables users to browse products by categories like shoes, watches, and T-shirts.
  • Full-Stack Architecture:
    • React Frontend: Provides a dynamic and responsive user interface for displaying products.
    • Node.js Backend (Express.js): Manages data retrieval, API endpoints, and serves product information to the frontend.
  • Structured Data Storage: Stores and renders product data in a well-organized format, facilitating easy display and management.

Web Scraping Fundamentals

Web scraping involves programmatically extracting information from websites. The general workflow demonstrated in these projects includes:

  1. Sending HTTP Requests: Initiating requests to target web pages.
  2. Receiving HTML Response: Obtaining the raw HTML content of the page.
  3. Parsing Content: Analyzing the HTML structure to locate relevant data.
  4. Extracting Fields: Pulling out specific data points (e.g., product names, prices).
  5. Cleaning & Normalizing: Standardizing extracted data for consistency.
  6. Storing Output: Saving the processed data in formats like CSV or into a database for application use.

Key Technologies

This repository leverages a diverse set of technologies across its projects:

  • Backend & Scraping:

    • Python: Primary language for data scraping and processing.
    • BeautifulSoup: Python library for parsing HTML and XML documents.
    • Requests: Python library for making HTTP requests.
    • pandas: Python library for data manipulation and analysis, especially with DataFrames.
    • Node.js: JavaScript runtime for the backend server.
    • Express.js: Web application framework for Node.js, used for building APIs.
    • MongoDB: NoSQL database for storing product data.
    • CSV Export: For structured data output.
  • Frontend:

    • React: JavaScript library for building user interfaces.
    • HTML5, CSS3, JavaScript: Core web technologies.

Features & Learning Outcomes

By exploring these projects, you will gain practical experience and insights into:

  • Web Page Inspection: Understanding how to analyze web page structures and identify data points for extraction.
  • HTML Parsing: Mastering selector-based scraping techniques using libraries like BeautifulSoup.
  • Data Transformation: Cleaning, normalizing, and transforming raw scraped data into usable formats.
  • Structured Output: Generating CSV files and handling structured data for various applications.
  • Full-Stack Development: Integrating frontend (React) and backend (Node.js/Express.js) components.
  • Dynamic Data Presentation: Implementing category-based filtering and efficient product display.
  • Performance Optimization: Developing an awareness of loading times and optimizing data handling.
  • Python Automation: Automating data collection tasks with Python scripts.

Getting Started

Follow these steps to set up and run the projects in your local environment.

Prerequisites

Ensure you have the following installed:

  • Git
  • Python 3.8+
  • Node.js 14+ and npm (or yarn)
  • MongoDB (running locally or accessible via a cloud service like MongoDB Atlas)

Installation

  1. Clone the repository:

    git clone https://github.com/your-username/web-scraping.git
    cd web-scraping
  2. Setup Companies Market Cap Scraper:

    cd companiesmarketcap
    pip install -r requirements.txt
    cd ..
  3. Setup Myntra Ecommerce Scraping App:

    Backend:

    cd myntra/backend
    npm install
    # Create a .env file based on .env.example and configure your MongoDB URI
    cp .env.example .env 
    # Open .env and add your MongoDB connection string (e.g., MONGODB_URI=mongodb://localhost:27017/myntradb)
    cd ..

    Frontend:

    cd myntra/frontend
    npm install
    cd ..

    Return to the root directory:

    cd ..

Running the Applications

  1. Run Companies Market Cap Scraper:

    cd companiesmarketcap
    python web_scraping.py

    This will generate Companies.csv in the companiesmarketcap directory.

  2. Run Myntra Ecommerce Scraping App:

    Backend (in a separate terminal):

    cd myntra/backend
    npm start

    The backend server will typically run on http://localhost:5000.

    Frontend (in another separate terminal):

    cd myntra/frontend
    npm run dev

    The frontend application will typically open in your browser at http://localhost:5173 (or similar port).

Project Structure

.
├── companiesmarketcap/                 # Python project for market cap scraping
│   ├── Companies.csv                   # Output CSV file
│   ├── README.md                       # Project-specific README
│   ├── requirements.txt                # Python dependencies
│   └── web_scraping.py                 # Main scraping script
├── myntra/                             # Full-stack e-commerce scraping app
│   ├── data.csv                        # Scraped data (example)
│   ├── index.html                      # Frontend entry point
│   ├── myntra.py                       # Python scraping script for Myntra (if applicable)
│   ├── package-lock.json
│   ├── package.json
│   ├── README.md                       # Project-specific README
│   ├── backend/                        # Node.js Express backend
│   │   ├── .env                        # Environment variables
│   │   ├── src/                        # Backend source code
│   │   │   ├── app.js
│   │   │   ├── server.js
│   │   │   ├── config/                 # Database configuration
│   │   │   ├── controllers/            # Request handlers
│   │   │   ├── middleware/             # Express middleware
│   │   │   ├── models/                 # Mongoose schemas
│   │   │   ├── routes/                 # API routes
│   │   │   ├── services/               # Business logic, e.g., CSV handling
│   │   │   └── utils/                  # Utility functions
│   │   └── ...                         # Node modules, etc.
│   └── frontend/                       # React.js frontend
│       ├── public/                     # Static assets
│       ├── src/                        # Frontend source code
│       │   ├── App.jsx                 # Main React component
│       │   ├── components/             # Reusable UI components
│       │   │   ├── common/
│       │   │   ├── home/
│       │   │   ├── layout/
│       │   │   └── shop/
│       │   ├── context/                # React context for state management
│       │   ├── lib/                    # Utility functions and API clients
│       │   └── pages/                  # Page-level React components
│       │       └── admin/
│       └── ...                         # Node modules, build configs, etc.
└── README.md                           # Main project README (this file)

Best Practices for Web Scraping

Adhering to best practices ensures ethical and efficient scraping:

  • Respect robots.txt: Always check a website's robots.txt file for scraping guidelines.
  • Review Terms of Service: Understand the legal implications of scraping a particular site.
  • Polite Scraping: Implement delays between requests to avoid overwhelming servers (e.g., time.sleep() in Python).
  • Error Handling: Implement robust error handling for network issues, changing website structures, and rate limits.
  • User-Agent: Set a descriptive User-Agent header to identify your scraper.
  • Avoid Unnecessary Requests: Cache data when possible and only request information you need.
  • Data Validation: Clean and validate all scraped data thoroughly before storage or use.
  • Modularity: Keep your scraping code modular and reusable.

Contributing

Contributions are welcome! If you have suggestions for improvements or new features, please feel free to:

  • Fork the repository.
  • Create a new branch (git checkout -b feature/YourFeature).
  • Make your changes.
  • Commit your changes (git commit -m 'Add new feature').
  • Push to the branch (git push origin feature/YourFeature).
  • Open a Pull Request.

License

This project is licensed under the MIT License. See the LICENSE file for details (if applicable, otherwise state 'No specific license').


This README was generated and enhanced by an AI assistant.

About

Python and Full-Stack JavaScript Web Scraping workflows. Includes a dynamic Companies Market Cap pipeline (BeautifulSoup/Pandas) and a Myntra clone e-commerce app (React, Node.js, Express, MongoDB).

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages