Skip to content

Jupyter06/Web-scrapping

Repository files navigation

Diamond Scraper - Data Collection Pipeline

Overview

This project involves the architecture and development of an Extract, Transform, Load (ETL) pipeline dedicated to the diamond industry. The primary objective is to collect, normalize, and store diamond pricing and characteristic data to serve as the foundation for a future predictive machine learning model.

This repository covers Phase 1: Data Collection & Preparation.

System Architecture

The application is structured into highly cohesive and decoupled modules:

  • config.py: Centralized configuration (URLs, parameters, delay constraints).
  • scraper.py: Handles HTTP requests with built-in resilience (exponential backoff retries, rate limiting, and robots.txt compliance).
  • parser.py: Translates raw HTML/JSON responses into standardized Python dictionaries.
  • cleaner.py: Executes the data normalization process (type casting, categorical validation, deduplication, and IQR-based outlier detection).
  • storage.py: Manages data persistence (XLSX, JSON, SQLite).
  • pipeline.py: Orchestrator that links extraction, parsing, and storage.
  • app_web.py & index.html: A local web dashboard (Flask/HTML) providing a user interface to trigger the pipeline and download datasets.

Data Schema (The 4 Cs)

The extracted dataset standardizes diamond features into the following core columns:

  • shape: Geometric appearance (Round, Oval, Princess, etc.)
  • carat: Weight of the diamond.
  • color: Color grading (D-Z scale).
  • clarity: Assessment of internal imperfections (FL, VVS1, SI2, etc.).
  • cut: Quality of the diamond's proportions and finish.
  • price: Retail price in USD.

Installation & Usage

  1. Install dependencies: pip install flask requests beautifulsoup4 pandas openpyxl colorlog

  2. Launch the web interface: python app_web.py

  3. Access the dashboard: Navigate to http://127.0.0.1:5000 in your web browser.

Technical Limitations & Future Scope

Due to enterprise-grade Web Application Firewalls (WAF) such as Cloudflare or Datadome present on target sites, the standard HTTP request implementation may occasionally face 403 Forbidden errors. To circumvent this, the pipeline includes a Synthetic Data Generator (Demo Mode) to demonstrate architectural integrity.

Future iterations (Phase 2) may explore headless browser automation (Selenium/Playwright) or residential proxy rotation to bypass strict TLS fingerprinting algorithms.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors