Diamond Scraper - Data Collection Pipeline

Overview

This project involves the architecture and development of an Extract, Transform, Load (ETL) pipeline dedicated to the diamond industry. The primary objective is to collect, normalize, and store diamond pricing and characteristic data to serve as the foundation for a future predictive machine learning model.

This repository covers Phase 1: Data Collection & Preparation.

System Architecture

The application is structured into highly cohesive and decoupled modules:

config.py: Centralized configuration (URLs, parameters, delay constraints).
scraper.py: Handles HTTP requests with built-in resilience (exponential backoff retries, rate limiting, and robots.txt compliance).
parser.py: Translates raw HTML/JSON responses into standardized Python dictionaries.
cleaner.py: Executes the data normalization process (type casting, categorical validation, deduplication, and IQR-based outlier detection).
storage.py: Manages data persistence (XLSX, JSON, SQLite).
pipeline.py: Orchestrator that links extraction, parsing, and storage.
app_web.py & index.html: A local web dashboard (Flask/HTML) providing a user interface to trigger the pipeline and download datasets.

Data Schema (The 4 Cs)

The extracted dataset standardizes diamond features into the following core columns:

shape: Geometric appearance (Round, Oval, Princess, etc.)
carat: Weight of the diamond.
color: Color grading (D-Z scale).
clarity: Assessment of internal imperfections (FL, VVS1, SI2, etc.).
cut: Quality of the diamond's proportions and finish.
price: Retail price in USD.

Installation & Usage

Install dependencies: pip install flask requests beautifulsoup4 pandas openpyxl colorlog
Launch the web interface: python app_web.py
Access the dashboard: Navigate to http://127.0.0.1:5000 in your web browser.

Technical Limitations & Future Scope

Due to enterprise-grade Web Application Firewalls (WAF) such as Cloudflare or Datadome present on target sites, the standard HTTP request implementation may occasionally face 403 Forbidden errors. To circumvent this, the pipeline includes a Synthetic Data Generator (Demo Mode) to demonstrate architectural integrity.

Future iterations (Phase 2) may explore headless browser automation (Selenium/Playwright) or residential proxy rotation to bypass strict TLS fingerprinting algorithms.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
__pycache__		__pycache__
data		data
logs		logs
Diamond_Scraper_Colab.py		Diamond_Scraper_Colab.py
Pitch_ali.txt		Pitch_ali.txt
README.md		README.md
app.py		app.py
cleaner.py		cleaner.py
config.py		config.py
index.html		index.html
logger_setup.py		logger_setup.py
parser.py		parser.py
pipeline.py		pipeline.py
scraper.py		scraper.py
storage.py		storage.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diamond Scraper - Data Collection Pipeline

Overview

System Architecture

Data Schema (The 4 Cs)

Installation & Usage

Technical Limitations & Future Scope

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Diamond Scraper - Data Collection Pipeline

Overview

System Architecture

Data Schema (The 4 Cs)

Installation & Usage

Technical Limitations & Future Scope

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages