This project involves the architecture and development of an Extract, Transform, Load (ETL) pipeline dedicated to the diamond industry. The primary objective is to collect, normalize, and store diamond pricing and characteristic data to serve as the foundation for a future predictive machine learning model.
This repository covers Phase 1: Data Collection & Preparation.
The application is structured into highly cohesive and decoupled modules:
- config.py: Centralized configuration (URLs, parameters, delay constraints).
- scraper.py: Handles HTTP requests with built-in resilience (exponential backoff retries, rate limiting, and robots.txt compliance).
- parser.py: Translates raw HTML/JSON responses into standardized Python dictionaries.
- cleaner.py: Executes the data normalization process (type casting, categorical validation, deduplication, and IQR-based outlier detection).
- storage.py: Manages data persistence (XLSX, JSON, SQLite).
- pipeline.py: Orchestrator that links extraction, parsing, and storage.
- app_web.py & index.html: A local web dashboard (Flask/HTML) providing a user interface to trigger the pipeline and download datasets.
The extracted dataset standardizes diamond features into the following core columns:
shape: Geometric appearance (Round, Oval, Princess, etc.)carat: Weight of the diamond.color: Color grading (D-Z scale).clarity: Assessment of internal imperfections (FL, VVS1, SI2, etc.).cut: Quality of the diamond's proportions and finish.price: Retail price in USD.
-
Install dependencies: pip install flask requests beautifulsoup4 pandas openpyxl colorlog
-
Launch the web interface: python app_web.py
-
Access the dashboard: Navigate to http://127.0.0.1:5000 in your web browser.
Due to enterprise-grade Web Application Firewalls (WAF) such as Cloudflare or Datadome present on target sites, the standard HTTP request implementation may occasionally face 403 Forbidden errors. To circumvent this, the pipeline includes a Synthetic Data Generator (Demo Mode) to demonstrate architectural integrity.
Future iterations (Phase 2) may explore headless browser automation (Selenium/Playwright) or residential proxy rotation to bypass strict TLS fingerprinting algorithms.