This machine learning project predicts the number of leads generated by vehicle advertisements using advanced feature engineering and gradient boosting algorithms. The solution helps automotive businesses optimise listings, pricing, and marketing strategies based on data-driven predictions.
Main Goal: Predict the number of leads (customer inquiries) that a vehicle listing will generate, based on vehicle characteristics, pricing, location, and advertisement features.
| Metric | Value |
|---|---|
| 📊 RMSE | 6.922 leads |
| 📈 R² | 70.3% variance explained |
| ⚡ Inference speed | < 1ms per listing |
| 📋 Features used | 16 (down from 48) |
| 🔼 Baseline improvement | 64.6% |
- Algorithm: LightGBM Regressor with Optuna hyperparameter optimisation (150 trials)
- Training data: 48,578 vehicle listings
- Validation: 5-fold cross-validation with overfitting detection
- Feature engineering: Target encoding, Jenks flag clustering, outlier removal
| Lead Range | Category | Recommended Action |
|---|---|---|
| 0–5 | 🔴 Low Performance | Review listing quality, adjust pricing |
| 6–15 | 🟡 Moderate Performance | Minor adjustments, standard monitoring |
| 16–30 | 🟢 High Performance | Replicate success factors, scale strategy |
| 31+ | 🌟 Exceptional Performance | Case study, premium placement |
- Analysed 48,578 listings across multiple Brazilian states and cities
- Reduced 48 raw features to 16 optimised features (67% reduction, zero accuracy loss)
- Systematic outlier removal and missing value handling
- Geographic encoding: City/state target encoding with smoothing (prevents overfitting)
- Flag clustering: Jenks Natural Breaks to group vehicle feature combinations
- Price positioning: Market value vs. advertised price gap analysis
- Visual impact: Photo count optimisation (sweet spot: 8 photos)
- Hyperparameter search via Optuna (150 trials)
- Learning curve analysis for regularisation guidance
- Sklearn-compatible production pipeline (
.joblib)
- Top drivers: phone clicks, views, location, price positioning
- State-specific lead generation patterns identified
- Feature importance ranking for actionable ad improvement
git clone https://github.com/pelabdang/leads-analysis-prediction.git
cd leads-analysis-prediction
pip install -r requirements.txtfrom src.models.model_trainer import ModelTrainer
import pandas as pd
listings_df = pd.read_csv('your_listings.csv')
trainer = ModelTrainer()
predictions = trainer.predict_batch(listings_df)
listings_df['predicted_leads'] = predictions
listings_df['performance_category'] = pd.cut(
predictions,
bins=[0, 5, 15, 30, float('inf')],
labels=['Low', 'Moderate', 'High', 'Exceptional']
)from src.models.model_trainer import ModelTrainer
from flask import Flask, request, jsonify
app = Flask(__name__)
model = ModelTrainer.load_model('complete_ml_pipeline')
@app.route('/predict', methods=['POST'])
def predict_leads():
data = request.json
prediction = model.predict_single(data)
return jsonify({'predicted_leads': prediction})leads-analysis-prediction/
├── config/ # config.yaml — model & data settings
├── data/
│ ├── raw/ # Original immutable dataset
│ ├── processed/ # Cleaned, feature-engineered data
│ └── external/ # Supplementary sources
├── models/ # complete_ml_pipeline.joblib + artefacts
├── notebooks/
│ ├── exploratory/
│ ├── feature_engineering/
│ └── modeling/
├── reports/ # Modeling & feature engineering reports
├── src/
│ ├── data/
│ ├── features/
│ ├── models/
│ └── visualization/
├── requirements.txt
└── setup.py
- 📞 Phone Engagement — strongest signal of buyer intent
- 👁️ Visual Presentation — 8 photos = optimal lead generation
- 🌍 Geographic Location — state/city-level variance is significant
- 💰 Price Positioning — sweet spot relative to market value
- 🚗 Vehicle Features — safety & comfort clusters drive engagement
📧 Contact: Angelo Pelisson · GitHub
📊 Performance: RMSE 6.922 · R² 70.3% · 64.6% over baseline