Trips & Travel.Com wants to expand its customer base by launching a new Wellness Tourism Package — defined as travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle and sense of well-being.
Previously, customers were contacted randomly for marketing without using any available data — leading to high marketing costs and a low conversion rate of only 18%.
This project builds a Random Forest Classification model to predict which customers are likely to purchase the Wellness Tourism Package, so the company can target the right customers and make marketing expenditure more efficient.
Holiday-Package-Prediction-Randomforest/
├── images/
│ ├── holiday_package_prediction.gif
│ ├── feature_distributions.png
│ ├── correlation_heatmap.png
│ ├── confusion_matrix.png
│ ├── feature_importance.png
│ └── auc.png
│
├── Holiday_Package_Prediction.ipynb
├── Travel.csv
├── holiday_package_classification_model.pkl
├── preprocessor.pkl
├── .gitignore
└── README.md
scikit-learnpandasnumpymatplotlibseabornjoblib
The dataset is sourced from Kaggle — Holiday Package Purchase Prediction
- Samples: 4,888
- Features: 20
- Target Variable:
ProdTaken— 1 (Purchased) / 0 (Not Purchased) - Purchase Rate: ~18% (class imbalance present)
| Feature | Description |
|---|---|
| Age | Age of the customer |
| MonthlyIncome | Monthly income of the customer |
| DurationOfPitch | Duration of the sales pitch (in minutes) |
| Passport | Whether the customer has a passport (1/0) |
| NumberOfTrips | Number of trips taken by the customer |
| PitchSatisfactionScore | Customer satisfaction score for the pitch |
| NumberOfFollowups | Number of followups made by the agent |
| TotalVisiting | Total number of people visiting (person + children) |
| CityTier | Tier of the city the customer belongs to |
| Occupation | Customer's occupation |
- Load the dataset from
Travel.csv - Handle missing values — median for continuous, mode for discrete features
- Fix categorical inconsistencies (
Fe Male→Female,Single→Unmarried) - Drop irrelevant feature —
CustomerID - Feature Engineering — create
TotalVisitingfromNumberOfPersonVisiting+NumberOfChildrenVisiting - Exploratory Data Analysis — distributions and correlation heatmap
- Train-Test Split — 80% train / 20% test (
random_state=42) - Preprocessing —
OneHotEncoderfor categorical,StandardScalerfor numerical viaColumnTransformer - Train and compare multiple models — Logistic Regression, Decision Tree, Random Forest, Gradient Boosting
- Hyperparameter tuning using
RandomizedSearchCVwith 100 iterations and 3-fold CV - Evaluate using Confusion Matrix, Classification Report, and ROC-AUC Curve
- Feature Importance analysis
- Save model and preprocessor using
joblib
Key Insights:
Passporthas the strongest positive correlation with purchase (0.26)AgeandMonthlyIncomeare highly correlated with each other (0.46)DurationOfPitchandNumberOfFollowupspositively influence purchase decision- Most customers are from City Tier 1 and prefer 3-star properties
| Model | Train Accuracy | Test Accuracy |
|---|---|---|
| Logistic Regression | ~81% | ~81% |
| Decision Tree | ~100% | ~85% |
| Gradient Boosting | ~93% | ~90% |
| Random Forest | ~100% | ~93% |
Random Forest was selected as the final model based on best test accuracy and F1 score.
rf_params = {
"max_depth" : [5, 8, 15, None, 10],
"max_features" : [5, 7, "sqrt", 8],
"min_samples_split": [2, 8, 15, 20],
"n_estimators" : [100, 200, 500, 1000]
}Tuning method: RandomizedSearchCV — n_iter=100, cv=3, n_jobs=-1
Best Parameters:
RandomForestClassifier(
n_estimators=1000,
min_samples_split=2,
max_features=7,
max_depth=None
)Classification Report:
precision recall f1-score support
Not Purchased 0.93 0.99 0.96 787
Purchased 0.97 0.68 0.80 191
accuracy 0.93 978
macro avg 0.95 0.84 0.88 978
weighted avg 0.94 0.93 0.93 978
Top Predictive Features:
- 💰 MonthlyIncome
- 🎂 Age
- ⏱️ DurationOfPitch
- 🛂 Passport
✈️ NumberOfTrips
import joblib
import pandas as pd
model = joblib.load('holiday_package_classification_model.pkl')
preprocessor = joblib.load('preprocessor.pkl')
# New customer sample
sample = pd.DataFrame({
'TypeofContact' : ['Self Enquiry'],
'CityTier' : [1],
'DurationOfPitch' : [20],
'Gender' : ['Male'],
'NumberOfFollowups' : [4],
'ProductPitched' : ['Deluxe'],
'PreferredPropertyStar' : [3],
'MaritalStatus' : ['Unmarried'],
'NumberOfTrips' : [3],
'Passport' : [1],
'PitchSatisfactionScore': [4],
'OwnCar' : [1],
'Occupation' : ['Salaried'],
'MonthlyIncome' : [25000],
'Age' : [30],
'Designation' : ['Executive'],
'TotalVisiting' : [3],
})
sample_transformed = preprocessor.transform(sample)
prediction = model.predict(sample_transformed)
probability = model.predict_proba(sample_transformed)
print("Prediction :", "✅ Will Purchase" if prediction[0] == 1 else "❌ Will Not Purchase")
print("Probability :", f"Not Purchased = {probability[0][0]*100:.1f}% | Purchased = {probability[0][1]*100:.1f}%")1. Clone the repo
git clone https://github.com/AnmolPatel20/Holiday-Package-Prediction-Randomforest.git
cd Holiday-Package-Prediction-Randomforest2. Install dependencies
pip install pandas numpy matplotlib seaborn scikit-learn joblib3. Run the notebook
jupyter notebook Holiday_Package_Prediction.ipynb- Both
holiday_package_classification_model.pklandpreprocessor.pklmust be loaded together for prediction - The preprocessor handles all encoding and scaling — never pass raw data directly to the model
I'm on my machine learning journey — building, experimenting and documenting as I go. Every notebook here represents something I've genuinely tried to understand, not just run. 🚀
- GitHub: @AnmolPatel20
- Portfolio: anmolpatel20.github.io/Anmol_Portfolio
Thanks to Krish Naik Sir whose Udemy course has been a great resource throughout this learning journey.
"Not all those who wander are lost." — J.R.R. Tolkien
⭐ Star this repo if you found it helpful!




