In today's healthcare landscape, rising costs and increasing demand for affordable insurance put significant pressure on insurance providers to price premiums accurately. Setting premiums too high drives customers away, while setting them too low leads to financial losses. This project aims to build a machine learning model that predicts health insurance charges based on personal and demographic factors.
Develop a predictive model that accurately estimates health insurance premium charges, helping insurance companies make data-driven pricing decisions.
The dataset contains 1,338 records with the following features:
| Feature | Description |
|---|---|
age |
Age of the insured individual |
sex |
Gender (male / female) |
bmi |
Body Mass Index |
children |
Number of dependents covered |
smoker |
Smoking status (yes / no) |
region |
Residential region (northeast, northwest, southeast, southwest) |
charges |
Insurance premium charges (target variable) |
- Checked dataset shape, data types, and statistical summary
- Identified and removed 1 duplicate record (1,338 → 1,337 rows)
- Confirmed no missing values in any column
Univariate Analysis:
- Age — Higher concentration of younger individuals (20-25 age bracket)
- BMI — Normally distributed, centered around 30
- Gender — Nearly balanced between male and female
- Smoker — Majority are non-smokers
- Region — Fairly balanced, southeast slightly higher
- Charges — Right-skewed distribution with outliers above 50,000
Bivariate Analysis:
- Insurance charges increase with age (positive correlation)
- Smokers pay significantly higher premiums than non-smokers — the strongest differentiator
- Gender has a minor influence; males have slightly higher and more variable charges
- Number of children shows limited impact beyond 2 dependents
- Regional differences are minimal
- Correlation heatmap confirmed age (0.30) and smoker status as the most influential features
- Dummy encoding — Converted categorical columns (
sex,smoker,region) into numerical values using one-hot encoding withdrop_first=True - Train-test split — 70% training, 30% testing (
random_state=42)
In insurance premium prediction, the model can make two types of errors:
- Underestimating the price — Predicted premium is lower than actual cost, leading to financial losses for the insurer
- Overestimating the price — Predicted premium is higher than actual cost, driving customers away
To capture both types of errors, we used:
- R-squared (R2) — Measures how well the model explains the variance in charges
- Mean Squared Error (MSE) — Penalizes larger prediction errors more heavily
We built three linear regression models incrementally to understand feature importance:
| Model | Features Used | Train R2 | Test R2 | Observation |
|---|---|---|---|---|
| Model 1 | Age only | Low | Low | Underfitting — single feature is insufficient |
| Model 2 | Age + BMI | Low | Low | Slight improvement, still underfitting |
| Model 3 | All features | ~0.74 | ~0.77 | Significant improvement, generalizes well but room to grow |
Takeaway: Using all features (including encoded categorical variables) significantly improved the model, but linear regression alone could not fully capture the non-linear patterns in the data.
- Built a Decision Tree model to capture non-linear relationships
- Problem: The model overfitted — near-perfect training performance but poor test performance
- This indicated the need for an ensemble approach
- Applied Random Forest to reduce overfitting through bagging (training multiple trees on random subsets)
- Default Random Forest already showed improvement over a single decision tree
Hyperparameter Tuning:
Tuned the following parameters to optimize performance:
| Parameter | Purpose |
|---|---|
n_estimators |
Number of trees in the forest |
max_depth |
Maximum depth of each tree |
min_samples_split |
Minimum samples required to split a node |
min_samples_leaf |
Minimum samples required at a leaf node |
Tuning methods used:
- GridSearchCV — Exhaustive search over a defined parameter grid
- RandomizedSearchCV — Efficient random sampling from the parameter space
- Applied XGBoost, a gradient boosting ensemble method, for further improvement
- XGBoost builds trees sequentially, where each new tree corrects the errors of the previous one
Hyperparameter Tuning:
| Parameter | Purpose |
|---|---|
n_estimators |
Number of boosting rounds |
learning_rate |
Step size for each boosting round |
max_depth |
Maximum depth of each tree |
subsample |
Fraction of samples used per tree |
Used GridSearchCV and RandomizedSearchCV to find the best combination.
- Extracted feature importances from both the best Random Forest and XGBoost models
- Smoker status emerged as the most important predictor of insurance charges, followed by age and BMI
Linear Regression (Age only) → Underfitting
Linear Regression (Age + BMI) → Still underfitting
Linear Regression (All features) → Decent baseline (~0.75 R2)
Decision Tree → Overfitting
Random Forest (Default) → Reduced overfitting
Random Forest (Tuned) → Improved generalization
XGBoost (Default) → Strong performance
XGBoost (Tuned) → Best performance
- Language: Python
- Libraries: pandas, numpy, matplotlib, seaborn, scikit-learn, xgboost
- Environment: Google Colab / Jupyter Notebook
| File | Description |
|---|---|
Case_Study_Health_Insurance_Premium_Prediction (1).ipynb |
Detailed solution with explanations (up to Linear Regression) |
HealthInsurance_PremiumPrediction.ipynb |
Complete solution including Decision Tree, Random Forest, and XGBoost |
insurance_prediction.csv |
Dataset used for the project |
- Smoking status is the single most important factor affecting insurance premiums
- Age and BMI are the next most significant predictors
- Linear regression provides a reasonable baseline but cannot capture non-linear patterns
- Decision trees overfit easily without constraints
- Ensemble methods (Random Forest, XGBoost) with hyperparameter tuning deliver the best results
- XGBoost with tuned hyperparameters achieved the best overall performance