Intern: Omokhoa Oshose Tosayoname
Intern ID: CA/DF1/71570
Duration: 20th May 2026 – 20th June 2026
This project predicts product sales based on advertising spend across three media channels: TV, Radio, and Newspaper. Using the classic Advertising dataset (200 observations), we explore how budget allocation across channels influences sales outcomes and build multiple regression models to forecast future sales.
Business Question: How does advertising spend across TV, Radio, and Newspaper channels drive sales, and which channel delivers the highest return?
Data Loading --> EDA & Visualisation --> Feature Engineering
--> Model Training --> Evaluation --> Business Insights
CodeAlpha_SalesPrediction/
├── data/
│ └── Advertising.csv # Raw dataset
├── notebooks/
│ └── sales_prediction.ipynb # Main notebook (fully executed)
├── requirements.txt
└── README.md
| Feature | Description |
|---|---|
| TV | Advertising budget spent on TV (in $000s) |
| Radio | Advertising budget spent on Radio (in $000s) |
| Newspaper | Advertising budget spent on Newspaper (in $000s) |
| Sales | Units sold (in thousands) — target variable |
| Model | Features Used |
|---|---|
| Linear Regression | Base (TV, Radio, Newspaper) |
| Ridge Regression | Base |
| Lasso Regression | Base |
| Polynomial Regression (degree 2) | Base |
| Random Forest Regressor | Base + engineered features |
| XGBoost Regressor | Base + engineered features |
Engineered features include: TV×Radio interaction, TV×Newspaper, Radio×Newspaper, Total Budget, TV Share, Radio Share.
| Model | R² | RMSE | MAE |
|---|---|---|---|
| Random Forest | 0.9880 | 0.6148 | 0.4797 |
| Polynomial Reg (d=2) | 0.9869 | 0.6426 | 0.5262 |
| XGBoost | 0.9846 | 0.6980 | 0.5449 |
| Linear Regression | 0.8994 | 1.7816 | 1.4608 |
| Ridge Regression | 0.8988 | 1.7872 | 1.4643 |
| Lasso Regression | 0.8983 | 1.7913 | 1.4613 |
Best model: Random Forest (R² = 0.9880)
- TV advertising has the strongest correlation with sales (r ~ 0.78).
- Radio is the second most impactful channel; its interaction with TV is highly predictive.
- Newspaper spending shows the weakest relationship with sales outcomes.
- Companies in the highest TV budget quartile achieve nearly 3x the sales of those in the lowest quartile.
- Recommendation: Prioritise TV and Radio spend for maximum sales lift; reconsider Newspaper allocation.
- Advertising budget distributions per channel
- Sales distribution and Q-Q plot
- Scatter plots: each channel vs sales with regression lines and correlation coefficients
- Correlation heatmaps (base and engineered features)
- Budget allocation pie chart and bar chart
- Box plots for all variables
- Pairplot of all features
- Engineered feature correlation matrix
- Model performance comparison (R², RMSE, MAE)
- Actual vs Predicted plots for top two models
- Residual analysis plots
- Random Forest feature importances
- OLS regression summary (statsmodels)
- Linear regression coefficients chart
- Sales segmentation by TV budget quartile
- Advertising channel ROI proxy
-
Clone this repository:
git clone https://github.com/Tosa9/CodeAlpha_SalesPrediction.git cd CodeAlpha_SalesPrediction -
Install dependencies:
pip install -r requirements.txt
-
Launch the notebook:
jupyter notebook notebooks/sales_prediction.ipynb
CodeAlpha Data Science Internship | Task 4
#CodeAlpha #DataScience #MachineLearning #SalesPrediction #Python