This project addresses a critical business challenge: customer churn prediction and retention strategy development. Customer churn represents a significant revenue loss for subscription-based businesses, and identifying at-risk customers before they leave is crucial for maintaining business growth.
The system combines data engineering, machine learning, and business analytics to predict customer churn probability and provide actionable retention strategies. It demonstrates a complete data science workflow from raw data processing to production-ready insights.
Live Demo: Interactive Dashboard
Customer churn is a major concern for subscription-based businesses. Without proper analytics, companies often:
- Lose valuable customers without warning
- Waste resources on customers unlikely to churn
- Miss opportunities to retain high-value customers
- Lack data-driven retention strategies
This project solves these challenges by:
- Identifying customers at high risk of churning
- Providing targeted retention recommendations
- Quantifying the business impact of churn
- Enabling proactive customer retention efforts
The system uses a comprehensive dataset containing:
Customer Data (10,000 records):
- Demographics: age, location, industry, company size
- Subscription details: plan type, payment method, tenure
- Behavioral metrics: usage patterns, login frequency
Usage Data:
- Monthly usage hours and feature utilization
- Session duration and activity patterns
- Feature adoption rates
Support Data:
- Ticket volume and resolution times
- Customer satisfaction scores
- Support interaction patterns
Data Quality:
- Clean, structured data with minimal missing values
- Realistic business scenarios and patterns
- Balanced representation across customer segments
The solution follows a systematic approach:
- Data Pipeline: Extract, transform, and load customer data from multiple sources
- Feature Engineering: Create predictive features from raw behavioral data
- Model Development: Train and validate machine learning models
- Risk Scoring: Generate churn probability scores for each customer
- Analytics Dashboard: Provide interactive insights and recommendations
- Action Planning: Generate targeted retention strategies for high-risk customers
The ETL process extracts data from SQLite database and CSV files, performs data cleaning, and creates aggregated features. Key features include:
- Monthly usage aggregations
- Support ticket patterns
- Engagement scores
- Risk indicators
- Algorithm: Random Forest Classifier
- Performance: 94.05% accuracy, 66.7% precision, 77.9% recall
- Feature Selection: Top 15 features identified through importance analysis
- Validation: 5-fold cross-validation for robust evaluation
Built with Streamlit and Plotly for interactive data visualization and real-time analytics. The dashboard provides comprehensive customer churn risk analysis and targeted retention strategy recommendations.
Backend:
- Python 3.9+ for core development
- Pandas for data manipulation
- NumPy for numerical computing
- SQLite for data storage
Machine Learning:
- Scikit-learn for model training
- Random Forest for classification
- GridSearchCV for hyperparameter optimization
- SHAP for model interpretability
Frontend:
- Streamlit for web application
- Plotly for interactive visualizations
- Custom CSS for professional styling
Deployment:
- Streamlit Cloud for production deployment
- Git/GitHub for version control
The interactive Streamlit dashboard provides comprehensive analytics across multiple tabs:
- Key performance indicators (KPIs)
- Churn rate and revenue at risk metrics
- High-level business impact summary
- 3D scatter plots for customer segmentation
- Interactive filters for data exploration
- Real-time chart updates based on selections
- Detailed table of customers at risk
- Search and sort functionality
- Pagination for large datasets
- Export capabilities
- Automated recommendations based on customer segments
- Personalized retention strategies
- Risk level explanations
- Executive summary of findings
- Action plan with timelines
- ROI analysis and cost-benefit breakdown
- Implementation roadmap
- Accuracy metrics and evaluation results
- Feature importance rankings
- ROC-AUC curves and confusion matrices
churn-prediction/
├── data/ # Data files and database
│ ├── customers.csv # Customer demographic data
│ ├── usage_data.csv # Usage patterns and metrics
│ ├── support_tickets.csv # Support interaction data
│ ├── churn_risk_predictions.csv # ML model predictions
│ └── churn_prediction.db # SQLite database
├── src/ # Source code
│ ├── data_pipeline/ # ETL and data processing
│ ├── feature_engineering/ # Feature creation and selection
│ ├── models/ # ML model training
│ └── dashboard/ # Streamlit application
├── models/ # Trained ML models
├── notebooks/ # Analysis notebooks
├── docs/ # Documentation
├── screenshots/ # Dashboard screenshots
└── requirements.txt # Python dependencies
The project includes three comprehensive analysis notebooks:
- Data Exploration and Cleaning: Initial data analysis, quality assessment, and cleaning procedures
- Feature Engineering Analysis: Detailed feature creation process and business logic explanation
- Model Training Analysis: Complete model development workflow and performance evaluation
These notebooks provide transparency into the analytical process and serve as documentation for the methodology.
The system has identified significant business opportunities:
- 913 high-risk customers (9.1% of total) with 85%+ churn probability
- $68,726 monthly revenue at risk
- $602,000 annual savings potential through targeted retention
- 1,218% ROI on retention efforts
- Python 3.9+
- Git
- pip or conda package manager
# Clone the repository
git clone https://github.com/Krish3na/churn-prediction.git
cd churn-prediction
# Install dependencies
pip install -r requirements.txt# Generate sample data
python src/data_pipeline/generate_sample_data.py
# Run complete data pipeline
python run_pipeline.py
# Launch dashboard locally
streamlit run src/dashboard/app.py# Data pipeline
python src/data_pipeline/main.py
# Feature engineering
python src/feature_engineering/feature_engineering.py
# Model training
python src/models/train_model.pyThe dashboard provides comprehensive analytics through multiple views:
Main dashboard showing key performance indicators and business metrics
Interactive 3D scatter plot for customer segmentation analysis
Detailed table of customers identified as high-risk with search and sort capabilities
AI-powered recommendations and strategic insights
Executive summary and action plan with ROI analysis
Machine learning model performance metrics and evaluation results
Geographic distribution of customer risk levels
Customer segmentation analysis by various demographic and behavioral factors
- Automated Risk Scoring: Real-time churn probability calculation
- Targeted Recommendations: Personalized retention strategies for high-risk customers
- Business Impact Analysis: Quantified ROI and savings potential
- Interactive Analytics: Multi-dimensional data exploration
- Production Ready: Deployed and accessible via web interface
- High-Risk Customer Identification: Pinpoints 9.1% of customers at highest churn risk
Potential improvements include:
- Integration with real-time data sources
- Advanced machine learning models (deep learning, ensemble methods)
- Automated alerting system for high-risk customers
- A/B testing framework for retention strategies
- API endpoints for integration with existing systems
Contributions are welcome. Please follow these steps:
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
This project is licensed under the MIT License.
For questions or support, please open an issue on GitHub or contact the development team.
This project demonstrates practical application of data science in solving real business problems, combining technical expertise with business acumen to drive measurable results.