- π Project Overview
- π― Business Problem & Objectives
- π οΈ Tech Stack & Tools
- βοΈ Data Engineering & SQL Intelligence
- π Statistical Validation of Key Insights
- π€ Predictive Churn Modeling
- π Visual Intelligence (Dashboards)
- π Key Business Insights
- π Business Impact & Strategic Recommendations
- π Conclusion
- π Repository Structure
- π¨βπ» Author & Contact
Most e-commerce businesses focus on customer acquisition, but revenue growth can often mask deep-rooted operational inefficiencies. Using the Brazilian Olist dataset (100K+ records), I developed an end-to-end analytics ecosystem that moves from raw data engineering to executive-level intelligence. By connecting logistics performance directly to customer satisfaction and financial value, this project provides a data-driven roadmap to shift a business from leaky-bucket growth to sustainable retention.
The project was further extended with a machine learning-based churn prediction framework using Random Forest modeling to identify high-risk customers and simulate retention intervention strategies.
High-growth platforms often struggle to answer critical questions that affect the bottom line. This project was designed to solve:
- The Retention Paradox: Why are ~96% of customers one-time buyers despite steady revenue?
- Logistics Friction: To what extent do delivery delays damage brand equity and review scores?
- Financial Leakage: Where are the opportunities to increase Average Order Value (AOV) through payment behavior?
- Operational Health: How can we visualize complex fulfillment cycles into a simple decision-support system?
Primary Objectives:
- Analyze revenue trends and growth patterns to identify peak performance periods.
- Evaluate delivery performance and quantify its direct impact on customer satisfaction.
- Understand customer behavior to identify homogeneous segments for targeted loyalty strategies.
- Build decision-support systems that transform raw data into visual strategy.
- Develop a predictive churn analytics pipeline to identify at-risk customers and estimate retention opportunities.
- SQL (Google BigQuery): Advanced data transformation, view orchestration, and business logic implementation.
- Python (Pandas, Matplotlib, Seaborn, SciPy, Scikit-learn): Exploratory Data Analysis (EDA) and rigorous statistical testing.
- Power BI: Professional dashboard development for executive and operational stakeholders.
- Jupyter Notebook / Colab: Research documentation and reproducible analysis.
Instead of performing analysis on raw, fragmented tables, I engineered a robust Business Intelligence Layer in BigQuery.
- Modular Data Orchestration: Created dedicated SQL views for Delivery, Revenue, Reviews, and Payments to isolate key business domains.
- Unified Master Layer: Joined 5+ relational tables into a consolidated
master_orders_featuresdataset for high-speed analysis. - Metric Engineering: Used SQL to derive complex KPIs including Delivery Precision (Actual vs. Estimated), Late Delivery Flags, and High-Value Order categorization.
- Cohort Identification: Implemented logic to track customer acquisition trends and distinguish between first-time and returning users.
I used Python to move beyond observation, applying statistical methods to ensure all findings were mathematically significant (p < 0.05).
- Logistics vs. Satisfaction: Proved via hypothesis testing that late deliveries correlate with a ~2.0 point drop in review scores.
- Revenue Drivers: Identified that Credit Card users have a ~21 BRL higher AOV compared to other payment types (Validated via ANOVA).
- Revenue Correlation: Found a moderate correlation (~0.49) between order value and freight cost, highlighting significant shipping price sensitivity among customers.
- The Retention Gap: Discovered that 96% of the 100K+ customers are one-time buyers, identifying a critical failure in the post-purchase funnel.
- Behavioral Distribution: Performed deep EDA to map the distribution of order values and delivery delay patterns across different regions.
To move beyond descriptive analytics, the project was extended into a machine learning-based customer churn prediction framework.
- Engineered customer-level behavioral features from transactional ecommerce data
- Applied temporal train-test separation to prevent future information leakage
- Built a Random Forest classification model to predict customer churn risk
- Evaluated model performance using Accuracy, Recall, F1-score, and ROC-AUC metrics
- Simulated retention intervention strategies using predicted churn probabilities
- Operational variables such as delivery performance and freight cost emerged as the strongest churn drivers
- The model identified 12K+ high-risk customers for potential retention targeting
- Simulated retention campaigns demonstrated projected churn reduction opportunities of approximately 15%
- Estimated retention simulations suggested potential protection of approximately 384K BRL in customer revenue
I translated complex datasets into four strategic Power BI dashboards to drive data-informed decision-making.
Focuses on core KPIs: Revenue, Total Orders, AOV, and the Late Delivery % rate.

Analyzes monthly revenue growth and provides Year-over-Year (YoY) comparisons.

Visualizes customer distribution, order value patterns, and the retention gap.

Evaluates delivery efficiency metrics and their direct impact on customer ratings.

- Retention Warning: Approximately 96% of customers (90K+) are one-time buyers, indicating a critical lack of post-purchase engagement.
- Credit Card Value: Credit cards aren't just a payment method; they are a driver of higher-value purchases, suggesting that optimizing payment partnerships could directly increase AOV.
- Logistics is Marketing: Customer satisfaction is driven more by delivery speed than product price; a single day of delay has a measurable negative impact on brand perception.
- Seasonal Stability: While 2017 showed explosive growth, 2018 showed stabilization, suggesting the market is maturing and requires a shift toward efficiency.
By integrating SQL engineering with statistical rigor, this project transitioned from raw data analysis to a strategic diagnostic tool, delivering the following impacts and actionable roadmaps:
- Identified a Critical Retention Gap: Exposed that 96% of the 100K+ customers are one-time buyers, proving that current revenue growth is unsustainably dependent on new acquisition rather than brand loyalty.
- Quantified Operational Loss: Statistically linked logistics to brand equity, proving that delivery delays cause a measurable 2.0 point drop in customer satisfaction scores.
- Unlocked Revenue Drivers: Discovered that specific payment behaviors (Credit Cards) correlate with a ~21 BRL increase in Average Order Value (AOV).
- Optimized Decision-Making: Developed a unified Business Intelligence layer that tracks 5+ real-time KPIs, reducing the time required for executive performance reviews.
- Enabled Predictive Retention Targeting: Built a churn prediction framework capable of identifying 12K+ high-risk customers and supporting simulated retention intervention strategies.
- Pivot to Retention Marketing: Shift budget from cold acquisition to loyalty-driven campaigns (e.g., personalized "Welcome Back" offers) to convert the 96% one-time buyer base into repeat customers.
- Optimize Fulfillment Benchmarks: Focus logistics improvements on reducing the "Actual vs. Estimated" delivery gap, as this specific metric is the primary driver of negative reviews.
- Incentivize High-Value Payments: Since credit card users spend significantly more, implement targeted promotions or cashback rewards for credit transactions to naturally lift the platform's overall AOV.
- Regional Freight Strategy: Adjust shipping subsidies or carrier partnerships in regions with high freight-to-order-value correlations (~0.49) to reduce price sensitivity and cart abandonment.
This project demonstrates that Data Analytics is not just about counting thingsβitβs about weighing them. By connecting SQL engineering with statistical rigor and visual storytelling, I was able to:
- π Expose the hidden retention problem behind positive growth numbers.
- π Quantify the financial cost of operational delays.
- π³ Identify the specific customer behaviors that drive higher revenue.
- π οΈ Build a repeatable framework that turns raw transactional logs into a strategic roadmap.
The addition of predictive churn modeling further extended the project from descriptive business intelligence into proactive customer retention analytics.
ecommerce-analytics-project/
β
βββ data/
β βββ raw/ # ignored
β βββ processed/
β β βββ customer_churn_modeling_dataset.csv
β β βββ high_risk_customers.csv
β β βββ olist_master_orders_features.csv
β βββ analytics/
β βββ 07_01_revenue_trend.csv
β βββ 07_02_delivery_performance.csv
β βββ 07_03_customer_retention.csv
β βββ 07_04_cohort_analysis.csv
β
βββ sql/
β βββ 01_create_delivery_view.sql
β βββ 02_create_revenue_view.sql
β βββ 03_create_review_view.sql
β βββ 04_create_payment_view.sql
β βββ 05_create_master_table.sql
β βββ 06_feature_engineering.sql
β βββ 07_01_revenue_trend.sql
β βββ 07_02_delivery_performance.sql
β βββ 07_03_customer_retention.sql
β βββ 07_04_cohort_analysis.sql
β βββ 08_export_master_orders_features.sql
β
βββ notebooks/
β βββ 01_eda.ipynb
β βββ 02_hypothesis_testing.ipynb
β βββ 03_sql_business_analysis.ipynb
β βββ 04_churn_prediction.ipynb
β
βββ dashboard/
β βββ olist_dashboard.pbix
β βββ Business_Performance_Dashboard.pptx
β βββ olist_dashboard.pdf
β
βββ images/
β βββ sql_outputs/ # from BigQuery
β β βββ 07_01_revenue_trend.png
β β βββ 07_02_delivery_performance.png
β β βββ 07_03_customer_retention.png
β β βββ 07_04_cohort_analysis.png
β β
β βββ dashboard/ # Power BI screenshots
β βββ 01_executive_overview.jpg
β βββ 02_revenue_trends.jpg
β βββ 03_customer_insights.jpg
β βββ 04_delivery_performance.jpg
β
βββ .gitignore
βββ requirements.txt
βββ README.md
Mohd Walid Ansari Data Analyst | Problem Solver
- Email: walidmohd2532001@gmail.com
- LinkedIn: Mohd Walid Ansari
- GitHub: @mohdwalid253
β If you find this project insightful, feel free to star the repository!