RFM-based customer segmentation using K-Means and Agglomerative Hierarchical Clustering for Myntra Gifts Ltd. — identifying Champions, Loyal, New, and At-Risk customers to enable targeted marketing strategies.
- Business Problem
- Dataset Overview
- Methodology
- EDA Highlights
- Customer Segments
- Model Results
- Project Structure
- How to Run
- Tech Stack
Myntra Gifts Ltd. is a UK-based online retailer specialising in unique all-occasion giftware. The company operates exclusively online and serves customers across 38 countries.
The Problem: All customers were treated uniformly — same marketing messages, same promotions, same communication strategy regardless of purchase behaviour. A customer who buys every month and spends thousands received identical treatment to one who purchased once two years ago.
The Goal: Apply unsupervised machine learning to segment the customer base into distinct behavioural groups using RFM (Recency, Frequency, Monetary) analysis, then assign targeted business strategies to each segment to:
- Improve customer retention
- Reduce churn among at-risk customers
- Maximise revenue from high-value segments
- Allocate marketing budget efficiently
| Property | Value |
|---|---|
| Source | Myntra Gifts Ltd. — Online Retail Transactions |
| Period | December 2009 – December 2011 |
| Raw Rows | 541,909 |
| Columns | 8 |
| Countries | 38 |
| Unique Customers (after cleaning) | ~4,300 |
Dataset URL (Google Sheets):
https://docs.google.com/spreadsheets/d/1_W3Jfp1bTWpPFmqyGgGXYJGd0rHIV8dD/export?format=csv&gid=501524341
Columns:
| Column | Type | Description |
|---|---|---|
| InvoiceNo | Object | Unique transaction ID. Values starting with 'C' = cancellation |
| StockCode | Object | Unique product identifier |
| Description | Object | Product name |
| Quantity | Integer | Units purchased. Negative = returns |
| InvoiceDate | Object → Datetime | Transaction timestamp |
| UnitPrice | Float | Price per unit in GBP |
| CustomerID | Float → Integer | Unique customer identifier (24.93% missing) |
| Country | Object | Customer's country |
Raw Data (541,909 rows)
│
▼
Data Wrangling
├── Convert InvoiceDate → datetime
├── Drop missing CustomerID (~135,080 rows)
├── Remove cancelled transactions (InvoiceNo starts with 'C')
├── Remove Quantity ≤ 0 and UnitPrice ≤ 0
├── Engineer TotalAmount = Quantity × UnitPrice
└── Remove duplicate rows
│
▼
Exploratory Data Analysis
└── 13 charts: univariate + bivariate + temporal + RFM distributions
│
▼
Hypothesis Testing (3 tests)
├── H1: UK vs non-UK order values (Mann-Whitney U)
├── H2: Frequency vs Monetary correlation (Spearman)
└── H3: Q4 vs non-Q4 revenue (Mann-Whitney U, one-sided)
│
▼
Feature Engineering → RFM Table
├── Recency : days since last purchase
├── Frequency: unique invoices per customer
└── Monetary : total GBP spend per customer
│
▼
Preprocessing
├── Outlier capping at 99th percentile (Winsorization)
├── StandardScaler (zero mean, unit variance)
└── PCA to 2 components (for visualisation only)
│
▼
ML Models
├── K-Means Clustering (Elbow + Silhouette → K=4)
└── Agglomerative Hierarchical Clustering (Ward linkage, K=4)
│
▼
Cluster Profiling → 4 Business Segments
├── Champions
├── Loyal Customers
├── New Customers
└── At-Risk Customers
│
▼
Deployment
├── kmeans_customer_segmentation.pkl
└── rfm_standard_scaler.pkl
Key findings across 13 charts:
- UK Dominance: The United Kingdom accounts for over 90% of all transactions — the business is heavily UK-centric, presenting both a focus opportunity and a concentration risk.
- Strong Seasonality: Revenue peaks every November–December (holiday gifting season). Year-over-year growth from 2010 to 2011 confirms business expansion.
- B2B Buying Pattern: Orders peak between 10 AM–2 PM on Wednesday and Thursday — consistent with trade/wholesale buyers ordering during UK business hours.
- Right-Skewed Spend: All RFM features are right-skewed. A small proportion of customers drive a disproportionate share of revenue — confirming the Pareto principle in this dataset.
- Low Cancellation Rate: ~2% of transactions are cancellations — small in volume but representing real operational cost.
- Price Distribution: Most products are priced GBP 1–5, making the business volume-dependent.
Hypothesis Testing Results:
| Hypothesis | Test | Result |
|---|---|---|
| UK vs non-UK average order value differs | Mann-Whitney U (two-sided) | Reject H₀ — significant difference |
| Frequency & Monetary are positively correlated | Spearman rank correlation | Reject H₀ — strong positive correlation |
| Q4 revenue is higher than non-Q4 | Mann-Whitney U (one-sided) | Reject H₀ — Q4 significantly higher |
Four segments were identified using K-Means clustering on standardised RFM features.
| Segment | Recency | Frequency | Monetary | Size | Business Strategy |
|---|---|---|---|---|---|
| 🏆 Champions | Very Low | Very High | Very High | ~15% | VIP loyalty rewards, early product access, referral programmes |
| 💛 Loyal Customers | Low–Medium | Medium–High | Medium–High | ~25% | Upsell premium products, personalised recommendations, loyalty points |
| 🆕 New Customers | Low | Low | Low | ~30% | Welcome email series, onboarding discount, showcase bestsellers |
| Very High | Low | Low | ~30% | Win-back campaign with heavy discount, satisfaction survey, retargeted ads |
RFM Profile Interpretation:
- Recency: Lower = purchased more recently (better)
- Frequency: Higher = orders more often (better)
- Monetary: Higher = spends more in total (better)
| Model | Algorithm | K | Silhouette Score |
|---|---|---|---|
| Model 1 | K-Means Clustering | 4 | ~0.38 |
| Model 2 | Agglomerative Clustering (Ward) | 4 | ~0.35 |
Final Model: K-Means with K=4
Selected because:
- Higher Silhouette Score than Agglomerative Clustering
- Computationally scalable to full dataset (4,300+ customers)
- Cluster centroids can be saved and reused to score new customers
- Both algorithms independently agreed on K=4, validating the 4-segment structure
Hyperparameter Tuning:
- K-Means: Elbow Method + Silhouette Score sweep across K = 2 to 10 → optimal at K=4
- Agglomerative: Ward vs Complete vs Average linkage comparison → Ward wins
myntra-customer-segmentation/
├── notebooks/
│ └── Myntra_Customer_Segmentation.ipynb # Full analysis notebook
├── data/
│ └── README_data.md # Dataset source & description
├── models/
│ ├── kmeans_customer_segmentation.pkl # Trained K-Means model
│ └── rfm_standard_scaler.pkl # Fitted StandardScaler
├── outputs/
│ ├── rfm_segments.csv # CustomerID + RFM + Segment label
│ └── charts/ # All 13 EDA charts as PNG
├── reports/
│ └── Myntra_Segmentation_Report.pdf # Project summary report
├── .gitignore
├── requirements.txt
└── README.md
1. Clone the repository
git clone https://github.com/YOUR_USERNAME/myntra-customer-segmentation.git
cd myntra-customer-segmentation2. Install dependencies
pip install -r requirements.txt3. Launch the notebook
jupyter notebook notebooks/Myntra_Customer_Segmentation.ipynb4. Run all cells — the notebook loads data directly from the Google Sheets URL. No CSV download needed.
5. Predict a new customer's segment
import joblib
import pandas as pd
model = joblib.load('models/kmeans_customer_segmentation.pkl')
scaler = joblib.load('models/rfm_standard_scaler.pkl')
new_customer = pd.DataFrame({
'Recency': [10], # days since last purchase
'Frequency': [8], # number of orders
'Monetary': [450] # total spend in GBP
})
scaled = scaler.transform(new_customer)
cluster = model.predict(scaled)
print(f'Predicted cluster: {cluster[0]}') # → Champions| Library | Purpose |
|---|---|
| pandas | Data loading, wrangling, aggregation |
| numpy | Numerical operations |
| matplotlib / seaborn | All 13 EDA charts |
| scikit-learn | StandardScaler, KMeans, AgglomerativeClustering, PCA, silhouette_score |
| scipy | Hierarchical linkage, dendrogram, Mann-Whitney U, Spearman |
| joblib | Model serialisation and loading |
[Your Name] Capstone Project — Unsupervised Machine Learning AlmaBetter / [Your Institute Name]
This project is licensed under the MIT License.