Skip to content

Gauravscriptx/myntra-customer-segmentation

Repository files navigation

Myntra Customer Segmentation — Unsupervised ML

Python scikit-learn pandas License Type

RFM-based customer segmentation using K-Means and Agglomerative Hierarchical Clustering for Myntra Gifts Ltd. — identifying Champions, Loyal, New, and At-Risk customers to enable targeted marketing strategies.


Table of Contents


Business Problem

Myntra Gifts Ltd. is a UK-based online retailer specialising in unique all-occasion giftware. The company operates exclusively online and serves customers across 38 countries.

The Problem: All customers were treated uniformly — same marketing messages, same promotions, same communication strategy regardless of purchase behaviour. A customer who buys every month and spends thousands received identical treatment to one who purchased once two years ago.

The Goal: Apply unsupervised machine learning to segment the customer base into distinct behavioural groups using RFM (Recency, Frequency, Monetary) analysis, then assign targeted business strategies to each segment to:

  • Improve customer retention
  • Reduce churn among at-risk customers
  • Maximise revenue from high-value segments
  • Allocate marketing budget efficiently

Dataset Overview

Property Value
Source Myntra Gifts Ltd. — Online Retail Transactions
Period December 2009 – December 2011
Raw Rows 541,909
Columns 8
Countries 38
Unique Customers (after cleaning) ~4,300

Dataset URL (Google Sheets):

https://docs.google.com/spreadsheets/d/1_W3Jfp1bTWpPFmqyGgGXYJGd0rHIV8dD/export?format=csv&gid=501524341

Columns:

Column Type Description
InvoiceNo Object Unique transaction ID. Values starting with 'C' = cancellation
StockCode Object Unique product identifier
Description Object Product name
Quantity Integer Units purchased. Negative = returns
InvoiceDate Object → Datetime Transaction timestamp
UnitPrice Float Price per unit in GBP
CustomerID Float → Integer Unique customer identifier (24.93% missing)
Country Object Customer's country

Methodology

Raw Data (541,909 rows)
        │
        ▼
  Data Wrangling
  ├── Convert InvoiceDate → datetime
  ├── Drop missing CustomerID (~135,080 rows)
  ├── Remove cancelled transactions (InvoiceNo starts with 'C')
  ├── Remove Quantity ≤ 0 and UnitPrice ≤ 0
  ├── Engineer TotalAmount = Quantity × UnitPrice
  └── Remove duplicate rows
        │
        ▼
  Exploratory Data Analysis
  └── 13 charts: univariate + bivariate + temporal + RFM distributions
        │
        ▼
  Hypothesis Testing (3 tests)
  ├── H1: UK vs non-UK order values (Mann-Whitney U)
  ├── H2: Frequency vs Monetary correlation (Spearman)
  └── H3: Q4 vs non-Q4 revenue (Mann-Whitney U, one-sided)
        │
        ▼
  Feature Engineering → RFM Table
  ├── Recency  : days since last purchase
  ├── Frequency: unique invoices per customer
  └── Monetary : total GBP spend per customer
        │
        ▼
  Preprocessing
  ├── Outlier capping at 99th percentile (Winsorization)
  ├── StandardScaler (zero mean, unit variance)
  └── PCA to 2 components (for visualisation only)
        │
        ▼
  ML Models
  ├── K-Means Clustering (Elbow + Silhouette → K=4)
  └── Agglomerative Hierarchical Clustering (Ward linkage, K=4)
        │
        ▼
  Cluster Profiling → 4 Business Segments
  ├── Champions
  ├── Loyal Customers
  ├── New Customers
  └── At-Risk Customers
        │
        ▼
  Deployment
  ├── kmeans_customer_segmentation.pkl
  └── rfm_standard_scaler.pkl

EDA Highlights

Key findings across 13 charts:

  • UK Dominance: The United Kingdom accounts for over 90% of all transactions — the business is heavily UK-centric, presenting both a focus opportunity and a concentration risk.
  • Strong Seasonality: Revenue peaks every November–December (holiday gifting season). Year-over-year growth from 2010 to 2011 confirms business expansion.
  • B2B Buying Pattern: Orders peak between 10 AM–2 PM on Wednesday and Thursday — consistent with trade/wholesale buyers ordering during UK business hours.
  • Right-Skewed Spend: All RFM features are right-skewed. A small proportion of customers drive a disproportionate share of revenue — confirming the Pareto principle in this dataset.
  • Low Cancellation Rate: ~2% of transactions are cancellations — small in volume but representing real operational cost.
  • Price Distribution: Most products are priced GBP 1–5, making the business volume-dependent.

Hypothesis Testing Results:

Hypothesis Test Result
UK vs non-UK average order value differs Mann-Whitney U (two-sided) Reject H₀ — significant difference
Frequency & Monetary are positively correlated Spearman rank correlation Reject H₀ — strong positive correlation
Q4 revenue is higher than non-Q4 Mann-Whitney U (one-sided) Reject H₀ — Q4 significantly higher

Customer Segments

Four segments were identified using K-Means clustering on standardised RFM features.

Segment Recency Frequency Monetary Size Business Strategy
🏆 Champions Very Low Very High Very High ~15% VIP loyalty rewards, early product access, referral programmes
💛 Loyal Customers Low–Medium Medium–High Medium–High ~25% Upsell premium products, personalised recommendations, loyalty points
🆕 New Customers Low Low Low ~30% Welcome email series, onboarding discount, showcase bestsellers
⚠️ At-Risk Customers Very High Low Low ~30% Win-back campaign with heavy discount, satisfaction survey, retargeted ads

RFM Profile Interpretation:

  • Recency: Lower = purchased more recently (better)
  • Frequency: Higher = orders more often (better)
  • Monetary: Higher = spends more in total (better)

Model Results

Model Algorithm K Silhouette Score
Model 1 K-Means Clustering 4 ~0.38
Model 2 Agglomerative Clustering (Ward) 4 ~0.35

Final Model: K-Means with K=4

Selected because:

  1. Higher Silhouette Score than Agglomerative Clustering
  2. Computationally scalable to full dataset (4,300+ customers)
  3. Cluster centroids can be saved and reused to score new customers
  4. Both algorithms independently agreed on K=4, validating the 4-segment structure

Hyperparameter Tuning:

  • K-Means: Elbow Method + Silhouette Score sweep across K = 2 to 10 → optimal at K=4
  • Agglomerative: Ward vs Complete vs Average linkage comparison → Ward wins

Project Structure

myntra-customer-segmentation/
├── notebooks/
│   └── Myntra_Customer_Segmentation.ipynb   # Full analysis notebook
├── data/
│   └── README_data.md                        # Dataset source & description
├── models/
│   ├── kmeans_customer_segmentation.pkl      # Trained K-Means model
│   └── rfm_standard_scaler.pkl               # Fitted StandardScaler
├── outputs/
│   ├── rfm_segments.csv                      # CustomerID + RFM + Segment label
│   └── charts/                               # All 13 EDA charts as PNG
├── reports/
│   └── Myntra_Segmentation_Report.pdf        # Project summary report
├── .gitignore
├── requirements.txt
└── README.md

How to Run

1. Clone the repository

git clone https://github.com/YOUR_USERNAME/myntra-customer-segmentation.git
cd myntra-customer-segmentation

2. Install dependencies

pip install -r requirements.txt

3. Launch the notebook

jupyter notebook notebooks/Myntra_Customer_Segmentation.ipynb

4. Run all cells — the notebook loads data directly from the Google Sheets URL. No CSV download needed.

5. Predict a new customer's segment

import joblib
import pandas as pd

model  = joblib.load('models/kmeans_customer_segmentation.pkl')
scaler = joblib.load('models/rfm_standard_scaler.pkl')

new_customer = pd.DataFrame({
    'Recency':   [10],   # days since last purchase
    'Frequency': [8],    # number of orders
    'Monetary':  [450]   # total spend in GBP
})

scaled    = scaler.transform(new_customer)
cluster   = model.predict(scaled)
print(f'Predicted cluster: {cluster[0]}')  # → Champions

Tech Stack

Library Purpose
pandas Data loading, wrangling, aggregation
numpy Numerical operations
matplotlib / seaborn All 13 EDA charts
scikit-learn StandardScaler, KMeans, AgglomerativeClustering, PCA, silhouette_score
scipy Hierarchical linkage, dendrogram, Mann-Whitney U, Spearman
joblib Model serialisation and loading

Author

[Your Name] Capstone Project — Unsupervised Machine Learning AlmaBetter / [Your Institute Name]


License

This project is licensed under the MIT License.

About

RFM-based customer segmentation using K-Means & Agglomerative Clustering on 541K+ retail transactions — identifying Champions, Loyal, New & At-Risk customers across 38 countries.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors