Myntra Customer Segmentation — Unsupervised ML

RFM-based customer segmentation using K-Means and Agglomerative Hierarchical Clustering for Myntra Gifts Ltd. — identifying Champions, Loyal, New, and At-Risk customers to enable targeted marketing strategies.

Business Problem

Myntra Gifts Ltd. is a UK-based online retailer specialising in unique all-occasion giftware. The company operates exclusively online and serves customers across 38 countries.

The Problem: All customers were treated uniformly — same marketing messages, same promotions, same communication strategy regardless of purchase behaviour. A customer who buys every month and spends thousands received identical treatment to one who purchased once two years ago.

The Goal: Apply unsupervised machine learning to segment the customer base into distinct behavioural groups using RFM (Recency, Frequency, Monetary) analysis, then assign targeted business strategies to each segment to:

Improve customer retention
Reduce churn among at-risk customers
Maximise revenue from high-value segments
Allocate marketing budget efficiently

Dataset Overview

Property	Value
Source	Myntra Gifts Ltd. — Online Retail Transactions
Period	December 2009 – December 2011
Raw Rows	541,909
Columns	8
Countries	38
Unique Customers (after cleaning)	~4,300

Dataset URL (Google Sheets):

https://docs.google.com/spreadsheets/d/1_W3Jfp1bTWpPFmqyGgGXYJGd0rHIV8dD/export?format=csv&gid=501524341

Columns:

Column	Type	Description
InvoiceNo	Object	Unique transaction ID. Values starting with 'C' = cancellation
StockCode	Object	Unique product identifier
Description	Object	Product name
Quantity	Integer	Units purchased. Negative = returns
InvoiceDate	Object → Datetime	Transaction timestamp
UnitPrice	Float	Price per unit in GBP
CustomerID	Float → Integer	Unique customer identifier (24.93% missing)
Country	Object	Customer's country

Methodology

Raw Data (541,909 rows)
        │
        ▼
  Data Wrangling
  ├── Convert InvoiceDate → datetime
  ├── Drop missing CustomerID (~135,080 rows)
  ├── Remove cancelled transactions (InvoiceNo starts with 'C')
  ├── Remove Quantity ≤ 0 and UnitPrice ≤ 0
  ├── Engineer TotalAmount = Quantity × UnitPrice
  └── Remove duplicate rows
        │
        ▼
  Exploratory Data Analysis
  └── 13 charts: univariate + bivariate + temporal + RFM distributions
        │
        ▼
  Hypothesis Testing (3 tests)
  ├── H1: UK vs non-UK order values (Mann-Whitney U)
  ├── H2: Frequency vs Monetary correlation (Spearman)
  └── H3: Q4 vs non-Q4 revenue (Mann-Whitney U, one-sided)
        │
        ▼
  Feature Engineering → RFM Table
  ├── Recency  : days since last purchase
  ├── Frequency: unique invoices per customer
  └── Monetary : total GBP spend per customer
        │
        ▼
  Preprocessing
  ├── Outlier capping at 99th percentile (Winsorization)
  ├── StandardScaler (zero mean, unit variance)
  └── PCA to 2 components (for visualisation only)
        │
        ▼
  ML Models
  ├── K-Means Clustering (Elbow + Silhouette → K=4)
  └── Agglomerative Hierarchical Clustering (Ward linkage, K=4)
        │
        ▼
  Cluster Profiling → 4 Business Segments
  ├── Champions
  ├── Loyal Customers
  ├── New Customers
  └── At-Risk Customers
        │
        ▼
  Deployment
  ├── kmeans_customer_segmentation.pkl
  └── rfm_standard_scaler.pkl

EDA Highlights

Key findings across 13 charts:

UK Dominance: The United Kingdom accounts for over 90% of all transactions — the business is heavily UK-centric, presenting both a focus opportunity and a concentration risk.
Strong Seasonality: Revenue peaks every November–December (holiday gifting season). Year-over-year growth from 2010 to 2011 confirms business expansion.
B2B Buying Pattern: Orders peak between 10 AM–2 PM on Wednesday and Thursday — consistent with trade/wholesale buyers ordering during UK business hours.
Right-Skewed Spend: All RFM features are right-skewed. A small proportion of customers drive a disproportionate share of revenue — confirming the Pareto principle in this dataset.
Low Cancellation Rate: ~2% of transactions are cancellations — small in volume but representing real operational cost.
Price Distribution: Most products are priced GBP 1–5, making the business volume-dependent.

Hypothesis Testing Results:

Hypothesis	Test	Result
UK vs non-UK average order value differs	Mann-Whitney U (two-sided)	Reject H₀ — significant difference
Frequency & Monetary are positively correlated	Spearman rank correlation	Reject H₀ — strong positive correlation
Q4 revenue is higher than non-Q4	Mann-Whitney U (one-sided)	Reject H₀ — Q4 significantly higher

Customer Segments

Four segments were identified using K-Means clustering on standardised RFM features.

Segment	Recency	Frequency	Monetary	Size	Business Strategy
🏆 Champions	Very Low	Very High	Very High	~15%	VIP loyalty rewards, early product access, referral programmes
💛 Loyal Customers	Low–Medium	Medium–High	Medium–High	~25%	Upsell premium products, personalised recommendations, loyalty points
🆕 New Customers	Low	Low	Low	~30%	Welcome email series, onboarding discount, showcase bestsellers
⚠️ At-Risk Customers	Very High	Low	Low	~30%	Win-back campaign with heavy discount, satisfaction survey, retargeted ads

RFM Profile Interpretation:

Recency: Lower = purchased more recently (better)
Frequency: Higher = orders more often (better)
Monetary: Higher = spends more in total (better)

Model Results

Model	Algorithm	K	Silhouette Score
Model 1	K-Means Clustering	4	~0.38
Model 2	Agglomerative Clustering (Ward)	4	~0.35

Final Model: K-Means with K=4

Selected because:

Higher Silhouette Score than Agglomerative Clustering
Computationally scalable to full dataset (4,300+ customers)
Cluster centroids can be saved and reused to score new customers
Both algorithms independently agreed on K=4, validating the 4-segment structure

Hyperparameter Tuning:

K-Means: Elbow Method + Silhouette Score sweep across K = 2 to 10 → optimal at K=4
Agglomerative: Ward vs Complete vs Average linkage comparison → Ward wins

Project Structure

myntra-customer-segmentation/
├── notebooks/
│   └── Myntra_Customer_Segmentation.ipynb   # Full analysis notebook
├── data/
│   └── README_data.md                        # Dataset source & description
├── models/
│   ├── kmeans_customer_segmentation.pkl      # Trained K-Means model
│   └── rfm_standard_scaler.pkl               # Fitted StandardScaler
├── outputs/
│   ├── rfm_segments.csv                      # CustomerID + RFM + Segment label
│   └── charts/                               # All 13 EDA charts as PNG
├── reports/
│   └── Myntra_Segmentation_Report.pdf        # Project summary report
├── .gitignore
├── requirements.txt
└── README.md

How to Run

1. Clone the repository

git clone https://github.com/YOUR_USERNAME/myntra-customer-segmentation.git
cd myntra-customer-segmentation

2. Install dependencies

pip install -r requirements.txt

3. Launch the notebook

jupyter notebook notebooks/Myntra_Customer_Segmentation.ipynb

4. Run all cells — the notebook loads data directly from the Google Sheets URL. No CSV download needed.

5. Predict a new customer's segment

import joblib
import pandas as pd

model  = joblib.load('models/kmeans_customer_segmentation.pkl')
scaler = joblib.load('models/rfm_standard_scaler.pkl')

new_customer = pd.DataFrame({
    'Recency':   [10],   # days since last purchase
    'Frequency': [8],    # number of orders
    'Monetary':  [450]   # total spend in GBP
})

scaled    = scaler.transform(new_customer)
cluster   = model.predict(scaled)
print(f'Predicted cluster: {cluster[0]}')  # → Champions

Tech Stack

Library	Purpose
pandas	Data loading, wrangling, aggregation
numpy	Numerical operations
matplotlib / seaborn	All 13 EDA charts
scikit-learn	StandardScaler, KMeans, AgglomerativeClustering, PCA, silhouette_score
scipy	Hierarchical linkage, dendrogram, Mann-Whitney U, Spearman
joblib	Model serialisation and loading

Author

[Your Name] Capstone Project — Unsupervised Machine Learning AlmaBetter / [Your Institute Name]

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Myntra Customer Segmentation — Unsupervised ML

Table of Contents

Business Problem

Dataset Overview

Methodology

EDA Highlights

Customer Segments

Model Results

Project Structure

How to Run

Tech Stack

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Myntra_Customer_Segmentation_Capstone_(1) (2).ipynb		Myntra_Customer_Segmentation_Capstone_(1) (2).ipynb
Myntra_Segmentation_Report.pdf		Myntra_Segmentation_Report.pdf
README.md		README.md
README_data.md		README_data.md
README_models.md		README_models.md
README_outputs.md		README_outputs.md
gitignore		gitignore
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Myntra Customer Segmentation — Unsupervised ML

Table of Contents

Business Problem

Dataset Overview

Methodology

EDA Highlights

Customer Segments

Model Results

Project Structure

How to Run

Tech Stack

Author

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages