Understanding customer behavior is crucial for business success in online retail or e-commerce. This project aims to uncover hidden patterns in customer transactions, identify distinct customer segments, and provide data-driven recommendations for improving customer retention and maximizing revenue.
- Getting started
- Introduction
- EDA & Data Cleaning
- Customer Segmentation
- Market Basket Analysis
- Cohort Retention Analysis
-
Clone repository
git clone https://github.com/tatv047/online-retail-data-analysis.git cd online-retail-data-analysis -
Create a virtual environment
python -m venv venv -
Activate the virtual environment
- On Windows:
venv\Scripts\activate- On macOS/Linux:
source venv/bin/activate -
Install dependencies
pip install -r requirements.txt -
Run the notebooks
jupyter eda.ipynb
This project uses online retail data from the UCI Machine Learning Repository.The dataset contains transactional data from a UK-based online retail company.
For more detail on the dataset, you can visit the website or check the data-dictionary. You can manually download the data or just run the first two cells in the EDA notebook and you will see a folder named "data" in your working directory.
The three critical questions that every retailer should ask to determine their strategic direction and profitability:
-
Who are your customers?
Understanding your customer base is about identifying behavioral patterns, purchase preferences, and value contribution. Are you serving mew customers, premium buyers, or loyal repeat customers? Each segment requires different approaches, messaging, and investment strategies. -
What do they buy together?
Product relationships drive significant revenue opportunities. When customers purchase item A, what else do they typically buy? These associations reveal cross-selling opportunities, inform inventory decisions, and optimize product placement strategies that can increase average order value by 10-30%. -
How are we doing in keeping customers coming back?
Understanding retention patterns helps identify when customers are likely to churn, how different customer cohorts behave over time, and what interventions can improve overall customer retention rates.
We'll answer these questions using:
- Question 1:-
RFM-Based Customer Segmentation: Analyzing Recency, Frequency, and Monetary patterns to create actionable customer personas - Question 2:-
Market Basket Analysis: Uncovering product associations and purchase patterns through association rule mining. - Question 3:-
Cohort and Retention Analysis: Tracking customer behavior over time to measure loyalty and identify retention opportunities.
But before diving into these advanced analyses, we'll begin with thorough Exploratory Data Analysis (EDA) and Data Cleaning.
Dataset contains 542,909 transactions, from December 2010 to December 2011.
The initial data quality assessment revealed:
- Description column: Contains null values for certain transactions.
- CustomerID column: Significant number of missing values (~25% of records)
- Co-occurrence: Many records have both Description and CustomerID missing simultaneously
Apart from Market Basket Analysis(MBA),we will need to remove all Null Values. This is because there is no way to impute these values and we are dealing with customer related analysis, you can do even without CustomerId,but we will remove null values for that too.
1. Invoice Number Validation
- Identified transactions with invoice numbers starting with 'C' (cancellations) and 'A' (adjustments).
- Retained only standard 6-digit invoice numbers for consistent transaction analysis.
- Removed ~10,000 cancelled and adjusted transactions.
2. Stock Code Standardization
- Expected format: 5-digit codes (e.g., "85123") or 5-digit codes with letter suffixes (e.g., "85123A").
- Identified ~54,000 entries with non-standard stock codes.
- Removed irregular patterns to maintain data consistency.
3. Price and Quantity Validation
- Negative Quantities: Removed as they represent returns and cancellations
- Zero/Negative Prices: Eliminated zero-price entries (promotional/sample items) and negative prices (data entry errors)
- Final Clean Dataset: 396337 valid transactions
The dataset reveals a strong geographic concentration:
- United Kingdom: 89.7% of all orders
- Germany: 2.3% of orders
- France: 1.9% of orders
- Ireland (EIRE): 1.7% of orders
This distribution aligns with the company's UK-based operations and domestic market focus.
When checked for the distribution of order volume by country, United Kingdom came at the top again cornering major chunk of it.
While looking at the distribution characterstic of various columns we observe:
- Quantity: Heavily right-skewed distribution with most orders between 1-12 items
- Unit Price: Right-skewed with majority of products priced under £5
- Outliers: Identified extreme values (quantities >80,000 and prices >£600) representing bulk orders,niche products or data anomalies.
- Outliers are dealt seperately on case basis in each of the three problems.
The analysis of the most frequently purchased items reveals:
- Strong preference for decorative and gift items
- Seasonal patterns in product popularity
- Bulk purchasing behavior for certain product categories
On looking at the data w.r.t. date-time,we observed that:
- Peak Month: November shows highest transaction volumes (holiday shopping effect)
- Peak Quarter: Q4 dominates with 35% of annual transactions
- Peak Day: Wednesday shows highest transaction activity
- Lowest Day: Sunday has minimal transaction volume
- Monthly Cycles: Noticeable spikes on the 9th and 18th of each month (potentially payday effects)
- Business Growth: Significant expansion from 2010 to 2011
EDA insights:
- Customer Concentration: 89.7% of business comes from UK customers, indicating strong domestic market penetration
- Product Portfolio: Wide range of unit prices suggests diverse product mix from low-cost to premium items
- Seasonal Sensitivity: Strong Q4 performance indicates holiday shopping dependence
- Growth Trajectory: Rapid international expansion and transaction volume growth
- Data Quality: After cleaning, we retained 74% of original data for robust analysis
Customer Segmentation is the process of division of customer base into several groups of individuals that share a similarity in different ways that are relevant to marketing such as gender, age, interests, and miscellaneous spending habits.
There are four major types of segmentation bases are as follows:
- Demographic: Quantifiable characteristics of the market. Examples include age, gender, income, education, social and economic status, and more.
- Geographic: Division of the market according to geographic attributes. Examples include country, region, size of population, and more.
- Psychographic: Incorporation of customer measures such as attitudes, beliefs, or personality attributes. For example, customer attitudes toward a certain statement are typically used to infer their perspectives on particular beliefs that are important to a brand.
- Behavioral: Historical usage, which can be a good indicator of future actions. Examples include usage frequency, level of user engagement, the tendency to retain the product or service, and so on.
Behavioral segmentation has received particularly high attention in industry due to its special focus on historical usage patterns and usage habits. One of the most frequently applied behavioral segmentation approaches is the Recency, Frequency, Monetary (RFM) model,and we will use that only.
The building blocks of RFM are as follows:
- Recency, which measures how recently the customer used the product or service.
- Frequency, which tells how frequently the customer used it.
- Monetary, which is related to how much monetary value was invested toward it.
- We saw in the EDA that there were quite a lot of outliers in "UnitPrice" and "Quantity", which in turn gets reflected in RFM model as above.
- To deal with this we create two dataset: outliers and non-outliers. The threshold is chosen as the max and min of the boxplot for the original dataset.
- non-outliers are clustered using K-means while for outliers we make custom clusters.
Distribution of RFM for non-outliers
- Then we move on to creating RFM datapoints for each entry in non-outliers dataset.
- K-means uses Euclidean distance to assign points to the nearest cluster center. If your features have different units or scales, the feature with the larger scale will dominate the distance calculations. So we will need to scale the data too.
This data is now prepared to be used in K-Means clustering algorithm.
- While using the k-means clustering algorithm, the first step is to indicate the number of clusters (
$k$ ) that we wish to produce in the final output. - The algorithm starts by selecting
$k$ objects from dataset randomly that will serve as the initial centers for our clusters. These selected objects are the cluster means, also known as centroids. - Then, the remaining objects have an assignment of the closest centroid. This centroid is defined by the Euclidean Distance present between the object and the cluster mean. We refer to this step as “cluster assignment”.
- When the assignment is complete, the algorithm proceeds to calculate new mean value of each cluster present in the data.
- After the recalculation of the centers, the observations are checked if they are closer to a different cluster. Using the updated cluster mean, the objects undergo reassignment.
- This goes on repeatedly through several iterations until the cluster assignments stop altering. The clusters that are present in the current iteration are the same as the ones obtained in the previous iteration.
When working with clustering algorithms like KMeans, it's important to determine the optimal number of clusters (
-
Elbow Method: This method uses the inertia, which is the sum of squared distances between each point and its nearest cluster center. As
$k$ increases, inertia decreases. The optimal k is often identified at the "elbow point"—where the rate of decrease sharply slows down, indicating diminishing returns with additional clusters. -
Average Silhouette Method: The silhouette score measures how well each point fits within its cluster compared to others. It ranges from -1 to 1, where a higher score indicates better-defined and more separated clusters. The average silhouette method calculates the mean silhouette score for different values of
$k$ . The optimal number of clusters is the one that maximizes the average silhouette score, indicating the best balance between cohesion and separation.
From the output above,
Having determined the number of clusters, we apply K-means to create the customer segments. Because the goal is to identify the group of users with high engagement, we assign a label (or name) to each segment.
We get the cluster labels for each datapoint in the non-outliers dataset,and then assign it. The visualisation is as follows:
Based on this we perform the cluster analysis for non-outliers datapoints and get the following segments:
-
Cluster 0 (Orange): "Nurture"
- Recent purchasers but with very low spend and frequency
- Maybe price-conscious customers who buy only during promotions
- May be new customers still testing your brand
-
Strategy:
- Value Migration: Gradually introduce higher-value products through cross-selling
- Frequency Building: Implement programs to increase purchase frequency (subscription models, bundles)
- Price Sensitivity Management: Use graduated pricing to move them up the value chain
- Onboarding Programs: If new customers, focus on product education and value demonstration
-
Cluster 1 (Red): "Re-engage"
- Moderate spenders ($200-800 range) with decent purchase frequency.
- Poor recency scores indicate they're becoming inactive.
- Risk of losing these customers.
-
Strategy:
- Immediate Win-Back Campaign: Urgent personalized outreach like using "We miss you" messages.
- Investigate Churn Reasons: Survey or direct contact to understand why they've stopped purchasing.
- Reactivation Incentives: Targeted discounts or exclusive offers to prompt immediate purchase.
- Engagement Monitoring: Track response rates and adjust messaging accordingly.
-
Cluster 2 (Green): "Reward"
- Highest spending customers ($2000-4000 range)
- Highest purchase frequency (8-12 transactions)
- These are your profit drivers and brand advocates, your best segment,customers who could be treated as VIP.
-
Strategy:
- VIP Treatment: Exclusive access to new products, priority customer service
- Referral Programs: Leverage their satisfaction to acquire similar high-value customers
- Retention Focus: Ensure they never have reason to leave through exceptional service
- Upselling Opportunities: Introduce premium products or services
- Feedback Loop: Use their insights for product development and business decisions
-
Cluster 3 (Blue): "Retain"
- Balanced performance across all RFM dimensions
- Reliable revenue generators with room for growth
- Potential to migrate to Champion status with right incentives
-
Strategy:
- Growth Programs: Targeted campaigns to increase both spend and frequency
- Loyalty Development: Implement tiered loyalty programs to encourage progression
- Cross-selling Focus: Introduce complementary products to increase basket size
- Engagement Consistency: Regular but not overwhelming communication
- We will create completely new clusters for our outliers dataset and label them.
- These are based purely on the distribution and not any clustering algorithm output.
- The distribution for these new clusters is as following:
-
Cluster -1 (Monetary Outliers) PAMPER:
- High spenders but not necessarily frequent buyers. Their purchases are large but infrequent.
- Strategy: Focus on maintaining their loyalty with personalized offers or luxury services that cater to their high spending capacity.
-
Cluster -2 (Frequency Outliers) UPSELL:
- Frequent buyers who spend less per purchase. These customers are consistently engaged but might benefit from upselling opportunities.
- Strategy: Implement loyalty programs or bundle deals to encourage higher spending per visit, given their frequent engagement.
-
Cluster -3 (Monetary & Frequency Outliers) DELIGHT:
- The most valuable outliers, with extreme spending and frequent purchases. They are likely your top-tier customers who require special attention.
- Strategy: Develop VIP programs or exclusive offers to maintain their loyalty and encourage continued engagement.
We performed clustering-based customer segmentation that combines K-means clustering with RFE modeling. We also dealth with outliers and anomolies. The approach is simple and intuitive, and yet it yields the power of segmenting customers into groups where users with similar usage behaviors and habits are clustered together, which then provides insights that help in the understanding of usage patterns of the product or service for the purpose of converting more users into your best segments.
Everything done above can be summarised very beautifully in the figure below:
General stores, supermarkets like D-Mart, SmartBazaar, and online platforms such as Amazon, Flipkart, and Meesho need to maintain good profitability to run efficiently. One of the ways to achieve this is by performing Market Basket Analysis.
Market Basket Analysis is a specialized application of Association Rule Mining that focuses specifically on retail and e-commerce scenarios. It examines transactional data to uncover patterns and relationships between different products that are purchased together by customers.
At its core, Market Basket Analysis seeks to answer:
- Which products are frequently bought together?
- If a customer purchases product A, what is the likelihood they will also purchase product B?
- How can product placement or bundling be optimized to increase cross-selling?
Market Basket Analysis operates on a dataset where each record is a transaction, containing a set of items purchased together. By analyzing many such transactions, patterns emerge that reflect customer behavior.
There are several business goals that Market Basket Analysis can support like Product Placement and Store Layout,Cross-Selling and Upselling,Inventory Management,Promotional Strategies, and real-time suggestions,personalised recommendations for E-commerce Platforms.
Market Basket Analysis involves two fundamental steps:
- Frequent Itemset Mining: Identify sets of items that frequently appear together in transactions. This is typically done using algorithms such as Apriori.
- Association Rule Generation: Generate “if-then” rules from the frequent itemsets. These rules describe how the presence of certain items in a transaction implies the presence of others.
The first step of Market Basket Analysis is to go through the dataset where each row represents a transaction (basket of items bought together), and extract all unique item combinations.
Then pass it to the apriori algorithm,which calculates how often each itemset appears, this frequency is called support.
The algorithm uses a level-wise iterative approach, based on the Downward Closure Property. We pass a list of transactions (baskets), each containing items, apriori then calculates support from scratch.
The algorithm is as follows:
We need to pass the minimum support threshold as a parameter,if the threshold is 0.25 (25%), then all itemsets having support equal to or above 0.25 would be considered frequent.
The Downward Closure Property being used states:
- If an itemset is frequent, all of its subsets must also be frequent.
- If an itemset is infrequent, none of its supersets can be frequent.
This property helps reduce computation by pruning unpromising candidate itemsets.
Once frequent itemsets are discovered, the next step is to generate association rules in the form:
Where:
- X (antecedent): the “if” part
- Y (consequent): the “then” part
For example: {Bread, Butter} → {Milk} “If a customer buys bread and butter, they are also likely to buy milk.”
The strength of these rules is measured using the following metrics:
This indicates how often the entire itemset (X and Y together) appears in the dataset.
Example: Rule: Bread → Butter Support = 300 / 1000 = 0.30 Interpretation: 30% of all transactions include both items.
This measures the proportion of transactions with X that also contain Y.
Example: Confidence = 300 / 600 = 0.50 Interpretation: 50% of the customers who bought bread also bought butter.
This measures how much more likely Y is to be purchased when X is purchased, compared to when X and Y are independent.
- Lift = 1 → X and Y are independent
- Lift > 1 → Positive association
- Lift < 1 → Negative association
Example: Lift = 0.50 / 0.40 = 1.25 Interpretation: Customers who buy bread are 1.25 times more likely to buy butter than average.
Consider the following example of the transactional summary for a grocery store (1000 total transactions):
| Item/Combination | Frequency | Support |
|---|---|---|
| Bread | 600 | 0.60 |
| Butter | 400 | 0.40 |
| Milk | 500 | 0.50 |
| Bread + Butter | 300 | 0.30 |
| Bread + Milk | 350 | 0.35 |
| Butter + Milk | 200 | 0.20 |
| Bread + Butter + Milk | 150 | 0.15 |
Rule: Bread → Butter
| Metric | Value | Interpretation |
|---|---|---|
| Support | 0.30 | 30% of transactions contain both |
| Confidence | 0.50 | 50% of bread buyers also buy butter |
| Lift | 1.25 | Positive correlation (1.25× more likely) |
- For our problem we will be focusing on the nurture customers found in the above customer segmentation problem.
- Customers in this segment are new, infrequent, price-sensitive who are drawn to discounts or offers. They have low frequency, low lifetime, but decent monetary value.
- Our goals should be:
- Rules with High Confidence so that we can avoid recommending items with weak linkage.
- Slightly low support as the customers are low-frequency.
- Strong lift to discover non-random co-puchases.
- That is why the thresholds used while selecting rules will be:
- Confidence >= 0.6
- Support >= 0.01
- Lift >= 2.5
- To find the most interesting rules,we will sort the Rules by "lift" because we want high confidence with high high lift.
- We won't be using confidence because confidence is biased by popular items. If “bread” is bought 90% of the time overall, almost everything will appear confidently related to it. High confidence with low lift just tells you people already buy the item a lot.
- Product recommendations for cross-selling can be done based on the results found here.
- There are 3220 distinct items bought by nurture-customers.
- The transaction data will be used to create a basket, a basket refers to a single transaction or purchase instance that contains a collection of all the items purchased together.
- We will be using the mlxtend library for our work, this basket would need to be transformed using transaction encoder for it to be used in the apriori algorithm.
te = TransactionEncoder() # defining the encoder object
te_array = te.fit(mybasket).transform(mybasket)
basket_df = pd.DataFrame(te_array,columns = te.columns_ , index=mybasket.index)
basket_df.head(5)
- We need to set a minimum support threshold, which we will keep at min_support = 0.01 (1%) i.e. itemsets that have occured atleast 1% in all of the transactions. With ~50,000 nurture transactions, this catches items in ≥ 500 transactions.
frequent_itemsets = apriori(basket_df,min_support = 0.01,use_colnames= True) # atleast 1 percent appearances
- We get 759 itemsets in the frequent_itemsets.
- On close examination,it was found that:
- 619 itemsets had only single item.
- 133 itemsets had a pair
- 7 itemsets had triplets.
- The top 10 most frequent itemsets had only one item.
- Given below is a heatmap for the support scores of possible combinations from the items in the top-10 pairs list.
- We then generate the rules with a minimum threshold on confidence of 0.6. There is also a threshold on lift of 2.5.
- We get 64 rules for our frequent_itemset.
# Generate rules using frequent_itemsets DataFrame
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.6)
rules = rules[
(rules['lift'] >= 2.5)
]
rules
- Given below is are the support,confidence and lift of the top-10 rules sorted by "lift" scores:
| Rule No. | Antecedent | Consequent |
|---|---|---|
| Rule 1 | SET/10 PINK POLKADOT PARTY CANDLES | SET/10 BLUE POLKADOT PARTY CANDLES |
| Rule 2 | SET/10 BLUE POLKADOT PARTY CANDLES | SET/10 PINK POLKADOT PARTY CANDLES |
| Rule 3 | PINK HAPPY BIRTHDAY BUNTING | BLUE HAPPY BIRTHDAY BUNTING |
| Rule 4 | BLUE HAPPY BIRTHDAY BUNTING | PINK HAPPY BIRTHDAY BUNTING |
| Rule 5 | POPPY'S PLAYHOUSE LIVINGROOM | POPPY'S PLAYHOUSE BEDROOM |
| Rule 6 | POPPY'S PLAYHOUSE BEDROOM | POPPY'S PLAYHOUSE LIVINGROOM |
| Rule 7 | GLASS APOTHECARY BOTTLE PERFUME | GLASS APOTHECARY BOTTLE TONIC |
| Rule 8 | GLASS APOTHECARY BOTTLE TONIC | GLASS APOTHECARY BOTTLE PERFUME |
| Rule 9 | CHILDRENS CUTLERY DOLLY GIRL | CHILDRENS CUTLERY SPACEBOY |
| Rule 10 | CHILDRENS CUTLERY SPACEBOY | CHILDRENS CUTLERY DOLLY GIRL |
- Market Basket Analysis is one of the most practical and impactful applications of data mining for businesses. By systematically identifying frequent itemsets and generating meaningful association rules, organizations can derive actionable insights that drive smarter product placement, targeted promotions, and personalized recommendations.
- Our nurture-customers buy fewer things.
- We want to find connections that are strong and unexpected, not just frequent.
- So Lift gives us more signal than just using Confidence. And we find very interesting relations as above..
- The rows represent customer cohorts,defined by first month they made a purchase (e.g.,
2010-12,2011-01, etc.). - The columns represent the number of months since their first purchase (
Cohort Index0 to 12). - And the values are the % of the original cohort that returned to make a purchase that month.
- There are few insights that you can draw from it:
- For almost all cohorts, there's a sharp drop after month 0.For example: the
2011-01cohort drops from100%to22%in month 1. A possible reason could be that customers are not being engaged effectively after their first purchase. There's a retention gap. - Cohort
2010-12stands out — it retains ~36-50% of its customers across several months. Possible reasons could be better offers, seasonality (December = holidays), or more loyal segment acquired. - Recent cohorts (e.g.
2011-09to2011-12) have very low retention after month 1 or 2. Marketing or product strategies may have changed for the worse, or seasonal customers were acquired. - Most cohorts reach near-zero retention by month 6–9. This shows high churn which may mean:
- Customers don’t see long-term value
- Lack of subscription/loyalty programs
- One-off purchases dominate
- For almost all cohorts, there's a sharp drop after month 0.For example: the
| Insight | Action |
|---|---|
| Low retention after first purchase | Implement onboarding email series, discounts on second purchase, loyalty rewards |
| Strong holiday cohort (2010-12) | Analyze what worked then — repeat successful campaigns |
| High churn by 3–6 months | Introduce reactivation campaigns before months 2–3 |
| Certain months worse than others | Align this with marketing spend, acquisition channel, and customer type |
"Our cohort retention analysis shows a significant drop in customer engagement after the first month, with most customers churning by month 3. However, certain cohorts like Dec 2010 show stronger long-term retention, indicating that specific periods or campaigns may be more effective. This highlights the need to improve post-purchase engagement and build loyalty pathways early in the customer lifecycle."






























