Skip to content

portia-da-analyst/Mall-Customer-Segmentation-Analysis

Repository files navigation

Portfolio of Evidence

Mall Customer Segmentation

Code Review, Bug Fixes & Technical Showcase

Author Portia
Date February 2026
Language Python 3.12 (Jupyter Notebook)
Libraries Pandas, NumPy, Seaborn, Matplotlib, Scikit-Learn
Dataset Mall_Customers.csv (n = 200 customers)
Algorithm K-Means Clustering (k = 5)

1. Project Overview

This document serves as a Portfolio of Evidence for the Mall Customer Segmentation project. It demonstrates the full data science workflow - from raw data loading and exploratory analysis through to machine learning clustering - and showcases all code written, bugs identified and fixed, and outputs produced.

The original notebook contained three critical errors that would have caused it to fail or produce misleading results. This report documents each error, explains why it was wrong, and presents the corrected code. All charts were regenerated with improved styling and annotations.

2. Bug Fixes Summary

The original notebook contained the following three bugs. Each is documented in detail in Section 4.

# Location Bug Description Fix Applied
1 EDA - Gender KDE plot Used "Gender" - column is actually named "Genre" Changed all "Gender" references to "Genre"
2 Elbow Method cell Fitted elbow on np.random.rand() dummy data - not actual mall features Replaced with actual df[["Annual Income","Spending Score"]] features
3 KMeans initialisation Typo: "n_custers=5" (missing "l") causes AttributeError crash Corrected to "n_clusters=5", added random_state=42 for reproducibility

3. Code Showcase - Jupyter Notebook

The following screenshots show each section of the fixed and improved notebook, rendered in a dark-theme code style. Each cell includes syntax highlighting and explanatory comments.

3.1 Imports & Data Loading

The notebook begins by importing all required libraries and loading the dataset. A global plot style is applied at this stage to ensure consistency across all visualisations throughout the analysis.

Code Block 1: Library imports, global plot settings, and CSV data loading.

3.2 Bug Fixes - Documented in Code

The three original bugs are shown side-by-side with their corrections. Red-highlighted lines show the broken original code; green lines show the fix applied. This side-by-side documentation approach is best practice in collaborative data science work.

Code Block 2: All three bug fixes documented with before/after comparison.

3.3 Univariate Distribution Plots

A for-loop iterates over the three numeric columns (Age, Annual Income, Spending Score) to produce distribution histograms with KDE overlays and mean reference lines. This replaces the original approach of saving all plots to the same filename (overwriting each other).

Code Block 3: Univariate distribution loop with improved multi-panel figure layout.

3.4 Correlation Heatmap

A correlation matrix is computed across the three numeric features and visualised as an annotated heatmap. The coolwarm diverging palette makes positive and negative correlations visually distinct. Key finding: income and spending are nearly uncorrelated (r ≈ 0.01), which confirms that using both as clustering features is meaningful.

Code Block 4: Correlation heatmap with annotations and diverging colour palette.

3.5 K-Means Clustering - Fitting & Labelling

K-Means is fitted to the two key features with k=5 (confirmed by the Elbow Method). The resulting cluster labels are mapped to descriptive segment names (Target Group, Careful, Standard, Careless, Sensible) based on each centroid's income and spending values. Segment names make downstream analysis and reporting far more accessible to non-technical stakeholders.

Code Block 5: K-Means fit, centroid output, and descriptive cluster name mapping.

3.6 Cluster Profile Analysis & Export

The final code cells compute the gender breakdown per cluster using pd.crosstab(), calculate average age, income and spending per segment using groupby().mean(), and export the fully labelled dataset to Excel. The export includes the new "Segment Name" column so business users can immediately identify which group each customer belongs to.

Code Block 6: Gender crosstab, cluster profile summary, and Excel export.

4. Analytical Outputs - Charts & Findings

Every chart produced by the notebook is presented below alongside the analytical interpretation that connects the visual output to business insight.

4.1 Univariate - Annual Income Distribution

The income distribution is approximately normal, centred around $60k with a mild right skew. The majority of shoppers (roughly 60%) fall in the $40k-$80k bracket. The visible right tail indicates a smaller high-earning cohort - this group, once identified through clustering, represents the highest-value opportunity.

Figure 1: Annual Income histogram with KDE, mean (red) and median (orange) reference lines.

4.2 Income by Gender

Female customers show a tighter income distribution peaking around $60k-$70k. Male customers display a flatter curve with a longer high-income tail - the highest earners in the dataset skew male. Despite this, median incomes are comparable between genders (~$60k), meaning gender alone is not a strong predictor of spending behaviour.

Figure 2: KDE of Annual Income split by Gender. Sample sizes and means shown in legend.

4.3 Elbow Method - Optimal k

Running K-Means for k = 1 through k = 10 on the actual mall features (Annual Income + Spending Score) produces the elbow curve below. The curve bends sharply at k = 5, after which additional clusters yield diminishing reductions in inertia. This mathematically confirms that five is the optimal number of segments for this dataset.

Figure 3: Elbow Method - WCSS vs. k. Annotated inflection point at k=5 (highlighted red).

4.4 Final Cluster Visualisation

The scatter plot below is the centrepiece of the analysis. Five well-separated clusters are clearly visible in income-spending space. The star markers indicate cluster centroids. Segment names are annotated directly on the chart. No significant boundary overlap exists, confirming the K-Means algorithm found genuinely distinct groupings.

Figure 4: K-Means 5-segment scatter plot. Stars = centroids. Colour-coded by segment.

4.5 Spending Score by Gender

Boxplot: Median spending scores are nearly identical for male and female shoppers (~50). However, female customers show a slightly tighter IQR (interquartile range), suggesting more consistent moderate spending. Male customers display a wider spread, including more very low and very high spenders.

KDE: The density plot confirms the pattern - females cluster more sharply around the 45-55 score range, while males display a heavier low-end tail (scores 0-30). This means male shoppers are more polarised: some are very high spenders, but a larger proportion spend very little.

Figure 5: Boxplot of Spending Score by Gender. Medians nearly equal; female distribution tighter.

Figure 6: KDE of Spending Score by Gender. Male distribution has heavier low-spending tail.

4.6 Segment Profile Comparison

The dual-panel chart below consolidates all five segment profiles. The left panel directly contrasts average income against average spending score for each group - the divergence of the "Careful" group (highest income, below-average spending) is immediately apparent. The right panel confirms that the "Careless" segment is the largest single group at 52 customers (26%), while the high-value "Target Group" and "Careful" groups are smaller but disproportionately commercially important.

Figure 7: Segment profiles - avg income vs spending (left) and customer count/share (right).

5. Segment Reference Table

The table below provides a complete reference summary for all five segments, including the average demographics computed from the clustering output.

Segment Avg Income Avg Spend Avg Age Count % Female Strategy
Target Group $66k 83 39 37 (18.5%) 51% VIP loyalty, exclusive events
Careful $100k 49 35 29 (14.5%) 55% Luxury audit, aspirational campaigns
Standard $64k 46 37 43 (21.5%) 54% Seasonal promotions, basket upsell
Careless $31k 52 40 52 (26.0%) 52% Flash sales, gamified loyalty
Sensible $51k 15 41 39 (19.5%) 51% Value bundles, essential discounts

6. Skills & Methods Demonstrated

This project demonstrates competency across the full data science pipeline. The table below maps each skill to the specific code or output produced in this notebook.

Skill Area Method Used Evidence in This Report
Data Loading pd.read_csv() Code Block 1 - loaded Mall_Customers.csv
Data Inspection df.describe(), df.columns, value_counts() Section 3.1 - summary statistics output
Data Cleaning df.drop("CustomerID", axis=1) Pairplot cell - removed non-predictive column
EDA - Univariate sns.histplot() with KDE, axvline() Code Block 3 - Figure 1
EDA - Bivariate sns.kdeplot(), sns.boxplot(), pd.crosstab() Code Blocks 3 & 6 - Figures 2, 5, 6
Correlation Analysis df.corr(), sns.heatmap() Code Block 4 - correlation heatmap
Pairplot sns.pairplot(hue="Genre") Full multi-variable relationship view
ML: Cluster Selection KMeans Elbow Method loop, inertia_ Code Block 2 - Figure 3 (correct features)
ML: K-Means Fitting KMeans(n_clusters=5).fit() Code Block 5 - Figure 4
Bug Identification Wrong column name, dummy data, typo Section 2 - Bug Fixes Summary
Export df.to_excel() Code Block 6 - Clustering.xlsx

Portfolio of Evidence · Mall Customer Segmentation · Portia · February 2026

About

Customer segmentation analysis using K-Means clustering in Python, including full EDA, bug fixes, and business insights.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors