1. Project Overview

Portfolio of Evidence

Mall Customer Segmentation

Code Review, Bug Fixes & Technical Showcase

Author	Portia
Date	February 2026
Language	Python 3.12 (Jupyter Notebook)
Libraries	Pandas, NumPy, Seaborn, Matplotlib, Scikit-Learn
Dataset	Mall_Customers.csv (n = 200 customers)
Algorithm	K-Means Clustering (k = 5)

1. Project Overview

This document serves as a Portfolio of Evidence for the Mall Customer Segmentation project. It demonstrates the full data science workflow - from raw data loading and exploratory analysis through to machine learning clustering - and showcases all code written, bugs identified and fixed, and outputs produced.

The original notebook contained three critical errors that would have caused it to fail or produce misleading results. This report documents each error, explains why it was wrong, and presents the corrected code. All charts were regenerated with improved styling and annotations.

2. Bug Fixes Summary

The original notebook contained the following three bugs. Each is documented in detail in Section 4.

#	Location	Bug Description	Fix Applied
1	EDA - Gender KDE plot	Used "Gender" - column is actually named "Genre"	Changed all "Gender" references to "Genre"
2	Elbow Method cell	Fitted elbow on np.random.rand() dummy data - not actual mall features	Replaced with actual df[["Annual Income","Spending Score"]] features
3	KMeans initialisation	Typo: "n_custers=5" (missing "l") causes AttributeError crash	Corrected to "n_clusters=5", added random_state=42 for reproducibility

3. Code Showcase - Jupyter Notebook

The following screenshots show each section of the fixed and improved notebook, rendered in a dark-theme code style. Each cell includes syntax highlighting and explanatory comments.

3.1 Imports & Data Loading

The notebook begins by importing all required libraries and loading the dataset. A global plot style is applied at this stage to ensure consistency across all visualisations throughout the analysis.

Code Block 1: Library imports, global plot settings, and CSV data loading.

3.2 Bug Fixes - Documented in Code

The three original bugs are shown side-by-side with their corrections. Red-highlighted lines show the broken original code; green lines show the fix applied. This side-by-side documentation approach is best practice in collaborative data science work.

Code Block 2: All three bug fixes documented with before/after comparison.

3.3 Univariate Distribution Plots

A for-loop iterates over the three numeric columns (Age, Annual Income, Spending Score) to produce distribution histograms with KDE overlays and mean reference lines. This replaces the original approach of saving all plots to the same filename (overwriting each other).

Code Block 3: Univariate distribution loop with improved multi-panel figure layout.

3.4 Correlation Heatmap

A correlation matrix is computed across the three numeric features and visualised as an annotated heatmap. The coolwarm diverging palette makes positive and negative correlations visually distinct. Key finding: income and spending are nearly uncorrelated (r ≈ 0.01), which confirms that using both as clustering features is meaningful.

Code Block 4: Correlation heatmap with annotations and diverging colour palette.

3.5 K-Means Clustering - Fitting & Labelling

K-Means is fitted to the two key features with k=5 (confirmed by the Elbow Method). The resulting cluster labels are mapped to descriptive segment names (Target Group, Careful, Standard, Careless, Sensible) based on each centroid's income and spending values. Segment names make downstream analysis and reporting far more accessible to non-technical stakeholders.

Code Block 5: K-Means fit, centroid output, and descriptive cluster name mapping.

3.6 Cluster Profile Analysis & Export

The final code cells compute the gender breakdown per cluster using pd.crosstab(), calculate average age, income and spending per segment using groupby().mean(), and export the fully labelled dataset to Excel. The export includes the new "Segment Name" column so business users can immediately identify which group each customer belongs to.

Code Block 6: Gender crosstab, cluster profile summary, and Excel export.

4. Analytical Outputs - Charts & Findings

Every chart produced by the notebook is presented below alongside the analytical interpretation that connects the visual output to business insight.

4.1 Univariate - Annual Income Distribution

The income distribution is approximately normal, centred around $60k with a mild right skew. The majority of shoppers (roughly 60%) fall in the $40k-$80k bracket. The visible right tail indicates a smaller high-earning cohort - this group, once identified through clustering, represents the highest-value opportunity.

Figure 1: Annual Income histogram with KDE, mean (red) and median (orange) reference lines.

4.2 Income by Gender

Female customers show a tighter income distribution peaking around $60k-$70k. Male customers display a flatter curve with a longer high-income tail - the highest earners in the dataset skew male. Despite this, median incomes are comparable between genders (~$60k), meaning gender alone is not a strong predictor of spending behaviour.

Figure 2: KDE of Annual Income split by Gender. Sample sizes and means shown in legend.

4.3 Elbow Method - Optimal k

Running K-Means for k = 1 through k = 10 on the actual mall features (Annual Income + Spending Score) produces the elbow curve below. The curve bends sharply at k = 5, after which additional clusters yield diminishing reductions in inertia. This mathematically confirms that five is the optimal number of segments for this dataset.

Figure 3: Elbow Method - WCSS vs. k. Annotated inflection point at k=5 (highlighted red).

4.4 Final Cluster Visualisation

The scatter plot below is the centrepiece of the analysis. Five well-separated clusters are clearly visible in income-spending space. The star markers indicate cluster centroids. Segment names are annotated directly on the chart. No significant boundary overlap exists, confirming the K-Means algorithm found genuinely distinct groupings.

Figure 4: K-Means 5-segment scatter plot. Stars = centroids. Colour-coded by segment.

4.5 Spending Score by Gender

Boxplot: Median spending scores are nearly identical for male and female shoppers (~50). However, female customers show a slightly tighter IQR (interquartile range), suggesting more consistent moderate spending. Male customers display a wider spread, including more very low and very high spenders.

KDE: The density plot confirms the pattern - females cluster more sharply around the 45-55 score range, while males display a heavier low-end tail (scores 0-30). This means male shoppers are more polarised: some are very high spenders, but a larger proportion spend very little.

Figure 5: Boxplot of Spending Score by Gender. Medians nearly equal; female distribution tighter.

Figure 6: KDE of Spending Score by Gender. Male distribution has heavier low-spending tail.

4.6 Segment Profile Comparison

The dual-panel chart below consolidates all five segment profiles. The left panel directly contrasts average income against average spending score for each group - the divergence of the "Careful" group (highest income, below-average spending) is immediately apparent. The right panel confirms that the "Careless" segment is the largest single group at 52 customers (26%), while the high-value "Target Group" and "Careful" groups are smaller but disproportionately commercially important.

Figure 7: Segment profiles - avg income vs spending (left) and customer count/share (right).

5. Segment Reference Table

The table below provides a complete reference summary for all five segments, including the average demographics computed from the clustering output.

Segment	Avg Income	Avg Spend	Avg Age	Count	% Female	Strategy
Target Group	$66k	83	39	37 (18.5%)	51%	VIP loyalty, exclusive events
Careful	$100k	49	35	29 (14.5%)	55%	Luxury audit, aspirational campaigns
Standard	$64k	46	37	43 (21.5%)	54%	Seasonal promotions, basket upsell
Careless	$31k	52	40	52 (26.0%)	52%	Flash sales, gamified loyalty
Sensible	$51k	15	41	39 (19.5%)	51%	Value bundles, essential discounts

6. Skills & Methods Demonstrated

This project demonstrates competency across the full data science pipeline. The table below maps each skill to the specific code or output produced in this notebook.

Skill Area	Method Used	Evidence in This Report
Data Loading	pd.read_csv()	Code Block 1 - loaded Mall_Customers.csv
Data Inspection	df.describe(), df.columns, value_counts()	Section 3.1 - summary statistics output
Data Cleaning	df.drop("CustomerID", axis=1)	Pairplot cell - removed non-predictive column
EDA - Univariate	sns.histplot() with KDE, axvline()	Code Block 3 - Figure 1
EDA - Bivariate	sns.kdeplot(), sns.boxplot(), pd.crosstab()	Code Blocks 3 & 6 - Figures 2, 5, 6
Correlation Analysis	df.corr(), sns.heatmap()	Code Block 4 - correlation heatmap
Pairplot	sns.pairplot(hue="Genre")	Full multi-variable relationship view
ML: Cluster Selection	KMeans Elbow Method loop, inertia_	Code Block 2 - Figure 3 (correct features)
ML: K-Means Fitting	KMeans(n_clusters=5).fit()	Code Block 5 - Figure 4
Bug Identification	Wrong column name, dummy data, typo	Section 2 - Bug Fixes Summary
Export	df.to_excel()	Code Block 6 - Clustering.xlsx

Portfolio of Evidence · Mall Customer Segmentation · Portia · February 2026

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Mall_Customer_Analysis_FIXED (1).ipynb		Mall_Customer_Analysis_FIXED (1).ipynb
Mall_Customers.csv		Mall_Customers.csv
README.md		README.md
image_1.png		image_1.png
image_10.png		image_10.png
image_11.png		image_11.png
image_12.png		image_12.png
image_13.png		image_13.png
image_2.png		image_2.png
image_3.png		image_3.png
image_4.png		image_4.png
image_5.png		image_5.png
image_6.png		image_6.png
image_7.png		image_7.png
image_8.png		image_8.png
image_9.png		image_9.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1. Project Overview

2. Bug Fixes Summary

3. Code Showcase - Jupyter Notebook

3.1 Imports & Data Loading

3.2 Bug Fixes - Documented in Code

3.3 Univariate Distribution Plots

3.4 Correlation Heatmap

3.5 K-Means Clustering - Fitting & Labelling

3.6 Cluster Profile Analysis & Export

4. Analytical Outputs - Charts & Findings

4.1 Univariate - Annual Income Distribution

4.2 Income by Gender

4.3 Elbow Method - Optimal k

4.4 Final Cluster Visualisation

4.5 Spending Score by Gender

4.6 Segment Profile Comparison

5. Segment Reference Table

6. Skills & Methods Demonstrated

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

1. Project Overview

2. Bug Fixes Summary

3. Code Showcase - Jupyter Notebook

3.1 Imports & Data Loading

3.2 Bug Fixes - Documented in Code

3.3 Univariate Distribution Plots

3.4 Correlation Heatmap

3.5 K-Means Clustering - Fitting & Labelling

3.6 Cluster Profile Analysis & Export

4. Analytical Outputs - Charts & Findings

4.1 Univariate - Annual Income Distribution

4.2 Income by Gender

4.3 Elbow Method - Optimal k

4.4 Final Cluster Visualisation

4.5 Spending Score by Gender

4.6 Segment Profile Comparison

5. Segment Reference Table

6. Skills & Methods Demonstrated

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages