Portfolio of Evidence
Mall Customer Segmentation
Code Review, Bug Fixes & Technical Showcase
| Author | Portia |
|---|---|
| Date | February 2026 |
| Language | Python 3.12 (Jupyter Notebook) |
| Libraries | Pandas, NumPy, Seaborn, Matplotlib, Scikit-Learn |
| Dataset | Mall_Customers.csv (n = 200 customers) |
| Algorithm | K-Means Clustering (k = 5) |
This document serves as a Portfolio of Evidence for the Mall Customer Segmentation project. It demonstrates the full data science workflow - from raw data loading and exploratory analysis through to machine learning clustering - and showcases all code written, bugs identified and fixed, and outputs produced.
The original notebook contained three critical errors that would have caused it to fail or produce misleading results. This report documents each error, explains why it was wrong, and presents the corrected code. All charts were regenerated with improved styling and annotations.
The original notebook contained the following three bugs. Each is documented in detail in Section 4.
| # | Location | Bug Description | Fix Applied |
|---|---|---|---|
| 1 | EDA - Gender KDE plot | Used "Gender" - column is actually named "Genre" | Changed all "Gender" references to "Genre" |
| 2 | Elbow Method cell | Fitted elbow on np.random.rand() dummy data - not actual mall features | Replaced with actual df[["Annual Income","Spending Score"]] features |
| 3 | KMeans initialisation | Typo: "n_custers=5" (missing "l") causes AttributeError crash | Corrected to "n_clusters=5", added random_state=42 for reproducibility |
The following screenshots show each section of the fixed and improved notebook, rendered in a dark-theme code style. Each cell includes syntax highlighting and explanatory comments.
The notebook begins by importing all required libraries and loading the dataset. A global plot style is applied at this stage to ensure consistency across all visualisations throughout the analysis.
Code Block 1: Library imports, global plot settings, and CSV data loading.
The three original bugs are shown side-by-side with their corrections. Red-highlighted lines show the broken original code; green lines show the fix applied. This side-by-side documentation approach is best practice in collaborative data science work.
Code Block 2: All three bug fixes documented with before/after comparison.
A for-loop iterates over the three numeric columns (Age, Annual Income, Spending Score) to produce distribution histograms with KDE overlays and mean reference lines. This replaces the original approach of saving all plots to the same filename (overwriting each other).
Code Block 3: Univariate distribution loop with improved multi-panel figure layout.
A correlation matrix is computed across the three numeric features and visualised as an annotated heatmap. The coolwarm diverging palette makes positive and negative correlations visually distinct. Key finding: income and spending are nearly uncorrelated (r ≈ 0.01), which confirms that using both as clustering features is meaningful.
Code Block 4: Correlation heatmap with annotations and diverging colour palette.
K-Means is fitted to the two key features with k=5 (confirmed by the Elbow Method). The resulting cluster labels are mapped to descriptive segment names (Target Group, Careful, Standard, Careless, Sensible) based on each centroid's income and spending values. Segment names make downstream analysis and reporting far more accessible to non-technical stakeholders.
Code Block 5: K-Means fit, centroid output, and descriptive cluster name mapping.
The final code cells compute the gender breakdown per cluster using pd.crosstab(), calculate average age, income and spending per segment using groupby().mean(), and export the fully labelled dataset to Excel. The export includes the new "Segment Name" column so business users can immediately identify which group each customer belongs to.
Code Block 6: Gender crosstab, cluster profile summary, and Excel export.
Every chart produced by the notebook is presented below alongside the analytical interpretation that connects the visual output to business insight.
The income distribution is approximately normal, centred around $60k with a mild right skew. The majority of shoppers (roughly 60%) fall in the $40k-$80k bracket. The visible right tail indicates a smaller high-earning cohort - this group, once identified through clustering, represents the highest-value opportunity.
Figure 1: Annual Income histogram with KDE, mean (red) and median (orange) reference lines.
Female customers show a tighter income distribution peaking around $60k-$70k. Male customers display a flatter curve with a longer high-income tail - the highest earners in the dataset skew male. Despite this, median incomes are comparable between genders (~$60k), meaning gender alone is not a strong predictor of spending behaviour.
Figure 2: KDE of Annual Income split by Gender. Sample sizes and means shown in legend.
Running K-Means for k = 1 through k = 10 on the actual mall features (Annual Income + Spending Score) produces the elbow curve below. The curve bends sharply at k = 5, after which additional clusters yield diminishing reductions in inertia. This mathematically confirms that five is the optimal number of segments for this dataset.
Figure 3: Elbow Method - WCSS vs. k. Annotated inflection point at k=5 (highlighted red).
The scatter plot below is the centrepiece of the analysis. Five well-separated clusters are clearly visible in income-spending space. The star markers indicate cluster centroids. Segment names are annotated directly on the chart. No significant boundary overlap exists, confirming the K-Means algorithm found genuinely distinct groupings.
Figure 4: K-Means 5-segment scatter plot. Stars = centroids. Colour-coded by segment.
Boxplot: Median spending scores are nearly identical for male and female shoppers (~50). However, female customers show a slightly tighter IQR (interquartile range), suggesting more consistent moderate spending. Male customers display a wider spread, including more very low and very high spenders.
KDE: The density plot confirms the pattern - females cluster more sharply around the 45-55 score range, while males display a heavier low-end tail (scores 0-30). This means male shoppers are more polarised: some are very high spenders, but a larger proportion spend very little.
Figure 5: Boxplot of Spending Score by Gender. Medians nearly equal; female distribution tighter.
Figure 6: KDE of Spending Score by Gender. Male distribution has heavier low-spending tail.
The dual-panel chart below consolidates all five segment profiles. The left panel directly contrasts average income against average spending score for each group - the divergence of the "Careful" group (highest income, below-average spending) is immediately apparent. The right panel confirms that the "Careless" segment is the largest single group at 52 customers (26%), while the high-value "Target Group" and "Careful" groups are smaller but disproportionately commercially important.
Figure 7: Segment profiles - avg income vs spending (left) and customer count/share (right).
The table below provides a complete reference summary for all five segments, including the average demographics computed from the clustering output.
| Segment | Avg Income | Avg Spend | Avg Age | Count | % Female | Strategy |
|---|---|---|---|---|---|---|
| Target Group | $66k | 83 | 39 | 37 (18.5%) | 51% | VIP loyalty, exclusive events |
| Careful | $100k | 49 | 35 | 29 (14.5%) | 55% | Luxury audit, aspirational campaigns |
| Standard | $64k | 46 | 37 | 43 (21.5%) | 54% | Seasonal promotions, basket upsell |
| Careless | $31k | 52 | 40 | 52 (26.0%) | 52% | Flash sales, gamified loyalty |
| Sensible | $51k | 15 | 41 | 39 (19.5%) | 51% | Value bundles, essential discounts |
This project demonstrates competency across the full data science pipeline. The table below maps each skill to the specific code or output produced in this notebook.
| Skill Area | Method Used | Evidence in This Report |
|---|---|---|
| Data Loading | pd.read_csv() | Code Block 1 - loaded Mall_Customers.csv |
| Data Inspection | df.describe(), df.columns, value_counts() | Section 3.1 - summary statistics output |
| Data Cleaning | df.drop("CustomerID", axis=1) | Pairplot cell - removed non-predictive column |
| EDA - Univariate | sns.histplot() with KDE, axvline() | Code Block 3 - Figure 1 |
| EDA - Bivariate | sns.kdeplot(), sns.boxplot(), pd.crosstab() | Code Blocks 3 & 6 - Figures 2, 5, 6 |
| Correlation Analysis | df.corr(), sns.heatmap() | Code Block 4 - correlation heatmap |
| Pairplot | sns.pairplot(hue="Genre") | Full multi-variable relationship view |
| ML: Cluster Selection | KMeans Elbow Method loop, inertia_ | Code Block 2 - Figure 3 (correct features) |
| ML: K-Means Fitting | KMeans(n_clusters=5).fit() | Code Block 5 - Figure 4 |
| Bug Identification | Wrong column name, dummy data, typo | Section 2 - Bug Fixes Summary |
| Export | df.to_excel() | Code Block 6 - Clustering.xlsx |
Portfolio of Evidence · Mall Customer Segmentation · Portia · February 2026












