Give your stakeholders the food delivery insights they need.
Clean, analyze, test, and visualize one month of FoodExpress platform data — end to end.
Overview • Dataset • Project Structure • Setup • Key Analyses • Findings • Author • Contributing
This project performs a comprehensive exploratory data analysis (EDA) and statistical inference study on one month of delivery operations for the FoodExpress platform — a rapidly growing food delivery service connecting customers with local restaurants across four city zones.
The analysis pipeline covers:
- ✅ Data cleaning — null imputation, sentinel correction, datatype standardization, outlier removal, and data reshaping
- ✅ Feature engineering — net revenue calculation, discount aggregation, and derived temporal features (day, week, segment)
- ✅ Exploratory analysis (EDA) — order distribution, customer segmentation, cuisine trends, and delivery performance patterns
- ✅ Statistical testing — hypothesis testing including t-tests and chi-square tests to validate operational and customer insights
- ✅ Business insights — identification of key drivers of customer satisfaction, revenue performance, and operational efficiency
- ✅ Interactive dashboard (Streamlit) — real-time filtering, KPI monitoring, and visualization of operational and revenue metrics
| Metric | Value |
|---|---|
| Period | September 2024 |
| Raw orders | 1,000 |
| Clean records | 990 |
| Total revenue | $35,575.93 |
| Net platform revenue | $25,013.29 |
| City zones | North · South · East · West |
The dataset foodexpress_data_uncleaned.csv contains 24 columns across six categories:
| Category | Columns |
|---|---|
| Order details | OrderID, OrderDate, OrderTime, DayOfWeek, TimePeriod |
| Customer info | CustomerSegment (New / Regular / VIP), City |
| Restaurant | RestaurantID, CuisineType, RestaurantRating |
| Delivery metrics | DeliveryTime, DeliveryDistance, DeliveryFee, WeatherCondition |
| Financials | OrderValue, DiscountAmount, PromoDiscount, TipAmount, CashPayment, CardPayment, WalletPayment, UPIPayment |
| Performance | CustomerRating, DeliveryPartnerRating |
⚠️ Raw data quality issues: 294 missing values across 4 columns, sentinel error99999inDeliveryTime, and"INVALID"/"N/A"strings inCustomerRating. All resolved in the cleaning pipeline.
FoodExpress-Data-Analytics/
│
├── app.py # Streamlit dashboard
├── FoodExpress Data Analytics Project.ipynb # Analysis notebook
├── foodexpress_data_uncleaned.csv # Raw dataset
├── requirements.txt # Dependencies
├── assets/ # Images & visuals
├── README.md # Documentation
---
## ⚙️ Setup
**Requirements:** Python 3.10+, Jupyter Notebook
```bash
# 1. Clone the repository
git clone https://github.com/doyancha/FoodExpress-Data-Analytics.git
cd FoodExpress-Data-Analytics
# 2. Install dependencies
pip install pandas numpy scipy matplotlib seaborn statsmodels jupyterlab
# 3. Launch the notebook
jupyter notebook "FoodExpress Data Analytics Project.ipynb"
The dataset
foodexpress_data_uncleaned.csvmust sit in the same directory as the notebook. All cleaning steps are self-contained within the notebook cells.
The raw CSV required six discrete cleaning operations before analysis:
| Step | Action | Outcome |
|---|---|---|
| 1 | Impute CustomerRating (79), RestaurantRating (75), DeliveryTime (70), TipAmount (70) with column median |
294 nulls resolved |
| 2 | Replace sentinel value 99999 in DeliveryTime with median of valid values |
Outlier corrected |
| 3 | Replace "INVALID" / "N/A" strings in CustomerRating → NaN → median fill |
dtype fixed to float64 |
| 4 | Convert OrderDate → datetime64; financial & rating columns → float64 |
Types corrected |
| 5 | Drop rows where DeliveryTime > P99 or OrderValue < $5 |
10 rows removed → 990 rows |
| 6 | Reshape 4 wide payment columns → PaymentMethod + PaymentAmount using melt() |
1 payment per row |
# Step 6 — Reshape payments from wide → long format
payment_cols = ['CashPayment', 'CardPayment', 'WalletPayment', 'UPIPayment']
df_payment = df_clean.melt(
id_vars=[c for c in df_clean.columns if c not in payment_cols],
value_vars=payment_cols,
var_name='PaymentType',
value_name='PaymentAmount'
)
df_payment = df_payment[df_payment['PaymentAmount'] > 0].copy()
# Result → 990 orders : 990 payments (1:1 mapping)total_revenue = 0.0
for value in df['OrderValue']:
total_revenue += value
# → Total Revenue: $35,575.93A delivery is "Premium" when DeliveryFee > $5.00 and DeliveryTime < 25 minutes.
def is_premium_delivery(order_id):
row = df[df['OrderID'] == order_id]
if row.empty:
return False
return (row['DeliveryFee'].iloc[0] > 5.00) and (row['DeliveryTime'].iloc[0] < 25)
# Vectorized scan → 49 premium deliveries founddef avg_order_value_by_cuisine(cuisine_type):
subset = df[df['CuisineType'] == cuisine_type]['OrderValue']
if subset.empty:
return None
return {'cuisine': cuisine_type, 'avg': subset.mean(), 'count': len(subset)}NetRevenue = OrderValue − DiscountAmount − PromoDiscount − (OrderValue × 0.20)
A 20% platform commission is applied to every order. The NetRevenue column reflects true platform earnings after all deductions.
COMMISSION_RATE = 0.20
df_payment['RestaurantCommission'] = df_payment['OrderValue'] * COMMISSION_RATE
df_payment['NetRevenue'] = (
df_payment['OrderValue']
- df_payment['DiscountAmount'].fillna(0)
- df_payment['PromoDiscount'].fillna(0)
- df_payment['RestaurantCommission']
)Net revenue by city zone:
| Rank | Zone | Net Revenue |
|---|---|---|
| 🥇 | North Zone | $6,631.72 |
| 🥈 | West Zone | $6,374.00 |
| 🥉 | South Zone | $6,060.36 |
| 4th | East Zone | $5,947.21 |
Six charts were produced using matplotlib and seaborn:
| Chart | Type | Key Insight |
|---|---|---|
| Orders by cuisine | Horizontal bar | Indian leads with 152 orders |
| Revenue by cuisine × city | Heatmap | Surface top cuisine-city revenue combinations |
| Payment method by segment | 100% stacked bar | Uniform spread — no segment preference |
| Daily order trend | Line plot | Temporal patterns across September |
| ry time by city | Box plot | North Zone has heavy right-skew |
| Campaign AOV impact | Paired scatter | Before vs After AOV per restaurant |
# Heatmap — cuisine × city net revenue
pivot = df_payment.pivot_table(
values='NetRevenue', index='CuisineType', columns='City', aggfunc='sum'
)
sns.heatmap(pivot, annot=True, fmt='.0f', cmap='YlOrRd', linewidths=0.5)
# 100% Stacked bar — payment method by customer segment
pct = pd.crosstab(
df_clean['CustomerSegment'], df_clean['PaymentMethod'], normalize='index'
) * 100
pct.plot(kind='bar', stacked=True,
color=['#e74c3c', '#3498db', '#2ecc71', '#f1c40f'], edgecolor='black')H₀ : μ(VIP ratings) = μ(Regular ratings)
H₁ : The means differ
from scipy import stats
vip_r = df[df['CustomerSegment'] == 'VIP Customer']['CustomerRating']
regular_r = df[df['CustomerSegment'] == 'Regular Customer']['CustomerRating']
t_stat, p_value = stats.ttest_ind(vip_r, regular_r)
Result: p ≥ 0.05 → Fail to reject H₀
No significant difference in satisfaction between VIP and Regular customers. The platform delivers a consistent experience across tiers.
H₀ : μ(North Zone delivery time) = 30 minutes
H₁ : μ ≠ 30 minutes
north = df[df['City'] == 'North Zone']['DeliveryTime'].dropna()
t_stat, p_value = stats.ttest_1samp(north, popmean=30.0)
# Observed mean: 125 min | Only 36.7% of orders delivered within 30 min
Result: p < 0.05 → Reject H₀
North Zone significantly misses its 30-minute target. Operational intervention is required.
H₀ : Payment method choice is independent of customer segment
H₁ : There is a significant association
Result: p ≥ 0.05 → Fail to reject H₀
Payment habits are uniform across all customer segments. A single payment promotion strategy is sufficient.
H₀ : Each of 5 cuisines accounts for exactly 20% of orders
H₁ : Distribution is not uniform
from scipy.stats import chisquare
observed = df_clean[df_clean['CuisineType'].isin(
['Italian', 'Chinese', 'Indian', 'Fast Food', 'Mexican']
)]['CuisineType'].value_counts()
chi2, p_value = chisquare(observed)
# χ² = 9.11 | p = 0.0585
Result: p = 0.0585 → Fail to reject H₀ (marginal)
Indian cuisine leads at 152 orders, but no statistically significant imbalance detected at α = 0.05.
factors = ['DeliveryDistance', 'OrderValue', 'RestaurantRating', 'DeliveryPartnerRating']
correlations = {f: df['CustomerRating'].corr(df[f]) for f in factors}
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
for idx, factor in enumerate(factors):
sns.regplot(data=df, x=factor, y='CustomerRating',
ax=axes.ravel()[idx], line_kws={'color': 'red'})| Factor | Direction | Strength |
|---|---|---|
DeliveryPartnerRating |
Positive | Strong — biggest satisfaction driver |
RestaurantRating |
Positive | Weak — food quality is secondary |
DeliveryDistance |
Negative | Weak — longer distance → lower satisfaction |
OrderValue |
~Zero | None — spend level doesn't predict satisfaction |
Average order value (AOV) per restaurant was compared before (Week 1) and after (Week 3) a promotional campaign.
before = df_campaign[df_campaign['Period'] == 'Before (Week 1)'] \
.groupby('RestaurantID')['OrderValue'].mean()
after = df_campaign[df_campaign['Period'] == 'After (Week 3)'] \
.groupby('RestaurantID')['OrderValue'].mean()
paired = pd.concat([before, after], axis=1, join='inner')
t_stat, p_value = stats.ttest_rel(paired.iloc[:, 0], paired.iloc[:, 1])
# AOV change: −13.3% | Revenue lift: −$7,228.79 | p ≥ 0.05
Result: p ≥ 0.05 → No statistically significant change in AOV from the campaign.
- 🏙️ North Zone generates the highest net revenue ($6,631.72) but misses the 30-minute delivery target — only 36.7% of orders arrive on time.
- 🚴 Delivery partner rating is the single strongest predictor of customer satisfaction, outweighing food quality and spend level.
- ⭐ VIP and Regular customers give statistically equivalent ratings — operations are fair and consistent across segments.
- 💳 Payment preferences are uniform across all customer segments — no targeted payment strategy is needed.
- 🍛 Indian cuisine leads in volume (152 orders), though no cuisine is statistically over- or under-represented.
- 📉 The promotional campaign had no statistically significant effect on average order value.
- Fix North Zone delivery — invest in route optimization or zone-specific partner allocation to push ≤30-min delivery rate above 70%.
- Prioritize partner quality — since partner rating drives satisfaction above all else, implement tiered incentives tied to delivery rating performance.
- Redesign the promo mechanics — flat discounts did not move AOV; test free delivery thresholds, bundling, or cuisine-specific limited-time offers.
- Shift to volume campaigns — since AOV is unresponsive to promos, redirect marketing toward order frequency and new customer acquisition.
- Formalize A/B measurement — every future campaign should have a pre/post paired t-test built in before budget is committed.
| Library | Version | Purpose |
|---|---|---|
pandas |
2.x | Data loading, cleaning, groupby, melt, feature engineering |
numpy |
1.x | Numerical operations, median imputation |
scipy |
1.x | t-tests, chi-square tests, winsorization |
matplotlib |
3.x | Base plotting engine |
seaborn |
0.13+ | Heatmaps, box plots, regression plots, facet grids |
statsmodels |
0.14+ | Tukey HSD post-hoc tests |
Doyancha
Data analyst focused on operations analytics, statistical inference, and Python-based data pipelines.
Contributions, issues, and feature requests are welcome!
- Fork the repository
- Create a feature branch —
git checkout -b feature/your-idea - Commit your changes —
git commit -m "Add: your idea" - Push to the branch —
git push origin feature/your-idea - Open a Pull Request
Please open an issue first to discuss any significant change.
This project is licensed under the MIT License — see the LICENSE file for details.









