don't try this lab until you have been briefed on using pandas
Objective: Read a tabular data file and calculate basic statistics on numerical columns.
Dataset: sales_data.csv (500 lines) with the following columns:
date(string): Transaction date in YYYY-MM-DD formatstore_id(string): Store identifier (e.g., "S001", "S002")product_id(string): Product identifier (e.g., "P123", "P456")quantity(integer): Number of items soldunit_price(float): Price per item in dollarscustomer_age(integer): Age of the customer
Tasks:
- Read the CSV file using pandas
- Display the first 5 rows of the dataset
- Calculate the sum of all quantities sold
- Calculate the average unit price
- Find the minimum and maximum customer age
- Count how many sales were made at each store
Example starter code:
import pandas as pd
# Read the data
df = pd.read_csv('sales_data.csv')
# Display first 5 rows
print("First 5 rows:")
print(df.head())
# Calculate sum of quantities
total_quantity = df['quantity'].sum()
print(f"Total quantity sold: {total_quantity}")
# Calculate average unit price
avg_price = df['unit_price'].mean()
print(f"Average unit price: ${avg_price:.2f}")
# Find min and max customer age
min_age = df['customer_age'].min()
max_age = df['customer_age'].max()
print(f"Customer age range: {min_age} to {max_age} years")
# Count sales by store
store_counts = df['store_id'].value_counts()
print("\nSales by store:")
print(store_counts)Objective: Read a data file, filter records, and perform group-by operations with simple aggregations.
Dataset: Same sales_data.csv from Exercise 1
Tasks:
- Read the CSV file using pandas
- Filter the data to show only:
- Sales with quantities greater than 5
- Products with a unit price less than $50
- Calculate the total revenue for each store (quantity × unit price)
- Find the average quantity sold for each product
- Determine which day of the week had the most sales
- Save the filtered results to a new CSV file
Example starter code:
import pandas as pd
from datetime import datetime
# Read the data
df = pd.read_csv('sales_data.csv')
# Filter data
high_quantity = df[df['quantity'] > 5]
print(f"Sales with quantity > 5: {len(high_quantity)}")
affordable_products = df[df['unit_price'] < 50]
print(f"Sales with unit price < $50: {len(affordable_products)}")
# Calculate revenue for each row
df['revenue'] = df['quantity'] * df['unit_price']
# Group by store and calculate total revenue
store_revenue = df.groupby('store_id')['revenue'].sum().reset_index()
print("\nTotal revenue by store:")
print(store_revenue)
# Group by product and calculate average quantity
product_avg_quantity = df.groupby('product_id')['quantity'].mean().reset_index()
print("\nAverage quantity by product (first 5):")
print(product_avg_quantity.head())
# Extract day of week from date
df['date'] = pd.to_datetime(df['date'])
df['day_of_week'] = df['date'].dt.day_name()
# Count sales by day of week
day_counts = df['day_of_week'].value_counts()
print("\nSales by day of week:")
print(day_counts)
# Save filtered results to CSV
affordable_high_quantity = df[(df['quantity'] > 5) & (df['unit_price'] < 50)]
affordable_high_quantity.to_csv('filtered_sales.csv', index=False)
print(f"\nSaved {len(affordable_high_quantity)} filtered records to 'filtered_sales.csv'")Objective: Process a data file, perform transformations, and output results to multiple files.
Dataset: Same sales_data.csv from previous exercises
Tasks:
- Read the CSV file using pandas
- Create new derived columns:
- Total revenue (quantity × unit price)
- Month (extracted from date)
- Age category (Young: <30, Adult: 30-50, Senior: >50)
- Calculate monthly sales totals
- Find the top 3 products by total revenue
- Create a separate summary file for each store with:
- Total revenue
- Average order value
- Number of transactions
- Most popular product
- Create a simple bar chart of revenue by store
Example starter code:
import pandas as pd
import matplotlib.pyplot as plt
import os
# Create output directory if it doesn't exist
os.makedirs('store_reports', exist_ok=True)
# Read the data
df = pd.read_csv('sales_data.csv')
# Convert date to datetime
df['date'] = pd.to_datetime(df['date'])
# Create derived columns
df['revenue'] = df['quantity'] * df['unit_price']
df['month'] = df['date'].dt.month_name()
# Create age category
def categorize_age(age):
if age < 30:
return 'Young'
elif age <= 50:
return 'Adult'
else:
return 'Senior'
df['age_category'] = df['customer_age'].apply(categorize_age)
# Calculate monthly sales
monthly_sales = df.groupby('month')['revenue'].sum().reset_index()
print("Monthly sales totals:")
print(monthly_sales)
# Find top 3 products by revenue
product_revenue = df.groupby('product_id')['revenue'].sum().reset_index()
top_products = product_revenue.sort_values('revenue', ascending=False).head(3)
print("\nTop 3 products by revenue:")
print(top_products)
# Process each store
for store_id in df['store_id'].unique():
# Filter data for this store
store_data = df[df['store_id'] == store_id]
# Calculate metrics
total_revenue = store_data['revenue'].sum()
avg_order_value = store_data['revenue'].mean()
transaction_count = len(store_data)
# Find most popular product
product_counts = store_data['product_id'].value_counts()
most_popular = product_counts.index[0]
# Create summary
summary = pd.DataFrame({
'Metric': ['Total Revenue', 'Average Order Value', 'Transaction Count', 'Most Popular Product'],
'Value': [f"${total_revenue:.2f}", f"${avg_order_value:.2f}", transaction_count, most_popular]
})
# Save to file
filename = f"store_reports/{store_id}_summary.csv"
summary.to_csv(filename, index=False)
print(f"Created summary for {store_id}")
# Create a bar chart of revenue by store
store_revenue = df.groupby('store_id')['revenue'].sum()
plt.figure(figsize=(10, 6))
store_revenue.plot(kind='bar')
plt.title('Total Revenue by Store')
plt.xlabel('Store ID')
plt.ylabel('Revenue ($)')
plt.savefig('store_revenue.png')
plt.close()
print("Created bar chart: store_revenue.png")Here's a simplified Python script to generate the sample dataset for these exercises:
import pandas as pd
import numpy as np
import random
from datetime import datetime, timedelta
# Set random seed for reproducibility
np.random.seed(42)
# Generate 500 records
n_records = 500
# Create date range for the past 30 days
end_date = datetime.now().date()
start_date = end_date - timedelta(days=30)
date_range = [start_date + timedelta(days=x) for x in range((end_date - start_date).days)]
# Generate data
data = {
'date': [random.choice(date_range).strftime('%Y-%m-%d') for _ in range(n_records)],
'store_id': [f"S{random.randint(1, 5):03d}" for _ in range(n_records)],
'product_id': [f"P{random.randint(1, 20):03d}" for _ in range(n_records)],
'quantity': np.random.randint(1, 15, n_records),
'unit_price': np.round(np.random.uniform(10, 100, n_records), 2),
'customer_age': np.random.randint(18, 70, n_records)
}
# Create DataFrame
df = pd.DataFrame(data)
# Save to CSV
df.to_csv('sales_data.csv', index=False)
print(f"Created sales_data.csv with {n_records} records")This simplified data generation script will create a dataset that students can use for all three exercises, focusing on basic data engineering concepts.