- R (v4.x) — primary analysis language
- RStudio — IDE
- R Markdown — reproducible reporting
- Key packages: tidyverse, lubridate, janitor, skimr
Full reproducible cleaning code lives in
notebooks/01_process.Rmd. Headline steps:
- Loaded 4 core CSV files: daily activity, sleep, hourly steps, hourly calories.
- Standardized column names to
snake_caseusingjanitor::clean_names(). - Removed 3 duplicate rows from the sleep dataset.
- Confirmed no missing values across the four files.
- Converted date columns from character strings to proper Date and POSIXct types.
- Flagged zero-step days as wear/engagement signals rather than deleting them.
- Engineered new features:
day_of_week— for weekly pattern analysishour— for time-of-day analysisuser_type— Sedentary / Lightly / Fairly / Very Active classificationsleep_efficiency— minutes asleep ÷ minutes in bedusage_category— High / Moderate / Low engagement
- Joined daily activity with sleep data on user ID and date.
- Exported 7 clean datasets to
data/clean/for reuse in the analysis phase.
- All file loads completed without parsing errors
- User ID counts confirmed: 33 in activity, 24 in sleep
- Date ranges confirmed: April 12 – May 12, 2016
- No negative values in step, calorie, or distance columns
- All date conversions verified by class checks
Cleaned datasets:
daily_activity_clean.csvsleep_day_clean.csvhourly_steps_clean.csvhourly_calories_clean.csvuser_activity_summary.csvusage_frequency.csvactivity_sleep_joined.csv