Skip to content

Latest commit

 

History

History
47 lines (38 loc) · 1.72 KB

File metadata and controls

47 lines (38 loc) · 1.72 KB

Phase 3: Process

Tools Used

  • R (v4.x) — primary analysis language
  • RStudio — IDE
  • R Markdown — reproducible reporting
  • Key packages: tidyverse, lubridate, janitor, skimr

Cleaning Summary

Full reproducible cleaning code lives in notebooks/01_process.Rmd. Headline steps:

  1. Loaded 4 core CSV files: daily activity, sleep, hourly steps, hourly calories.
  2. Standardized column names to snake_case using janitor::clean_names().
  3. Removed 3 duplicate rows from the sleep dataset.
  4. Confirmed no missing values across the four files.
  5. Converted date columns from character strings to proper Date and POSIXct types.
  6. Flagged zero-step days as wear/engagement signals rather than deleting them.
  7. Engineered new features:
    • day_of_week — for weekly pattern analysis
    • hour — for time-of-day analysis
    • user_type — Sedentary / Lightly / Fairly / Very Active classification
    • sleep_efficiency — minutes asleep ÷ minutes in bed
    • usage_category — High / Moderate / Low engagement
  8. Joined daily activity with sleep data on user ID and date.
  9. Exported 7 clean datasets to data/clean/ for reuse in the analysis phase.

Data Integrity Verification

  • All file loads completed without parsing errors
  • User ID counts confirmed: 33 in activity, 24 in sleep
  • Date ranges confirmed: April 12 – May 12, 2016
  • No negative values in step, calorie, or distance columns
  • All date conversions verified by class checks

Outputs

Cleaned datasets:

  • daily_activity_clean.csv
  • sleep_day_clean.csv
  • hourly_steps_clean.csv
  • hourly_calories_clean.csv
  • user_activity_summary.csv
  • usage_frequency.csv
  • activity_sleep_joined.csv