Skip to content

rharken/GettingAndCleaningData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GettingAndCleaningData

GitHub repository for scripts used in the GettingAndCleaningData class project

This project contains the results of the Coursera class Getting and Cleaning Data: https://class.coursera.org/getdata-013

Assignment - From the Coursera class Course Project page

You should create one R script called run_analysis.R that does the following.

  1. Merges the training and the test sets to create one data set.
  2. Extracts only the measurements on the mean and standard deviation for each measurement.
  3. Uses descriptive activity names to name the activities in the data set
  4. Appropriately labels the data set with descriptive variable names.
  5. From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.

Project Contents:

  • README.md - This file
  • tidyds.txt - The file produced in step 5 of the assignment
  • run_analysis.R - R script required to manipulate the data and produce the file in step 5
  • CodeBook.md - Codebook explaining the data contained in the output file

The process that was followed to create the required tidy dataset output

This project was completed using RStudio and began by creating a directory for this project and then setting the current working directory for RStudio to this working area. The data was then downloaded from the site pointed to by the assignment notes: https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip. There was also a page that described how the original study was conducted: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones.

After reading the description and downloading the data file, the file was then unzipped into the working directory.

The run_analysis.R script:

To produce the tidy dataset, the work is done in 3 steps:

  1. Read in all the required data

    There were two datasets required for labels:

    • features.txt - for column labels - **
    • activity_labels.txt - for activity descriptions - this data was read in with two columns activityID and activity

    And 3 datasets for each train and test measurements each within their corresponding sub-directories:

    • X_train.txt - the actual measurement data for the train measurements
    • Y_train.txt - the activies being performed by the subject being measured - this data was read in with one column - activityID
    • subject_train.txt - the subjects who performed the activities being measured - this data was read in with one column - subject
    • X_test.txt - the actual measurement data for the test measurements
    • Y_test.txt - the activies being performed by the subject being measured - this data was read in with one column - activityID
    • subject_test.txt - the subjects who performed the activities being measured - this data was read in with one column - subject

    **Note: While reading in the actual measurement data - the column names were assigned. This met the requirement for step 4, but was done out of order. It was just fairly easy to do this step and have descriptive column names already in the data rather than performing this step later. You can also see other columns were named appropriately for later steps. This made working with the data easier right after it was read in because you could see what the column contained.

  2. Bring the data sets together that are required for the output

    There were two processes involved here:

    • Bring each of the 3 datasets for testing and 3 datasets for training together as one dataset for test and one for train
    • Bring together the test and train datasets into a single dataset

    The 3 datasets were all the same number of observations so adding the additional columns to the measurement data with cbind was simple. After this step - both requirement 1 and requirement 4 were met.

  3. Prepare the output and write it to a file

    This was the most complex of the steps but was performed with a single dplyr statement followed by writing the tidy dataset. The statements were chained so here are the steps performed:

    • Start with the dataset produced out of the previous step
    • Bring in the descriptive activity names for each activity by using merge - Meets step 3 requirement - This step needs to occur here because the merge column activityID will drop out in the next step...
    • Pull out only the mean and standard deviation columns using select - Meets step 2 requirement
    • Group the data for the final summarization - required for the next step to assist in meeting step 5 requirement
    • Summarize the data by averaging each measurement column (that is not in the group_by clause above) - assist in meeting step 5 requirement

    Finally the dataset is written out into the local working directory as tidyds.txt.

    The data was left in the "wide" form because it did not violate the principles of tidy data outlined in Hadley Wickham's paper: http://vita.had.co.nz/papers/tidy-data.pdf

    It follows the three principles:

    • Each variable forms a column.
    • Each observation forms a row.
    • Each type of observational unit forms a table.

    Some might argue that each measurement should be brought into a "narrow" view of the dataset with each measurement as a variable in a single column. It's been my experience that attempting to look at multiple variables for the same subject and activity would then be more complicated. In the Database world this would be done by a fairly complex SQL sub-select. Since the data in the "wide" for does not violate the tidy principles - the data was left in this state. It could be fairly easily modified by using: melt(tidyDS, id.vars=c("subject", "activity"))

    and assigning the output to another dataset and writing it out.

    Also the tidy dataset was validated by reading the dataset back in using:

    validate_tidyds<-read.table("tidyds.txt", header=TRUE)

    Then comparing the table I wrote out with the table I read in - using all.equal(tidyDS, validate_tidyds) which showed the data was exactly the same in the two datasets

About

Course work for the Getting and Cleaning Data Coursera class https://class.coursera.org/getdata-013

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages