GettingAndCleaningData

GitHub repository for scripts used in the GettingAndCleaningData class project

This project contains the results of the Coursera class Getting and Cleaning Data: https://class.coursera.org/getdata-013

Assignment - From the Coursera class Course Project page

You should create one R script called run_analysis.R that does the following.

Merges the training and the test sets to create one data set.
Extracts only the measurements on the mean and standard deviation for each measurement.
Uses descriptive activity names to name the activities in the data set
Appropriately labels the data set with descriptive variable names.
From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.

Project Contents:

README.md - This file
tidyds.txt - The file produced in step 5 of the assignment
run_analysis.R - R script required to manipulate the data and produce the file in step 5
CodeBook.md - Codebook explaining the data contained in the output file

The process that was followed to create the required tidy dataset output

This project was completed using RStudio and began by creating a directory for this project and then setting the current working directory for RStudio to this working area. The data was then downloaded from the site pointed to by the assignment notes: https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip. There was also a page that described how the original study was conducted: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones.

After reading the description and downloading the data file, the file was then unzipped into the working directory.

The run_analysis.R script:

To produce the tidy dataset, the work is done in 3 steps:

Read in all the required data

There were two datasets required for labels:
- features.txt - for column labels - **
- activity_labels.txt - for activity descriptions - this data was read in with two columns activityID and activity
And 3 datasets for each train and test measurements each within their corresponding sub-directories:
- X_train.txt - the actual measurement data for the train measurements
- Y_train.txt - the activies being performed by the subject being measured - this data was read in with one column - activityID
- subject_train.txt - the subjects who performed the activities being measured - this data was read in with one column - subject
- X_test.txt - the actual measurement data for the test measurements
- Y_test.txt - the activies being performed by the subject being measured - this data was read in with one column - activityID
- subject_test.txt - the subjects who performed the activities being measured - this data was read in with one column - subject
**Note: While reading in the actual measurement data - the column names were assigned. This met the requirement for step 4, but was done out of order. It was just fairly easy to do this step and have descriptive column names already in the data rather than performing this step later. You can also see other columns were named appropriately for later steps. This made working with the data easier right after it was read in because you could see what the column contained.
Bring the data sets together that are required for the output

There were two processes involved here:
- Bring each of the 3 datasets for testing and 3 datasets for training together as one dataset for test and one for train
- Bring together the test and train datasets into a single dataset
The 3 datasets were all the same number of observations so adding the additional columns to the measurement data with cbind was simple. After this step - both requirement 1 and requirement 4 were met.
Prepare the output and write it to a file

This was the most complex of the steps but was performed with a single dplyr statement followed by writing the tidy dataset. The statements were chained so here are the steps performed:
- Start with the dataset produced out of the previous step
- Bring in the descriptive activity names for each activity by using merge - Meets step 3 requirement - This step needs to occur here because the merge column activityID will drop out in the next step...
- Pull out only the mean and standard deviation columns using select - Meets step 2 requirement
- Group the data for the final summarization - required for the next step to assist in meeting step 5 requirement
- Summarize the data by averaging each measurement column (that is not in the group_by clause above) - assist in meeting step 5 requirement
Finally the dataset is written out into the local working directory as tidyds.txt.

The data was left in the "wide" form because it did not violate the principles of tidy data outlined in Hadley Wickham's paper: http://vita.had.co.nz/papers/tidy-data.pdf

It follows the three principles:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
Some might argue that each measurement should be brought into a "narrow" view of the dataset with each measurement as a variable in a single column. It's been my experience that attempting to look at multiple variables for the same subject and activity would then be more complicated. In the Database world this would be done by a fairly complex SQL sub-select. Since the data in the "wide" for does not violate the tidy principles - the data was left in this state. It could be fairly easily modified by using: melt(tidyDS, id.vars=c("subject", "activity"))

and assigning the output to another dataset and writing it out.

Also the tidy dataset was validated by reading the dataset back in using:

validate_tidyds<-read.table("tidyds.txt", header=TRUE)

Then comparing the table I wrote out with the table I read in - using all.equal(tidyDS, validate_tidyds) which showed the data was exactly the same in the two datasets

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
CodeBook.md		CodeBook.md
README.md		README.md
run_analysis.R		run_analysis.R
tidyds.txt		tidyds.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GettingAndCleaningData

Assignment - From the Coursera class Course Project page

Project Contents:

The process that was followed to create the required tidy dataset output

The run_analysis.R script:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GettingAndCleaningData

Assignment - From the Coursera class Course Project page

Project Contents:

The process that was followed to create the required tidy dataset output

The run_analysis.R script:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages