Stat-with-R/sheet06.R at master · MoonRiyadh/Stat-with-R · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
### Stats with R Exercise sheet 6

##########################
#Week 7: Correlation and Regression
##########################


## This exercise sheet contains the exercises that you will need to complete and
## submit by 23:55 on Monday, December 9. Write the code below the questions.
## If you need to provide a written answer, comment this out using a hashtag (#).
## Submit your homework via moodle.
## You are required to work together in groups of three students, but everybody
## needs to submit the group version of the homework via moodle individually.


## Please write below your (and your teammates') name, matriculation number.
## Name: 1. H T M A Riyadh, 2. Abdallah Bashir, 3. Maria Francis
## Matriculation number: 1. 2577735, 2. 2577831, 3. 2573627

## Change the name of the file by adding your matriculation numbers
## (sheet06_firstID_secondID_thirdID.R)

###########################################################################################
###########################################################################################


library(reshape)
library(languageR)
library(ggplot2)
library(cowplot)
#######################
### Exercise 1: Correlation
#######################

# a) Get some data - access the ratings data set in languageR and name it "data".
# The data set contains subjective frequency ratings and their length averaged over
# subjects, for 81 concrete English nouns.
data <- ratings

# b) Take a look at the data frame.
head(data)

# c) Let's say you're interested in whether there is a linear relationship between
# the word frequency of the 81 nouns and their length.
# Take a look at the relationship between the frequency and word length data by
# means of a scatterplot (use the ggplot library for this).
ggplot(data, aes(x = Length, y = Frequency)) +
  geom_point()

# d) Judging from the graphs, do you think that word frequency and word length are
# in any way correlated with one another?

# Looking at the graph, it does not look like they are corelated to each other.

# e) Compute the Pearson correlation coefficient for the two variables by means
# of cor().
# Tell R to only include complete pairs of observations.
# As a reminder: Pearson coefficient denotes the covariance of the two variables
# divided by the product of their respective variance.
# It is scaled between 1 (for a perfect positive correlation) to -1 (for a perfect
# negative correlation).
#help("cor")
cor(x = data$Length, y = data$Frequency, use = "complete.obs",  method = "pearson")


# f) Does the correlation coefficient suggest a small, medium or large effect?
# What about the direction of the effect?

#[1] -0.4281462
#The answer is near to -.5. so it suggested a medium effect.
#the direction of the effect is negative

# g) Note that we have a large number of tied ranks in word length data
# (since there are multiple words with the length of e.g. 5).
# Thus, we might draw more accurate conclusions by setting the method to
# Kendall's tau instead of the Pearson correlation coefficient (which is the default).
# How do you interpret the difference between these 2 correlation coefficients?
cor(data$Length, data$Frequency,use="complete.obs", method = "kendall")

#[1] -0.316297

#Pearson coefficien is a measure of the linear correlation between two variables
#and Kendall's method is a measure of rank correlation

# h) What about significance? Use the more user-friendly cor.test()!
# Take a look at the output and describe what's in there.
# What do you conclude?

# Significance:- correlation coefficient value changed using kendall's method.
cor.test(data$Length, data$Frequency,  method = "kendall")

#data:  data$Length and data$Frequency
#z = -3.9186, p-value = 8.907e-05
#alternative hypothesis: true tau is not equal to 0
#sample estimates:
#  tau
#-0.316297
#Conclusion: p-value is less then 0.5, thus we cannot reject the null hypothesis


# i) Finally, also calculate Spearman's rank correlation for the same data.
cor(data$Length, data$Frequency, use="complete.obs", method = "spearman")


#######################
### Exercise 2: Regression
#######################

# a) Fit a linear regression model to the data frame "data" from exercise 1
# for the variables Frequency (outcome variable) and Length (predictor variable).
# General form:
# "modelname <- lm(outcome ~ predictor, data = dataFrame, na.action = an action)"
linear_rm<-lm(Frequency ~ Length, data = data)

# b) How do you interpret the output? Is the relationship between the two variables
# positive or negative?
# Plot the data points and the regression line.

#Output:-
#(Intercept)       Length
#6.5015      -0.2943
# Negative relationship between the two variables.
ggplot(data, aes(x = Length, y = Frequency))+
  geom_point()+
  geom_abline(intercept = 6.5015, slope=-0.2943)

# c) Run the plotting command again and have R display the actual words that belong
# to each point.
# (Don't worry about readability of overlapping words.)
ggplot(data, aes(x = Length, y = Frequency, label = rownames(data) ))+
  geom_abline(intercept = 6.5015, slope=-0.2943)+
  geom_text()

#######################
### Exercise 3: Regression
#######################


# a) Try this again for another example:
# Let's go back to our digsym data set.
# Set your wd and load the data frame digsym_clean.csv and store it in a variable.
# You can download this data frame from the material of week 6: t-test and friends.
dat <- read.csv("digsym_clean.csv")
str(dat)
#summary(dat)

# b) Suppose you want to predict reaction times in the digit symbol task by
# people's age.
# Fit a linear regression model to the data frame for the variables
# correct_RT_2.5sd (outcome variable) and Age (predictor variable).
# General form:
# "modelname <- lm(outcome ~ predictor, data = dataFrame, na.action = an action)"
# But first we need to cast the data to compute an RT mean (use correct_RT_2.5sd)
# for each subject, so that we have only one Age observation per Subject.
# Store the result in a new dataframe called "cast".
# In case you're wondering why we still have to do this - like the t-test,
# linear regression assumes independence of observations.
# In other words, one row should correspond to one subject or item only.

linear_rm_dat<-lm(correct_RT_2.5sd ~ Age, data = dat)

cast <- cast(dat, Subject + Age ~., fun.aggregate = mean, value = "correct_RT_2.5sd", na.rm = TRUE)
colnames(cast)[colnames(cast)=="(all)"] <- "RTavg"
head(cast)


# c) Now fit the regression model.
linear_rm_cast <- lm(RTavg ~ Age, data = cast)

# d) Let's go over the output - what's in there?
# How do you interpret the output?
linear_rm_cast
#output:
#Coefficients:
#  (Intercept)          Age
#   637.93            21.22
#Here the slope is positive. It means a positive relation between the input and
#the output


# e) Plot the data points and the regression line.
ggplot(cast, aes(x = Age, y = RTavg))+
  geom_point() +
  geom_abline(intercept = 637.93, slope=21.22)

# f) Plot a histogram and qq-plot of the residuals.
# Does their distribution look like a normal distribution?

ggplot(data=cast, aes(x = residuals(linear_rm_cast))) +
  geom_histogram()

ggplot(data = cast, aes(sample = residuals(linear_rm_cast))) +
  stat_qq()
#Histogram distribution looks like normally distributed but qq-plot is not nornally
#distributed for the residuals


# g) Plot Cook's distance for the regression model from c) which estimates the
# residuals (i.e. distance between the actual values and the  predicted value on
# the regression line) for individual data points in the model.
cooks_dist <- cooks.distance(linear_rm_cast)
ggplot(cast, aes(x = Age, y = cooks_dist )) +
  geom_point()

# h) Judging from the plot in g) it actually looks like we have 1 influential
# observation in there that has potential to distort (and pull up) our regression
# line.
# The last observation (row 37) in cast has a very high Cook's distance
# (greater than 0.6).
# In other words, the entire regression function would change by more than
# 0.6 when this particular case would be deleted.
# What is the problem with observation 37?
# Run the plotting command again and have R display the subjects that belong to
# each point.

#observation 37 for subject 40 seems like an outlier
ggplot(cast, aes(x = Age, y = cooks_dist, label = Subject)) +
  geom_point() +
  geom_text()


# i) Make a subset of "cast" by excluding the influential subject and name it cast2.
#dim(cast)
#37  3
cast2 <- subset(cast, Subject!= 40)


# j) Fit the model from c) again, using cast2, and take a good look at the output.
linear_rm_cast2<-lm(RTavg ~ Age, data = cast2)
linear_rm_cast2

# k) What's different about the output?
# How does that change your interpretation of whether age is predictive of RTs?
#Coefficients:
#  (Intercept)          Age
#862.05                 11.98
#As we removed the outlier from the data, thus our model is not affected by it. As a
#result, we got new slop value (11.98) comparing to the previous result (21.22)


# l) Plot the regression line again - notice the difference in slope in
# comparison to our earlier model fit?
ggplot(cast2, aes(x = Age, y = RTavg))+
  geom_point() +
  geom_abline(intercept = 862.05, slope = 11.98)

# m) Display the two plots side by side to better see what's going on.
plot1 <- ggplot(cast, aes(x = Age, y = RTavg))+
  geom_point() +
  geom_abline(intercept = 637.93, slope=21.22)+
  ggtitle("With outlier")


plot2 <- ggplot(cast2, aes(x = Age, y = RTavg))+
  geom_point() +
  geom_abline(intercept = 862.05, slope = 11.98)+
  ggtitle("Without outlier")

plot_grid(plot1, plot2)


# n) Compute the proportion of variance in RT that can be accounted for by Age.
# In other words: Compute R Squared.
# Take a look at the Navarro book (Chapter on regression) if you have trouble
# doing this.
summary(linear_rm_cast2)$r.squared

# o) How do you interpret R Squared?
#0.03493231
# We get the r squierd value 0.03493231. that means 3% of error rate we have. The smaller the number is,
#the better fit (closer) to the regression line.