Stat-with-R/sheet09.R at master · MoonRiyadh/Stat-with-R · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
### Stats with R Exercise sheet 9

##########################
#Week 10: Linear Mixed Effects Models
##########################


## This exercise sheet contains the exercises that you will need to complete and
## submit by 23:55 on Monday, January 13. Write the code below the questions.
## If you need to provide a written answer, comment this out using a hashtag (#).
## Submit your homework via moodle.
## You are required to work together in groups of three students, but everybody
## needs to submit the group version of the homework via moodle individually.
## You need to provide a serious attempt to each exercise in order to have
## the assignment graded as complete.


## Please write below your (and your teammates) name, matriculation number.
## Name: 1. H T M A Riyadh, 2. Maria Francis
## Matriculation number: 1. 2577735, 2. 2573627

## Change the name of the file by adding your matriculation numbers
## (exercise0N_firstID_secondID_thirdID.R)

###########################################################################################
###########################################################################################
library(lme4)
library(lattice)
library(Matrix)
library(ggplot2)
# a)There are 3 datasets on moodle, you can choose one of them to work with on this
#   assignment.
#   Read in the data file of your choice (gender.Rdata, sem.Rdata OR relclause.Rdata)
#   and assign it to a variable called "dat".
#   See a description of the items in the datasets below.

getwd()
dat <- read.table("gender.Rdata.txt")

# The files contain data from an experiment where people were reading sentences,
# and pressed the space bar to see the next word. The duration for which a word was
# viewed before pressing the space bar again is the reading time of the word, and is
# stored in the file as "WORD_TIME". The experiment had 24 items (given as "ITEM_ID")
# and 24 subjects (given as "PARTICIPANT"). The order in which the different sentences
# were presented in the experiment is given in the variable "itemOrder".

# For each of the files, the sentences that were shown had a different property.

# Sentences in the sem.Rdata experiment had a semantic violation, i.e. a word that
# didn't fit in with the previous words in terms of its meaning. The experiment
# contained two versions of each item, which were identical to one another except
# for the one sentence containing a semantic violation, while the other one was
# semantically correct. These conditions are named "SG" for "semantically good"
# and "SB" for "semantically bad".

# Semantic materials (the experiment is in German, English translation given
# for those who don't speak German'):

# Christina schießt / raucht eine Zigarette nach der Arbeit.
# "Christina is shooting / smoking a cigarette after work."

# The crticial word here is "Zigarette", as this would be very surprising in the
# context of the verb "shoot", but not in the context of the verb "smoke".
# Reading times are comparable because the critical word "Zigarette" is identical
# in both conditions.

# Syntactic items:
# Simone hatte eine(n) schreckliche(n) Traum und keine Lust zum Weiterschlafen.
# "Simone had a[masc/fem] horrible[masc/fem] dreammasc and didn't feel like sleeping
# any longer."

# Here, there are again two conditions, one using correct grammatical gender on
# "einen schrecklichen" vs. the other one using incorrect grammatical gender
# "eine schreckliche". The critical word is "Traum" (it's either consisten or
# inconsistent with the marking on the determiner and adjective)

# Relative clause items:
# Die Nachbarin, [die_sg nom/acc einige_pl nom/acc der Mieter auf Schadensersatz
# verklagt hat_sg/ haben_pl]RC, traf sich gestern mit Angelika.
# "The neighbor, [whom some of the tenants sued for damages / who sued some of  the
# tenants for damages]RC, met Angelika yesterday."

# When reading such a sentence, people will usually interpret the relative pronoun
# die as the subject of the relative clause and the following noun phrase
# "einige der Mieter" as the object. This interpretation is compatible with
# the embedded singular-marked (sg) verb hat at the end of the relative clause.
# Encountering the verb haben, which has plural marking (pl), leads to processing
# difficulty: in order to make sense of the relative clause, readers need to
# reinterpret the relative pronoun die as the object of the relative clause
# and the following noun phrase "einige der Mieter" as its subject.
# (Note that the sentences are all grammatical, as the relative pronoun and
# following NPs are chosen such that they are ambiguous between nominative (nom)
# and accusative (acc) case marking.)

# The number of the word in a sentence is given in column "SEMWDINDEX".
# 0 designates the word where the semantic violation happens (in the SB condition;
# in the SG condition, it's the corresponding word). We call this word the
# "critical word" or "critical region". -1 is the word before that, -2 is
# two words before that word, and 2 is two words after that critical word.
# "EXPWORD" shows the words. We expect longer reading times for the violation
# at word 0 or the word after that (word 1) (if people press the button quickly
# before thinking properly).

# b) Inspect "dat" and provide 2 plots.
#    The first plot should provide insights about the relationship between WORD_TIME
#    and ITEM_TYPE.
#    For the second plot you should first subset the data using only RELWDINDEX == 0 and
#    then plot the WORD_TIME for the different conditions (ITEM_TYPE).

str(dat)
summary(dat)
#first plot
#boxplot(WORD_TIME~ITEM_TYPE, data = dat)
ggplot(dat, aes(x = ITEM_TYPE, y = WORD_TIME, group = PARTICIPANT, color =PARTICIPANT)) +
  geom_boxplot()


#second plot
dat_sub <- subset(dat, RELWDINDEX == 0)
ggplot(dat_sub, aes(x = ITEM_TYPE, y = WORD_TIME, group = PARTICIPANT, color =PARTICIPANT)) +
  geom_boxplot()


# c) Decide whether you want to exclude any data points (provide not only the code,
#    but also a detailed (!) explanation).
#    Note that we are evaluating WORD_TIME as our reponse variable.
#    What time intervals make sense for such an experiment?

#Based on the above boxplots, we decide to eliminate points whose word time exceeds 2000.
#because these points are considered as outliers.
dat_sub2 <- subset(dat_sub, WORD_TIME < 2000)
ggplot(dat_sub2, aes(x = ITEM_TYPE, y = WORD_TIME, group = PARTICIPANT, color =PARTICIPANT)) +
  geom_boxplot()

# d) Make a scatter plot where for each index word as the sentence progresses (RELWDINDEX),
#    the average reading time is shown for each of the two conditions (ITEM_TYPE).
#    Please use two different colours for the different conditions.
#ggplot(data = dat, aes(x = RELWDINDEX, y = ITEM_TYPE, color =ITEM_TYPE)) +
#  geom_point()
GB <- subset(dat, ITEM_TYPE=="GB")
GG <- subset(dat, ITEM_TYPE=="GG")


ggplot(GB, aes(x=factor(EXPWORD), y=WORD_TIME, group = ITEM_ID), fun.y="mean", geom="point", stat="identity") +
  geom_point()+
  ggtitle("scatter plot for GB")


ggplot(GG, aes(x=factor(EXPWORD), y=WORD_TIME, group = ITEM_ID), fun.y="mean", geom="point", stat="identity") +
  geom_point()+
  ggtitle("scatter plot for GG")


# e) You do not need to use ggplot here, just follow the example below.
#    The code is a plot for the dataset 'sleepstudy' from the package 'lme4'.
#    The figure shows relationships between days without sleeping and reaction
#    time for each participant (subject) separately.

summary(sleepstudy)
print(xyplot(Reaction ~ Days | Subject, sleepstudy, aspect = "xy",
             layout = c(9,2), type = c("g", "p", "r"),
             index.cond = function(x,y) coef(lm(y ~ x))[1],
             xlab = "Days of sleep deprivation",
             ylab = "Average reaction time (ms)"))

#    Your task is to figure out how to adapt this plot for our data. What do you
#    conclude regarding the reading sentences experiment?

print(xyplot(RELWDINDEX ~ WORD_TIME  | PARTICIPANT , dat, aspect = "xy",
             layout = c(4,2), type = c("g", "p", "r"),
             index.cond = function(x,y) coef(lm(y ~ x))[1],
             xlab = "RELWDINDEX",
             ylab = "Wordtime"))
#We can see that different particepant has different wordtime, some has higher value.

# f) Experiment with calculating a linear mixed effects model for this study,
#    and draw the appropriate conclusions (give a detailed explanation
#    for each model).
linear_model1 = lmer(WORD_TIME ~ RELWDINDEX + (1|PARTICIPANT), dat)
linear_mode1
#Random effects: Here, Std. Dev is higher, meaning that the data points are spread out over a wider range of values which is 143.0

#Fixed Effects:
#RELWDINDEX coefficient indicates the slope for the effectiveness of individual word. That means, if we increase one unit
#of the RELWDINDEX, the WORD TIME  will be increased by 6.483.

linear_mode2 = lmer(WORD_TIME ~ RELWDINDEX + (1|PARTICIPANT) + (1|ITEM_ID), dat)
linear_mode2

#random effects
#Here, Std. Dev is higher, meaning that the data points are spread out over a wider range of values which is 143.0, item id 28.46.

#Fixed Effects:
#RELWDINDEX coefficient indicates the slope for the effectiveness of individual word. That means, if we increase one unit
#of the RELWDINDEX, the WORD TIME  will be increased by 7.118. This time this value is much bigger then previous model because of
#consideration of ITEM_ID

linear_mode3 = lmer(WORD_TIME ~ RELWDINDEX + (RELWDINDEX|PARTICIPANT), dat)
linear_mode3

#random effects
#Here, Std. Dev is higher, meaning that the data points are spread out over a wider range of values which is 141.40, slope value is
#10.63 means that one unit increases 10.63 times in participent value.

#Fixed Effects:
#RELWDINDEX coefficient indicates the slope for the effectiveness of individual word. That means, if we increase one unit
#of the RELWDINDEX, the WORD TIME  will be increased by 6.519.


# g) Let's get back to the dataset 'sleepstudy'. The following plot shows
#    subject-specific intercepts and slopes. Adapt this plot for our study
#    and draw conclusions.

model = lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
print(dotplot(ranef(model,condVar=TRUE),  scales = list(x = list(relation = 'free')))
      [["Subject"]])


model2 = lmer(WORD_TIME ~ RELWDINDEX + (RELWDINDEX|PARTICIPANT), dat)
print(dotplot(ranef(model2,condVar=TRUE),  scales = list(x = list(relation = 'free')))
      [["PARTICIPANT"]])

#From the graph, we can interprate word time for relwdindex of a participant.


#