irw/problems.qmd at main · datapages/irw · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
These are 'problems' that draw on IRW data which may be useful in problem sets for psychometrics classes (especially those focusing on IRT).

### Logistic regression

In the [chess data](https://www.rdocumentation.org/packages/LNIRT/versions/0.5.1/topics/AmsterdamChess), there is a special covariate related to the person’s ELO status. This is effectively their ‘ranking’ given their play in tournaments. [Get data via: `df<-irw::irw_fetch("chess_lnirt")`]

- If you consider the items in the measure that they are giving (see Figure 2 [here](https://www.ejwagenmakers.com/2005/VanderMaasWagenmakersACTpaper.pdf)), how would you anticipate that it is connected with responses?
- Can you probe this via logistic regression? In thinking about this, how might you account for the fact that responses are coming from different items?

### Classical item analysis

The simplest analysis of items involves calculation of (i) item-level mean responses and (ii) correlations between the item response and the sum score. For cognitive constructs, calculations of (i) give us a simple indication of item difficulty (larger values are easier items). Calculations of (ii) tell us about the degree to which an item is ‘hanging together’ with the rest of the items. If correlations between an item and the sum score are very low, this can be indicative of a problem.

For two IRW tables (`gilbert_meta_1` and `gilbert_meta_9`), consider what the above calculations (as well as considerations of reliability) tell us about each measure. (Hint: I would say one scale looks pretty good from this perspective [perhaps with one bad item] and one scale might wish us leaving we could do a little better.) Note that each of these tables is based on outcome data for an RCT (details below). *What do the descriptive statistics you have calculated imply for the inferences the RCT aims to make?*

- `gilbert_meta_1`: “We examine the intention-to-treat impact of the MORE intervention on third-grade reading comprehension from a cluster-RCT. Our data, collected in the 2021–2022 school year, consist of 110 schools randomly assigned to treatment and control from a large urban district in the southeastern United States (N = 7,797 students).” ([link](https://journals.sagepub.com/doi/10.3102/10769986231171710))
- `gilbert_meta_9`: “Using a randomized experiment in Ecuador, this study provides evidence on whether cash, vouchers, and food transfers targeted to women and intended to reduce poverty and food insecurity also affected intimate partner violence. Results indicate that transfers reduce controlling behaviors and physical and/or sexual violence by 6 to 7 percentage points. Impacts do not vary by transfer modality, which provides evidence that transfers not only have the potential to decrease violence in the short-term, but also that cash is just as effective as in-kind transfers.” ([link](https://www.openicpsr.org/openicpsr/project/113634/version/V1/view))

### Towards IRT models

We are going to start thinking about IRT models next week and I wanted to begin by further examining the relationship between sum scores (remind what those are) and responses for a single item.

- See [here](https://github.com/ben-domingue/252/blob/main/ps2/towards_irt.R). We will construct a figure where the x-axis is the sum score and the y-axis is the proportion of respondents with that sum score who got an individual item correct and each panel is a unique item. What do these figures suggest about the relationship between sum scores and item responses?
- Pick two items, one that is easy (most people get it right) and one that is hard (most do not). Can you estimate a logistic regression wherein you’re regressing the response for a single item on the sum score. How do the intercepts from these regressions look vis-a-vis your intuition about the difficulty of the items?
- Reconsider the above analysis with the `andrich_mudfold` table. What qualitatively different pattern do you notice between the relationship between item responses and sum scores here?

### A comparison of the 1-3PL approaches.

The 1PL, 2PL, and 3PL are all commonly used with dichotomous responses. We can examine the differences in the estimated response funtions for each of these models when applied to data from the Brazilian ENEM assessment, see [here](https://github.com/ben-domingue/252/blob/main/c4/enem1.R). What do you think of the differences between the approaches?

### Predictions

One of the most powerful ideas (in my view) related to thinking about the performance of your models is to look at predictions ([This paper](https://pubmed.ncbi.nlm.nih.gov/28841086/) is really powerful on this point). We’re going to look at some predictive comparisons in an IRT context. Code [here](https://github.com/ben-domingue/252/blob/main/ps4/prediction.R).

- Let’s start by understanding the importance of out-of-sample predictive tests. The idea here is that we want to look at predictions in ‘test’ or ‘hold-out’ data that was not used to generate model estimates (the training data upon which predictions are based) and contrast that with what we get when we look at predictions for ‘overfit’ data (i.e., the test and training data are the same). In particular we are going to look at the RMSE between the responses and the IRT-based probabilities for the same model applied to out-of-sample versus in-sample predictions. What do we observe in that case? (e.g., what model do we use to generate data? Which predictions are better in-sample? out-of-sample?
- Above we are looking at simulated data and seeing RMSEs between observed responses and estimated probabilities of those responses and seeing values above 0.4. This might seem big—these differences have to be less than 1 and we’d be surprised if they were bigger than 0.5—so perhaps we should be concerned? Can you assess how good they would be if we had *perfect prediction*? So, simulate data from the Rasch model and compute the RMSE of differences between the generated responses and the probabilities used to simulate them. How does this compare to what we observed in (a)?
- Let’s now look at two different models applied to empirical data (`gilbert_meta_2`, where we won’t know the truth). Does the 1pl or 2pl seem to fit better out-of-sample?
- Let’s bring in one more table (`gilbert_meta_14`) and again compare 1pl and 2pl predictions. How does the change you observe here compare to that in (b)? That is, if someone said “for which of these datasets do you see bigger improvements as you go from the 1pl to the 2pl?”, how would you answer? [Hint: In my view, there is one difference (which I’ve maybe tried to point you towards in the code) in the two tables being compared that actually makes this a really challenging question.]  Is there anything that gives you pause?

### Incorporating Priors

Priors can be very useful for IRT item parameters especially if you have smaller samples. We aren’t going to dive all the way into Bayesian models; rather, I am hoping to give you a conceptual guide to priors. For our purposes, priors are a way of ensuring good behavior in our parameter estimates. Estimates (the posterior distribution in the below figure) are going to be a mixture of the likelihood and the prior. When we’re worried the likelihood might be ill-behaved we can induce some better behavior via the posterior. We’ll often have poorly behaved guessing and discrimination parameters so we’re going to introduce some priors on them.

An example of how to impose priors on the discrimination and guessing parameters can be found [here](https://github.com/ben-domingue/252/blob/main/ps4/priors.R). From whatever perspective you like (parameter estimates, predictions as in #2; feel free to find some different data that lead to more dramatic differences!), consider the implications of adding priors to analysis of some item response data. I have added a small example illustrating how item parameter estimates differ with and without priors. Feel free to build on this in any way you like—for example, by examining whether including priors improves predictive performance.


### Measurement invariancce and item text

We are adding item text to the IRW. We are going to take advantage of that now to probe for DIF. We are going to look at data from the `gilbert_meta_11` RCT and assess whether there is any DIF in the items as a function of respondent features.

- First look at [this code](https://github.com/ben-domingue/252/blob/main/ps5/dif_itemtext.R) and examine the items (look for this line: `unique(df$item_text)`). We are going to look at DIF as a function of gender, race, SES and IEP status. Do you anticipate observing DIF on any of these dimensions given the nature of these items?
- I’ve provided code to do a logistic regression DIF analysis starting with gender. What do you observe? Can you try similar analyses with the other covariates (just pass different arguments to varname)? What does this make you think of the measure used here as an outcome? Note that these analyses rely upon the industry-standard A/B/C scale that is basically green/yellow/red with respect to DIF.

### Measurement invariance and exogenous shocks

I want to discuss measurement invariance as a function of respondent experience.

- I started thinking about this idea [here](https://academic.oup.com/psychsocgerontology/article/76/6/1231/5815719). Using data from the longitudinal HRS, we were studying the very well-known fact that spousal loss leads to a short-term rise in depression. You can see that in Figure 1. So spousal loss leads to an increase in your expected 8-item CESD score (see Table 2 [here](https://hrsonline.isr.umich.edu/sitedocs/userg/dr-005.pdf#page=44.16)) over roughly first 10 months but this settles back to where you’d typically been after the first two years. As I dug deeper into the data, I realized that it wasn’t all depression indicators that were affected but some specific ones. In the supplement we have a figure (S7) which shows that it is really the lonely and sad indicators which are affected. What do you think of this? What does this suggest about the nature of how we are measuring depression in the context of spousal loss?
- Let’s now think about a different 'experience': let’s look at being in the treatment condition. We’re going to work with data from `gilbert_meta_20`. Given that this is data from an RCT, let’s start by estimating the treatment effect. Compute the estimated treatment effect (in raw and effect size units) for both the sum score and an IRT-based ability estimate (i.e., theta). What do you observe?
- Let’s now consider treatment status as a grouping indicator for DIF. We can do a quick analysis [here](https://github.com/ben-domingue/252/blob/main/ps5/rctdif.R). Using the JG grades, several of the items exhibit DIF. What do you think this may tell us about the measure and/or the intervention?

### Grit and Growth

We’re going to look at two of the most popular constructs related to academic affect from the last few decades in psychology: [grit](https://psycnet.apa.org/fulltext/2007-07951-009.html) and [growth](https://www.nature.com/articles/s41586-019-1466-y). These constructs have been very impactful (every time I go to my kids’ classrooms I’ll see a poster about one or the other of them). In the IRW, we have data from a [survey](https://ldbase.org/datasets/3f7033dd-47a5-4ef8-aeab-08a559dce0d1) that included measures of both of these for kids. I want to use them to further explore multidimensional models [Note: This is meant to be a fun look at two serious psychological constructs.]

- Look at the items (see `grit_items` and `dweck_items` in [code](https://github.com/ben-domingue/252/blob/main/ps6/personality.R)). Conceptually do you think these two things will be independent dimensions or capture similar things?
- To prepare for analysis, I had to do two things:
  - Note that I have to reverse some responses. I did this by consulting the items. The one I can’t figure out is `qgrit11`. This feels to me like it should be inverted but the `psych` package thinks otherwise. Am I missing something? [This isn’t a trick question but rather a check for whether my brain is turning into mush.]
  - I dichotomized the responses here in a way that I think is sensible. Polytomous models could be used with the polytomous responses but one thing we can observe here is some loss of information when we dichotomize (i.e., what happens to the alpha values before and after?).
- Look at the coefficients from `m2`. What do these suggest about how well the two dimensions cleave the two constructs?
- If we look at `fscores(m2)`, what do the vectors of theta estimates represent? If you look at `coef(m2)`, you’ll note that these dimensions are uncorrelated (while there was some correlation between the sum scores).
- We can also estimate a model where the slopes are restricted to be 0 when the construct is not being measured, see `m3`. How would you think about the fit of all three models (I also added a 1d-2PL, see `m1`)? What does this suggest to you about these constructs and the scales being used to measure them?

### What makes an 'item'?

There is a tradition in cognitive psychology of having respondents complete ‘tasks’ multiple times and there are (depending on the task) either subtle (e.g., changing the shape or color of some stimulus) or continuous variation between the tasks (in a mental rotation task, we can imagine having the shape rotated at any angle between 0 and 360 degrees). With such ‘task’ or ‘trial’ data, it can be difficult (if not impossible) to apply IRT techniques given that it isn’t always clear how to define distinct items. We’re going to see what we can do!

Let’s explore this issue in the context of the hearts and flower task. The hearts and flowers task is a widely-used measure of [executive functioning](https://www.annualreviews.org/content/journals/10.1146/annurev-psych-113011-143750) (you can experiment with this task [here](https://cognitionlab.com/project/hearts-and-flowers/)). The gist of the task is that a shape is shown on one side of the screen and, depending on whether the shape is a heart or flower, the respondent has to touch the same (heart) or opposite side (flower). The [SPARK lab](https://sparklab.stanford.edu/) has made some data available for this task (we used this for a [datathon](https://itemresponsewarehouse.org/training.html) last summer). Let’s go!

- We are going to focus on the mixed block (where hearts and flowers tasks are intermingled). Would you anticipate differences in overall accuracy between the stimulus that is ‘same side’ versus the one that is ‘opposite side’? Do your expected differences manifest in the data?
- We’re going to first estimate a Rasch model for this data with a basic definition of item: the item will be defined by the stimulus shape and the side it is shown. We have one challenge in that each person can see an ‘item’ many times. To deal with this, we are going to use the ideas suggested here and use `lmer` to estimate the Rasch model. The basic idea is straightforward. We can estimate a standard Rasch model as:
  `m<-glmer(resp~(1|id)+0+item,df,family='binomial')`
  This will produce a model with item *easiness* parameters (`fixef(m)`) and person ability (`ranef(m)`). Using [this code](https://github.com/ben-domingue/252/blob/main/ps6/hf.R), can you interpret the item easiness parameters? I’ve also computed abilities separately for fall and spring for students that are undergoing developmental change related to EF. Does the patterning of these parameter estimates you are seeing make sense?

- Note that our first definition of an item doesn’t utilize information about whether the stimulus is changing. This change is critical and will likely impact response probabilities. In the code, I include it via the `switch` indicator. We are now estimating an ‘explanatory’ item response model in that we’re adding in information beyond simply item and id classifications. What does the estimate for switch suggest? What do you think of our original naive definition of an item here? *Note:* In the IRT sense, this is a violation of independence for the naive definition of item.

Aside: I think this flexible view of items as less rigid can be quite valuable. As another example of where this might come into play, suppose we wanted to model shooting accuracy in basketball. No two shots are the same so an ‘item’ is impossible to define. But, [I think we can parametrize the key features of a ‘trial’](https://github.com/ben-domingue/irw/blob/main/data/trials/nba_shots.R) (e.g., court location, the game clock, etc) and include a person-level ‘ability’ (we had better see S. Curry near the top!) and basically take something akin to an item response modeling view of that kind of scenario.

### Explanatory IRT

We’re going to build on ideas from [Josh Gilbert](https://onlinelibrary.wiley.com/doi/10.1002/pam.70025), see [here](https://github.com/ben-domingue/252/blob/main/ps7/ilhte.R).

- We have a pretest-score that we can include. Conceptually, what is the point of inclusion of this covariate if we assume that compliance with randomization was effective? Empirically, where do we observe this?
- Can you estimate a first-order treatment effect using a post-test that is a sum score and one that is a theta-based ability estimate (`m1`). [Note that I am not computing the sum score based effect in a fashion parallel to what I did for m1. You’ll also need to standardize both effects.] How similar are these in effect size units? What would you say about the efficacy of this intervention? *Note:* You could compare findings from this quick calculation to those of the [original study](https://psycnet.apa.org/record/2022-69392-001).
- Consider the raw coefficients on `treat` from `m0` and `m1a`. What scale are they each on? How are these scales similar? How are they different?
- If we look at `m2` and `m3a`, we might think this suggests that there is treatment effect heterogeneity such that the treatment is more effective for those who are higher-ability at baseline. What would you say about this inference? How might the story be in fact more complicated? What evidence do we have on this point?

### A first analysis of polytomous response data
We can use data from a [PROM](https://commonfund.nih.gov/promis) related to chronic pain. In [this code](https://github.com/ben-domingue/252/blob/main/c9/polyexample.R) we will consider the graded response model, the generalized partial credit model, and the sequential ratio model applied to this data. How would you describe the differences between the category response function and the expected response function for the three models across this data? *Note:* An alternative view of how these models have very different implications can be found [here](https://link.springer.com/article/10.1007/s41237-025-00262-9).

### Rating Scale Framework

The rating scale framework
I’m going to ask you to explore a new framework for modeling polytomous responses: the rating scale framework (see the `gpcmIRT`/`rsm` and the `grsm` [here](https://www.rdocumentation.org/packages/mirt/versions/1.43/topics/mirt)) constrains thresholds (from either the GPCM or GRM models) to be equal across items. It is thought to be useful when either sample sizes (of respondents) are small and the response structure is similar across items given that it greatly reduces the number of parameters that need to be estimated (one location parameter for each item and K-1 threshold parameters instead of K-1 threshold parameters for each item). Let’s explore how it works in [this code](https://github.com/ben-domingue/252/blob/main/ps8/ratingscale.R).

- What do you notice about this model? Focus your attention in particular on the `b1`-`b4` and `c` parameters.
- Please describe the key assumption in this model (see `gpcmIRT and rsm` part of the 'IRT models' part of the help page [here](https://www.rdocumentation.org/packages/mirt/versions/1.41/topics/mirt)).
- How would you think about the choice between the RSM & the PCM in general? This is a test with relatively few items and a strong response structure. How might you anticipate a test with more items (say 50 as opposed to 10) showing sensitivity to the difference between rsm and pcm were we to reduce the number of respondents?

### Fast responding

Let’s use IRTree models to look at whether overly fast responses in a conscientiousness dataset are also (shockingly!) related to the trait. Code is [here](https://github.com/ben-domingue/252/blob/main/ps8/fast.R). Note that I have automated reversing items (given that they may not all be coded such that 5s indicate higher levels of conscientiousness).

- What is the key parameter estimate that gives us that? What do you make of it?
- How sensitive is that result to our choice of the 2s threshold?

### Evaluating the speed-accuracy tradeoff

We are going to do a quick evaluation of the speed-accuracy tradeoff using data from a few assessments. Code is [here](https://github.com/ben-domingue/252/blob/main/ps9/sat.R). We will use data from the ROAR lexical decision task and PISA 2018 reading data from Spain.

- These are very different types of tasks! Please describe the differences that you observe in the response time distributions across the tasks.
- If you look at `m.roar` and `m.pisa`, you will get two estimates. Can you interpret these estimates? How do they have you feeling about the SAT in these two cases?
  *Aside:* It is not straightforward to interpret these things. I always think it is easier to focus on the expected responses that we can get based on these estimates rather than the estimates themselves in such complex modeling scenarios. I hope this helps you see why.
- Talk about the fact that we are assuming a linear relationship between log(rt) and accuracy.

### Dominant or Unfolding?
We’re going to dive into the deep end and try to figure out what to do with respect to choosing between dominant or ideal point processes for a bunch of items. In [this code](https://github.com/ben-domingue/252/blob/main/ps9/unfold.R), I have fit simple models of each type for the data in question and am showing their functioning. Your job is to try to determine (for each item), which model you think is appropriate. Some things you can consider:

- The items themselves (see `ANDRICH` [here](https://cran.r-project.org/web/packages/mudfold/mudfold.pdf#page=3.08)): Does the nature of the items give you any indication as to which you should fit?
- You can examine the ‘fit’ of the model (with the big caveat that we don’t have much data).

The idea here is to get you thinking about this. There are approaches that have been developed for working through this decision but I think it is always fun to struggle a little with a problem before looking at the solution! [Note: There is not a ton of data here, so in my view that limits any certainty about the right answer to this question!]

### CDM & what makes an item
Let’s examine what we get from a Q matrix using the `frac20` data (described [here](https://www.jstatsoft.org/article/view/v093i14)), a standard dataset used in the cognitive diagnostic modeling literature. Here we have math problems involving fractions which are hypothesized to require some combination of 8 specific skills. The linkage of the items and skills is contained in the ‘Q matrix’. Let’s use the code [here](https://github.com/ben-domingue/252/blob/main/ps7/cdm.R) to explore three ways of modeling the responses in that dataset!.

- Let’s first fit a lme4-style Rasch model `m1`. We can also fit a conventional Rasch model via mirt (`m2`).  What do we get when we compare item parameters from such analyses?
- Let’s now contrast these findings with an analysis wherein we think of the individual items as being potentially ‘exchangeable’ with any other item that loads on the same skills. I am now asserting that the only thing about items that matters is the skills they load on. This is the `m2` object. Conceptually, we can imagine having a bunch of different items that have the same skill loadings. From the perspective of this model, we would basically be observing different responses to common stimuli. Such reasoning is a common feature of psychological studies where, for example, psychologists will ask people to do [mental rotation tasks](https://rdrr.io/cran/diffIRT/man/rotation.html) wherein they are just varying the angle (there is no good definition of items here if angle is being varied continuously). Do you think this conceptualization of the items as interchangeable ‘tasks’ in this way is appropriate here? Do you have any empirical results to support/confirm your answer to the above?
- Let’s now estimate a CDM (see `m3`). In particular, we’re going to use the DINA approach. This approach is discussed in the above paper, the core idea is captured below (see page 4):
$$
\Pr(\alpha^*_{lj})=\delta_{j0}+\delta_{j12...K^*_j} \Pi_{k-1}^{K^*_j} \alpha_{lk}
$$
  This equation asserts that the probability of a correct responses is delta_j0 if you do NOT have ALL of the requisite skills (where the Q matrix tells us what these are for each item) and delta_j0+delta_j12…K if you do have ALL the requisite skills (the A in DINA is for AND; there is also a DINO model that make a different assertion about how skills are required for generating correct responses). Stop for a second: we now have IRT results and CDM results. How might we expect the results (at either the person- or item-level) to compare? How might you examine this? In the code I do some basic analyses of the output of estimating this model. Do these results behave as you may expect vis-a-vis the IRT results? Note that I am doing a few ad-hoc things. There are no standardized way of really approaching comparisons of CDM and IRT outputs. What I am doing makes sense to me but there may be other approaches!
- Can you compare the item response function for the same item in `m2` and `m3`? The CDM approach leads to a qualitatively different type of item response function as compared to what we get from IRT. Can you visualize that?
- The ‘fits’ of the 3 models (`m1 m2 m3`) are challenging to compare. I show how you can compare AICs in the code; according to those, which is optimal? How satisfied are you with this comparison? Could you come up with a plan that you think might yield more concrete insight about differences in prediction quality across these three approaches?
- BONUS: Can you implement your plan for comparing the fit of the three models?