Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 46 additions & 0 deletions _subfiles/Linear-models-overview/_sec_linreg_diagnostics.qmd
Comment thread
d-morrison marked this conversation as resolved.
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,52 @@ This section is adapted from @dobson4e [§6.2-6.3] and

{{< include _subfiles/Linear-models-overview/_sec-linreg-assumptions.qmd >}}

### Diagnostics for the independence assumption

The independence assumption means that the model errors are independent,
or at least uncorrelated,
across observations after conditioning on the predictors in the model.
Because those errors are unobserved,
we usually assess this assumption using residual-based plots and formal tests.

For data with a natural ordering
(for example, time, spatial location, or clinic visit sequence),
we usually assess independence with both plots and formal tests
[@kutner2005applied, Chapter 12; @chatterjee2015regression, Chapter 6].

:::{#def-independence-diagnostics}
#### Common diagnostics for independence

1. Residuals versus observation order (or time) plots:
patterns, runs, or drifts can indicate dependence
[@kutner2005applied, Chapter 12].

2. Correlograms:
plot the sample autocorrelation function (ACF),
and often the partial autocorrelation function (PACF),
to look for serial structure
[@chatterjee2015regression, Chapter 6].

3. Durbin-Watson test:
tests for first-order serial correlation in regression residuals
[@chatterjee2015regression, Chapter 9; @kutner2005applied, Chapter 12].

4. Breusch-Godfrey test:
tests for higher-order serial correlation,
and is more flexible than Durbin-Watson in many regression settings
[@chatterjee2015regression, Chapter 9; @kutner2005applied, Chapter 12].
:::

{{< include _subfiles/Linear-models-overview/_sec_linreg_independence_math.qmd >}}

{{< include _subfiles/Linear-models-overview/_sec_linreg_independence_examples.qmd >}}

No single diagnostic is definitive.
In practice,
we combine visual and formal diagnostics,
and interpret them in the context of study design
[@kutner2005applied, Chapter 12; @draper2014applied, Chapter 11].

### Direct visualization

::: notes
Expand Down
138 changes: 138 additions & 0 deletions _subfiles/Linear-models-overview/_sec_linreg_independence_examples.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
:::{#exm-independence-diagnostics-simulated}
#### Simulated example for independence diagnostics

The example below simulates ordered data with positively autocorrelated errors.
That creates a setting where independence should fail.
We generate the error term from an AR(1) process with coefficient 0.7,
matching the notation in the mathematical details section.

```{r}
#| label: exm-independence-sim-data
#| code-fold: false
set.seed(204)

n_obs <- 120
x <- seq(-1, 1, length.out = n_obs)
ar_errors <- as.numeric(arima.sim(model = list(ar = 0.7), n = n_obs, sd = 1))
y <- 2 + 1.5 * x + ar_errors

serial_dep_exm <- tibble::tibble(
time_index = seq_len(n_obs),
x = x,
y = y
)

serial_dep_exm_lm <- lm(y ~ x, data = serial_dep_exm)

summary(serial_dep_exm_lm)
```

Residuals versus order:

```{r}
#| label: fig-independence-resid-order
#| fig-cap: "Residuals versus observation order for simulated data"
serial_dep_exm |>
dplyr::mutate(resid = resid(serial_dep_exm_lm)) |>
ggplot(aes(x = time_index, y = resid)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
geom_line(color = "steelblue") +
theme_classic() +
labs(
x = "Observation order",
y = "Residual"
)
```

Correlogram diagnostics:

```{r}
#| label: fig-independence-correlogram
#| fig-cap: "Residual ACF and PACF for simulated data"
local({
op <- par(no.readonly = TRUE)
on.exit(par(op), add = TRUE)
par(mfrow = c(1, 2))
acf(
resid(serial_dep_exm_lm),
main = "Residual ACF"
)
pacf(
resid(serial_dep_exm_lm),
main = "Residual PACF"
)
})
```

Durbin-Watson and Breusch-Godfrey tests:

```{r}
#| label: exm-independence-formal-tests
#| code-fold: false
lmtest::dwtest(serial_dep_exm_lm, alternative = "two.sided")
lmtest::bgtest(serial_dep_exm_lm, order = 4)
```

In this example,
the residual-order plot and correlogram suggest serial dependence,
and both formal tests reject independence.
:::

:::{#exm-independence-diagnostics-real-data}
#### Real-data examples for independence diagnostics

We can also run the same diagnostics on real ordered data sets.

##### Nile annual river flow

```{r}
#| label: exm-independence-real-nile
#| code-fold: false
nile_df <- tibble::tibble(
year = as.numeric(time(datasets::Nile)),
flow = as.numeric(datasets::Nile)
)

nile_lm <- lm(flow ~ year, data = nile_df)

summary(nile_lm)
lmtest::dwtest(nile_lm, alternative = "two.sided")
lmtest::bgtest(nile_lm, order = 4)
```

```{r}
#| label: fig-independence-real-nile-resid-order
#| fig-cap: "Residuals versus year for Nile flow model"
nile_df |>
dplyr::mutate(resid = resid(nile_lm)) |>
ggplot(aes(x = year, y = resid)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
geom_line(color = "steelblue") +
theme_classic() +
labs(
x = "Year",
y = "Residual"
)
```

##### Mauna Loa atmospheric carbon dioxide (`co2`)

```{r}
#| label: exm-independence-real-co2
#| code-fold: false
co2_df <- tibble::tibble(
decimal_year = as.numeric(time(datasets::co2)),
co2_ppm = as.numeric(datasets::co2)
)

co2_lm <- lm(co2_ppm ~ decimal_year, data = co2_df)

summary(co2_lm)
lmtest::dwtest(co2_lm, alternative = "two.sided")
lmtest::bgtest(co2_lm, order = 12)
```

For both real-data examples,
the diagnostics indicate residual dependence,
consistent with their time-ordered structure.
:::
59 changes: 59 additions & 0 deletions _subfiles/Linear-models-overview/_sec_linreg_independence_math.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#### Mathematical details for Durbin-Watson and Breusch-Godfrey

Suppose the observations have a meaningful order,
indexed by $t = 1, \ldots, n$.
Let $e_t$ denote the OLS residual from a fitted regression model.

:::{#def-durbin-watson-diagnostic}
#### Durbin-Watson diagnostic

The test statistic is
$$
\ba
d
&= \frac{\sum_{t = 2}^{n} (e_t - e_{t-1})^2}
{\sum_{t = 1}^{n} e_t^2}
\ea
$$
which is approximately
$2(1 - \hat r_1)$,
where $\hat r_1$ is the sample lag-1 residual autocorrelation
[@kutner2005applied, Chapter 12; @chatterjee2015regression, Chapter 6].
This approximation is most useful in standard large-sample linear-model settings,
and is less straightforward in models with lagged outcomes
[@kutner2005applied, Chapter 12].

The null and alternatives are typically framed through
an AR(1) error model:
$$
\eps_t = \rho \eps_{t-1} + u_t.
$$
Here,
$\eps_t$ is the autocorrelated regression error process,
and $u_t$ is a white-noise innovation term.
The null is $H_0: \rho = 0$,
and alternatives can be one-sided or two-sided,
depending on whether we suspect positive,
negative,
or any serial correlation
[@kutner2005applied, Chapter 12].
:::

:::{#def-breusch-godfrey-diagnostic}
#### Breusch-Godfrey diagnostic

For a test of order $p$,
we run an auxiliary regression of residuals on:
the original regressors,
and lagged residuals $e_{t-1}, \ldots, e_{t-p}$.
If $R^2_{\text{aux}}$ is from that auxiliary model,
the LM statistic is
$$
\text{LM} = (n - p)R^2_{\text{aux}},
$$
and under
$H_0: \rho_1 = \cdots = \rho_p = 0$,
it is asymptotically
$\chi^2_p$
[@chatterjee2015regression, Chapter 6; @draper2014applied, Chapter 11].
:::
Loading