diff --git a/_subfiles/Linear-models-overview/_sec_linreg_diagnostics.qmd b/_subfiles/Linear-models-overview/_sec_linreg_diagnostics.qmd index d64f93e5d..af4c45d9a 100644 --- a/_subfiles/Linear-models-overview/_sec_linreg_diagnostics.qmd +++ b/_subfiles/Linear-models-overview/_sec_linreg_diagnostics.qmd @@ -8,6 +8,52 @@ This section is adapted from @dobson4e [ยง6.2-6.3] and {{< include _subfiles/Linear-models-overview/_sec-linreg-assumptions.qmd >}} +### Diagnostics for the independence assumption + +The independence assumption means that the model errors are independent, +or at least uncorrelated, +across observations after conditioning on the predictors in the model. +Because those errors are unobserved, +we usually assess this assumption using residual-based plots and formal tests. + +For data with a natural ordering +(for example, time, spatial location, or clinic visit sequence), +we usually assess independence with both plots and formal tests +[@kutner2005applied, Chapter 12; @chatterjee2015regression, Chapter 6]. + +:::{#def-independence-diagnostics} +#### Common diagnostics for independence + +1. Residuals versus observation order (or time) plots: +patterns, runs, or drifts can indicate dependence +[@kutner2005applied, Chapter 12]. + +2. Correlograms: +plot the sample autocorrelation function (ACF), +and often the partial autocorrelation function (PACF), +to look for serial structure +[@chatterjee2015regression, Chapter 6]. + +3. Durbin-Watson test: +tests for first-order serial correlation in regression residuals +[@chatterjee2015regression, Chapter 9; @kutner2005applied, Chapter 12]. + +4. Breusch-Godfrey test: +tests for higher-order serial correlation, +and is more flexible than Durbin-Watson in many regression settings +[@chatterjee2015regression, Chapter 9; @kutner2005applied, Chapter 12]. +::: + +{{< include _subfiles/Linear-models-overview/_sec_linreg_independence_math.qmd >}} + +{{< include _subfiles/Linear-models-overview/_sec_linreg_independence_examples.qmd >}} + +No single diagnostic is definitive. +In practice, +we combine visual and formal diagnostics, +and interpret them in the context of study design +[@kutner2005applied, Chapter 12; @draper2014applied, Chapter 11]. + ### Direct visualization ::: notes diff --git a/_subfiles/Linear-models-overview/_sec_linreg_independence_examples.qmd b/_subfiles/Linear-models-overview/_sec_linreg_independence_examples.qmd new file mode 100644 index 000000000..15e053d3f --- /dev/null +++ b/_subfiles/Linear-models-overview/_sec_linreg_independence_examples.qmd @@ -0,0 +1,138 @@ +:::{#exm-independence-diagnostics-simulated} +#### Simulated example for independence diagnostics + +The example below simulates ordered data with positively autocorrelated errors. +That creates a setting where independence should fail. +We generate the error term from an AR(1) process with coefficient 0.7, +matching the notation in the mathematical details section. + +```{r} +#| label: exm-independence-sim-data +#| code-fold: false +set.seed(204) + +n_obs <- 120 +x <- seq(-1, 1, length.out = n_obs) +ar_errors <- as.numeric(arima.sim(model = list(ar = 0.7), n = n_obs, sd = 1)) +y <- 2 + 1.5 * x + ar_errors + +serial_dep_exm <- tibble::tibble( + time_index = seq_len(n_obs), + x = x, + y = y +) + +serial_dep_exm_lm <- lm(y ~ x, data = serial_dep_exm) + +summary(serial_dep_exm_lm) +``` + +Residuals versus order: + +```{r} +#| label: fig-independence-resid-order +#| fig-cap: "Residuals versus observation order for simulated data" +serial_dep_exm |> + dplyr::mutate(resid = resid(serial_dep_exm_lm)) |> + ggplot(aes(x = time_index, y = resid)) + + geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") + + geom_line(color = "steelblue") + + theme_classic() + + labs( + x = "Observation order", + y = "Residual" + ) +``` + +Correlogram diagnostics: + +```{r} +#| label: fig-independence-correlogram +#| fig-cap: "Residual ACF and PACF for simulated data" +local({ + op <- par(no.readonly = TRUE) + on.exit(par(op), add = TRUE) + par(mfrow = c(1, 2)) + acf( + resid(serial_dep_exm_lm), + main = "Residual ACF" + ) + pacf( + resid(serial_dep_exm_lm), + main = "Residual PACF" + ) +}) +``` + +Durbin-Watson and Breusch-Godfrey tests: + +```{r} +#| label: exm-independence-formal-tests +#| code-fold: false +lmtest::dwtest(serial_dep_exm_lm, alternative = "two.sided") +lmtest::bgtest(serial_dep_exm_lm, order = 4) +``` + +In this example, +the residual-order plot and correlogram suggest serial dependence, +and both formal tests reject independence. +::: + +:::{#exm-independence-diagnostics-real-data} +#### Real-data examples for independence diagnostics + +We can also run the same diagnostics on real ordered data sets. + +##### Nile annual river flow + +```{r} +#| label: exm-independence-real-nile +#| code-fold: false +nile_df <- tibble::tibble( + year = as.numeric(time(datasets::Nile)), + flow = as.numeric(datasets::Nile) +) + +nile_lm <- lm(flow ~ year, data = nile_df) + +summary(nile_lm) +lmtest::dwtest(nile_lm, alternative = "two.sided") +lmtest::bgtest(nile_lm, order = 4) +``` + +```{r} +#| label: fig-independence-real-nile-resid-order +#| fig-cap: "Residuals versus year for Nile flow model" +nile_df |> + dplyr::mutate(resid = resid(nile_lm)) |> + ggplot(aes(x = year, y = resid)) + + geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") + + geom_line(color = "steelblue") + + theme_classic() + + labs( + x = "Year", + y = "Residual" + ) +``` + +##### Mauna Loa atmospheric carbon dioxide (`co2`) + +```{r} +#| label: exm-independence-real-co2 +#| code-fold: false +co2_df <- tibble::tibble( + decimal_year = as.numeric(time(datasets::co2)), + co2_ppm = as.numeric(datasets::co2) +) + +co2_lm <- lm(co2_ppm ~ decimal_year, data = co2_df) + +summary(co2_lm) +lmtest::dwtest(co2_lm, alternative = "two.sided") +lmtest::bgtest(co2_lm, order = 12) +``` + +For both real-data examples, +the diagnostics indicate residual dependence, +consistent with their time-ordered structure. +::: diff --git a/_subfiles/Linear-models-overview/_sec_linreg_independence_math.qmd b/_subfiles/Linear-models-overview/_sec_linreg_independence_math.qmd new file mode 100644 index 000000000..100b73e01 --- /dev/null +++ b/_subfiles/Linear-models-overview/_sec_linreg_independence_math.qmd @@ -0,0 +1,59 @@ +#### Mathematical details for Durbin-Watson and Breusch-Godfrey + +Suppose the observations have a meaningful order, +indexed by $t = 1, \ldots, n$. +Let $e_t$ denote the OLS residual from a fitted regression model. + +:::{#def-durbin-watson-diagnostic} +#### Durbin-Watson diagnostic + +The test statistic is +$$ +\ba +d +&= \frac{\sum_{t = 2}^{n} (e_t - e_{t-1})^2} +{\sum_{t = 1}^{n} e_t^2} +\ea +$$ +which is approximately +$2(1 - \hat r_1)$, +where $\hat r_1$ is the sample lag-1 residual autocorrelation +[@kutner2005applied, Chapter 12; @chatterjee2015regression, Chapter 6]. +This approximation is most useful in standard large-sample linear-model settings, +and is less straightforward in models with lagged outcomes +[@kutner2005applied, Chapter 12]. + +The null and alternatives are typically framed through +an AR(1) error model: +$$ +\eps_t = \rho \eps_{t-1} + u_t. +$$ +Here, +$\eps_t$ is the autocorrelated regression error process, +and $u_t$ is a white-noise innovation term. +The null is $H_0: \rho = 0$, +and alternatives can be one-sided or two-sided, +depending on whether we suspect positive, +negative, +or any serial correlation +[@kutner2005applied, Chapter 12]. +::: + +:::{#def-breusch-godfrey-diagnostic} +#### Breusch-Godfrey diagnostic + +For a test of order $p$, +we run an auxiliary regression of residuals on: +the original regressors, +and lagged residuals $e_{t-1}, \ldots, e_{t-p}$. +If $R^2_{\text{aux}}$ is from that auxiliary model, +the LM statistic is +$$ +\text{LM} = (n - p)R^2_{\text{aux}}, +$$ +and under +$H_0: \rho_1 = \cdots = \rho_p = 0$, +it is asymptotically +$\chi^2_p$ +[@chatterjee2015regression, Chapter 6; @draper2014applied, Chapter 11]. +:::