From bf8a541a380db92297bfe4d1da3d2b8c025b0e57 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Thu, 16 Apr 2026 23:42:21 +0000 Subject: [PATCH 01/28] Initial plan From a6ec77a310474160bf7481410cf73f219305b584 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Thu, 16 Apr 2026 23:59:47 +0000 Subject: [PATCH 02/28] fix: clarify HERS causal goal and confounder definition in predictor selection Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/5c70586d-3629-431b-bcd2-9d1bdee7a5fc Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- _sec-pred-sel-dag.qmd | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/_sec-pred-sel-dag.qmd b/_sec-pred-sel-dag.qmd index a4b722fc9..c1fe2ca0b 100644 --- a/_sec-pred-sel-dag.qmd +++ b/_sec-pred-sel-dag.qmd @@ -28,7 +28,7 @@ when estimating the total causal effect. A **confounder** of the relationship between an exposure $X$ and outcome $Y$ is a variable $C$ that: -1. Is associated with the exposure $X$. +1. Is a cause of the exposure $X$. 2. Is a cause of the outcome $Y$. 3. Is not on the causal pathway from $X$ to $Y$. @@ -88,9 +88,9 @@ The **primary outcome** was nonfatal myocardial infarction or CHD death. @fig-hers-dag shows a DAG for the HERS study, representing the assumed causal structure among key variables. -The main causal question is: - -> What is the effect of hormone therapy (HT) on recurrent CHD events? +The main goal of the HERS study was to estimate +the causal effect of hormone therapy (HT) +on recurrent CHD event risk. ::: {#fig-hers-dag} From b464427ddc6c1fc45a692cc6b09b5eecde7ca94a Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 17 Apr 2026 00:59:32 +0000 Subject: [PATCH 03/28] Refine WCGS DAG assumptions and simplify confounder wording Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/1b76873b-23d9-43b3-bfd3-738cff3075fe Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- _sec-pred-sel-dag.qmd | 26 ++++++++++++++++++-------- 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/_sec-pred-sel-dag.qmd b/_sec-pred-sel-dag.qmd index 4e0ab8123..03a72fbb2 100644 --- a/_sec-pred-sel-dag.qmd +++ b/_sec-pred-sel-dag.qmd @@ -27,7 +27,11 @@ when estimating the total causal effect. A **confounder** of the relationship between an exposure $X$ and outcome $Y$ is a variable $C$ -that is a cause of both $X$ and $Y$. +that is a common cause of both $X$ and $Y$. +By contrast, +a mediator lies on the causal path +from $X$ to $Y$ +and is not a common cause of $X$ and $Y$. Adjusting for confounders removes confounding bias. ::: @@ -353,20 +357,26 @@ Directed Acyclic Graph (DAG) for the WCGS study. In @fig-wcgs-dag, `age` is a **confounder**: -it has causal paths to both the exposure (`TypeA`) +it has direct causal arrows to both the exposure (`TypeA`) and the outcome (`CHD`). -We model `TypeA -> smoking` +We assume `TypeA -> smoking` rather than `smoking -> TypeA` -because the Type A behavior pattern is conceptualized -as a relatively stable behavioral trait +as an explicit modeling assumption +for this teaching example: +we treat Type A behavior pattern +as a relatively stable trait that can influence smoking behavior. Under that assumption, `smoking` is a mediator on the path `TypeA -> smoking -> CHD`, not a confounder. -Cholesterol, SBP, and BMI are downstream of age and BMI -but also have independent effects on CHD. -Unlike in HERS, where HT was randomized, +BMI is causally affected by age. +Cholesterol and SBP are causally affected +by both age and BMI. +All three have direct effects on CHD. +Unlike in HERS, +where HT was randomized +to estimate its causal effect on CHD event risk, we cannot assume the exposure is unconfounded here. ```{r} From 668a2cb8570dbb60155118d11afa90e59c7ddbca Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 17 Apr 2026 01:44:42 +0000 Subject: [PATCH 04/28] Add prediction metrics definitions and overfitting example Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/f0c6f7ab-75e3-406b-9060-9594c5014097 Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- _sec-pred-sel-prediction.qmd | 109 ++++++++++++++++++++++++++++++++--- 1 file changed, 101 insertions(+), 8 deletions(-) diff --git a/_sec-pred-sel-prediction.qmd b/_sec-pred-sel-prediction.qmd index 202204358..6c591488a 100644 --- a/_sec-pred-sel-prediction.qmd +++ b/_sec-pred-sel-prediction.qmd @@ -24,6 +24,76 @@ the **bias–variance trade-off**. Adding a predictor helps only if the reduction in bias outweighs the increase in variance. +## Numerical example of overfitting {#sec-pred-sel-overfit-example} + +The table below gives a simple numerical example. +A parsimonious model and an intentionally overfit model +are both trained on the same development set. +The overfit model adds many noise predictors +that have no true relationship with LDL. + +::: {#tbl-overfit-example} + +```{r} +#| label: overfit-example + +set.seed(204) + +hers_overfit <- + hers_ldl |> + dplyr::select(LDL, HT, age, statins) |> + tidyr::drop_na() + +n_overfit <- nrow(hers_overfit) +idx_overfit <- sample(seq_len(n_overfit), size = floor(0.7 * n_overfit)) + +train_overfit <- hers_overfit[idx_overfit, ] +valid_overfit <- hers_overfit[-idx_overfit, ] + +for (j in 1:20) { + zname <- paste0("z", j) + train_overfit[[zname]] <- rnorm(nrow(train_overfit)) + valid_overfit[[zname]] <- rnorm(nrow(valid_overfit)) +} + +model_pars <- lm(LDL ~ HT + age + statins, data = train_overfit) +model_overfit <- lm( + LDL ~ HT + age + statins + + z1 + z2 + z3 + z4 + z5 + z6 + z7 + z8 + z9 + z10 + + z11 + z12 + z13 + z14 + z15 + z16 + z17 + z18 + z19 + z20, + data = train_overfit +) + +rmse <- function(y, yhat) { + sqrt(mean((y - yhat)^2)) +} + +tibble::tibble( + Model = c("Parsimonious model", "Overfit model (+20 noise predictors)"), + Training_RMSE = c( + rmse(train_overfit$LDL, predict(model_pars, newdata = train_overfit)), + rmse(train_overfit$LDL, predict(model_overfit, newdata = train_overfit)) + ), + Validation_RMSE = c( + rmse(valid_overfit$LDL, predict(model_pars, newdata = valid_overfit)), + rmse(valid_overfit$LDL, predict(model_overfit, newdata = valid_overfit)) + ) +) |> + dplyr::mutate( + dplyr::across(c(Training_RMSE, Validation_RMSE), ~ round(.x, 2)) + ) +``` + +Numerical example showing overfitting: +lower training error but worse validation error. + +::: + +In this example, +the overfit model usually has lower training RMSE +but higher validation RMSE, +which is the defining pattern of overfitting. + ## Measures of prediction error {#sec-pred-sel-pe} To select a model that minimizes prediction error, @@ -43,16 +113,39 @@ $$\text{AIC} = -2\hat\ell + 2p, \quad (see @sec-pred-sel-aic-bic). For **binary outcomes**, -the **C-statistic** (area under the ROC curve) -measures discrimination — the model's ability to rank -events above non-events. -The C-statistic is analogous to $R^2$ for continuous outcomes, -but is rank-based and can be insensitive to improvements -in the quality of predicted probabilities. +common measures include: + +- **Balanced accuracy**: +$$\text{BA} = \frac{\text{Sensitivity} + \text{Specificity}}{2}$$ +where +$$\text{Sensitivity} = \Pr(\hat Y = 1 \mid Y = 1), \quad +\text{Specificity} = \Pr(\hat Y = 0 \mid Y = 0).$$ +If $J$ is **Youden's $J$ statistic**, +$$J = \text{Sensitivity} + \text{Specificity} - 1,$$ +then +$$\text{BA} = \frac{J + 1}{2}.$$ +So BA is a linear transformation of $J$, +and BA and $J$ are equivalent for comparing models. + +- **C-statistic** (area under the ROC curve, AUC): +$C = \Pr(\hat{\pi}_i > \hat{\pi}_j \mid Y_i = 1, Y_j = 0) + \frac{1}{2}\Pr(\hat{\pi}_i = \hat{\pi}_j \mid Y_i = 1, Y_j = 0)$, +where $\hat{\pi}$ is the predicted event probability. +An empirical estimator is +$\hat{C} = \frac{1}{n_1 n_0}\sum_{i:Y_i=1}\sum_{j:Y_j=0}\left[I(\hat{\pi}_i > \hat{\pi}_j) + \frac{1}{2}I(\hat{\pi}_i = \hat{\pi}_j)\right]$. +It measures discrimination: +the model's ability to rank events above non-events. For **survival outcomes**, the analogous measure is the **C-index**, -which measures the model's ability to correctly order -the timing of events for two randomly chosen subjects. +defined as +$$ +\hat C_{\text{index}} += \frac{N_{\text{concordant}} + 0.5\,N_{\text{tied}}} +{N_{\text{comparable}}}, +$$ +where pairs are comparable when one subject is observed to fail +before the other is known to fail or be censored, +and concordance means the subject with earlier failure +has the higher predicted risk score. ## Optimism and cross-validation {#sec-pred-sel-cv} From 2adf20d40bd5f2611071421ad89542ba012f7fbf Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 17 Apr 2026 01:50:28 +0000 Subject: [PATCH 05/28] Refine overfitting example code clarity in prediction section Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/f0c6f7ab-75e3-406b-9060-9594c5014097 Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- _sec-pred-sel-prediction.qmd | 34 ++++++++++++++++++++++------------ 1 file changed, 22 insertions(+), 12 deletions(-) diff --git a/_sec-pred-sel-prediction.qmd b/_sec-pred-sel-prediction.qmd index 6c591488a..abbd47586 100644 --- a/_sec-pred-sel-prediction.qmd +++ b/_sec-pred-sel-prediction.qmd @@ -38,29 +38,30 @@ that have no true relationship with LDL. #| label: overfit-example set.seed(204) +n_noise_predictors <- 20 -hers_overfit <- +hers_base <- hers_ldl |> dplyr::select(LDL, HT, age, statins) |> tidyr::drop_na() -n_overfit <- nrow(hers_overfit) +n_overfit <- nrow(hers_base) idx_overfit <- sample(seq_len(n_overfit), size = floor(0.7 * n_overfit)) -train_overfit <- hers_overfit[idx_overfit, ] -valid_overfit <- hers_overfit[-idx_overfit, ] +train_overfit <- hers_base[idx_overfit, ] +valid_overfit <- hers_base[-idx_overfit, ] -for (j in 1:20) { +for (j in seq_len(n_noise_predictors)) { zname <- paste0("z", j) train_overfit[[zname]] <- rnorm(nrow(train_overfit)) valid_overfit[[zname]] <- rnorm(nrow(valid_overfit)) } -model_pars <- lm(LDL ~ HT + age + statins, data = train_overfit) +model_parsimonious <- lm(LDL ~ HT + age + statins, data = train_overfit) +noise_terms <- paste0("z", seq_len(n_noise_predictors)) +overfit_formula <- reformulate(c("HT", "age", "statins", noise_terms), "LDL") model_overfit <- lm( - LDL ~ HT + age + statins + - z1 + z2 + z3 + z4 + z5 + z6 + z7 + z8 + z9 + z10 + - z11 + z12 + z13 + z14 + z15 + z16 + z17 + z18 + z19 + z20, + overfit_formula, data = train_overfit ) @@ -69,13 +70,22 @@ rmse <- function(y, yhat) { } tibble::tibble( - Model = c("Parsimonious model", "Overfit model (+20 noise predictors)"), + Model = c( + "Parsimonious model", + paste0("Overfit model (+", n_noise_predictors, " noise predictors)") + ), Training_RMSE = c( - rmse(train_overfit$LDL, predict(model_pars, newdata = train_overfit)), + rmse( + train_overfit$LDL, + predict(model_parsimonious, newdata = train_overfit) + ), rmse(train_overfit$LDL, predict(model_overfit, newdata = train_overfit)) ), Validation_RMSE = c( - rmse(valid_overfit$LDL, predict(model_pars, newdata = valid_overfit)), + rmse( + valid_overfit$LDL, + predict(model_parsimonious, newdata = valid_overfit) + ), rmse(valid_overfit$LDL, predict(model_overfit, newdata = valid_overfit)) ) ) |> From f5e3326b190415003d270f4bf6da06c56d0f85ea Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 17 Apr 2026 02:12:24 +0000 Subject: [PATCH 06/28] Add iterative 10 percent rule BWE workflow for WCGS confounders Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/1ce71256-1e47-43fa-9682-1e5394ef5bf5 Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- _sec-pred-sel-primary.qmd | 129 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 129 insertions(+) diff --git a/_sec-pred-sel-primary.qmd b/_sec-pred-sel-primary.qmd index ad4dd631d..d6bbd9a00 100644 --- a/_sec-pred-sel-primary.qmd +++ b/_sec-pred-sel-primary.qmd @@ -190,6 +190,135 @@ In practice, covariates with established biological relationships to the outcome are typically retained regardless of the change-in-estimate result, for reasons of face validity. +When multiple covariates are candidates for removal, +checking all possible subsets is often impractical. +A practical alternative is a backward-elimination procedure +guided jointly by the 10% rule and precision. + +At each step: + +1. Start from the current model. +2. Fit models that drop one remaining candidate covariate at a time. +3. Keep only candidates with + absolute percent change from the full-model exposure coefficient + no larger than 10%, + where the full-model coefficient is always from the original + gold-standard model. +4. Among those candidates, + drop the covariate that gives the largest reduction + in the exposure coefficient's standard error. +5. Repeat until no remaining covariate can be dropped + without either exceeding the 10% threshold + or increasing the exposure standard error. + +```{r} +#| label: wcgs-10pct-bwe + +beta_full <- coef(m_full)["dibpatType A"] +se_full <- sqrt(vcov(m_full)["dibpatType A", "dibpatType A"]) + +remaining_covars <- confounders +current_covars <- confounders +current_se <- se_full +step_num <- 1 +bwe_steps <- list() + +repeat { + candidate_rows <- lapply(remaining_covars, function(v) { + keep_covars <- setdiff(current_covars, v) + reduced_formula <- if (length(keep_covars) == 0) { + chd69 ~ dibpat + } else { + reformulate(c("dibpat", keep_covars), response = "chd69") + } + + m_red <- glm(reduced_formula, data = wcgs, family = binomial) + beta_red <- coef(m_red)["dibpatType A"] + se_red <- sqrt(vcov(m_red)["dibpatType A", "dibpatType A"]) + pct_change <- abs((beta_full - beta_red) / beta_full) * 100 + se_reduction <- current_se - se_red + + data.frame( + Step = step_num, + Candidate_drop = v, + Beta_reduced = beta_red, + Pct_change_from_full = pct_change, + SE_reduced = se_red, + SE_reduction = se_reduction, + Eligible = pct_change <= 10 & se_reduction > 0 + ) + }) + + candidates_df <- do.call(rbind, candidate_rows) + eligible_df <- candidates_df[candidates_df$Eligible, , drop = FALSE] + + if (nrow(eligible_df) == 0) { + break + } + + best_idx <- which.max(eligible_df$SE_reduction) + best_drop <- eligible_df[best_idx, , drop = FALSE] + dropped_var <- as.character(best_drop$Candidate_drop) + + bwe_steps[[step_num]] <- data.frame( + Step = step_num, + Dropped = dropped_var, + Pct_change_from_full = round(best_drop$Pct_change_from_full, 1), + SE_before = round(current_se, 4), + SE_after = round(best_drop$SE_reduced, 4), + SE_reduction = round(best_drop$SE_reduction, 4) + ) + + current_covars <- setdiff(current_covars, dropped_var) + remaining_covars <- current_covars + current_se <- best_drop$SE_reduced + step_num <- step_num + 1 +} + +bwe_steps_df <- if (length(bwe_steps) == 0) { + data.frame( + Step = NA_integer_, + Dropped = "None", + Pct_change_from_full = NA_real_, + SE_before = round(se_full, 4), + SE_after = round(se_full, 4), + SE_reduction = 0 + ) +} else { + do.call(rbind, bwe_steps) +} + +retained_covars <- current_covars +``` + +::: {#tbl-wcgs-10pct-bwe} + +```{r} +knitr::kable( + bwe_steps_df, + col.names = c( + "Step", + "Dropped covariate", + "% change from full model", + "SE before drop", + "SE after drop", + "SE reduction" + ) +) +``` + +Backward elimination under the 10% change-in-estimate rule, +with standard-error reduction used to choose among eligible drops. +The reference for the percent-change criterion +is always the Type A coefficient in the original full model. + +::: + +For further reading on model selection strategies in epidemiologic analyses, +see @vittinghoff2e [Chapter 10], +@kleinbaum2010logistic [Chapters 6--8], +and @me4 [Chapter 21, section "Model Selection"]. + ::: {.callout-note} ## Collinearity between confounders is not a problem From 590bb557b7054edfc8a79160fb4053dcf0a19dc7 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 17 Apr 2026 02:19:23 +0000 Subject: [PATCH 07/28] Refine BWE section formatting and citation detail Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/1ce71256-1e47-43fa-9682-1e5394ef5bf5 Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- _sec-pred-sel-primary.qmd | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/_sec-pred-sel-primary.qmd b/_sec-pred-sel-primary.qmd index d6bbd9a00..6b7bbf0ab 100644 --- a/_sec-pred-sel-primary.qmd +++ b/_sec-pred-sel-primary.qmd @@ -213,6 +213,7 @@ At each step: ```{r} #| label: wcgs-10pct-bwe +#| code-fold: false beta_full <- coef(m_full)["dibpatType A"] se_full <- sqrt(vcov(m_full)["dibpatType A", "dibpatType A"]) @@ -315,7 +316,7 @@ is always the Type A coefficient in the original full model. ::: For further reading on model selection strategies in epidemiologic analyses, -see @vittinghoff2e [Chapter 10], +see @vittinghoff2e [Chapter 10, pp. 257--275], @kleinbaum2010logistic [Chapters 6--8], and @me4 [Chapter 21, section "Model Selection"]. From 2055aeb825c54b3c91c0b9ffce989ea5f6fc875a Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 17 Apr 2026 02:32:47 +0000 Subject: [PATCH 08/28] Refactor BWE code clarity in WCGS 10 percent section Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/1ce71256-1e47-43fa-9682-1e5394ef5bf5 Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- _sec-pred-sel-primary.qmd | 33 ++++++++++++++++++++------------- 1 file changed, 20 insertions(+), 13 deletions(-) diff --git a/_sec-pred-sel-primary.qmd b/_sec-pred-sel-primary.qmd index 6b7bbf0ab..714cbbd11 100644 --- a/_sec-pred-sel-primary.qmd +++ b/_sec-pred-sel-primary.qmd @@ -191,9 +191,12 @@ are typically retained regardless of the change-in-estimate result, for reasons of face validity. When multiple covariates are candidates for removal, -checking all possible subsets is often impractical. -A practical alternative is a backward-elimination procedure -guided jointly by the 10% rule and precision. +checking all possible subsets +is often impractical. +A practical alternative +is a backward-elimination procedure +guided jointly by the 10% rule +and precision. At each step: @@ -202,13 +205,16 @@ At each step: 3. Keep only candidates with absolute percent change from the full-model exposure coefficient no larger than 10%, - where the full-model coefficient is always from the original + where the full-model coefficient + is always from the original gold-standard model. 4. Among those candidates, - drop the covariate that gives the largest reduction + drop the covariate + that gives the largest reduction in the exposure coefficient's standard error. 5. Repeat until no remaining covariate can be dropped - without either exceeding the 10% threshold + without either + exceeding the 10% threshold or increasing the exposure standard error. ```{r} @@ -217,20 +223,21 @@ At each step: beta_full <- coef(m_full)["dibpatType A"] se_full <- sqrt(vcov(m_full)["dibpatType A", "dibpatType A"]) +outcome_var <- "chd69" +exposure_var <- "dibpat" -remaining_covars <- confounders current_covars <- confounders current_se <- se_full step_num <- 1 bwe_steps <- list() repeat { - candidate_rows <- lapply(remaining_covars, function(v) { + candidate_results <- lapply(current_covars, function(v) { keep_covars <- setdiff(current_covars, v) reduced_formula <- if (length(keep_covars) == 0) { - chd69 ~ dibpat + reformulate(exposure_var, response = outcome_var) } else { - reformulate(c("dibpat", keep_covars), response = "chd69") + reformulate(c(exposure_var, keep_covars), response = outcome_var) } m_red <- glm(reduced_formula, data = wcgs, family = binomial) @@ -250,7 +257,7 @@ repeat { ) }) - candidates_df <- do.call(rbind, candidate_rows) + candidates_df <- do.call(rbind, candidate_results) eligible_df <- candidates_df[candidates_df$Eligible, , drop = FALSE] if (nrow(eligible_df) == 0) { @@ -271,7 +278,6 @@ repeat { ) current_covars <- setdiff(current_covars, dropped_var) - remaining_covars <- current_covars current_se <- best_drop$SE_reduced step_num <- step_num + 1 } @@ -309,7 +315,8 @@ knitr::kable( ``` Backward elimination under the 10% change-in-estimate rule, -with standard-error reduction used to choose among eligible drops. +with standard-error reduction +used to choose among eligible drops. The reference for the percent-change criterion is always the Type A coefficient in the original full model. From 5efc92b910c4a37be6a2c8598595af4a203c3f7c Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 17 Apr 2026 02:55:40 +0000 Subject: [PATCH 09/28] Address review thread citations and shared HERS intro consolidation Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/164b5c1e-24bc-4094-84c4-66cb1dd6d14d Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- .github/copilot-instructions.md | 32 +++++++++++----------------- _sec-pred-sel-dag.qmd | 9 -------- _sec-pred-sel-goals.qmd | 2 +- _sec-pred-sel-primary.qmd | 5 ++++- _subfiles/shared/_sec_hers_intro.qmd | 10 +++++++++ references.bib | 30 ++++++++++++++++++++++++++ 6 files changed, 57 insertions(+), 31 deletions(-) diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 30e89af40..cd81200d8 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -101,6 +101,18 @@ This applies whether the citation refers to one author or multiple authors (e.g. Pandoc renders `@dobson4e` as a noun phrase (e.g., "Dobson and Barnett (2018)"), but grammatically the citation key itself is treated as a singular pronoun/name. +## Evidence and Source Citation Requirements + +For factual claims that are not directly proved in the text, +always include a specific source citation. +Do not leave factual statements uncited. + +Always cite papers and books using +BibTeX entries in `references.bib` +and Quarto/Pandoc citation syntax +(for example `[@MickeyGreenland1989]`), +rather than plaintext author-date references. + ## Code Formatting Guidelines When adding or editing text in source code (such as comments, documentation strings, or error messages) or in Quarto document text chunks: @@ -143,26 +155,6 @@ Use `code-fold: false` whenever: - The output value is referenced or explained in the surrounding text - The reader needs to see both the code and the result to follow the argument -## Landscape Tables in PDF - -When a table is too wide for portrait orientation in PDF, -wrap the table div in a `.landscape` fenced div. - -Use this pattern: - -```qmd -::: {.landscape} - -:::{#tbl-your-wide-table} -... table content ... -::: - -::: -``` - -Apply this only where needed, -so HTML and RevealJS output stay unchanged. - ## Figures and Tables: Use Div Format Always use the **Quarto div format** for figures and tables rather than chunk-option `fig-cap`/`tbl-cap`. diff --git a/_sec-pred-sel-dag.qmd b/_sec-pred-sel-dag.qmd index 03a72fbb2..2b2097f51 100644 --- a/_sec-pred-sel-dag.qmd +++ b/_sec-pred-sel-dag.qmd @@ -75,15 +75,6 @@ introducing **collider bias**. ## HERS study DAG {#sec-pred-sel-hers-dag} {{< include _subfiles/shared/_sec_hers_intro.qmd >}} - -The HERS trial enrolled 2,763 postmenopausal women with established coronary -heart disease at 20 US clinical centers. -Participants were randomized to receive either conjugated equine estrogens -(0.625 mg/day) plus medroxyprogesterone acetate (2.5 mg/day) -or a matching placebo [@HulleyStephen1998RToE]. -Women were followed for an average of 4.1 years. - -The **primary outcome** was nonfatal myocardial infarction or CHD death. @fig-hers-dag shows a DAG for the HERS study, diff --git a/_sec-pred-sel-goals.qmd b/_sec-pred-sel-goals.qmd index a24c60c03..c19fb90a8 100644 --- a/_sec-pred-sel-goals.qmd +++ b/_sec-pred-sel-goals.qmd @@ -63,6 +63,6 @@ In practice, these goals often overlap. For example, in the HERS analysis of predictors of CHD, the primary goal was to identify the important independent risk factors (Goal 3) while also obtaining valid estimates of each predictor's effect -(Goal 2). +(Goal 2) [@HulleyStephen1998RToE]. ::: diff --git a/_sec-pred-sel-primary.qmd b/_sec-pred-sel-primary.qmd index 714cbbd11..3acd4e613 100644 --- a/_sec-pred-sel-primary.qmd +++ b/_sec-pred-sel-primary.qmd @@ -47,7 +47,10 @@ and removing them could introduce bias. An alternative retention criterion is the **10% change-in-estimate rule** -[@kleinbaum2010logistic, Chapter 8; @vittinghoff2e, Chapter 10]: +[@kleinbaum2010logistic, Chapter 8; +@vittinghoff2e, Chapter 10; +@MickeyGreenland1989, p. 125; +@Greenland1989Modeling, p. 340]: Retain a candidate confounder $C$ in the model if dropping it changes the estimated coefficient for the primary predictor diff --git a/_subfiles/shared/_sec_hers_intro.qmd b/_subfiles/shared/_sec_hers_intro.qmd index cff857e51..ccc299493 100644 --- a/_subfiles/shared/_sec_hers_intro.qmd +++ b/_subfiles/shared/_sec_hers_intro.qmd @@ -2,3 +2,13 @@ The "heart and estrogen/progestin study" (HERS) was a clinical trial of hormone therapy for prevention of recurrent heart attacks and death among 2,763 post-menopausal women with existing coronary heart disease (CHD) [@HulleyStephen1998RToE]. + +The HERS trial enrolled 2,763 postmenopausal women +with established coronary heart disease +at 20 US clinical centers. +Participants were randomized to receive either conjugated equine estrogens +(0.625 mg/day) plus medroxyprogesterone acetate (2.5 mg/day) +or a matching placebo [@HulleyStephen1998RToE]. +Women were followed for an average of 4.1 years. + +The primary outcome was nonfatal myocardial infarction or CHD death. diff --git a/references.bib b/references.bib index 3bb731149..cfd39b3fa 100644 --- a/references.bib +++ b/references.bib @@ -205,6 +205,36 @@ @book{dunn2018generalized url={https://link.springer.com/book/10.1007/978-1-4419-0118-7} } +@article{MickeyGreenland1989, + author = {Mickey, Ruth M. and Greenland, Sander}, + title = {The impact of confounder selection criteria on effect estimation}, + journal = {American Journal of Epidemiology}, + volume = {129}, + number = {1}, + pages = {125--137}, + year = {1989}, + month = {01}, + issn = {0002-9262}, + doi = {10.1093/oxfordjournals.aje.a115101}, + url = {https://doi.org/10.1093/oxfordjournals.aje.a115101}, + eprint = {https://academic.oup.com/aje/article-pdf/129/1/125/127199/129-1-125.pdf} +} + +@article{Greenland1989Modeling, + author = {Greenland, Sander}, + title = {Modeling and variable selection in epidemiologic analysis}, + journal = {American Journal of Public Health}, + volume = {79}, + number = {3}, + pages = {340--349}, + year = {1989}, + month = {03}, + doi = {10.2105/AJPH.79.3.340}, + pmid = {2916724}, + pmcid = {PMC1349563}, + url = {https://doi.org/10.2105/AJPH.79.3.340} +} + @book{rpkgs, title={{R} Packages}, author={Wickham, Hadley and Bryan, Jennifer}, From ece4666f010a0100feef49052b2a2d5cddf53a59 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 17 Apr 2026 03:01:44 +0000 Subject: [PATCH 10/28] Polish 10 percent citations and remove duplicated HERS count Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/164b5c1e-24bc-4094-84c4-66cb1dd6d14d Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- _sec-pred-sel-primary.qmd | 4 ++-- _subfiles/shared/_sec_hers_intro.qmd | 3 +-- 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/_sec-pred-sel-primary.qmd b/_sec-pred-sel-primary.qmd index 3acd4e613..d794ef39b 100644 --- a/_sec-pred-sel-primary.qmd +++ b/_sec-pred-sel-primary.qmd @@ -47,8 +47,8 @@ and removing them could introduce bias. An alternative retention criterion is the **10% change-in-estimate rule** -[@kleinbaum2010logistic, Chapter 8; -@vittinghoff2e, Chapter 10; +[@kleinbaum2010logistic, Chapter 8, pp. 194--212; +@vittinghoff2e, Chapter 10, pp. 257--275; @MickeyGreenland1989, p. 125; @Greenland1989Modeling, p. 340]: diff --git a/_subfiles/shared/_sec_hers_intro.qmd b/_subfiles/shared/_sec_hers_intro.qmd index ccc299493..85dbb0fa7 100644 --- a/_subfiles/shared/_sec_hers_intro.qmd +++ b/_subfiles/shared/_sec_hers_intro.qmd @@ -3,8 +3,7 @@ trial of hormone therapy for prevention of recurrent heart attacks and death among 2,763 post-menopausal women with existing coronary heart disease (CHD) [@HulleyStephen1998RToE]. -The HERS trial enrolled 2,763 postmenopausal women -with established coronary heart disease +The trial was conducted at 20 US clinical centers. Participants were randomized to receive either conjugated equine estrogens (0.625 mg/day) plus medroxyprogesterone acetate (2.5 mg/day) From 4ea0880bbf6f2ee48811b865d2b04c43bcef06cc Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 17 Apr 2026 03:06:39 +0000 Subject: [PATCH 11/28] Add transition text after shared HERS intro include Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/164b5c1e-24bc-4094-84c4-66cb1dd6d14d Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- _sec-pred-sel-dag.qmd | 3 +++ 1 file changed, 3 insertions(+) diff --git a/_sec-pred-sel-dag.qmd b/_sec-pred-sel-dag.qmd index 2b2097f51..8daf66662 100644 --- a/_sec-pred-sel-dag.qmd +++ b/_sec-pred-sel-dag.qmd @@ -75,6 +75,9 @@ introducing **collider bias**. ## HERS study DAG {#sec-pred-sel-hers-dag} {{< include _subfiles/shared/_sec_hers_intro.qmd >}} + +The DAG below focuses on the assumed causal structure +among key baseline variables, treatment, and recurrent CHD events. @fig-hers-dag shows a DAG for the HERS study, From 2e24a258429d758a1f5764e5ecb18c39f95a667f Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 17 Apr 2026 03:11:37 +0000 Subject: [PATCH 12/28] Refine HERS intro phrasing and DAG transition wording Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/164b5c1e-24bc-4094-84c4-66cb1dd6d14d Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- _sec-pred-sel-dag.qmd | 2 +- _subfiles/shared/_sec_hers_intro.qmd | 3 +-- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/_sec-pred-sel-dag.qmd b/_sec-pred-sel-dag.qmd index 8daf66662..f333cdab8 100644 --- a/_sec-pred-sel-dag.qmd +++ b/_sec-pred-sel-dag.qmd @@ -76,7 +76,7 @@ introducing **collider bias**. {{< include _subfiles/shared/_sec_hers_intro.qmd >}} -The DAG below focuses on the assumed causal structure +We now examine the assumed causal structure among key baseline variables, treatment, and recurrent CHD events. diff --git a/_subfiles/shared/_sec_hers_intro.qmd b/_subfiles/shared/_sec_hers_intro.qmd index 85dbb0fa7..be350c6df 100644 --- a/_subfiles/shared/_sec_hers_intro.qmd +++ b/_subfiles/shared/_sec_hers_intro.qmd @@ -3,8 +3,7 @@ trial of hormone therapy for prevention of recurrent heart attacks and death among 2,763 post-menopausal women with existing coronary heart disease (CHD) [@HulleyStephen1998RToE]. -The trial was conducted -at 20 US clinical centers. +The trial was conducted at 20 US clinical centers. Participants were randomized to receive either conjugated equine estrogens (0.625 mg/day) plus medroxyprogesterone acetate (2.5 mg/day) or a matching placebo [@HulleyStephen1998RToE]. From cdd6d8ce6ab2d81c75f5f33d90584436b874d899 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 17 Apr 2026 03:57:38 +0000 Subject: [PATCH 13/28] Restore landscape-table guidance and correct HERS objective wording Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/bcc2b6fa-4941-45b7-97a8-d56a91d33823 Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- .github/copilot-instructions.md | 20 ++++++++++++++++++++ _sec-pred-sel-goals.qmd | 9 +++++---- 2 files changed, 25 insertions(+), 4 deletions(-) diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index cd81200d8..77e6b814b 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -155,6 +155,26 @@ Use `code-fold: false` whenever: - The output value is referenced or explained in the surrounding text - The reader needs to see both the code and the result to follow the argument +## Landscape Tables in PDF + +When a table is too wide for portrait orientation in PDF, +wrap the table div in a `.landscape` fenced div. + +Use this pattern: + +```qmd +::: {.landscape} + +:::{#tbl-your-wide-table} +... table content ... +::: + +::: +``` + +Apply this only where needed, +so HTML and RevealJS output stay unchanged. + ## Figures and Tables: Use Div Format Always use the **Quarto div format** for figures and tables rather than chunk-option `fig-cap`/`tbl-cap`. diff --git a/_sec-pred-sel-goals.qmd b/_sec-pred-sel-goals.qmd index c19fb90a8..4e6f6b63b 100644 --- a/_sec-pred-sel-goals.qmd +++ b/_sec-pred-sel-goals.qmd @@ -60,9 +60,10 @@ Inclusive models with cautious interpretation are preferred ::: notes In practice, these goals often overlap. -For example, in the HERS analysis of predictors of CHD, -the primary goal was to identify the important independent risk factors -(Goal 3) while also obtaining valid estimates of each predictor's effect -(Goal 2) [@HulleyStephen1998RToE]. +For example, in HERS, +the primary objective was to determine whether estrogen plus progestin therapy +alters the risk of CHD events in postmenopausal women with established CHD +[@HulleyStephen1998RToE]. +This corresponds to estimating a treatment effect (Goal 2). ::: From 7b61dde3c8f40a929f3a611839458512bba1e99d Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 17 Apr 2026 04:14:25 +0000 Subject: [PATCH 14/28] Update WCGS DAG naming/layout and add HERS/WCGS data dictionaries Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/2c8befe5-308f-47be-aedb-1ac5a0a26ca0 Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- .github/copilot-instructions.md | 10 +++++ _sec-pred-sel-dag.qmd | 67 +++++++++++++++++++++++++-------- 2 files changed, 61 insertions(+), 16 deletions(-) diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 77e6b814b..a52559e41 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -101,6 +101,16 @@ This applies whether the citation refers to one author or multiple authors (e.g. Pandoc renders `@dobson4e` as a noun phrase (e.g., "Dobson and Barnett (2018)"), but grammatically the citation key itself is treated as a singular pronoun/name. +## DAG Naming for Categorical Variables + +In DAGs, +use node names that represent the full categorical variable, +not a specific level of that variable. + +Example: +use `personality_type`, +not `TypeA`. + ## Evidence and Source Citation Requirements For factual claims that are not directly proved in the text, diff --git a/_sec-pred-sel-dag.qmd b/_sec-pred-sel-dag.qmd index f333cdab8..562d64ca9 100644 --- a/_sec-pred-sel-dag.qmd +++ b/_sec-pred-sel-dag.qmd @@ -87,6 +87,25 @@ The main goal of the HERS study was to estimate the causal effect of hormone therapy (HT) on recurrent CHD event risk. +::: {#tbl-hers-data-dictionary} + +| Variable | Meaning | Type | +|---|---|---| +| `HT` | Hormone-therapy assignment (placebo vs estrogen+progestin) | Binary categorical | +| `CHD` | Recurrent coronary heart disease event outcome | Binary categorical | +| `age` | Baseline age (years) | Continuous | +| `smoking` | Baseline smoking status | Binary categorical | +| `diabetes` | Baseline diabetes status | Binary categorical | +| `BMI` | Body mass index | Continuous | +| `LDL` | Low-density lipoprotein cholesterol | Continuous | +| `HDL` | High-density lipoprotein cholesterol | Continuous | +| `statins` | Baseline statin use | Binary categorical | +| `SBP` | Systolic blood pressure | Continuous | + +Data dictionary for key HERS variables used in the DAG and examples. + +::: + ::: {#fig-hers-dag} ```{r} @@ -306,6 +325,22 @@ time urgency, and hostility), hypothesized to increase CHD risk independently of classic cardiovascular risk factors. +::: {#tbl-wcgs-data-dictionary} + +| Variable | Meaning | Type | +|---|---|---| +| `personality_type` | Personality type (Type A vs Type B behavior pattern) | Binary categorical | +| `CHD` | Coronary heart disease event outcome | Binary categorical | +| `age` | Baseline age (years) | Continuous | +| `smoking` | Baseline smoking status | Binary categorical | +| `chol` | Serum cholesterol | Continuous | +| `sbp` | Systolic blood pressure | Continuous | +| `bmi` | Body mass index | Continuous | + +Data dictionary for key WCGS variables used in the DAG and examples. + +::: + ::: {#fig-wcgs-dag} ```{r} @@ -314,21 +349,21 @@ independently of classic cardiovascular risk factors. wcgs_dag <- dagitty( 'dag { - TypeA [exposure, pos="0,0"] + personality_type [exposure, pos="0,0"] CHD [outcome, pos="4,0"] - age [pos="2,2"] - smoking [pos="-1,-2"] - chol [pos="2,-1"] - sbp [pos="3,-2"] - bmi [pos="-1,1"] + age [pos="1.4,2"] + bmi [pos="-0.3,1.1"] + smoking [pos="1.1,-1.8"] + chol [pos="2.7,1"] + sbp [pos="2.8,-1.2"] - TypeA -> CHD + personality_type -> CHD age -> CHD age -> chol age -> sbp age -> bmi - age -> TypeA - TypeA -> smoking + age -> personality_type + personality_type -> smoking smoking -> CHD chol -> CHD sbp -> CHD @@ -342,7 +377,7 @@ plot(wcgs_dag) ``` Directed Acyclic Graph (DAG) for the WCGS study. -`TypeA` = Type A behavior pattern (exposure); +`personality_type` = personality type (exposure; Type A vs Type B); `CHD` = coronary heart disease events (outcome); `chol` = serum cholesterol; `sbp` = systolic blood pressure. @@ -351,10 +386,10 @@ Directed Acyclic Graph (DAG) for the WCGS study. In @fig-wcgs-dag, `age` is a **confounder**: -it has direct causal arrows to both the exposure (`TypeA`) +it has direct causal arrows to both the exposure (`personality_type`) and the outcome (`CHD`). -We assume `TypeA -> smoking` -rather than `smoking -> TypeA` +We assume `personality_type -> smoking` +rather than `smoking -> personality_type` as an explicit modeling assumption for this teaching example: we treat Type A behavior pattern @@ -362,7 +397,7 @@ as a relatively stable trait that can influence smoking behavior. Under that assumption, `smoking` is a mediator on the path -`TypeA -> smoking -> CHD`, +`personality_type -> smoking -> CHD`, not a confounder. BMI is causally affected by age. Cholesterol and SBP are causally affected @@ -378,7 +413,7 @@ we cannot assume the exposure is unconfounded here. wcgs_adj <- adjustmentSets( wcgs_dag, - exposure = "TypeA", + exposure = "personality_type", outcome = "CHD", type = "minimal" ) @@ -388,7 +423,7 @@ wcgs_adj ::: notes The `adjustmentSets()` function returns the minimal set(s) of variables -that block all backdoor paths from `TypeA` to `CHD`. +that block all backdoor paths from `personality_type` to `CHD`. Adjusting for any of these minimal sets is sufficient to estimate the total causal effect of Type A behavior on CHD (under the assumptions encoded in the DAG). From bf2bd5047c9b0cab592adb656751b373de74703b Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 17 Apr 2026 04:19:45 +0000 Subject: [PATCH 15/28] Fix data-dictionary caption placement in predictor DAG section Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/2c8befe5-308f-47be-aedb-1ac5a0a26ca0 Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- _sec-pred-sel-dag.qmd | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/_sec-pred-sel-dag.qmd b/_sec-pred-sel-dag.qmd index 562d64ca9..a99ec3d67 100644 --- a/_sec-pred-sel-dag.qmd +++ b/_sec-pred-sel-dag.qmd @@ -89,6 +89,8 @@ on recurrent CHD event risk. ::: {#tbl-hers-data-dictionary} +Data dictionary for key HERS variables used in the DAG and examples. + | Variable | Meaning | Type | |---|---|---| | `HT` | Hormone-therapy assignment (placebo vs estrogen+progestin) | Binary categorical | @@ -102,8 +104,6 @@ on recurrent CHD event risk. | `statins` | Baseline statin use | Binary categorical | | `SBP` | Systolic blood pressure | Continuous | -Data dictionary for key HERS variables used in the DAG and examples. - ::: ::: {#fig-hers-dag} @@ -327,6 +327,8 @@ independently of classic cardiovascular risk factors. ::: {#tbl-wcgs-data-dictionary} +Data dictionary for key WCGS variables used in the DAG and examples. + | Variable | Meaning | Type | |---|---|---| | `personality_type` | Personality type (Type A vs Type B behavior pattern) | Binary categorical | @@ -337,8 +339,6 @@ independently of classic cardiovascular risk factors. | `sbp` | Systolic blood pressure | Continuous | | `bmi` | Body mass index | Continuous | -Data dictionary for key WCGS variables used in the DAG and examples. - ::: ::: {#fig-wcgs-dag} From cd70aee7a1300745317b0df15feb30f4d793f52c Mon Sep 17 00:00:00 2001 From: Douglas Ezra Morrison Date: Thu, 16 Apr 2026 21:21:27 -0700 Subject: [PATCH 16/28] Update _sec-pred-sel-dag.qmd Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- _sec-pred-sel-dag.qmd | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/_sec-pred-sel-dag.qmd b/_sec-pred-sel-dag.qmd index a99ec3d67..a76df4d1b 100644 --- a/_sec-pred-sel-dag.qmd +++ b/_sec-pred-sel-dag.qmd @@ -78,7 +78,10 @@ introducing **collider bias**. We now examine the assumed causal structure among key baseline variables, treatment, and recurrent CHD events. - + @fig-hers-dag shows a DAG for the HERS study, representing the assumed causal structure among key variables. From 4f8724fd2edd2fc6c8e2de195023af97a52de0b1 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 17 Apr 2026 04:40:36 +0000 Subject: [PATCH 17/28] Apply review-thread updates for predictor selection and citations Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/a2c4b52c-8928-452a-8cdd-63780a049e89 Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- .github/copilot-instructions.md | 7 +++++-- _sec-pred-sel-primary.qmd | 24 ++++++++++++---------- _subfiles/shared/_sec_hers_conclusions.qmd | 5 ++++- _subfiles/shared/_sec_hers_intro.qmd | 2 ++ references.bib | 4 ++-- 5 files changed, 26 insertions(+), 16 deletions(-) diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index a52559e41..63a21dbf8 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -187,8 +187,11 @@ so HTML and RevealJS output stay unchanged. ## Figures and Tables: Use Div Format -Always use the **Quarto div format** for figures and tables rather than chunk-option `fig-cap`/`tbl-cap`. +Prefer the **Quarto div format** for new figures and tables +rather than chunk-option `fig-cap`/`tbl-cap`. The div format makes it easier to write and format multi-sentence captions. +Existing chunk-option captions in older files are acceptable +unless you are already refactoring that content. **Correct** (div format): ````qmd @@ -226,7 +229,7 @@ plot(x, y) ``` ```` -This applies to all new figures and tables in `.qmd` files. +Use this for newly added or substantially revised figures and tables in `.qmd` files. ## Math Notation diff --git a/_sec-pred-sel-primary.qmd b/_sec-pred-sel-primary.qmd index d794ef39b..a4dc34053 100644 --- a/_sec-pred-sel-primary.qmd +++ b/_sec-pred-sel-primary.qmd @@ -96,8 +96,9 @@ the exposure of interest is Type A behavior pattern (`dibpat`), and the outcome is CHD incidence (`chd69`). We fit logistic regression models and apply the 10% change-in-estimate rule -to assess whether age, cholesterol, SBP, smoking, and BMI -are meaningful confounders of the Type A → CHD relationship. +to assess whether age, cholesterol, SBP, and BMI +are meaningful confounders of the Type A → CHD relationship, +while also examining smoking as an additional covariate. ::: notes @@ -117,14 +118,14 @@ wcgs <- readRDS(fs::path_package("rme", "extdata/wcgs.rds")) # Full model (all candidate confounders) m_full <- glm( - chd69 ~ dibpat + age + chol + sbp + smoke + bmi, + I(chd69 == "Yes") ~ dibpat + age + chol + sbp + smoke + bmi, data = wcgs, family = binomial ) # Unadjusted model (Type A only) m_unadj <- glm( - chd69 ~ dibpat, + I(chd69 == "Yes") ~ dibpat, data = wcgs, family = binomial ) @@ -138,7 +139,7 @@ confounders <- c("age", "chol", "sbp", "smoke", "bmi") results <- lapply(confounders, function(v) { f <- as.formula( - paste("chd69 ~ dibpat +", + paste("I(chd69 == \"Yes\") ~ dibpat +", paste(setdiff(confounders, v), collapse = " + ")) ) m_red <- glm(f, data = wcgs, family = binomial) @@ -226,7 +227,6 @@ At each step: beta_full <- coef(m_full)["dibpatType A"] se_full <- sqrt(vcov(m_full)["dibpatType A", "dibpatType A"]) -outcome_var <- "chd69" exposure_var <- "dibpat" current_covars <- confounders @@ -237,11 +237,13 @@ bwe_steps <- list() repeat { candidate_results <- lapply(current_covars, function(v) { keep_covars <- setdiff(current_covars, v) - reduced_formula <- if (length(keep_covars) == 0) { - reformulate(exposure_var, response = outcome_var) - } else { - reformulate(c(exposure_var, keep_covars), response = outcome_var) - } + rhs_terms <- c(exposure_var, keep_covars) + reduced_formula <- as.formula( + paste( + "I(chd69 == \"Yes\") ~", + paste(rhs_terms, collapse = " + ") + ) + ) m_red <- glm(reduced_formula, data = wcgs, family = binomial) beta_red <- coef(m_red)["dibpatType A"] diff --git a/_subfiles/shared/_sec_hers_conclusions.qmd b/_subfiles/shared/_sec_hers_conclusions.qmd index 1977a6250..6cae52ff2 100644 --- a/_subfiles/shared/_sec_hers_conclusions.qmd +++ b/_subfiles/shared/_sec_hers_conclusions.qmd @@ -1,8 +1,10 @@ Despite observational studies suggesting up to 35–80% fewer recurrent CHD events among hormone users than non-users, +[@HulleyStephen1998RToE] HERS found **no significant difference** in CHD event rates between the hormone therapy (HT) and placebo groups -(relative hazard 0.99; 95% CI: 0.80–1.22). +(relative hazard 0.99; 95% CI: 0.80–1.22) +[@HulleyStephen1998RToE]. The HERS investigators noted a statistically significant time trend: more CHD events occurred in the HT group in year 1, @@ -13,6 +15,7 @@ A key finding was that HT significantly altered lipid profiles: - LDL cholesterol: 11% lower in the HT group. - HDL cholesterol: 10% higher in the HT group. +[@HulleyStephen1998RToE] Yet these favorable lipid changes did not translate into fewer CHD events overall. diff --git a/_subfiles/shared/_sec_hers_intro.qmd b/_subfiles/shared/_sec_hers_intro.qmd index be350c6df..45dc05434 100644 --- a/_subfiles/shared/_sec_hers_intro.qmd +++ b/_subfiles/shared/_sec_hers_intro.qmd @@ -8,5 +8,7 @@ Participants were randomized to receive either conjugated equine estrogens (0.625 mg/day) plus medroxyprogesterone acetate (2.5 mg/day) or a matching placebo [@HulleyStephen1998RToE]. Women were followed for an average of 4.1 years. +[@HulleyStephen1998RToE] The primary outcome was nonfatal myocardial infarction or CHD death. +[@HulleyStephen1998RToE] diff --git a/references.bib b/references.bib index cfd39b3fa..b1c3a76f4 100644 --- a/references.bib +++ b/references.bib @@ -213,7 +213,7 @@ @article{MickeyGreenland1989 number = {1}, pages = {125--137}, year = {1989}, - month = {01}, + month = jan, issn = {0002-9262}, doi = {10.1093/oxfordjournals.aje.a115101}, url = {https://doi.org/10.1093/oxfordjournals.aje.a115101}, @@ -228,7 +228,7 @@ @article{Greenland1989Modeling number = {3}, pages = {340--349}, year = {1989}, - month = {03}, + month = mar, doi = {10.2105/AJPH.79.3.340}, pmid = {2916724}, pmcid = {PMC1349563}, From a0ff2b8e50817ce3a2de2e49fcf16ad5578f9ad5 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 17 Apr 2026 04:48:39 +0000 Subject: [PATCH 18/28] Address remaining review nits for HERS citations and WCGS formula robustness Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/a2c4b52c-8928-452a-8cdd-63780a049e89 Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- _sec-pred-sel-primary.qmd | 15 +++++++++------ _subfiles/shared/_sec_hers_conclusions.qmd | 7 +++---- _subfiles/shared/_sec_hers_intro.qmd | 8 ++++---- 3 files changed, 16 insertions(+), 14 deletions(-) diff --git a/_sec-pred-sel-primary.qmd b/_sec-pred-sel-primary.qmd index a4dc34053..e7185b304 100644 --- a/_sec-pred-sel-primary.qmd +++ b/_sec-pred-sel-primary.qmd @@ -237,13 +237,16 @@ bwe_steps <- list() repeat { candidate_results <- lapply(current_covars, function(v) { keep_covars <- setdiff(current_covars, v) - rhs_terms <- c(exposure_var, keep_covars) - reduced_formula <- as.formula( - paste( - "I(chd69 == \"Yes\") ~", - paste(rhs_terms, collapse = " + ") + reduced_formula <- if (length(keep_covars) == 0) { + as.formula("I(chd69 == \"Yes\") ~ dibpat") + } else { + as.formula( + paste( + "I(chd69 == \"Yes\") ~", + paste(c(exposure_var, keep_covars), collapse = " + ") + ) ) - ) + } m_red <- glm(reduced_formula, data = wcgs, family = binomial) beta_red <- coef(m_red)["dibpatType A"] diff --git a/_subfiles/shared/_sec_hers_conclusions.qmd b/_subfiles/shared/_sec_hers_conclusions.qmd index 6cae52ff2..804f242a5 100644 --- a/_subfiles/shared/_sec_hers_conclusions.qmd +++ b/_subfiles/shared/_sec_hers_conclusions.qmd @@ -1,6 +1,6 @@ Despite observational studies suggesting up to 35–80% fewer recurrent CHD events among hormone users than non-users, -[@HulleyStephen1998RToE] +[@HulleyStephen1998RToE], HERS found **no significant difference** in CHD event rates between the hormone therapy (HT) and placebo groups (relative hazard 0.99; 95% CI: 0.80–1.22) @@ -13,9 +13,8 @@ suggesting that HT may have early harmful effects that are later reversed. A key finding was that HT significantly altered lipid profiles: -- LDL cholesterol: 11% lower in the HT group. -- HDL cholesterol: 10% higher in the HT group. -[@HulleyStephen1998RToE] +- LDL cholesterol: 11% lower in the HT group [@HulleyStephen1998RToE]. +- HDL cholesterol: 10% higher in the HT group [@HulleyStephen1998RToE]. Yet these favorable lipid changes did not translate into fewer CHD events overall. diff --git a/_subfiles/shared/_sec_hers_intro.qmd b/_subfiles/shared/_sec_hers_intro.qmd index 45dc05434..6194b059b 100644 --- a/_subfiles/shared/_sec_hers_intro.qmd +++ b/_subfiles/shared/_sec_hers_intro.qmd @@ -7,8 +7,8 @@ The trial was conducted at 20 US clinical centers. Participants were randomized to receive either conjugated equine estrogens (0.625 mg/day) plus medroxyprogesterone acetate (2.5 mg/day) or a matching placebo [@HulleyStephen1998RToE]. -Women were followed for an average of 4.1 years. -[@HulleyStephen1998RToE] +Women were followed for an average of 4.1 years +[@HulleyStephen1998RToE]. -The primary outcome was nonfatal myocardial infarction or CHD death. -[@HulleyStephen1998RToE] +The primary outcome was nonfatal myocardial infarction or CHD death +[@HulleyStephen1998RToE]. From 6780abf22629db534e035349455ce1b9ad23b735 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Mon, 20 Apr 2026 22:12:14 +0000 Subject: [PATCH 19/28] Merge origin/main and resolve conflicts preferring branch content Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/e6898eda-9491-46e6-9187-e1695c31543d Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- _quarto-book.yml | 2 + _quarto-website.yml | 6 + .../count-regression/_sec_pois-reg_intro.qmd | 4 +- .../_sec_epi202_probability.qmd | 9 + .../_sec_epi203_statistical-inference.qmd | 114 ++++ .../_sec_epi204_glms-and-survival.qmd | 64 +++ .../_sec_overview_bernoulli_models.qmd | 2 +- count-vars.qmd | 111 +--- data.qmd | 305 ++++++++++ exam-formula-sheet.qmd | 201 +------ exploratory-descriptive.qmd | 530 ++++++++++++++++++ poisson.qmd | 110 +++- probability.qmd | 14 - 13 files changed, 1147 insertions(+), 325 deletions(-) create mode 100644 _subfiles/exam-formula-sheet/_sec_epi202_probability.qmd create mode 100644 _subfiles/exam-formula-sheet/_sec_epi203_statistical-inference.qmd create mode 100644 _subfiles/exam-formula-sheet/_sec_epi204_glms-and-survival.qmd create mode 100644 data.qmd create mode 100644 exploratory-descriptive.qmd diff --git a/_quarto-book.yml b/_quarto-book.yml index 5524c13ad..66731fa71 100644 --- a/_quarto-book.yml +++ b/_quarto-book.yml @@ -38,6 +38,8 @@ book: appendices: - appendices-are-prereqs.qmd - math-prereqs.qmd + - data.qmd + - exploratory-descriptive.qmd - probability.qmd - basic-statistical-methods.qmd - classification.qmd diff --git a/_quarto-website.yml b/_quarto-website.yml index 4110be188..fff6e27d3 100644 --- a/_quarto-website.yml +++ b/_quarto-website.yml @@ -18,6 +18,8 @@ project: - top-ten-concepts.qmd - appendices-are-prereqs.qmd - math-prereqs.qmd + - data.qmd + - exploratory-descriptive.qmd - probability.qmd - basic-statistical-methods.qmd - classification.qmd @@ -85,6 +87,10 @@ website: menu: - text: "Math Prerequisites" href: math-prereqs.qmd + - text: "Data" + href: data.qmd + - text: "Exploratory and Descriptive Methods" + href: exploratory-descriptive.qmd - text: "Probability" href: probability.qmd - text: "Basic Statistical Methods" diff --git a/_subfiles/count-regression/_sec_pois-reg_intro.qmd b/_subfiles/count-regression/_sec_pois-reg_intro.qmd index 4a534af98..fc2b95770 100644 --- a/_subfiles/count-regression/_sec_pois-reg_intro.qmd +++ b/_subfiles/count-regression/_sec_pois-reg_intro.qmd @@ -2,7 +2,7 @@ ::: notes This chapter presents models for -[count data](probability.qmd#sec-count-vars) outcomes. +[count data](data.qmd#sec-count-vars) outcomes. With covariates, the event rate $\lambda$ becomes a function of the covariates @@ -69,7 +69,7 @@ $$ In contrast with the other covariates (represented by $\vX$), $t$ enters this expression with a $\log{}$ transformation and without a corresponding $\beta$ coefficient; -in other words, $\logf{t}$ is an [offset term](probability.qmd#def-offset). +in other words, $\logf{t}$ is an [offset term](poisson.qmd#def-offset). ::: --- diff --git a/_subfiles/exam-formula-sheet/_sec_epi202_probability.qmd b/_subfiles/exam-formula-sheet/_sec_epi202_probability.qmd new file mode 100644 index 000000000..27cb8580e --- /dev/null +++ b/_subfiles/exam-formula-sheet/_sec_epi202_probability.qmd @@ -0,0 +1,9 @@ +{{< include _subfiles/probability/_eq_var_lincom.qmd >}} + +$$\Expf{Y} = \Expf{\Expf{Y \mid X}}$$ + +$$\Expf{Y \mid Z} = \Expf{\Expf{Y \mid X,Z} \mid Z}$$ + +$$\Var{Y} = \Expf{\Var{Y \mid X}} + \Var{\Expf{Y \mid X}}$$ + +$$\Cov{Y,Z} = \Expf{\Cov{Y,Z \mid X}} + \Cov{\Expf{Y \mid X}, \Expf{Z \mid X}}$$ diff --git a/_subfiles/exam-formula-sheet/_sec_epi203_statistical-inference.qmd b/_subfiles/exam-formula-sheet/_sec_epi203_statistical-inference.qmd new file mode 100644 index 000000000..8d7bd905c --- /dev/null +++ b/_subfiles/exam-formula-sheet/_sec_epi203_statistical-inference.qmd @@ -0,0 +1,114 @@ +$$\defLik$$ + +$$\defLogLik$$ + +$$\defScore$$ + +$$\defHess$$ + +$$\thmHessij$$ + +$$\defOInf$$ +$$\defEInf$$ + +$$\thmADistMLE$$ + +For one parameter $\theta_k$: + +$$ +\text{CI}_{1-\alpha}(\theta_k) += +\left[ +\hth_k +\pm +\za +\HSE{\hth_k} +\right] +$$ + +$$ +Z_k +\eqdef +\frac{\hth_k-\theta_{k,0}}{\HSE{\hth_k}} +\ddist +\ndist{0,1} +\qquad +\text{under }H_0:\theta_k=\theta_{k,0} +\qquad +z_k\text{ = observed }Z_k +$$ + +$$ +\ba +p\text{-value} +&= 2\Pr \paren{\abs{Z}\ge \abs{z_k}},\quad Z\sim \ndist{0,1} +\\ +&= 2\paren{1-\Phi\paren{\abs{z_k}}} +\ea +$$ + +For Gaussian linear regression: + +$$ +t_k +\eqdef +\frac{\hb_k-\beta_{k,0}}{\HSE{\hb_k}} +\ddist +t_{n-p} +\qquad +\text{under }H_0:\beta_k=\beta_{k,0} +\qquad +t_k^{\text{obs}}\text{ = observed }t_k +$$ + +$$ +\text{CI}_{1-\alpha}(\beta_k) += +\left[ +\hb_k +\pm +t_{n-p}\paren{1-\frac{\alpha}{2}} +\HSE{\hb_k} +\right] +$$ + +$$ +\ba +p\text{-value} +&= +2\Pr \paren{\abs{T_{n-p}}\ge \abs{t_k^{\text{obs}}}} +\\ +&= +2\paren{1-F_{t_{n-p}}\paren{\abs{t_k^{\text{obs}}}}} +\ea +$$ + +$$ +\text{CI}_{1-\alpha}\paren{\mu\paren{\vxs}} += +\left[ +\hat{\mu}\paren{\vxs} +\pm +t_{n-p}\paren{1-\frac{\alpha}{2}} +\HSE{\hat{\mu}\paren{\vxs}} +\right] +$$ + +$$ +\text{PI}_{1-\alpha}\paren{Y^*|\vxs} += +\left[ +\hat{Y}^* +\pm +t_{n-p}\paren{1-\frac{\alpha}{2}} +\HSE{Y^*-\hat{Y}^*} +\right] +$$ + +$$ +\HSE{Y^*-\hat{Y}^*} += +\hs\sqrt{1 + \tp{(\vxs)}(\mX'\mX)^{-1}\vxs} +$$ + +{{< pagebreak >}} diff --git a/_subfiles/exam-formula-sheet/_sec_epi204_glms-and-survival.qmd b/_subfiles/exam-formula-sheet/_sec_epi204_glms-and-survival.qmd new file mode 100644 index 000000000..42246f542 --- /dev/null +++ b/_subfiles/exam-formula-sheet/_sec_epi204_glms-and-survival.qmd @@ -0,0 +1,64 @@ +{{< include _subfiles/Intro-to-GLMs/_sec_glm_structure_brief.qmd >}} + +{{< include _subfiles/logistic-regression/_sec_logistic-link-diagram.qmd >}} + +$$\eqGLMRatio$$ + +## Estimates of odds ratios from 2x2 contingency tables + +$$\hth =\frac{ad}{bc}$$ + +{{< pagebreak >}} + + + + + + + + + + + +## Survival analysis + +### Probability distribution functions + +{{< include _subfiles/probability/_sec-survival-dist-fns.qmd >}} + +### Diagram of survival distribution function relationships + +{{< include _subfiles/shared/_surv_diagram.qmd >}} + +### Survival likelihood contributions, assuming non-informative censoring + +$$ +\ba +\p(Y=y,D=d) +&= [\pdf_T(y)]^{d} [\surv_T(y)]^{1-d} +\\ +&= [\haz_T(y)]^{d} [\surv_T(y)] +\ea +$$ + +### Nonparametric time-to-event distribution estimators + +{{< include _subfiles/intro-to-survival-analysis/_eq-def-km-surv-est.qmd >}} + +{{< include _subfiles/intro-to-survival-analysis/_eq-NA-cuhaz-est.qmd >}} + +{{< pagebreak >}} + +### Proportional hazards model structure + +{{< include _subfiles/proportional-hazards-models/_coxph-model-structure.qmd >}} + +{{< pagebreak >}} + +### Proportional hazards model partial likelihood formula: + +{{< include _subfiles/proportional-hazards-models/_def-ph-partial-lik.qmd >}} + +### Proportional hazards model baseline cumulative hazard estimator: + +{{< include _subfiles/proportional-hazards-models/_def-breslow-baseline-cuhaz-est.qmd >}} diff --git a/_subfiles/logistic-regression/_sec_overview_bernoulli_models.qmd b/_subfiles/logistic-regression/_sec_overview_bernoulli_models.qmd index 0398d2155..d89ee30fe 100644 --- a/_subfiles/logistic-regression/_sec_overview_bernoulli_models.qmd +++ b/_subfiles/logistic-regression/_sec_overview_bernoulli_models.qmd @@ -11,7 +11,7 @@ What is logistic regression? :::{#sol-def-logistic-regression} :::{#def-logistic-regression} -**Logistic regression** is a framework for modeling [binary](probability.qmd#def-binary) outcomes, conditional on one or more *predictors* (a.k.a. *covariates*). +**Logistic regression** is a framework for modeling [binary](data.qmd#def-binary) outcomes, conditional on one or more *predictors* (a.k.a. *covariates*). ::: ::: diff --git a/count-vars.qmd b/count-vars.qmd index b0a823986..98f2b294f 100644 --- a/count-vars.qmd +++ b/count-vars.qmd @@ -22,115 +22,8 @@ What are some examples of count variables? --- -:::{#def-exposure} - -##### Exposure magnitude - -For many count outcomes, -there is some sense of -an **exposure magnitude**, -such as -**population size**, or -**duration of observation**, -which multiplicatively rescales -the expected (mean) count. - -::: - ---- - -:::{#exr-exposure-magnitude} -What are some examples of exposure magnitudes? -::: - ---- - -::: {.solution .smaller} - -outcome | exposure units ------------------------------| ------------- -disease incidence | number of individuals exposed; time at risk -car accidents | miles driven -worksite accidents | person-hours worked -population size | size of habitat - -: Examples of exposure units {#tbl-exposure-units} - -::: - -:::: notes -Exposure units are similar to -the number of trials in a binomial distribution, -but **in non-binomial count outcomes, there can be more than one event per unit of exposure**. - -We can use $t$ to represent continuous-valued exposures/observation durations, -and $n$ to represent discrete-valued exposures. - -:::: - ---- - -::: {#def-event-rate} -#### Event rate - -:::: notes -For a count outcome $Y$ with exposure magnitude $t$, -the **event rate** (denoted $\lambda$) -is defined as the mean of $Y$ divided by the exposure magnitude. -That is: -:::: - -$$\mu \eqdef \Expp[Y|T=t]$$ - -$$\lambda \defeq \frac{\mu}{t}$$ {#eq-def-event-rate} -::: - - -::: notes -Event rate is somewhat analogous to odds in binary outcome models; -it typically serves as an intermediate transformation between the mean of the outcome and the linear component of the model. -However, in contrast with the odds function, the transformation $\lambda = \mu/t$ is *not* considered part of the Poisson model's link function, and it treats the exposure magnitude covariate differently from the other covariates. -::: - ---- - -:::{#thm-mean-vs-event-rate} -#### Transformation function from event rate to mean - -For a count variable with mean $\mu$, event rate $\lambda$, and exposure magnitude $t$: - -$$\tf \mu = \lambda \cdot t$${#eq-lambda-to-mu} - -::: - ---- - -::: solution -Start from definition of event rate and use algebra to solve for $\mu$. -::: - ---- - -@eq-lambda-to-mu is analogous to the inverse-odds function for binary variables. - ---- - -::: {#thm-non-exposed} -When the exposure magnitude is 0, there is no opportunity for events to occur: - -$$\Expp[Y|T=0] = 0$$ -::: - ---- - -::: proof -$$\Expp[Y|T=0] = \lambda \cdot 0 = 0$$ -::: - ---- - #### Probability distributions for count outcomes -- [Poisson distribution](#sec-poisson-dist) +- [Poisson distribution](probability.qmd#sec-poisson-dist) -- [Negative binomial distribution](#sec-nb-dist) +- [Negative binomial distribution](probability.qmd#sec-nb-dist) diff --git a/data.qmd b/data.qmd new file mode 100644 index 000000000..df9232c48 --- /dev/null +++ b/data.qmd @@ -0,0 +1,305 @@ +--- +title: "Data" +format: + html: default + revealjs: + output-file: data-slides.html + pdf: + output-file: data-handout.pdf +--- + +{{< include shared-config.qmd >}} + +# Types of variables + +Before summarizing data, +it helps to identify the **type** of each variable, +since the appropriate descriptive methods depend on the type. +Variables are broadly classified as either **numerical** or **categorical**, +and within each class there are important subtypes. + +--- + +:::{#def-numerical-var} + +#### Numerical variable + +A **numerical** (or **quantitative**) variable is a variable that takes +values on a numeric scale, +where arithmetic operations such as subtraction (and possibly division) +are meaningful. +Numerical variables may be further classified as +[interval](#def-interval-var) or [ratio](#def-ratio-var) variables, +and as [continuous](#def-continuous-var) or [discrete](#def-discrete-var). + +Examples: age, blood pressure, cholesterol, number of cigarettes per day. + +::: + +--- + +:::{#def-interval-var} + +#### Interval variable + +An **interval variable** is a numerical variable for which +differences between values are meaningful, +but there is no natural zero point +(so ratios of values are not meaningful). + +Example: temperature in degrees Celsius — +a difference of 10°C is meaningful, +but 30°C is not "twice as hot" as 15°C. + +::: + +--- + +:::{#def-ratio-var} + +#### Ratio variable + +A **ratio variable** is a numerical variable with a natural zero point, +so that both differences and ratios of values are meaningful. + +Examples: age, weight, cholesterol, blood pressure, number of cigarettes per day — +a value of 0 means complete absence of the quantity. + +::: + +--- + +:::{#def-continuous-var} + +#### Continuous variable + +A **continuous variable** is a numerical variable whose possible values form an +interval (or union of intervals) of real numbers. +A continuous variable is always numerical. + +Examples: age, blood pressure, cholesterol, BMI. + +::: + +--- + +:::{#def-discrete-var} + +#### Discrete variable + +A **discrete variable** is a variable whose possible values form a +countable set. +Discrete variables include both **numerical** types +(such as [count variables](#def-count)) and +**categorical** types +(such as binary, nominal, and ordinal variables). +In contrast, continuous variables are always numerical. + +Examples: number of cigarettes per day (discrete numerical/count), +CHD event status (discrete categorical/binary). + +::: + +--- + +:::{#def-categorical-var} + +#### Categorical variable + +A **categorical** (or **qualitative**) variable is a variable that takes +values in a finite set of categories, +where arithmetic operations such as differences are not meaningful. +Categorical variables may be **nominal** (unordered) or **ordinal** (ordered), +and are always discrete. + +Examples: behavioral pattern (Type A1, A2, B3, B4), +race/ethnicity, self-reported health status. + +::: + +--- + +:::{#def-nominal-var} + +#### Nominal variable + +A **nominal variable** is a categorical variable whose categories have +no natural ordering. + +Examples: behavioral pattern (Type A1, A2, B3, B4), +race/ethnicity, blood type. + +::: + +--- + +:::{#def-ordinal-var} + +#### Ordinal variable + +An **ordinal variable** is a categorical variable whose categories have a +natural ordering. + +Examples: self-reported health (poor, fair, good, very good, excellent), +weight category. + +::: + +--- + +:::{#def-binary-var} + +#### Binary variable + +A **binary variable** takes only two possible values, +often coded 0 (absence) and 1 (presence). +A binary variable is a special case of a nominal variable. + +Examples: CHD event (yes/no), current smoker (yes/no). + +::: + +--- + +@fig-var-taxonomy illustrates the relationships among these variable types. + +::: {#fig-var-taxonomy} + +```{r} +#| fig-alt: > +#| Taxonomy of variable types. +#| Variables divide into Numerical (quantitative) +#| and Categorical (qualitative). +#| Numerical variables subdivide into Interval (no true zero) and Ratio +#| (true zero); Ratio variables further divide into Continuous and Count. +#| Categorical variables subdivide into Nominal (unordered) and Ordinal +#| (ordered); Nominal includes Binary as a special case. +nodes <- tibble::tribble( + ~id, ~x, ~y, ~label, + "V", 5, 4.5, "Variables", + "N", 2.5, 3, "Numerical\n(quantitative)", + "C", 7.5, 3, "Categorical\n(qualitative)", + "I", 1, 1.5, "Interval\n(no true zero)\ne.g. temp. in \u00b0C", + "R", 4, 1.5, "Ratio\n(true zero)\ne.g. age, weight", + "CT", 3, 0, "Continuous\ne.g. age, BMI", + "CNT", 5, 0, "Count\n(discrete)\ne.g. cigs/day", + "NOM", 6.5, 1.5, "Nominal\n(unordered)\ne.g. blood type", + "ORD", 8.5, 1.5, "Ordinal\n(ordered)\ne.g. wt. category", + "BIN", 6.5, 0, "Binary\n(2 categories)\ne.g. CHD event" +) +edges <- tibble::tribble( + ~from, ~to, + "V", "N", + "V", "C", + "N", "I", + "N", "R", + "R", "CT", + "R", "CNT", + "C", "NOM", + "C", "ORD", + "NOM", "BIN" +) |> + dplyr::left_join( + dplyr::select(nodes, id, x, y), + by = c("from" = "id") + ) |> + dplyr::rename(x_from = x, y_from = y) |> + dplyr::left_join( + dplyr::select(nodes, id, x, y), + by = c("to" = "id") + ) |> + dplyr::rename(x_to = x, y_to = y) +fill_colors <- c( + "V" = "#f0f0f0", + "N" = "#d0e8ff", "C" = "#ffe8d0", + "I" = "#e8f4ff", "R" = "#e8f4ff", + "CT" = "#c8e8ff", "CNT" = "#c8e8ff", + "NOM" = "#ffe0c0", "ORD" = "#ffe0c0", + "BIN" = "#ffd0a0" +) +ggplot2::ggplot() + + ggplot2::aes() + + ggplot2::geom_segment( + data = edges, + ggplot2::aes( + x = x_from, y = y_from - 0.45, + xend = x_to, yend = y_to + 0.45 + ), + color = "grey50" + ) + + ggplot2::geom_tile( + data = nodes, + ggplot2::aes(x = x, y = y, fill = id), + width = 1.7, height = 0.8, + color = "grey40", linewidth = 0.4, + show.legend = FALSE + ) + + ggplot2::geom_text( + data = nodes, + ggplot2::aes(x = x, y = y, label = label), + size = 2.8, lineheight = 0.9 + ) + + ggplot2::scale_fill_manual(values = fill_colors) + + ggplot2::scale_y_continuous( + limits = c(-0.5, 5.1), expand = c(0, 0) + ) + + ggplot2::scale_x_continuous( + limits = c(0, 10), expand = c(0, 0) + ) + + ggplot2::theme_void() +``` + +Taxonomy of variable types. +Count variables are discrete and numerical; +binary, nominal, and ordinal variables are discrete and categorical. + +::: + +::: {.callout-note} +The continuous/discrete distinction cuts across the +numerical/categorical distinction. +Continuous variables are always numerical. +Discrete variables include both numerical types (e.g., count variables) +and categorical types (e.g., binary, nominal, and ordinal variables). +::: + +--- + +@tbl-wcgs-vartypes shows selected variables from the WCGS dataset and +their types. + +```{r} +#| label: tbl-wcgs-vartypes +#| tbl-cap: "Selected WCGS variables and their types" +tibble::tribble( + ~Variable, ~Description, ~Type, ~Scale, + "`age`", "Age (years)", "Continuous", "Ratio", + "`chol`", "Total cholesterol", "Continuous", "Ratio", + "`sbp`", "Systolic blood pressure", "Continuous", "Ratio", + "`bmi`", "Body mass index (kg/m²)", "Continuous", "Ratio", + "`weight`", "Weight (lbs)", "Continuous", "Ratio", + "`ncigs`", "Cigarettes per day", "Count (discrete)", "Ratio", + "`chd69`", "CHD event by 1969", "Binary (nominal)", "Nominal", + "`smoke`", "Current smoking", "Binary (nominal)", "Nominal", + "`arcus`", "Arcus senilis", "Binary (nominal)", "Nominal", + "`dibpat`", "Behavioral pattern (A/B)", "Binary (nominal)", "Nominal", + "`behpat`", "Behavioral pattern (A1/A2/B3/B4)", "Nominal", "Nominal", + "`wghtcat`", "Weight category", "Ordinal", "Ordinal", + "`agec`", "Age group", "Ordinal", "Ordinal" +) |> + knitr::kable() +``` + +# Random variables + +## Binary variables {#sec-binary-vars} + +{{< include binary-vars.qmd >}} + +--- + +## Count variables {#sec-count-vars} + +{{< include count-vars.qmd >}} + +--- diff --git a/exam-formula-sheet.qmd b/exam-formula-sheet.qmd index df18dca4e..0cb63d479 100644 --- a/exam-formula-sheet.qmd +++ b/exam-formula-sheet.qmd @@ -1,11 +1,12 @@ --- title: "Exam Formula Sheet" -format: +format: pdf: - output-file: exam-formula-sheet-handout.pdf toc: false - html: - code-link: false + output-file: exam-formula-sheet-handout.pdf + html: default + revealjs: + output-file: exam-formula-sheet-slides.html --- ```{r, eval = FALSE, include = FALSE} @@ -13,202 +14,16 @@ format: # quarto render "exam-formula-sheet.qmd" --profile "handout" ``` -# Exam formula sheet - {{< include latex-macros/macros.qmd >}} # Epi 202: Probability -{{< include _subfiles/probability/_eq_var_lincom.qmd >}} - -$$\Expf{Y} = \Expf{\Expf{Y \mid X}}$$ - -$$\Expf{Y \mid Z} = \Expf{\Expf{Y \mid X,Z} | Z}$$ - -$$\Var{Y} = \Expf{\Var{Y \mid X}} + \Var{\Expf{Y \mid X}}$$ - -$$\Cov{Y,Z} = \Expf{\Cov{Y,Z \mid X}} + \Cov{\Expf{Y \mid X}, \Expf{Z \mid X}}$$ +{{< include _subfiles/exam-formula-sheet/_sec_epi202_probability.qmd >}} # Epi 203: Statistical inference -$$\defLik$$ - -$$\defLogLik$$ - -$$\defScore$$ - -$$\defHess$$ - -$$\thmHessij$$ - -$$\defOInf$$ -$$\defEInf$$ - -$$\thmADistMLE$$ - -For one parameter $\theta_k$: - -$$ -\text{CI}_{1-\alpha}(\theta_k) -= -\left[ -\hth_k -\pm -\za -\HSE{\hth_k} -\right] -$$ - -$$ -Z_k -\eqdef -\frac{\hth_k-\theta_{k,0}}{\HSE{\hth_k}} -\ddist -\ndist{0,1} -\qquad -\text{under }H_0:\theta_k=\theta_{k,0} -\qquad -z_k\text{ = observed }Z_k -$$ - -$$ -\ba -p\text{-value} -&= 2\Pr \paren{\abs{Z}\ge \abs{z_k}},\quad Z\sim \ndist{0,1} -\\ -&= 2\paren{1-\Phi\paren{\abs{z_k}}} -\ea -$$ - -For Gaussian linear regression: - -$$ -t_k -\eqdef -\frac{\hb_k-\beta_{k,0}}{\HSE{\hb_k}} -\ddist -t_{n-p} -\qquad -\text{under }H_0:\beta_k=\beta_{k,0} -\qquad -t_k^{\text{obs}}\text{ = observed }t_k -$$ - -$$ -\text{CI}_{1-\alpha}(\beta_k) -= -\left[ -\hb_k -\pm -t_{n-p}\paren{1-\frac{\alpha}{2}} -\HSE{\hb_k} -\right] -$$ - -$$ -\ba -p\text{-value} -&= -2\Pr \paren{\abs{T_{n-p}}\ge \abs{t_k^{\text{obs}}}} -\\ -&= -2\paren{1-F_{t_{n-p}}\paren{\abs{t_k^{\text{obs}}}}} -\ea -$$ - -$$ -\text{CI}_{1-\alpha}\paren{\mu\paren{\vxs}} -= -\left[ -\hat{\mu}\paren{\vxs} -\pm -t_{n-p}\paren{1-\frac{\alpha}{2}} -\HSE{\hat{\mu}\paren{\vxs}} -\right] -$$ - -$$ -\text{PI}_{1-\alpha}\paren{Y^*|\vxs} -= -\left[ -\hat{Y}^* -\pm -t_{n-p}\paren{1-\frac{\alpha}{2}} -\HSE{Y^*-\hat{Y}^*} -\right] -$$ - -$$ -\HSE{Y^*-\hat{Y}^*} -= -\hs\sqrt{1 + \tp{(\vxs)}(\mX'\mX)^{-1}\vxs} -$$ - -{{< pagebreak >}} +{{< include _subfiles/exam-formula-sheet/_sec_epi203_statistical-inference.qmd >}} # Epi 204: Generalized linear models -{{< include _subfiles/Intro-to-GLMs/_sec_glm_structure_brief.qmd >}} - -{{< include _subfiles/logistic-regression/_sec_logistic-link-diagram.qmd >}} - -$$\eqGLMRatio$$ - -## Estimates of odds ratios from 2x2 contingency tables - -$$\hth =\frac{ad}{bc}$$ - -{{< pagebreak >}} - - - - - - - - - - - -## Survival analysis - -### Probability distribution functions - -{{< include _subfiles/probability/_sec-survival-dist-fns.qmd >}} - -### Diagram of survival distribution function relationships - -{{< include _subfiles/shared/_surv_diagram.qmd >}} - -### Survival likelihood contributions, assuming non-informative censoring - -$$ -\ba -\p(Y=y,D=d) -&= [\pdf_T(y)]^{d} [\surv_T(y)]^{1-d} -\\ -&= [\haz_T(y)]^{d} [\surv_T(y)] -\ea -$$ - -### Nonparametric time-to-event distribution estimators - -{{< include _subfiles/intro-to-survival-analysis/_eq-def-km-surv-est.qmd >}} - -{{< include _subfiles/intro-to-survival-analysis/_eq-NA-cuhaz-est.qmd >}} - -{{< pagebreak >}} - -### Proportional hazards model structure - -{{< include _subfiles/proportional-hazards-models/_coxph-model-structure.qmd >}} - -{{< pagebreak >}} - -### Proportional hazards model partial likelihood formula: - -{{< include _subfiles/proportional-hazards-models/_def-ph-partial-lik.qmd >}} - -### Proportional hazards model baseline cumulative hazard estimator: - -{{< include _subfiles/proportional-hazards-models/_def-breslow-baseline-cuhaz-est.qmd >}} +{{< include _subfiles/exam-formula-sheet/_sec_epi204_glms-and-survival.qmd >}} diff --git a/exploratory-descriptive.qmd b/exploratory-descriptive.qmd new file mode 100644 index 000000000..9f118c8c7 --- /dev/null +++ b/exploratory-descriptive.qmd @@ -0,0 +1,530 @@ +--- +title: "Exploratory and Descriptive Methods" +format: + html: default + revealjs: + output-file: exploratory-descriptive-slides.html + pdf: + output-file: exploratory-descriptive-handout.pdf +--- + +{{< include shared-config.qmd >}} + +::: {.callout-note} +This chapter is adapted from @vittinghoff2e, Chapter 2. +::: + +--- + +# Introduction + +Before fitting regression models, +it is good practice to explore and summarize the data. +Exploratory data analysis (EDA) serves several purposes: + +- Understand the distribution of each variable individually. +- Detect unusual values, outliers, and missing data. +- Identify patterns and relationships among variables. +- Motivate choices of model structure (e.g., transformations). +- Provide context for interpreting model results. + +--- + +In this chapter we illustrate exploratory and descriptive techniques +using the Western Collaborative Group Study (WCGS) dataset. + +{{< include _subfiles/logistic-regression/_sec_intro_wcgs.qmd >}} + +--- + +```{r} +#| label: load-wcgs +#| eval: false +### load the data directly from a UCSF website: +url <- paste0( + "https://regression.ucsf.edu/sites/g/files/", + "tkssra6706/f/wysiwyg/home/data/wcgs.dta" +) +wcgs <- haven::read_dta(url) +``` + +```{r} +#| label: load-wcgs-local +#| include: false +here::here() |> + fs::path("Data/wcgs.rda") |> + load() +``` + +```{r} +#| label: label-wcgs +wcgs <- wcgs |> + labelled::set_variable_labels( + age = "Age (years)", + chol = "Cholesterol (mg/dL)", + sbp = "Systolic BP (mmHg)", + dbp = "Diastolic BP (mmHg)", + bmi = "BMI (kg/m\u00B2)", + weight = "Weight (lbs)", + ncigs = "Cigarettes per day", + chd69 = "CHD event by 1969", + smoke = "Current smoker", + arcus = "Arcus senilis", + dibpat = "Behavioral pattern (A/B)", + behpat = "Behavioral pattern (A1/A2/B3/B4)", + wghtcat = "Weight category", + agec = "Age group" + ) +``` + +# Summarizing a single variable + +## Continuous variables + +### Measures of center + +:::{#def-mean} + +#### Sample mean + +The **sample mean** of a variable $x$ measured on $n$ observations is: + +$$\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$$ + +::: + +--- + +:::{#def-median} + +#### Median + +The **median** is the value that divides the sorted data in half: +50% of observations fall below and 50% above. +The median is more robust to outliers than the mean. + +::: + +--- + +### Measures of spread + +:::{#def-eda-variance} + +#### Sample variance + +The **sample variance** is: + +$$s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2$$ + +::: + +--- + +:::{#def-eda-sd} + +#### Sample standard deviation + +The **sample standard deviation** is the square root of the sample variance: + +$$s = \sqrt{s^2}$$ + +The standard deviation has the same units as the original variable, +making it easier to interpret than the variance. + +::: + +--- + +:::{#def-quantile} + +#### Quantiles and percentiles + +The **$p$th quantile** (or **$(100p)$th percentile**) of a variable +is the value below which a proportion $p$ of the data falls. + +Special cases: + +- The **median** is the 50th percentile ($p = 0.5$). +- The **first quartile** (Q1) is the 25th percentile ($p = 0.25$). +- The **third quartile** (Q3) is the 75th percentile ($p = 0.75$). + +::: + +--- + +:::{#def-iqr} + +#### Interquartile range + +The **interquartile range** (IQR) is: + +$$\text{IQR} = Q_3 - Q_1$$ + +The IQR contains the middle 50% of the data and +is more robust to outliers than the standard deviation. + +::: + +--- + +### Summary statistics in R + +The `summary()` function provides a quick overview of any variable: + +```{r} +#| label: summary-chol +#| code-fold: false +summary(wcgs$chol) +``` + +For a formatted summary table, +`gtsummary::tbl_summary()` is useful: + +```{r} +#| label: tbl-wcgs-summary-continuous +#| tbl-cap: "WCGS: descriptive statistics for continuous variables" +wcgs |> + dplyr::select(age, chol, sbp, dbp, bmi, weight) |> + gtsummary::tbl_summary( + statistic = list( + gtsummary::all_continuous() ~ + "{mean} ({sd}); {median} [{p25}, {p75}]" + ), + digits = gtsummary::all_continuous() ~ 1 + ) +``` + +## Binary and categorical variables + +For binary and categorical variables, +the natural descriptive statistics are +**frequencies** (counts) and **proportions** (relative frequencies). + +```{r} +#| label: tbl-wcgs-summary-cat +#| tbl-cap: "Frequency table for categorical variables in the WCGS dataset" +wcgs |> + dplyr::select(chd69, smoke, dibpat, behpat, wghtcat) |> + gtsummary::tbl_summary() +``` + +# Graphical methods + +Graphs are often more informative than summary statistics alone. +Different graph types are appropriate for different variable types. + +## Histograms + +A **histogram** displays the distribution of a continuous variable +by dividing its range into bins and counting the number of observations in each. + +```{r} +#| label: fig-hist-chol +#| fig-cap: "Histogram of total cholesterol in the WCGS dataset" +wcgs |> + ggplot2::ggplot() + + ggplot2::aes(x = chol) + + ggplot2::geom_histogram(bins = 30, fill = "steelblue", color = "white") + + ggplot2::labs(y = "Count") +``` + +--- + +The histogram in @fig-hist-chol shows that +cholesterol in the WCGS is roughly bell-shaped (approximately normal), +with most values between 150 and 350 mg/dL. + +## Density plots + +A **density plot** shows a smoothed estimate of the distribution, +which can be easier to interpret than a histogram. + +```{r} +#| label: fig-density-chol +#| fig-cap: "Density plot of total cholesterol in the WCGS dataset" +wcgs |> + ggplot2::ggplot() + + ggplot2::aes(x = chol) + + ggplot2::geom_density(fill = "steelblue", alpha = 0.5) + + ggplot2::labs(y = "Density") +``` + +## Box plots + +A **box plot** (or box-and-whisker plot) summarizes a distribution using +the median, quartiles, and extreme values. + +The box spans Q1 to Q3 (the IQR). +The line inside the box is the median. +The whiskers extend to the most extreme observations +within $1.5 \times \text{IQR}$ of the box. +Points beyond the whiskers are plotted individually as potential outliers. + +```{r} +#| label: fig-box-chol +#| fig-cap: "Box plot of total cholesterol in the WCGS dataset" +wcgs |> + ggplot2::ggplot() + + ggplot2::aes(y = chol) + + ggplot2::geom_boxplot(fill = "steelblue") + + ggplot2::theme(axis.text.x = ggplot2::element_blank()) +``` + +## Bar charts + +A **bar chart** displays the frequency or proportion of each category +of a categorical variable. + +```{r} +#| label: fig-bar-behpat +#| fig-cap: "Bar chart of behavioral pattern in the WCGS dataset" +wcgs |> + ggplot2::ggplot() + + ggplot2::aes(x = behpat) + + ggplot2::geom_bar(fill = "steelblue") + + ggplot2::labs(y = "Count") +``` + +## Normal probability (Q-Q) plots + +A **quantile-quantile (Q-Q) plot** compares the observed quantiles of a variable +to the theoretical quantiles of a normal distribution. +If the variable is approximately normally distributed, +the points fall close to the diagonal reference line. + +```{r} +#| label: fig-qq-chol +#| fig-cap: "Normal Q-Q plot for total cholesterol in the WCGS dataset" +wcgs |> + ggplot2::ggplot() + + ggplot2::aes(sample = chol) + + ggplot2::stat_qq() + + ggplot2::stat_qq_line(color = "red") + + ggplot2::labs(x = "Theoretical quantiles") +``` + +# Bivariate relationships + +## Two continuous variables: scatter plots and correlation + +### Scatter plots + +A **scatter plot** displays the relationship between two continuous variables +by plotting each observation as a point. + +```{r} +#| label: fig-scatter-chol-sbp +#| fig-cap: "Cholesterol vs. systolic BP in the WCGS dataset" +wcgs |> + ggplot2::ggplot() + + ggplot2::aes(x = sbp, y = chol) + + ggplot2::geom_point(alpha = 0.3) + + ggplot2::geom_smooth(method = "lm", se = TRUE, color = "red") +``` + +### Correlation + +:::{#def-correlation} + +#### Pearson correlation coefficient + +The **Pearson correlation coefficient** measures the strength and direction +of a linear relationship between two continuous variables $X$ and $Y$: + +$$r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})} +{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}}$$ + +The correlation coefficient takes values in $[-1, 1]$: + +- $r = 1$: perfect positive linear relationship. +- $r = -1$: perfect negative linear relationship. +- $r = 0$: no linear relationship. + +::: + +--- + +```{r} +#| label: cor-chol-sbp +#| code-fold: false +cor(wcgs$chol, wcgs$sbp, use = "complete.obs") +``` + +--- + +:::{.callout-caution} +Correlation measures only **linear** relationships. +Two variables can be strongly related but have a near-zero correlation +if the relationship is nonlinear. +Furthermore, correlation does not imply causation. +::: + +## Continuous outcome by a binary or categorical variable + +### Side-by-side box plots + +Side-by-side box plots are useful for comparing the distribution of a +continuous variable across groups defined by a categorical variable. + +```{r} +#| label: fig-box-chol-by-smoke +#| fig-cap: "Box plots of cholesterol by smoking status in the WCGS dataset" +wcgs |> + ggplot2::ggplot() + + ggplot2::aes(x = smoke, y = chol, fill = smoke) + + ggplot2::geom_boxplot() + + ggplot2::theme(legend.position = "none") +``` + +--- + +```{r} +#| label: fig-box-chol-by-behpat +#| fig-cap: "Box plots of cholesterol by behavioral pattern in the WCGS dataset" +wcgs |> + ggplot2::ggplot() + + ggplot2::aes(x = behpat, y = chol, fill = behpat) + + ggplot2::geom_boxplot() + + ggplot2::theme(legend.position = "none") +``` + +### Summary statistics by group + +We can compute summary statistics separately for each group: + +```{r} +#| label: tbl-chol-by-chd +#| tbl-cap: "Cholesterol summary statistics by CHD status" +wcgs |> + dplyr::select(chol, sbp, bmi, chd69) |> + gtsummary::tbl_summary( + by = chd69, + statistic = list( + gtsummary::all_continuous() ~ + "{mean} ({sd})" + ), + digits = gtsummary::all_continuous() ~ 1 + ) |> + gtsummary::add_overall() |> + gtsummary::add_p() +``` + +## Two categorical variables: contingency tables + +### Cross-tabulation + +A **contingency table** (or **cross-tabulation**) displays the joint frequency +distribution of two categorical variables. + +```{r} +#| label: tbl-cross-smoke-chd +#| tbl-cap: "Cross-tabulation of smoking status and CHD event in WCGS" +#| code-fold: false +table(Smoking = wcgs$smoke, CHD = wcgs$chd69) +``` + +--- + +We can also compute row proportions to examine the +conditional distribution of CHD given smoking status: + +```{r} +#| label: tbl-cross-smoke-chd-prop +#| code-fold: false +prop.table(table(Smoking = wcgs$smoke, CHD = wcgs$chd69), margin = 1) +``` + +--- + +Using `gtsummary` for a formatted table: + +```{r} +#| label: tbl-gtsummary-smoke-chd +#| tbl-cap: "Smoking status and CHD event in WCGS" +wcgs |> + dplyr::select(smoke, chd69) |> + gtsummary::tbl_summary(by = chd69) |> + gtsummary::add_overall() |> + gtsummary::add_p() +``` + +# Data transformations + +When a continuous variable has a **right-skewed** distribution +(long tail to the right), +a **logarithmic transformation** often makes the distribution +more approximately normal. +Log-transformed variables also have a natural multiplicative interpretation: +a one-unit increase in $\log(x)$ corresponds to multiplying $x$ by $e \approx 2.72$. + +--- + +The WCGS dataset already contains the log-transformed versions of some variables: + +- `lnsbp`: $\log(\text{SBP})$ +- `lnwght`: $\log(\text{weight})$ + +--- + +```{r} +#| label: fig-hist-sbp-log +#| fig-cap: "SBP distribution: raw and log scale (WCGS)" +p1 <- wcgs |> + ggplot2::ggplot() + + ggplot2::aes(x = sbp) + + ggplot2::geom_histogram(bins = 30, fill = "steelblue", color = "white") + + ggplot2::labs(y = "Count", title = "Raw scale") + +p2 <- wcgs |> + ggplot2::ggplot() + + ggplot2::aes(x = lnsbp) + + ggplot2::geom_histogram(bins = 30, fill = "coral", color = "white") + + ggplot2::labs(y = "Count", title = "Log scale") + +gridExtra::grid.arrange(p1, p2, ncol = 2) +``` + +--- + +The log-transformed SBP is slightly more symmetric than the raw SBP. +Whether to use the transformation in a regression model depends on +assumptions of that model and the scientific question. + +# Putting it all together: an EDA workflow + +A typical exploratory data analysis workflow might proceed as follows: + +1. **Understand the dataset**: number of observations, variables, and their types. + +2. **Examine each variable individually**: + - Continuous: histogram, box plot, mean, SD, median, IQR. + - Categorical: bar chart, frequency table. + - Note missing values, unusual values, and outliers. + +3. **Examine relationships between variables**: + - Two continuous variables: scatter plot, correlation. + - Continuous by category: side-by-side box plots, group means. + - Two categorical variables: contingency table. + +4. **Consider transformations** for skewed continuous variables. + +5. **Summarize findings** to guide model-building decisions. + +--- + +```{r} +#| label: tbl-wcgs-all-summary +#| tbl-cap: "Summary of selected WCGS variables" +wcgs |> + dplyr::select( + age, chol, sbp, dbp, bmi, weight, ncigs, + chd69, smoke, dibpat, behpat + ) |> + gtsummary::tbl_summary() +``` + +# References {.unnumbered} + +::: {#refs} +::: diff --git a/poisson.qmd b/poisson.qmd index a432385be..297390759 100644 --- a/poisson.qmd +++ b/poisson.qmd @@ -199,18 +199,116 @@ For the variance, see . #### Accounting for exposure -If the exposures/observation durations, denoted $T=t$ or $N=n$, vary between observations, we model: +:::{#def-exposure} -$$\mu = \lambda\cdot t$$ +##### Exposure magnitude -$\lambda$ is interpreted as the "expected event rate per unit of exposure"; that is, +For many count outcomes, +there is some sense of +an **exposure magnitude**, +such as +**population size**, or +**duration of observation**, +which multiplicatively rescales +the expected (mean) count. -$$\lambda = \frac{\Expp[Y|T=t]}{t}$$ +::: + +--- + +:::{#exr-exposure-magnitude} +What are some examples of exposure magnitudes? +::: + +--- + +::: {.solution .smaller} + +outcome | exposure units +-----------------------------| ------------- +disease incidence | number of individuals exposed; time at risk +car accidents | miles driven +worksite accidents | person-hours worked +population size | size of habitat + +: Examples of exposure units {#tbl-exposure-units} + +::: + +:::: notes +Exposure units are similar to +the number of trials in a binomial distribution, +but **in non-binomial count outcomes, there can be more than one event per unit of exposure**. + +We can use $t$ to represent continuous-valued exposures/observation durations, +and $n$ to represent discrete-valued exposures. + +:::: + +--- + +::: {#def-event-rate} +#### Event rate + +:::: notes +For a count outcome $Y$ with exposure magnitude $t$, +the **event rate** (denoted $\lambda$) +is defined as the mean of $Y$ divided by the exposure magnitude. +That is: +:::: + +$$\mu \eqdef \Expp[Y|T=t]$$ + +$$\lambda \defeq \frac{\mu}{t}$$ {#eq-def-event-rate} +::: + +::: notes +Event rate is somewhat analogous to odds in binary outcome models; +it typically serves as an intermediate transformation between the mean of the outcome and the linear component of the model. +However, in contrast with the odds function, the transformation $\lambda = \mu/t$ is *not* considered part of the Poisson model's link function, and it treats the exposure magnitude covariate differently from the other covariates. +::: + +--- + +:::{#thm-mean-vs-event-rate} +#### Transformation function from event rate to mean + +For a count variable with mean $\mu$, event rate $\lambda$, and exposure magnitude $t$: + +$$\mu = \lambda \cdot t$${#eq-lambda-to-mu} + +::: + +--- + +::: solution +Start from definition of event rate and use algebra to solve for $\mu$. +::: + +--- + +@eq-lambda-to-mu is analogous to the inverse-odds function for binary variables. + +--- + +::: {#thm-non-exposed} +When the exposure magnitude is 0, there is no opportunity for events to occur: + +$$\Expp[Y|T=0] = 0$$ +::: + +--- + +::: proof +$$\Expp[Y|T=0] = \lambda \cdot 0 = 0$$ +::: + +--- :::{.callout-important} -The exposure magnitude, $T$, is *similar* to a covariate in linear or logistic regression. -However, there is an important difference: in count regression, **there is no intercept corresponding to $\Expp[Y|T=0]$**. +The exposure magnitude, $T$, is *similar* to a covariate in linear or logistic regression. +However, there is an important difference: in count regression, **there is no intercept corresponding to $\Expp[Y|T=0]$**. In other words, this model assumes that if there is no exposure, there can't be any events. ::: diff --git a/probability.qmd b/probability.qmd index 42caae8d7..9bdd8191c 100644 --- a/probability.qmd +++ b/probability.qmd @@ -297,20 +297,6 @@ because the disease prevalence is relatively low (7%). ::: -# Random variables - -## Binary variables {#sec-binary-vars} - -{{< include binary-vars.qmd >}} - ---- - -## Count variables {#sec-count-vars} - -{{< include count-vars.qmd >}} - ---- - # Key probability distributions --- From c0c22758720412eb14f00acd522b89084bbc86eb Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Mon, 20 Apr 2026 23:46:19 +0000 Subject: [PATCH 20/28] Merge latest origin/main and resolve conflicts Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/9695e504-90e6-441e-b580-5ab45286eb1a Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- .github/workflows/preview.yml | 13 ++- .../d-morrison/div-anchors/_extension.yml | 10 +++ .../d-morrison/div-anchors/div-anchors.css | 30 +++++++ .../d-morrison/div-anchors/div-anchors.js | 84 +++++++++++++++++++ .../d-morrison/div-anchors/div-anchors.lua | 24 ++++++ _quarto.yml | 3 + _subfiles/math-prereqs/_sec_vector_calc.qmd | 2 +- 7 files changed, 163 insertions(+), 3 deletions(-) create mode 100644 _extensions/d-morrison/div-anchors/_extension.yml create mode 100644 _extensions/d-morrison/div-anchors/div-anchors.css create mode 100644 _extensions/d-morrison/div-anchors/div-anchors.js create mode 100644 _extensions/d-morrison/div-anchors/div-anchors.lua diff --git a/.github/workflows/preview.yml b/.github/workflows/preview.yml index 48e619266..8dc39e833 100644 --- a/.github/workflows/preview.yml +++ b/.github/workflows/preview.yml @@ -55,6 +55,7 @@ jobs: R CMD INSTALL . - name: Install TinyTeX packages required for PDF rendering + if: github.event.action != 'closed' && contains(github.event.pull_request.labels.*.name, 'preview:pdf') run: | Rscript -e "tinytex::tlmgr_repo('https://mirror.ctan.org/systems/texlive/tlnet')" Rscript -e "tinytex::tlmgr_install(c('luacolor', 'lua-ul'))" @@ -67,9 +68,17 @@ jobs: restore-keys: | quarto-freeze-${{ runner.os }}- - - name: Render + - name: Render html if: github.event.action != 'closed' # skip the build if the PR has been closed - uses: quarto-dev/quarto-actions/render@v2 + run: quarto render --to html + + - name: Render revealjs + if: github.event.action != 'closed' && contains(github.event.pull_request.labels.*.name, 'preview:revealjs') + run: quarto render --to revealjs + + - name: Render pdf + if: github.event.action != 'closed' && contains(github.event.pull_request.labels.*.name, 'preview:pdf') + run: quarto render --to pdf - name: list files if: github.event.action != 'closed' diff --git a/_extensions/d-morrison/div-anchors/_extension.yml b/_extensions/d-morrison/div-anchors/_extension.yml new file mode 100644 index 000000000..728f1c427 --- /dev/null +++ b/_extensions/d-morrison/div-anchors/_extension.yml @@ -0,0 +1,10 @@ +title: Div Anchors +author: d-morrison +version: 1.0.0 +quarto-required: ">=1.3.0" +contributes: + filters: + - div-anchors.lua + format: + html: + css: div-anchors.css diff --git a/_extensions/d-morrison/div-anchors/div-anchors.css b/_extensions/d-morrison/div-anchors/div-anchors.css new file mode 100644 index 000000000..139414323 --- /dev/null +++ b/_extensions/d-morrison/div-anchors/div-anchors.css @@ -0,0 +1,30 @@ +/* div-anchors.css + * Visual URL anchor links for theorem and proof divs. + * Mirrors the behavior of Quarto's built-in `anchor-sections` option, + * which shows a small anchor icon on hover for HTML headings. + */ + +.div-anchor { + opacity: 0; + font-size: 0.875em; + font-weight: 400; + margin-left: 0.375em; + text-decoration: none; + color: inherit; + vertical-align: middle; + transition: opacity 0.2s ease; +} + +.div-anchor:hover, +.div-anchor:focus { + opacity: 1; + text-decoration: none; +} + +/* Show the anchor when hovering over any theorem or proof div. + * .theorem covers all theorem-type environments (lemma, corollary, + * proposition, conjecture, definition, example, exercise, algorithm). + * Proof-type environments are listed separately. */ +:is(.theorem, .proof, .remark, .solution):hover .div-anchor { + opacity: 1; +} diff --git a/_extensions/d-morrison/div-anchors/div-anchors.js b/_extensions/d-morrison/div-anchors/div-anchors.js new file mode 100644 index 000000000..7718f272d --- /dev/null +++ b/_extensions/d-morrison/div-anchors/div-anchors.js @@ -0,0 +1,84 @@ +/** + * div-anchors.js + * + * Adds visual URL anchor links to Quarto theorem and proof divs, + * similarly to the `anchor-sections` option for HTML headings. + * + * After the DOM is ready, finds all theorem/proof divs that have an `id` + * attribute and inserts an anchor link (§) right after the theorem-title + * or proof-title span. The anchor is hidden by default and appears on hover, + * controlled by div-anchors.css. + */ + +(function () { + "use strict"; + + /** + * CSS selector matching all Quarto theorem and proof div environments. + * + * Quarto adds the `.theorem` class to all theorem-type divs + * (theorem, lemma, corollary, proposition, conjecture, definition, + * example, exercise, algorithm), so `.theorem[id]` covers all of them. + * Proof-type environments (.proof, .remark, .solution) are listed + * separately because they do not carry the `.theorem` class. + */ + var THEOREM_SELECTOR = [ + ".theorem[id]", /* covers: theorem, lemma, corollary, proposition, + conjecture, definition, example, + exercise, algorithm */ + ".proof[id]", + ".remark[id]", + ".solution[id]", + ].join(", "); + + /** + * Create a styled anchor element pointing to the given id. + * @param {string} id - The target element's id attribute value. + * @returns {HTMLAnchorElement} + */ + function createAnchor(id) { + var a = document.createElement("a"); + a.href = "#" + id; + a.className = "div-anchor"; + a.setAttribute("aria-label", "Anchor"); + var icon = document.createElement("span"); + icon.setAttribute("aria-hidden", "true"); + icon.textContent = "\u00A7"; + a.appendChild(icon); + return a; + } + + /** + * Add anchor links to all theorem/proof divs with ids on the page. + */ + function addDivAnchors() { + var divs = document.querySelectorAll(THEOREM_SELECTOR); + divs.forEach(function (div) { + var id = div.id; + if (!id) { + return; + } + + // Find the theorem-title or proof-title span + var titleSpan = div.querySelector(".theorem-title, .proof-title"); + var anchor = createAnchor(id); + + if (titleSpan) { + // Insert anchor immediately after the title span + titleSpan.insertAdjacentElement("afterend", anchor); + } else { + // Fallback: prepend to the first paragraph + var firstPara = div.querySelector("p"); + if (firstPara) { + firstPara.insertAdjacentElement("afterbegin", anchor); + } + } + }); + } + + if (document.readyState === "loading") { + document.addEventListener("DOMContentLoaded", addDivAnchors); + } else { + addDivAnchors(); + } +})(); diff --git a/_extensions/d-morrison/div-anchors/div-anchors.lua b/_extensions/d-morrison/div-anchors/div-anchors.lua new file mode 100644 index 000000000..cb16aa818 --- /dev/null +++ b/_extensions/d-morrison/div-anchors/div-anchors.lua @@ -0,0 +1,24 @@ +-- div-anchors.lua +-- Registers a JavaScript dependency that adds visual URL anchor links +-- to theorem and proof divs for HTML output, similarly to the +-- `anchor-sections` option for headings. +-- +-- This approach uses JavaScript rather than a Lua AST filter because +-- Quarto processes theorem divs as custom AST nodes that are not visible +-- to Lua element filters at extension filter time. + +function Meta(meta) + if quarto.doc.isFormat("html") then + quarto.doc.addHtmlDependency({ + name = "div-anchors", + version = "1.0.0", + scripts = { + { path = "div-anchors.js" } + }, + stylesheets = { + { path = "div-anchors.css" } + } + }) + end + return meta +end diff --git a/_quarto.yml b/_quarto.yml index 4d819ae06..031f352a9 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -1,6 +1,9 @@ profile: default: website +filters: + - d-morrison/div-anchors + editor_options: chunk_output_type: console diff --git a/_subfiles/math-prereqs/_sec_vector_calc.qmd b/_subfiles/math-prereqs/_sec_vector_calc.qmd index 37ddbe5c8..919c26faf 100644 --- a/_subfiles/math-prereqs/_sec_vector_calc.qmd +++ b/_subfiles/math-prereqs/_sec_vector_calc.qmd @@ -145,7 +145,7 @@ $$\derivf{z}{\vx} = \derivf{y}{\vx} \derivf{z}{y}$$ or in Euler/Lagrange notation: -$$(f(g(\vx)))' = \vec{g}'(\vx) f(g(\vx))$$ +$$(f(g(\vx)))' = \vec{g}'(\vx) f'(g(\vx))$$ ::: From 971da91c6b28971c0c445a4979a535c89e3da14c Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Tue, 21 Apr 2026 01:10:46 +0000 Subject: [PATCH 21/28] Apply review-thread citation and formatting fixes Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/61f6ae9a-9847-4c1c-be31-600cf450c0d8 Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- .github/copilot-instructions.md | 11 ------- _quarto-revealjs.yml | 1 + _sec-pred-sel-dag.qmd | 6 ++-- _sec-pred-sel-details.qmd | 2 +- _sec-pred-sel-goals.qmd | 6 ++-- _sec-pred-sel-multiple.qmd | 5 ++-- _sec-pred-sel-prediction.qmd | 27 ++++++++++++----- predictor-selection.qmd | 4 +-- references.bib | 51 +++++++++++++++++++++++++++++++++ 9 files changed, 84 insertions(+), 29 deletions(-) diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 792687c88..378e8a89c 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -131,17 +131,6 @@ and Quarto/Pandoc citation syntax (for example `[@MickeyGreenland1989]`), rather than plaintext author-date references. -## Attribution for Adapted Content - -When adapting any content from another source, -always include specific attribution in the chapter text. -Use a BibTeX-backed citation key -with a chapter or page locator -(for example, `[@dobson4e, Chapter 7]` or `[@vittinghoff2e, p. 194]`). -Do not use generic acknowledgements without locators -or plaintext author-title references -when a BibTeX citation is available. - ## Code Formatting Guidelines When adding or editing text in source code (such as comments, documentation strings, or error messages) or in Quarto document text chunks: diff --git a/_quarto-revealjs.yml b/_quarto-revealjs.yml index eb97e72d8..6d17df9b4 100644 --- a/_quarto-revealjs.yml +++ b/_quarto-revealjs.yml @@ -22,6 +22,7 @@ project: - "intro-bayes.qmd" - "common-mistakes.qmd" - "notation.qmd" + - "predictor-selection.qmd" - "intro-to-R.qmd" - "exam-formula-sheet.qmd" diff --git a/_sec-pred-sel-dag.qmd b/_sec-pred-sel-dag.qmd index a76df4d1b..93f3718cd 100644 --- a/_sec-pred-sel-dag.qmd +++ b/_sec-pred-sel-dag.qmd @@ -23,7 +23,7 @@ when estimating the total causal effect. ## Key concepts from DAG theory {#sec-pred-sel-dag-concepts} :::{#def-confounder} -### Confounder +#### Confounder A **confounder** of the relationship between an exposure $X$ and outcome $Y$ is a variable $C$ @@ -39,7 +39,7 @@ Adjusting for confounders removes confounding bias. --- :::{#def-mediator} -### Mediator +#### Mediator A **mediator** $M$ is a variable that lies on the causal pathway from the exposure $X$ to the outcome $Y$: @@ -61,7 +61,7 @@ adjust for $M$. --- :::{#def-collider} -### Collider +#### Collider A **collider** is a variable $L$ that is caused by two or more variables on a path: diff --git a/_sec-pred-sel-details.qmd b/_sec-pred-sel-details.qmd index 01b1719a8..d0591eaf5 100644 --- a/_sec-pred-sel-details.qmd +++ b/_sec-pred-sel-details.qmd @@ -3,7 +3,7 @@ ## Collinearity {#sec-pred-sel-collinearity} :::{#def-collinearity} -### Collinearity +#### Collinearity **Collinearity** refers to high correlation between predictors, sufficient to substantially degrade the precision of diff --git a/_sec-pred-sel-goals.qmd b/_sec-pred-sel-goals.qmd index 4e6f6b63b..0c1ad9508 100644 --- a/_sec-pred-sel-goals.qmd +++ b/_sec-pred-sel-goals.qmd @@ -11,7 +11,7 @@ the inferential goal of the analysis. The primary aim is to minimize prediction error for new observations. The predictors' individual causal interpretations are secondary. -> *Example*: Walter et al. (2001) developed a model to identify +> *Example*: @Walter2001 developed a model to identify > older adults at high risk of death following hospitalization, > using backward selection on a development set > and validating predictions in an independent hospital. @@ -29,7 +29,7 @@ protect against overfitting. The aim is to obtain a valid, minimally confounded estimate of the association between a specific exposure and the outcome. -> *Example*: Grodstein et al. (2001) estimated the effect of hormone +> *Example*: @Grodstein2001 estimated the effect of hormone > therapy on CHD risk in the Nurses' Health Study (NHS), > controlling for a wide range of known CHD risk factors > to minimize confounding. @@ -45,7 +45,7 @@ DAGs (see @sec-pred-sel-dag) are especially useful for this goal. The aim is to discover which predictors are independently associated with the outcome. -> *Example*: Orwoll et al. (1996) examined independent predictors of +> *Example*: @Orwoll1996 examined independent predictors of > bone mass in the Study of Osteoporotic Fractures (SOF), > including all predictors statistically significant at $p < 0.05$ > in age-adjusted models. diff --git a/_sec-pred-sel-multiple.qmd b/_sec-pred-sel-multiple.qmd index 7df29a3e1..c75f688f9 100644 --- a/_sec-pred-sel-multiple.qmd +++ b/_sec-pred-sel-multiple.qmd @@ -28,7 +28,8 @@ becomes even more important. ## Allen–Cady modified backward selection {#sec-pred-sel-allen-cady} -A modified backward selection procedure due to Allen and Cady (1982) +A modified backward selection procedure due to Allen and Cady +[@AllenCady1982] limits false-positive findings while still allowing variable selection. The procedure works as follows: @@ -74,7 +75,7 @@ Instead, the recommended approach is: ## Example: risk factors for CHD in HERS {#sec-pred-sel-chd-risk-factors} -Vittinghoff et al. (2003) identified independent CHD risk factors +@Vittinghoff2003 identified independent CHD risk factors in the HERS cohort using a multipredictor Cox model. The large number of outcome events ($n = 361$) allowed all previously identified risk factors with $p < 0.2$ diff --git a/_sec-pred-sel-prediction.qmd b/_sec-pred-sel-prediction.qmd index abbd47586..1db68a355 100644 --- a/_sec-pred-sel-prediction.qmd +++ b/_sec-pred-sel-prediction.qmd @@ -10,7 +10,7 @@ more parameters to estimate means more sampling variability, and the model can start to fit noise in the training data. :::{#def-overfitting} -### Overfitting +#### Overfitting **Overfitting** occurs when a model fits the training data well but predicts poorly for new observations. @@ -123,9 +123,10 @@ $$\text{AIC} = -2\hat\ell + 2p, \quad (see @sec-pred-sel-aic-bic). For **binary outcomes**, -common measures include: +common measures include +**balanced accuracy** and the **C-statistic**. -- **Balanced accuracy**: +**Balanced accuracy**: $$\text{BA} = \frac{\text{Sensitivity} + \text{Specificity}}{2}$$ where $$\text{Sensitivity} = \Pr(\hat Y = 1 \mid Y = 1), \quad @@ -137,11 +138,23 @@ $$\text{BA} = \frac{J + 1}{2}.$$ So BA is a linear transformation of $J$, and BA and $J$ are equivalent for comparing models. -- **C-statistic** (area under the ROC curve, AUC): -$C = \Pr(\hat{\pi}_i > \hat{\pi}_j \mid Y_i = 1, Y_j = 0) + \frac{1}{2}\Pr(\hat{\pi}_i = \hat{\pi}_j \mid Y_i = 1, Y_j = 0)$, +**C-statistic** (area under the ROC curve, AUC): +$$ +C = \Pr(\hat{\pi}_i > \hat{\pi}_j \mid Y_i = 1, Y_j = 0) \\ +\qquad + \frac{1}{2}\Pr(\hat{\pi}_i = \hat{\pi}_j \mid Y_i = 1, Y_j = 0) +$$ where $\hat{\pi}$ is the predicted event probability. An empirical estimator is -$\hat{C} = \frac{1}{n_1 n_0}\sum_{i:Y_i=1}\sum_{j:Y_j=0}\left[I(\hat{\pi}_i > \hat{\pi}_j) + \frac{1}{2}I(\hat{\pi}_i = \hat{\pi}_j)\right]$. +$$ +\hat{C} += \frac{1}{n_1 n_0} +\sum_{i:Y_i=1} +\sum_{j:Y_j=0} +\left[ +I(\hat{\pi}_i > \hat{\pi}_j) +\qquad + \frac{1}{2}I(\hat{\pi}_i = \hat{\pi}_j) +\right]. +$$ It measures discrimination: the model's ability to rank events above non-events. @@ -170,7 +183,7 @@ by evaluating predictions on observations not used to estimate the model. :::{#def-kfold} -### $k$-fold cross-validation +#### $k$-fold cross-validation In **$k$-fold cross-validation**: diff --git a/predictor-selection.qmd b/predictor-selection.qmd index 528907a23..b0d23db95 100644 --- a/predictor-selection.qmd +++ b/predictor-selection.qmd @@ -15,8 +15,8 @@ format: This content is adapted from: -- @vittinghoff2e, Chapter 10 -- @heinze2018variable +- [@vittinghoff2e, Chapter 10] +- [@heinze2018variable] # Introduction {#sec-pred-sel-intro} diff --git a/references.bib b/references.bib index b1c3a76f4..c62a46bc0 100644 --- a/references.bib +++ b/references.bib @@ -31,6 +31,57 @@ @book{vittinghoff2e url={https://doi.org/10.1007/978-1-4614-1353-0} } +@book{AllenCady1982, + title={Analyzing experimental data by regression}, + author={Allen, David M. and Cady, Frederick B.}, + year={1982}, + publisher={Lifetime Learning Publications}, + address={Belmont, CA} +} + +@article{Walter2001, + title={Development and validation of a prognostic index for 1-year mortality in older adults after hospitalization}, + author={Walter, Louise C. and Brand, Richard J. and Counsell, Steven R. and Palmer, R. Mark and Landefeld, C. Seth and Fortinsky, Richard H. and Covinsky, Kenneth E.}, + journal={JAMA}, + year={2001}, + volume={285}, + number={23}, + pages={2987--2994}, + doi={10.1001/jama.285.23.2987} +} + +@article{Grodstein2001, + title={Postmenopausal hormone use and secondary prevention of coronary events in the Nurses' Health Study: a prospective, observational study}, + author={Grodstein, Francine and Manson, JoAnn E. and Stampfer, Meir J. and Colditz, Graham A. and Willett, Walter C. and Rosner, Bernard and Speizer, Frank E. and Hennekens, Charles H.}, + journal={Annals of Internal Medicine}, + year={2001}, + volume={135}, + number={1}, + pages={1--8}, + doi={10.7326/0003-4819-135-1-200107030-00007} +} + +@article{Orwoll1996, + title={Axial bone mass in older women}, + author={Orwoll, Eric S. and Bauer, Douglas C. and Vogt, Thomas M. and Fox, Kathleen M. and Study of Osteoporotic Fractures Research Group}, + journal={Annals of Internal Medicine}, + year={1996}, + volume={124}, + number={3}, + pages={187--196} +} + +@article{Vittinghoff2003, + title={Risk factors and secondary prevention in women with heart disease: the Heart and Estrogen/progestin Replacement Study}, + author={Vittinghoff, Eric and Shlipak, Michael G. and Varosy, Peter D. and McDermott, Mary and Robbins, John A. and Harris, Tamara B. and Newman, Anne B. and Bauer, Douglas C.}, + journal={Annals of Internal Medicine}, + year={2003}, + volume={139}, + number={4}, + pages={274--281}, + doi={10.7326/0003-4819-139-4-200308190-00006} +} + @book{klein2003survival, title={Survival analysis: techniques for censored and truncated data}, author={Klein, John P and Moeschberger, Melvin L}, From 77ca8cb2c706d4260770a27dea94c35fd5c73249 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Thu, 23 Apr 2026 00:25:15 +0000 Subject: [PATCH 22/28] chore: plan unresolved comment follow-ups Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/16d2a398-6c43-4e34-ad49-a39af311e9f5 Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- renv/activate.R | 118 +++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 111 insertions(+), 7 deletions(-) diff --git a/renv/activate.R b/renv/activate.R index 815945c22..f42fe916e 100644 --- a/renv/activate.R +++ b/renv/activate.R @@ -3,6 +3,7 @@ local({ # the requested version of renv version <- "1.2.2" + attr(version, "md5") <- "32f21c37db37534c7968cbcf3dbb3157" attr(version, "sha") <- NULL # the project directory @@ -168,6 +169,16 @@ local({ if (quiet) return(invisible()) + # also check for config environment variables that should suppress messages + # https://github.com/rstudio/renv/issues/2214 + enabled <- Sys.getenv("RENV_CONFIG_STARTUP_QUIET", unset = NA) + if (!is.na(enabled) && tolower(enabled) %in% c("true", "1")) + return(invisible()) + + enabled <- Sys.getenv("RENV_CONFIG_SYNCHRONIZED_CHECK", unset = NA) + if (!is.na(enabled) && tolower(enabled) %in% c("false", "0")) + return(invisible()) + msg <- sprintf(fmt, ...) cat(msg, file = stdout(), sep = if (appendLF) "\n" else "") @@ -215,6 +226,20 @@ local({ section <- header(sprintf("Bootstrapping renv %s", friendly)) catf(section) + # ensure the target library path exists; required for file.copy(..., recursive = TRUE) + dir.create(library, showWarnings = FALSE, recursive = TRUE) + + # try to install renv from cache + md5 <- attr(version, "md5", exact = TRUE) + if (length(md5)) { + pkgpath <- renv_bootstrap_find(version) + if (length(pkgpath) && file.exists(pkgpath)) { + ok <- file.copy(pkgpath, library, recursive = TRUE) + if (isTRUE(ok)) + return(invisible()) + } + } + # attempt to download renv catf("- Downloading renv ... ", appendLF = FALSE) withCallingHandlers( @@ -240,7 +265,6 @@ local({ # add empty line to break up bootstrapping from normal output catf("") - return(invisible()) } @@ -257,12 +281,20 @@ local({ repos <- Sys.getenv("RENV_CONFIG_REPOS_OVERRIDE", unset = NA) if (!is.na(repos)) { - # check for RSPM; if set, use a fallback repository for renv - rspm <- Sys.getenv("RSPM", unset = NA) - if (identical(rspm, repos)) - repos <- c(RSPM = rspm, CRAN = cran) + # split on ';' if present + parts <- strsplit(repos, ";", fixed = TRUE)[[1L]] - return(repos) + # split into named repositories if present + idx <- regexpr("=", parts, fixed = TRUE) + keys <- substring(parts, 1L, idx - 1L) + vals <- substring(parts, idx + 1L) + names(vals) <- keys + + # if we have a single unnamed repository, call it CRAN + if (length(vals) == 1L && identical(keys, "")) + names(vals) <- "CRAN" + + return(vals) } @@ -511,6 +543,51 @@ local({ } + renv_bootstrap_find <- function(version) { + + path <- renv_bootstrap_find_cache(version) + if (length(path) && file.exists(path)) { + catf("- Using renv %s from global package cache", version) + return(path) + } + + } + + renv_bootstrap_find_cache <- function(version) { + + md5 <- attr(version, "md5", exact = TRUE) + if (is.null(md5)) + return() + + # infer path to renv cache + cache <- Sys.getenv("RENV_PATHS_CACHE", unset = "") + if (!nzchar(cache)) { + root <- Sys.getenv("RENV_PATHS_ROOT", unset = NA) + if (!is.na(root)) + cache <- file.path(root, "cache") + } + + if (!nzchar(cache)) { + tools <- asNamespace("tools") + if (is.function(tools$R_user_dir)) { + root <- tools$R_user_dir("renv", "cache") + cache <- file.path(root, "cache") + } + } + + # start completing path to cache + file.path( + cache, + renv_bootstrap_cache_version(), + renv_bootstrap_platform_prefix(), + "renv", + version, + md5, + "renv" + ) + + } + renv_bootstrap_download_tarball <- function(version) { # if the user has provided the path to a tarball via @@ -979,7 +1056,7 @@ local({ renv_bootstrap_validate_version_release <- function(version, description) { expected <- description[["Version"]] - is.character(expected) && identical(expected, version) + is.character(expected) && identical(c(expected), c(version)) } renv_bootstrap_hash_text <- function(text) { @@ -1158,6 +1235,21 @@ local({ } renv_bootstrap_run <- function(project, libpath, version) { + tryCatch( + renv_bootstrap_run_impl(project, libpath, version), + error = function(e) { + msg <- paste( + "failed to bootstrap renv: the project will not be loaded.", + paste("Reason:", conditionMessage(e)), + "Use `renv::activate()` to re-initialize the project.", + sep = "\n" + ) + warning(msg, call. = FALSE) + } + ) + } + + renv_bootstrap_run_impl <- function(project, libpath, version) { # perform bootstrap bootstrap(version, libpath) @@ -1181,6 +1273,18 @@ local({ } + renv_bootstrap_cache_version <- function() { + # NOTE: users should normally not override the cache version; + # this is provided just to make testing easier + Sys.getenv("RENV_CACHE_VERSION", unset = "v5") + } + + renv_bootstrap_cache_version_previous <- function() { + version <- renv_bootstrap_cache_version() + number <- as.integer(substring(version, 2L)) + paste("v", number - 1L, sep = "") + } + renv_json_read <- function(file = NULL, text = NULL) { jlerr <- NULL From 1342e63fefa2ee1d11fd0a08416355c6c4000b21 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Thu, 23 Apr 2026 01:00:40 +0000 Subject: [PATCH 23/28] Address unresolved review comments in predictor-selection goals and attribution guidance Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/16d2a398-6c43-4e34-ad49-a39af311e9f5 Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- .github/copilot-instructions.md | 9 ++- _sec-pred-sel-goals.qmd | 40 ++++++----- renv/activate.R | 118 ++------------------------------ 3 files changed, 37 insertions(+), 130 deletions(-) diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 50a8a1392..7f627fa87 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -80,11 +80,16 @@ Attribution should appear: so that readers can locate the source material. Use Pandoc's citation locator syntax: `[@citekey, Chapter 8]` or `[@citekey, p. 194]`. Include both chapter and page when both are known. +Do **not** invent page numbers. +Only cite a page after opening the source PDF +and confirming the exact page. +If you have not verified a page directly from the source PDF, +cite chapter-level location only. Examples of attribution with page numbers: -- "Adapted from @vittinghoff2e [Chapter 10, pp. 257–275]." -- "The following example is based on @kleinbaum2010logistic [Chapter 8, pp. 194–212]." +- "Adapted from @vittinghoff2e [Chapter 10]." +- "The following example is based on @kleinbaum2010logistic [Chapter 8]." Do **not** reproduce verbatim text from copyrighted sources without clear quotation marks and attribution. Paraphrase and summarize with a citation instead. diff --git a/_sec-pred-sel-goals.qmd b/_sec-pred-sel-goals.qmd index 0c1ad9508..43d249390 100644 --- a/_sec-pred-sel-goals.qmd +++ b/_sec-pred-sel-goals.qmd @@ -11,10 +11,12 @@ the inferential goal of the analysis. The primary aim is to minimize prediction error for new observations. The predictors' individual causal interpretations are secondary. -> *Example*: @Walter2001 developed a model to identify -> older adults at high risk of death following hospitalization, -> using backward selection on a development set -> and validating predictions in an independent hospital. +:::{#exm-pred-sel-goal-pred} +*Example*: @Walter2001 developed a model to identify +older adults at high risk of death following hospitalization, +using backward selection on a development set +and validating predictions in an independent hospital. +::: For this goal, we use measures of **prediction error** (PE) to choose among candidate models. @@ -29,10 +31,12 @@ protect against overfitting. The aim is to obtain a valid, minimally confounded estimate of the association between a specific exposure and the outcome. -> *Example*: @Grodstein2001 estimated the effect of hormone -> therapy on CHD risk in the Nurses' Health Study (NHS), -> controlling for a wide range of known CHD risk factors -> to minimize confounding. +:::{#exm-pred-sel-goal-eval} +*Example*: @Grodstein2001 estimated the effect of hormone +therapy on CHD risk in the Nurses' Health Study (NHS), +controlling for a wide range of known CHD risk factors +to minimize confounding. +::: In observational studies, confounding is the primary concern. In randomized trials (like HERS), treatment is already unconfounded @@ -45,10 +49,12 @@ DAGs (see @sec-pred-sel-dag) are especially useful for this goal. The aim is to discover which predictors are independently associated with the outcome. -> *Example*: @Orwoll1996 examined independent predictors of -> bone mass in the Study of Osteoporotic Fractures (SOF), -> including all predictors statistically significant at $p < 0.05$ -> in age-adjusted models. +:::{#exm-pred-sel-goal-id} +*Example*: @Orwoll1996 examined independent predictors of +bone mass in the Study of Osteoporotic Fractures (SOF), +including all predictors statistically significant at $p < 0.05$ +in age-adjusted models. +::: This goal is the most challenging because inferences about multiple predictors are of direct interest, @@ -60,10 +66,10 @@ Inclusive models with cautious interpretation are preferred ::: notes In practice, these goals often overlap. -For example, in HERS, -the primary objective was to determine whether estrogen plus progestin therapy -alters the risk of CHD events in postmenopausal women with established CHD -[@HulleyStephen1998RToE]. -This corresponds to estimating a treatment effect (Goal 2). +For example, +an analysis can estimate the causal effect of treatment (Goal 2) +while also selecting additional prognostic covariates +to improve individualized risk prediction performance (Goal 1) +in the same model-building workflow. ::: diff --git a/renv/activate.R b/renv/activate.R index f42fe916e..815945c22 100644 --- a/renv/activate.R +++ b/renv/activate.R @@ -3,7 +3,6 @@ local({ # the requested version of renv version <- "1.2.2" - attr(version, "md5") <- "32f21c37db37534c7968cbcf3dbb3157" attr(version, "sha") <- NULL # the project directory @@ -169,16 +168,6 @@ local({ if (quiet) return(invisible()) - # also check for config environment variables that should suppress messages - # https://github.com/rstudio/renv/issues/2214 - enabled <- Sys.getenv("RENV_CONFIG_STARTUP_QUIET", unset = NA) - if (!is.na(enabled) && tolower(enabled) %in% c("true", "1")) - return(invisible()) - - enabled <- Sys.getenv("RENV_CONFIG_SYNCHRONIZED_CHECK", unset = NA) - if (!is.na(enabled) && tolower(enabled) %in% c("false", "0")) - return(invisible()) - msg <- sprintf(fmt, ...) cat(msg, file = stdout(), sep = if (appendLF) "\n" else "") @@ -226,20 +215,6 @@ local({ section <- header(sprintf("Bootstrapping renv %s", friendly)) catf(section) - # ensure the target library path exists; required for file.copy(..., recursive = TRUE) - dir.create(library, showWarnings = FALSE, recursive = TRUE) - - # try to install renv from cache - md5 <- attr(version, "md5", exact = TRUE) - if (length(md5)) { - pkgpath <- renv_bootstrap_find(version) - if (length(pkgpath) && file.exists(pkgpath)) { - ok <- file.copy(pkgpath, library, recursive = TRUE) - if (isTRUE(ok)) - return(invisible()) - } - } - # attempt to download renv catf("- Downloading renv ... ", appendLF = FALSE) withCallingHandlers( @@ -265,6 +240,7 @@ local({ # add empty line to break up bootstrapping from normal output catf("") + return(invisible()) } @@ -281,20 +257,12 @@ local({ repos <- Sys.getenv("RENV_CONFIG_REPOS_OVERRIDE", unset = NA) if (!is.na(repos)) { - # split on ';' if present - parts <- strsplit(repos, ";", fixed = TRUE)[[1L]] + # check for RSPM; if set, use a fallback repository for renv + rspm <- Sys.getenv("RSPM", unset = NA) + if (identical(rspm, repos)) + repos <- c(RSPM = rspm, CRAN = cran) - # split into named repositories if present - idx <- regexpr("=", parts, fixed = TRUE) - keys <- substring(parts, 1L, idx - 1L) - vals <- substring(parts, idx + 1L) - names(vals) <- keys - - # if we have a single unnamed repository, call it CRAN - if (length(vals) == 1L && identical(keys, "")) - names(vals) <- "CRAN" - - return(vals) + return(repos) } @@ -543,51 +511,6 @@ local({ } - renv_bootstrap_find <- function(version) { - - path <- renv_bootstrap_find_cache(version) - if (length(path) && file.exists(path)) { - catf("- Using renv %s from global package cache", version) - return(path) - } - - } - - renv_bootstrap_find_cache <- function(version) { - - md5 <- attr(version, "md5", exact = TRUE) - if (is.null(md5)) - return() - - # infer path to renv cache - cache <- Sys.getenv("RENV_PATHS_CACHE", unset = "") - if (!nzchar(cache)) { - root <- Sys.getenv("RENV_PATHS_ROOT", unset = NA) - if (!is.na(root)) - cache <- file.path(root, "cache") - } - - if (!nzchar(cache)) { - tools <- asNamespace("tools") - if (is.function(tools$R_user_dir)) { - root <- tools$R_user_dir("renv", "cache") - cache <- file.path(root, "cache") - } - } - - # start completing path to cache - file.path( - cache, - renv_bootstrap_cache_version(), - renv_bootstrap_platform_prefix(), - "renv", - version, - md5, - "renv" - ) - - } - renv_bootstrap_download_tarball <- function(version) { # if the user has provided the path to a tarball via @@ -1056,7 +979,7 @@ local({ renv_bootstrap_validate_version_release <- function(version, description) { expected <- description[["Version"]] - is.character(expected) && identical(c(expected), c(version)) + is.character(expected) && identical(expected, version) } renv_bootstrap_hash_text <- function(text) { @@ -1235,21 +1158,6 @@ local({ } renv_bootstrap_run <- function(project, libpath, version) { - tryCatch( - renv_bootstrap_run_impl(project, libpath, version), - error = function(e) { - msg <- paste( - "failed to bootstrap renv: the project will not be loaded.", - paste("Reason:", conditionMessage(e)), - "Use `renv::activate()` to re-initialize the project.", - sep = "\n" - ) - warning(msg, call. = FALSE) - } - ) - } - - renv_bootstrap_run_impl <- function(project, libpath, version) { # perform bootstrap bootstrap(version, libpath) @@ -1273,18 +1181,6 @@ local({ } - renv_bootstrap_cache_version <- function() { - # NOTE: users should normally not override the cache version; - # this is provided just to make testing easier - Sys.getenv("RENV_CACHE_VERSION", unset = "v5") - } - - renv_bootstrap_cache_version_previous <- function() { - version <- renv_bootstrap_cache_version() - number <- as.integer(substring(version, 2L)) - paste("v", number - 1L, sep = "") - } - renv_json_read <- function(file = NULL, text = NULL) { jlerr <- NULL From fcd80741c92974384a1473ec25a5d2192afb3e72 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Thu, 23 Apr 2026 01:13:24 +0000 Subject: [PATCH 24/28] chore: plan merge-conflict resolution Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/a1c7e567-a3a5-49bb-8990-460a20877456 Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- renv/activate.R | 118 +++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 111 insertions(+), 7 deletions(-) diff --git a/renv/activate.R b/renv/activate.R index 815945c22..f42fe916e 100644 --- a/renv/activate.R +++ b/renv/activate.R @@ -3,6 +3,7 @@ local({ # the requested version of renv version <- "1.2.2" + attr(version, "md5") <- "32f21c37db37534c7968cbcf3dbb3157" attr(version, "sha") <- NULL # the project directory @@ -168,6 +169,16 @@ local({ if (quiet) return(invisible()) + # also check for config environment variables that should suppress messages + # https://github.com/rstudio/renv/issues/2214 + enabled <- Sys.getenv("RENV_CONFIG_STARTUP_QUIET", unset = NA) + if (!is.na(enabled) && tolower(enabled) %in% c("true", "1")) + return(invisible()) + + enabled <- Sys.getenv("RENV_CONFIG_SYNCHRONIZED_CHECK", unset = NA) + if (!is.na(enabled) && tolower(enabled) %in% c("false", "0")) + return(invisible()) + msg <- sprintf(fmt, ...) cat(msg, file = stdout(), sep = if (appendLF) "\n" else "") @@ -215,6 +226,20 @@ local({ section <- header(sprintf("Bootstrapping renv %s", friendly)) catf(section) + # ensure the target library path exists; required for file.copy(..., recursive = TRUE) + dir.create(library, showWarnings = FALSE, recursive = TRUE) + + # try to install renv from cache + md5 <- attr(version, "md5", exact = TRUE) + if (length(md5)) { + pkgpath <- renv_bootstrap_find(version) + if (length(pkgpath) && file.exists(pkgpath)) { + ok <- file.copy(pkgpath, library, recursive = TRUE) + if (isTRUE(ok)) + return(invisible()) + } + } + # attempt to download renv catf("- Downloading renv ... ", appendLF = FALSE) withCallingHandlers( @@ -240,7 +265,6 @@ local({ # add empty line to break up bootstrapping from normal output catf("") - return(invisible()) } @@ -257,12 +281,20 @@ local({ repos <- Sys.getenv("RENV_CONFIG_REPOS_OVERRIDE", unset = NA) if (!is.na(repos)) { - # check for RSPM; if set, use a fallback repository for renv - rspm <- Sys.getenv("RSPM", unset = NA) - if (identical(rspm, repos)) - repos <- c(RSPM = rspm, CRAN = cran) + # split on ';' if present + parts <- strsplit(repos, ";", fixed = TRUE)[[1L]] - return(repos) + # split into named repositories if present + idx <- regexpr("=", parts, fixed = TRUE) + keys <- substring(parts, 1L, idx - 1L) + vals <- substring(parts, idx + 1L) + names(vals) <- keys + + # if we have a single unnamed repository, call it CRAN + if (length(vals) == 1L && identical(keys, "")) + names(vals) <- "CRAN" + + return(vals) } @@ -511,6 +543,51 @@ local({ } + renv_bootstrap_find <- function(version) { + + path <- renv_bootstrap_find_cache(version) + if (length(path) && file.exists(path)) { + catf("- Using renv %s from global package cache", version) + return(path) + } + + } + + renv_bootstrap_find_cache <- function(version) { + + md5 <- attr(version, "md5", exact = TRUE) + if (is.null(md5)) + return() + + # infer path to renv cache + cache <- Sys.getenv("RENV_PATHS_CACHE", unset = "") + if (!nzchar(cache)) { + root <- Sys.getenv("RENV_PATHS_ROOT", unset = NA) + if (!is.na(root)) + cache <- file.path(root, "cache") + } + + if (!nzchar(cache)) { + tools <- asNamespace("tools") + if (is.function(tools$R_user_dir)) { + root <- tools$R_user_dir("renv", "cache") + cache <- file.path(root, "cache") + } + } + + # start completing path to cache + file.path( + cache, + renv_bootstrap_cache_version(), + renv_bootstrap_platform_prefix(), + "renv", + version, + md5, + "renv" + ) + + } + renv_bootstrap_download_tarball <- function(version) { # if the user has provided the path to a tarball via @@ -979,7 +1056,7 @@ local({ renv_bootstrap_validate_version_release <- function(version, description) { expected <- description[["Version"]] - is.character(expected) && identical(expected, version) + is.character(expected) && identical(c(expected), c(version)) } renv_bootstrap_hash_text <- function(text) { @@ -1158,6 +1235,21 @@ local({ } renv_bootstrap_run <- function(project, libpath, version) { + tryCatch( + renv_bootstrap_run_impl(project, libpath, version), + error = function(e) { + msg <- paste( + "failed to bootstrap renv: the project will not be loaded.", + paste("Reason:", conditionMessage(e)), + "Use `renv::activate()` to re-initialize the project.", + sep = "\n" + ) + warning(msg, call. = FALSE) + } + ) + } + + renv_bootstrap_run_impl <- function(project, libpath, version) { # perform bootstrap bootstrap(version, libpath) @@ -1181,6 +1273,18 @@ local({ } + renv_bootstrap_cache_version <- function() { + # NOTE: users should normally not override the cache version; + # this is provided just to make testing easier + Sys.getenv("RENV_CACHE_VERSION", unset = "v5") + } + + renv_bootstrap_cache_version_previous <- function() { + version <- renv_bootstrap_cache_version() + number <- as.integer(substring(version, 2L)) + paste("v", number - 1L, sep = "") + } + renv_json_read <- function(file = NULL, text = NULL) { jlerr <- NULL From f2c42d47ed6d4613fb998e49bed0b3c72e255b52 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Thu, 23 Apr 2026 01:57:25 +0000 Subject: [PATCH 25/28] Merge origin/main to resolve PR conflicts Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/a1c7e567-a3a5-49bb-8990-460a20877456 Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- .github/workflows/preview.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/.github/workflows/preview.yml b/.github/workflows/preview.yml index 6c3c6682f..ca8b8ae29 100644 --- a/.github/workflows/preview.yml +++ b/.github/workflows/preview.yml @@ -19,6 +19,7 @@ on: - '*.qmd' - '_subfiles/**' - '*.scss' + - 'latex-macros' concurrency: preview-${{ github.ref }} From d9103826fd58d5f6433ee38644852598032c9cc7 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Thu, 23 Apr 2026 02:02:45 +0000 Subject: [PATCH 26/28] Restore page-number locator examples in citation-format guidance Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/310a2689-2fe1-4988-85ab-66e02f3247cb Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- .github/copilot-instructions.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 7f627fa87..84b11ccc0 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -86,10 +86,11 @@ and confirming the exact page. If you have not verified a page directly from the source PDF, cite chapter-level location only. -Examples of attribution with page numbers: +Examples of attribution with page numbers +(formatting examples): -- "Adapted from @vittinghoff2e [Chapter 10]." -- "The following example is based on @kleinbaum2010logistic [Chapter 8]." +- "Adapted from [@vittinghoff2e, Chapter 10, p. 194]." +- "The following example is based on [@kleinbaum2010logistic, Chapter 8, p. 230]." Do **not** reproduce verbatim text from copyrighted sources without clear quotation marks and attribution. Paraphrase and summarize with a citation instead. From f09ad6f50634ca6fc7989ec9d960a72b24e2b022 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Thu, 23 Apr 2026 06:09:09 +0000 Subject: [PATCH 27/28] Merge origin/main and resolve latex-macros submodule conflict Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/d5884b3b-b081-4e11-8703-5073239f0b17 Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- .github/workflows/preview.yml | 4 +- _quarto-website.yml | 37 ++ .../_sec_linreg_model_selection.qmd | 369 ++++++++++++++++++ references.bib | 10 + 4 files changed, 418 insertions(+), 2 deletions(-) diff --git a/.github/workflows/preview.yml b/.github/workflows/preview.yml index ca8b8ae29..108f03392 100644 --- a/.github/workflows/preview.yml +++ b/.github/workflows/preview.yml @@ -92,11 +92,11 @@ jobs: - name: Render docx (website profile) if: github.event.action != 'closed' && contains(github.event.pull_request.labels.*.name, 'preview:docx') - run: quarto render --profile website --to docx + run: quarto render --profile website --to docx --no-clean - name: Render revealjs (website profile) if: github.event.action != 'closed' && contains(github.event.pull_request.labels.*.name, 'preview:revealjs') - run: quarto render --profile website --to revealjs + run: quarto render --profile website --to revealjs --no-clean - name: Render html (website profile) if: github.event.action != 'closed' # skip the build if the PR has been closed diff --git a/_quarto-website.yml b/_quarto-website.yml index c3d9ac594..d40f6e357 100644 --- a/_quarto-website.yml +++ b/_quarto-website.yml @@ -120,6 +120,32 @@ website: bibliography: references.bib callouty-theorem: + thm: + override-title: false # do not have the filter set + # or override a callout title + callout: # parameters of the wrapping callout + # see Quarto callout docs for details + type: note + appearance: minimal + prp: + override-title: false + callout: + type: note + appearance: minimal + exr: + override-title: false + callout: + type: tip + appearance: minimal + proof: # also support wrapping proof-like environments + override-title: true # override the callout title with + # the proof-type label, + # appending the element name when provided + callout: + type: note + appearance: default + collapse: true + icon: true solution: override-title: true callout: @@ -127,6 +153,13 @@ callouty-theorem: appearance: default collapse: true icon: true + remark: + override-title: true + callout: + type: tip + appearance: default + collapse: false + icon: true format: html: @@ -182,6 +215,8 @@ format: fig-height: 5 fig-width: 7 pdf: + filters: + - callouty-theorem geometry: - top=15mm - bottom=20mm @@ -207,6 +242,8 @@ format: \appto\cl@section{\quarto@cl@chapter}} \makeatother docx: + filters: + - callouty-theorem toc: true toc-depth: 3 number-sections: true diff --git a/_subfiles/Linear-models-overview/_sec_linreg_model_selection.qmd b/_subfiles/Linear-models-overview/_sec_linreg_model_selection.qmd index 3145ba8c7..a3110efed 100644 --- a/_subfiles/Linear-models-overview/_sec_linreg_model_selection.qmd +++ b/_subfiles/Linear-models-overview/_sec_linreg_model_selection.qmd @@ -236,6 +236,375 @@ $$ } $$ +#### Train/validation/test splits + +In many applications, +we split the data into +training, +validation, +and test sets. +This extends the basic train/test split +by separating model tuning +from final model assessment. +This framework follows +[@james2021islr2e, pp. 198-201]. + +- Use the **training set** to estimate model parameters. +- Use the **validation set** to compare candidate models, + choose transformations, + or tune hyperparameters. +- Use the **test set** once at the end + to estimate final out-of-sample performance. + +Keeping the test set untouched during model building +helps avoid optimistic bias +from repeatedly trying many models. +If data are limited, +$k$-fold cross-validation +can replace a single validation split +for more stable tuning. + +:::{#exm-train-validation-test-split} +#### Numerical example + +The example below +uses R's built-in `mtcars` dataset +($n=32$ cars) +to predict fuel efficiency (`mpg`) +from vehicle weight (`wt`). +It uses one random split +to compare linear, +quadratic, +cubic, +and quartic models +(`mpg ~ wt`, +`mpg ~ wt + I(wt^2)`, +`mpg ~ wt + I(wt^2) + I(wt^3)`, +and `mpg ~ wt + I(wt^2) + I(wt^3) + I(wt^4)`) +on a validation set, +then reports the chosen model's test RMSE +on untouched test data +[@james2021islr2e, p. 213]. + +```{r} +#| code-fold: false +set.seed(108) +n <- nrow(mtcars) + +n_train <- floor(0.6 * n) +n_valid <- floor(0.2 * n) + +idx_train <- sample.int(n, size = n_train) +idx_remaining <- setdiff(seq_len(n), idx_train) +idx_valid <- sample(idx_remaining, size = n_valid) +idx_test <- setdiff(idx_remaining, idx_valid) + +split_sizes <- tibble::tibble( + split = c("training", "validation", "test"), + n = c(length(idx_train), length(idx_valid), length(idx_test)) +) +``` + +:::{#tbl-train-validation-test-split-sizes} +```{r} +split_sizes +``` + +Sizes of the training, +validation, +and test splits +::: + +```{r} +train_dat <- mtcars[idx_train, ] +valid_dat <- mtcars[idx_valid, ] +test_dat <- mtcars[idx_test, ] + +model_linear <- lm(mpg ~ wt, data = train_dat) +model_quadratic <- lm(mpg ~ wt + I(wt^2), data = train_dat) +model_cubic <- lm(mpg ~ wt + I(wt^2) + I(wt^3), data = train_dat) +model_quartic <- lm( + mpg ~ wt + I(wt^2) + I(wt^3) + I(wt^4), + data = train_dat +) + +compute_rmse <- function(model, data) { + sqrt(mean((data$mpg - predict(model, newdata = data))^2)) +} + +candidate_models <- list( + linear = model_linear, + quadratic = model_quadratic, + cubic = model_cubic, + quartic = model_quartic +) + +validation_results <- tibble::tibble( + model = names(candidate_models) +) |> + dplyr::mutate( + training_RMSE = purrr::map_dbl( + model, + \(model_name) compute_rmse(candidate_models[[model_name]], train_dat) + ), + validation_RMSE = purrr::map_dbl( + model, + \(model_name) compute_rmse(candidate_models[[model_name]], valid_dat) + ) + ) + +model_labels <- c( + linear = "linear model", + quadratic = "quadratic model", + cubic = "cubic model", + quartic = "quartic model" +) + +chosen_model_name <- + validation_results |> + dplyr::arrange(validation_RMSE) |> + dplyr::slice(1) |> + dplyr::pull(model) + +chosen_model_label <- model_labels[[chosen_model_name]] + +chosen_model <- candidate_models[[chosen_model_name]] +chosen_model_validation_rmse <- compute_rmse(chosen_model, valid_dat) + +performance_comparison <- tibble::tibble( + split = c("validation", "test"), + model = chosen_model_name, + RMSE = c( + chosen_model_validation_rmse, + compute_rmse(chosen_model, test_dat) + ) +) +``` + +```{r} +#| label: setup-model-fit-figure-partitions +partition_colors <- c( + training = "#1b9e77", + validation = "#d95f02", + test = "#7570b3" +) +partition_shapes <- c( + training = 16, + validation = 17, + test = 15 +) +wt_range <- range(c(train_dat$wt, valid_dat$wt, test_dat$wt)) +mpg_range <- range(c(train_dat$mpg, valid_dat$mpg, test_dat$mpg)) +wt_grid <- seq(wt_range[1], wt_range[2], length.out = 200) +``` + +```{r} +#| label: fig-linear-model-superimposed +#| fig-cap: Linear model fit superimposed on data partitions +#| fig-alt: Scatterplot of miles per gallon versus vehicle weight. +#| Points are colored and shaped by training, +#| validation, +#| and test partitions. +#| The fitted linear regression line is superimposed. +#| code-fold: true +plot( + train_dat$wt, + train_dat$mpg, + xlab = "wt", + ylab = "mpg", + pch = partition_shapes[["training"]], + xlim = wt_range, + ylim = mpg_range, + col = partition_colors[["training"]] +) +points( + valid_dat$wt, + valid_dat$mpg, + pch = partition_shapes[["validation"]], + col = partition_colors[["validation"]] +) +points( + test_dat$wt, + test_dat$mpg, + pch = partition_shapes[["test"]], + col = partition_colors[["test"]] +) +abline(model_linear, col = "blue", lwd = 2) +legend( + "topright", + legend = c("training", "validation", "test", "linear fit"), + col = c(unname(partition_colors), "blue"), + pch = c(unname(partition_shapes), NA), + lty = c(NA, NA, NA, 1), + lwd = c(NA, NA, NA, 2), + bty = "n" +) +``` + +```{r} +#| label: fig-quadratic-model-superimposed +#| fig-cap: Quadratic model fit superimposed on data partitions +#| fig-alt: Scatterplot of miles per gallon versus vehicle weight. +#| Points are colored and shaped by training, +#| validation, +#| and test partitions. +#| The fitted quadratic curve is superimposed. +#| code-fold: true +plot( + train_dat$wt, + train_dat$mpg, + xlab = "wt", + ylab = "mpg", + pch = partition_shapes[["training"]], + xlim = wt_range, + ylim = mpg_range, + col = partition_colors[["training"]] +) +points( + valid_dat$wt, + valid_dat$mpg, + pch = partition_shapes[["validation"]], + col = partition_colors[["validation"]] +) +points( + test_dat$wt, + test_dat$mpg, + pch = partition_shapes[["test"]], + col = partition_colors[["test"]] +) +quadratic_pred <- predict( + model_quadratic, + newdata = data.frame(wt = wt_grid) +) +lines(wt_grid, quadratic_pred, col = "blue", lwd = 2) +legend( + "topright", + legend = c("training", "validation", "test", "quadratic fit"), + col = c(unname(partition_colors), "blue"), + pch = c(unname(partition_shapes), NA), + lty = c(NA, NA, NA, 1), + lwd = c(NA, NA, NA, 2), + bty = "n" +) +``` + +```{r} +#| label: fig-cubic-model-superimposed +#| fig-cap: Cubic model fit superimposed on data partitions +#| fig-alt: Scatterplot of miles per gallon versus vehicle weight. +#| Points are colored and shaped by training, +#| validation, +#| and test partitions. +#| The fitted cubic curve is superimposed. +#| code-fold: true +plot( + train_dat$wt, + train_dat$mpg, + xlab = "wt", + ylab = "mpg", + pch = partition_shapes[["training"]], + xlim = wt_range, + ylim = mpg_range, + col = partition_colors[["training"]] +) +points( + valid_dat$wt, + valid_dat$mpg, + pch = partition_shapes[["validation"]], + col = partition_colors[["validation"]] +) +points( + test_dat$wt, + test_dat$mpg, + pch = partition_shapes[["test"]], + col = partition_colors[["test"]] +) +cubic_pred <- predict( + model_cubic, + newdata = data.frame(wt = wt_grid) +) +lines(wt_grid, cubic_pred, col = "blue", lwd = 2) +legend( + "topright", + legend = c("training", "validation", "test", "cubic fit"), + col = c(unname(partition_colors), "blue"), + pch = c(unname(partition_shapes), NA), + lty = c(NA, NA, NA, 1), + lwd = c(NA, NA, NA, 2), + bty = "n" +) +``` + +```{r} +#| label: fig-quartic-model-superimposed +#| fig-cap: Quartic model fit superimposed on data partitions +#| fig-alt: Scatterplot of miles per gallon versus vehicle weight. +#| Points are colored and shaped by training, +#| validation, +#| and test partitions. +#| The fitted quartic curve is superimposed. +#| code-fold: true +plot( + train_dat$wt, + train_dat$mpg, + xlab = "wt", + ylab = "mpg", + pch = partition_shapes[["training"]], + xlim = wt_range, + ylim = mpg_range, + col = partition_colors[["training"]] +) +points( + valid_dat$wt, + valid_dat$mpg, + pch = partition_shapes[["validation"]], + col = partition_colors[["validation"]] +) +points( + test_dat$wt, + test_dat$mpg, + pch = partition_shapes[["test"]], + col = partition_colors[["test"]] +) +quartic_pred <- predict( + model_quartic, + newdata = data.frame(wt = wt_grid) +) +lines(wt_grid, quartic_pred, col = "blue", lwd = 2) +legend( + "topright", + legend = c("training", "validation", "test", "quartic fit"), + col = c(unname(partition_colors), "blue"), + pch = c(unname(partition_shapes), NA), + lty = c(NA, NA, NA, 1), + lwd = c(NA, NA, NA, 2), + bty = "n" +) +``` + +:::{#tbl-train-validation-results} +```{r} +validation_results +``` + +Training and validation RMSE for candidate models +::: + +The chosen model is +the `r chosen_model_label`, +because it has the lower validation RMSE. + +:::{#tbl-train-validation-test-rmse} +```{r} +performance_comparison +``` + +Validation and test RMSE +for the chosen model +::: + +::: + --- #### Cross-validation diff --git a/references.bib b/references.bib index c62a46bc0..d51cde7d5 100644 --- a/references.bib +++ b/references.bib @@ -493,6 +493,16 @@ @book{james2013introduction url={https://www.statlearning.com/} } +@book{james2021islr2e, + title={An Introduction to Statistical Learning: with Applications in {R}}, + author={James, Gareth and Witten, Daniela and Hastie, Trevor and Tibshirani, Robert}, + edition={2}, + year={2021}, + publisher={Springer}, + doi={10.1007/978-1-0716-1418-1}, + url={https://www.statlearning.com/} +} + @book{polin2011fetal, title={Fetal and neonatal physiology}, From b93c01df3f52590cea677a1caaa32f84bac36216 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Mon, 27 Apr 2026 07:19:52 +0000 Subject: [PATCH 28/28] Merge origin/main and resolve latex-macros submodule conflict again Agent-Logs-Url: https://github.com/d-morrison/rme/sessions/e649b933-3b3d-4bf2-b3e1-e8b2d46c2fd1 Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- .github/copilot-instructions.md | 46 ++++++- _extensions/slidebreak/_extension.yml | 7 + _extensions/slidebreak/slidebreak.lua | 6 + _quarto-book.yml | 1 + _quarto-website.yml | 3 + .../_sec_linreg_diagnostics.qmd | 29 +++- .../_sec_empirical_quantiles.qmd | 130 ++++++++++++++++++ estimation.qmd | 2 - nonparametric-models.qmd | 20 +++ probability.qmd | 21 +++ 10 files changed, 255 insertions(+), 10 deletions(-) create mode 100644 _extensions/slidebreak/_extension.yml create mode 100644 _extensions/slidebreak/slidebreak.lua create mode 100644 _subfiles/nonparametric-models/_sec_empirical_quantiles.qmd create mode 100644 nonparametric-models.qmd diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 84b11ccc0..c78c3638c 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -5,7 +5,7 @@ > > Before committing ANY changes to `.qmd`, `.R`, or config files: > -> 1. **Run `quarto render` on the FULL repository** (not individual files) +> 1. **Run `quarto render --to html` on each chapter whose subfiles were edited** (not individual subfiles; not the full book; HTML format only unless you are specifically fixing a non-HTML format issue) > 2. **Verify it completes successfully** (exit code 0, no errors) > 3. **Run linter on changed files**: > - For R files: `lintr::lint("path/to/file.R")` @@ -19,7 +19,8 @@ > **CRITICAL RULES:** > - **CI is NOT the test** - you must test locally BEFORE pushing > - **NEVER rely on CI to discover rendering, lint, or spelling errors** - that's your job -> - **ALWAYS run full `quarto render`** - testing individual files is insufficient +> - **Only render HTML format** (`--to html`) unless specifically fixing a non-HTML format issue +> - **Only render edited chapters** — run `quarto render --to html` on the parent `.qmd` that includes each edited subfile; do not render the full book > - **Only fix lint/spell issues in code YOU changed** - don't fix unrelated pre-existing issues > - **This is a hard requirement - no exceptions, no excuses** @@ -28,7 +29,7 @@ **CRITICAL**: Do not make assumptions about what code will do - always test it yourself. - **Test your changes**: Run the actual commands to verify functionality -- **Run `quarto render` on FULL repository**: Testing individual files misses cross-file issues +- **Run `quarto render --to html` on each edited chapter**: Only render chapters whose subfiles were edited, in HTML format (not the full book, not all formats) - **Verify output**: Check that expected files are created with correct content - **Never claim success without evidence**: Only report that something works after you've confirmed it yourself @@ -176,6 +177,19 @@ Example: # First, check if the input is valid. Then, process the data. Finally, return the result. ``` +## Definition Formatting + +When introducing or editing formal statistical definitions in `.qmd` files: + +- Use a definition div with an id beginning `#def-` +- Put the definition title in a heading inside the div + and choose the heading level to match the surrounding section depth + (for example, `####` or `#####`) +- If a definition uses other statistical terms + (for example, empirical CDF), + ensure those terms also have formal `#def-` div definitions + in the relevant scope before relying on them + ## Quarto Code Chunk Options When the code **and** its console output are both needed for the surrounding narrative to make sense, use `#| code-fold: false` so that neither is hidden: @@ -372,6 +386,27 @@ This repository uses JAGS (Just Another Gibbs Sampler) for Bayesian analysis: - R packages `rjags` and `runjags` depend on JAGS being installed - The lint workflow also needs these packages installed to avoid false positives +## Quarto Slide Breaks + +Use `{{< slidebreak >}}` instead of raw `---` to insert slide breaks in `.qmd` files. +The `{{< slidebreak >}}` shortcode produces a slide break only in `revealjs` output, +and produces no output in other formats (HTML, PDF, etc.). +This allows the same source file to render correctly across multiple output formats. + +Example: + +```markdown +## Slide 1 + +Content here. + +{{< slidebreak >}} + +## Slide 2 + +More content. +``` + ## Computer Algebra Systems (CAS) The Copilot environment includes two computer algebra systems for symbolic mathematics. @@ -433,7 +468,7 @@ maxima --very-quiet --batch=/tmp/cas_check.mac **CRITICAL REQUIREMENTS** before requesting code review: -1. **Run `quarto render` yourself locally** on the full repository and ensure it passes +1. **Run `quarto render --to html`** on each chapter whose subfiles you edited — do NOT render the full book, and do NOT render all formats unless you are specifically fixing a non-HTML format issue 2. **Verify it completes successfully** (exit code 0, no errors) 3. **Do NOT rely solely on CI workflows** to catch rendering issues 4. **Test the actual command and wait for completion** - do not assume success @@ -447,7 +482,8 @@ maxima --very-quiet --batch=/tmp/cas_check.mac - All documents render correctly (check for missing images, broken math, etc.) **Common mistakes to avoid**: -- Testing individual files only (use `quarto render` for full repository) +- Rendering all chapters when only a few subfiles changed (only render chapters with edited subfiles) +- Rendering all output formats when only HTML is needed (use `--to html` unless fixing a specific format issue) - Not waiting for the command to complete - Assuming success without checking the exit code - Claiming "it works" without actually running the command diff --git a/_extensions/slidebreak/_extension.yml b/_extensions/slidebreak/_extension.yml new file mode 100644 index 000000000..613bd810c --- /dev/null +++ b/_extensions/slidebreak/_extension.yml @@ -0,0 +1,7 @@ +title: Slidebreak +author: UC Davis SERG +version: 1.0.0 +quarto-required: ">=1.2.0" +contributes: + shortcodes: + - slidebreak.lua diff --git a/_extensions/slidebreak/slidebreak.lua b/_extensions/slidebreak/slidebreak.lua new file mode 100644 index 000000000..dce6bcfd4 --- /dev/null +++ b/_extensions/slidebreak/slidebreak.lua @@ -0,0 +1,6 @@ +function slidebreak(args, kwargs, meta) + if quarto.doc.isFormat("revealjs") then + return pandoc.HorizontalRule() + end + return {} +end diff --git a/_quarto-book.yml b/_quarto-book.yml index a8220be80..8cbbdce8f 100644 --- a/_quarto-book.yml +++ b/_quarto-book.yml @@ -44,6 +44,7 @@ book: - basic-statistical-methods.qmd - classification.qmd - estimation.qmd + - nonparametric-models.qmd - inference.qmd - intro-MLEs.qmd - intro-bayes.qmd diff --git a/_quarto-website.yml b/_quarto-website.yml index d40f6e357..02ef73b7d 100644 --- a/_quarto-website.yml +++ b/_quarto-website.yml @@ -24,6 +24,7 @@ project: - basic-statistical-methods.qmd - classification.qmd - estimation.qmd + - nonparametric-models.qmd - inference.qmd - intro-MLEs.qmd - intro-bayes.qmd @@ -100,6 +101,8 @@ website: href: classification.qmd - text: "Estimation" href: estimation.qmd + - text: "Nonparametric Models" + href: nonparametric-models.qmd - text: "Inference" href: inference.qmd - text: "Introduction to MLEs" diff --git a/_subfiles/Linear-models-overview/_sec_linreg_diagnostics.qmd b/_subfiles/Linear-models-overview/_sec_linreg_diagnostics.qmd index 3543712f0..7d2510fa5 100644 --- a/_subfiles/Linear-models-overview/_sec_linreg_diagnostics.qmd +++ b/_subfiles/Linear-models-overview/_sec_linreg_diagnostics.qmd @@ -692,9 +692,32 @@ A QQ plot compares the empirical distribution of a dataset to a theoretical refe Each point corresponds to one observation: - The **x-axis** (Theoretical Quantiles) shows the $N(0,1)$ quantile - corresponding to the plotting position $p_i = (\text{rank}_i - 0.5) / n$ - assigned to each observation based on its rank. -- The **y-axis** (Standardized Residuals) shows the observed standardized residual $r_i$. + corresponding to the plotting position $p_i$ + used for the $i$th ordered residual. + In R, `qqnorm()` obtains these plotting positions + from `ppoints(n)`. +- The **y-axis** (Standardized Residuals) shows the observed standardized residuals, + plotted in increasing order as $r_{(1)}\le\cdots\le r_{(n)}$. + In QQ-plot terminology, + these ordered values are [sample quantiles](nonparametric-models.qmd#def-sample-quantile). + Let $r_{(1)} \le \cdots \le r_{(n)}$ + be the [order statistics](nonparametric-models.qmd#def-order-statistics) + of the standardized residuals. + Let $\hat F_r$ be their [empirical CDF](nonparametric-models.qmd#def-empirical-cdf) + and $\hat Q_r$ the corresponding sample quantile function. + The QQ plot uses points + $$\left(\Phi^{-1}(p_i),\, r_{(i)}\right), \quad i = 1, \ldots, n,$$ + where $p_i$ denotes the plotting positions returned by `ppoints(n)`. + For the common midpoint choice $p_i=(i-0.5)/n$, + we have $\hat Q_r(p_i)=r_{(i)}$. + More generally, + `qqnorm()` pairs the $i$th ordered residual + with the corresponding theoretical quantile + computed from its plotting position. + So both labels are correct: + `qqnorm()` emphasizes the quantile role ("Sample Quantiles"), + while `autoplot.lm()` emphasizes the underlying ordered values + ("Standardized residuals"). If the standardized residuals approximately follow $N(0,1)$, the points should fall approximately on a straight line. diff --git a/_subfiles/nonparametric-models/_sec_empirical_quantiles.qmd b/_subfiles/nonparametric-models/_sec_empirical_quantiles.qmd new file mode 100644 index 000000000..74d2076c6 --- /dev/null +++ b/_subfiles/nonparametric-models/_sec_empirical_quantiles.qmd @@ -0,0 +1,130 @@ +:::{#def-empirical-cdf} +#### Empirical CDF + +For observed values $x_1,\ldots,x_n$, +the empirical CDF is +$$\hat F(t)=\frac{1}{n}\sum_{j=1}^n I(x_j\le t), \quad t\in\mathbb{R}.$$ +Here $I(A)$ is the indicator function: +$I(A)=1$ if $A$ holds, +and $I(A)=0$ otherwise. +::: + +:::{#def-order-statistics} +#### Order statistics + +For observed values $x_1,\ldots,x_n$, +the order statistics are the sorted values +$$x_{(1)}\le x_{(2)}\le\cdots\le x_{(n)}.$$ +::: + +:::{#exm-order-statistics} +#### Numerical example: order statistics + +If the observed values are $4,1,7,3$, +then the order statistics are +$x_{(1)}=1$, $x_{(2)}=3$, $x_{(3)}=4$, and $x_{(4)}=7$. +::: + +:::{#def-sample-quantile} +#### Sample quantile + +Given [empirical CDF](#def-empirical-cdf) $\hat F$, +the sample quantile function (EQF) +(the generalized inverse of $\hat F$) +is +$$\hat Q(p)=\inf\{t:\hat F(t)\ge p\}, \quad 0 +#| Relationship between the empirical CDF and sample quantiles +#| for data 4, 1, 7, 3. +#| (Order statistics are 1, 3, 4, 7.) +#| fig-subcap: +#| - > +#| Empirical CDF horizontal pieces only. +#| Closed circles mark included endpoints. +#| Open circles mark excluded endpoints. +#| - > +#| Sample quantile function horizontal pieces only. +#| Open circles mark excluded left endpoints. +#| Closed circles mark included right endpoints. +#| layout-ncol: 2 +#| fig-width: 5 +#| fig-height: 4 +x <- c(4, 1, 7, 3) +n <- length(x) +x_ord <- sort(x) +p_ord <- seq_len(n) / n +ecdf_x_padding <- 0.8 +eqf_y_padding <- 0.5 +x_min <- min(x_ord) - ecdf_x_padding +x_max <- max(x_ord) + ecdf_x_padding + +x_left <- c(x_min, x_ord) +x_right <- c(x_ord, x_max) +f_levels <- c(0, p_ord) + +plot( + 0, + 0, + type = "n", + xlim = c(x_min, x_max), + ylim = c(0, 1.05), + xlab = "t", + ylab = expression(hat(plain("F"))(t)) +) +segments(x_left, f_levels, x_right, f_levels, lwd = 2, col = "blue") +points(x_ord, f_levels[seq_len(n)], pch = 1, col = "blue") +points(x_ord, f_levels[seq_len(n) + 1], pch = 19, col = "blue") + +q_left <- c(0, p_ord[-n]) +q_right <- p_ord +plot( + 0, + 0, + type = "n", + xlim = c(0, 1), + ylim = c(min(x_ord) - eqf_y_padding, max(x_ord) + eqf_y_padding), + xlab = "p", + ylab = expression(hat(Q)(p)) +) +segments(q_left, x_ord, q_right, x_ord, lwd = 2, col = "blue") +points(q_left, x_ord, pch = 1, col = "blue") +points(q_right, x_ord, pch = 19, col = "blue") +``` + +Because both functions are step functions, +they are not one-to-one. +Flat regions in one function map many inputs to the same output, +and jump points in one correspond to intervals in the other. +So they are generalized inverses rather than exact two-sided inverses. diff --git a/estimation.qmd b/estimation.qmd index 2375fbd81..e36adf794 100644 --- a/estimation.qmd +++ b/estimation.qmd @@ -403,8 +403,6 @@ $$\SE{\hth} = \sqrt{\mselr{\hth}}$$ (this result is equivalent to @eq-unbiased-mse) ::: ---- - # References {.unnumbered} ::: {#refs} diff --git a/nonparametric-models.qmd b/nonparametric-models.qmd new file mode 100644 index 000000000..77a8ce503 --- /dev/null +++ b/nonparametric-models.qmd @@ -0,0 +1,20 @@ +--- +title: "Nonparametric Models" +format: + html: default + revealjs: + output-file: nonparametric-models-slides.html + pdf: + output-file: nonparametric-models-handout.pdf + docx: + output-file: nonparametric-models-handout.docx + +--- + +{{< include shared-config.qmd >}} + +--- + +# Empirical CDF and quantiles + +{{< include _subfiles/nonparametric-models/_sec_empirical_quantiles.qmd >}} diff --git a/probability.qmd b/probability.qmd index 012b0d048..210ba98d2 100644 --- a/probability.qmd +++ b/probability.qmd @@ -345,6 +345,27 @@ We will see more of this distribution later. --- +:::{#def-cdf} +#### Cumulative distribution function (CDF) + +For a random variable $X$, +its population CDF is +$$F(t)=\Pr(X\le t), \quad t\in\mathbb{R}.$$ +::: + +:::{#def-quantile-function} +#### Quantile function (population inverse CDF) + +For a random variable $X$ +with [cumulative distribution function (CDF)](#def-cdf) $F$, +its population quantile function +(generalized inverse of $F$) +is +$$Q(p)=\inf\{t:F(t)\ge p\}, \quad 0