jnepv · jnepv · Mar 15, 2026 · Mar 15, 2026 · Mar 19, 2026
diff --git a/02_activities/assignments/assignment_1.html b/02_activities/assignments/assignment_1.html
diff --git a/02_activities/assignments/assignment_1.qmd b/02_activities/assignments/assignment_1.qmd
@@ -13,99 +13,215 @@ Please bring questions that you cannot work out on your own to office hours, wor
 
 You will need to install PLINK and run the analyses. Please follow the OS-specific setup guide in [`SETUP.md`](../../SETUP.md). PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner.
 
+#### Setting up 
+
+```{r setup, include = FALSE}
+# Set up a consistent path for all chunks of code
+knitr::opts_knit$set(root.dir = normalizePath("../../"))
+```
+
+```{r, message = FALSE, warning = FALSE, results = 'hide'}
+library(data.table)
+library(ggplot2)
+library(seqminer)
+library(HardyWeinberg)
+library(dplyr)
+```
+
 #### Question 1: Data inspection
 
 Before fitting any models, it is essential to understand the data. Use R or bash code to answer the following questions about the `gwa.qc.A1.fam`, `gwa.qc.A1.bim`, and `gwa.qc.A1.bed` files, available at the following Google Drive link: <https://drive.google.com/drive/folders/11meVqGCY5yAyI1fh-fAlMEXQt0VmRGuz?usp=drive_link>. Please download all three files from this link and place them in `02_activities/data/`.
 
 (i) Read the .fam file. How many samples does the dataset contain?
+
+```{r, message = FALSE, warning = FALSE}
 
-```         
-# Your answer here...
-```
+fam_all <- fread("./02_activities/data/gwa.qc.A1.fam")
+
+N = dim(fam_all)[1]; N
+
+```     
+
+There are 4,000 samples in the dataset.
 
 (ii) What is the 'variable type' of the response variable (i.e.Continuous or binary)?
 
-```         
-# Your answer here...
+``` {bash}         
+
+head ./02_activities/data/gwa.qc.A1.fam
+
 ```
 
+The phenotype is a continuous variable. 
+
 (iii) Read the .bim file. How many SNPs does the dataset contain?
 
-```         
-# Your answer here...
+``` {r, message = FALSE, warning = FALSE}
+
+bim_all <- fread("./02_activities/data/gwa.qc.A1.bim") 
+M = nrow(bim_all); M
+
 ```
 
+There are 101,083 SNPs in the dataset.
+
 #### Question 2: Allele Frequency Estimation
 
 (i) Load the genotype matrix for SNPs rs1861, rs3813199, rs3128342, and rs11804831 using additive coding. What are the allele frequencies (AFs) for these four SNPs?
 
-```         
-# Your code here...
+``` {bash}
+
+setopt interactivecomments
+# Text file containing the list of SNPs to extract
+printf "%s\n" rs1861 rs3813199 rs3128342 rs11804831 > ./02_activities/data/snplistA1.txt
+
+# Genotype matrix for the four SNPs with additive coding
+plink2 --bfile ./02_activities/data/gwa.qc.A1 \
+      --extract ./02_activities/data/snplistA1.txt \
+      --recode A \
+      --out ./02_activities/data/geno.additiveA1
+
+# Calculate allele frequencies for the four SNPs
+plink2 --bfile ./02_activities/data/gwa.qc.A1 \
+      --extract ./02_activities/data/snplistA1.txt \
+      --freq --out ./02_activities/data/gwa_qc_freqA1
+
 ```
 
-(ii) What are the minor allele frequencies (MAFs) for these four SNPs?
+``` {r, message = FALSE, warning = FALSE}
+
+# View frequency table
+freq <- fread("./02_activities/data/gwa_qc_freqA1.afreq")
+head(freq)
 
-```         
-# Your code here...
 ```
 
+The allele frequencies for the four SNPs are as follows:
+ - rs1861: C = 0.9460141; A = 0.0539859 
+ - rs3813199: G = 0.9430874; A = 0.0569126  
+ - rs3128342: C = 0.694879; A = 0.3051210
+ - rs11804831: T = 0.845659; C = 0.1543410
+
+(ii) What are the minor allele frequencies (MAFs) for these four SNPs?
+
+The minor allele frequencies for the four SNPs are as follows:
+ - rs1861: MAF = 0.0539859
+ - rs3813199: MAF = 0.0569126 
+ - rs3128342: MAF = 0.3051210 
+ - rs11804831: MAF = 0.1543410 
+
 #### Question 3: Hardy–Weinberg Equilibrium (HWE) Test
 
 (i) Conduct the Hardy–Weinberg Equilibrium (HWE) test for all SNPs in the .bim file. Then, load the file containing the HWE p-value results and display the first few rows of the resulting data frame.
+
+```{bash, results = 'hide', message = FALSE, warning = FALSE}
+
+plink2 --bfile ./02_activities/data/gwa.qc.A1 --hardy --out ./02_activities/data/gwa_qc_hwe_A1
+
+```
+
+```{r, message = FALSE, warning = FALSE}
+
+# View HWE results
+hwe <- fread("./02_activities/data/gwa_qc_hwe_A1.hardy")
+head(hwe)
 
-```         
-# Your code here...
 ```
 
 (ii) What are the HWE p-values for SNPs rs1861, rs3813199, rs3128342, and rs11804831?
 
-```         
-# Your code here...
+``` {r, message = FALSE, warning = FALSE}
+
+snps <- c("rs1861", "rs3813199", "rs3128342", "rs11804831")
+
+hwe_subset <- hwe %>%
+  filter(ID %in% snps)
+
+hwe_subset
+
 ```
 
+The p-values for the HWE test for the four SNPs are as follows:
+ - rs1861: p-value = 0.275
+ - rs3813199: p-value = 1.000
+ - rs3128342: p-value = 0.330
+ - rs11804831: p-value = 0.113
+
 #### Question 4: Genetic Association Test
 
 (i) Conduct a linear regression to test the association between SNP rs1861 and the phenotype. What is the p-value?
 
-```         
-# Your code here...
+``` {r, message = FALSE, warning = FALSE}
+
+# Load the genotype matrix for SNP rs1861 with additive coding
+geno_subset <- fread("./02_activities/data/geno.additiveA1.raw")
+data_sub <- geno_subset[, c("PHENOTYPE", "rs1861_C")]
+data_sub <- data_sub %>%
+  mutate(rs1861_C = 2 - rs1861_C)
+data_sub <- na.omit(data_sub)  # Remove rows with missing values
+head(data_sub)
+
+# Fit linear regression model
+model1 <- lm(PHENOTYPE ~ rs1861_C, data = data_sub)
+summary(model1)
+
 ```
 
+The p-value for the association between SNP rs1861 and the phenotype is <2e-16
+
 (ii) How would you interpret the beta coefficient from this regression?
 
-```         
-# Your answer here...
-```
+For each additional copy of the C allele at SNP rs1861, the phenotype is expected to decrease by approximately 0.967 units. The significant p-value suggests evidence of an association between this SNP and the phenotype in our sample.
 
 (iii) Plot the scatterplot of phenotype versus the genotype of SNP rs1861. Add the regression line to the plot.
 
-```         
-# Your code here...
+```{r, message = FALSE, warning = FALSE}
+
+plot1 <- ggplot(data = data_sub, aes(x = rs1861_C, y = PHENOTYPE)) + 
+  geom_point() +
+  geom_smooth(method = "lm", se = FALSE, color = "blue") +
+  labs(x= "Genotype (0/1/2)", y = "Phenotype")
+
+print(plot1)
+
 ```
 
 (iv) Convert the genotype coding for rs1861 to recessive coding.
 
-```         
-# Your code here...
+```{r}
+
+geno_subset_recessive <- data_sub  # make a copy
+geno_subset_recessive$rs1861_C <- ifelse(geno_subset_recessive$rs1861_C == 2, 1, 0)
+
 ```
 
 (v) Conduct a linear regression to test the association between the recessive-coded rs1861 and the phenotype. What is the p-value?
 
-```         
-# Your code here...
+``` {r, message = FALSE, warning = FALSE}
+
+# Fit linear regression model
+model2 <- lm(PHENOTYPE ~ rs1861_C, data = geno_subset_recessive)
+summary(model2)
+
 ```
 
+The p-value for the association between the recessive-coded rs1861 and the phenotype is 3.08e-12.
+
 (vi) Plot the scatterplot of phenotype versus the recessive-coded genotype of rs1861. Add the regression line to the plot.
 
-```         
-# Your code here...
+```{r}
+plot2 <- ggplot(data = geno_subset_recessive, aes(x = rs1861_C, y = PHENOTYPE)) + 
+  geom_point() +
+  geom_smooth(method = "lm", se = FALSE, color = "blue") +
+  labs(x= "Genotype (0/1/2)", y = "Phenotype")
+
+print(plot2)
+
 ```
 
 (vii) Which model fits better? Justify your answer.
 
-```         
-# Your answer here...
-```
+The additive model fits better than the recessive model. This is evident from the much smaller p-value in the additive model (<2e-16) compared to the recessive model (3.08e-12), indicating stronger evidence of association. Additionally, the scatterplot for the additive model shows a clearer linear relationship between genotype and phenotype, while the recessive model's plot is more binary, suggesting that the additive coding captures more of the variation in the phenotype.
 
 ### Criteria
 

diff --git a/02_activities/assignments/assignment_1_files/figure-html/unnamed-chunk-11-1.png b/02_activities/assignments/assignment_1_files/figure-html/unnamed-chunk-11-1.png
diff --git a/02_activities/assignments/assignment_1_files/figure-html/unnamed-chunk-14-1.png b/02_activities/assignments/assignment_1_files/figure-html/unnamed-chunk-14-1.png
diff --git a/...ents/assignment_1_files/libs/bootstrap/bootstrap-d6a003b94517c951b2d65075d42fb01b.min.css b/...ents/assignment_1_files/libs/bootstrap/bootstrap-d6a003b94517c951b2d65075d42fb01b.min.css