diff --git a/02_activities/assignments/assignment_1.html b/02_activities/assignments/assignment_1.html new file mode 100644 index 0000000..86edfd2 --- /dev/null +++ b/02_activities/assignments/assignment_1.html @@ -0,0 +1,995 @@ + + + + + + + + + +Assignment #1 + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+
+

Assignment #1

+
+ + + +
+ + + + +
+ + + +
+ + +
+

Assignment 1

+

You only need to write lines of code for each question. When answering questions that ask you to identify or interpret something, the length of your response doesn’t matter. For example, if the answer is just ‘yes,’ ‘no,’ or a number, you can just give that answer without adding anything else.

+

We will go through comparable code and concepts in the live learning session. If you run into trouble, start by using the help help() function in R, to get information about the datasets and function in question. The internet is also a great resource when coding (though note that no outside searches are required by the assignment!). If you do incorporate code from the internet, please cite the source within your code (providing a URL is sufficient).

+

Please bring questions that you cannot work out on your own to office hours, work periods or share with your peers on Slack. We will work with you through the issue.

+

You will need to install PLINK and run the analyses. Please follow the OS-specific setup guide in SETUP.md. PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner.

+
+

Question 1: Data inspection

+

Before fitting any models, it is essential to understand the data. Use R or bash code to answer the following questions about the gwa.qc.A1.fam, gwa.qc.A1.bim, and gwa.qc.A1.bed files, available at the following Google Drive link: https://drive.google.com/drive/folders/11meVqGCY5yAyI1fh-fAlMEXQt0VmRGuz?usp=drive_link. Please download all three files from this link and place them in 02_activities/data/.

+
    +
  1. Read the .fam file. How many samples does the dataset contain?
  2. +
+

``` pgsssq         cd “/c/Users/Mahsa Edalatkah/gen_data” head ./02_activities/data/gwa.qc.A1.fam wc -l <02_activities/data/gwa.qc.A1.fam

+
# 4000
+(ii) What is the 'variable type' of the response variable (i.e.Continuous or binary)?
+
+
+::: {.cell}
+
+```{.bash .cell-code}
+head ./02_activities/data/gwa.qc.A1.fam
+
+
0   A2001   0   0   1   -0.694438129641973
+1   A2002   0   0   1   1.85384536141856
+2   A2003   0   0   1   2.08263677761584
+3   A2004   0   0   1   2.73871473943968
+4   A2005   0   0   1   1.34114035564636
+5   A2006   0   0   1   0.416778586749647
+6   A2007   0   0   1   2.38297123290054
+7   A2008   0   0   1   1.51429928826958
+8   A2009   0   0   1   0.718686390529039
+9   A2010   0   0   1   2.08904136245205
+
+

:::

+
+
+
+

continuous

+
    +
  1. Read the .bim file. How many SNPs does the dataset contain?
  2. +
+
+
cd "/c/Users/Mahsa Edalatkah/gen_data"
+head ./02_activities/data/gwa.qc.A1.bim
+wc -l ./02_activities/data/gwa.qc.A1.bim
+
+
1   rs3737728   0   1011278 A   G
+1   rs1320565   0   1109721 T   C
+1   rs3813199   0   1148140 A   G
+1   rs11804831  0   1184667 C   T
+1   rs3766178   0   1468043 C   T
+1   rs3128342   0   1476697 A   C
+1   rs880051    0   1483590 A   G
+1   rs2296716   0   1487687 T   C
+1   rs2474460   0   1833906 C   T
+1   rs3820011   0   1878053 A   C
+101083 ./02_activities/data/gwa.qc.A1.bim
+
+
+
+
+

101083 SNPs

+
+

Question 2: Allele Frequency Estimation

+
    +
  1. Load the genotype matrix for SNPs rs1861, rs3813199, rs3128342, and rs11804831 using additive coding. What are the allele frequencies (AFs) for these four SNPs?
  2. +
+

{bash,results='hide',message=FALSE, warning=FALSE}         cd "/c/Users/Mahsa Edalatkah/gen_data" printf "%s\n" rs1861 rs3813199 rs3128342 rs11804831 > ./02_activities/data/SNP_List_A1 plink2 --bfile ./02_activities/data/gwa.qc.A1 --extract ./02_activities/data/SNP_List_A1 --export A --freq --out ./02_activities/data/SNP_List_A1_Additive head ./02_activities/data/SNP_List_A1_Additive.afreq

+
    +
  1. What are the minor allele frequencies (MAFs) for these four SNPs?
  2. +
+
+
+
+

rs3813199 0.0569126

+
+
+

rs11804831 0.154341

+
+
+

rs3128342 0.305121

+
+
+

rs1861 0.0539859

+
+

Question 3: Hardy–Weinberg Equilibrium (HWE) Test

+
    +
  1. Conduct the Hardy–Weinberg Equilibrium (HWE) test for all SNPs in the .bim file. Then, load the file containing the HWE p-value results and display the first few rows of the resulting data frame.
  2. +
+
+
cd "C:/Users/Mahsa Edalatkah/gen_data"       
+plink2 --bfile ./02_activities/data/gwa.qc.A1 --memory 4000 --hardy --out ./02_activities/data/gwa_qc_A1_HWE
+head ./02_activities/data/gwa_qc_A1_HWE.hardy
+
+
PLINK v2.0.0-a.7 64-bit (10 Mar 2026)               cog-genomics.org/plink/2.0/
+(C) 2005-2026 Shaun Purcell, Christopher Chang    GNU General Public License v3
+Logging to ./02_activities/data/gwa_qc_A1_HWE.log.
+Options in effect:
+  --bfile ./02_activities/data/gwa.qc.A1
+  --hardy
+  --memory 4000
+  --out ./02_activities/data/gwa_qc_A1_HWE
+
+Start time: Mon Apr 27 21:39:17 2026
+24414 MiB RAM detected; reserving 4000 MiB for main workspace.
+Using up to 8 compute threads.
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+./02_activities/data/gwa.qc.A1.fam.
+101083 variants loaded from ./02_activities/data/gwa.qc.A1.bim.
+1 quantitative phenotype loaded (4000 values).
+Calculating allele frequencies... 0%64%done.
+--hardy: 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99%
+--hardy: Autosomal Hardy-Weinberg report (founders only) written to
+./02_activities/data/gwa_qc_A1_HWE.hardy .
+End time: Mon Apr 27 21:39:17 2026
+#CHROM  ID  A1  AX  HOM_A1_CT   HET_A1_CT   TWO_AX_CT   O(HET_A1)   E(HET_A1)   P
+1   rs3737728   G   A   1713    1841    428 0.46233 0.447932    0.0437892
+1   rs1320565   C   T   3368    589 19  0.148139    0.145262    0.273429
+1   rs3813199   G   A   3531    428 12  0.107781    0.107347    1
+1   rs11804831  T   C   2820    1061    81  0.267794    0.26104 0.113354
+1   rs3766178   T   C   2391    1378    214 0.34597 0.350629    0.415877
+1   rs3128342   C   A   1927    1655    382 0.417508    0.424044    0.330273
+1   rs880051    G   A   2715    1159    115 0.290549    0.287583    0.544917
+1   rs2296716   C   T   3334    620 28  0.155701    0.155354    1
+1   rs2474460   T   C   1204    1946    821 0.490053    0.495349    0.501191
+
+
+
+
library(data.table)
+library(ggplot2)
+library(seqminer)
+library(HardyWeinberg)
+
+
Loading required package: mice
+
+
+

+Attaching package: 'mice'
+
+
+
The following object is masked from 'package:stats':
+
+    filter
+
+
+
The following objects are masked from 'package:base':
+
+    cbind, rbind
+
+
+
Loading required package: nnet
+
+
+
Loading required package: Rsolnp
+
+
+
Loading required package: shape
+
+
library(dplyr)
+
+

+Attaching package: 'dplyr'
+
+
+
The following object is masked from 'package:HardyWeinberg':
+
+    recode
+
+
+
The following objects are masked from 'package:data.table':
+
+    between, first, last
+
+
+
The following objects are masked from 'package:stats':
+
+    filter, lag
+
+
+
The following objects are masked from 'package:base':
+
+    intersect, setdiff, setequal, union
+
+
+
+
HWE_A1 <- fread("C:/Users/Mahsa Edalatkah/gen_data/02_activities/data/gwa_qc_A1_HWE.hardy", header= TRUE)
+head(HWE_A1)
+
+
   #CHROM         ID     A1     AX HOM_A1_CT HET_A1_CT TWO_AX_CT O(HET_A1)
+    <int>     <char> <char> <char>     <int>     <int>     <int>     <num>
+1:      1  rs3737728      G      A      1713      1841       428  0.462330
+2:      1  rs1320565      C      T      3368       589        19  0.148139
+3:      1  rs3813199      G      A      3531       428        12  0.107781
+4:      1 rs11804831      T      C      2820      1061        81  0.267794
+5:      1  rs3766178      T      C      2391      1378       214  0.345970
+6:      1  rs3128342      C      A      1927      1655       382  0.417508
+   E(HET_A1)         P
+       <num>     <num>
+1:  0.447932 0.0437892
+2:  0.145262 0.2734290
+3:  0.107347 1.0000000
+4:  0.261040 0.1133540
+5:  0.350629 0.4158770
+6:  0.424044 0.3302730
+
+
+
    +
  1. What are the HWE p-values for SNPs rs1861, rs3813199, rs3128342, and rs11804831?
  2. +
+
+
names(HWE_A1)
+
+
 [1] "#CHROM"    "ID"        "A1"        "AX"        "HOM_A1_CT" "HET_A1_CT"
+ [7] "TWO_AX_CT" "O(HET_A1)" "E(HET_A1)" "P"        
+
+
HWE_A1[ID %in% c("rs1861", "rs3813199", "rs3128342", "rs11804831"),
+       .(ID, P)]
+
+
           ID        P
+       <char>    <num>
+1:  rs3813199 1.000000
+2: rs11804831 0.113354
+3:  rs3128342 0.330273
+4:     rs1861 0.274719
+
+
+
+
+

Question 4: Genetic Association Test

+
    +
  1. Conduct a linear regression to test the association between SNP rs1861 and the phenotype. What is the p-value?
  2. +
+
+
library(data.table)
+additive_A1 <- fread("C:/Users/Mahsa Edalatkah/gen_data/02_activities/data/SNP_List_A1_Additive.raw", header= TRUE) 
+head(additive_A1)
+
+
     FID    IID   PAT   MAT   SEX PHENOTYPE rs3813199_G rs11804831_T
+   <int> <char> <int> <int> <int>     <num>       <int>        <int>
+1:     0  A2001     0     0     1 -0.694438           2            2
+2:     1  A2002     0     0     1  1.853850           2            2
+3:     2  A2003     0     0     1  2.082640           2            1
+4:     3  A2004     0     0     1  2.738710           2            2
+5:     4  A2005     0     0     1  1.341140           2            1
+6:     5  A2006     0     0     1  0.416779           2            1
+   rs3128342_C rs1861_C
+         <int>    <int>
+1:           2       NA
+2:           2        2
+3:           2        2
+4:           1        2
+5:           1        2
+6:           2       NA
+
+
additive_A1$PHENOTYPE = additive_A1$PHENOTYPE -1
+grep("rs1861", names(additive_A1), value = TRUE)
+
+
[1] "rs1861_C"
+
+
model_additive <- lm (PHENOTYPE~ rs1861_C, data= additive_A1)
+summary(model_additive)
+
+

+Call:
+lm(formula = PHENOTYPE ~ rs1861_C, data = additive_A1)
+
+Residuals:
+    Min      1Q  Median      3Q     Max 
+-3.5439 -0.6850  0.0021  0.6993  3.3268 
+
+Coefficients:
+            Estimate Std. Error t value Pr(>|t|)    
+(Intercept) -0.94762    0.09486  -9.989   <2e-16 ***
+rs1861_C     0.97382    0.04943  19.703   <2e-16 ***
+---
+Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+Residual standard error: 1.003 on 3962 degrees of freedom
+  (36 observations deleted due to missingness)
+Multiple R-squared:  0.08924,   Adjusted R-squared:  0.08901 
+F-statistic: 388.2 on 1 and 3962 DF,  p-value: < 2.2e-16
+
+
+
+
+
+

p-value: < 2.2e-16

+
    +
  1. How would you interpret the beta coefficient from this regression?
  2. +
+
+
+

In this additive model, each additional copy of the C allele at rs1861 is associated with a 0.97-unit increase in the phenotype. This association is statistically significant (p < 2×10⁻¹⁶).

+
    +
  1. Plot the scatterplot of phenotype versus the genotype of SNP rs1861. Add the regression line to the plot.
  2. +
+
+
library(ggplot2)
+
+plot_1 <- ggplot(additive_A1, aes(x = rs1861_C, y = PHENOTYPE)) +
+  geom_jitter(width = 0.1, alpha = 0.3) +
+  geom_smooth(method = "lm", se = FALSE, color = "red") +
+  labs(x = "Number of C alleles (rs1861)",
+       y = "Phenotype",
+       title = "Phenotype vs rs1861 genotype")
+plot_1
+
+
`geom_smooth()` using formula = 'y ~ x'
+
+
+
Warning: Removed 36 rows containing non-finite outside the scale range
+(`stat_smooth()`).
+
+
+
Warning: Removed 36 rows containing missing values or values outside the scale range
+(`geom_point()`).
+
+
+
+
+

+
+
+
+
+
    +
  1. Convert the genotype coding for rs1861 to recessive coding.
  2. +
+
+
additive_A1$rs1861_rec <- ifelse(additive_A1$rs1861_C == 2, 1, 0)     
+table(additive_A1$rs1861_C, additive_A1$rs1861_rec)
+
+
   
+       0    1
+  0   15    0
+  1  398    0
+  2    0 3551
+
+
+
    +
  1. Conduct a linear regression to test the association between the recessive-coded rs1861 and the phenotype. What is the p-value?
  2. +
+
+
model_rec <- lm(PHENOTYPE ~ rs1861_rec, data = additive_A1)
+summary(model_rec)        
+
+

+Call:
+lm(formula = PHENOTYPE ~ rs1861_rec, data = additive_A1)
+
+Residuals:
+    Min      1Q  Median      3Q     Max 
+-3.5437 -0.6892  0.0015  0.7016  3.3270 
+
+Coefficients:
+             Estimate Std. Error t value Pr(>|t|)    
+(Intercept) -0.007686   0.049446  -0.155    0.876    
+rs1861_rec   1.007543   0.052242  19.286   <2e-16 ***
+---
+Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+Residual standard error: 1.005 on 3962 degrees of freedom
+  (36 observations deleted due to missingness)
+Multiple R-squared:  0.08582,   Adjusted R-squared:  0.08559 
+F-statistic:   372 on 1 and 3962 DF,  p-value: < 2.2e-16
+
+
+
+
+

p-value: < 2.2e-16

+
    +
  1. Plot the scatterplot of phenotype versus the recessive-coded genotype of rs1861. Add the regression line to the plot.
  2. +
+
+
library(ggplot2)
+plot_2 <- ggplot(additive_A1, aes(x = factor(rs1861_rec), y = PHENOTYPE)) +
+  geom_jitter(width = 0.1, alpha = 0.3) +
+  geom_smooth(aes(group = 1), method = "lm", se = FALSE, color = "red") +
+  labs(x = "Recessive genotype (0 = non-CC, 1 = CC)",
+       y = "Phenotype",
+       title = "Phenotype vs rs1861 (recessive model)")
+plot_2
+
+
`geom_smooth()` using formula = 'y ~ x'
+
+
+
+
+

+
+
+
+
+
    +
  1. Which model fits better? Justify your answer.
  2. +
+
+
AIC (model_additive, model_rec)
+
+
               df      AIC
+model_additive  3 11276.91
+model_rec       3 11291.74
+
+
+
+
+

additive is a better fit because it has a smaller AIC value compared to the recessive modeling

+
+

Criteria

+ +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
CriteriaCompleteIncomplete
Data InspectionCorrect sample/SNP counts and variable type identified.Missing or incorrect counts or variable type.
Allele Frequency EstimationCorrect allele and minor allele frequencies computed.Frequencies missing or wrong.
Hardy–Weinberg Equilibrium TestCorrect PLINK command and p-value extraction in R.PLINK command or extraction incorrect/missing.
Genetic Association TestCorrect regressions, plots, coding, and interpretation.Regression, plots, or interpretation missing/incomplete.
+
+
+

Submission Information

+

📌 Please review our Assignment Submission Guide for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

+
+

Note:

+

If you like, you may collaborate with others in the cohort. If you choose to do so, please indicate with whom you have worked with in your pull request by tagging their GitHub username. Separate submissions are required.

+
+
+
+

Submission Parameters

+
    +
  • Submission Due Date: 11:59 PM – 16/03/2026

  • +
  • Branch name for your repo should be: assignment-1

  • +
  • What to submit for this assignment:

    +
      +
    • Populate this Quarto document (assignment_1.qmd).
    • +
    • Render the document with Quarto: quarto render assignment_1.qmd.
    • +
    • Submit both assignment_1.qmd and the rendered HTML file assignment_1.html in your pull request.
    • +
  • +
  • What the pull request link should look like for this assignment: https://github.com/<your_github_username>/gen_data/pull/<pr_id>

    +
      +
    • Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support team review your submission easily.
    • +
  • +
+
+

Checklist:

+
    +
  • Created a branch with the correct naming convention.
  • +
  • Ensured that the repository is public.
  • +
  • Reviewed the PR description guidelines and adhered to them.
  • +
  • Verified that the link is accessible in a private browser window.
  • +
  • Confirmed that both assignment_1.qmd and assignment_1.html are included in the pull request.
  • +
+

If you encounter any difficulties or have questions, please don’t hesitate to reach out to our team via our Slack help channel. Our technical facilitators and learning support team are here to help you navigate any challenges.

+
+
+
+ +
+ + +
+ + + + + \ No newline at end of file diff --git a/02_activities/assignments/assignment_1.qmd b/02_activities/assignments/assignment_1.qmd index 550af3d..a90a3e4 100644 --- a/02_activities/assignments/assignment_1.qmd +++ b/02_activities/assignments/assignment_1.qmd @@ -13,99 +13,159 @@ Please bring questions that you cannot work out on your own to office hours, wor You will need to install PLINK and run the analyses. Please follow the OS-specific setup guide in [`SETUP.md`](../../SETUP.md). PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner. +```{r setup, include=FALSE} +# Set up a consistent path for all chunks of code +knitr::opts_knit$set(root.dir = normalizePath("../../")) +getwd() +``` + #### Question 1: Data inspection Before fitting any models, it is essential to understand the data. Use R or bash code to answer the following questions about the `gwa.qc.A1.fam`, `gwa.qc.A1.bim`, and `gwa.qc.A1.bed` files, available at the following Google Drive link: . Please download all three files from this link and place them in `02_activities/data/`. (i) Read the .fam file. How many samples does the dataset contain? -``` -# Your answer here... -``` +``` {bash}         +cd "/c/Users/Mahsa Edalatkah/gen_data" +head ./02_activities/data/gwa.qc.A1.fam +wc -l <02_activities/data/gwa.qc.A1.fam +``` +# 4000 (ii) What is the 'variable type' of the response variable (i.e.Continuous or binary)? -``` -# Your answer here... +```{bash} +head ./02_activities/data/gwa.qc.A1.fam ``` - +# continuous (iii) Read the .bim file. How many SNPs does the dataset contain? -``` -# Your answer here... +```{bash} +cd "/c/Users/Mahsa Edalatkah/gen_data" +head ./02_activities/data/gwa.qc.A1.bim +wc -l ./02_activities/data/gwa.qc.A1.bim + ``` +# 101083 SNPs + #### Question 2: Allele Frequency Estimation (i) Load the genotype matrix for SNPs rs1861, rs3813199, rs3128342, and rs11804831 using additive coding. What are the allele frequencies (AFs) for these four SNPs? -``` -# Your code here... +```{bash,results='hide',message=FALSE, warning=FALSE}         +cd "/c/Users/Mahsa Edalatkah/gen_data" +printf "%s\n" rs1861 rs3813199 rs3128342 rs11804831 > ./02_activities/data/SNP_List_A1 +plink2 --bfile ./02_activities/data/gwa.qc.A1 --extract ./02_activities/data/SNP_List_A1 --export A --freq --out ./02_activities/data/SNP_List_A1_Additive +head ./02_activities/data/SNP_List_A1_Additive.afreq ``` (ii) What are the minor allele frequencies (MAFs) for these four SNPs? -``` -# Your code here... -``` +# rs3813199 0.0569126 +# rs11804831 0.154341 +# rs3128342 0.305121 +# rs1861 0.0539859 #### Question 3: Hardy–Weinberg Equilibrium (HWE) Test (i) Conduct the Hardy–Weinberg Equilibrium (HWE) test for all SNPs in the .bim file. Then, load the file containing the HWE p-value results and display the first few rows of the resulting data frame. -``` -# Your code here... +``` {bash} +cd "C:/Users/Mahsa Edalatkah/gen_data" +plink2 --bfile ./02_activities/data/gwa.qc.A1 --memory 4000 --hardy --out ./02_activities/data/gwa_qc_A1_HWE +head ./02_activities/data/gwa_qc_A1_HWE.hardy +``` + +```{r} +library(data.table) +library(ggplot2) +library(seqminer) +library(HardyWeinberg) +library(dplyr) +``` + +```{r} +HWE_A1 <- fread("C:/Users/Mahsa Edalatkah/gen_data/02_activities/data/gwa_qc_A1_HWE.hardy", header= TRUE) +head(HWE_A1) ``` (ii) What are the HWE p-values for SNPs rs1861, rs3813199, rs3128342, and rs11804831? -``` -# Your code here... +``` {r} +names(HWE_A1) +HWE_A1[ID %in% c("rs1861", "rs3813199", "rs3128342", "rs11804831"), + .(ID, P)] ``` #### Question 4: Genetic Association Test (i) Conduct a linear regression to test the association between SNP rs1861 and the phenotype. What is the p-value? -``` -# Your code here... +```{r} +library(data.table) +additive_A1 <- fread("C:/Users/Mahsa Edalatkah/gen_data/02_activities/data/SNP_List_A1_Additive.raw", header= TRUE) +head(additive_A1) +additive_A1$PHENOTYPE = additive_A1$PHENOTYPE -1 +grep("rs1861", names(additive_A1), value = TRUE) +model_additive <- lm (PHENOTYPE~ rs1861_C, data= additive_A1) +summary(model_additive) ``` +# p-value: < 2.2e-16 + (ii) How would you interpret the beta coefficient from this regression? -``` -# Your answer here... -``` +# In this additive model, each additional copy of the C allele at rs1861 is associated with a 0.97-unit increase in the phenotype. This association is statistically significant (p < 2×10⁻¹⁶). + (iii) Plot the scatterplot of phenotype versus the genotype of SNP rs1861. Add the regression line to the plot. -``` -# Your code here... +``` {r} +library(ggplot2) + +plot_1 <- ggplot(additive_A1, aes(x = rs1861_C, y = PHENOTYPE)) + + geom_jitter(width = 0.1, alpha = 0.3) + + geom_smooth(method = "lm", se = FALSE, color = "red") + + labs(x = "Number of C alleles (rs1861)", + y = "Phenotype", + title = "Phenotype vs rs1861 genotype") +plot_1 ``` (iv) Convert the genotype coding for rs1861 to recessive coding. -``` -# Your code here... +```{r} +additive_A1$rs1861_rec <- ifelse(additive_A1$rs1861_C == 2, 1, 0) +table(additive_A1$rs1861_C, additive_A1$rs1861_rec) ``` (v) Conduct a linear regression to test the association between the recessive-coded rs1861 and the phenotype. What is the p-value? -``` -# Your code here... +```{r} +model_rec <- lm(PHENOTYPE ~ rs1861_rec, data = additive_A1) +summary(model_rec) ``` - +# p-value: < 2.2e-16 (vi) Plot the scatterplot of phenotype versus the recessive-coded genotype of rs1861. Add the regression line to the plot. -``` -# Your code here... +```{r} +library(ggplot2) +plot_2 <- ggplot(additive_A1, aes(x = factor(rs1861_rec), y = PHENOTYPE)) + + geom_jitter(width = 0.1, alpha = 0.3) + + geom_smooth(aes(group = 1), method = "lm", se = FALSE, color = "red") + + labs(x = "Recessive genotype (0 = non-CC, 1 = CC)", + y = "Phenotype", + title = "Phenotype vs rs1861 (recessive model)") +plot_2 ``` (vii) Which model fits better? Justify your answer. -``` -# Your answer here... +```{r} +AIC (model_additive, model_rec) ``` +# additive is a better fit because it has a smaller AIC value compared to the recessive modeling ### Criteria