library(data.table)
+library(ggplot2)
+library(seqminer)
+library(HardyWeinberg)
+library(dplyr)Assignment #1
+Assignment 1
+You only need to write lines of code for each question. When answering questions that ask you to identify or interpret something, the length of your response doesn’t matter. For example, if the answer is just ‘yes,’ ‘no,’ or a number, you can just give that answer without adding anything else.
+We will go through comparable code and concepts in the live learning session. If you run into trouble, start by using the help help() function in R, to get information about the datasets and function in question. The internet is also a great resource when coding (though note that no outside searches are required by the assignment!). If you do incorporate code from the internet, please cite the source within your code (providing a URL is sufficient).
+Please bring questions that you cannot work out on your own to office hours, work periods or share with your peers on Slack. We will work with you through the issue.
+You will need to install PLINK and run the analyses. Please follow the OS-specific setup guide in SETUP.md. PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner.
Setting up
+Question 1: Data inspection
+Before fitting any models, it is essential to understand the data. Use R or bash code to answer the following questions about the gwa.qc.A1.fam, gwa.qc.A1.bim, and gwa.qc.A1.bed files, available at the following Google Drive link: https://drive.google.com/drive/folders/11meVqGCY5yAyI1fh-fAlMEXQt0VmRGuz?usp=drive_link. Please download all three files from this link and place them in 02_activities/data/.
-
+
- Read the .fam file. How many samples does the dataset contain? +
fam_all <- fread("./02_activities/data/gwa.qc.A1.fam")
+
+N = dim(fam_all)[1]; N[1] 4000
+There are 4,000 samples in the dataset.
+-
+
- What is the ‘variable type’ of the response variable (i.e.Continuous or binary)? +
+head ./02_activities/data/gwa.qc.A1.fam0 A2001 0 0 1 -0.694438129641973
+1 A2002 0 0 1 1.85384536141856
+2 A2003 0 0 1 2.08263677761584
+3 A2004 0 0 1 2.73871473943968
+4 A2005 0 0 1 1.34114035564636
+5 A2006 0 0 1 0.416778586749647
+6 A2007 0 0 1 2.38297123290054
+7 A2008 0 0 1 1.51429928826958
+8 A2009 0 0 1 0.718686390529039
+9 A2010 0 0 1 2.08904136245205
+The phenotype is a continuous variable.
+-
+
- Read the .bim file. How many SNPs does the dataset contain? +
bim_all <- fread("./02_activities/data/gwa.qc.A1.bim")
+M = nrow(bim_all); M[1] 101083
+There are 101,083 SNPs in the dataset.
+Question 2: Allele Frequency Estimation
+-
+
- Load the genotype matrix for SNPs rs1861, rs3813199, rs3128342, and rs11804831 using additive coding. What are the allele frequencies (AFs) for these four SNPs? +
+setopt interactivecomments
+# Text file containing the list of SNPs to extract
+printf "%s\n" rs1861 rs3813199 rs3128342 rs11804831 > ./02_activities/data/snplistA1.txt
+
+# Genotype matrix for the four SNPs with additive coding
+plink2 --bfile ./02_activities/data/gwa.qc.A1 \
+ --extract ./02_activities/data/snplistA1.txt \
+ --recode A \
+ --out ./02_activities/data/geno.additiveA1
+
+# Calculate allele frequencies for the four SNPs
+plink2 --bfile ./02_activities/data/gwa.qc.A1 \
+ --extract ./02_activities/data/snplistA1.txt \
+ --freq --out ./02_activities/data/gwa_qc_freqA1bash: line 1: setopt: command not found
+PLINK v2.00a5.12 M1 (25 Jun 2024) www.cog-genomics.org/plink/2.0/
+(C) 2005-2024 Shaun Purcell, Christopher Chang GNU General Public License v3
+Logging to ./02_activities/data/geno.additiveA1.log.
+Options in effect:
+ --bfile ./02_activities/data/gwa.qc.A1
+ --export A
+ --extract ./02_activities/data/snplistA1.txt
+ --out ./02_activities/data/geno.additiveA1
+
+Start time: Thu Mar 19 15:01:27 2026
+36864 MiB RAM detected; reserving 18432 MiB for main workspace.
+Using up to 14 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+./02_activities/data/gwa.qc.A1.fam.
+101083 variants loaded from ./02_activities/data/gwa.qc.A1.bim.
+1 quantitative phenotype loaded (4000 values).
+--extract: 4 variants remaining.
+4 variants remaining after main filters.
+
+--export A pass 1/1: loading... 0%0%writing... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
+--export A: ./02_activities/data/geno.additiveA1.raw written.
+End time: Thu Mar 19 15:01:27 2026
+PLINK v2.00a5.12 M1 (25 Jun 2024) www.cog-genomics.org/plink/2.0/
+(C) 2005-2024 Shaun Purcell, Christopher Chang GNU General Public License v3
+Logging to ./02_activities/data/gwa_qc_freqA1.log.
+Options in effect:
+ --bfile ./02_activities/data/gwa.qc.A1
+ --extract ./02_activities/data/snplistA1.txt
+ --freq
+ --out ./02_activities/data/gwa_qc_freqA1
+
+Start time: Thu Mar 19 15:01:27 2026
+36864 MiB RAM detected; reserving 18432 MiB for main workspace.
+Using up to 14 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+./02_activities/data/gwa.qc.A1.fam.
+101083 variants loaded from ./02_activities/data/gwa.qc.A1.bim.
+1 quantitative phenotype loaded (4000 values).
+--extract: 4 variants remaining.
+Calculating allele frequencies... 0%0%done.
+--freq: 0%0%25%50%75%
+--freq: Allele frequencies (founders only) written to
+./02_activities/data/gwa_qc_freqA1.afreq .
+End time: Thu Mar 19 15:01:27 2026
+# View frequency table
+freq <- fread("./02_activities/data/gwa_qc_freqA1.afreq")
+head(freq) #CHROM ID REF ALT PROVISIONAL_REF? ALT_FREQS OBS_CT
+ <int> <char> <char> <char> <char> <num> <int>
+1: 1 rs3813199 G A Y 0.0569126 7942
+2: 1 rs11804831 T C Y 0.1543410 7924
+3: 1 rs3128342 C A Y 0.3051210 7928
+4: 1 rs1861 C A Y 0.0539859 7928
+The allele frequencies for the four SNPs are as follows: - rs1861: C = 0.9460141; A = 0.0539859 - rs3813199: G = 0.9430874; A = 0.0569126
+- rs3128342: C = 0.694879; A = 0.3051210 - rs11804831: T = 0.845659; C = 0.1543410
-
+
- What are the minor allele frequencies (MAFs) for these four SNPs? +
The minor allele frequencies for the four SNPs are as follows: - rs1861: MAF = 0.0539859 - rs3813199: MAF = 0.0569126 - rs3128342: MAF = 0.3051210 - rs11804831: MAF = 0.1543410
+Question 3: Hardy–Weinberg Equilibrium (HWE) Test
+-
+
- Conduct the Hardy–Weinberg Equilibrium (HWE) test for all SNPs in the .bim file. Then, load the file containing the HWE p-value results and display the first few rows of the resulting data frame. +
+plink2 --bfile ./02_activities/data/gwa.qc.A1 --hardy --out ./02_activities/data/gwa_qc_hwe_A1# View HWE results
+hwe <- fread("./02_activities/data/gwa_qc_hwe_A1.hardy")
+head(hwe) #CHROM ID A1 AX HOM_A1_CT HET_A1_CT TWO_AX_CT O(HET_A1)
+ <int> <char> <char> <char> <int> <int> <int> <num>
+1: 1 rs3737728 G A 1713 1841 428 0.462330
+2: 1 rs1320565 C T 3368 589 19 0.148139
+3: 1 rs3813199 G A 3531 428 12 0.107781
+4: 1 rs11804831 T C 2820 1061 81 0.267794
+5: 1 rs3766178 T C 2391 1378 214 0.345970
+6: 1 rs3128342 C A 1927 1655 382 0.417508
+ E(HET_A1) P
+ <num> <num>
+1: 0.447932 0.0437892
+2: 0.145262 0.2734290
+3: 0.107347 1.0000000
+4: 0.261040 0.1133540
+5: 0.350629 0.4158770
+6: 0.424044 0.3302730
+-
+
- What are the HWE p-values for SNPs rs1861, rs3813199, rs3128342, and rs11804831? +
snps <- c("rs1861", "rs3813199", "rs3128342", "rs11804831")
+
+hwe_subset <- hwe %>%
+ filter(ID %in% snps)
+
+hwe_subset #CHROM ID A1 AX HOM_A1_CT HET_A1_CT TWO_AX_CT O(HET_A1)
+ <int> <char> <char> <char> <int> <int> <int> <num>
+1: 1 rs3813199 G A 3531 428 12 0.107781
+2: 1 rs11804831 T C 2820 1061 81 0.267794
+3: 1 rs3128342 C A 1927 1655 382 0.417508
+4: 1 rs1861 C A 3551 398 15 0.100404
+ E(HET_A1) P
+ <num> <num>
+1: 0.107347 1.000000
+2: 0.261040 0.113354
+3: 0.424044 0.330273
+4: 0.102143 0.274719
+The p-values for the HWE test for the four SNPs are as follows: - rs1861: p-value = 0.275 - rs3813199: p-value = 1.000 - rs3128342: p-value = 0.330 - rs11804831: p-value = 0.113
+Question 4: Genetic Association Test
+-
+
- Conduct a linear regression to test the association between SNP rs1861 and the phenotype. What is the p-value? +
# Load the genotype matrix for SNP rs1861 with additive coding
+geno_subset <- fread("./02_activities/data/geno.additiveA1.raw")
+data_sub <- geno_subset[, c("PHENOTYPE", "rs1861_C")]
+data_sub <- data_sub %>%
+ mutate(rs1861_C = 2 - rs1861_C)
+data_sub <- na.omit(data_sub) # Remove rows with missing values
+head(data_sub) PHENOTYPE rs1861_C
+ <num> <num>
+1: 1.85385 0
+2: 2.08264 0
+3: 2.73871 0
+4: 1.34114 0
+5: 2.38297 0
+6: 1.51430 0
+# Fit linear regression model
+model1 <- lm(PHENOTYPE ~ rs1861_C, data = data_sub)
+summary(model1)
+Call:
+lm(formula = PHENOTYPE ~ rs1861_C, data = data_sub)
+
+Residuals:
+ Min 1Q Median 3Q Max
+-3.5439 -0.6850 0.0021 0.6993 3.3268
+
+Coefficients:
+ Estimate Std. Error t value Pr(>|t|)
+(Intercept) 2.00003 0.01680 119.0 <2e-16 ***
+rs1861_C -0.97382 0.04943 -19.7 <2e-16 ***
+---
+Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+Residual standard error: 1.003 on 3962 degrees of freedom
+Multiple R-squared: 0.08924, Adjusted R-squared: 0.08901
+F-statistic: 388.2 on 1 and 3962 DF, p-value: < 2.2e-16
+The p-value for the association between SNP rs1861 and the phenotype is <2e-16
+-
+
- How would you interpret the beta coefficient from this regression? +
For each additional copy of the C allele at SNP rs1861, the phenotype is expected to decrease by approximately 0.967 units. The significant p-value suggests evidence of an association between this SNP and the phenotype in our sample.
+-
+
- Plot the scatterplot of phenotype versus the genotype of SNP rs1861. Add the regression line to the plot. +
plot1 <- ggplot(data = data_sub, aes(x = rs1861_C, y = PHENOTYPE)) +
+ geom_point() +
+ geom_smooth(method = "lm", se = FALSE, color = "blue") +
+ labs(x= "Genotype (0/1/2)", y = "Phenotype")
+
+print(plot1)
-
+
- Convert the genotype coding for rs1861 to recessive coding. +
geno_subset_recessive <- data_sub # make a copy
+geno_subset_recessive$rs1861_C <- ifelse(geno_subset_recessive$rs1861_C == 2, 1, 0)-
+
- Conduct a linear regression to test the association between the recessive-coded rs1861 and the phenotype. What is the p-value? +
# Fit linear regression model
+model2 <- lm(PHENOTYPE ~ rs1861_C, data = geno_subset_recessive)
+summary(model2)
+Call:
+lm(formula = PHENOTYPE ~ rs1861_C, data = geno_subset_recessive)
+
+Residuals:
+ Min 1Q Median 3Q Max
+-3.9087 -0.7086 0.0022 0.7202 3.4248
+
+Coefficients:
+ Estimate Std. Error t value Pr(>|t|)
+(Intercept) 1.90204 0.01662 114.430 < 2e-16 ***
+rs1861_C -1.89035 0.27021 -6.996 3.08e-12 ***
+---
+Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+Residual standard error: 1.045 on 3962 degrees of freedom
+Multiple R-squared: 0.0122, Adjusted R-squared: 0.01195
+F-statistic: 48.94 on 1 and 3962 DF, p-value: 3.081e-12
+The p-value for the association between the recessive-coded rs1861 and the phenotype is 3.08e-12.
+-
+
- Plot the scatterplot of phenotype versus the recessive-coded genotype of rs1861. Add the regression line to the plot. +
plot2 <- ggplot(data = geno_subset_recessive, aes(x = rs1861_C, y = PHENOTYPE)) +
+ geom_point() +
+ geom_smooth(method = "lm", se = FALSE, color = "blue") +
+ labs(x= "Genotype (0/1/2)", y = "Phenotype")
+
+print(plot2)`geom_smooth()` using formula = 'y ~ x'
+
-
+
- Which model fits better? Justify your answer. +
The additive model fits better than the recessive model. This is evident from the much smaller p-value in the additive model (<2e-16) compared to the recessive model (3.08e-12), indicating stronger evidence of association. Additionally, the scatterplot for the additive model shows a clearer linear relationship between genotype and phenotype, while the recessive model’s plot is more binary, suggesting that the additive coding captures more of the variation in the phenotype.
+Criteria
+| Criteria | +Complete | +Incomplete | +
|---|---|---|
| Data Inspection | +Correct sample/SNP counts and variable type identified. | +Missing or incorrect counts or variable type. | +
| Allele Frequency Estimation | +Correct allele and minor allele frequencies computed. | +Frequencies missing or wrong. | +
| Hardy–Weinberg Equilibrium Test | +Correct PLINK command and p-value extraction in R. | +PLINK command or extraction incorrect/missing. | +
| Genetic Association Test | +Correct regressions, plots, coding, and interpretation. | +Regression, plots, or interpretation missing/incomplete. | +
Submission Information
+📌 Please review our Assignment Submission Guide for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.
+Note:
+If you like, you may collaborate with others in the cohort. If you choose to do so, please indicate with whom you have worked with in your pull request by tagging their GitHub username. Separate submissions are required.
++
Submission Parameters
+-
+
Submission Due Date:
11:59 PM – 16/03/2026
+Branch name for your repo should be:
assignment-1
+What to submit for this assignment:
+-
+
- Populate this Quarto document (
assignment_1.qmd).
+ - Render the document with Quarto:
quarto render assignment_1.qmd.
+ - Submit both
assignment_1.qmdand the rendered HTML fileassignment_1.htmlin your pull request.
+
+- Populate this Quarto document (
What the pull request link should look like for this assignment:
+https://github.com/<your_github_username>/gen_data/pull/<pr_id>-
+
- Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support team review your submission easily. +
+
+
Checklist:
+-
+
- Created a branch with the correct naming convention. +
- Ensured that the repository is public. +
- Reviewed the PR description guidelines and adhered to them. +
- Verified that the link is accessible in a private browser window. +
- Confirmed that both
assignment_1.qmdandassignment_1.htmlare included in the pull request.
+
If you encounter any difficulties or have questions, please don’t hesitate to reach out to our team via our Slack help channel. Our technical facilitators and learning support team are here to help you navigate any challenges.
+