diff --git a/02_activities/assignments/assignment_1.html b/02_activities/assignments/assignment_1.html new file mode 100644 index 0000000..2c61bcf --- /dev/null +++ b/02_activities/assignments/assignment_1.html @@ -0,0 +1,976 @@ + + + + + + + + + +Assignment #1 + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+
+

Assignment #1

+
+ + + +
+ + + + +
+ + + +
+ + +
+

Assignment 1

+

You only need to write lines of code for each question. When answering questions that ask you to identify or interpret something, the length of your response doesn’t matter. For example, if the answer is just ‘yes,’ ‘no,’ or a number, you can just give that answer without adding anything else.

+

We will go through comparable code and concepts in the live learning session. If you run into trouble, start by using the help help() function in R, to get information about the datasets and function in question. The internet is also a great resource when coding (though note that no outside searches are required by the assignment!). If you do incorporate code from the internet, please cite the source within your code (providing a URL is sufficient).

+

Please bring questions that you cannot work out on your own to office hours, work periods or share with your peers on Slack. We will work with you through the issue.

+

You will need to install PLINK and run the analyses. Please follow the OS-specific setup guide in SETUP.md. PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner.

+
+

Question 1: Data inspection

+

Before fitting any models, it is essential to understand the data. Use R or bash code to answer the following questions about the gwa.qc.A1.fam, gwa.qc.A1.bim, and gwa.qc.A1.bed files, available at the following Google Drive link: https://drive.google.com/drive/folders/11meVqGCY5yAyI1fh-fAlMEXQt0VmRGuz?usp=drive_link. Please download all three files from this link and place them in 02_activities/data/.

+
+
library(data.table)
+library(ggplot2)
+library(seqminer)
+library(HardyWeinberg)
+library(dplyr)
+
+
    +
  1. Read the .fam file. How many samples does the dataset contain?
  2. +
+
+
cd ~/Downloads/gen_data/02_activities/data
+pwd
+head gwa.qc.A1.fam
+
+
/c/Users/obedk/Downloads/gen_data/02_activities/data
+0   A2001   0   0   1   -0.694438129641973
+1   A2002   0   0   1   1.85384536141856
+2   A2003   0   0   1   2.08263677761584
+3   A2004   0   0   1   2.73871473943968
+4   A2005   0   0   1   1.34114035564636
+5   A2006   0   0   1   0.416778586749647
+6   A2007   0   0   1   2.38297123290054
+7   A2008   0   0   1   1.51429928826958
+8   A2009   0   0   1   0.718686390529039
+9   A2010   0   0   1   2.08904136245205
+
+
+
+
# Read the .fam file
+fam_file <- read.table("~/Downloads/gen_data/02_activities/data/gwa.qc.A1.fam", header = FALSE)
+
+# Count the number of samples and display the result
+num_samples <- nrow(fam_file)
+print(paste("Number of samples in the dataset:", num_samples))
+
+
[1] "Number of samples in the dataset: 4000"
+
+
+
    +
  1. What is the ‘variable type’ of the response variable (i.e.Continuous or binary)?
  2. +
+
Continuous variable.
+
    +
  1. Read the .bim file. How many SNPs does the dataset contain?
  2. +
+
+
# Read the .bim file
+bim_file <- read.table("~/Downloads/gen_data/02_activities/data/gwa.qc.A1.bim", header = FALSE)
+
+# Count the number of SNPs and display the result
+num_snps <- nrow(bim_file)
+print(paste("Number of SNPs in the dataset:", num_snps))
+
+
[1] "Number of SNPs in the dataset: 101083"
+
+
+
+
+

Question 2: Allele Frequency Estimation

+
    +
  1. Load the genotype matrix for SNPs rs1861, rs3813199, rs3128342, and rs11804831 using additive coding. What are the allele frequencies (AFs) for these four SNPs?
  2. +
+
+
# Step 1: Create SNP list file with 4 specified SNPs
+printf "%s\n" rs1861 rs3813199 rs3128342 rs11804831 > ~/Downloads/gen_data/02_activities/data/snplist.txt
+
+
+
# Step 2: Extract genotypes for the specified SNPs and recode to additive format
+/c/PLINK/plink2.exe --bfile ~/Downloads/gen_data/02_activities/data/gwa.qc.A1 \
+      --extract ~/Downloads/gen_data/02_activities/data/snplist.txt \
+      --recode A \
+      --out ~/Downloads/gen_data/02_activities/data/geno.additive
+
+
+
# Step 3: Load genotype data matrix
+geno_subset <- fread("~/Downloads/gen_data/02_activities/data/geno.additive.raw")
+geno_subset=na.omit(geno_subset)
+geno_subset$PHENOTYPE=geno_subset$PHENOTYPE-1
+head(geno_subset)
+
+
     FID    IID   PAT   MAT   SEX PHENOTYPE rs3813199_G rs11804831_T
+   <int> <char> <int> <int> <int>     <num>       <int>        <int>
+1:     1  A2002     0     0     1   0.85385           2            2
+2:     2  A2003     0     0     1   1.08264           2            1
+3:     3  A2004     0     0     1   1.73871           2            2
+4:     4  A2005     0     0     1   0.34114           2            1
+5:     6  A2007     0     0     1   1.38297           2            2
+6:     7  A2008     0     0     1   0.51430           2            2
+   rs3128342_C rs1861_C
+         <int>    <int>
+1:           2        2
+2:           2        2
+3:           1        2
+4:           1        2
+5:           2        2
+6:           1        2
+
+
+
+
# Step 4: Calculate allele frequencies for the specified SNPs
+/c/PLINK/plink2.exe --bfile ~/Downloads/gen_data/02_activities/data/gwa.qc.A1 \
+      --extract ~/Downloads/gen_data/02_activities/data/snplist.txt \
+      --freq \
+      --out ~/Downloads/gen_data/02_activities/data/gwa_qc_A1_freqs
+
+# The allele frequencies will be saved in the file gwa_qc_A1_freqs.afreq with column ALT_FREQS representing the allele frequencies for the specified SNPs.
+
+
+
# Step 5: View allele frequencies of the specified SNPs
+freq <- fread("~/Downloads/gen_data/02_activities/data/gwa_qc_A1_freqs.afreq")
+head(freq)
+
+
   #CHROM         ID    REF    ALT PROVISIONAL_REF? ALT_FREQS OBS_CT
+    <int>     <char> <char> <char>           <char>     <num>  <int>
+1:      1  rs3813199      G      A                Y 0.0569126   7942
+2:      1 rs11804831      T      C                Y 0.1543410   7924
+3:      1  rs3128342      C      A                Y 0.3051210   7928
+4:      1     rs1861      C      A                Y 0.0539859   7928
+
+
+
    +
  1. What are the minor allele frequencies (MAFs) for these four SNPs?
  2. +
+
+
# Calculate minor allele frequencies (MAFs) for the specified SNPs   
+awk 'NR==1 {print "SNP_ID\tMAF"} NR>1 {maf = ($6 > 0.5) ? 1 - $6 : $6; print $2 "\t" maf}' ~/Downloads/gen_data/02_activities/data/gwa_qc_A1_freqs.afreq
+
+# The MAFs will be calculated based on the ALT_FREQS column in the gwa_qc_A1_freqs.afreq file. The output will show the SNP ID and its corresponding MAF. Since the MAF is the frequency of the less common allele, it is calculated as 1 - ALT_FREQS if ALT_FREQS is greater than 0.5, otherwise it is just ALT_FREQS.
+
+
SNP_ID  MAF
+rs3813199   0.0569126
+rs11804831  0.154341
+rs3128342   0.305121
+rs1861  0.0539859
+
+
+
+
+

Question 3: Hardy–Weinberg Equilibrium (HWE) Test

+
    +
  1. Conduct the Hardy–Weinberg Equilibrium (HWE) test for all SNPs in the .bim file. Then, load the file containing the HWE p-value results and display the first few rows of the resulting data frame.
  2. +
+
+
# Step 1: Conduct HWE test for all SNPs in the .bim file
+/c/PLINK/plink2.exe --bfile ~/Downloads/gen_data/02_activities/data/gwa.qc.A1 --hardy --out ~/Downloads/gen_data/02_activities/data/gwa_qc_A1_hwe
+
+

This produces gwa_qc_A1_hwe.hardy containing observed and expected genotype counts plus p-values.

+
+
# Step 2: View HWE results
+hwe <- fread("~/Downloads/gen_data/02_activities/data/gwa_qc_A1_hwe.hardy")
+head(hwe)
+
+
   #CHROM         ID     A1     AX HOM_A1_CT HET_A1_CT TWO_AX_CT O(HET_A1)
+    <int>     <char> <char> <char>     <int>     <int>     <int>     <num>
+1:      1  rs3737728      G      A      1713      1841       428  0.462330
+2:      1  rs1320565      C      T      3368       589        19  0.148139
+3:      1  rs3813199      G      A      3531       428        12  0.107781
+4:      1 rs11804831      T      C      2820      1061        81  0.267794
+5:      1  rs3766178      T      C      2391      1378       214  0.345970
+6:      1  rs3128342      C      A      1927      1655       382  0.417508
+   E(HET_A1)         P
+       <num>     <num>
+1:  0.447932 0.0437892
+2:  0.145262 0.2734290
+3:  0.107347 1.0000000
+4:  0.261040 0.1133540
+5:  0.350629 0.4158770
+6:  0.424044 0.3302730
+
+
+
    +
  1. What are the HWE p-values for SNPs rs1861, rs3813199, rs3128342, and rs11804831?
  2. +
+
+
# Extract HWE p-values for the specified SNPs       
+grep -wFf ~/Downloads/gen_data/02_activities/data/snplist.txt ~/Downloads/gen_data/02_activities/data/gwa_qc_A1_hwe.hardy | awk '{print $2 "\t" $10}'
+
+
rs3813199   1
+rs11804831  0.113354
+rs3128342   0.330273
+rs1861  0.274719
+
+
+
+
+

Question 4: Genetic Association Test

+
    +
  1. Conduct a linear regression to test the association between SNP rs1861 and the phenotype. What is the p-value?
  2. +
+
+
#Pick SNP rs1861
+table(geno_subset$rs1861_C)
+
+

+   0    1    2 
+  14  389 3459 
+
+
freq[freq$ID=='rs1861',]
+
+
   #CHROM     ID    REF    ALT PROVISIONAL_REF? ALT_FREQS OBS_CT
+    <int> <char> <char> <char>           <char>     <num>  <int>
+1:      1 rs1861      C      A                Y 0.0539859   7928
+
+
head(geno_subset)
+
+
     FID    IID   PAT   MAT   SEX PHENOTYPE rs3813199_G rs11804831_T
+   <int> <char> <int> <int> <int>     <num>       <int>        <int>
+1:     1  A2002     0     0     1   0.85385           2            2
+2:     2  A2003     0     0     1   1.08264           2            1
+3:     3  A2004     0     0     1   1.73871           2            2
+4:     4  A2005     0     0     1   0.34114           2            1
+5:     6  A2007     0     0     1   1.38297           2            2
+6:     7  A2008     0     0     1   0.51430           2            2
+   rs3128342_C rs1861_C
+         <int>    <int>
+1:           2        2
+2:           2        2
+3:           1        2
+4:           1        2
+5:           2        2
+6:           1        2
+
+
+

The C allele is the major allele, with frequency 1 - 0.0539859.

+
+
# Fit linear regression model
+model <- glm(PHENOTYPE ~ rs1861_C, data = geno_subset, family = gaussian)
+
+# View model summary
+summary(model)
+
+

+Call:
+glm(formula = PHENOTYPE ~ rs1861_C, family = gaussian, data = geno_subset)
+
+Coefficients:
+            Estimate Std. Error t value Pr(>|t|)    
+(Intercept) -0.93891    0.09626  -9.753   <2e-16 ***
+rs1861_C     0.96698    0.05016  19.279   <2e-16 ***
+---
+Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+(Dispersion parameter for gaussian family taken to be 1.00628)
+
+    Null deviance: 4258.2  on 3861  degrees of freedom
+Residual deviance: 3884.2  on 3860  degrees of freedom
+AIC: 10988
+
+Number of Fisher Scoring iterations: 2
+
+
table(geno_subset$rs1861_C)
+
+

+   0    1    2 
+  14  389 3459 
+
+
# The p value for the association between SNP rs1861 and the phenotype is p < 2e-16, which is highly significant. This suggests that there is a strong association between the genotype of SNP rs1861 and the phenotype being studied. 
+
+
    +
  1. How would you interpret the beta coefficient from this regression?
  2. +
+
The beta coefficient for rs1861_C represents the change in the phenotype associated with each additional copy of the C allele. If the beta coefficient is positive, it indicates that the presence of the C allele is associated with an increase in the phenotype. Conversely, if the beta coefficient is negative, it suggests that the C allele is associated with a decrease in the phenotype. The magnitude of the beta coefficient indicates the strength of this association, while its sign (positive or negative) indicates the direction of the effect.
+
+In this case, since the beta coefficient is positive (i.e., +0.96698), it suggests that individuals with more copies of the C allele tend to have higher values of the phenotype. The exact interpretation would depend on the context of the phenotype being studied and the units of measurement for both the genotype and the phenotype.
+
    +
  1. Plot the scatterplot of phenotype versus the genotype of SNP rs1861. Add the regression line to the plot.
  2. +
+
+
# 1. Clean the data (remove missing genotypes or phenotypes)
+clean_data <- geno_subset %>%
+  filter(!is.na(rs1861_C), !is.na(PHENOTYPE))
+
+# 2. Build the scatterplot
+p1 <- ggplot(clean_data, aes(x = rs1861_C, y = PHENOTYPE)) +
+  # Add points with horizontal jitter and transparency to prevent overlapping
+  geom_jitter(width = 0.1, height = 0, color = "blue", alpha = 0.3, size = 2) +
+  # Automatically calculate and draw the linear regression line (with confidence interval)
+  geom_smooth(method = "lm", color = "red", se = TRUE, linewidth = 1.2) +
+  labs(
+    title = "Linear Regression: Continuous Phenotype ~ Genotype (rs1861)",
+    x = "Genotype (Number of Alternative Alleles: 0, 1, 2)",
+    y = "Phenotype Measurement"
+  ) +
+  # Lock the x-axis to our discrete genotype categories
+  scale_x_continuous(breaks = c(0, 1, 2)) +
+  coord_cartesian(xlim = c(-0.5, 2.5)) +
+  # Apply a clean, minimal theme
+  theme_minimal()
+
+# 3. Display the plot
+print(p1)
+
+
+
+

+
+
+
+
+
    +
  1. Convert the genotype coding for rs1861 to recessive coding.
  2. +
+
+
# Make a copy for recessive data
+geno_subset_rec <- geno_subset  
+
+# Apply the recessive logic (only 2 becomes 1, everything else is 0)
+geno_subset_rec[, 7:ncol(geno_subset_rec)] <- as.data.frame(
+  lapply(geno_subset[, 7:ncol(geno_subset)], function(x) ifelse(x == 2, 1, 0))
+)
+
+
    +
  1. Conduct a linear regression to test the association between the recessive-coded rs1861 and the phenotype. What is the p-value?
  2. +
+
+
# Fit linear regression model
+model <- glm(PHENOTYPE ~ rs1861_C, data = geno_subset_rec, family = gaussian)
+summary(model)
+
+

+Call:
+glm(formula = PHENOTYPE ~ rs1861_C, family = gaussian, data = geno_subset_rec)
+
+Coefficients:
+             Estimate Std. Error t value Pr(>|t|)    
+(Intercept) -0.006248   0.050047  -0.125    0.901    
+rs1861_C     1.001393   0.052882  18.936   <2e-16 ***
+---
+Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+(Dispersion parameter for gaussian family taken to be 1.0094)
+
+    Null deviance: 4258.2  on 3861  degrees of freedom
+Residual deviance: 3896.3  on 3860  degrees of freedom
+AIC: 11000
+
+Number of Fisher Scoring iterations: 2
+
+
+

The p-value for the association between the recessive-coded rs1861 and the phenotype is p < 2e-16, which is highly significant. This suggests that there is a strong association between the recessive genotype of SNP rs1861 and the phenotype being studied. The interpretation of this result would depend on the context of the phenotype and the specific coding used for the recessive model, but it indicates that individuals with two copies of the alternative allele (coded as 1) have a significantly different phenotype compared to those with zero or one copy (coded as 0).

+
    +
  1. Plot the scatterplot of phenotype versus the recessive-coded genotype of rs1861. Add the regression line to the plot.
  2. +
+
+
# 1. Clean the data using your new recessive dataframe
+clean_data_rec <- geno_subset_rec %>%
+  filter(!is.na(rs1861_C), !is.na(PHENOTYPE))
+
+# 2. Build the continuous scatterplot
+p4 <- ggplot(clean_data_rec, aes(x = rs1861_C, y = PHENOTYPE)) +
+  # Add points with slight horizontal jitter
+  geom_jitter(width = 0.05, height = 0, color = "blue", alpha = 0.3, size = 2) +
+  # Automatically draw the linear regression line
+  geom_smooth(method = "lm", color = "red", se = TRUE, linewidth = 1.2) +
+  labs(
+    title = "Linear Regression: Continuous Phenotype ~ Recessive Genotype",
+    # Updated x-axis label to reflect recessive biology
+    x = "Genotype (0 = Carrier/No Alt, 1 = Exactly 2 Copies of Alt)",
+    y = "Continuous Phenotype"
+  ) +
+  # Lock the x-axis to exactly 0 and 1
+  scale_x_continuous(breaks = c(0, 1)) +
+  coord_cartesian(xlim = c(-0.5, 1.5)) +
+  theme_minimal()
+
+# 3. Display the plot
+print(p4)
+
+
+
+

+
+
+
+
+
    +
  1. Which model fits better? Justify your answer.
  2. +
+
To determine which model fits better, we can compare the two linear regression models (additive vs. recessive) using metrics such as the Akaike Information Criterion (AIC) or the R-squared value. The model with the lower AIC or higher R-squared would be considered a better fit to the data. Additionally, we can look at the significance of the coefficients and the overall p-values for each model to assess which one provides a more meaningful association between the genotype and the phenotype. In this case, since both models have highly significant p-values, we would need to compare their AIC values or R-squared values to make a final determination on which model fits better. The additive model with an AIC of 10988 compared with 11000 for the recessive model suggests that the additive model fits the data better, as it has a lower AIC value. This indicates that the additive model provides a better balance of goodness-of-fit and model complexity compared to the recessive model.
+
+
+

Criteria

+ +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
CriteriaCompleteIncomplete
Data InspectionCorrect sample/SNP counts and variable type identified.Missing or incorrect counts or variable type.
Allele Frequency EstimationCorrect allele and minor allele frequencies computed.Frequencies missing or wrong.
Hardy–Weinberg Equilibrium TestCorrect PLINK command and p-value extraction in R.PLINK command or extraction incorrect/missing.
Genetic Association TestCorrect regressions, plots, coding, and interpretation.Regression, plots, or interpretation missing/incomplete.
+
+
+

Submission Information

+

📌 Please review our Assignment Submission Guide for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

+
+

Note:

+

If you like, you may collaborate with others in the cohort. If you choose to do so, please indicate with whom you have worked with in your pull request by tagging their GitHub username. Separate submissions are required.

+
+
+
+

Submission Parameters

+
    +
  • Submission Due Date: 11:59 PM – 16/03/2026

  • +
  • Branch name for your repo should be: assignment-1

  • +
  • What to submit for this assignment:

    +
      +
    • Populate this Quarto document (assignment_1.qmd).
    • +
    • Render the document with Quarto: quarto render assignment_1.qmd.
    • +
    • Submit both assignment_1.qmd and the rendered HTML file assignment_1.html in your pull request.
    • +
  • +
  • What the pull request link should look like for this assignment: https://github.com/<your_github_username>/gen_data/pull/<pr_id>

    +
      +
    • Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support team review your submission easily.
    • +
  • +
+
+

Checklist:

+
    +
  • Created a branch with the correct naming convention.
  • +
  • Ensured that the repository is public.
  • +
  • Reviewed the PR description guidelines and adhered to them.
  • +
  • Verified that the link is accessible in a private browser window.
  • +
  • Confirmed that both assignment_1.qmd and assignment_1.html are included in the pull request.
  • +
+

If you encounter any difficulties or have questions, please don’t hesitate to reach out to our team via our Slack help channel. Our technical facilitators and learning support team are here to help you navigate any challenges.

+
+
+
+ +
+ + +
+ + + + + \ No newline at end of file diff --git a/02_activities/assignments/assignment_1.qmd b/02_activities/assignments/assignment_1.qmd index 550af3d..a19fe9a 100644 --- a/02_activities/assignments/assignment_1.qmd +++ b/02_activities/assignments/assignment_1.qmd @@ -17,94 +17,242 @@ You will need to install PLINK and run the analyses. Please follow the OS-specif Before fitting any models, it is essential to understand the data. Use R or bash code to answer the following questions about the `gwa.qc.A1.fam`, `gwa.qc.A1.bim`, and `gwa.qc.A1.bed` files, available at the following Google Drive link: . Please download all three files from this link and place them in `02_activities/data/`. +```{r setup, include=FALSE} +# Set up a consistent path for all chunks of code +knitr::opts_knit$set(root.dir = normalizePath("../../")) + +# Force Quarto to use Git Bash for all {bash} chunks +knitr::opts_chunk$set(engine.path = list(bash = "C:/Program Files/Git/bin/bash.exe")) +``` + +```{r, message=FALSE, warning=FALSE, results='hide'} +library(data.table) +library(ggplot2) +library(seqminer) +library(HardyWeinberg) +library(dplyr) +``` (i) Read the .fam file. How many samples does the dataset contain? -``` -# Your answer here... +```{bash} +cd ~/Downloads/gen_data/02_activities/data +pwd +head gwa.qc.A1.fam +``` +```{r} +# Read the .fam file +fam_file <- read.table("~/Downloads/gen_data/02_activities/data/gwa.qc.A1.fam", header = FALSE) + +# Count the number of samples and display the result +num_samples <- nrow(fam_file) +print(paste("Number of samples in the dataset:", num_samples)) ``` (ii) What is the 'variable type' of the response variable (i.e.Continuous or binary)? ``` -# Your answer here... +Continuous variable. ``` (iii) Read the .bim file. How many SNPs does the dataset contain? -``` -# Your answer here... +```{r} +# Read the .bim file +bim_file <- read.table("~/Downloads/gen_data/02_activities/data/gwa.qc.A1.bim", header = FALSE) + +# Count the number of SNPs and display the result +num_snps <- nrow(bim_file) +print(paste("Number of SNPs in the dataset:", num_snps)) ``` #### Question 2: Allele Frequency Estimation (i) Load the genotype matrix for SNPs rs1861, rs3813199, rs3128342, and rs11804831 using additive coding. What are the allele frequencies (AFs) for these four SNPs? -``` -# Your code here... +```{bash,message=FALSE, warning=FALSE} +# Step 1: Create SNP list file with 4 specified SNPs +printf "%s\n" rs1861 rs3813199 rs3128342 rs11804831 > ~/Downloads/gen_data/02_activities/data/snplist.txt +``` + +```{bash, results='hide',message=FALSE, warning=FALSE} +# Step 2: Extract genotypes for the specified SNPs and recode to additive format +/c/PLINK/plink2.exe --bfile ~/Downloads/gen_data/02_activities/data/gwa.qc.A1 \ + --extract ~/Downloads/gen_data/02_activities/data/snplist.txt \ + --recode A \ + --out ~/Downloads/gen_data/02_activities/data/geno.additive +``` + +```{r} +# Step 3: Load genotype data matrix +geno_subset <- fread("~/Downloads/gen_data/02_activities/data/geno.additive.raw") +geno_subset=na.omit(geno_subset) +geno_subset$PHENOTYPE=geno_subset$PHENOTYPE-1 +head(geno_subset) +``` + +```{bash, results='hide',message=FALSE, warning=FALSE} +# Step 4: Calculate allele frequencies for the specified SNPs +/c/PLINK/plink2.exe --bfile ~/Downloads/gen_data/02_activities/data/gwa.qc.A1 \ + --extract ~/Downloads/gen_data/02_activities/data/snplist.txt \ + --freq \ + --out ~/Downloads/gen_data/02_activities/data/gwa_qc_A1_freqs + +# The allele frequencies will be saved in the file gwa_qc_A1_freqs.afreq with column ALT_FREQS representing the allele frequencies for the specified SNPs. +``` + +```{r,message=FALSE, warning=FALSE} +# Step 5: View allele frequencies of the specified SNPs +freq <- fread("~/Downloads/gen_data/02_activities/data/gwa_qc_A1_freqs.afreq") +head(freq) ``` (ii) What are the minor allele frequencies (MAFs) for these four SNPs? -``` -# Your code here... +```{bash, results='show',message=FALSE, warning=FALSE} +# Calculate minor allele frequencies (MAFs) for the specified SNPs +awk 'NR==1 {print "SNP_ID\tMAF"} NR>1 {maf = ($6 > 0.5) ? 1 - $6 : $6; print $2 "\t" maf}' ~/Downloads/gen_data/02_activities/data/gwa_qc_A1_freqs.afreq + +# The MAFs will be calculated based on the ALT_FREQS column in the gwa_qc_A1_freqs.afreq file. The output will show the SNP ID and its corresponding MAF. Since the MAF is the frequency of the less common allele, it is calculated as 1 - ALT_FREQS if ALT_FREQS is greater than 0.5, otherwise it is just ALT_FREQS. ``` #### Question 3: Hardy–Weinberg Equilibrium (HWE) Test (i) Conduct the Hardy–Weinberg Equilibrium (HWE) test for all SNPs in the .bim file. Then, load the file containing the HWE p-value results and display the first few rows of the resulting data frame. -``` -# Your code here... +```{bash,results='hide',message=FALSE, warning=FALSE} +# Step 1: Conduct HWE test for all SNPs in the .bim file +/c/PLINK/plink2.exe --bfile ~/Downloads/gen_data/02_activities/data/gwa.qc.A1 --hardy --out ~/Downloads/gen_data/02_activities/data/gwa_qc_A1_hwe +``` + +This produces gwa_qc_A1_hwe.hardy containing observed and expected genotype counts plus p-values. + +```{r,message=FALSE, warning=FALSE} +# Step 2: View HWE results +hwe <- fread("~/Downloads/gen_data/02_activities/data/gwa_qc_A1_hwe.hardy") +head(hwe) ``` (ii) What are the HWE p-values for SNPs rs1861, rs3813199, rs3128342, and rs11804831? -``` -# Your code here... +```{bash, results='show',message=FALSE, warning=FALSE} +# Extract HWE p-values for the specified SNPs +grep -wFf ~/Downloads/gen_data/02_activities/data/snplist.txt ~/Downloads/gen_data/02_activities/data/gwa_qc_A1_hwe.hardy | awk '{print $2 "\t" $10}' ``` #### Question 4: Genetic Association Test (i) Conduct a linear regression to test the association between SNP rs1861 and the phenotype. What is the p-value? -``` -# Your code here... +```{r,message=FALSE, warning=FALSE} +#Pick SNP rs1861 +table(geno_subset$rs1861_C) +freq[freq$ID=='rs1861',] +head(geno_subset) +``` + +The C allele is the major allele, with frequency 1 - 0.0539859. + +```{r,message=FALSE, warning=FALSE} +# Fit linear regression model +model <- glm(PHENOTYPE ~ rs1861_C, data = geno_subset, family = gaussian) + +# View model summary +summary(model) +table(geno_subset$rs1861_C) + +# The p value for the association between SNP rs1861 and the phenotype is p < 2e-16, which is highly significant. This suggests that there is a strong association between the genotype of SNP rs1861 and the phenotype being studied. ``` (ii) How would you interpret the beta coefficient from this regression? ``` -# Your answer here... +The beta coefficient for rs1861_C represents the change in the phenotype associated with each additional copy of the C allele. If the beta coefficient is positive, it indicates that the presence of the C allele is associated with an increase in the phenotype. Conversely, if the beta coefficient is negative, it suggests that the C allele is associated with a decrease in the phenotype. The magnitude of the beta coefficient indicates the strength of this association, while its sign (positive or negative) indicates the direction of the effect. + +In this case, since the beta coefficient is positive (i.e., +0.96698), it suggests that individuals with more copies of the C allele tend to have higher values of the phenotype. The exact interpretation would depend on the context of the phenotype being studied and the units of measurement for both the genotype and the phenotype. ``` (iii) Plot the scatterplot of phenotype versus the genotype of SNP rs1861. Add the regression line to the plot. -``` -# Your code here... +```{r,message=FALSE, warning=FALSE} + +# 1. Clean the data (remove missing genotypes or phenotypes) +clean_data <- geno_subset %>% + filter(!is.na(rs1861_C), !is.na(PHENOTYPE)) + +# 2. Build the scatterplot +p1 <- ggplot(clean_data, aes(x = rs1861_C, y = PHENOTYPE)) + + # Add points with horizontal jitter and transparency to prevent overlapping + geom_jitter(width = 0.1, height = 0, color = "blue", alpha = 0.3, size = 2) + + # Automatically calculate and draw the linear regression line (with confidence interval) + geom_smooth(method = "lm", color = "red", se = TRUE, linewidth = 1.2) + + labs( + title = "Linear Regression: Continuous Phenotype ~ Genotype (rs1861)", + x = "Genotype (Number of Alternative Alleles: 0, 1, 2)", + y = "Phenotype Measurement" + ) + + # Lock the x-axis to our discrete genotype categories + scale_x_continuous(breaks = c(0, 1, 2)) + + coord_cartesian(xlim = c(-0.5, 2.5)) + + # Apply a clean, minimal theme + theme_minimal() + +# 3. Display the plot +print(p1) ``` (iv) Convert the genotype coding for rs1861 to recessive coding. -``` -# Your code here... +```{r,message=FALSE, warning=FALSE} +# Make a copy for recessive data +geno_subset_rec <- geno_subset + +# Apply the recessive logic (only 2 becomes 1, everything else is 0) +geno_subset_rec[, 7:ncol(geno_subset_rec)] <- as.data.frame( + lapply(geno_subset[, 7:ncol(geno_subset)], function(x) ifelse(x == 2, 1, 0)) +) ``` (v) Conduct a linear regression to test the association between the recessive-coded rs1861 and the phenotype. What is the p-value? -``` -# Your code here... +```{r,message=FALSE, warning=FALSE} +# Fit linear regression model +model <- glm(PHENOTYPE ~ rs1861_C, data = geno_subset_rec, family = gaussian) +summary(model) ``` +The p-value for the association between the recessive-coded rs1861 and the phenotype is p < 2e-16, which is highly significant. This suggests that there is a strong association between the recessive genotype of SNP rs1861 and the phenotype being studied. The interpretation of this result would depend on the context of the phenotype and the specific coding used for the recessive model, but it indicates that individuals with two copies of the alternative allele (coded as 1) have a significantly different phenotype compared to those with zero or one copy (coded as 0). (vi) Plot the scatterplot of phenotype versus the recessive-coded genotype of rs1861. Add the regression line to the plot. -``` -# Your code here... +```{r,message=FALSE, warning=FALSE} +# 1. Clean the data using your new recessive dataframe +clean_data_rec <- geno_subset_rec %>% + filter(!is.na(rs1861_C), !is.na(PHENOTYPE)) + +# 2. Build the continuous scatterplot +p4 <- ggplot(clean_data_rec, aes(x = rs1861_C, y = PHENOTYPE)) + + # Add points with slight horizontal jitter + geom_jitter(width = 0.05, height = 0, color = "blue", alpha = 0.3, size = 2) + + # Automatically draw the linear regression line + geom_smooth(method = "lm", color = "red", se = TRUE, linewidth = 1.2) + + labs( + title = "Linear Regression: Continuous Phenotype ~ Recessive Genotype", + # Updated x-axis label to reflect recessive biology + x = "Genotype (0 = Carrier/No Alt, 1 = Exactly 2 Copies of Alt)", + y = "Continuous Phenotype" + ) + + # Lock the x-axis to exactly 0 and 1 + scale_x_continuous(breaks = c(0, 1)) + + coord_cartesian(xlim = c(-0.5, 1.5)) + + theme_minimal() + +# 3. Display the plot +print(p4) ``` (vii) Which model fits better? Justify your answer. ``` -# Your answer here... +To determine which model fits better, we can compare the two linear regression models (additive vs. recessive) using metrics such as the Akaike Information Criterion (AIC) or the R-squared value. The model with the lower AIC or higher R-squared would be considered a better fit to the data. Additionally, we can look at the significance of the coefficients and the overall p-values for each model to assess which one provides a more meaningful association between the genotype and the phenotype. In this case, since both models have highly significant p-values, we would need to compare their AIC values or R-squared values to make a final determination on which model fits better. The additive model with an AIC of 10988 compared with 11000 for the recessive model suggests that the additive model fits the data better, as it has a lower AIC value. This indicates that the additive model provides a better balance of goodness-of-fit and model complexity compared to the recessive model. ``` ### Criteria diff --git a/02_activities/assignments/assignment_2.html b/02_activities/assignments/assignment_2.html new file mode 100644 index 0000000..91f7505 --- /dev/null +++ b/02_activities/assignments/assignment_2.html @@ -0,0 +1,1078 @@ + + + + + + + + + +Assignment #2 + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+
+

Assignment #2

+
+ + + +
+ + + + +
+ + + +
+ + +

You will need PLINK2 installed and available in your PATH. Please follow the OS-specific setup guide in SETUP.md. The dataset for this assignment consists of the following binary PLINK files: gwa.A2.bed, gwa.A2.bim, gwa.A2.fam , available at the following Google Drive link: https://drive.google.com/drive/folders/1rHoy3z52Yyj985ukjjtLhfIBxchpyYtZ?usp=drive_link. Please download all three files and save them in 02_activities/data/.

+
+
library(data.table)
+library(ggplot2)
+library(seqminer)
+library(HardyWeinberg)
+library(dplyr)
+library(data.table)
+library(qqman)
+
+
+
cd ~/Downloads/gen_data/02_activities/data
+pwd
+head gwa.A2.fam
+
+
/c/Users/obedk/Downloads/gen_data/02_activities/data
+0 A2001 0 0 1 -0.275420523713885
+1 A2002 0 0 1 -1.99584337519063
+2 A2003 0 0 1 -0.758778643869391
+3 A2004 0 0 1 0.53728932841857
+4 A2005 0 0 1 1.13676697273246
+5 A2006 0 0 1 -0.20518361896695
+6 A2007 0 0 1 -0.0603085146427615
+7 A2008 0 0 1 0.648606850299281
+8 A2009 0 0 1 -1.31834326423139
+9 A2010 0 0 1 2.08729043554028
+
+
+
+

Question 1: Data inspection

+

Before you run any models, first get familiar with the dataset. You may find data.table::fread() in R helpful for reading .bim and .fam files.

+
    +
  1. Read the .fam file. How many samples does the dataset contain?
  2. +
+
+
# Reading the .fam file
+fam_file <- read.table("~/Downloads/gen_data/02_activities/data/gwa.A2.fam", header = FALSE)
+
+# Count the number of samples and display the result
+num_samples <- nrow(fam_file)
+print(paste("Number of samples in the dataset:", num_samples))
+
+
[1] "Number of samples in the dataset: 4000"
+
+
+
    +
  1. Read the .bim file. How many SNPs does the dataset contain?
  2. +
+
+
# Read the .bim file
+bim_file <- read.table("~/Downloads/gen_data/02_activities/data/gwa.A2.bim", header = FALSE)
+
+# Count the number of SNPs and display the result
+num_snps <- nrow(bim_file)
+print(paste("Number of SNPs in the dataset:", num_snps))
+
+
[1] "Number of SNPs in the dataset: 306102"
+
+
+
    +
  1. Make a histogram (or density plot) of the phenotype. Does it look roughly Gaussian?
  2. +
+
+
# 1. Extracting the phenotype column (6th column of the .fam file)
+phenotype <- read.table("~/Downloads/gen_data/02_activities/data/gwa.A2.fam", header = FALSE)$V6
+
+# Filtering out missing values if present (assuming -9 indicates missing phenotypes)
+phenotype_clean <- phenotype[phenotype != -9 & !is.na(phenotype)]
+
+# 2. Make a Histogram
+hist(phenotype_clean, 
+     main = "Histogram of Phenotype", 
+     xlab = "Phenotype Value", 
+     col = "lightblue", 
+     border = "black",
+     breaks = 20) 
+
+
+
+

+
+
+
+
# 3. Make a Density Plot to better visualize the distribution
+plot(density(phenotype_clean), 
+     main = "Density Plot of Phenotype", 
+     xlab = "Phenotype Value", 
+     col = "darkblue", 
+     lwd = 2)
+
+
+
+

+
+
+
+
+

Yes, the histogram (or density plot) displays a roughly symmetric, bell-shaped curve, indicating the phenotype follows a roughly Gaussian distribution.

+
+
+

Question 2: Quality control (QC)

+

Now we will perform QC using PLINK2 for the genotype files in gwa.A2.

+
    +
  1. Using PLINK2 from the command line (bash), perform basic QC with the following filters: MAF ≥ 0.05, SNP missingness (--geno) ≤ 0.01, individual missingness (--mind) ≤ 0.10, and HWE p-value ≥ 0.00005, and output the QC’ed dataset as gwa.qc.A2.
  2. +
+
+
/c/PLINK/plink2.exe --bfile ~/Downloads/gen_data/02_activities/data/gwa.A2 --maf 0.05 --geno 0.01 --mind 0.1 --hwe 0.00005 --make-bed --out ~/Downloads/gen_data/02_activities/data/gwa.qc.A2
+
+
PLINK v2.0.0-a.7 64-bit (10 Jan 2026)               cog-genomics.org/plink/2.0/
+(C) 2005-2026 Shaun Purcell, Christopher Chang    GNU General Public License v3
+Logging to C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2.log.
+Options in effect:
+  --bfile C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.A2
+  --geno 0.01
+  --hwe 0.00005
+  --maf 0.05
+  --make-bed
+  --mind 0.1
+  --out C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2
+
+Start time: Thu Apr 02 14:56:21 2026
+32479 MiB RAM detected; reserving 16239 MiB for main workspace.
+Using up to 12 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.A2.fam.
+306102 variants loaded from
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.A2.bim.
+1 quantitative phenotype loaded (4000 values).
+Calculating sample missingness rates... 0%21%42%64%85%done.
+0 samples removed due to missing genotype data (--mind).
+4000 samples (2000 females, 2000 males; 4000 founders) remaining after main
+filters.
+4000 quantitative phenotype values remaining after main filters.
+Calculating allele frequencies... 0%21%42%64%85%done.
+--geno: 196578 variants removed due to missing genotype data.
+--hwe: 6 variants removed due to Hardy-Weinberg exact test (founders only).
+8435 variants removed due to allele frequency threshold(s)
+(--maf/--max-maf/--mac/--max-mac).
+101083 variants remaining after main filters.
+Writing C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2.fam ...
+done.
+Writing C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2.bim ...
+done.
+Writing C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2.bed ...
+0%21%43%65%87%done.
+End time: Thu Apr 02 14:56:22 2026
+
+
+
+
+

Question 3: Relatedness

+

In this question, you will use PLINK2’s built-in KING-robust kinship (--king-cutoff) detect and remove related individuals.

+
    +
  1. Perform LD pruning on gwa.qc.A2 using PLINK2 with the following parameters: --indep-pairwise 500 50 0.05, and then generate a new dataset containing only the pruned SNPs.
  2. +
+
+
# 1. LD pruning to identify a set of approximately independent SNPs
+/c/PLINK/plink2.exe --bfile ~/Downloads/gen_data/02_activities/data/gwa.qc.A2 --indep-pairwise 500 50 0.05 --out ~/Downloads/gen_data/02_activities/data/gwa.qc.A2_pruned
+
+# 2. Create a new dataset using only the pruned SNPs
+/c/PLINK/plink2.exe --bfile ~/Downloads/gen_data/02_activities/data/gwa.qc.A2 --extract ~/Downloads/gen_data/02_activities/data/gwa.qc.A2_pruned.prune.in --make-bed --out ~/Downloads/gen_data/02_activities/data/gwa.qc.A2_pruned
+
+
PLINK v2.0.0-a.7 64-bit (10 Jan 2026)               cog-genomics.org/plink/2.0/
+(C) 2005-2026 Shaun Purcell, Christopher Chang    GNU General Public License v3
+Logging to C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2_pruned.log.
+Options in effect:
+  --bfile C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2
+  --indep-pairwise 500 50 0.05
+  --out C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2_pruned
+
+Start time: Thu Apr 02 14:56:22 2026
+32479 MiB RAM detected; reserving 16239 MiB for main workspace.
+Using up to 12 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2.fam.
+101083 variants loaded from
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2.bim.
+1 quantitative phenotype loaded (4000 values).
+Calculating allele frequencies... 0%64%done.
+--indep-pairwise (11 compute threads): 0%50%79169/101083 variants removed.
+Writing...
+Variant lists written to
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2_pruned.prune.in
+and
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2_pruned.prune.out
+.
+End time: Thu Apr 02 14:56:23 2026
+PLINK v2.0.0-a.7 64-bit (10 Jan 2026)               cog-genomics.org/plink/2.0/
+(C) 2005-2026 Shaun Purcell, Christopher Chang    GNU General Public License v3
+Logging to C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2_pruned.log.
+Options in effect:
+  --bfile C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2
+  --extract C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2_pruned.prune.in
+  --make-bed
+  --out C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2_pruned
+
+Start time: Thu Apr 02 14:56:23 2026
+32479 MiB RAM detected; reserving 16239 MiB for main workspace.
+Using up to 12 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2.fam.
+101083 variants loaded from
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2.bim.
+1 quantitative phenotype loaded (4000 values).
+--extract: 21914 variants remaining.
+21914 variants remaining after main filters.
+Writing
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2_pruned.fam ...
+done.
+Writing
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2_pruned.bim ...
+done.
+Writing
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2_pruned.bed ...
+0%61%done.
+End time: Thu Apr 02 14:56:23 2026
+
+
+
    +
  1. Use PLINK2 on the LD-pruned dataset to identify a set of unrelated individuals up to (approximately) 2nd-degree relatives (use a kinship cutoff of 0.0884).
  2. +
+
+
/c/PLINK/plink2.exe --bfile ~/Downloads/gen_data/02_activities/data/gwa.qc.A2_pruned --king-cutoff 0.0884 --out ~/Downloads/gen_data/02_activities/data/gwa_king
+
+
PLINK v2.0.0-a.7 64-bit (10 Jan 2026)               cog-genomics.org/plink/2.0/
+(C) 2005-2026 Shaun Purcell, Christopher Chang    GNU General Public License v3
+Logging to C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa_king.log.
+Options in effect:
+  --bfile C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2_pruned
+  --king-cutoff 0.0884
+  --out C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa_king
+
+Start time: Thu Apr 02 14:56:23 2026
+32479 MiB RAM detected; reserving 16239 MiB for main workspace.
+Using up to 12 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2_pruned.fam.
+21914 variants loaded from
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2_pruned.bim.
+1 quantitative phenotype loaded (4000 values).
+--king-cutoff pass 1/1: Scanning for rare variants... 0%done.
+0 variants handled by initial scan (21914 remaining).
+
+--king-cutoff pass 1/1: 0 variants complete.
+--king-cutoff pass 1/1: 1536 variants complete.
+--king-cutoff pass 1/1: 3072 variants complete.
+--king-cutoff pass 1/1: 4608 variants complete.
+--king-cutoff pass 1/1: 6144 variants complete.
+--king-cutoff pass 1/1: 7680 variants complete.
+--king-cutoff pass 1/1: 9216 variants complete.
+--king-cutoff pass 1/1: 10752 variants complete.
+--king-cutoff pass 1/1: 12288 variants complete.
+--king-cutoff pass 1/1: 13824 variants complete.
+--king-cutoff pass 1/1: 15360 variants complete.
+--king-cutoff pass 1/1: 16896 variants complete.
+--king-cutoff pass 1/1: 18432 variants complete.
+--king-cutoff pass 1/1: 19968 variants complete.
+--king-cutoff pass 1/1: 21504 variants complete.
+--king-cutoff pass 1/1: Condensing...                 done.
+--king-cutoff: 21914 variants processed.
+--king-cutoff: Excluded sample IDs written to
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa_king.king.cutoff.out.id
+, and 4000 remaining sample IDs written to
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa_king.king.cutoff.in.id
+.
+End time: Thu Apr 02 14:56:25 2026
+
+
+
    +
  1. Using the unrelated individual list from part (ii), create a dataset containing only unrelated individuals and name it gwa.qc_unrelated.A2.
  2. +
+
+
/c/PLINK/plink2.exe --bfile ~/Downloads/gen_data/02_activities/data/gwa.qc.A2 --keep ~/Downloads/gen_data/02_activities/data/gwa_king.king.cutoff.in.id --make-bed --out ~/Downloads/gen_data/02_activities/data/gwa.qc_unrelated.A2
+
+
PLINK v2.0.0-a.7 64-bit (10 Jan 2026)               cog-genomics.org/plink/2.0/
+(C) 2005-2026 Shaun Purcell, Christopher Chang    GNU General Public License v3
+Logging to C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc_unrelated.A2.log.
+Options in effect:
+  --bfile C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2
+  --keep C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa_king.king.cutoff.in.id
+  --make-bed
+  --out C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc_unrelated.A2
+
+Start time: Thu Apr 02 14:56:25 2026
+32479 MiB RAM detected; reserving 16239 MiB for main workspace.
+Using up to 12 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2.fam.
+101083 variants loaded from
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2.bim.
+1 quantitative phenotype loaded (4000 values).
+--keep: 4000 samples remaining.
+4000 samples (2000 females, 2000 males; 4000 founders) remaining after main
+filters.
+4000 quantitative phenotype values remaining after main filters.
+Writing
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc_unrelated.A2.fam
+... done.
+Writing
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc_unrelated.A2.bim
+... done.
+Writing
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc_unrelated.A2.bed
+... 0%64%done.
+End time: Thu Apr 02 14:56:25 2026
+
+
+
+
+

Question 4: principal component analysis (PCA)

+
    +
  1. Using PLINK2, perform PCA on the unrelated, LD-pruned dataset to obtain principal components.

    +

    (Hint: Refer to the PCA section in the GWAS Tutorial II. You should use a *.prune.in file to select a set of independent SNPs before performing PCA.)

  2. +
+
+
# 1. First, create a pruned dataset of unrelated individuals
+/c/PLINK/plink2.exe --bfile ~/Downloads/gen_data/02_activities/data/gwa.qc_unrelated.A2 --extract ~/Downloads/gen_data/02_activities/data/gwa.qc.A2_pruned.prune.in --make-bed --out ~/Downloads/gen_data/02_activities/data/gwa.qc_unrelated_pruned
+
+# 2. Now perform PCA:
+/c/PLINK/plink2.exe --bfile ~/Downloads/gen_data/02_activities/data/gwa.qc_unrelated_pruned --pca 20 --out ~/Downloads/gen_data/02_activities/data/gwa_unrel_PCA
+
+
PLINK v2.0.0-a.7 64-bit (10 Jan 2026)               cog-genomics.org/plink/2.0/
+(C) 2005-2026 Shaun Purcell, Christopher Chang    GNU General Public License v3
+Logging to C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc_unrelated_pruned.log.
+Options in effect:
+  --bfile C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc_unrelated.A2
+  --extract C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc.A2_pruned.prune.in
+  --make-bed
+  --out C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc_unrelated_pruned
+
+Start time: Thu Apr 02 14:56:25 2026
+32479 MiB RAM detected; reserving 16239 MiB for main workspace.
+Using up to 12 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc_unrelated.A2.fam.
+101083 variants loaded from
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc_unrelated.A2.bim.
+1 quantitative phenotype loaded (4000 values).
+--extract: 21914 variants remaining.
+21914 variants remaining after main filters.
+Writing
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc_unrelated_pruned.fam
+... done.
+Writing
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc_unrelated_pruned.bim
+... done.
+Writing
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc_unrelated_pruned.bed
+... 0%61%done.
+End time: Thu Apr 02 14:56:26 2026
+PLINK v2.0.0-a.7 64-bit (10 Jan 2026)               cog-genomics.org/plink/2.0/
+(C) 2005-2026 Shaun Purcell, Christopher Chang    GNU General Public License v3
+Logging to C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa_unrel_PCA.log.
+Options in effect:
+  --bfile C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc_unrelated_pruned
+  --out C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa_unrel_PCA
+  --pca 20
+
+Start time: Thu Apr 02 14:56:26 2026
+32479 MiB RAM detected; reserving 16239 MiB for main workspace.
+Using up to 12 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc_unrelated_pruned.fam.
+21914 variants loaded from
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc_unrelated_pruned.bim.
+1 quantitative phenotype loaded (4000 values).
+Calculating allele frequencies... 0%done.
+Constructing GRM: 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99%done.
+Correcting for missingness... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99%done.
+Extracting eigenvalues and eigenvectors... done.
+--pca: Eigenvectors written to
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa_unrel_PCA.eigenvec ,
+and eigenvalues written to
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa_unrel_PCA.eigenval .
+End time: Thu Apr 02 14:56:34 2026
+
+
+
+
+

Question 5: GWAS analyses

+

We now test for association between each SNP and the continuous phenotype, adjusting for population structure using the top 3 PCs.

+

Assume that:

+
    +
  • You will run GWAS on the unrelated, QC’ed dataset gwa.qc_unrelated.A2.

  • +
  • You have a covariate file (e.g., the *.eigenvec output from the PCA analysis) containing PC1–PC3 for each individual.

  • +
+
    +
  1. Using PLINK2, run a linear regression GWAS using gwa.qc_unrelated.A2 as the input dataset. Adjust for PCs 1–3 as covariates.
  2. +
+
+
/c/PLINK/plink2.exe --bfile ~/Downloads/gen_data/02_activities/data/gwa.qc_unrelated.A2 --ci 0.95 --glm --covar ~/Downloads/gen_data/02_activities/data/gwa_unrel_PCA.eigenvec --covar-name PC1,PC2,PC3 --adjust --out ~/Downloads/gen_data/02_activities/data/gwa.qc_assoc_glm
+
+
PLINK v2.0.0-a.7 64-bit (10 Jan 2026)               cog-genomics.org/plink/2.0/
+(C) 2005-2026 Shaun Purcell, Christopher Chang    GNU General Public License v3
+Logging to C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc_assoc_glm.log.
+Options in effect:
+  --adjust
+  --bfile C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc_unrelated.A2
+  --ci 0.95
+  --covar C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa_unrel_PCA.eigenvec
+  --covar-name PC1,PC2,PC3
+  --glm
+  --out C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc_assoc_glm
+
+Start time: Thu Apr 02 14:56:34 2026
+32479 MiB RAM detected; reserving 16239 MiB for main workspace.
+Using up to 12 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc_unrelated.A2.fam.
+101083 variants loaded from
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc_unrelated.A2.bim.
+1 quantitative phenotype loaded (4000 values).
+3 covariates loaded from C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa_unrel_PCA.eigenvec.
+Calculating allele frequencies... 0%64%done.
+--glm linear regression on phenotype 'PHENO1': 0%64%done.
+Results written to C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc_assoc_glm.PHENO1.glm.linear .
+--adjust: Genomic inflation est. lambda (based on median chisq) = 1.01779.
+--adjust values (101083 tests) written to
+C:/Users/obedk/Downloads/gen_data/02_activities/data/gwa.qc_assoc_glm.PHENO1.glm.linear.adjusted
+.
+End time: Thu Apr 02 14:56:36 2026
+
+
+
    +
  1. Create a Manhattan plot of the GWAS results.
  2. +
+
+
# Load PLINK2 logistic results
+assoc <- data.table::fread("~/Downloads/gen_data/02_activities/data/gwa.qc_assoc_glm.PHENO1.glm.linear")
+
+head(assoc)
+
+
   #CHROM     POS        ID    REF    ALT PROVISIONAL_REF?     A1 OMITTED
+    <int>   <int>    <char> <char> <char>           <char> <char>  <char>
+1:      1 1011278 rs3737728      G      A                Y      A       G
+2:      1 1011278 rs3737728      G      A                Y      A       G
+3:      1 1011278 rs3737728      G      A                Y      A       G
+4:      1 1011278 rs3737728      G      A                Y      A       G
+5:      1 1109721 rs1320565      C      T                Y      T       C
+6:      1 1109721 rs1320565      C      T                Y      T       C
+     A1_FREQ   TEST OBS_CT       BETA        SE        L95       U95    T_STAT
+       <num> <char>  <int>      <num>     <num>      <num>     <num>     <num>
+1: 0.3386490    ADD   3982  0.0182086 0.0240795 -0.0289864 0.0654035  0.756187
+2: 0.3386490    PC1   3982 -0.3779240 1.0015500 -2.3409200 1.5850800 -0.377340
+3: 0.3386490    PC2   3982  0.9818050 1.0015100 -0.9811150 2.9447300  0.980326
+4: 0.3386490    PC3   3982  0.4832040 1.0015300 -1.4797500 2.4461600  0.482467
+5: 0.0788481    ADD   3976 -0.0153297 0.0420641 -0.0977739 0.0671145 -0.364436
+6: 0.0788481    PC1   3976 -0.4762890 1.0037500 -2.4436000 1.4910200 -0.474511
+          P ERRCODE
+      <num>  <char>
+1: 0.449582       .
+2: 0.705941       .
+3: 0.326985       .
+4: 0.629501       .
+5: 0.715552       .
+6: 0.635162       .
+
+
# Keep only the additive test (one row per SNP)
+assoc_add <- assoc[TEST == "ADD"]
+
+# qqman expects columns named CHR, BP, SNP, and P.
+# In PLINK2 output they are #CHROM, POS, ID, and P.
+setnames(assoc_add, c("#CHROM","POS","ID"), c("CHR","BP","SNP"))
+
+# Manhattan plot
+manhattan(assoc_add,
+          chr = "CHR",
+          bp  = "BP",
+          snp = "SNP",
+          p   = "P",
+          xlab = "",
+          ylab = "",
+          suggestiveline = FALSE,
+          cex.axis = 1.5,
+          col = c("lightblue", "lightslateblue"))
+
+
+
+

+
+
+
+
+
    +
  1. Create a QQ plot of the GWAS p-values.
  2. +
+
+
# Q-Q plot
+qq(assoc_add$P, main = "Q-Q plot of GWAS p-values")
+
+
+
+

+
+
+
+
+
+
+

Criteria

+ +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
CriteriaCompleteIncomplete
Data inspectionCorrect sample and SNP counts; phenotype plot with a brief comment on Gaussianity.Counts or phenotype description/plot missing or incorrect.
QC & LD pruningCorrect PLINK2 QC command and thresholds.QC/pruning commands, thresholds, or output datasets missing or incorrect.
Relatedness & PCACorrect use of PLINK2 command to obtain unrelated samples and PCA run on pruned SNPs.Relatedness step, unrelated dataset, or PCA analysis missing or incorrect.
GWAS & visualisationLinear regression GWAS with PCs as covariates; Manhattan and QQ plots produced.GWAS command, or Manhattan/QQ plots missing or clearly incorrect.
+
+
+

Submission Information

+

🚨 Please review our Assignment Submission Guide 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

+
+
+

Submission Parameters

+
    +
  • Submission Due Date: 11:59 PM – 01/04/2026

  • +
  • The branch name for your repo should be: assignment-2

  • +
  • What to submit for this assignment:

    +
      +
    • Populate this Quarto document (assignment_2.qmd).
    • +
    • Render the document with Quarto: quarto render assignment_2.qmd.
    • +
    • Submit both assignment_2.qmd and the rendered HTML file assignment_2.html in your pull request.
    • +
  • +
  • What the pull request link should look like for this assignment: https://github.com/<your_github_username>/gen_data/pull/<pr_id>

    +
      +
    • Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support team review your submission easily.
    • +
  • +
+
+

Checklist:

+
    +
  • Create a branch called assignment-2.
  • +
  • Ensure that your repository is public.
  • +
  • Review the PR description guidelines and adhere to them.
  • +
  • Verify that your link is accessible in a private browser window.
  • +
  • Confirm that both assignment_2.qmd and assignment_2.html are included in the pull request.
  • +
+
+
+ +
+ + +
+ + + + + \ No newline at end of file diff --git a/02_activities/assignments/assignment_2.qmd b/02_activities/assignments/assignment_2.qmd index e3a5449..709bc7e 100644 --- a/02_activities/assignments/assignment_2.qmd +++ b/02_activities/assignments/assignment_2.qmd @@ -5,6 +5,31 @@ format: html You will need **PLINK2** installed and available in your PATH. Please follow the OS-specific setup guide in [`SETUP.md`](../../SETUP.md). The dataset for this assignment consists of the following binary PLINK files: `gwa.A2.bed`, `gwa.A2.bim`, `gwa.A2.fam` , available at the following Google Drive link: . Please download all three files and save them in `02_activities/data/`. + +```{r setup, include=FALSE} +# Set up a consistent path for all chunks of code +knitr::opts_knit$set(root.dir = normalizePath("../../")) + +# Force Quarto to use Git Bash for all {bash} chunks +knitr::opts_chunk$set(engine.path = list(bash = "C:/Program Files/Git/bin/bash.exe")) +``` + +```{r, message=FALSE, warning=FALSE, results='hide'} +library(data.table) +library(ggplot2) +library(seqminer) +library(HardyWeinberg) +library(dplyr) +library(data.table) +library(qqman) +``` + +```{bash} +cd ~/Downloads/gen_data/02_activities/data +pwd +head gwa.A2.fam +``` + #### Question 1: Data inspection Before you run any models, first get familiar with the dataset. You may find `data.table::fread()` in R helpful for reading `.bim` and `.fam` files. @@ -12,16 +37,52 @@ Before you run any models, first get familiar with the dataset. You may find `da (i) Read the `.fam` file. How many samples does the dataset contain? ```{r} -# Your code here... +# Reading the .fam file +fam_file <- read.table("~/Downloads/gen_data/02_activities/data/gwa.A2.fam", header = FALSE) + +# Count the number of samples and display the result +num_samples <- nrow(fam_file) +print(paste("Number of samples in the dataset:", num_samples)) ``` (ii) Read the `.bim` file. How many SNPs does the dataset contain? ```{r} -# Your code here... +# Read the .bim file +bim_file <- read.table("~/Downloads/gen_data/02_activities/data/gwa.A2.bim", header = FALSE) + +# Count the number of SNPs and display the result +num_snps <- nrow(bim_file) +print(paste("Number of SNPs in the dataset:", num_snps)) ``` +(iii) Make a histogram (or density plot) of the phenotype. Does it look roughly Gaussian? + +```{r} +# 1. Extracting the phenotype column (6th column of the .fam file) +phenotype <- read.table("~/Downloads/gen_data/02_activities/data/gwa.A2.fam", header = FALSE)$V6 + +# Filtering out missing values if present (assuming -9 indicates missing phenotypes) +phenotype_clean <- phenotype[phenotype != -9 & !is.na(phenotype)] + +# 2. Make a Histogram +hist(phenotype_clean, + main = "Histogram of Phenotype", + xlab = "Phenotype Value", + col = "lightblue", + border = "black", + breaks = 20) + +# 3. Make a Density Plot to better visualize the distribution +plot(density(phenotype_clean), + main = "Density Plot of Phenotype", + xlab = "Phenotype Value", + col = "darkblue", + lwd = 2) +``` +Yes, the histogram (or density plot) displays a roughly symmetric, bell-shaped curve, indicating the phenotype follows a roughly Gaussian distribution. + #### Question 2: Quality control (QC) Now we will perform QC using PLINK2 for the genotype files in `gwa.A2`. @@ -29,7 +90,7 @@ Now we will perform QC using PLINK2 for the genotype files in `gwa.A2`. (i) Using PLINK2 from the command line (bash), perform basic QC with the following filters: MAF ≥ 0.05, SNP missingness (`--geno`) ≤ 0.01, individual missingness (`--mind`) ≤ 0.10, and HWE p-value ≥ 0.00005, and output the QC’ed dataset as `gwa.qc.A2`. ```{bash} -# Your code here... +/c/PLINK/plink2.exe --bfile ~/Downloads/gen_data/02_activities/data/gwa.A2 --maf 0.05 --geno 0.01 --mind 0.1 --hwe 0.00005 --make-bed --out ~/Downloads/gen_data/02_activities/data/gwa.qc.A2 ``` #### Question 3: Relatedness @@ -39,19 +100,23 @@ In this question, you will use **PLINK2’s built-in KING-robust kinship** (`--k i) Perform LD pruning on `gwa.qc.A2` using PLINK2 with the following parameters: `--indep-pairwise 500 50 0.05`, and then generate a new dataset containing only the pruned SNPs. ```{bash} -# Your code here... +# 1. LD pruning to identify a set of approximately independent SNPs +/c/PLINK/plink2.exe --bfile ~/Downloads/gen_data/02_activities/data/gwa.qc.A2 --indep-pairwise 500 50 0.05 --out ~/Downloads/gen_data/02_activities/data/gwa.qc.A2_pruned + +# 2. Create a new dataset using only the pruned SNPs +/c/PLINK/plink2.exe --bfile ~/Downloads/gen_data/02_activities/data/gwa.qc.A2 --extract ~/Downloads/gen_data/02_activities/data/gwa.qc.A2_pruned.prune.in --make-bed --out ~/Downloads/gen_data/02_activities/data/gwa.qc.A2_pruned ``` (ii) Use PLINK2 on the LD-pruned dataset to identify a set of unrelated individuals up to (approximately) 2nd-degree relatives (use a kinship cutoff of 0.0884). ```{bash} -# Your code here... +/c/PLINK/plink2.exe --bfile ~/Downloads/gen_data/02_activities/data/gwa.qc.A2_pruned --king-cutoff 0.0884 --out ~/Downloads/gen_data/02_activities/data/gwa_king ``` (iii) Using the unrelated individual list from part (ii), create a dataset containing only unrelated individuals and name it `gwa.qc_unrelated.A2`. ```{bash} -# Your code here... +/c/PLINK/plink2.exe --bfile ~/Downloads/gen_data/02_activities/data/gwa.qc.A2 --keep ~/Downloads/gen_data/02_activities/data/gwa_king.king.cutoff.in.id --make-bed --out ~/Downloads/gen_data/02_activities/data/gwa.qc_unrelated.A2 ``` #### Question 4: principal component analysis (PCA) @@ -61,7 +126,11 @@ i) Perform LD pruning on `gwa.qc.A2` using PLINK2 with the following parameters (Hint: Refer to the PCA section in the GWAS Tutorial II. You should use a `*.prune.in` file to select a set of independent SNPs before performing PCA.) ```{bash} -# Your code here... +# 1. First, create a pruned dataset of unrelated individuals +/c/PLINK/plink2.exe --bfile ~/Downloads/gen_data/02_activities/data/gwa.qc_unrelated.A2 --extract ~/Downloads/gen_data/02_activities/data/gwa.qc.A2_pruned.prune.in --make-bed --out ~/Downloads/gen_data/02_activities/data/gwa.qc_unrelated_pruned + +# 2. Now perform PCA: +/c/PLINK/plink2.exe --bfile ~/Downloads/gen_data/02_activities/data/gwa.qc_unrelated_pruned --pca 20 --out ~/Downloads/gen_data/02_activities/data/gwa_unrel_PCA ``` #### Question 5: GWAS analyses @@ -77,19 +146,42 @@ Assume that: (i) Using PLINK2, run a linear regression GWAS using `gwa.qc_unrelated.A2` as the input dataset. Adjust for PCs 1–3 as covariates. ```{bash} -# Your code here... +/c/PLINK/plink2.exe --bfile ~/Downloads/gen_data/02_activities/data/gwa.qc_unrelated.A2 --ci 0.95 --glm --covar ~/Downloads/gen_data/02_activities/data/gwa_unrel_PCA.eigenvec --covar-name PC1,PC2,PC3 --adjust --out ~/Downloads/gen_data/02_activities/data/gwa.qc_assoc_glm ``` (ii) Create a Manhattan plot of the GWAS results. ```{r} -# Your code here... +# Load PLINK2 logistic results +assoc <- data.table::fread("~/Downloads/gen_data/02_activities/data/gwa.qc_assoc_glm.PHENO1.glm.linear") + +head(assoc) + +# Keep only the additive test (one row per SNP) +assoc_add <- assoc[TEST == "ADD"] + +# qqman expects columns named CHR, BP, SNP, and P. +# In PLINK2 output they are #CHROM, POS, ID, and P. +setnames(assoc_add, c("#CHROM","POS","ID"), c("CHR","BP","SNP")) + +# Manhattan plot +manhattan(assoc_add, + chr = "CHR", + bp = "BP", + snp = "SNP", + p = "P", + xlab = "", + ylab = "", + suggestiveline = FALSE, + cex.axis = 1.5, + col = c("lightblue", "lightslateblue")) ``` (iii) Create a QQ plot of the GWAS p-values. ```{r} -# Your code here... +# Q-Q plot +qq(assoc_add$P, main = "Q-Q plot of GWAS p-values") ``` ### Criteria diff --git a/02_activities/assignments/unnamed-chunk-12-1.png b/02_activities/assignments/unnamed-chunk-12-1.png new file mode 100644 index 0000000..47a2554 Binary files /dev/null and b/02_activities/assignments/unnamed-chunk-12-1.png differ diff --git a/02_activities/assignments/unnamed-chunk-13-1.png b/02_activities/assignments/unnamed-chunk-13-1.png new file mode 100644 index 0000000..8fcf552 Binary files /dev/null and b/02_activities/assignments/unnamed-chunk-13-1.png differ diff --git a/02_activities/assignments/unnamed-chunk-5-1.png b/02_activities/assignments/unnamed-chunk-5-1.png new file mode 100644 index 0000000..200f8c0 Binary files /dev/null and b/02_activities/assignments/unnamed-chunk-5-1.png differ diff --git a/02_activities/assignments/unnamed-chunk-5-2.png b/02_activities/assignments/unnamed-chunk-5-2.png new file mode 100644 index 0000000..d45bf7c Binary files /dev/null and b/02_activities/assignments/unnamed-chunk-5-2.png differ