You only need to write lines of code for each question. When answering questions that ask you to identify or interpret something, the length of your response doesn’t matter. For example, if the answer is just ‘yes,’ ‘no,’ or a number, you can just give that answer without adding anything else.
+
We will go through comparable code and concepts in the live learning session. If you run into trouble, start by using the help help() function in R, to get information about the datasets and function in question. The internet is also a great resource when coding (though note that no outside searches are required by the assignment!). If you do incorporate code from the internet, please cite the source within your code (providing a URL is sufficient).
+
Please bring questions that you cannot work out on your own to office hours, work periods or share with your peers on Slack. We will work with you through the issue.
+
You will need to install PLINK and run the analyses. Please follow the OS-specific setup guide in SETUP.md. PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner.
+
+
Question 1: Data inspection
+
Before fitting any models, it is essential to understand the data. Use R or bash code to answer the following questions about the gwa.qc.A1.fam, gwa.qc.A1.bim, and gwa.qc.A1.bed files, available at the following Google Drive link: https://drive.google.com/drive/folders/11meVqGCY5yAyI1fh-fAlMEXQt0VmRGuz?usp=drive_link. Please download all three files from this link and place them in 02_activities/data/.
bim_data <-read.table(bim_file, header =FALSE)
+cat("Number of SNPs:", nrow(bim_data), "\n")
+
+
Number of SNPs: 101083
+
+
head(bim_data)
+
+
V1 V2 V3 V4 V5 V6
+1 1 rs3737728 0 1011278 A G
+2 1 rs1320565 0 1109721 T C
+3 1 rs3813199 0 1148140 A G
+4 1 rs11804831 0 1184667 C T
+5 1 rs3766178 0 1468043 C T
+6 1 rs3128342 0 1476697 A C
+
+
library(snpStats)
+
+
Loading required package: survival
+
+
+
Loading required package: Matrix
+
+
plink_data <-read.plink(
+bed = bed_file,
+bim = bim_file, # pass path, NOT the data frame
+fam = fam_file # pass path, NOT the data frame
+)
+
+geno_matrix <- plink_data$genotypes
+cat("Genotype matrix dimensions (samples x SNPs):", dim(geno_matrix), "\n")
+
+
Genotype matrix dimensions (samples x SNPs): 4000 101083
+
+
+
+
Read the .fam file. How many samples does the dataset contain?
Read the .bim file. How many SNPs does the dataset contain?
+
+
+
# Read the .bim file
+bbim_data <-read.table(bim_file, header =FALSE)
+# Number of SNPs
+num_snps <-nrow(bim_data)
+cat("The dataset contains", num_snps, "SNPs.\n")
+
+
The dataset contains 101083 SNPs.
+
+
+
+
+
Question 2: Allele Frequency Estimation
+
+
Load the genotype matrix for SNPs rs1861, rs3813199, rs3128342, and rs11804831 using additive coding. What are the allele frequencies (AFs) for these four SNPs?
Conduct the Hardy–Weinberg Equilibrium (HWE) test for all SNPs in the .bim file. Then, load the file containing the HWE p-value results and display the first few rows of the resulting data frame.
CHR SNP TEST A1 A2 GENO O.HET. E.HET. P
+1 1 rs3737728 ALL(QT) A G 428/1841/1713 0.4623 0.4479 0.04379
+2 1 rs1320565 ALL(QT) T C 19/589/3368 0.1481 0.1453 0.27340
+3 1 rs3813199 ALL(QT) A G 12/428/3531 0.1078 0.1073 1.00000
+4 1 rs11804831 ALL(QT) C T 81/1061/2820 0.2678 0.2610 0.11340
+5 1 rs3766178 ALL(QT) C T 214/1378/2391 0.3460 0.3506 0.41590
+6 1 rs3128342 ALL(QT) A C 382/1655/1927 0.4175 0.4240 0.33030
+
+
+
+
What are the HWE p-values for SNPs rs1861, rs3813199, rs3128342, and rs11804831?
+
+
+
library(dplyr)
+
+
+Attaching package: 'dplyr'
+
+
+
The following objects are masked from 'package:stats':
+
+ filter, lag
+
+
+
The following objects are masked from 'package:base':
+
+ intersect, setdiff, setequal, union
SNP P
+1 rs3813199 1.0000
+2 rs11804831 0.1134
+3 rs3128342 0.3303
+4 rs1861 0.2747
+
+
+
+
+
Question 4: Genetic Association Test
+
+
Conduct a linear regression to test the association between SNP rs1861 and the phenotype. What is the p-value?
+
+
+
# Read in the phenotype from .fam
+fam_data <-read.table("../data/gwa.qc.A1.fam", header =FALSE)
+geno_matrix <-as(geno_matrix, "numeric") # make sure numeric
+
+# Add phenotype to genotype subset
+geno_df <-as.data.frame(geno_numeric)
+geno_df$PHENOTYPE <- fam_data$V6 # column 6 is phenotype
+
+# Fit linear regression for rs1861
+model <-lm(PHENOTYPE ~ rs1861, data = geno_df)
+
+# Summary to get p-value
+summary(model)$coefficients
+
+
Estimate Std. Error t value Pr(>|t|)
+(Intercept) 0.05238114 0.09486224 0.5521811 5.808554e-01
+rs1861 0.97382401 0.04942584 19.7027317 1.617463e-82
+
+
# Extract p-value for rs1861
+p_value <-summary(model)$coefficients["rs1861", "Pr(>|t|)"]
+
+# Print it
+cat("SNP: rs1861\nLinear regression p-value:", p_value, "\n")
cat ("The beta coefficient for rs1861 is approximately 0.974, indicating that each additional copy of the minor allele is associated with an increase of about 0.974 units in the phenotype under an additive model.\n")
+
+
The beta coefficient for rs1861 is approximately 0.974, indicating that each additional copy of the minor allele is associated with an increase of about 0.974 units in the phenotype under an additive model.
+
+
+
+
Plot the scatterplot of phenotype versus the genotype of SNP rs1861. Add the regression line to the plot.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
+ℹ Please use `linewidth` instead.
+
+
print(p)
+
+
+
+
+
+
+
+
+
+
Convert the genotype coding for rs1861 to recessive coding.
+
+
+
# Create a new variable with recessive coding
+geno_df$rs1861_rec <-ifelse(geno_df$rs1861 ==2, 1, 0)
+
+# Check result
+table(geno_df$rs1861, geno_df$rs1861_rec)
+
+
+ 0 1
+ 0 15 0
+ 1 398 0
+ 2 0 3551
+
+
+
+
Conduct a linear regression to test the association between the recessive-coded rs1861 and the phenotype. What is the p-value?
+
+
+
# Fit linear regression with recessive coding
+model_rec <-lm(PHENOTYPE ~ rs1861_rec, data = geno_df)
+
+# Extract p-value
+p_value_rec <-summary(model_rec)$coefficients["rs1861_rec", "Pr(>|t|)"]
+
+# Print result
+cat("Recessive model p-value:", p_value_rec, "\n")
+
+
Recessive model p-value: 2.729298e-79
+
+
+
+
Plot the scatterplot of phenotype versus the recessive-coded genotype of rs1861. Add the regression line to the plot.
+
+
+
# Scatterplot
+plot(geno_df$rs1861_rec, geno_df$PHENOTYPE,
+xlab ="Recessive Genotype (rs1861)",
+ylab ="Phenotype",
+main ="Phenotype vs Recessive rs1861",
+pch =16)
+
+# Add regression line
+abline(model_rec, col ="red", lwd =2)
+
+
+
+
+
+
+
+
+
+
Which model fits better? Justify your answer.
+
+
+
# R-squared for additive model
+summary(model)$r.squared
+
+
[1] 0.08923678
+
+
# R-squared for recessive model
+summary(model_rec)$r.squared
+
+
[1] 0.08582295
+
+
# Compare model fits
+cat("The additive model fits slightly better than the recessive model because its R-squared is higher (0.0892 vs 0.0858), even though both p-values are highly significant.\n")
+
+
The additive model fits slightly better than the recessive model because its R-squared is higher (0.0892 vs 0.0858), even though both p-values are highly significant.
+
+
+
+
+
Criteria
+
+
+
+
+
+
+
+
+
Criteria
+
Complete
+
Incomplete
+
+
+
+
+
Data Inspection
+
Correct sample/SNP counts and variable type identified.
+
Missing or incorrect counts or variable type.
+
+
+
Allele Frequency Estimation
+
Correct allele and minor allele frequencies computed.
+
Frequencies missing or wrong.
+
+
+
Hardy–Weinberg Equilibrium Test
+
Correct PLINK command and p-value extraction in R.
+
PLINK command or extraction incorrect/missing.
+
+
+
Genetic Association Test
+
Correct regressions, plots, coding, and interpretation.
+
Regression, plots, or interpretation missing/incomplete.
+
+
+
+
+
+
Submission Information
+
📌 Please review our Assignment Submission Guide for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.
+
+
Note:
+
If you like, you may collaborate with others in the cohort. If you choose to do so, please indicate with whom you have worked with in your pull request by tagging their GitHub username. Separate submissions are required.
+
+
+
+
Submission Parameters
+
+
Submission Due Date: 11:59 PM – 16/03/2026
+
Branch name for your repo should be: assignment-1
+
What to submit for this assignment:
+
+
Populate this Quarto document (assignment_1.qmd).
+
Render the document with Quarto: quarto render assignment_1.qmd.
+
Submit both assignment_1.qmd and the rendered HTML file assignment_1.html in your pull request.
+
+
What the pull request link should look like for this assignment: https://github.com/<your_github_username>/gen_data/pull/<pr_id>
+
+
Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support team review your submission easily.
+
+
+
+
Checklist:
+
+
Created a branch with the correct naming convention.
+
Ensured that the repository is public.
+
Reviewed the PR description guidelines and adhered to them.
+
Verified that the link is accessible in a private browser window.
+
Confirmed that both assignment_1.qmd and assignment_1.html are included in the pull request.
+
+
If you encounter any difficulties or have questions, please don’t hesitate to reach out to our team via our Slack help channel. Our technical facilitators and learning support team are here to help you navigate any challenges.
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/02_activities/assignments/assignment_1.qmd b/assignment_1.qmd
similarity index 52%
rename from 02_activities/assignments/assignment_1.qmd
rename to assignment_1.qmd
index 550af3d..1d9ee79 100644
--- a/02_activities/assignments/assignment_1.qmd
+++ b/assignment_1.qmd
@@ -16,97 +16,293 @@ You will need to install PLINK and run the analyses. Please follow the OS-specif
#### Question 1: Data inspection
Before fitting any models, it is essential to understand the data. Use R or bash code to answer the following questions about the `gwa.qc.A1.fam`, `gwa.qc.A1.bim`, and `gwa.qc.A1.bed` files, available at the following Google Drive link: . Please download all three files from this link and place them in `02_activities/data/`.
+```{r}
+
+fam_file <- "../data/gwa.qc.A1.fam"
+bim_file <- "../data/gwa.qc.A1.bim"
+bed_file <- "../data/gwa.qc.A1.bed"
+
+
+fam_data <- read.table(fam_file, header = FALSE)
+cat("Number of samples:", nrow(fam_data), "\n")
+head(fam_data)
+
+bim_data <- read.table(bim_file, header = FALSE)
+cat("Number of SNPs:", nrow(bim_data), "\n")
+head(bim_data)
+
+library(snpStats)
+
+plink_data <- read.plink(
+ bed = bed_file,
+ bim = bim_file, # pass path, NOT the data frame
+ fam = fam_file # pass path, NOT the data frame
+)
+
+geno_matrix <- plink_data$genotypes
+cat("Genotype matrix dimensions (samples x SNPs):", dim(geno_matrix), "\n")
+```
(i) Read the .fam file. How many samples does the dataset contain?
-```
-# Your answer here...
+
+```{r}
+ffam_file <- "../data/gwa.qc.A1.fam"
+bim_file <- "../data/gwa.qc.A1.bim"
+bed_file <- "../data/gwa.qc.A1.bed"
+```
+```{r}
+# Read the data file
+data <- read.table("../data/gwa.qc.A1.fam", header = FALSE)
+
+# Display number of rows
+nrow(data)
+
+# Display first few rows
+head(data)
+```
+
+
+
+
+
+
+
```
(ii) What is the 'variable type' of the response variable (i.e.Continuous or binary)?
-```
-# Your answer here...
+```
+```{r}
+data <- read.table(fam_file, header = FALSE)
+```
+```{r}
+table(data$V6)
```
(iii) Read the .bim file. How many SNPs does the dataset contain?
-```
-# Your answer here...
+
+```{r}
+# Read the .bim file
+bbim_data <- read.table(bim_file, header = FALSE)
+# Number of SNPs
+num_snps <- nrow(bim_data)
+cat("The dataset contains", num_snps, "SNPs.\n")
```
#### Question 2: Allele Frequency Estimation
(i) Load the genotype matrix for SNPs rs1861, rs3813199, rs3128342, and rs11804831 using additive coding. What are the allele frequencies (AFs) for these four SNPs?
+```{r}
+ # Load required library
+library(snpStats)
+
+# Read PLINK files
+plink_data <- read.plink(
+ bed = bed_file,
+ bim = bim_file,
+ fam = fam_file
+)
+
+# Extract genotype matrix
+geno_matrix <- plink_data$genotypes
-```
-# Your code here...
+# Subset SNPs of interest
+snps_of_interest <- c("rs1861", "rs3813199", "rs3128342", "rs11804831")
+geno_sub <- geno_matrix[, snps_of_interest]
+
+# Convert to numeric (0,1,2)
+geno_numeric <- as(geno_sub, "numeric")
+
+# Function to calculate allele frequency
+calc_af <- function(snp_vector) {
+ sum(snp_vector, na.rm = TRUE) / (2 * sum(!is.na(snp_vector)))
+}
+
+# Compute allele frequencies
+allele_freqs <- apply(geno_numeric, 2, calc_af)
+
+# Print results
+allele_freqs
```
(ii) What are the minor allele frequencies (MAFs) for these four SNPs?
-```
-# Your code here...
+``` {r}
+# Function to calculate minor allele frequency
+calc_maf <- function(snp_vector) {
+ af <- sum(snp_vector, na.rm = TRUE) / (2 * sum(!is.na(snp_vector))) # allele freq
+ maf <- min(af, 1 - af) # minor allele freq
+ return(maf)
+}
+
+# Compute MAFs for all SNPs in geno_numeric
+maf_values <- apply(geno_numeric, 2, calc_maf)
+
+# Print results
+maf_values
```
#### Question 3: Hardy–Weinberg Equilibrium (HWE) Test
(i) Conduct the Hardy–Weinberg Equilibrium (HWE) test for all SNPs in the .bim file. Then, load the file containing the HWE p-value results and display the first few rows of the resulting data frame.
-```
-# Your code here...
+``` {r}
+hwe_results <- read.table("../data/gwa_hwe.hwe", header = TRUE)
+head(hwe_results)
```
(ii) What are the HWE p-values for SNPs rs1861, rs3813199, rs3128342, and rs11804831?
-```
-# Your code here...
+``` {r}
+
+library(dplyr)
+
+# Now filter SNPs
+snps_of_interest <- c("rs1861", "rs3813199", "rs3128342", "rs11804831")
+
+hwe_results %>%
+ filter(SNP %in% snps_of_interest) %>%
+ select(SNP, P)
```
#### Question 4: Genetic Association Test
(i) Conduct a linear regression to test the association between SNP rs1861 and the phenotype. What is the p-value?
-```
-# Your code here...
+
+```{r}
+# Read in the phenotype from .fam
+fam_data <- read.table("../data/gwa.qc.A1.fam", header = FALSE)
+geno_matrix <- as(geno_matrix, "numeric") # make sure numeric
+
+# Add phenotype to genotype subset
+geno_df <- as.data.frame(geno_numeric)
+geno_df$PHENOTYPE <- fam_data$V6 # column 6 is phenotype
+
+# Fit linear regression for rs1861
+model <- lm(PHENOTYPE ~ rs1861, data = geno_df)
+
+# Summary to get p-value
+summary(model)$coefficients
+
+# Extract p-value for rs1861
+p_value <- summary(model)$coefficients["rs1861", "Pr(>|t|)"]
+
+# Print it
+cat("SNP: rs1861\nLinear regression p-value:", p_value, "\n")
```
(ii) How would you interpret the beta coefficient from this regression?
-```
-# Your answer here...
-```
+```{r}
+summary(model)$coefficients
+# Extract beta (estimate) and p-value for rs1861
+beta <- summary(model)$coefficients["rs1861", "Estimate"]
+pval <- summary(model)$coefficients["rs1861", "Pr(>|t|)"]
+cat("rs1861: Beta =", beta, ", p-value =", pval, "\n")
+
+cat ("The beta coefficient for rs1861 is approximately 0.974, indicating that each additional copy of the minor allele is associated with an increase of about 0.974 units in the phenotype under an additive model.\n")
+```
(iii) Plot the scatterplot of phenotype versus the genotype of SNP rs1861. Add the regression line to the plot.
-```
-# Your code here...
+```{r}
+library(ggplot2)
+library(dplyr)
+
+# Create prediction data (smooth curve)
+geno_range <- data.frame(
+ rs1861 = seq(0, 2, length.out = 100)
+)
+
+geno_range$predicted <- predict(model, newdata = geno_range)
+
+# Prepare count data (for point sizes)
+count_data <- geno_df %>%
+ mutate(PHENOTYPE = round(PHENOTYPE)) %>%
+ filter(!is.na(rs1861), !is.na(PHENOTYPE)) %>%
+ group_by(rs1861, PHENOTYPE) %>%
+ summarise(count = n(), .groups = "drop")
+
+# Plot
+p <- ggplot() +
+ geom_point(
+ data = count_data,
+ aes(x = rs1861, y = PHENOTYPE, size = count),
+ color = "blue"
+ ) +
+ geom_line(
+ data = geno_range,
+ aes(x = rs1861, y = predicted),
+ color = "red",
+ size = 1.2
+ ) +
+ scale_size_continuous(range = c(2, 10)) +
+ labs(
+ title = "Phenotype vs Genotype (rs1861)",
+ x = "Genotype (0/1/2)",
+ y = "Phenotype",
+ size = "Count"
+ ) +
+ coord_cartesian(xlim = c(-0.5, 2.5), ylim = c(-0.1, 1.1)) +
+ theme_minimal()
+
+print(p)
```
(iv) Convert the genotype coding for rs1861 to recessive coding.
-```
-# Your code here...
+``` {r}
+# Create a new variable with recessive coding
+geno_df$rs1861_rec <- ifelse(geno_df$rs1861 == 2, 1, 0)
+
+# Check result
+table(geno_df$rs1861, geno_df$rs1861_rec)
```
(v) Conduct a linear regression to test the association between the recessive-coded rs1861 and the phenotype. What is the p-value?
-```
-# Your code here...
+```{r}
+# Fit linear regression with recessive coding
+model_rec <- lm(PHENOTYPE ~ rs1861_rec, data = geno_df)
+
+# Extract p-value
+p_value_rec <- summary(model_rec)$coefficients["rs1861_rec", "Pr(>|t|)"]
+
+# Print result
+cat("Recessive model p-value:", p_value_rec, "\n")
```
(vi) Plot the scatterplot of phenotype versus the recessive-coded genotype of rs1861. Add the regression line to the plot.
-```
-# Your code here...
+``` {r}
+# Scatterplot
+plot(geno_df$rs1861_rec, geno_df$PHENOTYPE,
+ xlab = "Recessive Genotype (rs1861)",
+ ylab = "Phenotype",
+ main = "Phenotype vs Recessive rs1861",
+ pch = 16)
+
+# Add regression line
+abline(model_rec, col = "red", lwd = 2)
```
(vii) Which model fits better? Justify your answer.
-```
-# Your answer here...
+```{r}
+# R-squared for additive model
+summary(model)$r.squared
+
+# R-squared for recessive model
+summary(model_rec)$r.squared
+
+# Compare model fits
+cat("The additive model fits slightly better than the recessive model because its R-squared is higher (0.0892 vs 0.0858), even though both p-values are highly significant.\n")
```
+
### Criteria
| Criteria | Complete | Incomplete |
diff --git a/assignment_2.html b/assignment_2.html
new file mode 100644
index 0000000..0b27c2d
--- /dev/null
+++ b/assignment_2.html
@@ -0,0 +1,979 @@
+
+
+
+
+
+
+
+
+
+Assignment #2
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Assignment #2
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
You will need PLINK2 installed and available in your PATH. Please follow the OS-specific setup guide in SETUP.md. The dataset for this assignment consists of the following binary PLINK files: gwa.A2.bed, gwa.A2.bim, gwa.A2.fam , available at the following Google Drive link: https://drive.google.com/drive/folders/1rHoy3z52Yyj985ukjjtLhfIBxchpyYtZ?usp=drive_link. Please download all three files and save them in 02_activities/data/.
+
+
Question 1: Data inspection
+
Before you run any models, first get familiar with the dataset. You may find data.table::fread() in R helpful for reading .bim and .fam files.
+
+
Read the .fam file. How many samples does the dataset contain?
Make a histogram (or density plot) of the phenotype. Does it look roughly Gaussian?
+
+
+
pheno <- fam[[6]]
+
+hist(pheno, main="Histogram of Phenotype", xlab="Phenotype")
+
+
+
+
+
+
+
+
plot(density(pheno), main="Density Plot of Phenotype")
+
+
+
+
+
+
+
+
+
+
+
Question 2: Quality control (QC)
+
Now we will perform QC using PLINK2 for the genotype files in gwa.A2.
+
+
Using PLINK2 from the command line (bash), perform basic QC with the following filters: MAF ≥ 0.05, SNP missingness (--geno) ≤ 0.01, individual missingness (--mind) ≤ 0.10, and HWE p-value ≥ 0.00005, and output the QC’ed dataset as gwa.qc.A2.
PLINK v2.00a5.12 M1 (25 Jun 2024) www.cog-genomics.org/plink/2.0/
+(C) 2005-2024 Shaun Purcell, Christopher Chang GNU General Public License v3
+Logging to ./02_activities/data/gwa.qc.A2.log.
+Options in effect:
+ --bfile ./02_activities/data/gwa.A2
+ --geno 0.01
+ --hwe 0.00005
+ --maf 0.05
+ --make-bed
+ --mind 0.1
+ --out ./02_activities/data/gwa.qc.A2
+
+Start time: Thu Apr 2 02:24:28 2026
+32768 MiB RAM detected; reserving 16384 MiB for main workspace.
+Using up to 10 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+./02_activities/data/gwa.A2.fam.
+306102 variants loaded from ./02_activities/data/gwa.A2.bim.
+1 quantitative phenotype loaded (4000 values).
+Calculating sample missingness rates... 0%21%42%64%85%done.
+0 samples removed due to missing genotype data (--mind).
+4000 samples (2000 females, 2000 males; 4000 founders) remaining after main
+filters.
+4000 quantitative phenotype values remaining after main filters.
+Calculating allele frequencies... 0%21%42%64%85%done.
+--geno: 196578 variants removed due to missing genotype data.
+--hwe: 6 variants removed due to Hardy-Weinberg exact test (founders only).
+8435 variants removed due to allele frequency threshold(s)
+(--maf/--max-maf/--mac/--max-mac).
+101083 variants remaining after main filters.
+Writing ./02_activities/data/gwa.qc.A2.fam ... done.
+Writing ./02_activities/data/gwa.qc.A2.bim ... done.
+Writing ./02_activities/data/gwa.qc.A2.bed ... 0%21%43%65%87%done.
+End time: Thu Apr 2 02:24:28 2026
+
+
+
+
+
Question 3: Relatedness
+
In this question, you will use PLINK2’s built-in KING-robust kinship (--king-cutoff) detect and remove related individuals.
+
+
Perform LD pruning on gwa.qc.A2 using PLINK2 with the following parameters: --indep-pairwise 500 50 0.05, and then generate a new dataset containing only the pruned SNPs.
PLINK v2.00a5.12 M1 (25 Jun 2024) www.cog-genomics.org/plink/2.0/
+(C) 2005-2024 Shaun Purcell, Christopher Chang GNU General Public License v3
+Logging to ./02_activities/data/gwa.qc.A2.log.
+Options in effect:
+ --bfile ./02_activities/data/gwa.qc.A2
+ --indep-pairwise 500 50 0.05
+ --out ./02_activities/data/gwa.qc.A2
+
+Start time: Thu Apr 2 02:24:29 2026
+32768 MiB RAM detected; reserving 16384 MiB for main workspace.
+Using up to 10 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+./02_activities/data/gwa.qc.A2.fam.
+101083 variants loaded from ./02_activities/data/gwa.qc.A2.bim.
+1 quantitative phenotype loaded (4000 values).
+Calculating allele frequencies... 0%64%done.
+--indep-pairwise (9 compute threads): 0%50%79169/101083 variants removed.
+Writing...
+Variant lists written to ./02_activities/data/gwa.qc.A2.prune.in and
+./02_activities/data/gwa.qc.A2.prune.out .
+End time: Thu Apr 2 02:24:29 2026
+PLINK v2.00a5.12 M1 (25 Jun 2024) www.cog-genomics.org/plink/2.0/
+(C) 2005-2024 Shaun Purcell, Christopher Chang GNU General Public License v3
+Logging to ./02_activities/data/gwa.qc_pruned.A2.log.
+Options in effect:
+ --bfile ./02_activities/data/gwa.qc.A2
+ --extract ./02_activities/data/gwa.qc.A2.prune.in
+ --make-bed
+ --out ./02_activities/data/gwa.qc_pruned.A2
+
+Start time: Thu Apr 2 02:24:29 2026
+32768 MiB RAM detected; reserving 16384 MiB for main workspace.
+Using up to 10 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+./02_activities/data/gwa.qc.A2.fam.
+101083 variants loaded from ./02_activities/data/gwa.qc.A2.bim.
+1 quantitative phenotype loaded (4000 values).
+--extract: 21914 variants remaining.
+21914 variants remaining after main filters.
+Writing ./02_activities/data/gwa.qc_pruned.A2.fam ... done.
+Writing ./02_activities/data/gwa.qc_pruned.A2.bim ... done.
+Writing ./02_activities/data/gwa.qc_pruned.A2.bed ... 0%61%done.
+End time: Thu Apr 2 02:24:29 2026
+
+
+
+
Use PLINK2 on the LD-pruned dataset to identify a set of unrelated individuals up to (approximately) 2nd-degree relatives (use a kinship cutoff of 0.0884).
PLINK v2.00a5.12 M1 (25 Jun 2024) www.cog-genomics.org/plink/2.0/
+(C) 2005-2024 Shaun Purcell, Christopher Chang GNU General Public License v3
+Logging to ./02_activities/data/gwa.qc_assoc.A2.log.
+Options in effect:
+ --bfile ./02_activities/data/gwa.qc_unrelated.A2
+ --covar ./02_activities/data/gwa_unrel_PCA.A2.eigenvec
+ --covar-name PC1,PC2,PC3
+ --glm
+ --out ./02_activities/data/gwa.qc_assoc.A2
+
+Start time: Thu Apr 2 02:24:39 2026
+32768 MiB RAM detected; reserving 16384 MiB for main workspace.
+Using up to 10 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+./02_activities/data/gwa.qc_unrelated.A2.fam.
+101083 variants loaded from ./02_activities/data/gwa.qc_unrelated.A2.bim.
+1 quantitative phenotype loaded (4000 values).
+3 covariates loaded from ./02_activities/data/gwa_unrel_PCA.A2.eigenvec.
+Calculating allele frequencies... 0%64%done.
+--glm linear regression on phenotype 'PHENO1': 0%64%done.
+Results written to ./02_activities/data/gwa.qc_assoc.A2.PHENO1.glm.linear .
+End time: Thu Apr 2 02:24:40 2026
+
+
+
+
Create a Manhattan plot of the GWAS results.
+
+
+
+
+
+
+
For example usage please run: vignette('qqman')
+
+
+
+
+
+
Citation appreciated but not required:
+
+
+
Turner, (2018). qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. Journal of Open Source Software, 3(25), 731, https://doi.org/10.21105/joss.00731.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Create a QQ plot of the GWAS p-values.
+
+
+
+
+
+
+
+
+
+
+
+
+
The QQ plot shows that the majority of observed p-values follow the expected null distribution, with a clear upward deviation at the tail, indicating the presence of significant genetic associations.
+
+
Criteria
+
+
+
+
+
+
+
+
+
Criteria
+
Complete
+
Incomplete
+
+
+
+
+
Data inspection
+
Correct sample and SNP counts; phenotype plot with a brief comment on Gaussianity.
+
Counts or phenotype description/plot missing or incorrect.
+
+
+
QC & LD pruning
+
Correct PLINK2 QC command and thresholds.
+
QC/pruning commands, thresholds, or output datasets missing or incorrect.
+
+
+
Relatedness & PCA
+
Correct use of PLINK2 command to obtain unrelated samples and PCA run on pruned SNPs.
+
Relatedness step, unrelated dataset, or PCA analysis missing or incorrect.
+
+
+
GWAS & visualisation
+
Linear regression GWAS with PCs as covariates; Manhattan and QQ plots produced.
+
GWAS command, or Manhattan/QQ plots missing or clearly incorrect.
+
+
+
+
+
+
Submission Information
+
🚨 Please review our Assignment Submission Guide 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.
+
+
+
Submission Parameters
+
+
Submission Due Date: 11:59 PM – 01/04/2026
+
The branch name for your repo should be: assignment-2
+
What to submit for this assignment:
+
+
Populate this Quarto document (assignment_2.qmd).
+
Render the document with Quarto: quarto render assignment_2.qmd.
+
Submit both assignment_2.qmd and the rendered HTML file assignment_2.html in your pull request.
+
+
What the pull request link should look like for this assignment: https://github.com/<your_github_username>/gen_data/pull/<pr_id>
+
+
Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support team review your submission easily.