diff --git a/PAGE.md b/PAGE.md new file mode 100644 index 0000000..8923b21 --- /dev/null +++ b/PAGE.md @@ -0,0 +1,170 @@ +# PAGE (Polyacrylamide Gel Electrophoresis) + +1. [Introduction](#231)
+ 1.1 [Charge Density](#2311)
+ 1.2 [Size and Shape](#2312) +3. [Native PAGE vs SDS PAGE](#232)
+ 2.1. [Native PAGE](#2321)
+ 2.2. [SDS PAGE](#2322) +4. [Procedure of SDS](#235) +5. [Interpretation](#233) +6. [Applications](#234) + +## 1. Introduction + +Polyacrylamide Gel Electrophoresis(PAGE) is a technique that separates macromolecules based on their electrophoretic mobility. Electrophoretic mobility is the ability of analytes to move towards an electrode of the opposite charge.[1] + +Compared to agarose gel which can also be used for electrophoresis, acrylamide gel is used for smaller molecules like proteins and nucleic acids because it has smaller pores[Figure1]. The separation of proteins in PAGE depends on the: +- Charge Density (Charge-to-mass ratio) +- Size and shape + +![Image](image/pores.png) +Figure 1 + +#### 1) Charge Density + +Proteins are composed of amino acids. Each of these amino acids carries charge, either positive or negative; some of them has no charge. Because of the charges that amino acids carry, proteins can have an overall charge. The pH of surroundings of the protein also affects the net charge of a protein. [2] + +In electrophoresis, pH of the buffer is set at a value such that all proteins at that pH will carry negative net charge. Proteins being negatively charged, then will migrate to anode (positive electrode) through electrophoresis. [Figure 2] + +- higher charge density migrates faster in the gel (green) +- lower charge density migrates slower in the gel (blue) + +![Image](image/demonstration.png) + +Figure 2 + +#### 2) Size and Shape + +Size: number of amino acid residues in a protein + +- larger in size, slower in migration +- smaller in size, faster in migration + +Shape: + +- globular proteins migrate faster +- elongated proteins migrate slower + +The extent of cross-linking in the gel and average pore size also affects the migration of proteins of various shapes and sizes. + +## 2. Native PAGE vs SDS PAGE + +There are two ways to run PAGE depending on the purpose of the analysis. PAGE can be run under denaturing or non-denaturing conditions. + + +#### 1) Native PAGE + +In Native PAGE, the disulfide bonds are undisturbed, preserving the protein’s overall structure. As a result, the positioning of proteins through the gel is mainly influenced by the protein’s charge and the pH of the separation rather than its size. It allows the analysis of their natural state if we want to analyze bound proteins or complexes. Compared to SDS, Native PAGE do not use reducing agents, heat, and lower voltages which can damage the protein's natural state. [1] + +#### 2) SDS PAGE + +To determine the molecular weights of proteins or whether the given protein is made up of single subunit or multiple subunits, we’ll have to run SDS PAGE. In SDS PAGE, sodium dodecyl sulphate, with heat and sometimes a reducing agent are used to denature proteins before electrophoretic separation. The heat breaks the hydrogen bonds and the reducing agent cleaves disulfide bridges. [1] + +In Figure 3, we have proteins folded with positive and negative charges. Once you add the reducing agent, it cleaves the disulfide bond and unfolds the proteins. SDS with negative charge is then added to negate the charge density in proteins so that the proteins can be separated based on their molecular weight. The linearization of proteins and complex with SDS, as a result, cause the proteins to have similar charge density. With similar charge density, the proteins are then separated based on different molecular weight. + +![Image](image/SDS.png) +Figure 3 + + +![1702354442303](image/1702354442303.png) +Figure 4 + +The figure 4 shows that the protein is composed of two subunits. When the protein is treated with SDS molecule, its intact structure would get disrupted by attachment to negative charge of SDS. This leads to the protein denaturation and the mask of the original charges of amino acid by the coating. Now, having approximately same charge, density, shape, 'size' or 'molecular weight' would be the only parameter. + +## 3. Procedure of SDS + +![1702368233069](image/1702368233069.png) +Figure 5 + + + + + + + + + + + + + + + + + + + + + + + + + + +
Steps + Description +
Sample Preparation
  • Treat the protein for denaturation with SDS and beta-mercaptoethonol with heat
  • Coating of the original charge yields the similar charge, density, shape of the polypeptide chains
  • This helps the gel electrophoresis strictly based on the 'molecule weight' and 'size'
Gel Preparation
  • Requires BIS, acrylamide, and a buffer for the mixture of gel
  • This mixture prevent forming the bubble during the Gel electrophoresis process
  • Allows the separation of the proteins at the end by creating the gel matrix
Gel Electrophoresis
  • Protein migration occurs towards negative electrode by the electric current
  • Different rate of each molecule's migration indicates the molecule weight
  • Leads to the separation of protein molecule based on their size
  • The voltage strength controls the migration speed +
Staining and Visualization
  • The result of the gel electrophoresis can be detected by using the colored dye
  • Separated protein molecule stained in distinct color by tracking dye
  • Coomassie Brilliant or Blue or ehtidium bromide, major colored dye used, will be washed out if unbound
Analysis
  • Analysis of the protein band's color intensity will proceed by using autoradiography
  • Amout of the protein molecule is directly proportional to the color intensity, meaning the amount of the bound dye
+ +# 4. Interpretation +Unlike agarose gel electrophoresis result which is only visible through UV light, PAGE gel is visible by the naked eye through the coomassie brilliant blue stained in polyacrylamide gels. To begin the interpretation of the PAGE gel, it is important to identify the marker ladder location. Marker ladders typically present most left or right lane of the gel. It is used to assist in determining the protein size that is present in the sample. Each band in the marker ladder is labeled with its corresponding size in kilodaltons (kDa), providing a reference to estimate the molecular weights of proteins in the sample. [7] + +![Image](image/gel1.png) + +Figure 6 + +In Figure 5, the marker ladder is present in the leftmost lane. Notice how the size of the band decreases as it reaches the bottom of the lane. This is consistent with the gel electrophoresis principle where smaller proteins migrate more rapidly than larger ones. As a result, the smaller size of the protein ends up at the bottom of the lane and the larger protein stays more toward the top of the lane. + +Another crucial aspect in interpreting the PAGE gel is the difference in band strengths. The strength of a band directly reflects the quantity of the corresponding protein within the sample +It provides important information on the relative abundance of specific proteins. The stronger band indicates a higher presence of the corresponding proteins in the sample, while lighter bands suggest lower concentrations of that protein. [7] + +![Image](image/gel.png) +Figure 7 + +Looking at the gel representation in Figure 6, sample A has three bands at 30kDa, 40kDa, and 90kDa. Comparing the band strengths among three bands allows to interpret the relative protein quantities within the sample A. Notably, the 40 kDa protein band has the highest intensity, indicating a high abundance of the protein presented in the sample. Then, there are more 90kDa protein presented in the sample, followed by 30kDa protein being the least presented in the sample A. + +# 5. Applications + There are multiple PAGE applications analyzing the protein, including Western Blotting, Enzyme Zymography,Extraction for mass spectrometry, and Electrophoretic mobility shift assay. + + + + + + + + + + + + + + + + + + + + + + +
Applications + Description +
Western Blotting [3]
  • Used for specific antibody detection by transfering of the protein molecule from gel to memebrane
  • Identify significant traits of protein antigents based on its quanity, molecular weight, and presence
  • How effective the antigen could be extract could be measured
Enzyme Zymography [4]
  • Overcomes challenge of analyzing the protease that traditional Zymography
  • Nonreducing SDS-PAGE addresses this limitation without engaging protein substrate
  • Having similar trend to Western Blotting, the band from electrophoresis leads to the proteomic analysis interpretation
Extraction for mass spectrometry [5]
  • Analyze the chemical and molecular strucure by measuring the mass to charge ratio
  • Protein bands produced by SDS-PAGE gel are excised
  • Destaining, and extraction of the protein follows as the procedure
Electrophoretic mobility shift assay [6]
  • Electrophoretic mobility shift assay(EMSA) is a technology to identify the nucleic acids within the protein complexes
  • Paired with PAGE due to its high performance of separation resolution and stability compared to agroase gel electrophoresis
  • SDS-PAGE is regarded as one of the most best tool for EMSA
+ +# Reference + +[1] Polyacrylamide Gel Electrophoresis, How It Works, Technique Variants and Its Applications | Technology Networks. (n.d.). Retrieved December 12, 2023, from https://www.technologynetworks.com/analysis/articles/polyacrylamide-gel-electrophoresis-how-it-works-technique-variants-and-its-applications-359100. + +[2] “NATIVE PAGE.” YouTube, YouTube, 10 May 2019. Retrieved from December 13, 2023, from +https://www.youtube.com/watch?v=5obiHqeYEc0. + +[3] Brooks, S. A., Schumacher, U., Blancher, C., & Jones, A. (n.d.). Western Blotting 145 145 SDS-PAGE and Western Blotting Techniques. From: Methods in Molecular Medicine, 57. https://pubmed.ncbi.nlm.nih.gov/21340897/. + +[4] Pan, D., Wilson, K. A., & Tan-Wilson, A. (2017). Transfer Zymography. Methods in Molecular Biology (Clifton, N.J.), 1626, 253–269. https://doi.org/10.1007/978-1-4939-7111-4_24 + +[5] Cohen, S. L., & Chait, B. T. (1997). Mass spectrometry of whole proteins eluted from sodium dodecyl sulfate-polyacrylamide gel electrophoresis gels. Analytical Biochemistry, 247(2), 257–267. https://doi.org/10.1006/ABIO.1997.2072. + +[6] Hellman, L. M., & Fried, M. G. (n.d.). Electrophoretic Mobility Shift Assay (EMSA) for Detecting Protein-Nucleic Acid Interactions. https://doi.org/10.1038/nprot.2007.249. + +[7] How to Interpret Polyacrylamide Gels: The basics - LabXchange. (n.d.). Retrieved December 12, 2023, from https://www.labxchange.org/library/items/lb:LabXchange:02a2a79b:html:1. diff --git a/PAGE_copy.md b/PAGE_copy.md new file mode 100644 index 0000000..6a88b64 --- /dev/null +++ b/PAGE_copy.md @@ -0,0 +1,181 @@ +# PAGE (Polyacrylamide Gel Electrophoresis) + +1. [Introduction](#231)
+ 1.1 [Charge Density](#2311)
+ 1.2 [Size and Shape](#2312) +3. [Native PAGE vs SDS PAGE](#232)
+ 2.1. [Native PAGE](#2321)
+ 2.2. [SDS PAGE](#2322) +4. [Procedure of SDS](#235) +5. [Interpretation](#233) +6. [Applications](#234) + +## 1. Introduction + +Polyacrylamide Gel Electrophoresis(PAGE) is a technique that separates macromolecules based on their electrophoretic mobility which is the ability of analytes to move towards an electrode of the opposite charge.[1] + +Compared to agarose gel which can also be used for electrophoresis, acrylamide gel is used for smaller molecules like proteins and nucleic acids because it has smaller pores[Figure1]. The separation of proteins in PAGE depends on the: +- Charge Density (Charge-to-mass ratio) +- Size and shape + +Figure 1 +![Image](pores.png) + +#### 1) Charge Density + +Proteins are composed of amino acids. Each of these amino acids carries charge, either positive or negative; some of them has no charge. Thus, because of the charges, proteins carry an overall charge (or net charge). Net charge of a protein depends on pH of its surroundings. [2] + +In electrophoresis, pH of the buffer is set such that all proteins at that pH will carry negative net charge. Being negatively charged, they will migrate to anode (positive electrode). [Figure 2] + +- higher charge density migrates faster in the gel (green) +- lower charge density migrates slower in the gel (blue) + +Figure 2
+![Image](demonstration.png) + +#### 2) Size and Shape
+ +Size: number of amino acid residues in a protein + +- larger in size, slower in migration +- smaller in size, faster in migration + +Shape: + +- globular proteins migrate faster +- elongated proteins migrate slower + +The extent of cross-linking in the gel and average pore size also affects the migration of proteins of various shapes and sizes. + +## 2. Native PAGE vs SDS PAGE + +There are two ways to run PAGE depending on the purpose of the analysis. + +To better understand the difference between these methods, I'd like to distingush them between the following couple of aspects: + +#### 1) Native PAGE + +In Native PAGE, the disulfide bonds are undisturbed, preserving the protein’s overall structure. As a reult, the positioning of proteins through the gel is mainly influenced by the protein’s charge and the pH of the separation rather than its size. It allows the analysis of their natural state if we want to analyze bound proteins or complexes. Compared to SDS, Native PAGE do not use reducing agents, heat, and lower voltages which can damage the protein's natural state. [1] + +#### 2) SDS PAGE + +To determine the molecular weights of proteins or whether the given protein is made up of single subunit or multiple subunits, we’ll have to run SDS PAGE. In SDS PAGE, sodium dodecyl sulphate, with heat and sometimes a reducing agent are used to denature proteins before electrophoretic separation. The heat breaks the hydrogen bonds and the reducing agent cleaves disulfide bridges. [1] + +In Figure 3, we have proteins folded with positive and negative charges. Once you add the reducing agent, it cleaves the disulfide bond and unfolds the proteins. SDS with negative charge is then added to negate the charge density in proteins so that the proteins can be separated based on their molecular weight. The linearization of proteins and complex with SDS, as a result, cause the proteins to have similar charge density. With similar charge density, the proteins are then separated based on different molecular weight. [1] + +Figure 3 + +![Image](SDS.png) + +Figure 4 + +![1702354442303](image/PAGE/1702354442303.png) + +The figure 4 shows that the protein is composed of two subunits. When the protein is treated with SDS molecule, its intact structure would get disrupted by attachment to negative chage of SDS. This leads to the protein denaturation and the mask of the original charges of amino acid by the coating. Now, having approximately same charge, density, shape, 'size' or 'molecular weight' would be the only paramter. + +## 3. Procedure of SDS + +Figure 5 + +![1702368233069](image/PAGE/1702368233069.png) + + + + + + + + + + + + + + + + + + + + + + + + + + +
Steps + Description +
Sample Preparation
  • Treat the protein for denaturation with SDS and beta-mercaptoethonol with heat
  • Coating of the original charge yields the similar charge, density, shape of the polypeptide chains
  • This helps the gel electrophoresis strictly based on the 'molecule weight' and 'size'
Gel Preparation
  • Requires BIS, acrylamide, and a buffer for the mixture of gel
  • This mixture prevent forming the bubble during the Gel electrophoresis process
  • Allows the separation of the proteins at the end by creating the gel matrix
Gel Electrophoresis
  • Protein migration occurs towards negative electrode by the electric current
  • Different rate of each molecule's migration indicates the molecule weight
  • Leads to the separation of protein molecule based on their size
  • The voltage strength controls the migration speed +
Staining and Visualization
  • The result of the gel electrophoresis can be detected by using the colored dye
  • Separated protein molecule stained in distinct color by tracking dye
  • Coomassie Brilliant or Blue or ehtidium bromide , major colored dye used, will be washed out if unbound
Analysis
  • Analysis of the protein band's color intensity will proceed by using autoradiography
  • Amout of the protein molecule is directly proportional to the color intensity, meaning the amount of the bound dye
+ +# 4. Interpretation +Unlike agarose gel electrophoresis result which is only visible through UV light, PAGE gel is visible by the naked eye through the coomassie brilliant blue stained in polyacrylamide gels. + +To begin the interpretation of the PAGE gel, it is important to identify the marker ladder location. Marker ladders typically present most left or right lane of the gel. It is used to assist in determining the protein size that is present in the sample. Each band in the marker ladder is labeled with its corresponding size in kilodaltons (kDa), providing a reference to estimate the molecular weights of proteins in the sample. + +Figure 6 + +![Image](gel1.png) + +In Figure 5, the marker ladder is present in the leftmost lane. Notice how the size of the band decreases as it reaches the bottom of the lane. This is consistent with the gel electrophoresis principle where smaller proteins migrate more rapidly than larger ones. As a result, the smaller size of the protein ends up at the bottom of the lane and the larger protein stays more toward the top of the lane. + +Another crucial aspect in interpreting the PAGE gel is the difference in band strengths. The strength of a band directly reflects the quantity of the corresponding protein within the sample +It provides important information on the relative abundance of specific proteins. The stronger band indicates a higher presence of the corresponding proteins in the sample, while lighter bands suggest lower concentrations of that protein. + +Figure 7 + +![Image](gel.png) + +Looking at the gel representation in Figure 6, sample A has three bands at 30kDa, 40kDa, and 90kDa. Comparing the band strengths among three bands allows to interpret the relative protein quantities within the sample A. Notably, the 40 kDa protein band has the highest intensity, indicating a high abundance of the protein presented in the sample. Then, there are more 90kDa protein presented in the sample, followed by 30kDa protein being the least presented in the sample A. + +# 5. Applicaton + There are multiple PAGE applications analyzing the protein, including Western Blotting, Enzyme Zymography,Extraction for mass spectrometry, and Electrophoretic mobility shift assay. + + + + + + + + + + + + + + + + + + + + + + +
Applications + Description +
Western Blotting [3]
  • Used for specific antibody detection by transfering of the protein molecule from gel to memebrane
  • Identify significant traits of protein antigents based on its its quanity, molecular weight,and presence
  • How effective the antigen could be extract could be measured
Enzyme Zymography [4]
  • Overcomes challenge of analyzing the protease that traditional Zymography
  • Nonreducing SDS-PAGE addresses this limitation without engaging protein substrate
  • Having similar trend to Western Blotting, the band from electrophoresis leads to the proteomic analysis interpretation
Extraction for mass spectrometry [5]
  • Analyze the chemical and molecular strucure by measuring the mass to charge ratio
  • Protein bands produced by SDS-PAGE gel are excised
  • Destaining, and extraction of the protein follows as the procedure
Electrophoretic mobility shift assay [6]
  • Electrophoretic mobility shift assay(EMSA) is a technology to identify the nucleic acids within the protein complexes
  • Paired with PAGE due to its high performance of separation resolution and stability compared to agroase gel electrophoresis
  • SDS-PAGE is regarded as one of the most best tool for EMSA
+ +Referrence + +[1] Polyacrylamide Gel Electrophoresis, How It Works, Technique Variants and Its Applications | Technology Networks. (n.d.). Retrieved December 12, 2023, from https://www.technologynetworks.com/analysis/articles/polyacrylamide-gel-electrophoresis-how-it-works-technique-variants-and-its-applications-359100. + +[2] “NATIVE PAGE.” YouTube, YouTube, 10 May 2019. Retrieved from December 13, 2023, from +https://www.youtube.com/watch?v=5obiHqeYEc0. + +[3] Brooks, S. A., Schumacher, U., Blancher, C., & Jones, A. (n.d.). Western Blotting 145 145 SDS-PAGE and Western Blotting Techniques. From: Methods in Molecular Medicine, 57.`
` + +[4] Pan, D., Wilson, K. A., & Tan-Wilson, A. (2017). Transfer Zymography. Methods in Molecular Biology (Clifton, N.J.), 1626, 253–269. https://doi.org/10.1007/978-1-4939-7111-4_24.`
` + +[5] Cohen, S. L., & Chait, B. T. (1997). Mass spectrometry of whole proteins eluted from sodium dodecyl sulfate-polyacrylamide gel electrophoresis gels. Analytical Biochemistry, 247(2), 257–267. https://doi.org/10.1006/ABIO.1997.2072.`
` + +[6] Hellman, L. M., & Fried, M. G. (n.d.). Electrophoretic Mobility Shift Assay (EMSA) for Detecting Protein-Nucleic Acid Interactions. https://doi.org/10.1038/nprot.2007.249.`
` + +[7] How to Interpret Polyacrylamide Gels: The basics - LabXchange. (n.d.). Retrieved December 12, 2023, from https://www.labxchange.org/library/items/lb:LabXchange:02a2a79b:html:1.`
` + +[8] SDS-PAGE - Wikipedia. (n.d.). Retrieved December 12, 2023, from https://en.wikipedia.org/wiki/SDS-PAGE. + + + + diff --git a/README.md b/README.md deleted file mode 100644 index 8dc42cf..0000000 --- a/README.md +++ /dev/null @@ -1,5 +0,0 @@ -# BENG183 -Lecture materials -and other documentation - -Fall 2018 diff --git a/RNA-seq Data Analysis/readme.md b/RNA-seq Data Analysis/readme.md deleted file mode 100644 index 28daefd..0000000 --- a/RNA-seq Data Analysis/readme.md +++ /dev/null @@ -1,63 +0,0 @@ -# RNA-seq Data Analysis Pipeline -## Connect to linux server -Open a terminal and type
`ssh username@ieng6-###.ucsd.edu` -## TOPHAT-CUFFLINK Pipeline -First let's create some target directories with the following commands -``` -mkdir geneExpression -cd geneExpression -mkdir alignments -mkdir fpkm -mkdir diff -``` - -Then we can use TOPHAT to align the reads to the genome with the following template command
-`tophat -p 1 -G /path/to/genes.gtf -o out/dir path/to/genome/index path/to/reads_R1.fastq path/to/reads_R1.fastq`
-The genome index used with TOPHAT should be bowtie2 index files. -To align all the fastq files from the example data at the same time, we can create a shell script
-``` -cd alignments -vi alignment.sh -reads=/home/linux/ieng6/be183f/public/bengTutorial/fastq # address where the fastq files are stored -genes=/home/linux/ieng6/be183f/public/bengTutorial/index_gtf/genes4.gtf # this contain information about the positions of genes on the genome -# Tophat need to know where is the reference genome, we'll create soft-links in of the reference genome in the current folder -for file in /home/linux/ieng6/be183f/public/bengTutorial/index_gtf/4*; do - ln -s $file . -done - -tophat -p 1 -G $genes -o C1_R1 4 ${reads}/GSM794483_C1_R1_1.ss.fq ${reads}/GSM794483_C1_R1_2.ss.fq & -tophat -p 1 -G $genes -o C1_R2 4 ${reads}/GSM794484_C1_R2_1.ss.fq ${reads}/GSM794484_C1_R2_2.ss.fq & -tophat -p 1 -G $genes -o C2_R1 4 ${reads}/GSM794486_C2_R1_1.ss.fq ${reads}/GSM794486_C2_R1_2.ss.fq & -tophat -p 1 -G $genes -o C2_R2 4 ${reads}/GSM794487_C2_R2_1.ss.fq ${reads}/GSM794487_C2_R2_2.ss.fq & -``` -Next we quit and save the script by typing: `:wq` Then we run the script by typing: `bash alignment.sh`
-After the alignment step is finished, we use Cufflink to quantify the gene expressions
-A template Cufflink command is like the following
-`cufflink -p 1 -G path/to/genes.gtf -o path/to/outdir path/to/accepted_hits.bam` -We can also write a shell script to execute the files all at once
-``` -$ cd fpkm # get into the fpkm folder -$ vi fpkm.sh -genes=/home/linux/ieng6/be183f/public/bengTutorial/index_gtf/genes4.gtf -alignments=../alignments - -for condition in C1 C2; do -for replicate in R1 R2; do - echo ${condition}_${replicate} - cufflinks -p 1 -G $genes -o ${condition}_${replicate} ${alignments}/${condition}_${replicate}/accepted_hits.bam -done; done -``` -We quit and save the script by typing: `:wq` Then we run the script by typing: `bash fpkm.sh`
-Here, genes.fpkm_tracking and isoforms.fpkm_tracking contains gene expression values (measured as FPKM) at the gene and transcript levels. - -## STAR-Kallisto Pipeline -We can also use STAR to align the reads to the genome. We need to first build index files that is compatible with STAR prior to the alignment step. To build the index, we can run the following template command
-`STAR --runMode genomeGenerate --genomeDir path/to/starIndex --genomeFastaFiles path/to/genome.fa`
-Then, we'll be ablt to execute the alignment step with the following template command
-`STAR --genomeDir path/to/starIndex/ --readFilesIn path/to/read1 path/to/read2 --outFileNamePrefix output/`
-After the mapping is finished, the mapping statistics can be viewed as `Log.final.out` and the detailed mapping results can be viewed at `Align.out.sam`
-Besides Cufflinks, we can also use Kallisto to quantify the gene expressions directly from the raw fastq files. To do this, we need to build the index for Kallisto first with the following template command
-`kallisto index -i path/to/output.index path/to/transcriptome.fa`
-With index file built, we are able to quantify the gene expressions with the following template command
-`kallisto quant -i path/to/output.index -o path/to/outDic path/to/read1.fastq path/to/read2.fastq`
-The results can be viewed at `abundance.tsv` where gene expressions are quantified in terms of TPM values. diff --git a/RNA-seq Data Analysis/rna-seq-lecture.MD b/RNA-seq Data Analysis/rna-seq-lecture.MD deleted file mode 100644 index 1599810..0000000 --- a/RNA-seq Data Analysis/rna-seq-lecture.MD +++ /dev/null @@ -1,148 +0,0 @@ -# RNA-seq Differential Analysis -Assess different expression levels of genes between our - samples at different timepoints, conditions, etc. -## Why? -Let's suppose we have a strain of E. coli that we want to optimize to produce ethanol. -But we don't know what genes are involved. What do we do? - -We perform differential expression analysis on our RNA-seq data. The expression -levels have already been calculated using `cufflinks`. - -## How? Cuffdiff! - -### Cuffdiff under the hood -A simple methodology for assessing differential expression levels is -differences in transcript count between sample conditions, but this can -be erroneous due to alternative splicing and biases in sequencing. - -A more robust method is to fit a **Poisson Distribution** given the -expectation that the odds of seeing a change in expression level are small. - -**But Poisson doesn't account for count uncertainty and count dispersion!** - -+ **Count Uncertainty:** The fact that reads can be shared across multiple genes -because of shared genetic data. - -+ **Count Dispersion:** The fact that the number of reads produced is highly -variable between replicates. - -From the [_Cuffdiff Paper_](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3869392/pdf/nihms439296.pdf): ->Our method addresses both of these issues by modeling how variability in measurements of a transcript's fragment count depends on both its expression and its splicing structure. - -The way in which this happens is: - ->The algorithm captures uncertainty in a transcript's fragment count as a beta distribution and the overdispersion in this count with a negative binomial, and mixes the distributions together. The resulting mixture is a beta negative binomial distribution that reflects both sources of variability in an isoform's measured expression level. - -Summarized in this picture: -![text][methods] - -**Pretty clever!** -### Using Cuffdiff - -Thankfully using `cuffdiff` is super easy. - -An example implementation is shown below: - -First thing for ease of use is to generate a file called `diff.sh`. This -is a simple bash file and all of this could be accomplished just using -the command line but this is a little cleaner and allows for -easier code repeatability. - -```bash -$ vi diff.sh -genes=/home/linux/ieng6/be183f/public/bengTutorial/index_gtf/genes4.gtf -C1_R1=../alignments/C1_R1/accepted_hits.bam -C1_R2=../alignments/C1_R2/accepted_hits.bam -C2_R1=../alignments/C2_R1/accepted_hits.bam -C2_R2=../alignments/C2_R2/accepted_hits.bam - -cuffdiff -o diff_out ${genes} ${C1_R1},${C1_R2} ${C2_R2},${C2_R2} -``` -Now we simply run: `bash diff.sh` -Where the files noted in `C*_R*` are the expression files generated from -`cufflinks`. - -In bash scripting calling `${*}` denotes a variable defined earlier in -the script. - -### Cuffdiff Output -It will generate a CSV-like file in the `diff_out` directory -called `gene_exp.diff`. It looks like: -``` -test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change) test_stat p_value q_value significant -FBgn0002521 FBgn0002521 pho 4:1193093-1202271 q1 q2 OK 1885.82 1597.12 -0.239721 -1.27565 0.07125 0.389135 no -FBgn0004607 FBgn0004607 zfh2 4:524476-560418 q1 q2 NOTEST 6.72414 6.02587 -0.158181 0 1 1 no -FBgn0004624 FBgn0004624 CaMKII 4:1056642-1074329 q1 q2 OK 6854.63 6821.34 -0.00702337 -0.0397619 0.9532 0.97767 no -FBgn0004859 FBgn0004859 ci 4:68333-77667 q1 q2 NOTEST 33.6854 68.8131 1.03056 0 1 1 no -FBgn0005558 FBgn0005558 ey 4:718314-741787 q1 q2 OK 339.474 313.086 -0.116739 -0.511244 0.46085 0.925195 no -FBgn0005561 FBgn0005561 sv 4:1109443-1133943 q1 q2 OK 215.591 182.876 -0.237431 -0.810935 0.2223 0.819162 no -FBgn0005666 FBgn0005666 bt 4:745029-796707 q1 q2 OK 337.666 352.068 0.0602593 0.312879 0.65755 0.925195 no -FBgn0010217 FBgn0010217 ATPsyn-beta 4:1052439-1055175 q1 q2 OK 45751.2 45196.6 -0.0175982 -0.10918 0.8776 0.929994 no -FBgn0011642 FBgn0011642 Zyx102EF 4:1077990-1081542 q1 q2 OK 4612.4 4319.09 -0.0947887 -0.486887 0.48765 0.925195 no -``` - -### Accessing the data -Conveniently CSV files can be accessed with `R`, `Python`, or even `Excel` - -Here's an example code snippet of accessing only statistically significant -rows in `Bash`: -```bash -$ grep yes gene_exp.diff -``` -Where this only selects rows where the last columns is `yes`. However if -the substring `yes` is in a `locus`, `gene_id`, or any other field then this will -return false positives. - -By default this threshold is a value of 0.05 for the q-value field. -**What is a q-value?** - - - -In `Python`: -```python -import pandas as pd -# Tab delimited with a header at the zeroth row. -df = pd.read_csv('gene_exp.diff', delimiter='\t', header=0) -just_significant = df.loc[df.significant == 'yes'] -``` -Further analysis would be to get the count of significant differences by -gene locus. Continuing: -```python -counts_by_locus = just_significant.groupby(['locus']).count() -``` - - -### Example Figure -This is an example out put of how you could plot the results. Here the x-axis -is the Log2(fold-change) and the y-axis is the --log10(q-value). Both of these values are accessible from the -the output of `cuffdiff` above. -![alt text][diff-im] - - -*Global analysis of differential gene expression related to long-term sperm storage in oviduct of Chinese Soft-Shelled Turtle Pelodiscus sinensis* - -Liu, Tengfei, et al. - - -## Supplementary - -### What's a q-value? -When doing lots of statistical comparisons, the likelihood of getting -a statistically significant result (p-value) increases as more comparisons -are performed. This results in an increased `False Discovery Rate (FDR).` - -#### Bonferroni Correction -To combat this issue of an increased FDR from multiple comparisons, we can -adjust our threshold for statistical significance with the **Bonferroni Correction**. - -Simply, if we have `m` comparisons being performed and some initial p-value, `a`, -then our new threshold, `p` is: - -`p = a / m` - - - - -[diff-im]: https://media.nature.com/lw926/nature-assets/srep/2016/160915/srep33296/images_hires/srep33296-f3.jpg "Example plot" -[methods]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3869392/bin/nihms439296f2.jpg diff --git a/finalPaper/ChipSeq/CHIP.png b/finalPaper/ChipSeq/CHIP.png deleted file mode 100644 index 88b9375..0000000 Binary files a/finalPaper/ChipSeq/CHIP.png and /dev/null differ diff --git a/finalPaper/ChipSeq/ExperimentDesign.png b/finalPaper/ChipSeq/ExperimentDesign.png deleted file mode 100644 index e555507..0000000 Binary files a/finalPaper/ChipSeq/ExperimentDesign.png and /dev/null differ diff --git a/finalPaper/ChipSeq/Histone.png b/finalPaper/ChipSeq/Histone.png deleted file mode 100644 index 628408e..0000000 Binary files a/finalPaper/ChipSeq/Histone.png and /dev/null differ diff --git a/finalPaper/ChipSeq/Yiyang_Yin.md b/finalPaper/ChipSeq/Yiyang_Yin.md deleted file mode 100644 index 15fb107..0000000 --- a/finalPaper/ChipSeq/Yiyang_Yin.md +++ /dev/null @@ -1,187 +0,0 @@ -# CHIP-Sequencing -###### BENG183 Final Paper date:12-14-2018 -###### Group 5 Yiyang Yin A92112108 - -1. [Introduction](#1) -2. [History](#2) -3. [The Goal of CHIP-Sequencing](#3) -4. [Understanding functional features of the genome](#4)
- 4.1. [Nucleosome](#41)
- 4.2. [Transcription Factors and TF binding sites](#42)
- 4.3. [Histone Modification](#43)
- 4.4. [Insulator](#44) -5. [Overivew of Chip-Squencing](#5)
- 5.1. [The workflow of CHIP-Sequencing](#51)
- 5.2. [Experimental Design](#52) -6. [Data Analysis](#6)
- 6.1. [FastQ File Format](#61)
- 6.2. [Downstream Analysis](#62) -7. [Advantages of Chip-Sequencing](#7) -8. [Applications](#8) - - - -## 1. Introduction - -![CHIP-Sequencing](CHIP.png) - -Chromatin-immunoprecipitation (ChIP) followed by sequencing of the immuno-precipitated DNA is a powerful tool for the investigation of Protein-DNA interactions. To perform ChIP-seq, chromatin is isolated from cells or tissues and fragmented. Antibodies against chromatin associated proteins are used to enrich for specific chromatin fragments. The DNA is recovered, sequenced and aligned to a reference genome to determine specific protein binding loci. ChIP studies have increased our knowledge of transcription factor biology, DNA methylation and histone modifications. - -Typical steps in CHIP-Sequencing: -- **Cross-link** -- **Selection** -- **Alignment** - -## 2. History - -In 2007, there was a race to develop CHIP-Sequencing. At least three groups worked to -develop a genome-wide assay of protein binding. The three papers were submitted to -three separate journals. -- Mikkelsen et al. from the Broad submitted to -Nature and was published in August. -- Johnson et al. from Stanford submitted to -Science and was published in June. -- Barski et al. from NHLBI, NIH submitted to Cell -and was published in May. - -![Dr.Zhao](Zhao.jpg)
-Barski, along with Dr. Zhao's lab is recognized to be the first. - -## 3. The Goal of CHIP-Sequencing - -- **Determine the binding sites of various proteins in the genome** -- **Pridict regulation of certain genes in cells** -- **Derive binding motifications of certain transcription factors by studying their common binding sites** - -All in all, CHIP-Sequencing focus on gene regulation via DNA sequencing method. - -## 4. Understanding functional features of the genome - -CHIP-Sequencing involves various concepts in genome study. Before we dive into the actual process, let us quickly remind ourselves of several definitions. - -#### 1) Nucleosome - -A nucleosome is a basic unit of DNA packaging, consisting of a segment of DNA wound around eight histone protein cores. - -#### 2) Transcription Factors and TF binding sites - -![Promoters](genome.png) - -Transcription factors regulates gene expression by interacting with various binding sites. - -- Core Promoter - - Core promoter is the minimal portion of the promoter required to properly initiate transcription, and it contains transcription start site (TSS), a binding site for RNA polymerase and some general transcription factor binding sites, such as TATA box - -- Proximal promoter( Enhancer,Silencer ) - - Proximal promoter is the proximal sequence upstream of the gene that tends to contain primary regulatory elements, such as enhancer and silencer - -#### 3) Histone Modification - -![Histone](Histone.png) - -A covalent post-translational modification (PTM) to histone proteins includes methylation, phosphorylation, acetylation, ubiquitylation, and sumoylation. The PTMs made to histones can impact gene expression by altering chromatin structure or recruiting histone modifiers. - -#### 4) Insulators - -Function either as an enhancer-blocker or a barrier, or both, an insulator performs these two functions include loop formation and nucleosome modifications. - -## 5. Overivew of Chip-Squencing - -#### 1) The workflow of CHIP-Sequencing - -The basic workflow is to isolate the target nuclei and then covalently **cross link** the proteins to DNA. Then the chromatin is sheared by sonication or enzymatically digested into small pieces of fragment so that these small fragments can be suitable for sequencing. Then an **antibody against the specific protein or protein modification** is used to bring down the protein bound to DNA. Then these chromatin is **sheared and immunoprecipitated** with antibody-bound magnetic beads. After that **cross linking is reversed**, and DNA is purified. These immunoprecipitated and purified DNA is then used as the input for a next-generation sequencing library prep protocol, where it is sequenced and analyzed for DNA binding sites. - -- **Cross-link** - -Proteins are cross-linked to their bound DNA by formaldehyde, cells are homogenized. - -- **Selection** - -Chromatin is sheared and immunoprecipitated with antibody-bound magnetic beads. - -- **Alignment** - -Immunoprecipitated DNA is then used as the input for a next-generation sequencing library prep protocol, where it sequenced and analysed for DNA binding sites. - -#### 2) Experimental Design - -![ExperimentalDesign](ExperimentDesign.png) - -During the whole CHIP-Sequencing process, many small details could affact the accuracy of our data. To eliminate possible noises from our experiment, designated lab protocol is formed, as the picture shows. - -## 6. Data Analysis - -#### 1) FASTQ file format - -![FASTQ](fastq.png) - -After wet lab process, it is time for data analyzing. In most CHIP-Sequencing protocols, we acquire raw sequence data in the FASTQ format, a text-based format for storing both a biological sequence and its corresponding quality scores. - -For each cluster that passes filter, a single sequence is written to the corresponding sample’s R1 FASTQ file, and, for a paired-end run, a single sequence is also written to the sample’s R2 FASTQ file. Each entry in a FASTQ files consists of 4 lines: - -- A sequence identifier with information about the sequencing run and the cluster. The exact contents of this line vary based on the BCL to FASTQ conversion software used. -- The sequence (the base calls; A, C, T, G and N). -- A separator, which is simply a plus (+) sign. -- The base call quality scores. These are Phred +33 encoded, using ASCII characters to represent the numerical quality scores. - -#### 2) Down Stream Analysis - -![Downstream Analysis](analysis.png) - -ChIP-seq analysis begins with mapping of trimmed sequence reads to a reference genome. Next, peaks are found using peak-calling algorithms. To further analyze the data, binding or motif analysis are common end points of ChIP-seq workflows. At every stage, the choice of method or algorithm and the parameters used will affect the downstream results. - -It is very important to look into noise reduction for the data we get. Typically, noise comes from these features: -- **bad mapping algorithm** -- **bad antibody design** -- **multiple binding sites** -- **variation on number of DNA copies** - -We can use negative control experiments to determine the background noise of performed experiment. - -## 7. Advantages of CHIP-Sequencing - -CHIP-Sequencing has several unique advantages comparing to existing methods. - -- **Generality** - - Captures DNA targets for transcription factors or histone modifications across the entire genome of any organism. - -- **Defines transcription factor binding sites** - -- **Multipurpose** - - Reveals gene regulatory networks in combination with RNA sequencing and methylation analysis. - -- **Offers compatibility with various input DNA samples** - -## 8. Applications - -Although CHIP-Sequencing is a newly developed method, it has inspired many advancements. - -- **The ENCODE Project** - - The National Human Genome Research Institute (NHGRI) supports the public research consortium named ENCODE, the Encyclopedia Of DNA Elements, to identify all functional elements in the human and mouse genomes. - - ENCODE has produced vast amounts of data that can be accessed through the project's freely accessible database, the ENCODE Portal. The ENCODE "Encyclopedia" organizes these data into two levels of annotations: 1) integrative-level annotations, including a registry of candidate cis-regulatory elements and 2) ground-level annotations derived directly from experimental data. - - As a result of outreach and collaboration, ENCODE data are widely used. Lists of publications using ENCODE resources can be found on the ENCODE Portal. The ENCODE Portal also hosts data from modENCODE as well as data from the RoadMap Epigenomics and Genomics of Gene Regulation projects. - -- **FoxA1 Knock-down** - - Antoni Hurtado, et al. performed knock-down of the FoxA1 “pioneer factor”, resulting in reduced binding by the estrogen receptor (ER) at over 50% of known ER binding sites. They showed that FoxA1 is an important regulator of ER-mediated transcription, suggesting it may be a new and important therapeutic target in breast cancer (Hurtado 2011). - -- **TF binding evolution** - - Dominic Shmidt, et al. used ChIP-seq to investigate the evolution of transcription factor binding. They focused on CEBPA and HNF4 binding in the liver tissue of five vertebrate species: human, mouse, dog, opossum and chicken. ChIP-chip would have been almost impossible given the different species involved and complexities in designing probes (Schmidt 2010). - -## Reference - -[1]https://www.genome.gov/10005107/the-encode-project-encyclopedia-of-dna-elements/ - -[2]https://www.nhlbi.nih.gov/science/epigenome-biology - -[3]https://www.researchgate.net/publication/288346044_Hurtado_2011-NatGen - -[4]https://www.researchgate.net/profile/Dominic_Schmidt diff --git a/finalPaper/ChipSeq/Zhao.jpg b/finalPaper/ChipSeq/Zhao.jpg deleted file mode 100644 index 452d1a8..0000000 Binary files a/finalPaper/ChipSeq/Zhao.jpg and /dev/null differ diff --git a/finalPaper/ChipSeq/analysis.png b/finalPaper/ChipSeq/analysis.png deleted file mode 100644 index d5a8b83..0000000 Binary files a/finalPaper/ChipSeq/analysis.png and /dev/null differ diff --git a/finalPaper/ChipSeq/fastq.png b/finalPaper/ChipSeq/fastq.png deleted file mode 100644 index 4c31ccb..0000000 Binary files a/finalPaper/ChipSeq/fastq.png and /dev/null differ diff --git a/finalPaper/ChipSeq/genome.png b/finalPaper/ChipSeq/genome.png deleted file mode 100644 index 7d53c43..0000000 Binary files a/finalPaper/ChipSeq/genome.png and /dev/null differ diff --git a/finalPaper/IntroToMachineLearning/Derek_Jow.md b/finalPaper/IntroToMachineLearning/Derek_Jow.md deleted file mode 100644 index 5053a32..0000000 --- a/finalPaper/IntroToMachineLearning/Derek_Jow.md +++ /dev/null @@ -1,154 +0,0 @@ -# Introduction to Machine Learning -### By Derek Jow - -* [Overview](#overview) -* [Machine Learning in Bioinformatics](#machine-learning-in-bioinformatics) -* [Distance Functions](#distance-functions) -* [Hierarchical Clustering](#hierarchical-clustering) -* [K-Means Clustering](#k-means-clustering) -* [Summary](#summary) - -* * * - -## Overview - -It's a cozy friday evening and your girlfriend/boyfriend comes over to your house to watch netflix and chill. -As you log in, you are swarmed with advertisements recommending you to watch "The Avengers", "Superman", and "James Bond". -The suggestions all sound good, so how did Netflix know that you would like these movies? Netflix uses a special type of -classification algorithm that grouped you with the action movie enthusiasts. They noticed which movies you watched in the past -and predicted what type of movies you would like in the future based on those past decisions. Making predictions and -conclusions based on data is the crux of machine learning. - -Machine learning refers to the general type of algorithm that makes assertions or -predictions based on data. Machine learning can be used to categorize individuals into certain -groups based on shared similarities or recognize certain patterns that match an individual. There are -many different kinds of questions that can be answered with machine learning, hence it -is split into several domains: supervised vs unsupervised, clustering vs categorization, and -continuous vs discrete. - -An algorithm is said to be **supervised** if the potential types, or **labels** of the data -are known. On the other hand, an unsupervised algorithm does not have output labels and can work with -anonymous data. Unsupervised algorithms are used when the **grouping** of the data is more important -than the label themselves, and supervised algorithms are used when the type of the group provides -more meaning. - -![Unsupervised vs Supervised](img/unsupvssup.JPG) - -The second domain of machine learning is continuous vs discrete. **Continuous** algorithms -have outputs on continuous, or flowing, spectrum. **Discrete** algorithms produce information in -distinct, well-defined buckets. - -![Supervised/Unsupervised Continuous/Discrete example](img/examples.PNG) - -Lastly, a machine learning algorithm could be classifying or clustering. Classification and clustering -algorithms often answer the same question, but differ in their implementations. In general, -classification algorithms aim to find the best way to **separate** data in classes, whereas -clustering algorithms strive to **group** data into cliques. Depending on the circumstance, -classification and clustering may give different results based on the input data. - -![Classification vs Clustering](img/classvsclust.png) - -* * * - -## Machine Learning in Bioinformatics - -There are many applications of machine learning in bioinformatics. Machine learning algorithms are -used in image classification, detecting variation in rare diseases, and computing phylogenetic trees. In -**biomedical informatics**, we see machine learning used in **precision medicine**. The general -idea of precision medicine is to make disease predictions based on an individual's -genomic or transcriptomic (RNA-Seq) data. We can use this **personalized** information to -predict potential cancer-causing genes or discover subtypes of a disease. - -In RNA-Seq data, a datapoint is a multidimensional vector. Each row corresponds to a gene, and each column -represents an individual. Often, we use clustering algorithms to group individuals together -who have the same variation of a disease to diagnose the best therapy. - -![Rna-Seq Example](img/rna-seq.png) - -In the subsequent sections, we will examine two commonly used clustering algorithms used -in bioinformatics. - -* * * - -## Distance Functions - -Before we discuss the details of clustering algorithms, we need to define a distance function. The -**distance function** is a function that computes the difference between two points by some -mathematical quantity. The most common distance function uses **Euclidean** distance, or the distance -between points in **geometric** space. Another distance function could be **Hamming** distance, the -number of different nucleotides in DNA strings. - -* * * - -## Hierarchical Clustering - -In **hierarchical clustering**, we compute a **dendrogram** that separates data points. A -**dendrogram** is a diagram that illustrates the arrangement of clusters in a tree. However, it could -also be visualized in a **heatmap** or **venn diagram**. In the above RNA-Seq example, we see that -a dendrogram is produced above the heatmap that clusters individuals together based on similarity. - -![Hierarchical Clustering](img/hierImg.png) - -The following is the general procedure of a hierarchical clustering algorithm: -1. Calculate the similarity (distance function) between all possible combinations of two profiles -2. Place each profile in a separate cluster. -3. Group the two most similar clusters together to form a new cluster. -4. Recalculate the similarity between the new cluster and all the remaining clusters. -5. Repeat steps 3 and 4 until all of the profiles end up in one large cluster. - -![Hierarchical Clustering Animation](img/hClust.gif) - -In hierarchical clustering, you have several clustering methods that each use a -different distance function: -1. Unweighted Pair Group Method (UPGMA) - Calculates the average distance from each point in -the cluster to all other points in another cluster -2. Single Linkage - Measures dissimilarity between two clusters as the minimum -dissimilarity between members of the two clusters -3. Complete Linkage - Measures dissimilarity between two clusters as the greatest -dissimilarity between members of the two clusters - -* * * - -## K-Means Clustering - -In **K-Means clustering**, we compute groups by minimizing the distance of each point to -to its group's **mean**. Unlike hierarchical clustering, k-means clustering must arbitrarily -choose **"k"**, the number of clusters, and consequently select k points to serve as the initial -**means** for each cluster. At each iteration, we continuously reassign points such that they are grouped -with the clusters that minimize the point's distance to the cluster's mean. - -![K-Means Clustering](img/kImg.png) - -The following is the general procedure of the k-means clustering algorithm: -1. Select the number of clusters K -2. Select the K starting points to serve as the initial cluster means -3. Iterate through each point and calculate the distance between the datum and each cluster's mean -4. Assign the datum to the cluster whose mean is closest to that point -5. Repeat steps 3 and 4 until **convergence** - when points are no longer reassigned - -![K-Means Clustering](img/kClust.gif) - -Note that K-Means clustering **does not always guarantee** termination. An upper bound for the number of iterations -should be assigned to prevent infinite loops. Additionally, the selection of the initial points can -**change the outcome**. Perhaps the selection of the initial points should be given careful -consideration rather than an arbitrary choosing. - -* * * - -## Summary - -Machine Learning is perhaps the greatest export of computing to biology. Any algorithm that -classifies or categorizes data utilizes some form of machine learning. In computer science, we see -machine learning in image classification and recommender systems. In bioinformatics, -machine learning tackles important medical problems such as categorizing patients into -subtypes of a disease or detecting genes responsible for pathology. In the future, machine learning -will become an essential component of bioinformatics as we aggregate more and more -biological data. Machine learning helps us deal with the information explosion of the 21st century and will -lay the foundation for precision medicine and gene-function discovery. - -* * * - -## Sources -1. Sheng Zhong BENG 183 -2. Victoria Tom, Joey Sun BENG 183 -3. https://dashee87.github.io/data%20science/general/Clustering-with-Scikit-with-GIFs/ diff --git a/finalPaper/IntroToMachineLearning/img/classvsclust.png b/finalPaper/IntroToMachineLearning/img/classvsclust.png deleted file mode 100644 index 041c7c7..0000000 Binary files a/finalPaper/IntroToMachineLearning/img/classvsclust.png and /dev/null differ diff --git a/finalPaper/IntroToMachineLearning/img/examples.PNG b/finalPaper/IntroToMachineLearning/img/examples.PNG deleted file mode 100644 index f4647f3..0000000 Binary files a/finalPaper/IntroToMachineLearning/img/examples.PNG and /dev/null differ diff --git a/finalPaper/IntroToMachineLearning/img/hClust.gif b/finalPaper/IntroToMachineLearning/img/hClust.gif deleted file mode 100644 index a756ab3..0000000 Binary files a/finalPaper/IntroToMachineLearning/img/hClust.gif and /dev/null differ diff --git a/finalPaper/IntroToMachineLearning/img/hierImg.png b/finalPaper/IntroToMachineLearning/img/hierImg.png deleted file mode 100644 index 198e62b..0000000 Binary files a/finalPaper/IntroToMachineLearning/img/hierImg.png and /dev/null differ diff --git a/finalPaper/IntroToMachineLearning/img/imgClass.png b/finalPaper/IntroToMachineLearning/img/imgClass.png deleted file mode 100644 index 117078a..0000000 Binary files a/finalPaper/IntroToMachineLearning/img/imgClass.png and /dev/null differ diff --git a/finalPaper/IntroToMachineLearning/img/kClust.gif b/finalPaper/IntroToMachineLearning/img/kClust.gif deleted file mode 100644 index 97c315b..0000000 Binary files a/finalPaper/IntroToMachineLearning/img/kClust.gif and /dev/null differ diff --git a/finalPaper/IntroToMachineLearning/img/kImg.png b/finalPaper/IntroToMachineLearning/img/kImg.png deleted file mode 100644 index fed6200..0000000 Binary files a/finalPaper/IntroToMachineLearning/img/kImg.png and /dev/null differ diff --git a/finalPaper/IntroToMachineLearning/img/readme.md b/finalPaper/IntroToMachineLearning/img/readme.md deleted file mode 100644 index 8b13789..0000000 --- a/finalPaper/IntroToMachineLearning/img/readme.md +++ /dev/null @@ -1 +0,0 @@ - diff --git a/finalPaper/IntroToMachineLearning/img/rna-seq.png b/finalPaper/IntroToMachineLearning/img/rna-seq.png deleted file mode 100644 index 9cb9189..0000000 Binary files a/finalPaper/IntroToMachineLearning/img/rna-seq.png and /dev/null differ diff --git a/finalPaper/IntroToMachineLearning/img/unsupvssup.JPG b/finalPaper/IntroToMachineLearning/img/unsupvssup.JPG deleted file mode 100644 index b3bc620..0000000 Binary files a/finalPaper/IntroToMachineLearning/img/unsupvssup.JPG and /dev/null differ diff --git a/finalPaper/IntroToMachineLearning_2/K-means_convergence.gif b/finalPaper/IntroToMachineLearning_2/K-means_convergence.gif deleted file mode 100644 index 2976219..0000000 Binary files a/finalPaper/IntroToMachineLearning_2/K-means_convergence.gif and /dev/null differ diff --git a/finalPaper/IntroToMachineLearning_2/Victoria_Tom.md b/finalPaper/IntroToMachineLearning_2/Victoria_Tom.md deleted file mode 100644 index 4f68922..0000000 --- a/finalPaper/IntroToMachineLearning_2/Victoria_Tom.md +++ /dev/null @@ -1,143 +0,0 @@ -# A Brief Introduction to Machine Learning -### By Victoria Tom - -* [Overview](#overview) -* [Application to Bioinformatics](#application-to-bioinformatics) -* [Distance Functions](#distance-functions) -* [Clustering Methods](#clustering-methods) -* [Hierarchical Clustering](#hierarchical-clustering) -* [K-Means Clustering](#k-means-clustering) -* [Summary](#summary) - -* * * - -## Overview - -Machine learning is currently a very popular field of research, in part because of its ability to analyze very large data sets at unprecedented scale in a variety of different fields. From analyzing consumer purchasing behavior to figuring out the best way to classify a disease, machine learning has the potential to be applied to many of our current research problems. - -So what is machine learning? - -At a high level, **machine learning** is a general term that we use to describe computer algorithms that we can use to predict an outcome based on the data that we input. - -There are a couple different types of machine learning, which we use to solve different problems. - -One way of looking at a problem is trying to decide if we want the output to be a result of supervised or unsupervised learning. - -In **supervised learning**, we already know the **labels** of the output. These labels represent all of the different possible expected outcomes of our program. For example, if we want to use machine learning to figure out if a flower is a tulip or a rose, we already know beforehand what the expected output should be - either "tulip" for a picture of a tulip, or "rose" for a picture of a rose. We tend to use this type of learning to predict an outcome that we know. - -In **unsupervised learning**, we don't know beforehand what we expect the output to look like beforehand. We let the computer decide how many labels there should be to best separate the different types of data. Unsupervised learning algorithms tend to focus more on how best to group different pieces of data together without us needing to specify what the different groups should be. We tend to use this type of learning to analyze sets of data for outcomes that we don't know. - -We can also distinguish between discrete and continuous outputs from the machine learning algorithms. - -**Discrete** outputs tend to come from a distinct and finite set of bracketed values, with no "in-between" values. For example, a discrete output can be an integer value like "3" if we're trying to count how many cats are in a picture, or a category like "romance" if we're trying to figure out what genre a piece of prose is written in. - -**Continuous** outputs tend to be values along a sliding scale, with a potentially infinite number of values. An example of a continuous output would be ".9876564" for how likely it is that a given picture is of a cat, since ".9876564" is not a category that we'd predefine, nor is it a result that many other data points would also be classified as. - -Below is an image that gives examples of each kind of machine learning. - - - -* * * - -## Application to Bioinformatics - -**Clustering** is an important tool that is often used in bioinformatics because of its ability to group data into different groups that we can then analyze. Clustering algorithms start with a collection of *n* objects that we want want to divide into *k* clusters so that objects within a cluster are more "similar" to each other than objects in other clusters. - -A common application is to use clustering on gene expression data. By analyzing how different genes are grouped, we can predict functions of unknown by using known ones. We can also discover shared regulatory regions in DNA sequences and discover subtypes of a given disease. - -For example, [researchers](https://www.ncbi.nlm.nih.gov/pubmed/11707567) were able to classify different subtypes of lung cancer by analyzing the differences in gene expression between them. - - - -Bhattacharjee et al. (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses Proc. Natl. Acad. Sci. USA, Vol. 98, 13790-13795 - - -* * * - - -## Distance Functions - -Before we move into a discussion of various clustering algorithms, let's go into the basics of how we determine how to group different data points into clusters. - -In clustering, we usually try to group things that are "similar" together into clusters. We use **distance functions** as a way to quantitatively measure how "close" data points are to each other. Lower distances mean that the points are closer together, while larger values indicate less similarity. - -There are several different ways to calculate the distance between two data points. The most common is the **Euclidean distance function**, which uses the distance formula to calculate the distance between two points. - -Another alternative is to use **Pearson's correlation** (aka Pearson’s r) to measure how correlated the two profiles are, with values ranging between -1 and 1. - -* * * - -## Clustering Methods - -One we determine how to measure how "close" data points are to each other, we need to determine how we measure the distances between different clusters. - -One method is using the **Unweighted Pair Group Method with Arithmetic Mean (UPGMA)**, which calculates the average distance from each point in the cluster to all other points in another cluster. Once that is calculated, the two clusters with the lowest average distance are then joined together, which will create a new cluster. - -Another clustering method is **Complete Linkage**. This method instead measures how similar 2 clusters are using the biggest dissimilarity between a member of one cluster and a member of another cluster. This method will tend to produce very tight clusters. - -Another method is to do **Single Linkage**. This method measures how similar 2 clusters are using the minimum dissimilarity between members of the 2 clusters. This method of clustering tends to produce clusters in long chains and can identify outliers readily. - - - - - -* * * - -## Hierarchical Clustering - -Now let's look at one algorithm for clustering - the hierarchical clustering algorithm. - -Using the hierarchical clustering algorithm will result in a dendrogram that groups every single data point into one large cluster, with similar data points grouped closer together. - -The basic algorithm for hierarchical clustering is as follows: -1. Calculate the similarity between all possible combinations of two profiles using a distance function. -2. Place each profile in a separate cluster. -3. Group the two most similar clusters together to form a new cluster. -4. Recalculate the similarity between the new cluster and all the other remaining clusters using a user-defined clustering method. -5. Repeat steps 3 and 4 until all of the profiles end up in one large cluster. - -We can then "cut" our dendrogram at a level that gives us the number of clusters that we want. - -Here is an animation that shows how hierarchical clustering behaves. - - - -* * * - -## K-Means Clustering - -The other algorithm we will cover is the k-means algorithm. Unlike hierarchical clustering, we know beforehand the number of clusters that we want (k). - -The basic algorithm for k-means clustering is as follows: - -1. Choose a k-value for the number of clusters we want to end up with. -2. Select k number of starting points that we want to initialize to start our clusters. -3. For each data point, find the closest mean vector and assign the object to the corresponding cluster. -4. For each cluster, update its mean vector according to the current assignments. - - - -We keep repeating the last two steps until a stopping criteria is met. Unlike the hierarchical clustering algorithm, the k-means clustering algorithm isn't always guaranteed to terminate. It can stop during **convergence**, when the algorithm no longer reassigns points, or it can run indefinitely until it stops at a user-defined number of iterations. - -This contrasts with hierarchical clustering which has a more finite and predictable termination step (when everything is inside of one cluster). Additionally, the k-means algorithm may produce different outcomes based on how we initialize our initial k points. - -Here is an animation that shows how k-means clustering behaves. - - - - -* * * - -## Summary - -In summary, machine learning is a powerful tool that we can use to analyze data. In bioinformatics, we commonly use clustering algorithms to analyze our data, with both k-means and hierarchical clustering algorithms as viable options. In the future, it is likely that machine learning will be used to drive furhter innovations in the fields of personalized medicine and genomics. - -* * * - -## Sources -1. Bhattacharjee et al. (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses Proc. Natl. Acad. Sci. USA, Vol. 98, 13790-13795 -2. Sheng Zhong "Intro to machine learning", BENG 183 -3. Derek Jow, Joey Sun, Victoria Tom "Introduction to Machine Learning", BENG 183 -4. https://commons.wikimedia.org/wiki/File:K-means_convergence.gif -5. https://dashee87.github.io/data%20science/general/Clustering-with-Scikit-with-GIFs/ - diff --git a/finalPaper/IntroToMachineLearning_2/classvsclust.png b/finalPaper/IntroToMachineLearning_2/classvsclust.png deleted file mode 100644 index 041c7c7..0000000 Binary files a/finalPaper/IntroToMachineLearning_2/classvsclust.png and /dev/null differ diff --git a/finalPaper/IntroToMachineLearning_2/clustering.png b/finalPaper/IntroToMachineLearning_2/clustering.png deleted file mode 100644 index 07db665..0000000 Binary files a/finalPaper/IntroToMachineLearning_2/clustering.png and /dev/null differ diff --git a/finalPaper/IntroToMachineLearning_2/examples.PNG b/finalPaper/IntroToMachineLearning_2/examples.PNG deleted file mode 100644 index f4647f3..0000000 Binary files a/finalPaper/IntroToMachineLearning_2/examples.PNG and /dev/null differ diff --git a/finalPaper/IntroToMachineLearning_2/hClust.gif b/finalPaper/IntroToMachineLearning_2/hClust.gif deleted file mode 100644 index a756ab3..0000000 Binary files a/finalPaper/IntroToMachineLearning_2/hClust.gif and /dev/null differ diff --git a/finalPaper/IntroToMachineLearning_2/kImg.png b/finalPaper/IntroToMachineLearning_2/kImg.png deleted file mode 100644 index fed6200..0000000 Binary files a/finalPaper/IntroToMachineLearning_2/kImg.png and /dev/null differ diff --git a/finalPaper/IntroToMachineLearning_2/methods.png b/finalPaper/IntroToMachineLearning_2/methods.png deleted file mode 100644 index 37eb140..0000000 Binary files a/finalPaper/IntroToMachineLearning_2/methods.png and /dev/null differ diff --git a/finalPaper/MachineLearngin_DimensionReduction/Chen_Li.md b/finalPaper/MachineLearngin_DimensionReduction/Chen_Li.md deleted file mode 100644 index bec2301..0000000 --- a/finalPaper/MachineLearngin_DimensionReduction/Chen_Li.md +++ /dev/null @@ -1,61 +0,0 @@ -# Machine Learning - Dimension Reduction -By Chen Li -## Introduction -After we obtain an expression matrix RNA-seq, in which every row is a gene, and every column is a sample, how can we get a general sense of the distribution of the data and extract important features (genes) from the data matrix? Note that the expression matrix is usually in a high dimensional space because every gene is a dimension, so directly visualizing the data is impossible for human beings. Dimension reduction techniques such as PCA and t-SNE can map the original data in high dimension to lower dimension while retaining some spatial information (similarity and difference between samples). In this way, we can easily visualize the data in 2D space. In addition, PCA can return the *importances* of the features (genes), which are the ability of these features in grouping and separating the data points. By extracting the most important genes from the raw data matrix, we are able to explore the data such as clustering more easily. - -1. [PCA](#pca) -2. [t-SNE](#t-sne) -3. [PCA vs t-SNE](#pca-vs-t-sne) - -### PCA -PCA (Principle Component Analysis) is widely used upstream of calculations that handle high dimensional data badly. PCA is a linear transformation method. It preserves the correlation between point x and y after transformation. PCA can reduce 4 or higher dimension graph to 2D or 3D. Let’s take a expression matrix for 6 mouse samples as an example. We will only use two genes for illustration. Youtube Reference: [StatQuest](https://www.youtube.com/watch?v=FgakZw6K1QQ) - -![](https://github.com/danielee0707/BENG183/raw/master/images/1.png) - -Now let's go through the steps in PCA calculation: -1. The data set is first moved so that its center is at origin. Then PCA calculates the top components with the highest variations in the data. What it does is to fit a line to the data set. For an arbitrary line in the plane through the origin, the (sum of squared) distances to the projections of all the points are calculated and maximized. This has the same effect as minimizing all the distances between the points and the line. (In real world, calculation is done by linear algebra.) - -![](https://github.com/danielee0707/BENG183/raw/master/images/2.png) -![](https://github.com/danielee0707/BENG183/raw/master/images/3.png) - -2. This line is called PC1, or **Principal Component** 1, and it captures the largest variation in the data. From the resulting line, we also know what its compositions based on the slope: it contains 4 parts Gene1 plus 1 part Gene2. - -3. PC2 is simply the line perpendicular to PC1. If the data is in 3D, PC2 will be residing in a plane perpendicular to PC1, and the previous steps will be repeated. All dotted lines are perpendicular to each other. - -![](https://github.com/danielee0707/BENG183/raw/master/images/4.png) -![](https://github.com/danielee0707/BENG183/raw/master/images/6.png) - -4. In the final step, we rotate the coordinates so that PC1 becomes x-axis and PC2 becomes y-axis. The top two PCs are able to explain 94% of all variations in the data. - -![](https://github.com/danielee0707/BENG183/raw/master/images/5.png) - -**Note:** -When performing PCA transformation, only the top several PCs are used, and other PCs (dimensions) are discarded. Information is *lost* in this process. - -### t-SNE -By projecting data points to a plane of high variability, PCA tries to place dissimilar data points far apart and only preserves the global structure of data points (which means it may not be powerful enough to distinguish subgroups). So we will need t-SNE to see more detailed neighboring structures. - -![](https://github.com/danielee0707/BENG183/raw/master/images/7.png) -![](https://github.com/danielee0707/BENG183/raw/master/images/8.png) - -[Image Source](https://www.kaggle.com/puyokw/clustering-in-2-dimension-using-tsne/code) - -1. So how does t-SNE work? The name stands for t-distributed stochastic neighbor embedding. The underlining mathematics of t-SNE is very advance and will not be covered here. But basiclly, it applies neighborhood preserving mapping so that distances between neighboring points are truthfully preserved after transformation. - -2. But how do we determine neighbors? *Perplexity* represents roughly the number of potential neighbors considered for a cluster, so we can determine neighbors of each point and thus clusters by trying different perplexity parameters until a reasonable and clear clustering is visualized by t-SNE. This is usually determined arbitrarily. Youtube Reference: [Applied AI Course](https://www.youtube.com/watch?v=FQmCzpKWD48&list=PLupD_xFct8mHqCkuaXmeXhe0ajNDu0mhZ&index=1) - -![](https://github.com/danielee0707/BENG183/raw/master/images/9.png) - -3. Meanwhile, since there is always some randomality in t-SNE’s embedding, we need to run multiple *iterations* to improve the 2D embedding to best represent the original structure. Such number of iterations is another hyperparameter to choose, and generally, the more iterations t-SNE runs, the more credible the resulting embedding will be. - -![](https://github.com/danielee0707/BENG183/raw/master/images/10.gif) - -**Note:** -* The size of each cluster in a t-SNE plot means nothing because it tends to expand dense clusters and shrink sparse ones. -* Since t-SNE only represents distances within a potential cluster, distances between clusters do not provide any information. - -### PCA vs t-SNE -Like many other machine learning techniques, PCA and t-SNE have their own advantages and disadvantages. When to use them depends on what you want to achieve. Please note that they are only two of the vast number of dimension reduction methods, and there are also ones that serve very similar purposes such as *ICA* and *UMAP*. [Reference](https://www.datacamp.com/community/tutorials/introduction-t-sne) -1. t-SNE is much more computationally expensive than PCA. -2. PCA is deterministic while t-SNE is not. Hyperparameters for t-SNE are somewhat arbitrary. -3. Information is lost during PCA calculation while t-SNE attempts to capture information from all dimensions. diff --git a/finalPaper/PrecisionMedicine/Mengyi_Liu.md b/finalPaper/PrecisionMedicine/Mengyi_Liu.md deleted file mode 100644 index 661d089..0000000 --- a/finalPaper/PrecisionMedicine/Mengyi_Liu.md +++ /dev/null @@ -1,126 +0,0 @@ -# Precision Medicine - Where do we go from here? -### by Mengyi Liu (Miko) - -### Sections: -##### 1. Introduction to Precision Medicine -##### 2. Personalized Medicine through the Lens of Asthma -##### 3. Current Hot Topics on Precision Medicine - -## 1. Introduction to Precision Medicine -Definition of Precision Medicine (by Dr. Su-In Lee): Tailoring of medical treatment to the individual characteristics of each patient, especially by using genetic or molecular profiling. - -In other words, precision medicine should account for personal variations in genomic sequence and environmental exposure. - -> Here is a story of how scientists used precision medicine to treat breast cancer. - -In the old paradigm, cancer types are usually distinguished by the origin of tumors. For example, lung cancer, liver cancer, breast cancer. - -However, tumors with the same origin may be different in appearance and behavior, aggressiveness, and vulnerability. Thus we need to treat each kind of tumor differently. - -The early treatment of breast cancer in the 1970s: -- All patients underwent removal of ovaries: The assumption was, no estrogen, no growth of tumor. -- It only helped around 70% of patients with ER+ tumor. -- But how tumor cells are built matters: For example, markers on tumor cell surface and growth circuits can lead to different tumor progression. - -Then a discovery in 1970 found that: -- 70% of breast cancer are ER (estrogen-receptor) positive. -- These patients can be treated with anti-estrogen agent to block cancer growth. -> A much better treatment, but what about the remaining 30% of the patients? - -1984 discovery: -- 20% of breast cancers have abnormal HER-2 gene expression. -- A new drug, Herceptin, was developed to inhibit the function of this protein. - -Mid-1990s findings: -- 5% of breast cancers have inherited defect in gene BRCA1 or BRCA2. -- No preventive treatment, but can screen for such inborn defect and watch for tumor formation closely. - -Fortunately, the treatments using anti-estrogen agent or Herceptin can be effective for around 85% of the breast cancer patients. This improvement in cancer therapy is largely contributed to the effort to split the patients into different subgroups, then treat each subgroup on its own based on its distinct characteristics. That is the core of precision medicine. We will see how scientists implement this "splitting" in the next section as well. - -Note that the 5% of breast cancer with inherited defect in gene BRCA1 or BRCA2, are categoried as Triple Negative Breast Cancer (TNBC). This type of breast cancer does not express the genes for estrogen receptor (ER), progesterone receptor (PR) and HER2. And it is still in need of an effective treatment. - -In order to develop personalized treatment and advance the field of precision medicine, current genomics and machine learning researchers need to address: - -- identifying genetic or molecular markers for clinical phenotypes -- discovering disease subtypes from genetic and/or molecular data -- building prediction models for clinical events based on electronic medical record (EMR) data - -> Let's look at an example where scientists used the theory of precision medicine to study asthma. - -## 2. Personalized Medicine through the Lens of Asthma - -#### 1) What is Asthma? -A condition where a person's lungs become inflamed, and produce extra mucus, making it hard to breath. - -Biology behind asthma: An excessive allergen-induced type 2 inflammation, orchestrated by memory CD4+ T cells that produce type 2 cytokines (Th2 cells). - -In a similar disease, rhinitis, its pathway also involves Th2 cells. - -But unfortunately, there is currently no cure for asthma. Newer therapies are only partially successful in certain subtypes. - -![asthma](https://github.com/miko-798/BENG_183_mini_lecture/blob/master/asthma.png) - -Thus we need a better understanding of the disease at a molecular level. - -#### 2) A precision medicine study on asthma - -The figure below shows how scientists split the asthma patients into 4 subgroups. - -They took into account 8 parameters obtained from patients that were correlated with their disease and the severity, and did a partition-around-medoids clustering. - -![subgroups](https://github.com/miko-798/BENG_183_mini_lecture/blob/master/subgroups.png) - -In the study, *Transcriptional profiling of Th2 cells identifies pathogenic features associated with asthma*, the researchers performed RNA-Seq on Th2 cells from a total of 80 samples from 77 patients, including 3 biological replicates. - -Then they did RNA-Seq analysis to identify genes differentially expressed between allergic asthma, rhinitis and healthy control groups, by performing negative binomial tests for pairwise comparisons employing the Bioconductor package DESeq2. - -They found the following: - -- They identified a total of 15 distinct gene modules. -- DESeq analysis found 500 genes differentially expressed between asthmatic subjects and healthy subjects (Genes for apoptosis, zinc transporters, MAPK, NF-κB, TNF). -- The expression of most of these genes was similar between rhinitis and asthma. -- Genes that differentiate asthmatic from healthy subjects show an intermediate phenotype in allergic rhinitis subjects. - -These genetic markers identified through this study could offer insights for personalized medicine for asthma patients. - -## 3. Current Hot Topics on Precision Medicine - -> The study of precision medicine has gained an increased interest in the scientific community in the past decade. Some interesting work are presented as follows. - -#### 1) Mapping of patients data - -Atul Butte, a Distinguished Professor and the Director of the Computational Health Sciences Institute at UCSF, constructed a map illustrating how each individual patient developed different subtypes of diseases and how these diseases lead to mortality. - -Here is the map showing the paths for each patient: - -Each square and circle represent a certain subtype of disease, and each line represents a transition. - -![map](https://github.com/miko-798/BENG_183_mini_lecture/blob/master/map.png) - -This map offers physicians and scientists possible patterns of disease progression, which can help guide the treatments for future patients. Clearly, this is one way to *build prediction models for clinical events based on electronic medical record (EMR) data*. The amount of detailed information encompassed by this map is powerful and grants possibility for precision care for patients. - -> Let's look at another application of precision medicine. - -#### 2) Opening the black box of machine learning models - -Su-In Lee, an Associate Professor at Paul G. Allen School of Computer Science & Engineering at UW, found weaknesses in the conventional way to identify molecular markers. She tries to interpret the complex machine learning models that are used to analyze patient data, and strives to offer individualized explanations for a particular prediction, and for a particular patient. - -![su-in lee](https://github.com/miko-798/BENG_183_mini_lecture/blob/master/su-in%20lee.png) - - -These are both very interesting and cutting edge research efforts in the field of precision medicine. So what's next in precision medicine? Well, let's keep in mind the three things that scientists may want to address: - -- identifying genetic or molecular markers for clinical phenotypes -- discovering disease subtypes from genetic and/or molecular data -- building prediction models for clinical events based on electronic medical record (EMR) data - -Whether it be cancer vaccines or personalized drugs for a particular subtype of cancer, the biomedical research community will witness a growth in precision cancer care. We all have the potential to make what's next. - - - -## References: -1. Seumois, Grégory et al. “Transcriptional Profiling of Th2 Cells Identifies Pathogenic Features Associated with Asthma” Journal of immunology (Baltimore, Md. : 1950) vol. 197,2 (2016): 655-64. -2. Su-In Lee Lab: https://suinlee.cs.washington.edu/research -3. Su-In Lee: "Interpretable Machine Learning for Precision Medicine" https://youtu.be/La7KTIe2DeU -4. Precisely practicing medicine with a trillion points of data. | Atul Butte | TEDxSanFrancisco https://www.youtube.com/watch?v=fbZZ_1Jbm6w -5. Zhong, Sheng. BENG 183 lecture on precision medicine. https://docs.google.com/presentation/d/1VD0KbnLThYzqJ9eN6EWcOhd5i726QfYc8KnCt7bUhPM/edit#slide=id.g404a8c7ebe_0_517 diff --git a/finalPaper/PrecisionMedicine_2/Reysha_Patel.md b/finalPaper/PrecisionMedicine_2/Reysha_Patel.md deleted file mode 100644 index 518dec9..0000000 --- a/finalPaper/PrecisionMedicine_2/Reysha_Patel.md +++ /dev/null @@ -1,118 +0,0 @@ -1. [Introduction - What is Precision Medicine?](#1) -2. [Overview of Precision Medicine/Pipeline](#2)
-3. [Diabetes and Precision Medicine](#3) -4. [Benefits and Drawbacks](#4) -5. [Future of Precision Medicine](#5) -6. [References](#6) - -## 1. Introduction - What is Precision Medicine? - ->Most medical treatments are designed for the "average patient" as a one-size-fits-all-approach, which may be successful for some patients but not for others. This is the way medicine has been approached for many years. Precision medicine, however, takes a new approach that tailors disease prevention and treatment to the differences in people's genes, environments, and lifestyles. The goal of precision medicine is to target the right treatments to the right patients at the right time. - -The difference between personalized medicine and precision medicine: ->Although they are often used interchangeably, personalized medicine is more focused on the diagnosis and treatment of a single person rather than the use of big data to categorize diseases and determine specific treatments for subtypes to be applied to individuals on a larger scale. - - -## 2. Overview of Precision Medicine/Pipeline - ->The first step in conducting precision medicine research requires collecting data from patients two groups of patients: a disease group and a control group. There are several bioinformatics approaches used to gather relevant patient data that can be applied to diagnosing and treating patients. Some popular approaches are: -- High-throughput sequencing -- RNA-Seq -- Molecular profiling -- Tumor profiling -- Hi-C -- Chip Seq
- ->After data is obtained, the patients or disease is categorized according to the relevant features that were determined through data collection. This clustering is usually carried out using various machine learning algorithms. Finally, the appropriate treatments can be determined based on an individual basis or disease subtypes.[3] - - -![](imag.png) -An overview of the basic precision medicine protocol described above can be seen in the figure above. **Figure by Prasad, Rashmi, et al. Journal of Internal Medicine (2018).** - -## 3. Diabetes and Precision Medicine ->The treatment of diabetes has been lagging behind cancer for some time now because the diagnosis of diabetes during the last 100 years has been based upon measurement of a single metabolite, glucose. This method of diagnosis had lead to the classification of diabetes into two main subgroups - Type 1 (T1D) and Type 2 (T2D). These categorizations are very imprecise. - ->T2D is an exclusionary diagnosis. If the patient does not have T1D or monogenic diabetes, then they are considered to have T2D. This suggests that about 90% of all patients with diabetes have T2D. Metformin generally the first medication that physicians prescribe to patients with Type 2 diabetes, as it controls insulin levels. - ->However, not all patients diagnosed with T2D respond well to the treatments that are typically prescribed, like Merformin. To address this heterogeneity of T2D, a study, conducted in 2018 by Lund University Diabetes Center was able to classify Type 2 diabetes into five distinct subgroups, allowing for prediction, diagnosis, and ultimately treatment. Sub-classification of the disease resulted from their clustering analysis, which was based on six characterizations: age at diagnosis, BMI, HbA1c, GADA, C-peptide together with glucose for estimation of insulin secretion, HOMA-B and insulin-sensitivity, and HOMA-IS. - -![cluster](clusterChar.jpeg) -This table shows the phenotypic characteristics that were used to create the T2D subtypes. **Figure by Prasad, Rashmi, et al. Journal of Internal Medicine (2018).** - - -### Type 2 Diabetes Subtypes - -#### SAID (Severe autoimmune diabetes) -SAID is severe autoimmune diabetes. It is characterized with the presence of GADA, low insulin secretion, and poor metabolic control. SAID shows the expected association with HLA genes, which is completely lacking in SIDD.[1] -###### Treatment -Insulin - -#### SIDD (Severe insulin-deficient diabetes) -SIDD is severe insulin deficient diabetes. SIDD is characterized with characterized by low insulin secretion, poor metabolic control, and an increased risk of retinopathy. SIDD also shows a strong association with variants in the TCF7L2 gene (which has previously shown strong association with T2D and insulin deficiency).[1]
- -Treatment: Mistreated by Metformin and need insulin. - -#### SIRD (Severe insulin resistant diabetes) -SIRD is severe insulin resistant diabetes. It is characterized by was characterized by severe insulin resistance, obesity, late onset and markedly increased risk of nephropathy. SIRD reflects an “unhealthy” obesity with insulin resistance and often fatty liver. In support of this, SIRD shows a clear association with a variant in the TMSF2 gene previously associated with fatty liver and nonalcoholic fatty liver disease.[1]
- -Treatment: Need treatment which enhances insulin sensitivity - -#### MOD (Mild obesity related diabetes) -MOD is mild obesity related diabetes represents healthy obesity with no insulin resistance. It is characterized by obesity, early onset and good metabolic control[1]
- -Treatment: Lifestyle, Metformin - -#### MARD (Mild age related diabetes) -MARD is mild age related diabetes. It is characterized by late onset and good metabolic control.[1]
- -Treatment: Lifestyle, Metformin. - -![plots](plots.png) -These plots show the results of the K-means clustering algorithm that was used to create the five diesease subtypes in diabetes. **Figure by Ahlqvist, Emma, et al. The Lancet Diabetes & Endocrinology (2018)** - -## 4. Benefits and Drawbacks of Precision Medicine ->Since precision medicene is upcoming there are many benefits and drawbacks associated with precision medicene. These benefits and drawbacks are related to a variety of different fields including governmental policy and scientific practices. - -#### Benefits - - Treatments that are derived as a result of precision medicine are more likely to be successful because there are many additional tests that are performed to determine the specific subtype of disease. These tests allow treatment to be more specific and to better target the disease. - - Another benefit of the genetic testing involved with precision medicine is that the treatment is less likely to have side effects because it is tailored to a specific disease subtype. - - Precision medicine can also be used as a means of early diagnoses. For example if a patient is beginning to exhibit a genetic change that is common in a certain disease that patient can be diagnosed early. - - It can also be used as preventative medicine in the example of breast cancer, where the BRCA genes are indicators that a patient may develop breast cancer. - - All of the testing in precision medicine also can help determine new disease subtypes which we saw in our diabetes example. - - Precision medicine can also reveal population health, in that we can see which countries or groups of genetically similar people are likely to develop a specific disease. [4] - -#### Drawbacks -- A big drawback of precision medicine is that is expensive. Testing is expensive and precision medicine is based on the idea of genetic testing. -- All of the testing is also very time consuming. The testing itself is quick, but the sequencing and secondary analysis of data is very time consuming because the files are large. -- The large files created by the different types are testing could lead to a data storage problem. These files tend to be very large and testing every patient would create a data storage problem. -- Many of the informatics approaches that are already in place are not able to effectively integrate data, informatics approaches need to be better developed and improved so that data that is collected can be analyzed effectively. -- Precision medicine involves patient data is heavily protected and hard to access. As testing becomes more common, patient privacy may be an issue and the security of their information could also be at stake. -- Many current precision medicine approaches and studies are difficult to access. So there is limited access to collected data as well.[4] - -## 5. Future of Precision Medicine ->Current studies utilizing precision medicine are barely scraping the surface of the capabilities of the practice. As a relatively new approach in determining both diagnoses and treatments for many diseases, precision medicine has room for growth in so many areas. For researchers, having access to appropriate and complete patient data is imperative to determining diagnosis and treatments that are accurate and can be useful in a clinical approach. Furthermore, besides HIPAA (Health Insurance Portability and Accountability Act), there is not much policy in place to protect patient's genomic data which has the potential to be used without the patient's full consent. Finally, many of the informatics approaches that are currently used to cluster patient data are time consuming and inaccurate, so the advent of new informatics approaches is a key factor in furthering the field of precision medicine. More specific examples of ways in which these areas are progressing towards a future of clinical applications are listed below -#### Data Collection -The All of Us Research Program is an initiative that began in 2015. The program was given $215 million in funding and it aims to collect the various forms of data, such as sequencing information, physical examination data, and wearable device information in order to make advances in tailoring medical care to the individual. - -Research institutes, like Memorial Sloan Kettering, are also at the center of creating an infrastructure for widespread access to certain aspects precision medicine. Sloan Kettering is attempting to do this for cancer research and treatments by grouping patient data which can determine patient eligibility for clinical trials. - -A large amount of data is also being collected through the use of mobile or wearable devices. For example, the Fitbit, collects lots of valuable information about health and fitness. This data could prove to be extremely useful in a clinical setting. - -#### Policy - -In addition to the need for legislation in the area of patient privacy regarding genomic and other health related information, there is also a need for increased funding in the area of precision medicine. As more money has been allocated towards research in precision medicine has increased in the past few years, the reimbursements for those who wish to participate in studies or clinical trials remains low. - -#### Better Informatics Approaches -Machine learning is widely applied in solving health informatics problems because it is able to make predictions based on large datasets. Various machine learning algorithms are able to reduce data dimensionality and determine features that are relevant in applying treatments and diagnoses. However, sometimes complex data, lack of data, or rare events, (which are highly probable in patient data) are not able to be accurately handled using machine learning algorithms. For this reason, better informatics approaches are needed for -reaching accurate clinical results. - -Data storage is an additional area in expansion is needed in order to keep up with growth in precision medicine. Human data is very large and must be protected so abundant and secure databases will need to be established to store patient data. - -## 6. References -[1] Prasad, Rashmi B, and Leif Groop. “Precision Medicine in Type 2 Diabetes.” *Journal of Internal Medicine*, 2018, doi:10.1111/joim.12859. - -[2] Ahlqvist, E., et al., "Novel subgroups of adult-onset diabetes and their association with outcomes: a data-driven cluster analysis of six variables." *Lancet Diabetes Endocrinology*, 2018. 6(5) p. 361-369
- -[3] Center for Devices and Radiological Health. “Precision Medicine.” *U S Food and Drug Administration Home Page, Center for Drug Evaluation and Research*, 2018, www.fda.gov/medicaldevices/productsandmedicalprocedures/invitrodiagnostics/precisionmedicine-medicaldevices/default.htm.
- -[4] Patel, Kirti. “Precision Medicine: Pros & Cons – Kirti Patel, MD – Medium.” Medium.com, *Medium*, 1 Feb. 2015, medium.com/@kirtipatelmd/precision-medicine-blessing-or-curse-8722c3ae94cb. diff --git a/finalPaper/PrecisionMedicine_2/clusterChar.jpeg b/finalPaper/PrecisionMedicine_2/clusterChar.jpeg deleted file mode 100644 index a459f26..0000000 Binary files a/finalPaper/PrecisionMedicine_2/clusterChar.jpeg and /dev/null differ diff --git a/finalPaper/PrecisionMedicine_2/imag.png b/finalPaper/PrecisionMedicine_2/imag.png deleted file mode 100644 index 50cb92a..0000000 Binary files a/finalPaper/PrecisionMedicine_2/imag.png and /dev/null differ diff --git a/finalPaper/PrecisionMedicine_2/plots.png b/finalPaper/PrecisionMedicine_2/plots.png deleted file mode 100644 index 4108be3..0000000 Binary files a/finalPaper/PrecisionMedicine_2/plots.png and /dev/null differ diff --git a/finalPaper/readme.md b/finalPaper/readme.md deleted file mode 100644 index bbb628c..0000000 --- a/finalPaper/readme.md +++ /dev/null @@ -1 +0,0 @@ -# Read Me diff --git a/finalPaper/sequencingApplication/CAPP-Seq.png b/finalPaper/sequencingApplication/CAPP-Seq.png deleted file mode 100644 index 13c78bf..0000000 Binary files a/finalPaper/sequencingApplication/CAPP-Seq.png and /dev/null differ diff --git a/finalPaper/sequencingApplication/CTCvsctDNA.jpg b/finalPaper/sequencingApplication/CTCvsctDNA.jpg deleted file mode 100644 index 7bd9db7..0000000 Binary files a/finalPaper/sequencingApplication/CTCvsctDNA.jpg and /dev/null differ diff --git a/finalPaper/sequencingApplication/CTCvsctDNA2.jpg b/finalPaper/sequencingApplication/CTCvsctDNA2.jpg deleted file mode 100644 index 5867858..0000000 Binary files a/finalPaper/sequencingApplication/CTCvsctDNA2.jpg and /dev/null differ diff --git a/finalPaper/sequencingApplication/Ishan_Goyal.md b/finalPaper/sequencingApplication/Ishan_Goyal.md deleted file mode 100644 index f5b9311..0000000 --- a/finalPaper/sequencingApplication/Ishan_Goyal.md +++ /dev/null @@ -1,150 +0,0 @@ -# Using Targeted Sequencing to Further Cancer Diagnostics -By - Ishan Goyal (A12094992) - -1. [Abstract](#1)
-2. [Introduction](#2)
- a. [Liquid Biopsy vs Tissue Biopsy](#21)
- b. [Circulating Tumor Cells vs Circulating Tumor DNA](#22)
-3. [Overivew of Methods](#3)
- a. [Whole Genome Sequencing](#31)
- b. [Whole Exome Sequencing](#32)
- c. [CAPP-Seq](#33)
- d. [TAm-Seq](#34)
-4. [Applications & Future Use](#4)
-5. [Data Tables & Figures](#5)
- - -## Abstract - -The field of cancer diagnostics has experienced tremendous growth and technical developments over the past decade. The advent of high throughoupt sequencing technologies coupled with high-specificity screening methods is enabling the discovery of new biomarkers and potential early disease diagnosis of patients. [1] With early diagnosis comes a variety of benefits for the patient including increased treatment options and higher overall survival rate. For colon cancer ,there is a 91% 5-year survival rate when diagnosed early versus only 11% survival rate if it is caught late and has spread to other organs. [7] Unfortunately, current diagnosis options such as tissue biopsies, endoscopy, or radiology are often invasive, expensive, and involve long procedures to patients. In addition, tissue biopsies only provide a snapshot of the mutations in a patient rather than a global picture of the patient's predisposition to a disease. - -In this chapter, we will be introducing the use of sequencing technologies to analyze liquid biopsy based cancer samples. Early studies in 1977 revealed a high level of cell free DNA and circulating tumor DNA in cancer patient plasma. Liquid biopsy based diagnostics hope to solve the invasive and costly drawbacks of tissue biopsy while serving as a highly specific predictor of cancer. Liquid biopsy aims to detect cancer mutations within the plasma, provide early screening options for these mutations, and monitor these mutations over time to assess tumor burden and treatment effectiveness. - - -## Introduction - -#### Liquid Biopsy vs Tissue Biopsy:
-In recent years, the personalized or stratified management of patients with advanced non small-cell lung cancer (NSCLC) has allowed for the comparison of liquid and tissue biopsy techniques. While tissue biopsy is the conventional approach, it has been discovered that tumors often display heterogeneity between different regions in the same tumor as well as with distal tumors in the patient. [4] This heterogeneity presents a challenge because it limits tissue biopsy to a mere snapshot of the entire tumor profile. In addition, patients with NSCLC are often not in the condition to undergo complex biopsy procedures that are required to extract adequate tissue samples. Lastly, the turn around times on these procedures and costs can pose immense burden to the patient. - -In contrast, liquid biopsy is showing great promise through the analysis of tumor material within patient blood samples. A variety of nucleic acids such as circulating cell-free DNA and cell-free RNA are often released from apoptotic and necrotic tumor cells into the bloodstream. The mutation profiles of these nucleic acids can be analyzed using sequencing technologies to give a more holistic snapshot of a patients tumor and mutation progression. It takes approximately 50 million malignant cells to release sufficient DNA for the detection of circular tumor specific DNA in the blood. In contrast, current positron emission topography techniques for biopsy analysis can detect tumors of length no less than 7-10 mm in size. This equates to roughly 1 billion malignant cells in contrast to the 50 million required for liquid biopsy identification. The advanced sensitivity and financial feasibility of liquid biopsy approaches hope to further the cancer diagnostics field and soon replace traditional tissue biopsy. [4] - -![](./CTCvsctDNA.jpg)
-[Figure 1](https://www.sciencedirect.com/science/article/pii/S2001037018300060#s0030). -**Comparison of the types of mutations and analysis that can be conducted with tissue & liquid based biopsy techniques.** - -#### Circulating Tumor Cells vs Circulating Tumor DNA:
- -Circulating Tumor Cells (CTCs): CTC's are tumor cells that have spread from tumors via blood or lymphatic vessels. The presence of CTC's in lung cancer patients has been reported as a known factor in disease metastasis and outgrowth. The key challenge with CTC's is their detection as they require extreme levels of sensitivity to observe. One enrichment technique, CellSearch, has been approved by the FDA for monitoring metastatic breast cancer, castration-resistant prostate cancer, and color cancer. [2] In clinical practice, it has been noticed that the overall survival of patients who had stable CTC counts in their blood after treatment was significantly worse. - -Circulating Tumor DNA (ctDNA): ctDNA is hypothesized to enter the bloodstream either passively through apoptotic and necrotic tumor cells or actively by living tumor cells that are targeting recipient cells at distal locations. Detection of ctDNA, similar to CTC's, is challenging as it is a small percentage of all cell free DNA. Using PCR and advanced NGS technologies, it is possible to identify low concentrations of ctDNA within patient plasma samples. Additional methods will be discussed below that help improve the sensitivity of NGS ctDNA sample analysis. [2] In the clinical setting, ctDNA can be used for early diagnosis/response prediction and to characterize molecular tumor alterations. It has been noted in many studies that the ctDNA concentration often spikes during patient relapse. Overall, both CTC and ctDNA analysis coupled with sequencing methodologies can yield novel insights on tumor growth and patient specific mutations. - - -![](./CTCvsctDNA2.jpg)
-[Figure 2](http://cancerdiscovery.aacrjournals.org/content/4/6/650). -**CTC and ctDNA analysis can yield information on the tumor progression and mutation profile.** - - -## Overview of Methods - -#### Whole Genome Sequencing (WGS): -WGS is used as an initial step in the bioinformatics pipeline to get an understanding of the patient's genome wide cancer profile. WGS provides insights into genomic loci that are mutation hotspots and can help inform further probe design. Within plasma DNA, WGS has been used to detect copy number variation (CNV), but does not have the resolution to detect SNV's or allele frequencies. WGS approaches are also more prone to higher ratio of intronic or passenger mutations compared to a targeted approach. [5] - -#### Whole Exome Sequencing (WES): -WES makes possible the routine analysis of de novo mutations in plasma samples by comparing samples prior to and in response to therapy. A proof of concept experiment with WES involved collecting plasma samples at the beginning of treatment and at the time of relapse. WES approach here allows more in-depth understanding of multiple regions that may be differentially expressed. This approach is also less sensitive to copy number variation challenges and is more cost effective than WGS. [5] - - -#### CAPP-Seq: -Cancer Personalized Profiling by Deep Sequencing (CAPP-Seq) is a highly sensititve and economical method to quantify ctDNA. In general, ctDNA levels are highly correlated with tumor volume and can provide earlier treatment response assessment compared to radiographic approaches. In non-small cell lung cancer patients, ctDNA was detected in 100% of stage II-IV tumors and 50% of stage I tumors. The CAPP-Seq technique requires designing a "selector" of biotinylated DNA oligonucleotides that target recurrently mutated regions in the cancer of interest. The selector panel is often optimized using WES data and other intron breakpoints to best span the mutated areas. CAPP-Seq can be further applied to disease stage monitoring as it can predict tumor stage progression via ctDNA analysis with a relatively high degree of sensitivity and specificity. [6] - - -Experimentally, it is possible to use CAPP-Seq across different time points to understand how the allele frequencies of different mutations are changing in response to treatment. This approach requires analyzing both ctDNA and germline DNA as a baseline to compare the variants. Mutations that tend to increase in frequency may be experiencing selective pressure and targeting these mutations can signficantly reduce disease progression. [6] - -CAPP-Seq Workflow:
-![](./CAPP-Seq.png)
-[Figure 4](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4016134/figure/F1/). -**Population level analysis is conducted to identify a set of selector regions for CAPP-Seq. Both tissue and liquid biopsy sample are analyzed to identify and validate personalized cancer markers.** - - - -#### TAm-Seq: -TAm-Seq is a method for tagged-amplicon deep sequencing that allows for the identification of cancer mutations in circulating DNA present at allele frequencies as low as 2%. With a sensitivity and specificity of over 97%, TAm-Seq can be used to monitor tumor dynamics, track mutations, and identify the origin of relapse in a patient with multiple primary tumors. The prototypic example is where the allele frequencies of 10 different mutations in a patient with metastatic breast cancer all sharply decline upon onset of the chemotherapy, but increase after termination of the therapy. By monitoring the AF of common mutations amongst patients with different disease subsets, it may be possible to generate more personalized treatment options. [3] The image below shows the progression of a disease at various time points where the allele frequencies are being measured by TAm-Seq. PR indicates partial response, SD is stable disease, and PD is progressive disease. - -TAm-Seq Workflow: -![](./TAm-SeqMonitoring.jpg)
-[Figure 5](http://stm.sciencemag.org/content/scitransmed/4/136/136ra68/F4.large.jpg). -**The allele frequencies of mutations in breast cancer patients are monitored to view their relationship with time and treatment.** - -The TAm-Seq method uses a combination of short amplicons, two-step amplification, sample barcode sequences, and high-throughput PCR. Because the amplicons are short, TAm-Seq effectively amplifies even small amounts of fragmented DNA such as are present in circulating DNA. PCR primers are used to cover regions of interest that are identified through prior WGS and WES along with population level analysis. The regions are amplified in a two step process. They are first amplified in parallel to preserve the allele representation. They are then selectively re-amplified along the regions of choice (single-plex). Lastly, sequence specific adaptors are added to these amplicons to allow for pooling and sequencing. Preparing TAm-Seq libraries for sequencing from 48 samples takes less than 24 hours and involves only few hours of hands-on time. New platforms for massively parallel sequencing allow for fast turnaround times, which make this approach practical in a clinical setting. [3] - - -## Applications & Future Use - -The cancer diagnostics industry is expected to reach a net worth of approximately 232.7 billion USD by 2025. The market of individuals who could benefit from more accurate diagnostics increases annually as the National Cancer Institute estimates approximately 14 million new cases a year. There are a variety of companies developing technologies to further cancer diagnostics. GRAIL uses high through-put sequencing to understand cancer causing mutations while Freenome uses machine learning to predict immune response that are results of cancer development. - - -## Data Tables & Figures - -#### Comparison of WGS, WES, & Targeted Sequencing - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Attribute / Parameter - Whole Genome Sequencing (WGS) - Whole Exome Sequencing (WES) - Targeted Sequencing -
Information Level Everything Genes User Defined
Cost Per Sample $5000 $2000 $200
Low Frequency Mutation Detection Not Possible Not Likely Yes
DNA Input Amount 1 ug 100 - 200 ng 10 ng
# of Samples in Parallel 1 2 96
-
- -The ability to detect low frequency mutations is specific to targeted sequencing. Unlike WGS and WES, targeted sequencing is relatively inexpensive, can be run in parellel up to 96x and doesn't require high levels of DNA input. - -## References -[1] “Biomarkers.” Canary Foundation, www.canaryfoundation.org/canary-science/science/biomarkers/.
- -[2] Calabuig-Fariñas, Silvia, and Carlos Camps. “Circulating Tumor Cells versus Circulating Tumor DNA in Lung Cancer-Which One Will Win?” Translational Lung Cancer Research, 5 Oct. 2016, tlcr.amegroups.com/article/view/10106/8669. - -[3] Forshew, Tim. “Noninvasive Identification and Monitoring of Cancer Mutations by Targeted Deep Sequencing of Plasma DNA.” Science Translational Medicine, American Association for the Advancement of Science, 30 May 2012, stm.sciencemag.org/content/4/136/136ra68.long. - -[4] Ilie, Marius, and Paul Hofman. “Pros: Can Tissue Biopsy Be Replaced by Liquid Biopsy?” Translational Lung Cancer Research, Aug. 2016, tlcr.amegroups.com/article/view/8950/8064. - -[5] Ma, Mingwei, and Gang Chen. “‘Liquid Biopsy’-CtDNA Detection with Great Potential and Challenges.” Annals of Translational Medicine, Sept. 2015, atm.amegroups.com/article/view/7851/8632. - - -[6] Newman, Aaron M, and Maximilian Diehn. “An Ultrasensitive Method for Quantitating Circulating Tumor DNA with Broad Patient Coverage.” Nature News, Nature Publishing Group, 6 Apr. 2014, www.nature.com/articles/nm.3519. - -[7] “Understanding Statistics Used to Guide Prognosis and Evaluate Treatment.” Cancer.Net, 11 Aug. 2018, www.cancer.net/navigating-cancer-care/cancer-basics/understanding-statistics-used-guide-prognosis-and-evaluate-treatment.
- - - - diff --git a/finalPaper/sequencingApplication/TAm-SeqMonitoring.jpg b/finalPaper/sequencingApplication/TAm-SeqMonitoring.jpg deleted file mode 100644 index ed8b8b2..0000000 Binary files a/finalPaper/sequencingApplication/TAm-SeqMonitoring.jpg and /dev/null differ diff --git a/image/1702354333272.png b/image/1702354333272.png new file mode 100644 index 0000000..1203862 Binary files /dev/null and b/image/1702354333272.png differ diff --git a/image/1702354442303.png b/image/1702354442303.png new file mode 100644 index 0000000..874b920 Binary files /dev/null and b/image/1702354442303.png differ diff --git a/image/1702356275622.png b/image/1702356275622.png new file mode 100644 index 0000000..ad65655 Binary files /dev/null and b/image/1702356275622.png differ diff --git a/image/1702368233069.png b/image/1702368233069.png new file mode 100644 index 0000000..ad65655 Binary files /dev/null and b/image/1702368233069.png differ diff --git a/image/SDS.png b/image/SDS.png new file mode 100644 index 0000000..ebba739 Binary files /dev/null and b/image/SDS.png differ diff --git a/image/demonstration.png b/image/demonstration.png new file mode 100644 index 0000000..33e2716 Binary files /dev/null and b/image/demonstration.png differ diff --git a/image/gel.png b/image/gel.png new file mode 100644 index 0000000..06601df Binary files /dev/null and b/image/gel.png differ diff --git a/image/gel1.png b/image/gel1.png new file mode 100644 index 0000000..958112d Binary files /dev/null and b/image/gel1.png differ diff --git a/image/pores.png b/image/pores.png new file mode 100644 index 0000000..460aaf9 Binary files /dev/null and b/image/pores.png differ diff --git a/markdown_tutorial.md b/markdown_tutorial.md deleted file mode 100644 index 9200b86..0000000 --- a/markdown_tutorial.md +++ /dev/null @@ -1,115 +0,0 @@ -# Markdown Tutorial - -Michael Wiest 2018 - ---- - -# Text weight - -# Really heavy text! -## Less heavy -### Even littler -#### I'm shrinking -##### Oh no -###### plz help -This is some normal text... - -This is a new paragraph -that continues in the same line despite the line break. - - ---- - - -# Text formatting - - **I'm bold!** - - *I'm italic!* - -### COLOR! COLOR! COLOR! COLOR! - - - ---- - -# Links -## [I'm a hyperlink](https://en.wikipedia.org/wiki/YOLO_(aphorism) -[*I'm a small italic link*](https://en.wikipedia.org/wiki/YOLO_(aphorism) - -___ - - -# Quoting - -"Normal Quotes" - -Block quotes! - ->Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris a ipsum nec justo dictum convallis. Integer luctus ultricies tortor sit amet vehicula. Nam tristique faucibus pharetra. Cras nec erat ligula. Proin at odio ex. Nulla ornare imperdiet nulla, sed posuere est imperdiet et. Curabitur egestas risus vitae quam pulvinar, in vulputate leo mollis. Fusce nec tincidunt libero. Suspendisse odio urna, pretium eget lectus id, accumsan bibendum ex. Aenean accumsan malesuada sem, at blandit dui euismod a. Sed sed eros porttitor, iaculis augue id, pulvinar mi. Ut in scelerisque libero, quis malesuada sem. Donec commodo diam quis massa condimentum, non consequat orci bibendum. Praesent auctor tellus egestas felis tincidunt, eget imperdiet lorem pharetra. - ---- - -# Pictures -Two ways to format this -![Puppy0](https://www.merriam-webster.com/assets/mw/images/article/art-wap-landing-mp-lg/puppy-3143-ad4140d8f6055cda2cd8956d4af37ea9@1x.jpg "A puppy!") - - -![Puppy1][puppy2] - -[puppy2]: https://i.ytimg.com/vi/AZ2ZPmEfjvU/maxresdefault.jpg "Another Puppy!!" - - -[![Puppy2](https://pbs.twimg.com/profile_images/446566229210181632/2IeTff-V_400x400.jpeg "This puppy is a link!")](nytimes.com) - ---- - -# Code -You can reference variables like `x` or `y` inline. -But you can also have block quotes like: -``` -import numpy as np -random = np.random.randn(10, 10) - -print('Hello World!') -``` -But you can even format the text for languages: -```python -import numpy as np -random = np.random.randn(10, 10) - -print('Hello World!') -``` -**Pretty!** - - ----- - -# Lists - -1. Item One -2. Another - - -* Unordered Item -* Another one! - ---- - -# Tables -Note the different alignments - -|Time of day| Favorite snack | Why? | -|-----------|:----------------:|----: | -|Morning | Oreos! | Delicious | -|Afternoon | Otter Pop | It's hot out baby | -|Night | IPA | I need it. | - -# Comments (look at the source code) - -[comment]: <> (This is a comment - everything in here won't be in the final doc) - - diff --git a/template.md b/template.md deleted file mode 100644 index c9db5b7..0000000 --- a/template.md +++ /dev/null @@ -1,151 +0,0 @@ -# 2.3 C-Techs (chromosome conformation capture)-coupled -1. [Introduction](#231) -2. [Overivew of 3C methods](#232)
- 2.1. [Specificity](#2321)
- 2.2. [Through-put and resolution](#2322) -3. [Hi-C](#233) -4. [ChIA-PET](#234) -5. [Selected methods comparison](#235) - - - - -## 2.3.1 Introduction - -The foundamental object of 3C(Chromosome Conformation Capture) techniques and 3C-derived methods is to understand the physical wiring diagram of the genome by identifying the physical interaction between chromosomes. - -To capture the interaction (crosslink between strings), there are few steps in general: -- Take a snapshot of the flowing cells - **Crosslink** with fixative agent (formaldehyde) -- Zoom in on crosslinked part and exclude untangled parts - **Digested** with a restriction enzyme -- Analyze the components come from the same chromatin - **Reverse crosslink** and **sequence** -- Finish the jigsaw puzzle and get the results - **Align** the reads and **summarize** the contacts - -> Based on these general ideas, then we'll dive deeper by walking through two of the most popular techniques and then briefly introduce some other methods. - -## 2.3.2 Overivew of 3C methods - -![](/assets/1-s2.0-S1360138518300827-gr1b2_lrg.jpg) -[Figure1](https://doi.org/10.1016/j.tplants.2018.03.014). Schematic Representation of Chromosome Conformation Capture (3C) and 3C-Derived Methods. These methods help to elucidate nuclear organization by detecting physical interactions between genetic elements located throughout the genome. Abbreviations: IP, immunoprecipitation; RE, restriction enzyme. **Figure by Sotelo-Silveira, Mariana, et al. Trends in Plant Science (2018).** - -To better understand the difference between these methods, I'd like to distingush them between the following couple of aspects: - -#### 1) Specificity - What does _one, all, many_ mean -‘1’, ‘Many’ and ‘All’ indicate how many loci are interrogated in a given experiment. For example, ‘1 versus All’ indicates that the experiment probes the interaction profile between 1 locus and all other potential loci in the genome. ‘All versus All’ means that one can detect the interaction profiles of all loci, genome-wide, and their interactions with all other genomic loci [1]. - -These kind of specificity is determined by the primer when people use **specific primers** before PCR. - -#### 2) Through-put and resolution -Hi-C techniques has the highest through-put (billion reads per sample) but suffering of a relative low resolution of 0.1-1Mb. However, the other methods usually have a higher resolution around 1kb. For more details one can refer to table2 in [2]. - -## 2.3.3 Hi-C -Hi-C is the highest through-put version of 3C-derived technologies. Due to the decreasing cost of 2nd generation sequencing, hi-c is widely used. - -The principle of Hi-C can be illustrated as: -![](/assets/hic.gif) - - -##### Hi-C critical steps [8] -- Fixation: keep DNA conformed -- Digestion: enzyme frequency and penetratin -- Fill-in: biotin for junction enrichment -- Ligation: freeze interactions in sequence -- Biotin removal: junctions only -- Fragment size: small fragments sequence better -- Adapter ligation: paired-end and indexing -- PCR: create enough material for flow cell - -##### Hi-C derived techniques -- Hi-C original: [Lieberman-Aiden et al., Science 2010](doi: 10.1126/science.1181369) -- Hi-C 1.0: [Belton-JM et al., Methods 2012](doi: 10.1016/j.ymeth.2012.05.001) -- In situ Hi-C: [Rao et al., Cell 2014](doi: 10.1016/j.cell.2014.11.021) -- Single cell Hi-C: [Nagano et al., Genome Biology 2015](https://doi.org/10.1186/s13059-015-0753-7) -- DNase Hi-C [Ma, Wenxiu, Methods et al](https://www.ncbi.nlm.nih.gov/pubmed/25437436) -- Hi-C 2.0: [Belaghzal et al., Methods 2017](https://www.ncbi.nlm.nih.gov/pubmed/28435001) -- DLO-Hi-C: [Lin et al., Nature Genetics 2018](https://doi.org/10.1038/s41588-018-0111-2) -- Hi-C improving: [Golloshi et al., Methods 2018](https://www.biorxiv.org/content/biorxiv/early/2018/02/13/264515.full.pdf) -- Arima 1-day Hi-C: [Ghurye et al., BioRxiv 2018](https://www.biorxiv.org/content/early/2018/02/07/261149) - -## 2.3.4 ChIA-PET -ChIA-PET is another method that combines ChIP and pair-end sequencing to analysis the chromtin interaction. It allows for targeted binding factors such as: estrogen receptor alpha, CTCF-mediated loops, RNA polymerase II, and a combination of key architectural factors. on the one hand, it has the benefit of achieving a higher resolution compared to Hi-C, as only ligation products involving the immunoprecipitated molecule are sequenced, on the other hand, ChIA-PET has systematic biases due to ChIP process: -- Only one type of binding factor selected -- Different antibodies -- ChIP conditions - - -## 2.3.5 Selected methods comparison - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Method - Targets - Resolution - Notes -
3C [3]one-vs-one~1–10 kb
  • Sequence of bait locus must be known
  • Easy data analysis
  • Low throughput
4C [4]one-vs-all~2 kb
  • Sequence of bait locus must be known
  • Detects novel contacts
  • Long-range contacts
5C [5]many-vs-many~1 kb
  • High dynamic range
  • Complete contact map of a locus
  • 3C with ligation-mediated amplification (LMA) of a ‘carbon copy’ library of oligos designed across restriction fragment junctions of interest -3C
Hi-C [6]all-vs-all0.1–1 Mb
  • Genome-wide nucleosome core positioning
  • Relative low resolution
  • High cost
ChIA-PET [7]Interaction of whole genome mediated by proteinDepends on read depth and the size of the genome region bound by the protein of interest
  • Lower noise with ChIP
  • Biased method since selected protein
- - - - - - - - - - - - - - - - - -# Referrence -[1] Schmitt, Anthony D., Ming Hu, and Bing Ren. "Genome-wide mapping and analysis of chromosome architecture." Nature reviews Molecular cell biology 17.12 (2016): 743.
- -[2] Risca, Viviana I., and William J. Greenleaf. "Unraveling the 3D genome: genomics tools for multiscale exploration." Trends in Genetics 31.7 (2015): 357-372.
- -[3] Dekker J, Rippe K, Dekker M, Kleckner N. Capturing chromosome conformation. Science 2002;295(5558):1306–11.
- -[4] Simonis M, Klous P, Homminga I, Galjaard RJ, Rijkers EJ, Grosveld F, et al. High-res- olution identification of balanced and complex chromosomal rearrangements by 4C technology. Nature Methods 2009;6(11):837–42.
- -[5] Dostie J, Richmond TA, Arnaout RA, Selzer RR, Lee WL, Honan TA, et al. Chromo- some Conformation Capture Carbon Copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome Res 2006;16(10): 1299–309.
- -[6] Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 2009;326(5950):289–93.
- -[7] Fullwood, M.J. et al. (2009) An oestrogen-receptor-alpha-bound human chromatin interactome. Nature 462, 58–64.
- -[8] https://github.com/hms-dbmi/hic-data-analysis-bootcamp/blob/master/HiC-Protocol.pptx. - - diff --git a/template_description.txt b/template_description.txt deleted file mode 100644 index c770589..0000000 --- a/template_description.txt +++ /dev/null @@ -1,3 +0,0 @@ -Basic markdown syntax will help you understand how to organise the formats (google it!). -Besides, github can also render html syntax (in template.md, table and anchor settings are all html syntax, you're welcomed to use them for better presentation :D) -