From 1e8c08105aeb1d6d8259c927deacc0c087abeab8 Mon Sep 17 00:00:00 2001 From: hussein Date: Tue, 18 Oct 2022 13:56:12 +0200 Subject: [PATCH 1/3] ct changes(didn't add tut3) --- .../tutorial1-checkpoint.ipynb | 145 +++++++++--------- .../tutorial2-checkpoint.ipynb | 59 ++++--- rooibos/tutorial1.ipynb | 145 +++++++++--------- rooibos/tutorial2.ipynb | 59 ++++--- rooibos/tutorial3.ipynb | 6 +- 5 files changed, 229 insertions(+), 185 deletions(-) diff --git a/rooibos/.ipynb_checkpoints/tutorial1-checkpoint.ipynb b/rooibos/.ipynb_checkpoints/tutorial1-checkpoint.ipynb index a44b3e7..e342b39 100644 --- a/rooibos/.ipynb_checkpoints/tutorial1-checkpoint.ipynb +++ b/rooibos/.ipynb_checkpoints/tutorial1-checkpoint.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "markdown", - "id": "b9040fa0", + "id": "cdb7cf70", "metadata": {}, "source": [ "# Tutorial 1: Data visualization" @@ -10,7 +10,7 @@ }, { "cell_type": "markdown", - "id": "c534e59a-e519-4068-93e4-9d8456734f9e", + "id": "78a5f5ba", "metadata": {}, "source": [ "---" @@ -18,7 +18,7 @@ }, { "cell_type": "markdown", - "id": "ba277ee6-66ea-4a1c-8f57-45f216e50f4c", + "id": "9f759677", "metadata": {}, "source": [ "## Introduction" @@ -26,26 +26,26 @@ }, { "cell_type": "markdown", - "id": "bffd6602", + "id": "7e228ef6", "metadata": { "tags": [] }, "source": [ - "Welcome!, this tutorial will show you how to visualise biochemical assay data from rooibos tea samples using python. From this tutorial you will learn:\n", + "Welcome! This tutorial will show you how to visualise biochemical assay data from rooibos tea samples using Python. From this tutorial you will learn:\n", "\n", " - how to read data into python from an Excel file\n", " - how to use dataframes (pandas package)\n", " - how to visualise and compare biochemical properties of fermented and unfermented rooibos teas using histograms\n", " - how to use Google to overcome programming challenges\n", "\n", - "Let's get started! First let's import the python packages we'll need to load and visualize our data.\n", + "Let's get started! First let's import the Python packages we'll need to load and visualize our data.\n", "\n", "_Note_ that you will need to run all the cells in the notebook in order for it to work properly. The best way to do this is run them one by one. Try to understand what each cell is doing when you run it. In some cells, you will have to write or modify code--just follow the instructions. " ] }, { "cell_type": "markdown", - "id": "88d4f7dd-21ca-4e7a-a616-8d59e5e6a43e", + "id": "f17fdf07", "metadata": {}, "source": [ "---" @@ -53,7 +53,7 @@ }, { "cell_type": "markdown", - "id": "6a04601b-e6c1-4eb9-812e-58b57c390035", + "id": "ffc6c25b", "metadata": {}, "source": [ "First we import some libraries:" @@ -62,7 +62,7 @@ { "cell_type": "code", "execution_count": null, - "id": "d438ecdc", + "id": "2fb1973a", "metadata": {}, "outputs": [], "source": [ @@ -75,7 +75,7 @@ }, { "cell_type": "markdown", - "id": "0c62182d-3c70-4019-873b-b330b0009098", + "id": "7357715c", "metadata": {}, "source": [ "The above statements define the prefixes 'pd' and 'sns' which will be used to identify pandas and seaborn functions respectively in the following code." @@ -83,7 +83,7 @@ }, { "cell_type": "markdown", - "id": "465ecdc5", + "id": "dfa02a76", "metadata": {}, "source": [ " Reading in data \n", @@ -96,7 +96,7 @@ { "cell_type": "code", "execution_count": null, - "id": "c3e3fdcd", + "id": "cb22e9c1", "metadata": {}, "outputs": [], "source": [ @@ -113,7 +113,7 @@ }, { "cell_type": "markdown", - "id": "996ea564", + "id": "c8de2dbb", "metadata": {}, "source": [ " Examining data \n", @@ -124,7 +124,7 @@ { "cell_type": "code", "execution_count": null, - "id": "3b8adaf7", + "id": "26867fc5", "metadata": {}, "outputs": [], "source": [ @@ -134,7 +134,7 @@ }, { "cell_type": "markdown", - "id": "61a7ec9f-58cf-404d-8de0-5dca237ba64d", + "id": "4f2c0fce", "metadata": {}, "source": [ "---\n", @@ -147,7 +147,7 @@ { "cell_type": "code", "execution_count": null, - "id": "45b0a3d3", + "id": "bb0bedc9", "metadata": {}, "outputs": [], "source": [ @@ -156,23 +156,27 @@ }, { "cell_type": "markdown", - "id": "4fc60d44", + "id": "b4d03442", "metadata": {}, "source": [ "In both dataframes, the rows correspond to different tea samples, while the columns give the values of 8 variables, which can be explained as follows:\n", - "- type -- categorical variable denoting one of two types: nonfermented(0) or fermented(1)\n", - "- F-H2O -- continuous variable: F stands for phenolics and H2O stands for water extract. which means the phenolic content was extracted using water as solvent\n", - "- A1-H2O -- continuous variable: A1 is a symbol for TEAC which is measurement of antioxidant activity, H2O is the solvent as above \n", - "- A2-H2O -- continuous variable: A2 is a symbol for FRAP which is a measurement of antioxidant activity, H2O is the solvent again \n", + "- type -- categorical variable denoting one of two types of tea: nonfermented(0) or fermented(1)\n", + "- F-H2O -- continuous variable: F stands for phenolics and H2O stands for water extract. This column gives the phenolic content that was extracted using water as solvent. \n", + "- A1-H2O -- continuous variable: A1 represents \"Trolox equivalent antioxidant capacity\" (TEAC), which is a measurement of antioxidant activity. H2O is the solvent used for extraction. \n", + "- A2-H2O -- continuous variable: A2 represents \"Ferric Reducing Antioxidant Power Assay\" (FRAP), which is a different measurement of antioxidant activity. As before H2O is the solvent \n", "- F-MEOH -- continuous variable: F stands for phenolics (as above), this time extracted using methanol (MEOH) as solvent instead of water\n", "- A1-MEOH -- continuous variable: as above A1 represents TEAC with MEOH as solvent\n", "- A2-MEOH -- continuous variable: A2 is for FRAP and MEOH is the solvent. \n", - "- cut -- catagorical variable, indicating the cut of the rooibos (not of interest in this study)" + "- cut -- catagorical variable, indicating the cut of the rooibos (not of interest in this study)\n", + "\n", + "_Note_: Antioxidants have various health benefits (you may Google \"antioxidant health benefits\"). So the antioxidant content of different tea varieties is of interest both to consumers and to rooibos producers. Phenolics are one particular type of antioxidant of special interest (you may Google \"phenolics health benefits\").\n", + "\n", + "In this study, we use these different antioxidant measurements to attempt to identify whether a rooibos sample is fermented or nonfermented. This can help us better understand the relationship between fermentation and antioxidant content: for instance, does fermentation tend to increase or decrease antioxidant content?" ] }, { "cell_type": "markdown", - "id": "86a08d92", + "id": "022f7ab8", "metadata": {}, "source": [ "Now let's verify the number of samples in each dataset. We do this using the 'shape' attribute for data frames:" @@ -181,7 +185,7 @@ { "cell_type": "code", "execution_count": null, - "id": "59c285dc", + "id": "f5dd5405", "metadata": {}, "outputs": [], "source": [ @@ -199,7 +203,7 @@ }, { "cell_type": "markdown", - "id": "9c9b562e-4c42-4eb4-a19d-b79aeba7c67b", + "id": "e5d7393b", "metadata": {}, "source": [ "---\n", @@ -211,7 +215,7 @@ { "cell_type": "code", "execution_count": null, - "id": "760ef925", + "id": "fc1e4d09", "metadata": {}, "outputs": [], "source": [ @@ -222,15 +226,18 @@ }, { "cell_type": "markdown", - "id": "e0d5d086", + "id": "0e6ca6e2", "metadata": {}, "source": [ + "Just so you can see what we're studying, here's a picture of samples of unfermented and fermented rooibos. See if you can guess which is which.\n", + "
\n", + "
\n", "\n" ] }, { "cell_type": "markdown", - "id": "2d6dd1b6-a5f5-459e-9476-1514c7115044", + "id": "ab41dbfa", "metadata": {}, "source": [ "---" @@ -238,17 +245,17 @@ }, { "cell_type": "markdown", - "id": "982f9d78", + "id": "29878c20", "metadata": {}, "source": [ "Renaming variables: \n", "\n", - "The variable names are rather obscure. Let's change them to improve readability. Unfortunately I don't remember how to do this--but all is not lost. We have at our disposal one of the main keys to python programming success: Google! " + "The variable names are not very descriptive. Let's change them to improve readability. Unfortunately I don't remember how to do this--but all is not lost. We have at our disposal one of the main keys to python programming success: Google! " ] }, { "cell_type": "markdown", - "id": "fa35a1e5", + "id": "f52f328c", "metadata": {}, "source": [ "\n" @@ -256,7 +263,7 @@ }, { "cell_type": "markdown", - "id": "c0fc6a29", + "id": "eb9decb4", "metadata": {}, "source": [ "Just search for `change columns names pandas`. You will soon learn how to recognize good websites that will provide working code that you can copy, paste, and modify. \n", @@ -271,7 +278,7 @@ { "cell_type": "code", "execution_count": null, - "id": "fa2ccec8", + "id": "9e710258", "metadata": {}, "outputs": [], "source": [ @@ -289,7 +296,7 @@ { "cell_type": "code", "execution_count": null, - "id": "926ee851", + "id": "6ede7f3c", "metadata": {}, "outputs": [], "source": [ @@ -301,7 +308,7 @@ }, { "cell_type": "markdown", - "id": "4ee611d8", + "id": "91c4a73e", "metadata": {}, "source": [ "Let's check and see if the renaming worked as we expected:" @@ -310,7 +317,7 @@ { "cell_type": "code", "execution_count": null, - "id": "37b8f240", + "id": "109bf195", "metadata": {}, "outputs": [], "source": [ @@ -324,7 +331,7 @@ }, { "cell_type": "markdown", - "id": "d7ba2371-e200-4034-9dd6-aa51efa1c9bf", + "id": "00127d43", "metadata": {}, "source": [ "---" @@ -332,7 +339,7 @@ }, { "cell_type": "markdown", - "id": "f9059c6d", + "id": "6ead091f", "metadata": {}, "source": [ "Data concatenation into a single frame: \n", @@ -343,7 +350,7 @@ { "cell_type": "code", "execution_count": null, - "id": "44996bd1", + "id": "e2d80f7a", "metadata": {}, "outputs": [], "source": [ @@ -356,7 +363,7 @@ }, { "cell_type": "markdown", - "id": "dfc0efb9-9953-4d61-bd15-6130fe2760c2", + "id": "f4b90726", "metadata": {}, "source": [ "---\n", @@ -368,7 +375,7 @@ { "cell_type": "code", "execution_count": null, - "id": "b65e70da", + "id": "1108e4ec", "metadata": {}, "outputs": [], "source": [ @@ -377,7 +384,7 @@ }, { "cell_type": "markdown", - "id": "b13c569b-f7cc-4aa8-b54a-f7e430ac6c2f", + "id": "5993750e", "metadata": {}, "source": [ "---" @@ -385,7 +392,7 @@ }, { "cell_type": "markdown", - "id": "1a63c260", + "id": "f5f29970", "metadata": {}, "source": [ "Histograms: \n", @@ -398,7 +405,7 @@ { "cell_type": "code", "execution_count": null, - "id": "58bb9759", + "id": "0dea4f38", "metadata": {}, "outputs": [], "source": [ @@ -409,18 +416,7 @@ }, { "cell_type": "markdown", - "id": "553d29f3", - "metadata": {}, - "source": [ - "From the histograms we may draw the following conclusions:\n", - "\n", - " - nonfermented (blue) is somewhat less left skewed. The data appears to have two peaks (this is called \"bimodal\", but with more data it's quite likely that this effect would disappear.\n", - " - fermented (orange) is clearly left skewed.\n" - ] - }, - { - "cell_type": "markdown", - "id": "7aabbce8", + "id": "4f6d43ce", "metadata": {}, "source": [ "The histogram options in the previous code can be explained as follows:\n", @@ -436,7 +432,18 @@ }, { "cell_type": "markdown", - "id": "f2ac2f7b", + "id": "bd328f26", + "metadata": {}, + "source": [ + "**Exercise From the histograms we may draw the following conclusions:\n", + "\n", + " - nonfermented (blue) has the higher mean and the smaller variance. It is also somewhat less left skewed. The data appears to have two peaks (this is called \"bimodal\", but with more data it's quite likely that this effect would disappear.\n", + " - fermented (orange) is clearly left skewed.\n" + ] + }, + { + "cell_type": "markdown", + "id": "9416e16b", "metadata": {}, "source": [ "We can do multiple plots from the same cell if we use the 'show() command between plots. Otherwise all the data will be put on the same plot." @@ -445,7 +452,7 @@ { "cell_type": "code", "execution_count": null, - "id": "5007cd1b", + "id": "ecac5b13", "metadata": {}, "outputs": [], "source": [ @@ -458,7 +465,7 @@ }, { "cell_type": "markdown", - "id": "5511122b", + "id": "ec643423", "metadata": {}, "source": [ " - fermented (orange) TPC_MEOH has a much smaller variance than nonfermented (blue), and is also slightly skewed left.\n", @@ -467,19 +474,19 @@ }, { "cell_type": "markdown", - "id": "097edfd3-3457-4f2c-b251-8ca016def4d8", + "id": "0ecb7d1f", "metadata": {}, "source": [ "---\n", - "**Exercise 3:** Do the plots for TEAC, TPC, and FRAP for water solvent.\n", + "**Exercise 4:** Do the plots for TEAC, TPC, and FRAP for water solvent.\n", "
\n", - "##### **hint**: Remember to use show() to separate the plots." + "##### **Hint**: Remember to use show() to separate the plots." ] }, { "cell_type": "code", "execution_count": null, - "id": "3a4c14dc", + "id": "a855bb8a", "metadata": {}, "outputs": [], "source": [ @@ -492,7 +499,7 @@ }, { "cell_type": "markdown", - "id": "186cb1a1-e985-4b40-983b-d66fb4dd9021", + "id": "8a42d277", "metadata": {}, "source": [ "---" @@ -500,7 +507,7 @@ }, { "cell_type": "markdown", - "id": "1dae234a", + "id": "6e294756", "metadata": {}, "source": [ " Saving data for later use \n", @@ -511,7 +518,7 @@ { "cell_type": "code", "execution_count": null, - "id": "906e684c", + "id": "07f77364", "metadata": {}, "outputs": [], "source": [ @@ -522,7 +529,7 @@ }, { "cell_type": "markdown", - "id": "268bd8eb", + "id": "0369e402", "metadata": {}, "source": [ "Congratulations! You've finished your basic exploration of the data. In the next notebook we'll go on to more descriptive visualizations." @@ -531,9 +538,9 @@ ], "metadata": { "kernelspec": { - "display_name": "rooibos_hack", + "display_name": "teaClass_ker", "language": "python", - "name": "rooibos_hack" + "name": "teaclass_ker" }, "language_info": { "codemirror_mode": { diff --git a/rooibos/.ipynb_checkpoints/tutorial2-checkpoint.ipynb b/rooibos/.ipynb_checkpoints/tutorial2-checkpoint.ipynb index aca3822..42bce44 100644 --- a/rooibos/.ipynb_checkpoints/tutorial2-checkpoint.ipynb +++ b/rooibos/.ipynb_checkpoints/tutorial2-checkpoint.ipynb @@ -29,7 +29,7 @@ "id": "d9096c03", "metadata": {}, "source": [ - "This tutorial will show you some correlation analysis betweeen the features of the same solution (water/methanol)" + "This tutorial shows how to do a correlation analysis in Python. In particular, we will investigate correlations between the 3 features (measurements) made with water as solvent and similarly between the features with methanol as solvent." ] }, { @@ -37,10 +37,12 @@ "id": "3993aed0", "metadata": {}, "source": [ - "Correlation analysis:\n", - "- A statistical method which is often used to determine the linear relationship between two features, and compute their dependency.\n", - "- a dependency between two features can be positive, negative, or zero (uncorrelated).\n", - "- the strength and direction of the correlation/dependecy can be summarized using the \"correlation cooficient\"" + "What is correlation analysis?\n", + "\n", + "- A statistical method which is often used to determine the degree of linear dependency between pairs of variables. \n", + "- The dependency is expressed using a single number (the _correlation coefficient_) which is between -1 and 1. We can say that two variables are negatively correlated, positively correlated, or uncorrelated according to whether the correlation coefficient is negative, positive, or 0.\n", + "- Two variables are strongly correlated if a scatter plot of the variables has points which lie nearly along a line. On the other hand, variables are uncorrelated is the scatter plot shows a \"cloud\" of points that has no slant tendency.\n", + "- (You may Google \"correlation scatter plot\" to see examples of scatter plots of variable pairs with different degrees of correlation)\n" ] }, { @@ -48,7 +50,7 @@ "id": "fdab5c88", "metadata": {}, "source": [ - "### Now let us start" + "### Let's begin!" ] }, { @@ -56,7 +58,7 @@ "id": "ff304bfb", "metadata": {}, "source": [ - "let us retrieve the data from the previous tutorial" + "First, retrieve the data from the previous tutorial." ] }, { @@ -82,7 +84,7 @@ "---\n", "**Exercise 1:** Verify the data in the above data frames\n", "
\n", - "##### **hint**: Remember the 'head' command" + "_(Hint: Remember the 'head' command)_" ] }, { @@ -100,7 +102,7 @@ "id": "48049067", "metadata": {}, "source": [ - "For more detailed data analysis, we will need the 'numpy' package, plus a specialized package that can draw confidence ellipses (explained below). " + "Since we will be doing more detailed data analysis, we will use the 'numpy' package ('numpy' stands for 'numerical python'). We will also use a customized code that draws _confidence ellipses_ (which we will explain below). The customized code may be found in the `sources` directory.` " ] }, { @@ -112,7 +114,8 @@ "source": [ "# ___Cell no. 2___\n", "import numpy as np # 'np' is the prefix that will identify nump packages\n", - "from source.ellipses import draw_confidence_ellipse # for representing the correlation" + "from source.ellipses import draw_confidence_ellipse # for representing the correlation (for\n", + "# code see 'source' directory)" ] }, { @@ -120,7 +123,11 @@ "id": "f547df38", "metadata": {}, "source": [ - "We will represent the data using a scatterplot, and will superimpose confidence ellipses to bring out the general orientation and extent of the data. A confidence ellipse shows where the data is most heavily concentrated (i.e. where the probability density is highest). Confidence regions are used for predicting new observations with a certain degree of confidence, which depends on the confidence parameter (measured in standard deviations) used to generate the ellipse. " + "In our case, we want to compare bivariate distributions for two different datasets: \"bivariate\" refers to the fact that we are looking at the joint distributions for two different features. \n", + "\n", + "To make an effective comparison, first we make 2-d scatterplots for the two datasets on the same axes. The two axes correspond to the two features being represented. The scatterplot looks like a \"cloud\" of points, where each point corresponds to one tea sample: the $x$ and $y$ coordinates of a point are given by the values of the two features for that particular sample. \n", + "\n", + "In order to characterize the overall distribution, confidence ellipses are superimposed on the scatterplots for each dataset. A confidence ellipse shows where the data is most heavily concentrated (i.e. where the probability density is highest). Confidence regions are used for predicting new observations with a certain degree of confidence, which depends on the confidence parameter (measured in standard deviations) used to generate the ellipse. When the confidence parameter is 2, roughly 95 percent of the data lies within the confidence ellipse." ] }, { @@ -128,12 +135,12 @@ "id": "a3c0a0fa", "metadata": {}, "source": [ - "The syntax for the `draw_Confidence_ellipse` command is as follows:\n", + "The syntax for the `draw_confidence_ellipse` command is as follows:\n", "\n", - " draw_Confidence_ellipse (data1_x, data1_y, data2_x, data2_y, \n", - " \"y-axis label\", \"x-axis label\", \"title\", x-scale, y-scale)\"\n", + " draw_confidence_ellipse (data1_x, data1_y, data2_x, data2_y, confidence_parameter,\n", + " \"x-axis label\", \"y-axis label\", \"title\", x-scale, y-scale)\"\n", " \n", - "Notice that we can continue Python commands on multiple lines, as long as we make break the statement in such a way that the Python compiler can see that the command is not yet finished (e.g. by breaking the statement after an open parenthesis or comma).\n", + "Notice that we can continue Python commands on multiple lines, as long as we break the statement in such a way that the Python compiler can see that the command is not yet finished. A good way to do this is to make the break after an open parenthesis or bracket, or after a commma that separates items in a list.\n", "\n", "Let's give this a try!" ] @@ -149,7 +156,7 @@ "\n", "draw_confidence_ellipse ( \n", " df_fer[['TPC_MEOH']], df_fer[['TEAC_MEOH']], \n", - " df_nf[['TPC_MEOH']], df_nf[['TEAC_MEOH']], \n", + " df_nf[['TPC_MEOH']], df_nf[['TEAC_MEOH']], 2, \n", " \"TPC(GAE/g)\", \"TEAC(TE/g)\", \n", " \"TEAC versus TP for $MeOH$ extracted samples\",\n", " [100, 550], [1000,5550] )\n" @@ -160,7 +167,15 @@ "id": "9f07a02a", "metadata": {}, "source": [ - "The results show there is a statistically significant positive correlation between TPC and TEAC for fermented (blue). This is reflected in the tilted orientation of the ellipse. On the other hand, the correlation between TPC and TEAC for unfermented is not statistically signficant (p > 0.05). " + "The function draw_confidence_ellipse also gives the estimated correlation coefficient between the two variables; error denotes the uncertainty in the correlation coefficient; and p denotes the p-value for the null hypothesis that the correlation coefficient is 0 (typically the null hypothesis is denoted as 𝐻0 .\n", + "\n", + "The p value has the following meaning (one must be very careful about this, because the p value is often misunderstood). Suppose the estimated correlation coefficient is C. In this case, the p value is the conditional probability given that the correlation coefficient is 0 that the measured correlation coefficient will have absolute value greater than or equal to C. In other words, the p value is the probability given that 𝐻0 is true that a measurement that is \"at least as extreme\" as C is obtained. So it is not true that \"the p value is the probability that H0 is false\", because it is calculated under the assumption that H0 is true! Instead, the p value expresses a likelihood. For example, suppose my friend flips a coin 20 times and get 20 heads. If the coin is actually fair, the probability this would happens is less than 0.00001. So it is likely that the coin is not fair (e.g. maybe it has 'head' on both sides). But it is not correct to say that the probability that the coin is fair is 0.00001.\n", + "\n", + "If the p value is below a certain level, then we reject the null hypothesis. The level of rejection is called the confidence level. What significance level you use depends on the application. In many cases, a confidence level of 0.01 is used.\n", + "\n", + "The results show there is a statistically significant positive correlation between TPC and TEAC for fermented (blue). This is reflected in the tilted orientation of the ellipse. On the other hand, the correlation between TPC and TEAC for unfermented is not statistically signficant (p > 0.01). \n", + "\n", + "The graph also shows a large overlap between the fermented and nonfermented data. Both confidence ellipses contain many points with $225 < TPC < 300$ and $2000 < TEAC < 2500$. Because of the correlation, fermented data with lower values of TPC also tend to have lower values of TEAC. The relative sizes of the ellipses shows that nonfermented data is more spead out, meaning a wider range of values is observed (particularly with TPC). " ] }, { @@ -200,7 +215,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": null, "id": "923db9af-983c-4476-b162-d9816d9c6806", "metadata": {}, "outputs": [], @@ -213,7 +228,7 @@ "id": "ae1025c7", "metadata": {}, "source": [ - "Evidently the y-scale is off, so we will need to change the scale. To do this, we find the min an max values for FRAP" + "Evidently the y-scale is off, so we will need to change the scale. To do this, we find the min and max values for FRAP" ] }, { @@ -283,9 +298,9 @@ ], "metadata": { "kernelspec": { - "display_name": "rooibos_hack", + "display_name": "Python 3", "language": "python", - "name": "rooibos_hack" + "name": "python3" }, "language_info": { "codemirror_mode": { @@ -297,7 +312,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.10" + "version": "3.8.3" } }, "nbformat": 4, diff --git a/rooibos/tutorial1.ipynb b/rooibos/tutorial1.ipynb index a44b3e7..e342b39 100644 --- a/rooibos/tutorial1.ipynb +++ b/rooibos/tutorial1.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "markdown", - "id": "b9040fa0", + "id": "cdb7cf70", "metadata": {}, "source": [ "# Tutorial 1: Data visualization" @@ -10,7 +10,7 @@ }, { "cell_type": "markdown", - "id": "c534e59a-e519-4068-93e4-9d8456734f9e", + "id": "78a5f5ba", "metadata": {}, "source": [ "---" @@ -18,7 +18,7 @@ }, { "cell_type": "markdown", - "id": "ba277ee6-66ea-4a1c-8f57-45f216e50f4c", + "id": "9f759677", "metadata": {}, "source": [ "## Introduction" @@ -26,26 +26,26 @@ }, { "cell_type": "markdown", - "id": "bffd6602", + "id": "7e228ef6", "metadata": { "tags": [] }, "source": [ - "Welcome!, this tutorial will show you how to visualise biochemical assay data from rooibos tea samples using python. From this tutorial you will learn:\n", + "Welcome! This tutorial will show you how to visualise biochemical assay data from rooibos tea samples using Python. From this tutorial you will learn:\n", "\n", " - how to read data into python from an Excel file\n", " - how to use dataframes (pandas package)\n", " - how to visualise and compare biochemical properties of fermented and unfermented rooibos teas using histograms\n", " - how to use Google to overcome programming challenges\n", "\n", - "Let's get started! First let's import the python packages we'll need to load and visualize our data.\n", + "Let's get started! First let's import the Python packages we'll need to load and visualize our data.\n", "\n", "_Note_ that you will need to run all the cells in the notebook in order for it to work properly. The best way to do this is run them one by one. Try to understand what each cell is doing when you run it. In some cells, you will have to write or modify code--just follow the instructions. " ] }, { "cell_type": "markdown", - "id": "88d4f7dd-21ca-4e7a-a616-8d59e5e6a43e", + "id": "f17fdf07", "metadata": {}, "source": [ "---" @@ -53,7 +53,7 @@ }, { "cell_type": "markdown", - "id": "6a04601b-e6c1-4eb9-812e-58b57c390035", + "id": "ffc6c25b", "metadata": {}, "source": [ "First we import some libraries:" @@ -62,7 +62,7 @@ { "cell_type": "code", "execution_count": null, - "id": "d438ecdc", + "id": "2fb1973a", "metadata": {}, "outputs": [], "source": [ @@ -75,7 +75,7 @@ }, { "cell_type": "markdown", - "id": "0c62182d-3c70-4019-873b-b330b0009098", + "id": "7357715c", "metadata": {}, "source": [ "The above statements define the prefixes 'pd' and 'sns' which will be used to identify pandas and seaborn functions respectively in the following code." @@ -83,7 +83,7 @@ }, { "cell_type": "markdown", - "id": "465ecdc5", + "id": "dfa02a76", "metadata": {}, "source": [ " Reading in data \n", @@ -96,7 +96,7 @@ { "cell_type": "code", "execution_count": null, - "id": "c3e3fdcd", + "id": "cb22e9c1", "metadata": {}, "outputs": [], "source": [ @@ -113,7 +113,7 @@ }, { "cell_type": "markdown", - "id": "996ea564", + "id": "c8de2dbb", "metadata": {}, "source": [ " Examining data \n", @@ -124,7 +124,7 @@ { "cell_type": "code", "execution_count": null, - "id": "3b8adaf7", + "id": "26867fc5", "metadata": {}, "outputs": [], "source": [ @@ -134,7 +134,7 @@ }, { "cell_type": "markdown", - "id": "61a7ec9f-58cf-404d-8de0-5dca237ba64d", + "id": "4f2c0fce", "metadata": {}, "source": [ "---\n", @@ -147,7 +147,7 @@ { "cell_type": "code", "execution_count": null, - "id": "45b0a3d3", + "id": "bb0bedc9", "metadata": {}, "outputs": [], "source": [ @@ -156,23 +156,27 @@ }, { "cell_type": "markdown", - "id": "4fc60d44", + "id": "b4d03442", "metadata": {}, "source": [ "In both dataframes, the rows correspond to different tea samples, while the columns give the values of 8 variables, which can be explained as follows:\n", - "- type -- categorical variable denoting one of two types: nonfermented(0) or fermented(1)\n", - "- F-H2O -- continuous variable: F stands for phenolics and H2O stands for water extract. which means the phenolic content was extracted using water as solvent\n", - "- A1-H2O -- continuous variable: A1 is a symbol for TEAC which is measurement of antioxidant activity, H2O is the solvent as above \n", - "- A2-H2O -- continuous variable: A2 is a symbol for FRAP which is a measurement of antioxidant activity, H2O is the solvent again \n", + "- type -- categorical variable denoting one of two types of tea: nonfermented(0) or fermented(1)\n", + "- F-H2O -- continuous variable: F stands for phenolics and H2O stands for water extract. This column gives the phenolic content that was extracted using water as solvent. \n", + "- A1-H2O -- continuous variable: A1 represents \"Trolox equivalent antioxidant capacity\" (TEAC), which is a measurement of antioxidant activity. H2O is the solvent used for extraction. \n", + "- A2-H2O -- continuous variable: A2 represents \"Ferric Reducing Antioxidant Power Assay\" (FRAP), which is a different measurement of antioxidant activity. As before H2O is the solvent \n", "- F-MEOH -- continuous variable: F stands for phenolics (as above), this time extracted using methanol (MEOH) as solvent instead of water\n", "- A1-MEOH -- continuous variable: as above A1 represents TEAC with MEOH as solvent\n", "- A2-MEOH -- continuous variable: A2 is for FRAP and MEOH is the solvent. \n", - "- cut -- catagorical variable, indicating the cut of the rooibos (not of interest in this study)" + "- cut -- catagorical variable, indicating the cut of the rooibos (not of interest in this study)\n", + "\n", + "_Note_: Antioxidants have various health benefits (you may Google \"antioxidant health benefits\"). So the antioxidant content of different tea varieties is of interest both to consumers and to rooibos producers. Phenolics are one particular type of antioxidant of special interest (you may Google \"phenolics health benefits\").\n", + "\n", + "In this study, we use these different antioxidant measurements to attempt to identify whether a rooibos sample is fermented or nonfermented. This can help us better understand the relationship between fermentation and antioxidant content: for instance, does fermentation tend to increase or decrease antioxidant content?" ] }, { "cell_type": "markdown", - "id": "86a08d92", + "id": "022f7ab8", "metadata": {}, "source": [ "Now let's verify the number of samples in each dataset. We do this using the 'shape' attribute for data frames:" @@ -181,7 +185,7 @@ { "cell_type": "code", "execution_count": null, - "id": "59c285dc", + "id": "f5dd5405", "metadata": {}, "outputs": [], "source": [ @@ -199,7 +203,7 @@ }, { "cell_type": "markdown", - "id": "9c9b562e-4c42-4eb4-a19d-b79aeba7c67b", + "id": "e5d7393b", "metadata": {}, "source": [ "---\n", @@ -211,7 +215,7 @@ { "cell_type": "code", "execution_count": null, - "id": "760ef925", + "id": "fc1e4d09", "metadata": {}, "outputs": [], "source": [ @@ -222,15 +226,18 @@ }, { "cell_type": "markdown", - "id": "e0d5d086", + "id": "0e6ca6e2", "metadata": {}, "source": [ + "Just so you can see what we're studying, here's a picture of samples of unfermented and fermented rooibos. See if you can guess which is which.\n", + "
\n", + "
\n", "\n" ] }, { "cell_type": "markdown", - "id": "2d6dd1b6-a5f5-459e-9476-1514c7115044", + "id": "ab41dbfa", "metadata": {}, "source": [ "---" @@ -238,17 +245,17 @@ }, { "cell_type": "markdown", - "id": "982f9d78", + "id": "29878c20", "metadata": {}, "source": [ "Renaming variables: \n", "\n", - "The variable names are rather obscure. Let's change them to improve readability. Unfortunately I don't remember how to do this--but all is not lost. We have at our disposal one of the main keys to python programming success: Google! " + "The variable names are not very descriptive. Let's change them to improve readability. Unfortunately I don't remember how to do this--but all is not lost. We have at our disposal one of the main keys to python programming success: Google! " ] }, { "cell_type": "markdown", - "id": "fa35a1e5", + "id": "f52f328c", "metadata": {}, "source": [ "\n" @@ -256,7 +263,7 @@ }, { "cell_type": "markdown", - "id": "c0fc6a29", + "id": "eb9decb4", "metadata": {}, "source": [ "Just search for `change columns names pandas`. You will soon learn how to recognize good websites that will provide working code that you can copy, paste, and modify. \n", @@ -271,7 +278,7 @@ { "cell_type": "code", "execution_count": null, - "id": "fa2ccec8", + "id": "9e710258", "metadata": {}, "outputs": [], "source": [ @@ -289,7 +296,7 @@ { "cell_type": "code", "execution_count": null, - "id": "926ee851", + "id": "6ede7f3c", "metadata": {}, "outputs": [], "source": [ @@ -301,7 +308,7 @@ }, { "cell_type": "markdown", - "id": "4ee611d8", + "id": "91c4a73e", "metadata": {}, "source": [ "Let's check and see if the renaming worked as we expected:" @@ -310,7 +317,7 @@ { "cell_type": "code", "execution_count": null, - "id": "37b8f240", + "id": "109bf195", "metadata": {}, "outputs": [], "source": [ @@ -324,7 +331,7 @@ }, { "cell_type": "markdown", - "id": "d7ba2371-e200-4034-9dd6-aa51efa1c9bf", + "id": "00127d43", "metadata": {}, "source": [ "---" @@ -332,7 +339,7 @@ }, { "cell_type": "markdown", - "id": "f9059c6d", + "id": "6ead091f", "metadata": {}, "source": [ "Data concatenation into a single frame: \n", @@ -343,7 +350,7 @@ { "cell_type": "code", "execution_count": null, - "id": "44996bd1", + "id": "e2d80f7a", "metadata": {}, "outputs": [], "source": [ @@ -356,7 +363,7 @@ }, { "cell_type": "markdown", - "id": "dfc0efb9-9953-4d61-bd15-6130fe2760c2", + "id": "f4b90726", "metadata": {}, "source": [ "---\n", @@ -368,7 +375,7 @@ { "cell_type": "code", "execution_count": null, - "id": "b65e70da", + "id": "1108e4ec", "metadata": {}, "outputs": [], "source": [ @@ -377,7 +384,7 @@ }, { "cell_type": "markdown", - "id": "b13c569b-f7cc-4aa8-b54a-f7e430ac6c2f", + "id": "5993750e", "metadata": {}, "source": [ "---" @@ -385,7 +392,7 @@ }, { "cell_type": "markdown", - "id": "1a63c260", + "id": "f5f29970", "metadata": {}, "source": [ "Histograms: \n", @@ -398,7 +405,7 @@ { "cell_type": "code", "execution_count": null, - "id": "58bb9759", + "id": "0dea4f38", "metadata": {}, "outputs": [], "source": [ @@ -409,18 +416,7 @@ }, { "cell_type": "markdown", - "id": "553d29f3", - "metadata": {}, - "source": [ - "From the histograms we may draw the following conclusions:\n", - "\n", - " - nonfermented (blue) is somewhat less left skewed. The data appears to have two peaks (this is called \"bimodal\", but with more data it's quite likely that this effect would disappear.\n", - " - fermented (orange) is clearly left skewed.\n" - ] - }, - { - "cell_type": "markdown", - "id": "7aabbce8", + "id": "4f6d43ce", "metadata": {}, "source": [ "The histogram options in the previous code can be explained as follows:\n", @@ -436,7 +432,18 @@ }, { "cell_type": "markdown", - "id": "f2ac2f7b", + "id": "bd328f26", + "metadata": {}, + "source": [ + "**Exercise From the histograms we may draw the following conclusions:\n", + "\n", + " - nonfermented (blue) has the higher mean and the smaller variance. It is also somewhat less left skewed. The data appears to have two peaks (this is called \"bimodal\", but with more data it's quite likely that this effect would disappear.\n", + " - fermented (orange) is clearly left skewed.\n" + ] + }, + { + "cell_type": "markdown", + "id": "9416e16b", "metadata": {}, "source": [ "We can do multiple plots from the same cell if we use the 'show() command between plots. Otherwise all the data will be put on the same plot." @@ -445,7 +452,7 @@ { "cell_type": "code", "execution_count": null, - "id": "5007cd1b", + "id": "ecac5b13", "metadata": {}, "outputs": [], "source": [ @@ -458,7 +465,7 @@ }, { "cell_type": "markdown", - "id": "5511122b", + "id": "ec643423", "metadata": {}, "source": [ " - fermented (orange) TPC_MEOH has a much smaller variance than nonfermented (blue), and is also slightly skewed left.\n", @@ -467,19 +474,19 @@ }, { "cell_type": "markdown", - "id": "097edfd3-3457-4f2c-b251-8ca016def4d8", + "id": "0ecb7d1f", "metadata": {}, "source": [ "---\n", - "**Exercise 3:** Do the plots for TEAC, TPC, and FRAP for water solvent.\n", + "**Exercise 4:** Do the plots for TEAC, TPC, and FRAP for water solvent.\n", "
\n", - "##### **hint**: Remember to use show() to separate the plots." + "##### **Hint**: Remember to use show() to separate the plots." ] }, { "cell_type": "code", "execution_count": null, - "id": "3a4c14dc", + "id": "a855bb8a", "metadata": {}, "outputs": [], "source": [ @@ -492,7 +499,7 @@ }, { "cell_type": "markdown", - "id": "186cb1a1-e985-4b40-983b-d66fb4dd9021", + "id": "8a42d277", "metadata": {}, "source": [ "---" @@ -500,7 +507,7 @@ }, { "cell_type": "markdown", - "id": "1dae234a", + "id": "6e294756", "metadata": {}, "source": [ " Saving data for later use \n", @@ -511,7 +518,7 @@ { "cell_type": "code", "execution_count": null, - "id": "906e684c", + "id": "07f77364", "metadata": {}, "outputs": [], "source": [ @@ -522,7 +529,7 @@ }, { "cell_type": "markdown", - "id": "268bd8eb", + "id": "0369e402", "metadata": {}, "source": [ "Congratulations! You've finished your basic exploration of the data. In the next notebook we'll go on to more descriptive visualizations." @@ -531,9 +538,9 @@ ], "metadata": { "kernelspec": { - "display_name": "rooibos_hack", + "display_name": "teaClass_ker", "language": "python", - "name": "rooibos_hack" + "name": "teaclass_ker" }, "language_info": { "codemirror_mode": { diff --git a/rooibos/tutorial2.ipynb b/rooibos/tutorial2.ipynb index aca3822..42bce44 100644 --- a/rooibos/tutorial2.ipynb +++ b/rooibos/tutorial2.ipynb @@ -29,7 +29,7 @@ "id": "d9096c03", "metadata": {}, "source": [ - "This tutorial will show you some correlation analysis betweeen the features of the same solution (water/methanol)" + "This tutorial shows how to do a correlation analysis in Python. In particular, we will investigate correlations between the 3 features (measurements) made with water as solvent and similarly between the features with methanol as solvent." ] }, { @@ -37,10 +37,12 @@ "id": "3993aed0", "metadata": {}, "source": [ - "Correlation analysis:\n", - "- A statistical method which is often used to determine the linear relationship between two features, and compute their dependency.\n", - "- a dependency between two features can be positive, negative, or zero (uncorrelated).\n", - "- the strength and direction of the correlation/dependecy can be summarized using the \"correlation cooficient\"" + "What is correlation analysis?\n", + "\n", + "- A statistical method which is often used to determine the degree of linear dependency between pairs of variables. \n", + "- The dependency is expressed using a single number (the _correlation coefficient_) which is between -1 and 1. We can say that two variables are negatively correlated, positively correlated, or uncorrelated according to whether the correlation coefficient is negative, positive, or 0.\n", + "- Two variables are strongly correlated if a scatter plot of the variables has points which lie nearly along a line. On the other hand, variables are uncorrelated is the scatter plot shows a \"cloud\" of points that has no slant tendency.\n", + "- (You may Google \"correlation scatter plot\" to see examples of scatter plots of variable pairs with different degrees of correlation)\n" ] }, { @@ -48,7 +50,7 @@ "id": "fdab5c88", "metadata": {}, "source": [ - "### Now let us start" + "### Let's begin!" ] }, { @@ -56,7 +58,7 @@ "id": "ff304bfb", "metadata": {}, "source": [ - "let us retrieve the data from the previous tutorial" + "First, retrieve the data from the previous tutorial." ] }, { @@ -82,7 +84,7 @@ "---\n", "**Exercise 1:** Verify the data in the above data frames\n", "
\n", - "##### **hint**: Remember the 'head' command" + "_(Hint: Remember the 'head' command)_" ] }, { @@ -100,7 +102,7 @@ "id": "48049067", "metadata": {}, "source": [ - "For more detailed data analysis, we will need the 'numpy' package, plus a specialized package that can draw confidence ellipses (explained below). " + "Since we will be doing more detailed data analysis, we will use the 'numpy' package ('numpy' stands for 'numerical python'). We will also use a customized code that draws _confidence ellipses_ (which we will explain below). The customized code may be found in the `sources` directory.` " ] }, { @@ -112,7 +114,8 @@ "source": [ "# ___Cell no. 2___\n", "import numpy as np # 'np' is the prefix that will identify nump packages\n", - "from source.ellipses import draw_confidence_ellipse # for representing the correlation" + "from source.ellipses import draw_confidence_ellipse # for representing the correlation (for\n", + "# code see 'source' directory)" ] }, { @@ -120,7 +123,11 @@ "id": "f547df38", "metadata": {}, "source": [ - "We will represent the data using a scatterplot, and will superimpose confidence ellipses to bring out the general orientation and extent of the data. A confidence ellipse shows where the data is most heavily concentrated (i.e. where the probability density is highest). Confidence regions are used for predicting new observations with a certain degree of confidence, which depends on the confidence parameter (measured in standard deviations) used to generate the ellipse. " + "In our case, we want to compare bivariate distributions for two different datasets: \"bivariate\" refers to the fact that we are looking at the joint distributions for two different features. \n", + "\n", + "To make an effective comparison, first we make 2-d scatterplots for the two datasets on the same axes. The two axes correspond to the two features being represented. The scatterplot looks like a \"cloud\" of points, where each point corresponds to one tea sample: the $x$ and $y$ coordinates of a point are given by the values of the two features for that particular sample. \n", + "\n", + "In order to characterize the overall distribution, confidence ellipses are superimposed on the scatterplots for each dataset. A confidence ellipse shows where the data is most heavily concentrated (i.e. where the probability density is highest). Confidence regions are used for predicting new observations with a certain degree of confidence, which depends on the confidence parameter (measured in standard deviations) used to generate the ellipse. When the confidence parameter is 2, roughly 95 percent of the data lies within the confidence ellipse." ] }, { @@ -128,12 +135,12 @@ "id": "a3c0a0fa", "metadata": {}, "source": [ - "The syntax for the `draw_Confidence_ellipse` command is as follows:\n", + "The syntax for the `draw_confidence_ellipse` command is as follows:\n", "\n", - " draw_Confidence_ellipse (data1_x, data1_y, data2_x, data2_y, \n", - " \"y-axis label\", \"x-axis label\", \"title\", x-scale, y-scale)\"\n", + " draw_confidence_ellipse (data1_x, data1_y, data2_x, data2_y, confidence_parameter,\n", + " \"x-axis label\", \"y-axis label\", \"title\", x-scale, y-scale)\"\n", " \n", - "Notice that we can continue Python commands on multiple lines, as long as we make break the statement in such a way that the Python compiler can see that the command is not yet finished (e.g. by breaking the statement after an open parenthesis or comma).\n", + "Notice that we can continue Python commands on multiple lines, as long as we break the statement in such a way that the Python compiler can see that the command is not yet finished. A good way to do this is to make the break after an open parenthesis or bracket, or after a commma that separates items in a list.\n", "\n", "Let's give this a try!" ] @@ -149,7 +156,7 @@ "\n", "draw_confidence_ellipse ( \n", " df_fer[['TPC_MEOH']], df_fer[['TEAC_MEOH']], \n", - " df_nf[['TPC_MEOH']], df_nf[['TEAC_MEOH']], \n", + " df_nf[['TPC_MEOH']], df_nf[['TEAC_MEOH']], 2, \n", " \"TPC(GAE/g)\", \"TEAC(TE/g)\", \n", " \"TEAC versus TP for $MeOH$ extracted samples\",\n", " [100, 550], [1000,5550] )\n" @@ -160,7 +167,15 @@ "id": "9f07a02a", "metadata": {}, "source": [ - "The results show there is a statistically significant positive correlation between TPC and TEAC for fermented (blue). This is reflected in the tilted orientation of the ellipse. On the other hand, the correlation between TPC and TEAC for unfermented is not statistically signficant (p > 0.05). " + "The function draw_confidence_ellipse also gives the estimated correlation coefficient between the two variables; error denotes the uncertainty in the correlation coefficient; and p denotes the p-value for the null hypothesis that the correlation coefficient is 0 (typically the null hypothesis is denoted as 𝐻0 .\n", + "\n", + "The p value has the following meaning (one must be very careful about this, because the p value is often misunderstood). Suppose the estimated correlation coefficient is C. In this case, the p value is the conditional probability given that the correlation coefficient is 0 that the measured correlation coefficient will have absolute value greater than or equal to C. In other words, the p value is the probability given that 𝐻0 is true that a measurement that is \"at least as extreme\" as C is obtained. So it is not true that \"the p value is the probability that H0 is false\", because it is calculated under the assumption that H0 is true! Instead, the p value expresses a likelihood. For example, suppose my friend flips a coin 20 times and get 20 heads. If the coin is actually fair, the probability this would happens is less than 0.00001. So it is likely that the coin is not fair (e.g. maybe it has 'head' on both sides). But it is not correct to say that the probability that the coin is fair is 0.00001.\n", + "\n", + "If the p value is below a certain level, then we reject the null hypothesis. The level of rejection is called the confidence level. What significance level you use depends on the application. In many cases, a confidence level of 0.01 is used.\n", + "\n", + "The results show there is a statistically significant positive correlation between TPC and TEAC for fermented (blue). This is reflected in the tilted orientation of the ellipse. On the other hand, the correlation between TPC and TEAC for unfermented is not statistically signficant (p > 0.01). \n", + "\n", + "The graph also shows a large overlap between the fermented and nonfermented data. Both confidence ellipses contain many points with $225 < TPC < 300$ and $2000 < TEAC < 2500$. Because of the correlation, fermented data with lower values of TPC also tend to have lower values of TEAC. The relative sizes of the ellipses shows that nonfermented data is more spead out, meaning a wider range of values is observed (particularly with TPC). " ] }, { @@ -200,7 +215,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": null, "id": "923db9af-983c-4476-b162-d9816d9c6806", "metadata": {}, "outputs": [], @@ -213,7 +228,7 @@ "id": "ae1025c7", "metadata": {}, "source": [ - "Evidently the y-scale is off, so we will need to change the scale. To do this, we find the min an max values for FRAP" + "Evidently the y-scale is off, so we will need to change the scale. To do this, we find the min and max values for FRAP" ] }, { @@ -283,9 +298,9 @@ ], "metadata": { "kernelspec": { - "display_name": "rooibos_hack", + "display_name": "Python 3", "language": "python", - "name": "rooibos_hack" + "name": "python3" }, "language_info": { "codemirror_mode": { @@ -297,7 +312,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.10" + "version": "3.8.3" } }, "nbformat": 4, diff --git a/rooibos/tutorial3.ipynb b/rooibos/tutorial3.ipynb index 286eb52..0e0dbd6 100644 --- a/rooibos/tutorial3.ipynb +++ b/rooibos/tutorial3.ipynb @@ -662,9 +662,9 @@ ], "metadata": { "kernelspec": { - "display_name": "rooibos_hack", + "display_name": "Python 3", "language": "python", - "name": "rooibos_hack" + "name": "python3" }, "language_info": { "codemirror_mode": { @@ -676,7 +676,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.10" + "version": "3.8.3" } }, "nbformat": 4, From 4f137093ae01eb68025addd93ac9223e36810426 Mon Sep 17 00:00:00 2001 From: Eslam Hussein <31181109+eahussein@users.noreply.github.com> Date: Sun, 4 Dec 2022 12:56:06 +0200 Subject: [PATCH 2/3] update the readMe file --- README.md | 72 +++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 60 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index 87427ee..35b28ff 100644 --- a/README.md +++ b/README.md @@ -1,16 +1,64 @@ -# rooibosTea_classification -An educational tutorial that is based on rooibos tea data. The tutorials run through data visualization, data correlation, and finally performing binary classification on fermented and non-fermented rooibos data using basic statistical methods and some machine learning tools. The work is fairly simple, and can work on ***Google Colab!** (https://colab.research.google.com/) +# Rooibos tea classification -### This repo has four notebooks (/rooibos/..): -1. Tutorial 1: Data visualization -2. Tutorial 2: Data correlation -3. Tutorial 3: Classification using simple statistics -4. Tutorial 4: Classification using machine learning +## Description -#### In case you found difficulty dealing with python when working on the tutorials, please check the following links: -1. https://www.sololearn.com/learning/1073 -2. https://problemsolvingwithpython.com/ +Welcome to the project on rooibos tea classification ! From the tutorials you will learn to do the following: -#### If you make use of this code in preparing results for a paper, please Cite: +- *Tutorial 1*: Data visualization +- *Tutorial 2*: Data correlation +- *Tutorial 3*: Classification using simple statistics +- *Tutorial 4*: Classification using machine learning -Hussein, E.A.; Thron, C.; Ghaziasgar, M.; Vaccari, M.; Marnewick, J.L.; Hussein, A.A. Comparison of Phenolic Content and Antioxidant Activity for Fermented and Unfermented Rooibos Samples Extracted with Water and Methanol. Plants 2022, 11, 16. https://doi.org/10.3390/plants11010016 + +## Data +98 randomly selected fermented (fer) (51 samples) and nonfermnted (nf) (47 samples) were kindlydonated by Rooibos LTD-BPK (Clanwilliam, South Africa) during March 2020. + + +## Hackathon Task +From the proposed pipeline (tutorials), investigate new ways to classify between fer and nf rooibos tea + + +## Prerequisites + +All the libraries/dependencies necessary to run the tutorials are listed in the [requirements.txt](https://github.com/Hack4Dev/rooibosTea_classification/blob/main/requirements.txt) file. + + +## Installation + +All the required libraries can be installed using pip and the [requirements.txt](https://github.com/Hack4Dev/rooibosTea_classification/blob/main/requirements.txt) file in the repo: + +```bash +> pip install -r requirements.txt +``` + +### Would you like to clone this repository? Feel free! + +```bash +> git clone https://github.com/Hack4Dev/rooibosTea_classification.git +``` + +Then make sure you have the right Python libraries for the tutorials. + + +### New to Github? + +The easiest way to get all of the lecture and tutorial material is to clone this repository. To do this you need git installed on your laptop. If you're working on Linux you can install git using apt-get (you might need to use sudo): + +``` +apt install git +``` + +You can then clone the repository by typing: + +``` +git clone https://github.com/Hack4Dev/rooibosTea_classification.git +``` + +To update your clone if changes are made, use: + +``` +cd rooibosTea_classification/ +git pull +``` + +----- From f50678dce063bde6012dae1190b9518c098a81e0 Mon Sep 17 00:00:00 2001 From: Eslam Hussein <31181109+eahussein@users.noreply.github.com> Date: Thu, 26 Jan 2023 09:05:37 +0200 Subject: [PATCH 3/3] Update ReadMe file added the citation --- README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/README.md b/README.md index 35b28ff..d04c0b8 100644 --- a/README.md +++ b/README.md @@ -62,3 +62,6 @@ git pull ``` ----- +### Original research work + +Hussein, E.A.; Thron, C.; Ghaziasgar, M.; Vaccari, M.; Marnewick, J.L.; Hussein, A.A. Comparison of Phenolic Content and Antioxidant Activity for Fermented and Unfermented Rooibos Samples Extracted with Water and Methanol. Plants 2022, 11, 16. [https://doi.org/10.3390/plants11010016](https://www.mdpi.com/2223-7747/11/1/16)