diff --git a/.DS_Store b/.DS_Store new file mode 100644 index 0000000..ff78b53 Binary files /dev/null and b/.DS_Store differ diff --git a/Assignment Colab/.DS_Store b/Assignment Colab/.DS_Store new file mode 100644 index 0000000..a4fdfb1 Binary files /dev/null and b/Assignment Colab/.DS_Store differ diff --git a/Assignment Colab/.ipynb_checkpoints/.DS_Store b/Assignment Colab/.ipynb_checkpoints/.DS_Store new file mode 100644 index 0000000..5008ddf Binary files /dev/null and b/Assignment Colab/.ipynb_checkpoints/.DS_Store differ diff --git a/Assignment Colab/.ipynb_checkpoints/Ibra Lujumba_kaggle-checkpoint.ipynb b/Assignment Colab/.ipynb_checkpoints/Ibra Lujumba_kaggle-checkpoint.ipynb new file mode 100644 index 0000000..4c4273f --- /dev/null +++ b/Assignment Colab/.ipynb_checkpoints/Ibra Lujumba_kaggle-checkpoint.ipynb @@ -0,0 +1,1439 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## ACE-Uganda Kaggle competition\n", + "### by Ibra Lujumba\n", + "#### 2019/HD07/27842U" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Contents\n", + "\n", + "- Exploratory Data Analysis\n", + "\n", + "- Data transformation\n", + "\n", + "- Feature selection\n", + "\n", + "- Building a machine learning classifier\n", + "\n", + "- Measuring performance of a classifier\n", + "\n", + "- Machine learning is an iterative process\n", + "\n", + "- Visualising decision boundaries" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 1. Exploratory Data Analysis " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exploratory data analysis (EDA) is the first step towards building a model that is capable of making predictions about the data. This is done to try to understand the properties of the data before any machine learning algorithm is used to make predictions using the data.\n", + "\n", + "The go-to libraries for manipulating data in python are **numpy**, and **pandas**. **matplotlib** and **seaborn** are used for data visualisation. These lbraries are imported using the **import** keyword." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Importing python modules for data analysis and visualization" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np # manipulation of arrays\n", + "import pandas as pd # manipulating dataframes\n", + "import matplotlib.pyplot as plt # data visualisation\n", + "import seaborn as sb # data visualisation,it is based on plt" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is necessary to ignore verbose warnings from functions in python. The data being manipulated may not necessarily meet the description of the input arguments." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#ignoring warnings that may arise\n", + "import warnings\n", + "warnings.filterwarnings('ignore')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Importing the datasets\n", + "The pandas library has the functionality to read data in delimited text files. The delimiter may be a tab(\\t), comma(,), semi-colon(;) or another metacharacter which is specified using the 'sep=' argument.\n", + "For comma-seperated files (csv), pd.read_csv() function is used. The default delimiter is a comma so there is no need to explicitly specify it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!ls ../input/ace-class-assignment/ # '!' specifies that this is not a python command \n", + " # and should be executed outside the notebook\n", + "\n", + "# reading in the data\n", + "data = pd.read_csv('../input/ace-class-assignment/AMP_TrainSet.csv')\n", + "new = pd.read_csv('../input/ace-class-assignment/Test.csv')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Checking the dimensions of the data as well as the datatype of each column\n", + "Dimensions of the dataframe are the number of rows and columns represented in the format (rows, columns)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# checking dimensions of the datasets\n", + "data.shape, new.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is important to check the type of data stored in the dataframe. Machine learning algorithms utilise numeric data to make predictions.\n", + "In cases where, there exists a non-numeric data type, it should be converted to numeric values through binarisation or numeric coding for different classes" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# checking the datatypes of the variables\n", + "data.dtypes, new.dtypes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "All the values in all the variables exists as either floats or integers.\n", + "\n", + "##### Proceeding to work with the training dataset to build the classifier" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Descriptive statistics of the train dataset such as arithmetic mean, standard deviation, quartiles and number of non-NA values in each column. These values inform what the next steps should be. \n", + "These steps include:\n", + "- cleaning of the dataset to remove rows that don't meet preset criteria\n", + "- taking care of skewed distributions\n", + "- taking care of missing values problem" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data.describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "From the count, all values for all cells in the dataset exist. i.e, there are no missing values for any of the variables\n", + "AS_DAYM780201 and FULL_DAYM780201 have the highest mean and highest maximum. FULL_OOBM850104 has a negative mean\n", + "For all the variables, the data points are not widely spread" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# checking the proprotions of classses\n", + "data.groupby('CLASS').size()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Both classes have an equal number of entries\n", + "\n", + "Obtaining pairwise correlation values for the variables in the train dataset to check if variables are independent of each other or multicollinearity exists within data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# use this resource to understand the output https://realpython.com/numpy-scipy-pandas-correlation-python/#pearson-correlation-coefficient\n", + "pearsoncorr = data.corr(method='pearson')\n", + "\n", + "# visualizing the correlation matrix as a heatmap to make interpretation easier\n", + "plt.figure(figsize=(10,10))\n", + "top_corr = pearsoncorr.index\n", + "sb.heatmap(pearsoncorr, \n", + " xticklabels=pearsoncorr.columns,\n", + " yticklabels=pearsoncorr.columns,\n", + " cmap='RdYlGn',\n", + " annot=True,\n", + " linewidth=0.5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Looking at the last row, FULL_Charge and AS_MeanAmphiMoment have the highest positive correlation values with CLASS whereas second,third and fourth variables have the most negative correlation values." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can get the p-values associated with the correlation values using the code below.\n", + "\n", + "`from scipy.stats import pearsonr`\n", + "\n", + "`data.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(len(train.columns))`\n", + "\n", + "Getting the *p-value* associated with a correlation value is necessary since it acknowledges whether the observed correlation between variables is significant or not. Normally, significance is confirmed if the *p-value* is below a threshold.\n", + "\n", + "In this examples, all the p-values for the observed correlations were significant." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# using a scatter plot matrix to visualise correlations\n", + "plt.figure(figsize=(40,40))\n", + "sb.pairplot(data)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Some variables are significantly correlated with each other which raises the problem of multicollinearity (variables are correlated with each other as well as with the response variable).\n", + "These variables are Full_Charge, FULL_AcidicMolPer, FULL_AURR980107,...\n", + "\n", + "Variables that require further investigation - NT_EFC195, AS_MeanAmphiMoment" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "len(data['AS_MeanAmphiMoment'].unique()), data['NT_EFC195'].unique()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This confirms that NT_EFC195 is a categorical variable with two classes 0 and 1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data[['CLASS','NT_EFC195']].head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "NT_EFC195 assumes both values irrespective of class\n", + "\n", + "Getting the associated p-values to check the correlation of features with the class. \n", + "The value of 1 at the bottom should be ignored since these values are obtained from the last column of a matrix of pairwise correlations therefore it corresponds to self correlation " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from scipy.stats import pearsonr\n", + "data.corr(method=lambda x, y: pearsonr(x, y)[1])['CLASS']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# checking the distribution and skewness of variables\n", + "plt.figure(figsize=(10,6))\n", + "data.skew().plot(kind='bar')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Most of the variables are minimally skewed except NT_EFC195. Further checks will be done to try to understand the properties of this variable.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data.groupby('NT_EFC195').size() # majority of the instances are of Class 0." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The skewedness in this variable can be understood by having most of its values at zeros\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data.plot(kind='density', subplots=True, layout=(4,3), figsize=(10,10))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Values for AS_FUK010112, CT_RACS820104,FULL_GEOR030101 and FULL_AURR980107 lie close to zero compared to the rest of the variables.\n", + "\n", + "Tranformation possibilities\n", + "* using the minimum and maximum scaler\n", + "* standardisation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 2. Data transformation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Better performance of algorithms can be obtained if the data is transformed.\n", + "Some algorithms are may take features with large values as the most important features in the predictions which creates bias within predictions since predictions are majorly based on that variable." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Seperating the predictor variables from the target variable. Transformation should not be applied on the target variable.\n", + "\n", + "Operations on data are carried out on numpy ndArrays. The pandas dataframe is first converted to an ndArray." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dataArray = data.to_numpy()\n", + "\n", + "# seperating the predictor and response variables\n", + "target = dataArray[:,11]\n", + "predictors = dataArray[:,0:11]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Numeric data may be rescaled (all values are put between a specified range) or standardized to have a mean of zero for normally distributed data.\n", + "Transformations have the advantage of improving performance of machine learning classifiers.\n", + "\n", + "Rescaling is done using the MinMaxScaler while standardising is done using the StandardScaler() functions. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# using minMaxScaler to set all values between 0 and 1\n", + "from sklearn.preprocessing import MinMaxScaler\n", + "scaler = MinMaxScaler(feature_range=(0,1))\n", + "rescaledPredictors = scaler.fit_transform(predictors)\n", + " \n", + "\n", + "# using StandardScaler\n", + "from sklearn.preprocessing import StandardScaler\n", + "scaler1 = StandardScaler().fit(predictors)\n", + "standardizedPredictors = scaler1.transform(predictors)\n", + " \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 3. Feature selection\n", + "This is done to be able to know which features contribute significantly to prediction of class. Once these features are known, they can be used to predict class. This has the advantage of reducing training time, preventing overfitting as well as improving accuracy of the model since redundant features are not used in the prediction process.\n", + "\n", + "On this dataset,univariate statistics such as F-test was used as an alternative to chi-squared test (some values after transformation are zero and chi2 returns an error)\n", + "\n", + "The F-test and feature selection were done on both the original and transformed data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# untransformed data\n", + "from sklearn.feature_selection import SelectKBest, f_classif\n", + "bestFeatures = SelectKBest(score_func=f_classif, k=7)\n", + "fit = bestFeatures.fit(predictors, target)\n", + "\n", + "scores = pd.DataFrame(fit.scores_) \n", + "pvalues = pd.DataFrame(fit.pvalues_)\n", + "columns = pd.DataFrame(data.columns[0:11])\n", + "\n", + "featureValues = pd.concat([columns,scores, pvalues,], axis=1) # concatenating dataframes\n", + "featureValues.columns = ['predictor', 'score', 'pvalue'] # naming the columns\n", + "\n", + "print(featureValues.nlargest(7, 'score'))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# checking the transformed data\n", + "\n", + "# rescaledPredictors\n", + "reBestFeatures = SelectKBest(score_func=f_classif, k=7)\n", + "reFit = reBestFeatures.fit(rescaledPredictors, target)\n", + "\n", + "reScores = pd.DataFrame(reFit.scores_) \n", + "rePvalues = pd.DataFrame(reFit.pvalues_)\n", + "reColumns = pd.DataFrame(data.columns[0:11])\n", + "\n", + "reFeatureValues = pd.concat([reColumns,reScores, rePvalues,], axis=1) # concatenating dataframes\n", + "reFeatureValues.columns = ['re_predictor', 're_score', 're_pvalue'] # naming the columns\n", + "\n", + "\n", + "\n", + "# standardizedPredictors\n", + "stBestFeatures = SelectKBest(score_func=f_classif, k=7)\n", + "stFit = stBestFeatures.fit(standardizedPredictors, target)\n", + "\n", + "stScores = pd.DataFrame(stFit.scores_) \n", + "stPvalues = pd.DataFrame(stFit.pvalues_)\n", + "stColumns = pd.DataFrame(data.columns[0:11])\n", + "\n", + "stFeatureValues = pd.concat([stColumns,stScores, stPvalues,], axis=1) # concatenating dataframes\n", + "stFeatureValues.columns = ['st_predictor', 'st_score', 'st_pvalue'] # naming the columns\n", + "\n", + "print(reFeatureValues.nlargest(7, 're_score')), print(stFeatureValues.nlargest(7, 'st_score'))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The importance of different features can be ranked using the ExtraTressClassifier which is also known as the Extremely Randomized Trees. In this Extra Trees classifier, the features and splits are selected at random; hence, “Extremely Randomized Tree”. Since splits are chosen at random for each feature in the Extra Trees Classifier, it’s less computationally expensive than a Random Forest." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# using feature importance\n", + "from sklearn.ensemble import ExtraTreesClassifier\n", + "model = ExtraTreesClassifier()\n", + "model.fit(predictors, target)\n", + "print(model.feature_importances_)\n", + "\n", + "# visualising feature importance\n", + "importances = pd.Series(model.feature_importances_, index=data.columns[0:11])\n", + "importances.nlargest(10).plot(kind='barh')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 4. Building the classification model\n", + "Finding te right model for a given dataset is an iterative methods. It is advisable to pursue all combinations of data and algorithms to find the best algorithm that is able to offer satisfactory results on a given dataset.\n", + "It cannot be expected that an algorithm that performs well on a particular dataset will perform the same eay in another dataset.\n", + "Therefore, an exhaustive approach is necessary to find the best performing algorithm for a given dataset.\n", + "\n", + "Some of these combinations are (for original and transformed data):\n", + "- model + train/test sets\n", + "- model + kfold crossvalidation\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Splitting the data_one dataset into training and test datasets and using a logit function to classify instances\n", + "from sklearn.model_selection import train_test_split # random split\n", + "from sklearn.linear_model import LogisticRegression # all machine learning models in Python are implemented as classes\n", + "p_train, p_test, t_train, t_test = train_test_split(predictors, target, \n", + " test_size=0.30,random_state=42)\n", + "\n", + "logit = LogisticRegression() # making instance of model\n", + "\n", + "# fitting the model on untransformed data\n", + "logit.fit(p_train, t_train)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 5. Measuring model performance\n", + "Logistic regression is a supervised learning algorithm. Performance of the algorithm is measured as the number of correct predictions made by the algorithm out the the number of true class values.\n", + "We can measure the performance of a classification problem using precison, F1 Score, Receiver Operating Curve (ROC) curve, accuracy and Matthews Correlation Coefficient.\n", + "\n", + "Other performance metric are;\n", + "- True Positives (TP) / True Positive Rate (TPR): Number of correct positive predictions / Probability of predicting positive given that the actual class is positive\n", + "- False Negatives (FN) / False Negative Rate (FNR): Number of wrong negative predictions / Probability of predicting negative given that the actual class is positive\n", + "- True Negatives (TN) / True Negative Rate (TNR): Number of correct negative predictions / Probability of predicting negative given that the actual class is negative\n", + "- False Positives (FP) / False Positive Rate (FPR): Number of wrong positive predictions / Probability of predicting positive given that the actual class is negative\n", + "- Precision (P): Proportion of predicted positives that are correct\n", + "- Recall (R): Proportion of actual positives captured" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# predict on test data\n", + "predictions = logit.predict(p_test) " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# the confusion matrix\n", + "from sklearn import metrics\n", + "cm = metrics.confusion_matrix(t_test, predictions)\n", + "cm\n", + "sb.heatmap(cm, annot=True, fmt='.3f', linewidths=.5,\n", + " square=True, cmap='Blues') \n", + "plt.ylabel('Actual label'); plt.xlabel('Predicted label')\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# performance metrics\n", + "print(\"Accuracy: \",metrics.accuracy_score(t_test, predictions)*100)\n", + "print(\"Precision: \",metrics.precision_score(t_test, predictions)*100)\n", + "print(\"Recall: \",metrics.recall_score(t_test, predictions)*100)\n", + "\n", + "from sklearn.metrics import matthews_corrcoef\n", + "print('MCC: ',matthews_corrcoef(t_test, predictions)) # takes into account true and false positives and negatives, \n", + " # higher values are better\n", + "# not affected by unbalanced classes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Performance metrics are inter-related and can be visualised using an ROC curve where the true positive rate is plotted against the false positive rate. Essentially, the curve shows the tradeoff between sensitivity and specificity.\n", + "The maximum area under the curve (AUC) is one which represents 100% accuracy therefore higher values of AUC are desired." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pred_probs = logit.predict_proba(p_test)[::,1] # start=0, stop=size of dimension, step=1\n", + "fpr, tpr,_ = metrics.roc_curve(t_test, pred_probs)\n", + "auc = metrics.roc_auc_score(t_test, pred_probs)\n", + "plt.plot(fpr, tpr, label = 'Untransformed+all Var, auc='+ str(auc))\n", + "plt.legend(loc=4)\n", + "plt.ylabel('tpr'), plt.xlabel('fpr')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The confusion matrix above shows the number of true positives, true negatives, false positives as well as false negatives.\n", + "The model made a total of 75 wrong class predictions. It can be seen that logistic regression has a good performance of the original dataset where all features are utilised in the prediction of the class.\n", + "\n", + "Something to keep in mind is that a good performance of the training set by an algorithm is not an indicator of how the model will performance when new data is presented to it. However, a good performance is desirable." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# performance on new data\n", + "new_pred = logit.predict(new.values)\n", + "\n", + "pred_df = pd.DataFrame(new_pred) \n", + "pred_df.columns=[\"CLASS\"]\n", + "pred_df.index.name=\"Index\" \n", + "pred_df[\"CLASS\"] = pred_df[\"CLASS\"].map({0:'False',1.0:'True'})\n", + "\n", + "#csv file output\n", + "pred_df.to_csv(\"ilujumba.csv\") \n", + "print(pred_df['CLASS'].unique())\n", + "\n", + "#printing the numbers of False and True\n", + "print(pred_df.groupby('CLASS').size()[0].sum())\n", + "print(pred_df.groupby('CLASS').size()[1].sum())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After the csv file was submitted to the competition, the model had an 83.33% accuracy." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 6. Machine learning is an iterative process\n", + "\n", + "Different combinations of logistic regression with the data were tried.\n", + "- train_test split\n", + "\n", + " model + rescaled data\n", + " \n", + " model + standardised data\n", + "\n", + "\n", + "- kfold crossvalidation\n", + "\n", + " model + rescaled data\n", + " \n", + " model + standardised data\n", + "\n", + "#### Logistic regression on rescaled data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "p1_train, p1_test, t1_train, t1_test = train_test_split(rescaledPredictors, target, \n", + " test_size=0.30,random_state=42)\n", + "\n", + "logit1 = LogisticRegression() # making instance of model\n", + "\n", + "# fitting the model on rescaled data\n", + "logit1.fit(p1_train, t1_train)\n", + "\n", + "# predict on test data\n", + "predictions1 = logit1.predict(p1_test)\n", + "\n", + "# performance metrics\n", + "print(\"Accuracy: \",metrics.accuracy_score(t1_test, predictions1)*100)\n", + "print(\"Precision: \",metrics.precision_score(t1_test, predictions1)*100)\n", + "print(\"Recall: \",metrics.recall_score(t1_test, predictions1)*100)\n", + "\n", + "from sklearn.metrics import matthews_corrcoef\n", + "print('MCC: ',matthews_corrcoef(t1_test, predictions1))\n", + "\n", + "# rescaling new data\n", + "newArray = new.to_numpy()\n", + "rescaledNew = scaler.fit_transform(newArray)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# performance on new data (rescaled)\n", + "new_pred1 = logit1.predict(rescaledNew)\n", + "\n", + "pred_df1 = pd.DataFrame(new_pred1) \n", + "pred_df1.columns=[\"CLASS\"]\n", + "pred_df1.index.name=\"Index\" \n", + "pred_df1[\"CLASS\"] = pred_df1[\"CLASS\"].map({0:'False',1.0:'True'})\n", + "\n", + "#csv file output\n", + "pred_df1.to_csv(\"ilujumba1.csv\") \n", + "print(pred_df1['CLASS'].unique())\n", + "\n", + "#printing the numbers of False and True\n", + "print(pred_df1.groupby('CLASS').size()[0].sum())\n", + "print(pred_df1.groupby('CLASS').size()[1].sum())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This combination returned an 88.11% accuracy after submission." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Logistic regression on standardized data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "p2_train, p2_test, t2_train, t2_test = train_test_split(standardizedPredictors, target, \n", + " test_size=0.30,random_state=42)\n", + "\n", + "logit2 = LogisticRegression() # making instance of model\n", + "\n", + "# fitting the model on rescaled data\n", + "logit2.fit(p2_train, t2_train)\n", + "\n", + "# predict on test data\n", + "predictions2 = logit2.predict(p2_test)\n", + "\n", + "# performance metrics\n", + "print(\"Accuracy: \",metrics.accuracy_score(t2_test, predictions2)*100)\n", + "print(\"Precision: \",metrics.precision_score(t2_test, predictions2)*100)\n", + "print(\"Recall: \",metrics.recall_score(t2_test, predictions2)*100)\n", + "print('MCC: ',matthews_corrcoef(t2_test, predictions2))\n", + "\n", + "# standardizing new data\n", + "standardizedNew = scaler1.transform(newArray)\n", + "\n", + "# performance on new data (standaridized)\n", + "new_pred2 = logit2.predict(standardizedNew)\n", + "\n", + "pred_df2 = pd.DataFrame(new_pred2) \n", + "pred_df2.columns=[\"CLASS\"]\n", + "pred_df2.index.name=\"Index\" \n", + "pred_df2[\"CLASS\"] = pred_df2[\"CLASS\"].map({0:'False',1.0:'True'})\n", + "\n", + "#csv file output\n", + "pred_df2.to_csv(\"ilujumba2.csv\") \n", + "print(pred_df2['CLASS'].unique())\n", + "\n", + "#printing the numbers of False and True\n", + "print(pred_df2.groupby('CLASS').size()[0].sum())\n", + "print(pred_df2.groupby('CLASS').size()[1].sum())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This model streategy returned an 86.03% accuracy after submission which is better than when the original data is used but slightly lower than when rescaled features are used." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Using selected features, rescaled data and Logistic regression\n", + "\n", + "An attempt was made to manually select for features that showed the highest importance in the prediction of class. All further strategies on data using Logistic Regression were done using rescaled data with the assumption that it could provide better model performance results when compared to other transformations of the data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "p3_train, p3_test, t3_train, t3_test = train_test_split(rescaledPredictors[:,(0,1,2,3,7)], target, \n", + " test_size=0.30,random_state=42)\n", + "\n", + "logit3 = LogisticRegression() # making instance of model\n", + "\n", + "# fitting the model on rescaled data\n", + "logit3.fit(p3_train, t3_train)\n", + "\n", + "# predict on test data\n", + "predictions3 = logit3.predict(p3_test)\n", + "\n", + "# performance metrics\n", + "print(\"Accuracy: \",metrics.accuracy_score(t3_test, predictions3)*100)\n", + "print(\"Precision: \",metrics.precision_score(t3_test, predictions3)*100)\n", + "print(\"Recall: \",metrics.recall_score(t3_test, predictions3)*100)\n", + "\n", + "from sklearn.metrics import matthews_corrcoef\n", + "print('MCC: ',matthews_corrcoef(t3_test, predictions3))\n", + "\n", + "# rescaling new data\n", + "# newArray = new.to_numpy()\n", + "# rescaledNew = scaler.fit_transform(newArray)\n", + "\n", + "# performance on new data (rescaled)\n", + "new_pred3 = logit3.predict(rescaledNew[:,(0,1,2,3,7)])\n", + "\n", + "pred_df3 = pd.DataFrame(new_pred3) \n", + "pred_df3.columns=[\"CLASS\"]\n", + "pred_df3.index.name=\"Index\" \n", + "pred_df3[\"CLASS\"] = pred_df3[\"CLASS\"].map({0:'False',1.0:'True'})\n", + "\n", + "#csv file output\n", + "pred_df3.to_csv(\"ilujumba3.csv\") \n", + "print(pred_df3['CLASS'].unique())\n", + "\n", + "\n", + "#printing the numbers of False and True\n", + "print(pred_df3.groupby('CLASS').size()[0].sum())\n", + "print(pred_df3.groupby('CLASS').size()[1].sum())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Using cross-validation and Logistic Regression\n", + "\n", + "During cross-validation, the data is split into kfolds. k-1 folds are used in the training are used while the kth fold is used for testing. The process is repeated for k times until each fold has been used as a testing set." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.model_selection import KFold\n", + "from sklearn.model_selection import cross_val_score\n", + "\n", + "kfold = KFold(n_splits=10, random_state=42)\n", + "model6 = LogisticRegression()\n", + "model6.fit(predictors, target)\n", + "\n", + "results = cross_val_score(model6, predictors, target)\n", + "print(results.mean())\n", + "\n", + "model6_pred = model6.predict(rescaledNew)\n", + "df6 = pd.DataFrame(model6_pred)\n", + "df6.columns = ['CLASS']\n", + "df6.index.name = 'Index'\n", + "df6['CLASS'] = df6['CLASS'].map({0.0:False, 1.0:True})\n", + "\n", + "df6.to_csv('ilujumba7.csv')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Another thing to keep in mind is to exhaustive use of simple models on the data. The reason for this is that they have a good reputation in terms of performance on different types of data and their method of prediction is well understood. \n", + "\n", + "\n", + "Another simple algorithm for classification is the Naive Bayes classifier. This algorithm assumes that all features are independent of each other and each feature contributes equally to the resulting class. For this reason, it is called 'naive'.\n", + "\n", + "#### Naive Bayes classifier with kfold cross-validation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.naive_bayes import GaussianNB\n", + "kfold = KFold(n_splits=10, random_state=42, shuffle=True)\n", + "model7 = GaussianNB()\n", + "model7.fit(predictors, target)\n", + "\n", + "results = cross_val_score(model7, predictors, target)\n", + "print(results.mean())\n", + "\n", + "model7_pred = model7.predict(newArray)\n", + "df7 = pd.DataFrame(model7_pred)\n", + "df7.columns = ['CLASS']\n", + "df7.index.name = 'Index'\n", + "df7['CLASS'] = df7['CLASS'].map({0.0:'False', 1.0:'True'})\n", + "\n", + "df7.to_csv('ilujumba7.csv')\n", + "print(df7['CLASS'].unique())\n", + "\n", + "#printing the numbers of False and True\n", + "print(df7.groupby('CLASS').size()[0].sum())\n", + "print(df7.groupby('CLASS').size()[1].sum())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Naive Bayes classifier on rescaled features" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "kfold = KFold(n_splits=10, random_state=42, shuffle=True)\n", + "model8 = GaussianNB()\n", + "model8.fit(rescaledPredictors, target)\n", + "\n", + "results1 = cross_val_score(model8, rescaledPredictors, target)\n", + "print(results1.mean())\n", + "\n", + "model8_pred = model8.predict(rescaledNew)\n", + "df8 = pd.DataFrame(model8_pred)\n", + "df8.columns = ['CLASS']\n", + "df8.index.name = 'Index'\n", + "df8['CLASS'] = df8['CLASS'].map({0.0:'False', 1.0:'True'})\n", + "\n", + "df8.to_csv('ilujumba8.csv')\n", + "print(df8['CLASS'].unique())\n", + "\n", + "#printing the numbers of False and True\n", + "print(df8.groupby('CLASS').size()[0].sum())\n", + "print(df8.groupby('CLASS').size()[1].sum())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Naive Bayes and kfold validation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.model_selection import cross_val_predict\n", + "from sklearn.naive_bayes import GaussianNB\n", + "\n", + "kfold = KFold(n_splits=10, random_state=42, shuffle=True)\n", + "model9 = GaussianNB()\n", + "model9.fit(predictors, target)\n", + "\n", + "results = cross_val_score(model9, predictors, target, cv =10) # ten-fold cross validation\n", + "print('mean for results', results.mean())\n", + "\n", + "predic = cross_val_predict(model9, predictors, target, cv =10)\n", + "accuracy = metrics.r2_score(target, predic)\n", + "print('cross-predicted accuracy ', accuracy)\n", + "\n", + "model9_pred = model9.predict(newArray)\n", + "df9 = pd.DataFrame(model9_pred)\n", + "df9.columns = ['CLASS']\n", + "df9.index.name = 'Index'\n", + "df9['CLASS'] = df9['CLASS'].map({0.0:'False', 1.0:'True'})\n", + "\n", + "df9.to_csv('ilujumba9.csv')\n", + "print(df9['CLASS'].unique())\n", + "\n", + "#printing the numbers of False and True\n", + "print(df9.groupby('CLASS').size()[0].sum())\n", + "print(df9.groupby('CLASS').size()[1].sum())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This model strategy returned the highest score after submission with a 99.599% accuracy." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Comparing several algorithms to look at the nature of the decision boundaries created" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "https://medium.com/cascade-bio-blog/creating-visualizations-to-better-understand-your-data-and-models-part-2-28d5c46e956" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Algorithms define a set of hyperplanes that divide the datapoints to their respective classes and span the feature space trained on. Visualising enables one to understand the limitations of a given algorithm on a dataset given to it.\n", + "Thus decision boundaries enable one to understand to how the training data selected affects performance of the algorithm." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ten sklearn non-linear classifier algorithms were compared. KNearest Neighbors, Support Vector Machines (Linear and RBF kernels), Gaussian Process Classifier, Decision Trees, Random Forest, Neural Networks, AdaBoost, Naive Bayes, \"Quadratic Discriminant Analysis, Logistic Regression." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#importing classifiers from the sklearn library\n", + "\n", + "from matplotlib.colors import ListedColormap\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.neural_network import MLPClassifier #1\n", + "from sklearn.neighbors import KNeighborsClassifier #2\n", + "from sklearn.svm import SVC #3\n", + "from sklearn.gaussian_process import GaussianProcessClassifier #4\n", + "from sklearn.gaussian_process.kernels import RBF #5\n", + "from sklearn.tree import DecisionTreeClassifier #6\n", + "from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier #7,8\n", + "from sklearn.naive_bayes import GaussianNB #9\n", + "from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis #10\n", + "from sklearn.linear_model import LogisticRegression #11\n", + "\n", + "names = [\"Nearest Neighbors\", \"Linear SVM\", \"RBF SVM\", \"Gaussian Process\",\n", + " \"Decision Tree\", \"Random Forest\", \"Neural Net\", \"AdaBoost\",\n", + " \"Naive Bayes\", \"QDA\", \"Logistic Regression\"]\n", + "\n", + "classifiers = [\n", + " KNeighborsClassifier(3), # holds no assumption on data distribution (non-parametric)\n", + " SVC(kernel=\"linear\", C=0.025), # using a linear kernel\n", + " SVC(gamma=2, C=1), # using radial basis function kernel,C is low to enable a large decision margin\n", + " GaussianProcessClassifier(1.0 * RBF(1.0)), # based on Laplace approximation\n", + " DecisionTreeClassifier(),\n", + " RandomForestClassifier(n_estimators=100), # 100 trees in the forest\n", + " MLPClassifier(max_iter=1000), #iterations until converge\n", + " AdaBoostClassifier(), # fits multiple classifiers on the same dataset\n", + " GaussianNB(), #NaiveBayes\n", + " QuadraticDiscriminantAnalysis(),\n", + " LogisticRegression()]\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Dimensionality reduction\n", + "\n", + "https://stackabuse.com/dimensionality-reduction-in-python-with-scikit-learn/\n", + "\n", + "https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Since the data is multi-dimensional, it was reduced using Principal Component Analysis (PCA) to reduce it to two components.\n", + "Trial runs were done to check how much of the variation in the data is explained by the principal components.\n", + "\n", + "Another thing to keep in mind is that PCA works best on standardised/normalised data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# preprocessing the dataset\n", + "dataArray = data.to_numpy()\n", + "X, y = dataArray[:,0:11], dataArray[0:,11]\n", + "X = StandardScaler().fit_transform(X)\n", + "\n", + "# reducing dimensions of the dataset using PCA \n", + "from sklearn.decomposition import PCA\n", + "pca = PCA()\n", + "pca.fit_transform(X)\n", + "pca_variance = pca.explained_variance_\n", + "plt.figure(figsize=(8, 6))\n", + "plt.bar(range(11), pca_variance, alpha=0.5, align='center', label='individual variance')\n", + "plt.legend()\n", + "plt.ylabel('Variance ratio')\n", + "plt.xlabel('Principal components')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pca2 = PCA(0.95) # keeping principal components that explain 95% of the variance\n", + "ninety_five = pca2.fit_transform(X)\n", + "ninety_five.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Explained variance: \", sum(pca2.explained_variance_ratio_))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Eight features explain 95% of the variance in the dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pca2 = PCA(3) # keeping features three principal components\n", + "principalComponents = pca2.fit_transform(X)\n", + "\n", + "from mpl_toolkits.mplot3d import Axes3D\n", + "plt.figure(figsize=(10,6))\n", + "ax = plt.axes(projection='3d')\n", + "ax.scatter(principalComponents[:,0], principalComponents[:,1], principalComponents[:,2], \n", + " linewidths=1, alpha=.5,\n", + " edgecolor='k', s= 200,\n", + " c=data['CLASS'])\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The three pincipal components wete visualised using a 3D plot. The figure above shows clustering of the three components. Each component is not exactly independent of the others so the clusters overlap to some extent" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#converting principal component ndarrays to DataFrame format\n", + "principalDf = pd.DataFrame(data = principalComponents, columns = ['PC1', 'PC2','PC3'])\n", + "finalDf = pd.concat([principalDf, data['CLASS']], axis = 1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "finalDf.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print('Variance explained by three PCs: ',sum(pca2.explained_variance_ratio_)*100,'%')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Visualising the top 2 principal components" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fig = plt.figure(figsize = (6,6))\n", + "ax = fig.add_subplot(111) \n", + "ax.set(xlim=(-10,10), ylim=(-10,10))\n", + "ax.set_xlabel('Principal Component 1', fontsize = 15)\n", + "ax.set_ylabel('Principal Component 2', fontsize = 15)\n", + "ax.set_title('top 2 components', fontsize = 20)\n", + "\n", + "targets = [0, 1]\n", + "colors = ['r', 'g']\n", + "\n", + "for target, color in zip(targets,colors):\n", + " indices = finalDf['CLASS'] == target\n", + " ax.scatter(finalDf.loc[indices, 'PC1']\n", + " , finalDf.loc[indices, 'PC2']\n", + " , c = color\n", + " , s = 50, alpha = 0.4)\n", + "ax.legend(targets)\n", + "ax.grid()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# splitting the into training and test part\n", + "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A mesh grid is required. This can be thought of as a matrix of coordinates upon which the model will make decisions.\n", + "These are then visualised to reveal decision boundaries.\n", + "The mesh grip was created based on the data and a step size of 0.02" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# creating mesh for the contour plot\n", + "\n", + "h = .02 # step size in the mesh\n", + "x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5\n", + "y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5\n", + "xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Two principal components were used to enable visualisation on a scatter plot.\n", + "\n", + "The parameters for the PCA were generated on the training data and these were applied on both the training and training sets" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pca4 = PCA(n_components=2)\n", + "\n", + "# applying PCA on training set\n", + "pca4.fit(X_train)\n", + "\n", + "#applying transform on training and testing sets\n", + "train_ = pca4.transform(X_train)\n", + "test_ = pca4.transform(X_test)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Explained variance: \", sum(pca4.explained_variance_ratio_))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "train_.shape, test_.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After transforming the data and the creating the meshgrid, decision boundaries for the algorithms were created by iterating over the classifiers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "figure = plt.figure(figsize=(27, 15))\n", + "i = 1\n", + "\n", + "datasets=[data]\n", + "for ds_cnt, ds in enumerate(datasets):\n", + " # just plot the dataset first\n", + " cm = plt.cm.RdBu\n", + " cm_bright = ListedColormap(['#FF0000', '#0000FF'])\n", + " ax = plt.subplot(len(datasets), len(classifiers) + 1, i)\n", + "\n", + " if ds_cnt == 0:\n", + " ax.set_title(\"Input data\")\n", + " # Plot the top 2 principal components for training data\n", + " ax.scatter(train_[:, 0], train_[:, 1], c=y_train, cmap=cm_bright,\n", + " edgecolors='k')\n", + " # Plot the top 2 principal components for the testing data\n", + " ax.scatter(test_[:, 0], test_[:, 1], c=y_test, cmap=cm_bright, alpha=0.6,\n", + " edgecolors='k')\n", + " ax.set_xlim(xx.min(), xx.max())\n", + " ax.set_ylim(yy.min(), yy.max())\n", + " ax.set_xticks(())\n", + " ax.set_yticks(())\n", + " i += 1\n", + "\n", + " # iterate over classifiers\n", + "\n", + " for name, clf in zip(names, classifiers):\n", + " ax = plt.subplot(len(datasets), len(classifiers) + 1, i)\n", + " clf.fit(train_, y_train)\n", + " score = clf.score(test_, y_test)\n", + "\n", + " # Plot the decision boundary. For that, we will assign a color to each\n", + " # point in the mesh [x_min, x_max]x[y_min, y_max].\n", + "\n", + " if hasattr(clf, \"decision_function\"):\n", + " Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) # confidence scores\n", + " else:\n", + " Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1] # probability estimates\n", + "\n", + " # Put the result into a color plot\n", + " Z = Z.reshape(xx.shape)\n", + " ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)\n", + "\n", + " # Plot the training points\n", + " ax.scatter(train_[:, 0], train_[:, 1], c=y_train, cmap=cm_bright, edgecolors='k')\n", + " # Plot the testing points\n", + " ax.scatter(test_[:, 0], test_[:, 1], c=y_test, cmap=cm_bright, edgecolors='k', alpha=0.4)\n", + "\n", + " ax.set_xlim(xx.min(), xx.max())\n", + " ax.set_ylim(yy.min(), yy.max())\n", + " ax.set_xticks(())\n", + " ax.set_yticks(())\n", + " if ds_cnt == 0:\n", + " ax.set_title(name)\n", + " ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'), size=15, horizontalalignment='right')\n", + " i += 1\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Accuracies of the different algorithms are indicated on the lower right corner.\n", + "\n", + "The plots show training points in solid colors and testing points semi-transparent.Decision boundaries for GaussianProcessClassifier, RandomForest and AdaBoost are complicated while the decison boundaries for LogisticRegression, NaiveBayes, NeuralNetwork, and Linear SVM are simpler. GaussianProcess and RBF SVM have contoured decision boundaries which seperate points with similar characteristics.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/Assignment Colab/Ibra Lujumba_kaggle.ipynb b/Assignment Colab/Ibra Lujumba_kaggle.ipynb new file mode 100644 index 0000000..4c4273f --- /dev/null +++ b/Assignment Colab/Ibra Lujumba_kaggle.ipynb @@ -0,0 +1,1439 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## ACE-Uganda Kaggle competition\n", + "### by Ibra Lujumba\n", + "#### 2019/HD07/27842U" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Contents\n", + "\n", + "- Exploratory Data Analysis\n", + "\n", + "- Data transformation\n", + "\n", + "- Feature selection\n", + "\n", + "- Building a machine learning classifier\n", + "\n", + "- Measuring performance of a classifier\n", + "\n", + "- Machine learning is an iterative process\n", + "\n", + "- Visualising decision boundaries" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 1. Exploratory Data Analysis " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Exploratory data analysis (EDA) is the first step towards building a model that is capable of making predictions about the data. This is done to try to understand the properties of the data before any machine learning algorithm is used to make predictions using the data.\n", + "\n", + "The go-to libraries for manipulating data in python are **numpy**, and **pandas**. **matplotlib** and **seaborn** are used for data visualisation. These lbraries are imported using the **import** keyword." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Importing python modules for data analysis and visualization" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np # manipulation of arrays\n", + "import pandas as pd # manipulating dataframes\n", + "import matplotlib.pyplot as plt # data visualisation\n", + "import seaborn as sb # data visualisation,it is based on plt" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is necessary to ignore verbose warnings from functions in python. The data being manipulated may not necessarily meet the description of the input arguments." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#ignoring warnings that may arise\n", + "import warnings\n", + "warnings.filterwarnings('ignore')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Importing the datasets\n", + "The pandas library has the functionality to read data in delimited text files. The delimiter may be a tab(\\t), comma(,), semi-colon(;) or another metacharacter which is specified using the 'sep=' argument.\n", + "For comma-seperated files (csv), pd.read_csv() function is used. The default delimiter is a comma so there is no need to explicitly specify it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!ls ../input/ace-class-assignment/ # '!' specifies that this is not a python command \n", + " # and should be executed outside the notebook\n", + "\n", + "# reading in the data\n", + "data = pd.read_csv('../input/ace-class-assignment/AMP_TrainSet.csv')\n", + "new = pd.read_csv('../input/ace-class-assignment/Test.csv')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Checking the dimensions of the data as well as the datatype of each column\n", + "Dimensions of the dataframe are the number of rows and columns represented in the format (rows, columns)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# checking dimensions of the datasets\n", + "data.shape, new.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is important to check the type of data stored in the dataframe. Machine learning algorithms utilise numeric data to make predictions.\n", + "In cases where, there exists a non-numeric data type, it should be converted to numeric values through binarisation or numeric coding for different classes" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# checking the datatypes of the variables\n", + "data.dtypes, new.dtypes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "All the values in all the variables exists as either floats or integers.\n", + "\n", + "##### Proceeding to work with the training dataset to build the classifier" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Descriptive statistics of the train dataset such as arithmetic mean, standard deviation, quartiles and number of non-NA values in each column. These values inform what the next steps should be. \n", + "These steps include:\n", + "- cleaning of the dataset to remove rows that don't meet preset criteria\n", + "- taking care of skewed distributions\n", + "- taking care of missing values problem" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data.describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "From the count, all values for all cells in the dataset exist. i.e, there are no missing values for any of the variables\n", + "AS_DAYM780201 and FULL_DAYM780201 have the highest mean and highest maximum. FULL_OOBM850104 has a negative mean\n", + "For all the variables, the data points are not widely spread" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# checking the proprotions of classses\n", + "data.groupby('CLASS').size()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Both classes have an equal number of entries\n", + "\n", + "Obtaining pairwise correlation values for the variables in the train dataset to check if variables are independent of each other or multicollinearity exists within data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# use this resource to understand the output https://realpython.com/numpy-scipy-pandas-correlation-python/#pearson-correlation-coefficient\n", + "pearsoncorr = data.corr(method='pearson')\n", + "\n", + "# visualizing the correlation matrix as a heatmap to make interpretation easier\n", + "plt.figure(figsize=(10,10))\n", + "top_corr = pearsoncorr.index\n", + "sb.heatmap(pearsoncorr, \n", + " xticklabels=pearsoncorr.columns,\n", + " yticklabels=pearsoncorr.columns,\n", + " cmap='RdYlGn',\n", + " annot=True,\n", + " linewidth=0.5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Looking at the last row, FULL_Charge and AS_MeanAmphiMoment have the highest positive correlation values with CLASS whereas second,third and fourth variables have the most negative correlation values." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can get the p-values associated with the correlation values using the code below.\n", + "\n", + "`from scipy.stats import pearsonr`\n", + "\n", + "`data.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(len(train.columns))`\n", + "\n", + "Getting the *p-value* associated with a correlation value is necessary since it acknowledges whether the observed correlation between variables is significant or not. Normally, significance is confirmed if the *p-value* is below a threshold.\n", + "\n", + "In this examples, all the p-values for the observed correlations were significant." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# using a scatter plot matrix to visualise correlations\n", + "plt.figure(figsize=(40,40))\n", + "sb.pairplot(data)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Some variables are significantly correlated with each other which raises the problem of multicollinearity (variables are correlated with each other as well as with the response variable).\n", + "These variables are Full_Charge, FULL_AcidicMolPer, FULL_AURR980107,...\n", + "\n", + "Variables that require further investigation - NT_EFC195, AS_MeanAmphiMoment" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "len(data['AS_MeanAmphiMoment'].unique()), data['NT_EFC195'].unique()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This confirms that NT_EFC195 is a categorical variable with two classes 0 and 1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data[['CLASS','NT_EFC195']].head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "NT_EFC195 assumes both values irrespective of class\n", + "\n", + "Getting the associated p-values to check the correlation of features with the class. \n", + "The value of 1 at the bottom should be ignored since these values are obtained from the last column of a matrix of pairwise correlations therefore it corresponds to self correlation " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from scipy.stats import pearsonr\n", + "data.corr(method=lambda x, y: pearsonr(x, y)[1])['CLASS']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# checking the distribution and skewness of variables\n", + "plt.figure(figsize=(10,6))\n", + "data.skew().plot(kind='bar')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Most of the variables are minimally skewed except NT_EFC195. Further checks will be done to try to understand the properties of this variable.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data.groupby('NT_EFC195').size() # majority of the instances are of Class 0." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The skewedness in this variable can be understood by having most of its values at zeros\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data.plot(kind='density', subplots=True, layout=(4,3), figsize=(10,10))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Values for AS_FUK010112, CT_RACS820104,FULL_GEOR030101 and FULL_AURR980107 lie close to zero compared to the rest of the variables.\n", + "\n", + "Tranformation possibilities\n", + "* using the minimum and maximum scaler\n", + "* standardisation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 2. Data transformation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Better performance of algorithms can be obtained if the data is transformed.\n", + "Some algorithms are may take features with large values as the most important features in the predictions which creates bias within predictions since predictions are majorly based on that variable." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Seperating the predictor variables from the target variable. Transformation should not be applied on the target variable.\n", + "\n", + "Operations on data are carried out on numpy ndArrays. The pandas dataframe is first converted to an ndArray." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dataArray = data.to_numpy()\n", + "\n", + "# seperating the predictor and response variables\n", + "target = dataArray[:,11]\n", + "predictors = dataArray[:,0:11]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Numeric data may be rescaled (all values are put between a specified range) or standardized to have a mean of zero for normally distributed data.\n", + "Transformations have the advantage of improving performance of machine learning classifiers.\n", + "\n", + "Rescaling is done using the MinMaxScaler while standardising is done using the StandardScaler() functions. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# using minMaxScaler to set all values between 0 and 1\n", + "from sklearn.preprocessing import MinMaxScaler\n", + "scaler = MinMaxScaler(feature_range=(0,1))\n", + "rescaledPredictors = scaler.fit_transform(predictors)\n", + " \n", + "\n", + "# using StandardScaler\n", + "from sklearn.preprocessing import StandardScaler\n", + "scaler1 = StandardScaler().fit(predictors)\n", + "standardizedPredictors = scaler1.transform(predictors)\n", + " \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 3. Feature selection\n", + "This is done to be able to know which features contribute significantly to prediction of class. Once these features are known, they can be used to predict class. This has the advantage of reducing training time, preventing overfitting as well as improving accuracy of the model since redundant features are not used in the prediction process.\n", + "\n", + "On this dataset,univariate statistics such as F-test was used as an alternative to chi-squared test (some values after transformation are zero and chi2 returns an error)\n", + "\n", + "The F-test and feature selection were done on both the original and transformed data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# untransformed data\n", + "from sklearn.feature_selection import SelectKBest, f_classif\n", + "bestFeatures = SelectKBest(score_func=f_classif, k=7)\n", + "fit = bestFeatures.fit(predictors, target)\n", + "\n", + "scores = pd.DataFrame(fit.scores_) \n", + "pvalues = pd.DataFrame(fit.pvalues_)\n", + "columns = pd.DataFrame(data.columns[0:11])\n", + "\n", + "featureValues = pd.concat([columns,scores, pvalues,], axis=1) # concatenating dataframes\n", + "featureValues.columns = ['predictor', 'score', 'pvalue'] # naming the columns\n", + "\n", + "print(featureValues.nlargest(7, 'score'))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# checking the transformed data\n", + "\n", + "# rescaledPredictors\n", + "reBestFeatures = SelectKBest(score_func=f_classif, k=7)\n", + "reFit = reBestFeatures.fit(rescaledPredictors, target)\n", + "\n", + "reScores = pd.DataFrame(reFit.scores_) \n", + "rePvalues = pd.DataFrame(reFit.pvalues_)\n", + "reColumns = pd.DataFrame(data.columns[0:11])\n", + "\n", + "reFeatureValues = pd.concat([reColumns,reScores, rePvalues,], axis=1) # concatenating dataframes\n", + "reFeatureValues.columns = ['re_predictor', 're_score', 're_pvalue'] # naming the columns\n", + "\n", + "\n", + "\n", + "# standardizedPredictors\n", + "stBestFeatures = SelectKBest(score_func=f_classif, k=7)\n", + "stFit = stBestFeatures.fit(standardizedPredictors, target)\n", + "\n", + "stScores = pd.DataFrame(stFit.scores_) \n", + "stPvalues = pd.DataFrame(stFit.pvalues_)\n", + "stColumns = pd.DataFrame(data.columns[0:11])\n", + "\n", + "stFeatureValues = pd.concat([stColumns,stScores, stPvalues,], axis=1) # concatenating dataframes\n", + "stFeatureValues.columns = ['st_predictor', 'st_score', 'st_pvalue'] # naming the columns\n", + "\n", + "print(reFeatureValues.nlargest(7, 're_score')), print(stFeatureValues.nlargest(7, 'st_score'))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The importance of different features can be ranked using the ExtraTressClassifier which is also known as the Extremely Randomized Trees. In this Extra Trees classifier, the features and splits are selected at random; hence, “Extremely Randomized Tree”. Since splits are chosen at random for each feature in the Extra Trees Classifier, it’s less computationally expensive than a Random Forest." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# using feature importance\n", + "from sklearn.ensemble import ExtraTreesClassifier\n", + "model = ExtraTreesClassifier()\n", + "model.fit(predictors, target)\n", + "print(model.feature_importances_)\n", + "\n", + "# visualising feature importance\n", + "importances = pd.Series(model.feature_importances_, index=data.columns[0:11])\n", + "importances.nlargest(10).plot(kind='barh')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 4. Building the classification model\n", + "Finding te right model for a given dataset is an iterative methods. It is advisable to pursue all combinations of data and algorithms to find the best algorithm that is able to offer satisfactory results on a given dataset.\n", + "It cannot be expected that an algorithm that performs well on a particular dataset will perform the same eay in another dataset.\n", + "Therefore, an exhaustive approach is necessary to find the best performing algorithm for a given dataset.\n", + "\n", + "Some of these combinations are (for original and transformed data):\n", + "- model + train/test sets\n", + "- model + kfold crossvalidation\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Splitting the data_one dataset into training and test datasets and using a logit function to classify instances\n", + "from sklearn.model_selection import train_test_split # random split\n", + "from sklearn.linear_model import LogisticRegression # all machine learning models in Python are implemented as classes\n", + "p_train, p_test, t_train, t_test = train_test_split(predictors, target, \n", + " test_size=0.30,random_state=42)\n", + "\n", + "logit = LogisticRegression() # making instance of model\n", + "\n", + "# fitting the model on untransformed data\n", + "logit.fit(p_train, t_train)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 5. Measuring model performance\n", + "Logistic regression is a supervised learning algorithm. Performance of the algorithm is measured as the number of correct predictions made by the algorithm out the the number of true class values.\n", + "We can measure the performance of a classification problem using precison, F1 Score, Receiver Operating Curve (ROC) curve, accuracy and Matthews Correlation Coefficient.\n", + "\n", + "Other performance metric are;\n", + "- True Positives (TP) / True Positive Rate (TPR): Number of correct positive predictions / Probability of predicting positive given that the actual class is positive\n", + "- False Negatives (FN) / False Negative Rate (FNR): Number of wrong negative predictions / Probability of predicting negative given that the actual class is positive\n", + "- True Negatives (TN) / True Negative Rate (TNR): Number of correct negative predictions / Probability of predicting negative given that the actual class is negative\n", + "- False Positives (FP) / False Positive Rate (FPR): Number of wrong positive predictions / Probability of predicting positive given that the actual class is negative\n", + "- Precision (P): Proportion of predicted positives that are correct\n", + "- Recall (R): Proportion of actual positives captured" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# predict on test data\n", + "predictions = logit.predict(p_test) " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# the confusion matrix\n", + "from sklearn import metrics\n", + "cm = metrics.confusion_matrix(t_test, predictions)\n", + "cm\n", + "sb.heatmap(cm, annot=True, fmt='.3f', linewidths=.5,\n", + " square=True, cmap='Blues') \n", + "plt.ylabel('Actual label'); plt.xlabel('Predicted label')\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# performance metrics\n", + "print(\"Accuracy: \",metrics.accuracy_score(t_test, predictions)*100)\n", + "print(\"Precision: \",metrics.precision_score(t_test, predictions)*100)\n", + "print(\"Recall: \",metrics.recall_score(t_test, predictions)*100)\n", + "\n", + "from sklearn.metrics import matthews_corrcoef\n", + "print('MCC: ',matthews_corrcoef(t_test, predictions)) # takes into account true and false positives and negatives, \n", + " # higher values are better\n", + "# not affected by unbalanced classes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Performance metrics are inter-related and can be visualised using an ROC curve where the true positive rate is plotted against the false positive rate. Essentially, the curve shows the tradeoff between sensitivity and specificity.\n", + "The maximum area under the curve (AUC) is one which represents 100% accuracy therefore higher values of AUC are desired." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pred_probs = logit.predict_proba(p_test)[::,1] # start=0, stop=size of dimension, step=1\n", + "fpr, tpr,_ = metrics.roc_curve(t_test, pred_probs)\n", + "auc = metrics.roc_auc_score(t_test, pred_probs)\n", + "plt.plot(fpr, tpr, label = 'Untransformed+all Var, auc='+ str(auc))\n", + "plt.legend(loc=4)\n", + "plt.ylabel('tpr'), plt.xlabel('fpr')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The confusion matrix above shows the number of true positives, true negatives, false positives as well as false negatives.\n", + "The model made a total of 75 wrong class predictions. It can be seen that logistic regression has a good performance of the original dataset where all features are utilised in the prediction of the class.\n", + "\n", + "Something to keep in mind is that a good performance of the training set by an algorithm is not an indicator of how the model will performance when new data is presented to it. However, a good performance is desirable." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# performance on new data\n", + "new_pred = logit.predict(new.values)\n", + "\n", + "pred_df = pd.DataFrame(new_pred) \n", + "pred_df.columns=[\"CLASS\"]\n", + "pred_df.index.name=\"Index\" \n", + "pred_df[\"CLASS\"] = pred_df[\"CLASS\"].map({0:'False',1.0:'True'})\n", + "\n", + "#csv file output\n", + "pred_df.to_csv(\"ilujumba.csv\") \n", + "print(pred_df['CLASS'].unique())\n", + "\n", + "#printing the numbers of False and True\n", + "print(pred_df.groupby('CLASS').size()[0].sum())\n", + "print(pred_df.groupby('CLASS').size()[1].sum())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After the csv file was submitted to the competition, the model had an 83.33% accuracy." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 6. Machine learning is an iterative process\n", + "\n", + "Different combinations of logistic regression with the data were tried.\n", + "- train_test split\n", + "\n", + " model + rescaled data\n", + " \n", + " model + standardised data\n", + "\n", + "\n", + "- kfold crossvalidation\n", + "\n", + " model + rescaled data\n", + " \n", + " model + standardised data\n", + "\n", + "#### Logistic regression on rescaled data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "p1_train, p1_test, t1_train, t1_test = train_test_split(rescaledPredictors, target, \n", + " test_size=0.30,random_state=42)\n", + "\n", + "logit1 = LogisticRegression() # making instance of model\n", + "\n", + "# fitting the model on rescaled data\n", + "logit1.fit(p1_train, t1_train)\n", + "\n", + "# predict on test data\n", + "predictions1 = logit1.predict(p1_test)\n", + "\n", + "# performance metrics\n", + "print(\"Accuracy: \",metrics.accuracy_score(t1_test, predictions1)*100)\n", + "print(\"Precision: \",metrics.precision_score(t1_test, predictions1)*100)\n", + "print(\"Recall: \",metrics.recall_score(t1_test, predictions1)*100)\n", + "\n", + "from sklearn.metrics import matthews_corrcoef\n", + "print('MCC: ',matthews_corrcoef(t1_test, predictions1))\n", + "\n", + "# rescaling new data\n", + "newArray = new.to_numpy()\n", + "rescaledNew = scaler.fit_transform(newArray)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# performance on new data (rescaled)\n", + "new_pred1 = logit1.predict(rescaledNew)\n", + "\n", + "pred_df1 = pd.DataFrame(new_pred1) \n", + "pred_df1.columns=[\"CLASS\"]\n", + "pred_df1.index.name=\"Index\" \n", + "pred_df1[\"CLASS\"] = pred_df1[\"CLASS\"].map({0:'False',1.0:'True'})\n", + "\n", + "#csv file output\n", + "pred_df1.to_csv(\"ilujumba1.csv\") \n", + "print(pred_df1['CLASS'].unique())\n", + "\n", + "#printing the numbers of False and True\n", + "print(pred_df1.groupby('CLASS').size()[0].sum())\n", + "print(pred_df1.groupby('CLASS').size()[1].sum())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This combination returned an 88.11% accuracy after submission." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Logistic regression on standardized data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "p2_train, p2_test, t2_train, t2_test = train_test_split(standardizedPredictors, target, \n", + " test_size=0.30,random_state=42)\n", + "\n", + "logit2 = LogisticRegression() # making instance of model\n", + "\n", + "# fitting the model on rescaled data\n", + "logit2.fit(p2_train, t2_train)\n", + "\n", + "# predict on test data\n", + "predictions2 = logit2.predict(p2_test)\n", + "\n", + "# performance metrics\n", + "print(\"Accuracy: \",metrics.accuracy_score(t2_test, predictions2)*100)\n", + "print(\"Precision: \",metrics.precision_score(t2_test, predictions2)*100)\n", + "print(\"Recall: \",metrics.recall_score(t2_test, predictions2)*100)\n", + "print('MCC: ',matthews_corrcoef(t2_test, predictions2))\n", + "\n", + "# standardizing new data\n", + "standardizedNew = scaler1.transform(newArray)\n", + "\n", + "# performance on new data (standaridized)\n", + "new_pred2 = logit2.predict(standardizedNew)\n", + "\n", + "pred_df2 = pd.DataFrame(new_pred2) \n", + "pred_df2.columns=[\"CLASS\"]\n", + "pred_df2.index.name=\"Index\" \n", + "pred_df2[\"CLASS\"] = pred_df2[\"CLASS\"].map({0:'False',1.0:'True'})\n", + "\n", + "#csv file output\n", + "pred_df2.to_csv(\"ilujumba2.csv\") \n", + "print(pred_df2['CLASS'].unique())\n", + "\n", + "#printing the numbers of False and True\n", + "print(pred_df2.groupby('CLASS').size()[0].sum())\n", + "print(pred_df2.groupby('CLASS').size()[1].sum())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This model streategy returned an 86.03% accuracy after submission which is better than when the original data is used but slightly lower than when rescaled features are used." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Using selected features, rescaled data and Logistic regression\n", + "\n", + "An attempt was made to manually select for features that showed the highest importance in the prediction of class. All further strategies on data using Logistic Regression were done using rescaled data with the assumption that it could provide better model performance results when compared to other transformations of the data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "p3_train, p3_test, t3_train, t3_test = train_test_split(rescaledPredictors[:,(0,1,2,3,7)], target, \n", + " test_size=0.30,random_state=42)\n", + "\n", + "logit3 = LogisticRegression() # making instance of model\n", + "\n", + "# fitting the model on rescaled data\n", + "logit3.fit(p3_train, t3_train)\n", + "\n", + "# predict on test data\n", + "predictions3 = logit3.predict(p3_test)\n", + "\n", + "# performance metrics\n", + "print(\"Accuracy: \",metrics.accuracy_score(t3_test, predictions3)*100)\n", + "print(\"Precision: \",metrics.precision_score(t3_test, predictions3)*100)\n", + "print(\"Recall: \",metrics.recall_score(t3_test, predictions3)*100)\n", + "\n", + "from sklearn.metrics import matthews_corrcoef\n", + "print('MCC: ',matthews_corrcoef(t3_test, predictions3))\n", + "\n", + "# rescaling new data\n", + "# newArray = new.to_numpy()\n", + "# rescaledNew = scaler.fit_transform(newArray)\n", + "\n", + "# performance on new data (rescaled)\n", + "new_pred3 = logit3.predict(rescaledNew[:,(0,1,2,3,7)])\n", + "\n", + "pred_df3 = pd.DataFrame(new_pred3) \n", + "pred_df3.columns=[\"CLASS\"]\n", + "pred_df3.index.name=\"Index\" \n", + "pred_df3[\"CLASS\"] = pred_df3[\"CLASS\"].map({0:'False',1.0:'True'})\n", + "\n", + "#csv file output\n", + "pred_df3.to_csv(\"ilujumba3.csv\") \n", + "print(pred_df3['CLASS'].unique())\n", + "\n", + "\n", + "#printing the numbers of False and True\n", + "print(pred_df3.groupby('CLASS').size()[0].sum())\n", + "print(pred_df3.groupby('CLASS').size()[1].sum())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Using cross-validation and Logistic Regression\n", + "\n", + "During cross-validation, the data is split into kfolds. k-1 folds are used in the training are used while the kth fold is used for testing. The process is repeated for k times until each fold has been used as a testing set." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.model_selection import KFold\n", + "from sklearn.model_selection import cross_val_score\n", + "\n", + "kfold = KFold(n_splits=10, random_state=42)\n", + "model6 = LogisticRegression()\n", + "model6.fit(predictors, target)\n", + "\n", + "results = cross_val_score(model6, predictors, target)\n", + "print(results.mean())\n", + "\n", + "model6_pred = model6.predict(rescaledNew)\n", + "df6 = pd.DataFrame(model6_pred)\n", + "df6.columns = ['CLASS']\n", + "df6.index.name = 'Index'\n", + "df6['CLASS'] = df6['CLASS'].map({0.0:False, 1.0:True})\n", + "\n", + "df6.to_csv('ilujumba7.csv')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Another thing to keep in mind is to exhaustive use of simple models on the data. The reason for this is that they have a good reputation in terms of performance on different types of data and their method of prediction is well understood. \n", + "\n", + "\n", + "Another simple algorithm for classification is the Naive Bayes classifier. This algorithm assumes that all features are independent of each other and each feature contributes equally to the resulting class. For this reason, it is called 'naive'.\n", + "\n", + "#### Naive Bayes classifier with kfold cross-validation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.naive_bayes import GaussianNB\n", + "kfold = KFold(n_splits=10, random_state=42, shuffle=True)\n", + "model7 = GaussianNB()\n", + "model7.fit(predictors, target)\n", + "\n", + "results = cross_val_score(model7, predictors, target)\n", + "print(results.mean())\n", + "\n", + "model7_pred = model7.predict(newArray)\n", + "df7 = pd.DataFrame(model7_pred)\n", + "df7.columns = ['CLASS']\n", + "df7.index.name = 'Index'\n", + "df7['CLASS'] = df7['CLASS'].map({0.0:'False', 1.0:'True'})\n", + "\n", + "df7.to_csv('ilujumba7.csv')\n", + "print(df7['CLASS'].unique())\n", + "\n", + "#printing the numbers of False and True\n", + "print(df7.groupby('CLASS').size()[0].sum())\n", + "print(df7.groupby('CLASS').size()[1].sum())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Naive Bayes classifier on rescaled features" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "kfold = KFold(n_splits=10, random_state=42, shuffle=True)\n", + "model8 = GaussianNB()\n", + "model8.fit(rescaledPredictors, target)\n", + "\n", + "results1 = cross_val_score(model8, rescaledPredictors, target)\n", + "print(results1.mean())\n", + "\n", + "model8_pred = model8.predict(rescaledNew)\n", + "df8 = pd.DataFrame(model8_pred)\n", + "df8.columns = ['CLASS']\n", + "df8.index.name = 'Index'\n", + "df8['CLASS'] = df8['CLASS'].map({0.0:'False', 1.0:'True'})\n", + "\n", + "df8.to_csv('ilujumba8.csv')\n", + "print(df8['CLASS'].unique())\n", + "\n", + "#printing the numbers of False and True\n", + "print(df8.groupby('CLASS').size()[0].sum())\n", + "print(df8.groupby('CLASS').size()[1].sum())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Naive Bayes and kfold validation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.model_selection import cross_val_predict\n", + "from sklearn.naive_bayes import GaussianNB\n", + "\n", + "kfold = KFold(n_splits=10, random_state=42, shuffle=True)\n", + "model9 = GaussianNB()\n", + "model9.fit(predictors, target)\n", + "\n", + "results = cross_val_score(model9, predictors, target, cv =10) # ten-fold cross validation\n", + "print('mean for results', results.mean())\n", + "\n", + "predic = cross_val_predict(model9, predictors, target, cv =10)\n", + "accuracy = metrics.r2_score(target, predic)\n", + "print('cross-predicted accuracy ', accuracy)\n", + "\n", + "model9_pred = model9.predict(newArray)\n", + "df9 = pd.DataFrame(model9_pred)\n", + "df9.columns = ['CLASS']\n", + "df9.index.name = 'Index'\n", + "df9['CLASS'] = df9['CLASS'].map({0.0:'False', 1.0:'True'})\n", + "\n", + "df9.to_csv('ilujumba9.csv')\n", + "print(df9['CLASS'].unique())\n", + "\n", + "#printing the numbers of False and True\n", + "print(df9.groupby('CLASS').size()[0].sum())\n", + "print(df9.groupby('CLASS').size()[1].sum())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This model strategy returned the highest score after submission with a 99.599% accuracy." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Comparing several algorithms to look at the nature of the decision boundaries created" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "https://medium.com/cascade-bio-blog/creating-visualizations-to-better-understand-your-data-and-models-part-2-28d5c46e956" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Algorithms define a set of hyperplanes that divide the datapoints to their respective classes and span the feature space trained on. Visualising enables one to understand the limitations of a given algorithm on a dataset given to it.\n", + "Thus decision boundaries enable one to understand to how the training data selected affects performance of the algorithm." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ten sklearn non-linear classifier algorithms were compared. KNearest Neighbors, Support Vector Machines (Linear and RBF kernels), Gaussian Process Classifier, Decision Trees, Random Forest, Neural Networks, AdaBoost, Naive Bayes, \"Quadratic Discriminant Analysis, Logistic Regression." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#importing classifiers from the sklearn library\n", + "\n", + "from matplotlib.colors import ListedColormap\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.neural_network import MLPClassifier #1\n", + "from sklearn.neighbors import KNeighborsClassifier #2\n", + "from sklearn.svm import SVC #3\n", + "from sklearn.gaussian_process import GaussianProcessClassifier #4\n", + "from sklearn.gaussian_process.kernels import RBF #5\n", + "from sklearn.tree import DecisionTreeClassifier #6\n", + "from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier #7,8\n", + "from sklearn.naive_bayes import GaussianNB #9\n", + "from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis #10\n", + "from sklearn.linear_model import LogisticRegression #11\n", + "\n", + "names = [\"Nearest Neighbors\", \"Linear SVM\", \"RBF SVM\", \"Gaussian Process\",\n", + " \"Decision Tree\", \"Random Forest\", \"Neural Net\", \"AdaBoost\",\n", + " \"Naive Bayes\", \"QDA\", \"Logistic Regression\"]\n", + "\n", + "classifiers = [\n", + " KNeighborsClassifier(3), # holds no assumption on data distribution (non-parametric)\n", + " SVC(kernel=\"linear\", C=0.025), # using a linear kernel\n", + " SVC(gamma=2, C=1), # using radial basis function kernel,C is low to enable a large decision margin\n", + " GaussianProcessClassifier(1.0 * RBF(1.0)), # based on Laplace approximation\n", + " DecisionTreeClassifier(),\n", + " RandomForestClassifier(n_estimators=100), # 100 trees in the forest\n", + " MLPClassifier(max_iter=1000), #iterations until converge\n", + " AdaBoostClassifier(), # fits multiple classifiers on the same dataset\n", + " GaussianNB(), #NaiveBayes\n", + " QuadraticDiscriminantAnalysis(),\n", + " LogisticRegression()]\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Dimensionality reduction\n", + "\n", + "https://stackabuse.com/dimensionality-reduction-in-python-with-scikit-learn/\n", + "\n", + "https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Since the data is multi-dimensional, it was reduced using Principal Component Analysis (PCA) to reduce it to two components.\n", + "Trial runs were done to check how much of the variation in the data is explained by the principal components.\n", + "\n", + "Another thing to keep in mind is that PCA works best on standardised/normalised data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# preprocessing the dataset\n", + "dataArray = data.to_numpy()\n", + "X, y = dataArray[:,0:11], dataArray[0:,11]\n", + "X = StandardScaler().fit_transform(X)\n", + "\n", + "# reducing dimensions of the dataset using PCA \n", + "from sklearn.decomposition import PCA\n", + "pca = PCA()\n", + "pca.fit_transform(X)\n", + "pca_variance = pca.explained_variance_\n", + "plt.figure(figsize=(8, 6))\n", + "plt.bar(range(11), pca_variance, alpha=0.5, align='center', label='individual variance')\n", + "plt.legend()\n", + "plt.ylabel('Variance ratio')\n", + "plt.xlabel('Principal components')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pca2 = PCA(0.95) # keeping principal components that explain 95% of the variance\n", + "ninety_five = pca2.fit_transform(X)\n", + "ninety_five.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Explained variance: \", sum(pca2.explained_variance_ratio_))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Eight features explain 95% of the variance in the dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pca2 = PCA(3) # keeping features three principal components\n", + "principalComponents = pca2.fit_transform(X)\n", + "\n", + "from mpl_toolkits.mplot3d import Axes3D\n", + "plt.figure(figsize=(10,6))\n", + "ax = plt.axes(projection='3d')\n", + "ax.scatter(principalComponents[:,0], principalComponents[:,1], principalComponents[:,2], \n", + " linewidths=1, alpha=.5,\n", + " edgecolor='k', s= 200,\n", + " c=data['CLASS'])\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The three pincipal components wete visualised using a 3D plot. The figure above shows clustering of the three components. Each component is not exactly independent of the others so the clusters overlap to some extent" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#converting principal component ndarrays to DataFrame format\n", + "principalDf = pd.DataFrame(data = principalComponents, columns = ['PC1', 'PC2','PC3'])\n", + "finalDf = pd.concat([principalDf, data['CLASS']], axis = 1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "finalDf.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print('Variance explained by three PCs: ',sum(pca2.explained_variance_ratio_)*100,'%')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Visualising the top 2 principal components" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fig = plt.figure(figsize = (6,6))\n", + "ax = fig.add_subplot(111) \n", + "ax.set(xlim=(-10,10), ylim=(-10,10))\n", + "ax.set_xlabel('Principal Component 1', fontsize = 15)\n", + "ax.set_ylabel('Principal Component 2', fontsize = 15)\n", + "ax.set_title('top 2 components', fontsize = 20)\n", + "\n", + "targets = [0, 1]\n", + "colors = ['r', 'g']\n", + "\n", + "for target, color in zip(targets,colors):\n", + " indices = finalDf['CLASS'] == target\n", + " ax.scatter(finalDf.loc[indices, 'PC1']\n", + " , finalDf.loc[indices, 'PC2']\n", + " , c = color\n", + " , s = 50, alpha = 0.4)\n", + "ax.legend(targets)\n", + "ax.grid()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# splitting the into training and test part\n", + "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A mesh grid is required. This can be thought of as a matrix of coordinates upon which the model will make decisions.\n", + "These are then visualised to reveal decision boundaries.\n", + "The mesh grip was created based on the data and a step size of 0.02" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# creating mesh for the contour plot\n", + "\n", + "h = .02 # step size in the mesh\n", + "x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5\n", + "y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5\n", + "xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Two principal components were used to enable visualisation on a scatter plot.\n", + "\n", + "The parameters for the PCA were generated on the training data and these were applied on both the training and training sets" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pca4 = PCA(n_components=2)\n", + "\n", + "# applying PCA on training set\n", + "pca4.fit(X_train)\n", + "\n", + "#applying transform on training and testing sets\n", + "train_ = pca4.transform(X_train)\n", + "test_ = pca4.transform(X_test)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Explained variance: \", sum(pca4.explained_variance_ratio_))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "train_.shape, test_.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After transforming the data and the creating the meshgrid, decision boundaries for the algorithms were created by iterating over the classifiers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "figure = plt.figure(figsize=(27, 15))\n", + "i = 1\n", + "\n", + "datasets=[data]\n", + "for ds_cnt, ds in enumerate(datasets):\n", + " # just plot the dataset first\n", + " cm = plt.cm.RdBu\n", + " cm_bright = ListedColormap(['#FF0000', '#0000FF'])\n", + " ax = plt.subplot(len(datasets), len(classifiers) + 1, i)\n", + "\n", + " if ds_cnt == 0:\n", + " ax.set_title(\"Input data\")\n", + " # Plot the top 2 principal components for training data\n", + " ax.scatter(train_[:, 0], train_[:, 1], c=y_train, cmap=cm_bright,\n", + " edgecolors='k')\n", + " # Plot the top 2 principal components for the testing data\n", + " ax.scatter(test_[:, 0], test_[:, 1], c=y_test, cmap=cm_bright, alpha=0.6,\n", + " edgecolors='k')\n", + " ax.set_xlim(xx.min(), xx.max())\n", + " ax.set_ylim(yy.min(), yy.max())\n", + " ax.set_xticks(())\n", + " ax.set_yticks(())\n", + " i += 1\n", + "\n", + " # iterate over classifiers\n", + "\n", + " for name, clf in zip(names, classifiers):\n", + " ax = plt.subplot(len(datasets), len(classifiers) + 1, i)\n", + " clf.fit(train_, y_train)\n", + " score = clf.score(test_, y_test)\n", + "\n", + " # Plot the decision boundary. For that, we will assign a color to each\n", + " # point in the mesh [x_min, x_max]x[y_min, y_max].\n", + "\n", + " if hasattr(clf, \"decision_function\"):\n", + " Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) # confidence scores\n", + " else:\n", + " Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1] # probability estimates\n", + "\n", + " # Put the result into a color plot\n", + " Z = Z.reshape(xx.shape)\n", + " ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)\n", + "\n", + " # Plot the training points\n", + " ax.scatter(train_[:, 0], train_[:, 1], c=y_train, cmap=cm_bright, edgecolors='k')\n", + " # Plot the testing points\n", + " ax.scatter(test_[:, 0], test_[:, 1], c=y_test, cmap=cm_bright, edgecolors='k', alpha=0.4)\n", + "\n", + " ax.set_xlim(xx.min(), xx.max())\n", + " ax.set_ylim(yy.min(), yy.max())\n", + " ax.set_xticks(())\n", + " ax.set_yticks(())\n", + " if ds_cnt == 0:\n", + " ax.set_title(name)\n", + " ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'), size=15, horizontalalignment='right')\n", + " i += 1\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Accuracies of the different algorithms are indicated on the lower right corner.\n", + "\n", + "The plots show training points in solid colors and testing points semi-transparent.Decision boundaries for GaussianProcessClassifier, RandomForest and AdaBoost are complicated while the decison boundaries for LogisticRegression, NaiveBayes, NeuralNetwork, and Linear SVM are simpler. GaussianProcess and RBF SVM have contoured decision boundaries which seperate points with similar characteristics.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/Current Platforms of AI and ML/Current AI Platforms and where to start.ipynb b/Current Platforms of AI and ML/Current AI Platforms and where to start.ipynb index a8ee5fe..9fab383 100644 --- a/Current Platforms of AI and ML/Current AI Platforms and where to start.ipynb +++ b/Current Platforms of AI and ML/Current AI Platforms and where to start.ipynb @@ -608,7 +608,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.4" + "version": "3.7.3" }, "varInspector": { "cols": { diff --git a/Thur 20 Feb/ACE_ClassML Pipeline.ipynb b/Thur 20 Feb/ACE_ClassML Pipeline.ipynb index e55fe6d..0f344f5 100644 --- a/Thur 20 Feb/ACE_ClassML Pipeline.ipynb +++ b/Thur 20 Feb/ACE_ClassML Pipeline.ipynb @@ -48,7 +48,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "/Users/drjamesmugume/Desktop/ACE CLASS NOTES/Thur 20 Feb\r\n" + "/Users/drjamesmugume/Desktop/ACE CLASS NOTES/Thur 20 Feb\n" ] } ], @@ -65,8 +65,8 @@ "name": "stdout", "output_type": "stream", "text": [ - "ACE_ClassML Pipeline.ipynb \u001b[31mhousing.csv\u001b[m\u001b[m\r\n", - "\u001b[31mdiabetes.csv\u001b[m\u001b[m\r\n" + "ACE_ClassML Pipeline.ipynb \u001b[31mhousing.csv\u001b[m\u001b[m\n", + "\u001b[31mdiabetes.csv\u001b[m\u001b[m\n" ] } ], @@ -768,9 +768,7 @@ { "cell_type": "code", "execution_count": 10, - "metadata": { - "scrolled": false - }, + "metadata": {}, "outputs": [ { "data": { @@ -2890,9 +2888,7 @@ { "cell_type": "code", "execution_count": 48, - "metadata": { - "scrolled": false - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -3865,7 +3861,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.4" + "version": "3.7.3" }, "varInspector": { "cols": { @@ -3898,5 +3894,5 @@ } }, "nbformat": 4, - "nbformat_minor": 2 + "nbformat_minor": 4 }