Does this feature contain holes (missing values)? Are those holes possible to be filled, or would they stay forever? If it possible to eliminate them in the future data?

Are there duplicate and/or intersecting examples? Answering this question right is extremely important, since duplicate or connected data points might significantly affect the results of model validation if not properly excluded.

Where do the features come from? Should we come up with the new features that prove to be useful, how hard would it be to incorporate those features in the final design?

Is the data real-time? Are the requests real-time?

If yes, well-engineered simple features would likely rock. If no, we likely are in the business of advanced models and algorithms.

Are there features that can be used as the "truth"?

Filtering

Relational algebra projection and selection

Add or remove data based on its value

Outlier removal

Gaussian filter

Exponential smoothing

Median filter

Imputation

Fill in missing values in data

Mean, stastical distributions, regression models

Random Sampling

Markov Chain Monte Carlo (MCMC)

http://jeremykun.com/2015/04/06/markov-chain-monte-carlo-without-all-the-bullshit/ http://setosa.io/blog/2014/07/26/markov-chains/index.html

https://scottlocklin.wordpress.com/2014/07/22/neglected-machine-learning-ideas/

Dimensionality Reduction

Principal component analysis

VarianceThreshold

VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.

Sensivity Analysis

Univariate feature selection

Univariate feature selection works by selecting the best features based on univariate statistical tests.

Recursive feature elimination

N grams

Term Frequency Inverse Document Frequency (TF IDF)

Feature Hashing

Getting a fixed number of features

Normalization and Transformation

log transforms deduplication

normalization: put in range between 0 and 1 Why? Model is able to train faster if using gradient descent

categorical conversion

format conversion

one hot modeling

Encoding

Frequency domain

Fast fourier transform

Coordinate transform

Analyze

Clustering

http://scikit-learn.org/stable/modules/clustering.html#clustering

Hierarchical clustering

K means

known number of clusters

X means

unknown nubmer of clusters

Fractal

Topic modeling

text data

Canopy clustering

self organizing maps

Regression

Regression is fitting a line or curve to do numeric prediction or to do causility analysis (in some cases)

Think Scatter plot of Grade vs. Hours of Study

Hours of study	Grade
data1	data1
...	...

One feature (or attribute): $$y = mx +b$$

Multiple features (or attributes)

$$y = m_1 x_1 + m_2 x_2 + m_3 x_3 ... m_k x_k + b $$

Generalized linear models

Linear regression

Input:

id	x	y	ŷ	y-ŷ
p1	1	2	1	1
p2	1	1	1	0
p3	3	2	3	-1

Where x is a feature, y is the true output, ŷ is the predicted output and y - ŷ is called residual(observed - predicted)

Suppose we obtain in the following linear model:
y = x

Sum of the squares residuals (SSQ) = $\sum (y- \hat{y} )^{2}$

Ordinary least squares (OLS) or linear least squares computes the least squares solution using a singular value decomposition of X. This means the algorithm attempts to minimize the sum of squares of residuals. If X is a matrix of size (n, p) this method has a cost of $O(n p^2)$, assuming that $n \geq p$. This means that for small number of features, the algorithm scales very well, but for a large number of features (p), it is more efficient (in terms of run time) to use a gradient descent approach after feature scaling. You can read more about the theory here and in [more detail here] (http://cs229.stanford.edu/notes/cs229-notes1.pdf)

Other related metrics:
Mean Squared Error (MSE) = SSQ / num of data points
Root Mean Squared Error (RMSE) = sqrt (MSE)

In example:
SSQ = $1^2 + 0^2 + (-1)^2 + 0^2 + 0^2 = 2$ MSE = 2/5
RMSE = $\sqrt (2/5)$

Vertical vs Horizontal vs Orthogonal Residuals

Usually we calculate the (vertical) residual, or the difference in the observed and predicted in the y. This is because "the use of the least squares method to calculate the best-fitting line through a two-dimensional scatter plot typically requires the user to assume that one of the variables depends on the other. (We caculate the difference in the y) However, in many cases the relationship between the two variables is more complex, and it is not valid to say that one variable is independent and the other is dependent. When analysing such data researchers should consider plotting the three regression lines that can be calculated for any two-dimensional scatter plot.

Plotting all three regression lines gives a fuller picture of the data, and comparing their slopes provides a simple graphical assessment of the correlation coefficient. Plotting the orthogonal regression line (red) provides additional information because it makes no assumptions about the dependence or independence of the variables; as such, it appears to more accurately describe the trend in the data compared to either of the ordinary least squares regression lines." You can read the full paper here

Residual plot y - ŷ vs x

TODO: put plot

Distribution of Residuals

Frequency of y - ŷ

TODO: put histogram

create residual plot for each feature and if x and y are indeed linearly related, the distribution of residuals should be normal and centered around zero
if poor fit, consider applying a transform (such as log transform) or non-linear regression
residual plot should not have any patterns (under/over estimation bias)
residual plot is a great visualization of fit, but should be used in combination of other statiscal methods (see tutorial 2 and 3)
After an initial model is created, it can be modified by changing the features (called feature engineering) ormodel selection. Example techniques include:

Non-linear Regression

Piece-wise regression (see tutorial), polynomial regression, ridge regression, bayesian regression and many more generalized linear models. The Scikit Learn documentation has a very good outline and examples of many different techniques and when to use them.

Data transforms

For example, if x is expoential compared to y appply log transform:

y = b e^x
ln y = ln (b e^x)
     = ln b + ln e^x
     = ln b + x

Why apply transforms? It linearizes the model so linear regression can be used on some non-linear relationships.

There are other methods for transforming variables to achieve linearity outlined here

|Method | Transformation(s) | Regression equation | Predicted value (ŷ) |

Transforming a data set to enhance linearity is a multi-step, trial-and-error process.

Conduct a standard regression analysis on the raw data.
Construct a residual plot.

If the plot pattern is random, do not transform data.
If the plot pattern is not random, continue.

Compute the coefficient of determination (R2).
Choose a transformation method (see above table).
Transform the independent variable, dependent variable, or both.
Conduct a regression analysis, using the transformed variables.
Compute the coefficient of determination (R2), based on the transformed variables.

If the tranformed R2 is greater than the raw-score R2, the transformation was successful. Congratulations!
If not, try a different transformation method.

Feature Transform

Forward Selection Try each variable one by one and find the one the lowest sum of squares error

not used in practice

Backward Selection (more practical than FS) Try with all the variables and remove the worse one (greedy algorithm) (bad variable = highest impact on SSQ)

has python implementation

Shrinkage

LASSO: uses matrix algebra to shrink coefficient to help with eliminating variables

Other techniques are used in feature engineering section (TODO add link)

Tree based models

Random trees

K nearest neighours

Classification

Decision Tree

transparent model

Naive bayes

Hidden Markov model

Bayesian Network

Know dependant relationships between variables

Logistic Regression

Support vector machine

Random Forests

Deep learning

Neural networks

Recommendation

Collaborative filtering

How people interact with features (items)

Content based methods

Use item characteristics

Graph based methods

how items are interconnected

TODO: model selection and evaluation

#Advise

Logical Reasoning

How to understand different evidence?

Logical Reasoning

Expert Systems

Optimization

How to determine the best course of action

Genetic Algorithms

Simulated Annealing

Gradient Search

Stochastic search

Linear Programming

Integer Programming

Non-linear programming

Active learning

Simulation

How to characterize systems

Discrete event simulation

Monte Carlo Simulation

Markov Models

Distinct states

Agent-based simulation

interaction between autonomous entities

Systems Dynamics/ Dynamic Modeling

complex system with feedback

Activity based simulation

Tracking system behaviour

Fuzzy Logic

Imprecise categories

Text analytics

N grams

TFIDF

Model Evaluation

It can be difficult to predict which classifier will work best on your dataset. Always try multiple classifiers. Pick the one or two that work the best to refine and explore further

Learning Curves http://scikit-learn.org/stable/modules/learning_curve.html

http://www.astroml.org/sklearn_tutorial/practical.html

Bais Vs Variance Tradeoff

http://nbviewer.ipython.org/url/astroml.github.com/sklearn_tutorial/_downloads/06_learning_curves.ipynb

https://github.com/rcompton/ml_cheat_sheet

Hyper-parameter tuning

http://blog.dato.com/how-to-evaluate-machine-learning-models-part-4-hyperparameter-tuning

http://www.analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/

Ensembling

http://mlwave.com/kaggle-ensembling-guide/

Big data tools

Hadoop

Hive

SQL on clusters

Map reduce

Pig

Spark

Storm

Flume, Scribe, Chukwa

http://flume.apache.org/

Apache Nutch, Talend, Scraperwiki

Wbscraper, Sqoop

Markov Chains

MCMC

FilesExpand file tree

index.md

Latest commit

History

index.md

File metadata and controls

Streaming vs. Batched

Data scraping

Data logging

Prepare and Transform Data

Processing and Enrichment

Data cleaning

Munging, wrangling

Aggregation of Data

Summeriation of data

Feature engineering

Curse of dimensionality

Questions:

Filtering

Relational algebra projection and selection

Outlier removal

Gaussian filter

Exponential smoothing

Median filter

Imputation

Mean, stastical distributions, regression models

Random Sampling

Markov Chain Monte Carlo (MCMC)

Dimensionality Reduction

Principal component analysis

VarianceThreshold

Sensivity Analysis

Univariate feature selection

Recursive feature elimination

N grams

Term Frequency Inverse Document Frequency (TF IDF)

Feature Hashing

Normalization and Transformation

Analyze

Clustering

Regression

Generalized linear models

Linear regression

Vertical vs Horizontal vs Orthogonal Residuals

Non-linear Regression

Data transforms

|Method | Transformation(s) | Regression equation | Predicted value (ŷ) |

Feature Transform

Tree based models

K nearest neighours

Classification

Decision Tree

Naive bayes

Hidden Markov model

Bayesian Network

Logistic Regression

Support vector machine

Random Forests

Deep learning

Neural networks

Recommendation

Collaborative filtering

Content based methods

Graph based methods

Logical Reasoning

Logical Reasoning

Expert Systems

Optimization

Simulation

Markov Models

Agent-based simulation

Systems Dynamics/ Dynamic Modeling

Activity based simulation

Fuzzy Logic

Text analytics

Hyper-parameter tuning

Ensembling

Big data tools

Hadoop

Hive