http://bl.ocks.org/serra/5012770 http://bl.ocks.org/mbostock/raw/4063550/flare.json https://leanpub.com/D3-Tips-and-Tricks
Regular expressions
Once you have data, you need to determine what features to use.
Applied machine learning is feature engineering
http://stackoverflow.com/questions/2674430/how-to-engineer-features-for-machine-learning
http://nerds.airbnb.com/overcoming-missing-values-in-a-rfc/
http://www.cs.princeton.edu/courses/archive/spring10/cos424/slides/18-feat.pdf
- Models fail to converge
- Models produce results equivalent to random chance
- You lack the computational power to perform operations across the feature space
- You do not know which aspects of the data are the most important
FROM; https://www.quora.com/MLconf-2015-Seattle/What-are-some-best-practices-in-Feature-Engineering Are the features continuous, discrete or none of the above?
What is the distribution of this feature?
Does the distribution largely depend on what subset of examples is being considered? Time-based segmentation? Type-based segmentation?
Does this feature contain holes (missing values)? Are those holes possible to be filled, or would they stay forever? If it possible to eliminate them in the future data?
Are there duplicate and/or intersecting examples? Answering this question right is extremely important, since duplicate or connected data points might significantly affect the results of model validation if not properly excluded.
Where do the features come from? Should we come up with the new features that prove to be useful, how hard would it be to incorporate those features in the final design?
Is the data real-time? Are the requests real-time?
If yes, well-engineered simple features would likely rock. If no, we likely are in the business of advanced models and algorithms.
Are there features that can be used as the "truth"?
Add or remove data based on its value
Fill in missing values in data
http://jeremykun.com/2015/04/06/markov-chain-monte-carlo-without-all-the-bullshit/ http://setosa.io/blog/2014/07/26/markov-chains/index.html
https://scottlocklin.wordpress.com/2014/07/22/neglected-machine-learning-ideas/
VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.
Univariate feature selection works by selecting the best features based on univariate statistical tests.
Getting a fixed number of features
log transforms deduplication
normalization: put in range between 0 and 1 Why? Model is able to train faster if using gradient descent
categorical conversion
format conversion
one hot modeling
Encoding
Frequency domain
Fast fourier transform
Coordinate transform
http://scikit-learn.org/stable/modules/clustering.html#clustering
Hierarchical clustering
K means
- known number of clusters
X means
- unknown nubmer of clusters
Fractal
Topic modeling
- text data
Canopy clustering
self organizing maps
Regression is fitting a line or curve to do numeric prediction or to do causility analysis (in some cases)
Think Scatter plot of Grade vs. Hours of Study
| Hours of study | Grade |
|---|---|
| data1 | data1 |
| ... | ... |
One feature (or attribute):
Multiple features (or attributes)
Input:
| id | x | y | ŷ | y-ŷ |
|---|---|---|---|---|
| p1 | 1 | 2 | 1 | 1 |
| p2 | 1 | 1 | 1 | 0 |
| p3 | 3 | 2 | 3 | -1 |
Where x is a feature, y is the true output, ŷ is the predicted output and y - ŷ is called residual(observed - predicted)
Suppose we obtain in the following linear model:
y = x
Sum of the squares residuals (SSQ) =
Ordinary least squares (OLS) or linear least squares computes the least squares solution using a singular value decomposition of X. This means the algorithm attempts to minimize the sum of squares of residuals. If X is a matrix of size (n, p) this method has a cost of
Other related metrics:
Mean Squared Error (MSE) = SSQ / num of data points
Root Mean Squared Error (RMSE) = sqrt (MSE)
In example:
SSQ =
RMSE =
Usually we calculate the (vertical) residual, or the difference in the observed and predicted in the y. This is because "the use of the least squares method to calculate the best-fitting line through a two-dimensional scatter plot typically requires the user to assume that one of the variables depends on the other. (We caculate the difference in the y) However, in many cases the relationship between the two variables is more complex, and it is not valid to say that one variable is independent and the other is dependent. When analysing such data researchers should consider plotting the three regression lines that can be calculated for any two-dimensional scatter plot.
Plotting all three regression lines gives a fuller picture of the data, and comparing their slopes provides a simple graphical assessment of the correlation coefficient. Plotting the orthogonal regression line (red) provides additional information because it makes no assumptions about the dependence or independence of the variables; as such, it appears to more accurately describe the trend in the data compared to either of the ordinary least squares regression lines." You can read the full paper here
Residual plot y - ŷ vs x
TODO: put plot
Distribution of Residuals
Frequency of y - ŷ
TODO: put histogram
- create residual plot for each feature and if x and y are indeed linearly related, the distribution of residuals should be normal and centered around zero
- if poor fit, consider applying a transform (such as log transform) or non-linear regression
- residual plot should not have any patterns (under/over estimation bias)
- residual plot is a great visualization of fit, but should be used in combination of other statiscal methods (see tutorial 2 and 3)
- After an initial model is created, it can be modified by changing the features (called feature engineering) ormodel selection. Example techniques include:
Piece-wise regression (see tutorial), polynomial regression, ridge regression, bayesian regression and many more generalized linear models. The Scikit Learn documentation has a very good outline and examples of many different techniques and when to use them.
For example, if x is expoential compared to y appply log transform:
y = b e^x
ln y = ln (b e^x)
= ln b + ln e^x
= ln b + x
Why apply transforms? It linearizes the model so linear regression can be used on some non-linear relationships.
There are other methods for transforming variables to achieve linearity outlined here
|Standard linear regression | None | y = b0 + b1x | ŷ = b0 + b1x | |Exponential model | Dependent variable = log(y) | log(y) = b0 + b1x | ŷ = 10b0 + b1x | |Quadratic model | Dependent variable = sqrt(y) | sqrt(y) = b0 + b1x | ŷ = ( b0 + b1x )^2 | |Reciprocal model | Dependent variable = 1/y | 1/y = b0 + b1x | ŷ = 1 / ( b0 + b1x )| |Logarithmic model | Independent variable = log(x) | y= b0 + b1 log(x) | ŷ = b0 + b1log(x) | |Power model | Dependent variable = log(y) | | | | |Independent variable = log(x) | log(y)= b0 + b1log(x) | ŷ = 10b0 + b1log(x) |
Transforming a data set to enhance linearity is a multi-step, trial-and-error process.
- Conduct a standard regression analysis on the raw data.
- Construct a residual plot.
- If the plot pattern is random, do not transform data.
- If the plot pattern is not random, continue.
- Compute the coefficient of determination (R2).
- Choose a transformation method (see above table).
- Transform the independent variable, dependent variable, or both.
- Conduct a regression analysis, using the transformed variables.
- Compute the coefficient of determination (R2), based on the transformed variables.
- If the tranformed R2 is greater than the raw-score R2, the transformation was successful. Congratulations!
- If not, try a different transformation method.
- Forward Selection Try each variable one by one and find the one the lowest sum of squares error
- not used in practice
- Backward Selection (more practical than FS) Try with all the variables and remove the worse one (greedy algorithm) (bad variable = highest impact on SSQ)
- has python implementation
- Shrinkage
- LASSO: uses matrix algebra to shrink coefficient to help with eliminating variables
Other techniques are used in feature engineering section (TODO add link)
- Random trees
- transparent model
Hidden Markov model
- Know dependant relationships between variables
How people interact with features (items)
Use item characteristics
how items are interconnected
TODO: model selection and evaluation
#Advise
How to understand different evidence?
How to determine the best course of action
Genetic Algorithms
Simulated Annealing
Gradient Search
Stochastic search
Linear Programming
Integer Programming
Non-linear programming
Active learning
How to characterize systems
Discrete event simulation
Monte Carlo Simulation
Distinct states
interaction between autonomous entities
complex system with feedback
Tracking system behaviour
Imprecise categories
N grams
TFIDF
Model Evaluation
It can be difficult to predict which classifier will work best on your dataset. Always try multiple classifiers. Pick the one or two that work the best to refine and explore further
Learning Curves http://scikit-learn.org/stable/modules/learning_curve.html
http://www.astroml.org/sklearn_tutorial/practical.html
Bais Vs Variance Tradeoff
https://github.com/rcompton/ml_cheat_sheet
http://blog.dato.com/how-to-evaluate-machine-learning-models-part-4-hyperparameter-tuning
http://www.analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/
http://mlwave.com/kaggle-ensembling-guide/
SQL on clusters
Pig
Spark
Storm
Flume, Scribe, Chukwa
Apache Nutch, Talend, Scraperwiki
Wbscraper, Sqoop
Markov Chains
MCMC
