AcademicConceptMachine LearningSeries

Machine learning: a quick review (part 2)

2 – Fundamental of learning

2-1- Bias vs variance

Bias is error due to erroneous or overly simplistic assumptions in the learning algorithm you’re using. This can lead to the model underfitting your data, making it hard for it to have high predictive accuracy and for you to generalize your knowledge from the training set to the test set.

Variance is error due to too much complexity in the learning algorithm you’re using. This leads to the algorithm being highly sensitive to high degrees of variation in your training data, which can lead your model to overfit the data. You’ll be carrying too much noise from your training data for your model to be very useful for your test data.

The bias-variance decomposition essentially decomposes the learning error from any algorithm by adding the bias, the variance and a bit of irreducible error due to noise in the underlying dataset. Essentially, if you make the model more complex and add more variables, you’ll lose bias but gain some variance — in order to get the optimally reduced amount of error, you’ll have to tradeoff bias and variance. You don’t want either high bias or high variance in your model.

2-2- Cross validation

Cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (called the validation dataset or testing set)

2-3- Precision vs recall

Recall is also known as the true positive rate: the amount of positives your model claims compared to the actual number of positives there are throughout the data. Precision is also known as the positive predictive value, and it is a measure of the amount of accurate positives your model claims compared to the number of positives it actually claims. It can be easier to think of recall and precision in the context of a case where you’ve predicted that there were 10 event1 and 5 event2 in a case of 10 event1. You’d have perfect recall (there are actually 10 event1, and you predicted there would be 10) but 66.7% precision because out of the 15 events you predicted, only 10 (the event1) are correct.

Recall = 10/10 = 100%

Precision = 10/(10+5) =67%

2-4- Error Metrics

Type I error is a false positive, while Type II error is a false negative. Briefly stated, Type I error means claiming something has happened when it hasn’t, while Type II error means that you claim nothing is happening when in fact something is.

You could use measures such as the F1 score, the accuracy, and the confusion matrix. The F1 score is a measure of a model’s performance. It is a weighted average of the precision and recall of a model, with results tending to 1 being the best, and those tending to 0 being the worst. You would use it in classification tests where true negatives don’t matter much.

The other metric of error calculation are the MSE, MAE, and R^2 score. 

R^2 Validation is the mean R² score calculated on the validation set during cross-validation.

R^2 Test is the evaluation of the final model, performed on the test set which contains data not seen during training (30 % of total data in our case). It is usually lower than R^2 Validation, but too large a difference would be a sign that the model ‘overfits’ the training data and does not generalize well on new data.

2-5- Probability vs likelihood

Likelihood of P(y|x,𝛳 ), which is used for training using the MLE method. We try to find the parameters (𝛳) so that the probability of happening the events maximized. The probability of  P(y|x,𝛳 ) is used for the testing. After we know the parameters, now we want to check what will be the most probable outcome. Hence, the probability is used in the test stage.

2-6- Time-series data

If we are trying to identify the time relation of the data, we can not shuffle the data and we should pay attention to the cross-validation step. Instead of using standard k-folds cross-validation, you have to pay attention to the fact that a time series is not randomly distributed data—it is inherently ordered by chronological order. If a pattern emerges in later time periods, for example, your model may still pick up on it even if that effect doesn’t hold in earlier years!

You’ll want to do something like forward chaining where you’ll be able to model on past data then look at forward-facing data.

  • Fold 1 : training [1], test [2]
  • Fold 2 : training [1 2], test [3]
  • Fold 3 : training [1 2 3], test [4]
  • Fold 4 : training [1 2 3 4], test [5]
  • Fold 5 : training [1 2 3 4 5], test [6]

2-7- Discriminative vs generative model

A generative model will learn categories of data while a discriminative model will simply learn the distinction between different categories of data. Discriminative models will generally outperform generative models on classification tasks. KNN and SVM are discriminative models as they try to optimize P(y|x,𝛳) while Naive bayes is a generative model as it tries to identify the underlying distribution of the clusters and try to find P(D,𝛳).

2-8- Parametric vs nonparametric models

Modeling approaches can be split into two major categories: parametric and non-parametric. 

Parametric: finite set of parameters are used to do the modeling such as LR, ANN, SVR

In contrast to that, non-parametric models assume that the dataset distribu-

tion cannot be defined using any finite number of parameters.  therefore, the amount of information that θ can capture grows with the number of training data points in dataset D. such as DTR, RFR, SVR with RBF

Why the non-parametric model?

  • It allows great flexibility in the possible form of the regression curve and makes no assumption about a parametric form. 
  • It only assumes that the regression curve belongs to them some infinite-dimensional collection of functions.
  •  It heavily relies on the experimenter to supply only qualitative information about the function and let the data speak for itself concerning the actual form of a regression curve.
  • It is best suited for inference in a situation where there is very little or no prior information about the regression curve.

parametric methods make large assumptions about the mapping of the input variables to the output variable and in turn are faster to train, require less data but may not be as powerful. Nonparametric methods make few or no assumptions about the target function and in turn require a lot more data, are slower to train, and have a higher model complexity but can result in more powerful models.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses User Verification plugin to reduce spam. See how your comment data is processed.