AcademicConceptMachine LearningSeries

Machine learning: a quick review (part 3)

3- Data preparation

3-1- Imbalanced dataset

An imbalanced dataset is when you have, for example, a classification test and 90% of the data is in one class. That leads to problems: an accuracy of 90% can be skewed if you have no predictive power on the other category of data! Here are a few tactics to get over the hump:

  • Collect more data to even the imbalances in the dataset.
  • Resample the dataset to correct for imbalances.
  • Try a different algorithm altogether on your dataset.

What’s important here is that you have a keen sense for what damage an unbalanced dataset can cause, and how to balance that.

3-2- Avoid overfitting

This is a simple restatement of a fundamental problem in machine learning: the possibility of overfitting training data and carrying the noise of that data through to the test set, thereby providing inaccurate generalizations.

There are three main methods to avoid overfitting:

  • Keep the model simpler: reduce variance by taking into account fewer variables and parameters, thereby removing some of the noise in the training data.
  • Use cross-validation techniques such as k-folds cross-validation.
  • Use regularization techniques such as LASSO that penalize certain model parameters if they’re likely to cause overfitting.

3-3- Missing/corrupted data

You could find missing/corrupted data in a dataset and either drop those rows or columns, or decide to replace them with another value.

In Pandas, there are two very useful methods: isnull() and dropna() that will help you find columns of data with missing or corrupted data and drop those values. If you want to fill the invalid values with a placeholder value (for example, 0), you could use the fillna() method.

3-4- Outlier detection

Interquartile Range

Any set of data can be described by its five-number summary

  1. The minimum or lowest value of the dataset
  2. The first quartile Q1, which represents a quarter of the way through the list of all data
  3. The median of the data set, which represents the midpoint of the whole list of data
  4. The third quartile Q3, which represents 3-quarters of the way through the list of all data
  5. The maximum or highest value of the data set.

IQR = Q3 – Q1.

3-5- Feature selection/Engineer

Given domain knowledge of the available parameters transformations can be performed to engineer new features that better capture the information contained in the raw dataset

3-6- Feature standardization

Mapping the data to -1 to 1 by the following relation: x-mean(x)/std(x)

The metric used to evaluate and compare the different models (predictors) is R² (R-squared, or coefficient of determination), which evaluates how well the predicted values fit compared to the original values.  It ranges from 0 to 1, with a value close to 1 indicating a better model.

R^2 Validation is the mean R² score calculated on the validation set during cross-validation.

R^2 Test is the evaluation of the final model, performed on the test set which contains data not seen during training (30 % of total data in our case). It is usually lower than R^2 Validation, but too large a difference would be a sign that the model ‘overfits’ the training data and does not generalize well on new data.

1- Impose Domain knowledge

There are some supervision knowledge about the data that can be used before processing the data.

In our example these are the domain knowledge that are used:

1-1- dropping non-informative columns

1-2- Dropping/trimming based on Domain knowledge

2- Imputing missing data

We can either drop all the rows of missing data or we can set it to zero or impute the missing values with the most common value or the median value. 

3- Removing Corrupted data

An easy way to get rid of misshapen or corrupted numerical data is to force the type of a column.

4- Looking for correlations

A value close to 1 shows a strong positive correlation, and a value close to -1 means a strong negative correlation. So we can find the important features.

5- Getting rid of outliers

To help the linear regression model generalize on unseen data, it is good practice to remove extreme samples from the training data. We apply the 1.5 x IQR rule to isolate data lower than 1.5*IQR (interquartile range) under the first quartile, and higher than 1.5*IQR above the third quartile. This is not necessary for decision tree-based predictors, which are robust to outliers.

6- Data whitening

Make the mean equal to zero and variance equal to one.

7- Splitting training/test sets

Models learn on a subset of the original data (the training set) and are evaluated on a different subset (the test set). We set the ratio to 70%-30%.

The histogram after data preprocessing are much more equally distributed. 

2 thoughts on “Machine learning: a quick review (part 3)

  • Anthony James

    Data preparations are a must have in all machine learning methods which is usually over looked, both from teaching perspective and engineer view point.

    Reply
  • sanaz

    wow it was so usefull👌🏻👌🏻👌🏻

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses User Verification plugin to reduce spam. See how your comment data is processed.