Machine learning: a quick review (part 3)

August 9, 2022August 5, 2022 admin

3- Data preparation

3-1- Imbalanced dataset

An imbalanced dataset is when you have, for example, a classification test and 90% of the data is in one class. That leads to problems: an accuracy of 90% can be skewed if you have no predictive power on the other category of data! Here are a few tactics to get over the hump:

Collect more data to even the imbalances in the dataset.
Resample the dataset to correct for imbalances.
Try a different algorithm altogether on your dataset.

What’s important here is that you have a keen sense for what damage an unbalanced dataset can cause, and how to balance that.

3-2- Avoid overfitting

This is a simple restatement of a fundamental problem in machine learning: the possibility of overfitting training data and carrying the noise of that data through to the test set, thereby providing inaccurate generalizations.

There are three main methods to avoid overfitting:

Keep the model simpler: reduce variance by taking into account fewer variables and parameters, thereby removing some of the noise in the training data.
Use cross-validation techniques such as k-folds cross-validation.
Use regularization techniques such as LASSO that penalize certain model parameters if they’re likely to cause overfitting.

3-3- Missing/corrupted data

You could find missing/corrupted data in a dataset and either drop those rows or columns, or decide to replace them with another value.

In Pandas, there are two very useful methods: isnull() and dropna() that will help you find columns of data with missing or corrupted data and drop those values. If you want to fill the invalid values with a placeholder value (for example, 0), you could use the fillna() method.

3-4- Outlier detection

Interquartile Range

Any set of data can be described by its five-number summary.

The minimum or lowest value of the dataset
The first quartile Q1, which represents a quarter of the way through the list of all data
The median of the data set, which represents the midpoint of the whole list of data
The third quartile Q3, which represents 3-quarters of the way through the list of all data
The maximum or highest value of the data set.

IQR = Q3 – Q1.

3-5- Feature selection/Engineer

Given domain knowledge of the available parameters transformations can be performed to engineer new features that better capture the information contained in the raw dataset

3-6- Feature standardization

Mapping the data to -1 to 1 by the following relation: x-mean(x)/std(x)

The metric used to evaluate and compare the different models (predictors) is R² (R-squared, or coefficient of determination), which evaluates how well the predicted values fit compared to the original values. It ranges from 0 to 1, with a value close to 1 indicating a better model.

R^2 Validation is the mean R² score calculated on the validation set during cross-validation.

R^2 Test is the evaluation of the final model, performed on the test set which contains data not seen during training (30 % of total data in our case). It is usually lower than R^2 Validation, but too large a difference would be a sign that the model ‘overfits’ the training data and does not generalize well on new data.

1- Impose Domain knowledge

There are some supervision knowledge about the data that can be used before processing the data.

In our example these are the domain knowledge that are used:

1-1- dropping non-informative columns

1-2- Dropping/trimming based on Domain knowledge

2- Imputing missing data

We can either drop all the rows of missing data or we can set it to zero or impute the missing values with the most common value or the median value.

3- Removing Corrupted data

An easy way to get rid of misshapen or corrupted numerical data is to force the type of a column.

4- Looking for correlations

A value close to 1 shows a strong positive correlation, and a value close to -1 means a strong negative correlation. So we can find the important features.

5- Getting rid of outliers

To help the linear regression model generalize on unseen data, it is good practice to remove extreme samples from the training data. We apply the 1.5 x IQR rule to isolate data lower than 1.5*IQR (interquartile range) under the first quartile, and higher than 1.5*IQR above the third quartile. This is not necessary for decision tree-based predictors, which are robust to outliers.

6- Data whitening

Make the mean equal to zero and variance equal to one.

7- Splitting training/test sets

Models learn on a subset of the original data (the training set) and are evaluated on a different subset (the test set). We set the ratio to 70%-30%.

The histogram after data preprocessing are much more equally distributed.

2 thoughts on “Machine learning: a quick review (part 3)”

Anthony James
August 16, 2022 at 11:28 am
Data preparations are a must have in all machine learning methods which is usually over looked, both from teaching perspective and engineer view point.
sanaz
September 4, 2022 at 12:01 pm
wow it was so usefull👌🏻👌🏻👌🏻

Machine learning: a quick review (part 3)

Like this:

Related

2 thoughts on “Machine learning: a quick review (part 3)”

Leave a Reply Cancel reply

Share this:

Like this:

Related

You May Also Like

ARMCMC: Bayesian Online Full Density Estimation of Model Parameters

Mixtral MoE: First open-sourced model to beat GPT3.5

Self-Rewarding Language Models

2 thoughts on “Machine learning: a quick review (part 3)”

Leave a Reply Cancel reply