Machine learning: a quick review (part 3)
3- Data preparation
3-1- Imbalanced dataset
An imbalanced dataset is when you have, for example, a classification test and 90% of the data is in one class. That leads to problems: an accuracy of 90% can be skewed if you have no predictive power on the other category of data! Here are a few tactics to get over the hump:
- Collect more data to even the imbalances in the dataset.
- Resample the dataset to correct for imbalances.
- Try a different algorithm altogether on your dataset.
What’s important here is that you have a keen sense for what damage an unbalanced dataset can cause, and how to balance that.
3-2- Avoid overfitting
This is a simple restatement of a fundamental problem in machine learning: the possibility of overfitting training data and carrying the noise of that data through to the test set, thereby providing inaccurate generalizations.
There are three main methods to avoid overfitting:
- Keep the model simpler: reduce variance by taking into account fewer variables and parameters, thereby removing some of the noise in the training data.
- Use cross-validation techniques such as k-folds cross-validation.
- Use regularization techniques such as LASSO that penalize certain model parameters if they’re likely to cause overfitting.
3-3- Missing/corrupted data
You could find missing/corrupted data in a dataset and either drop those rows or columns, or decide to replace them with another value.
In Pandas, there are two very useful methods: isnull() and dropna() that will help you find columns of data with missing or corrupted data and drop those values. If you want to fill the invalid values with a placeholder value (for example, 0), you could use the fillna() method.
3-4- Outlier detection
Interquartile Range
Any set of data can be described by its five-number summary.
- The minimum or lowest value of the dataset
- The first quartile Q1, which represents a quarter of the way through the list of all data
- The median of the data set, which represents the midpoint of the whole list of data
- The third quartile Q3, which represents 3-quarters of the way through the list of all data
- The maximum or highest value of the data set.
IQR = Q3 – Q1.
3-5- Feature selection/Engineer
Given domain knowledge of the available parameters transformations can be performed to engineer new features that better capture the information contained in the raw dataset
3-6- Feature standardization
Mapping the data to -1 to 1 by the following relation: x-mean(x)/std(x)
The metric used to evaluate and compare the different models (predictors) is R² (R-squared, or coefficient of determination), which evaluates how well the predicted values fit compared to the original values. It ranges from 0 to 1, with a value close to 1 indicating a better model.
R^2 Validation is the mean R² score calculated on the validation set during cross-validation.
R^2 Test is the evaluation of the final model, performed on the test set which contains data not seen during training (30 % of total data in our case). It is usually lower than R^2 Validation, but too large a difference would be a sign that the model ‘overfits’ the training data and does not generalize well on new data.
1- Impose Domain knowledge
There are some supervision knowledge about the data that can be used before processing the data.
In our example these are the domain knowledge that are used:
1-1- dropping non-informative columns
1-2- Dropping/trimming based on Domain knowledge
2- Imputing missing data
We can either drop all the rows of missing data or we can set it to zero or impute the missing values with the most common value or the median value.
3- Removing Corrupted data
An easy way to get rid of misshapen or corrupted numerical data is to force the type of a column.
4- Looking for correlations
A value close to 1 shows a strong positive correlation, and a value close to -1 means a strong negative correlation. So we can find the important features.
5- Getting rid of outliers
To help the linear regression model generalize on unseen data, it is good practice to remove extreme samples from the training data. We apply the 1.5 x IQR rule to isolate data lower than 1.5*IQR (interquartile range) under the first quartile, and higher than 1.5*IQR above the third quartile. This is not necessary for decision tree-based predictors, which are robust to outliers.
6- Data whitening
Make the mean equal to zero and variance equal to one.
7- Splitting training/test sets
Models learn on a subset of the original data (the training set) and are evaluated on a different subset (the test set). We set the ratio to 70%-30%.
The histogram after data preprocessing are much more equally distributed.
Data preparations are a must have in all machine learning methods which is usually over looked, both from teaching perspective and engineer view point.
wow it was so usefull👌🏻👌🏻👌🏻