Machine learning: a quick review (part 4)
4- Supervised learning
4-1- Decision tree
models partition the feature space into rectangles and learn a simple (e.g. constant) model in each of those [23]
Assuming that the feature space has been partitioned into M regions, namely R1, . . . , RM , and that the model’s prediction at each region is cm, the DTR model will have the following formulation:
where 1 is the indicator function, returning 1 where the condition in brackets is true, and 0 in any other case.
the best cm can be obtained through the minimisation of the fit’s LS, ∑ (yi − f (xi))^2. a greedy algorithm is used recursively to find an optimal splitting, until the stopping rule is activated.
The branches/edges represent the result of the node and the nodes have either:
- Conditions [Decision Nodes]
- Result [End Nodes]
The branches/edges represent the truth/falsity of the statement and take makes a decision based on that in the example below which shows a decision tree that evaluates the smallest of 3 nums:
4-2- Linear/polynomial regression
Solution: Least Square (LS)
4-3- Classification
- Classification predictive modeling involves assigning a class label to input examples.
- Binary classification refers to predicting one of two classes and multi-class classification involves predicting one of more than two classes.
- Multi-label classification involves predicting one or more classes for each example and imbalanced classification refers to classification tasks where the distribution of examples across the classes is not equal.
There are perhaps four main types of classification tasks that you may encounter; they are:
- Binary Classification
- Multi-Class Classification
- Multi-Label Classification
- Imbalanced Classification
Popular algorithms that can be used for binary classification include:
- Logistic Regression
- k-Nearest Neighbors
- Decision Trees
- Support Vector Machine
- Naive Bayes
Popular algorithms that can be used for multi-class classification include:
- k-Nearest Neighbors.
- Decision Trees.
- Naive Bayes.
- Random Forest.
- Gradient Boosting.
Specialized versions of standard classification algorithms can be used, so-called multi-label versions of the algorithms, including:
- Multi-label Decision Trees
- Multi-label Random Forests
- Multi-label Gradient Boosting
Examples of imbalance classification include:
- Fraud detection.
- Outlier detection.
- Medical diagnostic tests.
K-Nearest Neighbours: There, given a point xq , the algorithm identifies the k nearest neighbours distance- wise , with the parameter k being user-selectable (n_neighbours) [24] In order to calculate the distance between xq and any other point xj , usually Minkowski distance Lp is used
with p = 1 this corresponds to the Manhattan distance and with p = 2 to the Euclidean distance
Support vector machines (SVM) in their simplest form constitute a two- class classifier in cases where the two classes are linearly separable SVM work by deriving the optimal hyperplane, i.e. the hyperplane that offers the widest possible margin between instances of the two classe. Support Vector Reggressors (SVRs) work in a similar way, this time trying to fit a hyperplane that accurately predicts the target values of training samples within a margin of tolerance. Simple case with a linear kernel:
model parameters w are obtained through the minimisation of the function
An often-used non-linear kernel is the RBF kernel, formulated as
4-4- Regularization
where hyperparameter λ > 0 is a user-selectable parameter that controls the
amount of shrinkage. This shrinkage helps avoid overfitting the training dataset
L2 regularization tends to spread error among all the terms, while L1 is more binary/sparse, with many variables either being assigned a 1 or 0 in weighting. L1 corresponds to setting a Laplacean prior on the terms, while L2 corresponds to a Gaussian prior.
The use of L1 and L2 norm in terms of their changes to the feature space also depends on the domain of problem.