AcademicConceptMachine LearningSeries

Machine learning: a quick review (part 4)

4- Supervised learning

4-1- Decision tree

models partition the feature space into rectangles and learn a simple (e.g. constant) model in each of those [23]

Assuming that the feature space has been partitioned into M regions, namely R1, . . . , RM , and that the model’s prediction at each region is cm, the DTR model will have the following formulation:

where 1 is the indicator function, returning 1 where the condition in brackets is true, and 0 in any other case.

the best  cm can be obtained through the minimisation of the fit’s LS, ∑ (yi − f (xi))^2. a greedy algorithm is used recursively to find an optimal splitting, until the stopping rule is activated.

The branches/edges represent the result of the node and the nodes have either: 

  1. Conditions [Decision Nodes]
  2. Result [End Nodes]

The branches/edges represent the truth/falsity of the statement and take makes a decision based on that in the example below which shows a decision tree that evaluates the smallest of 3 nums: 

4-2- Linear/polynomial regression

Solution: Least Square (LS)

4-3- Classification

  • Classification predictive modeling involves assigning a class label to input examples.
  • Binary classification refers to predicting one of two classes and multi-class classification involves predicting one of more than two classes.
  • Multi-label classification involves predicting one or more classes for each example and imbalanced classification refers to classification tasks where the distribution of examples across the classes is not equal.

There are perhaps four main types of classification tasks that you may encounter; they are:

  • Binary Classification
  • Multi-Class Classification
  • Multi-Label Classification
  • Imbalanced Classification

Popular algorithms that can be used for binary classification include:

  • Logistic Regression
  • k-Nearest Neighbors
  • Decision Trees
  • Support Vector Machine
  • Naive Bayes

Popular algorithms that can be used for multi-class classification include:

  • k-Nearest Neighbors.
  • Decision Trees.
  • Naive Bayes.
  • Random Forest.
  • Gradient Boosting.

 Specialized versions of standard classification algorithms can be used, so-called multi-label versions of the algorithms, including:

  • Multi-label Decision Trees
  • Multi-label Random Forests
  • Multi-label Gradient Boosting

Examples of imbalance classification include:

  • Fraud detection.
  • Outlier detection.
  • Medical diagnostic tests.

K-Nearest Neighbours: There, given a point xq , the algorithm identifies the k nearest neighbours distance- wise , with the parameter k being user-selectable (n_neighbours) [24] In order to calculate the distance between xq and any other point xj , usually Minkowski distance Lp is used

with p = 1 this corresponds to the Manhattan distance and with p = 2 to the Euclidean distance

Support vector machines (SVM) in their simplest form constitute a two- class classifier in cases where the two classes are linearly separable SVM work by deriving the optimal hyperplane, i.e. the hyperplane that offers the widest possible margin between instances of the two classe. Support Vector Reggressors (SVRs) work in a similar way, this time trying to fit a hyperplane that accurately predicts the target values of training samples within a margin of tolerance. Simple case with a linear kernel:

model parameters w are obtained through the minimisation of the function

An often-used non-linear kernel is the RBF kernel, formulated as

4-4- Regularization

where hyperparameter λ > 0 is a user-selectable parameter that controls the

amount of shrinkage. This shrinkage helps avoid overfitting the training dataset

L2 regularization tends to spread error among all the terms, while L1 is more binary/sparse, with many variables either being assigned a 1 or 0 in weighting. L1 corresponds to setting a Laplacean prior on the terms, while L2 corresponds to a Gaussian prior.

One thought on “Machine learning: a quick review (part 4)

  • The use of L1 and L2 norm in terms of their changes to the feature space also depends on the domain of problem.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses User Verification plugin to reduce spam. See how your comment data is processed.