1 Introduction

1.1 Classification

The goal is to learn a mapping from inputs x to outputs y , where y \in \{1, . . . , C \}, with C being the number of classes

if C = 2, this is called binary classification
if C > 2, this is called multiclass classification
if the class labels are not mutually exclusive, we call it multi-label classification

1.2 Probabilistic predictions

To handle ambiguous cases it is desirable to return a probability. We will denote the probability distribution over possible labels, given the input vector x and training set D by p(y |x, D). This represents a vector of length C. If there are just two classes, it is sufficient to return the single-number p(y = 1|x, D), since p(y = 1|x, D) + p(y = 0|x, D) = 1

1.3 Paradigms

Supervised: correct output known for each training example

learn to predict output when given an input vector
classification: 1-of-N output
regression: real-valued output

Unsupervised: create an internal representation of the input, capturing regularities/structure in data, that is, discover patterns summarizing the underlying relationship in the data. Unlike supervised learning, we are not told what the desired output is for each input. We formalize the task as one of density estimation: build models of the form p(x_i | \theta). There are two differences from the supervised case:

we have written p(x_i |\theta) instead of p(y_i |x_i, \theta): supervised learning is conditional density estimation, whereas unsupervised learning is unconditional density estimation
x_i is a vector of features, so we need to create multivariate probability models. By contrast, in supervised learning, y_i is usually just a single variable that we are trying to predict. This means that for most supervised learning problems, we can use univariate probability models, which significantly simplifies the problem

Reinforcement learning: how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal

1.4 Parametric vs non-parametric models

When a model has a fixed number of parameters, it is called parametric.

When the number of parameters grows with the amount of training data, it is called non-parametric

parametric models
- advantage: being faster to use
- disadvantage: making stronger assumptions about the nature of the data distributions
non-parametric models
- advantage: are more flexible
- disadvantage: often computationally intractable for large datasets

No free lunch theorem: there is no universally best model. The reason is that a set of assumptions that works well in one domain may work poorly in another