2  Regression

References: Bishop 1, 1.1, 3.1, (5.2.0-5.2.2), 5.2.3-5.2.4

2.1 Linear Regression

Linear Regression: The simplest linear model for regression is one that involves a linear combination of the input variables

y(x, w) = w_0 + w_1x_1 + ... + w_Dx_D

where x = (x_1, ..., x_D)^T.

It is a linear function of the parameters w_0, ..., w_D, and a linear function of the input variables x_i, and this imposes significant limitations on the model.

We therefore extend the class of models by considering linear combinations of fixed nonlinear functions of the input variables, of the form

y(x, w) = w_0 + \sum^{M-1}_{j=1} w_j \phi_j (x)

where \phi_j(x) are known as basis functions.

The total number of parameters in this model will be M (from j = 1 to j = M - 1).

The parameter w_0 allows for any fixed offset in the data and is called bias.

It is often convenient to define an additional dummy basis function \phi_0 (x) = 1 so that

y(x, w) = \sum_{j=0}^{M-1} w_j \phi_j (x) = w^T\phi(x)

where w = (w_0, ..., w_{m-1})^T and \phi = (\phi_0, ..., \phi_{m-1})^T

By using nonlinear basis functions, we allow the function y(x, w) to be a non-linear function of the input vector x. We still call them linear models because they are still linear in w.

2.2 Training

To find the value of the coefficients that better fit the training data, we try to minimize an error function that measure the misfit between the function y(x, w) and the training set.

One of the most used is the sum of squares of the errors between the predictions y(x_j, w) for each data point x_j and the corresponding target value t_j, so that we minimize:

E(w) = \dfrac{1}{2} \sum_{j=1}^{M-1} [y(x_j, w) - t_j]^2

It is sometimes more convenient to use the root-mean-square error defined by

E_{RMS} = \dfrac{\sqrt{2 \cdot E(w)}}{N}

in which the division by N allows us to compare different sizes of datasets on an equal footing, and the square root ensures that E_{RMS} is measured on the same scale (and in the same units) as the target variable t.

How do we find the minimum of the error function? Using gradient descent.

w^{(\tau +1)} = w^{(\tau)} - \eta \nabla E_n

where \tau denotes the iteration number, \eta is the learning rate, and \nabla E_n = \frac{\partial E}{\partial w}.

The value of w is initialized to some starting vector w^{(0)}.

For the case of the sum-of-squares error function, this gives

w^{(\tau + 1)} = w^{(\tau)} + \eta(t_n - w^{(\tau)T} \phi (x_n)) \phi (x_n)

where (t_n - w^{(\tau) T} \phi(x_n)) is simply the error between the prediction and the target. This is known as least-mean-squares algorithm.

2.3 Regularizer

To control over-fitting, we add another term in the error function, called regularizer. In this way, the final error function will be defined as

E_D(w) + \lambda E_W(w)

where E_D(w) is the error derived from the data, while E_w (w) is the regularizer term, weighted by \lambda.

One of the simplest form of regularizer is the sum-of-squares of the weight vector elements

E_W(w) = \dfrac{1}{2} w^T w

This regularizer is known as weight decay because in sequential learning algorithms, it encourages weight values to decay towards zero, unless supported by the data. It has the advantage that the error function remains a quadratic function of w, and so its exact minimizer can be found in closed form.

A more general regularizer is sometimes used

E_w(W) = \dfrac{\lambda}{2} \sum_{j = 1}^M |w_j|^q

If q = 2 is called L2 regularizer: E_w(W) = w_1^2 + w_2^2 + ... + w_n^2

If q = 1 is called L1 regularizer (lasso): E_w(W) = w_1 + w_2 + ... + w_n. It has the property that if \lambda is sufficiently large, some of the coefficients w_j are driven to zero, leading to a sparse model in which the corresponding basis functions play no role. As \lambda is increased, so an increasing number of parameters are driven to zero.

So, if we use the sum-of-squares error combined with the general regularizer, the final error function to be minimized will be

E(w) = \dfrac{1}{2} \sum_{n=1}^N [t_n - w^T \phi(x_n)]^2 + \dfrac{\lambda}{2} \sum_{j=1}^M |w_j|^q

Regularization allows complex models to be trained on datasets of limited size without severe over-fitting, essentially by limiting the effective model complexity.

2.4 Multiple Outputs

So far, we have considered the case of a single target variable t:

  • w \in R^{M}
  • \phi_j (x) \in R^{M}
  • y(x) \in R

We want to use the same set of basis functions to model all of the components of the target vector.

To do that, we simply define w \in R^{M\times K}, where K is the number of target variables, while \phi_j(x) stays the same. In this way, y(x) \in R^{K}, where every value of the vector corresponds to the prediction of that target variable.