4 Logistic Regression
References: Bishop Ch. 4.2, 4.3.0-4.3.2
4.1 Introduction
Now we talk about the generative approach in which we model the class-conditional densities p(x | C_k), as well as the class priors p(C_k), and then use these to compute posterior probabilities p(C_k | x) through Bayes’ theorem.
Consider first of all the case of two classes. The posterior probability for class C_1 can be written as
\begin{aligned} p(C_1 | x ) &= \dfrac{p(x|C_1)p(C_1)}{p(x|C_1)p(C_1) + p(x | C_2)p(C_2)} \\ &= \dfrac{\frac{p(x | C_1) p(C_1)}{p(x | C_1)p(C_1)}}{\frac{p(x|C_1)p(C_1) + p(x|C_2) p(C_2)}{p(X|C_1) p(C_1)}} \\&= \dfrac{1}{1 + \frac{p(x|C_2)p(C_2)}{p(x | C_1)p(C_1)}}\\ &= \dfrac{1}{1 + exp(-a)} = \sigma(a) \end{aligned}
To demonstrate that this is valid, we consider
a = \ln \dfrac{p(x|C_1) p(C_1) }{p(x|C_2) p(C_2)}
\begin{aligned} -1 \cdot a &= -1 \cdot \ln \dfrac{p(x|C_1) p(C_1) }{p(x|C_2) p(C_2)} \\ -a &= \ln \dfrac{p(x|C_2) p(C_2)}{p(x|C_1) p(C_1)} \\ \exp(-a) &= \exp\left(\ln \dfrac{p(x|C_2) p(C_2)}{p(x|C_1) p(C_1)}\right) \\ \exp(-a) &= \dfrac{p(x|C_2) p(C_2)}{p(x|C_1) p(C_1)} \end{aligned}
For the case of K > 2 classes, we have
\begin{aligned} p(C_k | x) &= \dfrac{p(x | C_k) p(C_k)}{\sum_j p(x | C_j) p (C_j)} \\ &= \dfrac{\exp(a_k)}{\sum_j \exp(a_j)} \end{aligned}
a_k = \ln p(x | C_k) p (C_k)
This quantity its just equivalent to the one seen before, just in a different formulation.
This formula is known as the softmax function as it represents a smoothed version of the max function because, if a_k \gg a_j for all j \neq k, then p(C_k | x) \simeq 1, and p(C_j | x) \simeq 0, in this way we don’t set to zero the class to which the example do not belong in order to still learn using the gradient descent.
4.2 Logistic Regression
p(C_1 | \phi) = y(\phi) = \sigma(w^T\phi)
\sigma(\cdot) is the logistic sigmoid function. This model is known as logistic regression, although this is a model for classification rather than regression.
For an M-dimensional feature space \phi, this model has M parameters.
4.3 Learning a Logistic Regression Model
We use maximum likelihood to determine the parameters of the model.
For a dataset \{\phi_n, t_n\}, where t_n \in \{0, 1\} and \phi_n = \phi(x_n), with n = 1, \dots, N, the likelihood function can be written as
L(w) = p(\mathbf{t} | \mathbf{X}; w) = \prod_{n=1}^N p(C_1 | \phi_n)^{t_n} p(C_0 | \phi_n)^{1 - t_n}
By defining y_n = p(C_1 | \phi_n) = \sigma(w^T \phi_n) and 1 - y_n = p(C_0 | \phi_n), we cast the problem from a maximization problem to a minimization problem in order to define an error function by taking the negative logarithm of the likelihood:
E(w) = -\ln p(\mathbf{t} | w) = - \sum_{n = 1}^N \{t_n \ln y_n + (1 - t_n) \ln (1 - y_n)\}
This is known as the cross-entropy error function for the two-class classification problem. We want to minimize this error function with respect to the parameter vector w.
Applying the chain rule to compute the gradient of the error for a single observation E_n(w), with a_n = w^T \phi_n:
\dfrac{\partial E_n(w)}{\partial w} = \dfrac{\partial E_n(w)}{\partial y_n} \cdot \dfrac{\partial y_n}{\partial a_n} \cdot \dfrac{\partial a_n}{\partial w}
The derivative of the sigmoid function y_n = \sigma(a_n) can be expressed as:
\dfrac{d\sigma}{da_n} = \sigma(a_n)(1-\sigma(a_n)) = y_n(1 - y_n)
The full gradient \nabla E(w) over the entire dataset is derived as follows:
\begin{aligned} \nabla E(w) &= \sum_{n = 1}^N \nabla E_n(w) \\ &= - \sum_{n = 1}^N \left\{ \underbrace{\dfrac{t_n}{y_n}}_{\frac{\partial}{\partial y_n} (t_n \ln y_n)} [\underbrace{y_n (1 - y_n)}_{\frac{\partial y_n}{\partial a_n}} \underbrace{\phi_n}_{\frac{\partial a_n}{\partial w}}] - \underbrace{\dfrac{1- t_n}{1 - y_n}}_{-\frac{\partial}{\partial y_n} ((1-t_n) \ln(1-y_n))} [\underbrace{y_n(1 -y_n)}_{\frac{\partial y_n}{\partial a_n}} \underbrace{\phi_n}_{\frac{\partial a_n}{\partial w}}] \right\} \\ &= - \sum_{n = 1}^N [\underbrace{y_n (1 - y_n) \phi_n}_{\frac{\partial y_n}{\partial w}}] \underbrace{\left( \dfrac{t_n}{y_n} - \dfrac{1-t_n}{1- y_n}\right)}_{\frac{\partial (-E_n)}{\partial y_n}} \\ &= - \sum_{n=1}^N [\cancel{y_n(1-y_n)}\phi_n] \underbrace{\dfrac{t_n - y_n}{\cancel{y_n (1- y_n)}}}_{\text{messa a denominatore comune}} \\ &= - \sum_{n=1}^N (t_n - y_n) \phi_n \\ &= \sum_{n=1}^N (y_n - t_n) \phi_n \quad \text{(assorbimento del segno negativo)} \end{aligned}
The update rule for gradient descent at iteration \tau, utilizing the learning rate \eta, is expressed for the entire parameter vector w as:
w^{(\tau+1)} = w^{(\tau)} - \eta \nabla E(w) = w^{(\tau)} - \eta \sum_{n = 1}^N (y_n - t_n) \phi_n = w^{(\tau)} + \eta \sum_{n = 1}^N (t_n - y_n) \phi_n
\lambda here is the learning rate.
4.4 Problems
Maximum likelihood can exhibit severe over-fitting for datasets that are linearly separable. This arises because the maximum likelihood solution occurs when the hyperplane corresponding to \sigma = 0.5, equivalent to w^T \phi = 0 (decision boundary), separates the two classes and the magnitude of w goes to infinity. In this case, the sigmoid function will correspond to a Heaviside step function, so that every training point from each class k is assigned a posterior probability p(C_k | x) = 1.
We can avoid this problem by adding a regularization term to the error function, and adding a prior probability on weights.
p(t | w) = p(w) \prod_{n=1}^N p(C_1 | \phi_n)^{t_n} \{1 - p(C_1 | \phi_n)\}^{1 - t_n}
w_j^i = w_j^{i-1} - \lambda \sum_{n = 1}^N (t_n - y_n) \phi_n - \lambda \alpha w_j^{i - 1}
\alpha is the importance of the regularization term.
4.5 Multi-class formalization
We have seen that for a large class of distributions, the posterior probabilities are given by a softmax transformation of this way:
p(C_k \mid \boldsymbol{\phi}) = y_k(\boldsymbol{\phi}) = \frac{\exp(a_k)}{\sum_j \exp(a_j)}
a_k = \mathbf{w}_k^\top \boldsymbol{\phi}
Now to learn the parameters we will require the derivatives of y_k with respect to all the a_j. These are given by
\frac{\partial y_k}{\partial a_j} = y_k (I_{kj} - y_j)
where I_{kj} are the elements of the identity matrix.
Next we write down the likelihood function. This is most easily done using the 1-of-K coding scheme.
The likelihood function is then given by
p(\mathbf{T} \mid \mathbf{w}_1, \dots, \mathbf{w}_K) = \prod_{n=1}^{N} \prod_{k=1}^{K} p(C_k \mid \phi_n)^{t_{nk}} = \prod_{n=1}^{N} \prod_{k=1}^{K} y_{nk}^{t_{nk}}
where y_{nk} = y_k(\phi_n), and \mathbf{T} is an N \times K matrix of target variables with elements t_{nk}. Taking the negative logarithm then gives
E(\mathbf{w}_1, \dots, \mathbf{w}_K) = -\ln p(\mathbf{T} \mid \mathbf{w}_1, \dots, \mathbf{w}_K) = - \sum_{n=1}^{N} \sum_{k=1}^{K} t_{nk} \ln y_{nk}
which is the cross-entropy error function for the multiclass classification problem.
We now take the gradient of the error function with respect to one of the parameter vectors \mathbf{w}_j.
\nabla_{\mathbf{w}_j} E(\mathbf{w}_1, \dots, \mathbf{w}_K) = \sum_{n=1}^{N} (y_{nj} - t_{nj}) \boldsymbol{\phi}_n