325

# Maths Behind ML- Multinomial Logistic Regression

## What is Multinomial Logistic Regression?

Multinomial Logistic Regression is an extension of logistic regression, which is also capable of solving a classification problem where the number of classes can be more than two. Multinomial Logistic Regression is also known as Polytomous LR, Multiclass LR, Softmax Regression, Multinomial Logit, Maximum Entropy classifier. For example, a handwritten digit can have ten classes (0-9), or a student’s marks can fall into the first, second, or third division, etc.
In Maths Behind ML- Logistic Regression, we saw that a logistic regression classifier could work in binary class problem. For example, a customer can either default on his credit card payment or not. Recall that in the case of logistic regression, the classifier was a logistic function (also known as sigmoid). This function was used to model the relation between probabilities of each class. The relationship is given as
$p\left(x\right)=p\left(\frac{Y=1}{X}\right)=\frac{{e}^{{\beta}_{0}+{\beta}_{1}x}}{1+{e}^{{\beta}_{0}+{\beta}_{1}x}}$
Where, coefficients $\beta_{0} and \beta_{1}$ were determined using a maximum likelihood constraint.
Multinomial logistic regression extends this binary classification to a multiclass classification and uses a softmax function as a classifier, which is discussed in the next section.

## Softmax function as a classifier

Multinomial Logistic Regression uses a softmax function to model the relationship between the predictors and probabilities of each class. Finally, it predicts the class which has the highest probability among all the classes. The softmax function is given below,

$P(y=j \mid z^{(i)}) = \phi_{softmax}(z^{(i)}) = \frac{e^{z^{(i)}}}{\sum_{j=0}^{k} e^{z_{k}^{(i)}}}.$
Here,
• j is the class of the input observation i and can range from 0 to k, where k is the number of classes possible for the input observation.
• The term ${\sum_{j=0}^{k} e^{z_{k}^{(i)}}}$ normalizes the distribution so that the sum of probabilities of each class sum to one.
• z is the net input vector and is given as $z = w_1x_1 + … + w_mx_m + b=\sum_{l=1}^{m} w_l x_l + b= \mathbf{w}^T\mathbf{x} + b.$ w is the weight vector, x is the feature vector of 1 training sample, and b is the bias unit.

## Optimization of coefficients and Loss function

In the previous section, we saw that the net input vector is given as
$z = w_1x_1 + … + w_mx_m + b=\sum_{l=1}^{m} w_l x_l + b=\mathbf{w}^T\mathbf{x} + b.$
We already know the feature vector X for a training sample. The goal is to determine the weight vector w and b in such a way that the actual class and the predicted class becomes as close as possible. This is known as the Maximum Likelihood criterion. For a Multinomial Logistic Regression, it is given below.

$P(Y \mid X) = \prod_{i=1}^n P(y^{(i)} \mid x^{(i)})$ and thus

$-\log P(Y \mid X) =\sum_{i=1}^n -\log P(y^{(i)} \mid x^{(i)}).$

Maximizing likelihood function $P(Y \mid X)$ is the same as minimizing function $-\log P(Y \mid X)$. We can define a loss function accordingly as,

$l = -\log P(y \mid x) = – \sum_j y_j \log \hat{y}_j.$

Usually $\mathbf{y}$ is a one hot encoded vector and that it has j=1 for one class and j=0 for other class. Also, as $\hat{y}_j$ is the probability ranging from 0 to 1, their logarithms are never greater than 0. In the case where $\hat{y}_j$ is 1, then the loss function becomes 0. Hence, if the classifier correcrly classifies the class, then the loss function is minimized. This loss function is also known as cross entropy loss.  Further, we can simplify the loss function by putting the value of softmax function in place of $\hat{y}_j$ as given below.

$l = -\sum_j y_j \log \hat{y}_j$

$= \sum_j y_j \log \sum_k \exp(z_k) – \sum_j y_j z_j$

$= \log \sum_k \exp(z_k) – \sum_j y_j z_j$

Note that summation of ${y}_j$ over all classes is equal to 1.
To understand a bit better what is going on, consider the derivative with respect to z. We get

$\partial_{z_j} l = \frac{\exp(z_j)}{\sum_k \exp(z_k)} – y_j = \mathrm{softmax}(\mathbf{z})_j – y_j = P(y = j \mid x) – y_j.$
In other words, the gradient is the difference between the probability assigned to the true class by our model, as expressed by the probability P(y∣x), and what actually happened, as expressed by y.Minimizing loss function using a software gives us the weight and bias vectors to correctly define a softmax classifier.

## Conclusion

In this article, we have recognized the problem of multiclass classification problems and seen the mathematics behind one such classifier known as multinomial logistic regression. We also discussed the mathematics behind Maximum Likelihood and the cost minimization function.
References –
1. Zhang, Aston, et al. “Dive into Deep Learning.” Unpublished draft. Retrieved 3 (2019): 319.
2. http://deeplearning.stanford.edu/tutorial/supervised/SoftmaxRegression/
3. http://rasbt.github.io/mlxtend/user_guide/classifier/SoftmaxRegression/