6_logistic_regression

Logistic Regression

Table of Content

1. Logistic Regression 1.1. Coefficient interpretation 1.2. Multinomial regression 1.3. Regularization 1.3.1. Why Lasso regularization induce model sparsity? 1.4. Early stopping 1.5. Other considerations

1. Logistic Regression• Logistic regression is based on the same idea as linear regression in the way that we still use a line to designate our model. The only difference is that we now want

y

to be a probability.• The probability equation, which is a sigmoid function, is:

P (y | X) = \frac{1}{1 + e^{- (𝛽_{0} + 𝛽 X)}}

• Unlike linear regression, there's no closed form solution for logistic regression.• The loss function for logistic regression is the log loss (cross-entropy loss):•

L o s s (y, \hat{y}) = - \sum_{n}^{} [y_{i} \log {\hat{y}}_{i} + (1 - y_{i}) \log (1 - {\hat{y}}_{i})]

• Note: Log Loss is a slight twist on the likelihood function. In fact →

l o g l o s s = - 1 \times \log (l i k e l i h o o d f u n c t i o n)

.• Note: The likelihood function of logistic regression is:

L (𝛽_{0}, 𝛽) = \prod_{i = 1}^{n} p (x_{i})_{}^{y_{i}} (1 - p (x_{i}))^{1 - y_{i}}

• To minimize the loss function, we take derivatives w.r.t. coefficients:•

\begin{array}{c} \frac{d L o s s}{d 𝛽} = \sum_{n}^{} [{\hat{y}}_{i} - y_{i}] x_{i} \\ \frac{d L o s s}{d 𝛽_{0}} = \sum_{n}^{} [{\hat{y}}_{i} - y_{i}] .1 \\ 𝛻_{𝛽} = [\begin{array}{c} \frac{d L o s s}{d 𝛽} \\ \frac{d L o s s}{d 𝛽_{0}} \end{array}] \\ 𝛽_{i}^{t + 1} = 𝛽_{i}^{t} - r 𝛻_{𝛽_{i}} \to until coefficient gradients converge to 0 \\ r \to learning rate \to usually [10^{- 6}, 0.1] \end{array}

1.1. Coefficient interpretation• In order to interpret the impact of coefficient

𝛽

on the probability, we have to exponentiate it,

e^{𝛽}

, to get something called the odds ratio.–

1 - e^{𝛽}

gives the % change in the odds. 1.2. Multinomial regression• We use multinomial regression when we want to predict more than two classes.• Instead of sigmoid, we're going to use softmax function.

P (y = k | x_{i}) = \frac{e^{𝛽_{0} + 𝛽 x_{i}}}{\sum_{j = 1}^{K} e^{𝛽_{0} + 𝛽 x_{j}}}

• A softmax function is generalized sigmoid such that it produces the probability among

K

classes.– The predicted value will the be class with the maximum predicted probability. 1.3. Regularization• Regularization is a techniques used to avoid overfitting which involves adding a term to the loss function which is the sum of all coefficients. There are two main types of regularization:–

L 1

, Lasso, or Laplace →

\sum_{j}^{} | 𝛽_{j} |

* Typically results in more zero-valued coefficients, which means fewer features will be used.–

L 2

, ridge, Gaussian →

\sum_{j}^{} 𝛽_{j}^{2}

* Usually results in small weights for many of the features (that would've been out by

L 1

).– Note: Both

L 1

and

L 2

usually have a coefficient,

𝜆

, multiplied to them which allows to control the degree of regularization.* Two high

𝜆

can result in under-fitting and too low can result in overfitting.*

𝜆

is best tuned in cross validation.– Note: When using regularization, it's better to scale our data. Scaling data can also help the model to converge faster.1.3.1. Why Lasso regularization induce model sparsity?• First off, note that– L1 norm:

| | w | |_{1} = | w_{1} | + | w_{2} | + | w_{3} | + \dots + | w_{n} |

– L2 norm:

| | w | |_{2} = \sqrt{w_{1}^{2} + w_{2}^{2} + w_{3}^{2} + \dots + w_{n}^{2}}

• • When optimizing the cost function, we use gradient descent and update our weights by →

w^{t + 1} = w^{t} - r \nabla_{w}

– Convergence occurs when the value of

w^{t}

doesn't change much with further iterations → i.e.

\frac{𝜕 L o s s}{𝜕 w} \approx 0

→ i.e.

w^{t + 1} \approx w^{t}

. • L1 norm: The derivative is →

\frac{𝜕 | w |}{𝜕 w} = 1

, therefore →

w^{t + 1} = w^{t} - r .1

.– We can see that our loss derivative becomes a constant, so the condition of convergence occurs faster because we only have

r

in the subtraction terms and it's not being multiplied by any smaller value of

w

.– Therefore,

w^{t}

tends towards zero in a few iterations. • • L2 norm: The derivative →

\frac{𝜕 w^{2}}{𝜕 w} = 2 w

, therefore →

w^{t + 1} = w^{t} - 2. r . w

.– We can see that our loss derivative term is not constant and thus for smaller values of

w

, our condition of convergence will not occur faster (or maybe at all) because we have a smaller value of

w

getting multiplied with

r

and thus making the whole term to be subtracted even smaller. – Therefore, after a few iterations, our

w^{t}

becomes a very small constant value but not zero. – Hence, not contributing to the sparsity of the weight vector. 1.4. Early stopping• Another technique to avoid overfitting is early stopping.• Simply, it means to stop training somewhere before reaching the absolute minimum [of loss function] to avoid overfitting the training examples. 1.5. Other considerations• We can't use the same

R^{2}

from the linear regression. For logistic regression we use something called McFadden's pseudo

R^{2}

. which also lies between

0

and

1

.– Its value is usually smaller than

R^{2}

. A value of

0.2

and

0.4

usually indicates an excellent fitting model.• Logistic regression → discriminative model– Naive Bayes → generative model Back to Top