Logistic Regression
Table of Content
1. Logistic Regression 1.1. Coefficient interpretation 1.2. Multinomial regression 1.3. Regularization 1.3.1. Why Lasso regularization induce model sparsity? 1.4. Early stopping 1.5. Other considerations
1. Logistic RegressionLogistic regression is based on the same idea as linear regression in the way that we still use a line to designate our model. The only difference is that we now want y to be a probability.The probability equation, which is a sigmoid function, is:P(y|X)=11+e-(𝛽0+𝛽X)Unlike linear regression, there's no closed form solution for logistic regression.The loss function for logistic regression is the log loss (cross-entropy loss): Loss(y,y^)=-n[yilogy^i+(1-yi)log(1-y^i)] Note: Log Loss is a slight twist on the likelihood function. In fact → logloss=-1×log(likelihoodfunction).Note: The likelihood function of logistic regression is:L(𝛽0,𝛽)=i=1np(xi)yi(1-p(xi))1-yiTo minimize the loss function, we take derivatives w.r.t. coefficients: dLossd𝛽=n[y^i-yi]xidLossd𝛽0=n[y^i-yi].1𝛻𝛽=[dLossd𝛽dLossd𝛽0]𝛽it+1=𝛽it-r𝛻𝛽iuntil coefficient gradients converge to 0rlearning rateusually[10-6,0.1]1.1. Coefficient interpretationIn order to interpret the impact of coefficient 𝛽 on the probability, we have to exponentiate it, e𝛽, to get something called the odds ratio.1-e𝛽 gives the % change in the odds. 1.2. Multinomial regressionWe use multinomial regression when we want to predict more than two classes.Instead of sigmoid, we're going to use softmax function. P(y=k|xi)=e𝛽0+𝛽xij=1Ke𝛽0+𝛽xj A softmax function is generalized sigmoid such that it produces the probability among K classes.The predicted value will the be class with the maximum predicted probability. 1.3. RegularizationRegularization is a techniques used to avoid overfitting which involves adding a term to the loss function which is the sum of all coefficients. There are two main types of regularization:L1, Lasso, or Laplacej|𝛽j|* Typically results in more zero-valued coefficients, which means fewer features will be used.L2, ridge, Gaussianj𝛽j2* Usually results in small weights for many of the features (that would've been out by L1).Note: Both L1 and L2 usually have a coefficient, 𝜆, multiplied to them which allows to control the degree of regularization.* Two high 𝜆 can result in under-fitting and too low can result in overfitting.* 𝜆 is best tuned in cross validation.Note: When using regularization, it's better to scale our data. Scaling data can also help the model to converge faster.1.3.1. Why Lasso regularization induce model sparsity?First off, note thatL1 norm: ||w||1=|w1|+|w2|+|w3|++|wn|L2 norm: ||w||2=w12+w22+w32++wn2 When optimizing the cost function, we use gradient descent and update our weights by → wt+1=wt-rwConvergence occurs when the value of wt doesn't change much with further iterations → i.e. 𝜕Loss𝜕w0 → i.e. wt+1wt. L1 norm: The derivative is → 𝜕|w|𝜕w=1, therefore → wt+1=wt-r.1.We can see that our loss derivative becomes a constant, so the condition of convergence occurs faster because we only have r in the subtraction terms and it's not being multiplied by any smaller value of w.Therefore, wt tends towards zero in a few iterations. L2 norm: The derivative → 𝜕w2𝜕w=2w, therefore → wt+1=wt-2.r.w.We can see that our loss derivative term is not constant and thus for smaller values of w, our condition of convergence will not occur faster (or maybe at all) because we have a smaller value of w getting multiplied with r and thus making the whole term to be subtracted even smaller. Therefore, after a few iterations, our wt becomes a very small constant value but not zero. Hence, not contributing to the sparsity of the weight vector. 1.4. Early stoppingAnother technique to avoid overfitting is early stopping.Simply, it means to stop training somewhere before reaching the absolute minimum [of loss function] to avoid overfitting the training examples. 1.5. Other considerationsWe can't use the same R2 from the linear regression. For logistic regression we use something called McFadden's pseudo R2. which also lies between 0 and 1.Its value is usually smaller than R2. A value of 0.2 and 0.4 usually indicates an excellent fitting model.Logistic regression → discriminative modelNaive Bayes → generative model Back to Top