1. Logistic Regression• Logistic regression is based on the same idea as linear regression in the way that we still use a line to designate our model. The only difference is that we now want y to be a probability.• The probability equation, which is a sigmoid function, is:P(y|X)=1
1+e-(𝛽0+𝛽X)• Unlike linear regression, there's no closed form solution for logistic regression.• The loss function for logistic regression is the log loss (cross-entropy loss):• Loss(y,y)=-∑n[yilogyi+(1-yi)log(1-yi)]• Note: Log Loss is a slight twist on the likelihood function. In fact → log loss =-1 ×log(likelihood function).• Note: The likelihood function of logistic regression is:L(𝛽0,𝛽)=n∏i=1p(xi)yi(1-p(xi))1-yi• To minimize the loss function, we take derivatives w.r.t. coefficients:• dLoss
d𝛽=∑n[yi-yi]xidLoss
d𝛽0=∑n[yi-yi].1𝛻𝛽=[a
dLoss
d𝛽
dLoss
d𝛽0
]𝛽t+1i=𝛽ti-r𝛻𝛽i→until coefficient gradients converge to 0r →learning rate→usually[10-6,0.1]1.1. Coefficient interpretation• In order to interpret the impact of coefficient 𝛽 on the probability, we have to exponentiate it, e𝛽, to get something called the odds ratio.– 1-e𝛽 gives the % change in the odds.1.2. Multinomial regression• We use multinomial regression when we want to predict more than two classes.• Instead of sigmoid, we're going to use softmax function.P(y=k|xi)=e𝛽0+𝛽xi
K∑j=1e𝛽0+𝛽xj• A softmax function is generalized sigmoid such that it produces the probability among K classes.– The predicted value will the be class with the maximum predicted probability.1.3. Regularization• Regularization is a techniques used to avoid overfitting which involves adding a term to the loss function which is the sum of all coefficients. There are two main types of regularization:– L1, Lasso, or Laplace → ∑j|𝛽j|* Typically results in more zero-valued coefficients, which means fewer features will be used.– L2, ridge, Gaussian → ∑j𝛽2j* Usually results in small weights for many of the features (that would've been out by L1).– Note: Both L1 and L2 usually have a coefficient, 𝜆, multiplied to them which allows to control the degree of regularization.* Two high 𝜆 can result in under-fitting and too low can result in overfitting.* 𝜆 is best tuned in cross validation.– Note: When using regularization, it's better to scale our data. Scaling data can also help the model to converge faster.1.3.1. Why Lasso regularization induce model sparsity?• First off, note that– L1 norm:||w||1=|w1|+|w2|+|w3|+…+|wn|– L2 norm:||w||2=w21+w22+w23+…+w2n• • When optimizing the cost function, we use gradient descent and update our weights by → wt+1=wt-r∇w– Convergence occurs when the value of wt doesn't change much with further iterations → i.e. 𝜕Loss
𝜕w≈0 → i.e. wt+1≈wt.• L1 norm: The derivative is → 𝜕|w|
𝜕w=1, therefore → wt+1=wt-r.1.– We can see that our loss derivative becomes a constant, so the condition of convergence occurs faster because we only have r in the subtraction terms and it's not being multiplied by any smaller value of w.– Therefore, wt tends towards zero in a few iterations. • • L2 norm: The derivative → 𝜕w2
𝜕w=2w, therefore → wt+1=wt-2.r.w.– We can see that our loss derivative term is not constant and thus for smaller values of w, our condition of convergence will not occur faster (or maybe at all) because we have a smaller value of w getting multiplied with r and thus making the whole term to be subtracted even smaller. – Therefore, after a few iterations, our wt becomes a very small constant value but not zero. – Hence, not contributing to the sparsity of the weight vector.1.4. Early stopping• Another technique to avoid overfitting is early stopping.• Simply, it means to stop training somewhere before reaching the absolute minimum [of loss function] to avoid overfitting the training examples. 1.5. Other considerations• We can't use the same R2 from the linear regression. For logistic regression we use something called McFadden's pseudo R2. which also lies between 0 and 1.– Its value is usually smaller than R2. A value of 0.2 and 0.4 usually indicates an excellent fitting model.• Logistic regression → discriminative model– Naive Bayes → generative modelBack to Top