Neural Networks
Table of Content
1. Neural Networks (NN) 1.1. What is a neuron (in NN)? 1.2. Why do we need the bias term? 1.3. How a NN learns non-linear patterns? 1.4. How does a NN learn? 1.5. Chain Rule 1.6. How do we update the weights? 1.7. Stochastic Gradient Descent (SGD) 1.8. Momentum 1.9. AdaGrad 1.10. Adam 1.11. RMSProp 1.12. AdaDelta 1.13. Vanishing and exploding gradients 1.13.1. Initialization 1.14. ReLU and Leaky ReLU 1.15. tanh 1.16. Loss Functions 1.17. Avoid Overfitting 1.17.1. Regularization 1.17.2. Dropout 1.18. How to determine the number of layers and neurons?
1. Neural Networks (NN) 1.1. What is a neuron (in NN)? Sometimes called perceptron, is a graphical representation of the smallest part of a NN that takes an input, multiply it by a weight. The 𝜎(WT.X+b) is the same as logistic regression and the loss is exactly the same as the one in logistic regression:L(y^,y)=1Ni=1nyilogy^i+(1-yi)log(1-y^i)Note: There are other non-linear functions used as well, such as Relu, tanh, etc.We can update the weights by taking gradients of the loss function with respect to the weights, L=[LbLw1...Lwn]wt+1=wt-rLNote: According to equation (3), in order to update weights, we move in the opposite direction of loss gradient adjusted by the learning rate (r). 1.2. Why do we need the bias term?Bias is like the intercept added in a linear equation. It is an additional parameter in the Neural Network which is used to adjust the output along with the weighted sum of the inputs to the neuron. Thus, Bias is a constant which helps the model in a way that it can fit best for the given data.The bias term helps in cases where all the wixi terms are 0, which means that the model cannot be trained. Adding a bias terms let the model be trained in such cases. 1.3. How a NN learns non-linear patterns?Each neuron in a NN learns decision boundary.Since NN has many neurons, the combined learned decision boundaries creates a non-linear decision boundary. In the example below, there's no way to separate the data with one line. There are feature engineering methods (or other algorithms) that can handle this. But, how a NN can separate these two classes? The above picture is a simple example of how a NN can capture non-linear patterns. In practice, NN have more than one hidden layers and more neurons per layer.Note: Usually the number of neurons in each hidden layer decreases as we move forward through the network. 1.4. How does a NN learn?Let's explain this through a small NN below. 1.5. Chain Rule1.6. How do we update the weights?1.7. Stochastic Gradient Descent (SGD)1.8. Momentum1.9. AdaGrad1.10. Adam 1.11. RMSPropWill add the note later 1.12. AdaDeltaWill add the note later 1.13. Vanishing and exploding gradientsOnce we have the gradients, from whatever optimizer we use, multiplying these gradients together can result in a problem.Let's say if we use the sigmoid activation function, the maximum value of the gradients are 0.25.Now, if we multiply a lot of 0.25s together, this final gradient (based on chain rule) will brace towards 0 → this result in underflow.This is called → vanishing gradient.Also, some of the gradient terms include the value of weights. If the weights are extremely large, by multiplying them together, we can end up getting an extremely large value → This is called an exploding gradient.There are a few methods to mitigate these problems. 1.13.1. InitializationOne of the method to tackle the vanishing/exploding gradients is to initializing the weights of the NN in a particular way.A bad way to initialize the weights is just to use a uniform distribution between 0 and 1. Another bad way to just to initialize these parameters with a normal distribution in which the mean is 0 and the standard deviation is 1.What we can do instead is to initialize the weights from a normal distribution in which the mean is 0 but the standard deviation is 𝜎=2(fi+fo). fi is fan in and fo is fan out. Fan in is the number of inputs to a particular layer and fan out is the number of outputs for that layer.This way, we can initialize the weights for each layer of NN.This is called Xavier or Glorot initialization.The reason why this helps is because we're shrinking the standard deviation by how many ever times we will be multiplying these variables together per layer. Not doing this makes the variances of each layer multiply together and that causes the variance to grow exponentially. So, if we can shrink down the standard deviation early, these the other multiplications, hopefully, won't result in exponential growth (or shrinkage) of the gradients.This works best when we use something called a symmetric activation function. Example of such functions → sigmoid function. 1.14. ReLU and Leaky ReLU What if we want to use a non-symmetric activation function?Example of such functions is ReLU (Rectified Linear Unit) function. Why do we want to use ReLU?More computationally efficient → All the negative values take on the value of 0, and all positive values take on the value itself → When taking derivatives, the derivatives of 0 is 0 and the derivative of any value is just 1.Tends to produce better model performanceSparsity → reduce overfitting → not all neurons will output a value (negative values → 0). What are the downsides of ReLU?It has an uncapped activation. With sigmoid, we'd have something called saturation, where the output of the neuron could be no larger than the value of 1.However, the ReLU can output any value, which means that we could be susceptible to exploding gradients more often.As well, we can even now be susceptible to exploding forward passes where by simply doing multiplications in the forward pass all the way through the NN, we can also get unreasonably large numbers that overflow.Another problem called dying ReLU problem.* It comes from the fact that a neuron that takes on a value of 0 will be 0 forever.* That means that the neuron will be completely dead and never output another value except 0.Even with these problems, ReLU activation functions are used often in practice. Initialization for ReLUInstead of Xavier initialization, we use the Kaiming initialization𝜎=2fiThe Kaiming initialization can be used for other asymmetric activation functions like Leaky ReLUf(x)={xifx>00.01xotherwiseLeaky ReLU tries to get around the dead neuron problem by adding a slight angle to the slope.Another thing to do (in addition to the initialization) is feature scaling. 1.15. tanhtanh is very similar to the sigmoid function, but instead of being in the range of [0,1], it lies in the range of [-1,1]. The idea to cross validate between all the activation functions to see which works best for your data.Note: Different activation functions can be used at different layers of NN.Note: The last neuron will dictate what the output looks like. sigmoid → binary classification, softmax → multi-class classification (the maximum value of softmax function is your prediction), linear regression → linear activation function 1.16. Loss FunctionsRegression → Mean Squared Error (MSE) → L(y,y^)=N(yi-y^i)2NN is usually the batch_size.Regression → Mean Absolute Error (MAE) → L(y,y^)=N|yi-y^i|2NClassification → Cross Entropy (sometimes called logloss) → L(y,y^)=-(ylogy^i+(1-y)log(1-y^i)Classification → Cross Entropy for K classes → L(y,y^)=-kyilog(y^i)1.17. Avoid Overfitting1.17.1. RegularizationWe can do regularization by adding L1 or L2 term to the loss function → L(y,y^)=-kyilog(y^i)+𝜆w|wi|1.17.2. DropoutDropout is when you have, per layer of the NN, a particular neuron in that layer that will have some probability of sticking around. The others will be dropped out for this training iteration.For each layer we assign a dropout probability (e.g. P=0.5).The problem with dropout is that during training, the dropped out neurons (during training) will not drop out during prediction → so, all of the sudden, the last node summation will be a lot higher (because we have all the neurons).To solve this problem, we can use inverted dropout. During training (after every mini-batch), they'll take the output of the layers and divide by the dropout rate → outputdropout rate .This ensures that the total sum coming into the last node will match on average the total sum coming to it during prediction time. 1.18. How to determine the number of layers and neurons?If your data is linearly separable, you don't need any hidden layer at all.Beyond that, it's safe to start with a single hidden layer, and the number of neurons in that single hidden layer should be the average of input and output.Another alternative is to start with more layers or units than you need, and then go examine the weights of your connections.The weights that are close to 0, should allow you to prune the surrounding neuron.Once, you drop the neuron, you run the cross validation to see how much the NN model performance is affected. Back to Top