10_nn

Neural Networks

Table of Content

1. Neural Networks (NN) 1.1. What is a neuron (in NN)? 1.2. Why do we need the bias term? 1.3. How a NN learns non-linear patterns? 1.4. How does a NN learn? 1.5. Chain Rule 1.6. How do we update the weights? 1.7. Stochastic Gradient Descent (SGD) 1.8. Momentum 1.9. AdaGrad 1.10. Adam 1.11. RMSProp 1.12. AdaDelta 1.13. Vanishing and exploding gradients 1.13.1. Initialization 1.14. ReLU and Leaky ReLU 1.15. tanh 1.16. Loss Functions 1.17. Avoid Overfitting 1.17.1. Regularization 1.17.2. Dropout 1.18. How to determine the number of layers and neurons?

1. Neural Networks (NN) 1.1. What is a neuron (in NN)? • Sometimes called perceptron, is a graphical representation of the smallest part of a NN that takes an input, multiply it by a weight. • The

𝜎 (W^{T} . X + b)

is the same as logistic regression and the loss is exactly the same as the one in logistic regression:

L (\hat{y}, y) = \frac{1}{N} \sum_{i = 1}^{n} y_{i} \log {\hat{y}}_{i} + (1 - y_{i}) \log (1 - {\hat{y}}_{i})

• Note: There are other non-linear functions used as well, such as Relu, tanh, etc.• We can update the weights by taking gradients of the loss function with respect to the weights,

\begin{array}{c} \nabla L = [\begin{array}{c} \frac{\partial L}{\partial b} \\ \frac{\partial L}{\partial w_{1}} \\ . . . \\ \frac{\partial L}{\partial w_{n}} \end{array}] \\ w^{t + 1} = w^{t} - r \nabla L \end{array}

• Note: According to equation

(3)

, in order to update weights, we move in the opposite direction of loss gradient adjusted by the learning rate (

r

). 1.2. Why do we need the bias term?• Bias is like the intercept added in a linear equation. It is an additional parameter in the Neural Network which is used to adjust the output along with the weighted sum of the inputs to the neuron. Thus, Bias is a constant which helps the model in a way that it can fit best for the given data.• The bias term helps in cases where all the

w_{i} x_{i}

terms are

0

, which means that the model cannot be trained. Adding a bias terms let the model be trained in such cases. 1.3. How a NN learns non-linear patterns?• Each neuron in a NN learns decision boundary.• Since NN has many neurons, the combined learned decision boundaries creates a non-linear decision boundary. • In the example below, there's no way to separate the data with one line. There are feature engineering methods (or other algorithms) that can handle this. But, how a NN can separate these two classes? • The above picture is a simple example of how a NN can capture non-linear patterns. In practice, NN have more than one hidden layers and more neurons per layer.– Note: Usually the number of neurons in each hidden layer decreases as we move forward through the network. 1.4. How does a NN learn?• Let's explain this through a small NN below.

𝜎

\sum

𝜎

\sum

𝜎

\sum

x_{2}

x_{1}

{\hat{y}}_{o u t}

w_{1}

w_{2}

w_{3}

w_{4}

w_{5}

w_{6}

h_{i n}^{1}

h_{i n}^{2}

h_{o u t}^{1}

h_{o u t}^{2}

y_{i n}^{}

h_{i n}^{1} = w_{1} x_{1} + w_{3} x_{2}

h_{i n}^{2} = w_{2} x_{1} + w_{4} x_{2}

h_{o u t}^{1} = 𝜎 (h_{i n}^{1})

h_{o u t}^{2} = 𝜎 (h_{i n}^{2})

y_{i n}^{} = w_{5} h_{o u t}^{1} + w_{6} h_{o u t}^{2}

{\hat{y}}_{o u t} = 𝜎 (y_{i n}^{})

L (\hat{y}, y) = \frac{1}{N} \sum_{i = 1}^{n} y_{i} \log {\hat{y}}_{i} + (1 - y_{i}) \log (1 - {\hat{y}}_{i})

Loss for one example:

{\hat{y}}_{o u t} = 0.33, y = 1

\begin{array}{l} L = \log ({\hat{y}}_{o u t}) \\ = \log (𝜎 (y_{i n}^{})) \\ = \log (𝜎 (w_{5} h_{o u t}^{1} + w_{6} h_{o u t}^{2})) \\ = \log (𝜎 (w_{5} 𝜎 (h_{i n}^{1}) + w_{6} 𝜎 (h_{i n}^{2}))) \\ = \log (𝜎 (w_{5} 𝜎 (w_{1} x_{1} + w_{3} x_{2}) + w_{6} 𝜎 (w_{2} x_{1} + w_{4} x_{2}))) \end{array}

The loss function is →

1.5. Chain Rule

* Now, we have to take the gradient of

L .

Since loss function is a complex functionit's hard to derive the analytical gradient. * Since the loss function is essentially a function of functions, we use the chain rule tocompute the derivatives. For example,

\frac{\partial L}{\partial w_{1}} = \frac{\partial L}{\partial {\hat{y}}_{o u t}} . \frac{\partial {\hat{y}}_{o u t}}{\partial y_{i n}} . \frac{\partial y_{i n}}{\partial h_{o u t}^{1}} . \frac{\partial h_{o u t}^{1}}{\partial h_{i n}^{1}} . \frac{\partial h_{i n}^{1}}{\partial w_{1}}

\frac{\partial L}{\partial w_{6}} = \frac{\partial L}{\partial {\hat{y}}_{o u t}} . \frac{\partial {\hat{y}}_{o u t}}{\partial y_{i n}} . \frac{\partial y_{i n}}{\partial w_{6}}

* Note that the first two terms of

\frac{\partial L}{\partial w_{6}}

and

\frac{\partial L}{\partial w_{1}}

are the same. This means that we can use dynamic programming + chain rule to calculate derivatives. * We start by first calculating

\frac{\partial L}{\partial w_{6}}

and work our way to

\frac{\partial L}{\partial w_{1}}

. * This gives us something called backpropagation. Backpropagation is the standard wayto train NN. * For training we need: * Forward Pass → To figure out how far our predictions are from the actual value * Backpropagation → Once we have the loss, we can backpropagate those gradients to update all of the weights in our NN.

1.6. How do we update the weights?

w^{t + 1} = w^{t} - r \nabla L

* If we plot the average loss obtained from all of the trained examples against a particularparameter, say

w_{1}

, we get a function like below,

L (\hat{y}, y)

w_{1}

* The difference between this function and the logistic regression function is that herewe have local optima. In logistic regression, we were guaranteed to have one minimum. * That's because we stacked up neurons and added layers, so we opened up ourselves tolocal optima.

1.7. Stochastic Gradient Descent (SGD)

* There are some techniques that increases the chance of not getting stuck in the local optima. The most popular method is Stochastic Gradient Descent (SGD). * SGD's characteristic of not getting stuck in local optima is just a by-product of takingrandom examples and updating the weights with just that single example. This randomnessin the weight updates, can increase the chances that we don't get stuck in a local optima. * The problem with SGD is that it's slow to converge. * One idea to speed up convergence is by incorporating momentum. * The idea of momentum is to keep track of the previous updates.

1.8. Momentum

* The problem with SGD is that it's slow to converge. * One idea to speed up convergence is by incorporating momentum. * The idea of momentum is to keep track of the previous updates.

\begin{array}{l} w^{t + 1} = w^{t} - r \nabla L^{t} - 𝛾 r (\nabla L^{t - 1} + \nabla L^{t - 2} + \dots + \nabla L^{t - n}) \\ or \\ w^{t + 1} = w^{t} - V^{t} \\ V^{t} = 𝛾 V^{t - 1} - r \nabla L \end{array}

* The

𝛾

parameter is usually set to

0.9

so that the previous gradient doesn't matter as much as the current gradient. * The problem with momentum is that sometimes we could build so much momentum that we pass the global optima.

1.9. AdaGrad

* There's another method called AdaGrad which adjusts the learning rate per parameter.*Note:

r_{g e n e r a l}

is typically set to

0.001

w_{1}^{t + 1} = w_{1}^{t} - r_{1}^{t} \frac{\partial L^{t}}{\partial w_{1}}

r_{1}^{t} = \frac{r_{g e n e r a l}}{\sqrt{(\frac{\partial L^{t - 1}}{\partial w_{1}})^{2} + \dots + (\frac{\partial L^{t - n}}{\partial w_{1}})^{2}} + 𝜀}

* Why would you want to do something like this? * It balances the update value at each step such that when gradient is high it lowers the learning rate and when the gradient is low it increase the learning rate. * This way it moderates the steps we take at each update (and for each parameter). * Note: The

𝜀

term in the denominator is set to a small value to avoid dividing by

0

. * Note: AdaGrad really helps in the case of sparse features, because if we have sparse features, that means that the weights associated with those features will be updated less, and therefore the learning rate will be higher.

1.10. Adam

* The other method is Adam.* Adam combines momentum and adaptive learning rate.

w_{}^{t + 1} = w_{}^{t} - \frac{r_{g e n e r a l}}{\sqrt{{\hat{V}}_{t}} + 𝜀} {\hat{m}}_{t}

m_{t} = 𝛽_{1} m_{t - 1} + (1 - 𝛽_{1}) \nabla_{l o s s}^{t}

V_{t} = 𝛽_{2} V_{t - 1} + (1 - 𝛽_{2}) (\nabla_{l o s s}^{t})^{2}

* Note:

𝛽_{1}

and

𝛽_{2}

are hyperparameters.* Note: The only difference between

m_{t}

and

V_{t}

is the squared gradient loss term.* Note: Notice that the

m_{t}

and

V_{t}

are adjusted (i.e.

{\hat{m}}_{t}

and

{\hat{V}}_{t}

). The reason is because these terms are technically moments of a function, and in order to get an unbiased moment on these functions we have to adjust them by the

𝛽

parameters.* Note: Adam looks like a ball rolling down a hill with momentum, but the ball also has friction. The idea is that the friction helps the parameters settle in the global optima, while the momentum helps the parameters escape the local minimum.

{\hat{m}}_{t} = \frac{m_{t}}{1 - 𝛽_{1}^{t}}, {\hat{V}}_{t} = \frac{V_{t}}{1 - 𝛽_{2}^{t}}

1.11. RMSProp• Will add the note later 1.12. AdaDelta• Will add the note later 1.13. Vanishing and exploding gradients• Once we have the gradients, from whatever optimizer we use, multiplying these gradients together can result in a problem.– Let's say if we use the sigmoid activation function, the maximum value of the gradients are

0.25

.– Now, if we multiply a lot of

0.25

s together, this final gradient (based on chain rule) will brace towards

0

→ this result in underflow.– This is called → vanishing gradient.• Also, some of the gradient terms include the value of weights. – If the weights are extremely large, by multiplying them together, we can end up getting an extremely large value → This is called an exploding gradient.• There are a few methods to mitigate these problems. 1.13.1. Initialization• One of the method to tackle the vanishing/exploding gradients is to initializing the weights of the NN in a particular way.• A bad way to initialize the weights is just to use a uniform distribution between

0

and

1

. • Another bad way to just to initialize these parameters with a normal distribution in which the mean is

0

and the standard deviation is

1

.• What we can do instead is to initialize the weights from a normal distribution in which the mean is

0

but the standard deviation is

𝜎 = \sqrt{2 ⁄ (f_{i} + f_{o})}

. •

f_{i}

is fan in and

f_{o}

is fan out. • Fan in is the number of inputs to a particular layer and fan out is the number of outputs for that layer.• This way, we can initialize the weights for each layer of NN.• This is called Xavier or Glorot initialization.• The reason why this helps is because we're shrinking the standard deviation by how many ever times we will be multiplying these variables together per layer. – Not doing this makes the variances of each layer multiply together and that causes the variance to grow exponentially. – So, if we can shrink down the standard deviation early, these the other multiplications, hopefully, won't result in exponential growth (or shrinkage) of the gradients.• This works best when we use something called a symmetric activation function. Example of such functions → sigmoid function.• 1.14. ReLU and Leaky ReLU • What if we want to use a non-symmetric activation function?– Example of such functions is ReLU (Rectified Linear Unit) function. • Why do we want to use ReLU?– More computationally efficient → All the negative values take on the value of

0

, and all positive values take on the value itself → When taking derivatives, the derivatives of

0

0

and the derivative of any value is just

1

.– Tends to produce better model performance– Sparsity → reduce overfitting → not all neurons will output a value (negative values →

0

). – • What are the downsides of ReLU?– It has an uncapped activation. With sigmoid, we'd have something called saturation, where the output of the neuron could be no larger than the value of

1

.– However, the ReLU can output any value, which means that we could be susceptible to exploding gradients more often.– As well, we can even now be susceptible to exploding forward passes where by simply doing multiplications in the forward pass all the way through the NN, we can also get unreasonably large numbers that overflow.– Another problem called dying ReLU problem.* It comes from the fact that a neuron that takes on a value of

0

will be

0

forever.* That means that the neuron will be completely dead and never output another value except

0

.– Even with these problems, ReLU activation functions are used often in practice.– • Initialization for ReLU– Instead of Xavier initialization, we use the Kaiming initialization →

𝜎 = \sqrt{2 ⁄ f_{i}}

– The Kaiming initialization can be used for other asymmetric activation functions like Leaky ReLU →

f (x) = {\begin{array}{cc} x & i f x > 0 \\ 0.01 x & otherwise \end{array}

– Leaky ReLU tries to get around the dead neuron problem by adding a slight angle to the slope.• Another thing to do (in addition to the initialization) is feature scaling.• 1.15. tanh•

t a n h

is very similar to the sigmoid function, but instead of being in the range of

[0, 1]

, it lies in the range of

[- 1, 1]

. • The idea to cross validate between all the activation functions to see which works best for your data.• Note: Different activation functions can be used at different layers of NN.• Note: The last neuron will dictate what the output looks like. sigmoid → binary classification, softmax → multi-class classification (the maximum value of softmax function is your prediction), linear regression → linear activation function 1.16. Loss Functions• Regression → Mean Squared Error (MSE) →

L (y, \hat{y}) = \frac{\sum_{N}^{} (y_{i} - {\hat{y}}_{i})^{2}}{N}

–

N

is usually the batch_size.• Regression → Mean Absolute Error (MAE) →

L (y, \hat{y}) = \frac{\sum_{N}^{} | y_{i} - {\hat{y}}_{i} |^{2}}{N}

• Classification → Cross Entropy (sometimes called logloss) →

L (y, \hat{y}) = - (y \log {\hat{y}}_{i} + (1 - y) \log (1 - {\hat{y}}_{i})

• Classification → Cross Entropy for

K

classes →

L (y, \hat{y}) = - \sum_{k}^{} y_{i} \log ({\hat{y}}_{i})

1.17. Avoid Overfitting1.17.1. Regularization• We can do regularization by adding

L_{1}

L_{2}

term to the loss function →

L (y, \hat{y}) = - \sum_{k}^{} y_{i} \log ({\hat{y}}_{i}) + 𝜆 \sum_{w}^{} | w_{i} |

1.17.2. Dropout• Dropout is when you have, per layer of the NN, a particular neuron in that layer that will have some probability of sticking around. The others will be dropped out for this training iteration.• For each layer we assign a dropout probability (e.g.

P = 0.5

).• The problem with dropout is that during training, the dropped out neurons (during training) will not drop out during prediction → so, all of the sudden, the last node summation will be a lot higher (because we have all the neurons).– To solve this problem, we can use inverted dropout. During training (after every mini-batch), they'll take the output of the layers and divide by the dropout rate →

\frac{output}{dropout rate}

.– This ensures that the total sum coming into the last node will match on average the total sum coming to it during prediction time. 1.18. How to determine the number of layers and neurons?• If your data is linearly separable, you don't need any hidden layer at all.• Beyond that, it's safe to start with a single hidden layer, and the number of neurons in that single hidden layer should be the average of input and output.• Another alternative is to start with more layers or units than you need, and then go examine the weights of your connections.– The weights that are close to

0

, should allow you to prune the surrounding neuron.– Once, you drop the neuron, you run the cross validation to see how much the NN model performance is affected. Back to Top

x	if x>0
0.01x	otherwise

x	if x>0
0.01x	otherwise