3_modeling_p1_general_dl

Modeling: Part 1 - General Deep Learning & Machine Learning Table of Content

1. Deep Learning 101 1.1. Choosing an activation function: 1.2. Transfer Learning 1.2.1. Transfer Learning: BERT Example 1.2.2. Transfer Learning approaches 2. Deep Learning on EC2/EMR 3. Tuning Neural Networks 3.1. Learning Rate 3.2. Effect of Learning Rate 3.3. Batch Size 4. Neural Network Regularization Techniques 4.1. Grief with Gradients 4.1.1. The Vanishing Gradient Problem 4.1.2. Fixing the Vanishing Gradient Problem 4.1.3. Gradient Checking 4.2. L1 & L2 Regularization 4.2.1. What’s the difference? 4.2.2. Why would you want L1? 5. Ensemble methods 5.1. Bagging 5.2. Boosting 5.3. Bagging vs. Boosting

NOTE: I only took notes the important parts not everything they covered in the course. Some basics of DL and ML are already covered in my other notes.• 1. Deep Learning 1011.1. Choosing an activation function:• For multiple classification, use softmax on the output layer.– RNN’s do well with Tanh.– For everything else:* Start with ReLU* Last resort → PReLU, Maxout* Swish for really deep networks 1.2. Transfer Learning• NLP models (and others) are too big and complex to build from scratch and re-train every time.– The latest may have hundreds of billions of parameters!• Model zoos such as HuggingFace offer pre-trained models to start from.– Integrated with Sagemaker via Hugging Face Deep Learning Containers 1.2.1. Transfer Learning: BERT Example• Hugging Face offers a Deep Learning Container (DLC) for BERT.• It’s pre-trained on BookCorpus and Wikipedia.• You can fine-tune BERT (or DistilBERT etc.) with your own additional training data through transfer learning:– Tokenize your own training data to be of the same format.– Just start training it further with your data, with a low learning rate. 1.2.2. Transfer Learning approaches• Continue training a pre-trained model (fine-tuning)– Use for fine-tuning a model that has way more training data than you’ll ever have.– Use a low learning rate to ensure you are just incrementally improving the model.• Add new trainable layers to the top of a frozen model– Learns to turn old features into predictions on new data.– Can do both → add new layers, then fine tune as well.• Retrain from scratch– If you have large amounts of training data, and it’s fundamentally different from what the model was pre-trained with.– And you have the computing capacity for it!• Use it as-is– When the model’s training data is what you want already.

2. Deep Learning on EC2/EMR• EMR supports Apache MXNet and GPU instance types.• • Appropriate instance types for deep learning:– P3: 8 Tesla V100 GPU’s → more expensive option– P2: 16 K80 GPU’s → less expensive option– G3: 4 M60 GPU’s (all Nvidia chips)– G5g: AWS Graviton 2 processors / Ndivia T4G Tensor Core GPU's* Not yet available in EMR* Also used for Android game streaming– P4d: A100 "UltraClusters" for supercomuting– • Deep Learning AMI’s

3. Tuning Neural Networks3.1. Learning Rate• Neural networks are trained by gradient descent (or similar means).• We start at some random point, and sample different solutions (weights) seeking to minimize some cost function, over many epochs.• How far apart these samples are is the learning rate. 3.2. Effect of Learning Rate• Too high a learning rate means you might overshoot the optimal solution!• Too small a learning rate will take too long to find the optimal solution.• Learning rate is an example of a hyperparameter. 3.3. Batch Size• How many training samples are used within each batch of each epoch.• Somewhat counter-intuitively:– Smaller batch sizes can work their way out of “local minima” more easily.* Small learning rates increase training time.– Batch sizes that are too large can end up getting stuck in the wrong solution.– Random shuffling at each epoch can make this look like very inconsistent results from run to run.

4. Neural Network Regularization Techniques• What is regularization?– Preventing overfitting* Models that are good at making predictions on the data they were trained on, but not on new data it hasn’t seen before.* Overfitted models have learned patterns in the training data that don’t generalize to the real world.* Often seen as high accuracy on training data set, but lower accuracy on test or evaluation data set.· When training and evaluating a model, we use training, evaluation, and testing data sets.– Regularization techniques are intended to prevent overfitting.• • Common regularization techniques:– Dropout– Early stopping 4.1. Grief with Gradients4.1.1. The Vanishing Gradient Problem• When the slope of the learning curve approaches zero, things can get stuck.• We end up working with very small numbers that slow down training, or even introduce numerical errors.• Becomes a problem with deeper networks and RNN’s as these “vanishing gradients” propagate to deeper layers.• Opposite problem: “exploding gradients”. 4.1.2. Fixing the Vanishing Gradient Problem• Multi-level hierarchy– Break up levels into their own sub-networks trained individually.• Long short-term memory (LSTM)• Residual Networks– i.e., ResNet– Ensemble of shorter networks• Better choice of activation function– ReLU is a good choice 4.1.3. Gradient Checking• A debugging technique.• Numerically check the derivatives computed during training.• Useful for validating code of neural network training.– But you’re probably not going to be writing this code… 4.2. L1 & L2 Regularization• Preventing overfitting in ML in general.• A regularization term is added as weights are learned.• L1 term is the sum of the weights →

𝜆 \sum_{i = 1}^{k} | w_{i} |

• L2 term is the sum of the square of the weights →

𝜆 \sum_{i = 1}^{k} w_{i}^{2}

• Same idea can be applied to loss functions. 4.2.1. What’s the difference?• L1: sum of weights– Performs feature selection – entire features go to 0– Computationally inefficient– Sparse output• L2: sum of square of weights– All features remain considered, just weighted– Computationally efficient– Dense output4.2.2. Why would you want L1?• Feature selection can reduce dimensionality– Out of 100 features, maybe only 10 end up with non-zero coefficients!– The resulting sparsity can make up for its computational inefficiency• But, if you think all of your features are important, L2 is probably a better choice.

5. Ensemble methods5.1. Bagging• Generate N new training sets by random sampling with replacement.• Each resampled model can be trained in parallel. 5.2. Boosting• Observations are weighted.• Some will take part in new training sets more often.• Training is sequential; each classifier takes into account the previous one’s success.. 5.3. Bagging vs. Boosting• XGBoost is the latest hotness.• Boosting generally yields better accuracy.• But bagging avoids overfitting.• Bagging is easier to parallelize.• So, depends on your goal.