MLExpert

Table of Content

1. Naive Bayes 1.1. What is Laplace Smoothing? 1.2. How to Prepare the Data for Naive Bayes? 2. Performance 2.1. Confusion Matrix 2.2. Can we control the sensitivity and specificity tradeoff? 2.3. Cross Validation 2.4. Receiver Operator Characteristic (ROC) curve 2.5. Hyperparameter Tuning 2.6. Types of Cross Validation 2.7. Steps of ML Modeling 3. Naive Bayes Optimization 4. K Nearest Neighbors (KNN) 5. Decision Trees 5.1. How to build a CART model? 5.2. Gini Impurity 5.3. When do stop splitting? 5.4. How do we make prediction? 5.5. How does CART handle missing data? 5.6. How does CART make predictions for multi-class data? 5.7. How does CART handle regression? 5.8. How to handle categorical features in CART? 5.9. What are the down sides of CART? 5.9.1. Boosting 5.9.2. Bagging 5.9.3. XGBoost vs. LightGBM 6. Linear Regression 6.1. R-squared 6.2. Test for significance 6.2.1. When to use a t-test? 6.2.2. Which t-test? 6.2.3. How to do a t-test? 6.2.4. p-value from t-test 6.2.5. t-test critical values 6.2.6. One-sample t-test 6.2.7. Two-sample t-test 6.2.8. Two-sample t-test if variances are equal 6.2.9. Two-sample t-test if variances are unequal (Welch's t-test) 6.2.10. Paired t-test 6.2.11. t-test vs. z-test 6.3. Multicollinearity 6.3.1. How to detect multicollinearity? 6.4. Feature interaction 6.5. Simpson's paradox 7. Logistic Regression 7.1. Coefficient interpretation 7.2. Multinomial regression 7.3. Regularization 7.3.1. Why Lasso regularization induce model sparsity? 7.4. Early stopping 7.5. Other considerations 8. Support Vector Machine (SVM) 8.1. Hard-margin SVM 8.2. Soft-margin SVM 8.2.1. How to solve soft-margin SVM 8.3. SVM for non-linear data 8.4. Kernel trick 8.4.1. RBF kernel 8.5. The dual problem 8.5.1. Lagrange multiplier 8.6. More on kernel trick 8.7. Kernel trick - other resources 8.8. Multi-class SVM 8.8.1. Structured SVM 8.9. SVM for regression 8.10. Other considerations 9. k-Means 10. Singular Value Decomposition (SVD) 10.1. Eigendecomposition 10.2. Principal Component Analysis (PCA) 11. Neural Networks (NN) 11.1. What is a neuron (in NN)? 11.2. Why do we need the bias term? 11.3. How a NN learns non-linear patterns? 11.4. How does a NN learn? 11.5. Chain Rule 11.6. How do we update the weights? 11.7. Stochastic Gradient Descent (SGD) 11.8. Momentum 11.9. AdaGrad 11.10. Adam 11.11. RMSProp 11.12. AdaDelta 11.13. Vanishing and exploding gradients 11.13.1. Initialization 11.14. ReLU and Leaky ReLU 11.15. tanh 11.16. Loss Functions 11.17. Avoid Overfitting 11.17.1. Regularization 11.17.2. Dropout 11.18. How to determine the number of layers and neurons? 12. Convolutional Neural Networks (CNN) 13. Recurrent Neural Networks (RNN) 14. Generative Adversarial Networks (GAN) 14.1. GAN Loss Function 14.2. Example 14.3. Evaluation 15. Recommender Systems 15.1. Collaborative Filtering 15.1.1. User-based Collaborative Filtering 15.1.2. Item-based Collaborative Filtering 15.1.3. Considerations on Memory-based Filtering 15.2. Matrix Factorization 15.2.1. Implicit Ratings 15.2.2. Alternating Least Squares (ALS) 15.2.3. Predicting with ALS 15.3. Deep Learning Extension 15.4. Challenges of Collaborative Filtering 15.5. Content-based Filtering 15.6. Deep Learning Recommender Systems 16. Learning To Rank 16.1. Framing a Ranking Problem 16.2. Candidate Generation 16.3. Ranking the Top K 16.3.1. Learning To Rank 16.3.2. Learning To Rank Loss Function 16.4. RankNet 16.5. LambdaNet 16.6. nDCG 16.7. LambdaMART 16.8. Other Notes

1. Naive Bayes• Let's say we want to identify spam messages.• We can use Bayes Theorem to formulate probability of a spam message based on appearance of some words in the message.

P (spam | w) = \frac{P (spam) . P (w | spam)}{P (spam) . P (w | spam) + P (not spam) . P (w | not spam)}

•

w

represents the vocabulary,

V = {w_{1}, w_{2}, \dots, w_{n}}

, i.e. just a list of words that our model recognizes.•

P (spam)

indicates the probability of seeing a spam message regardless of the word.• Note:

P (spam)

is called priors, and

P (w | spam)

and

P (w | not spam)

are called likelihoods. The denominator is called the evidence, and

P (spam | w)

is called the posterior.• Note: Since we're modeling the presence/absence of a particular word, this is called a Bernoulli Model.• In order to calculate the probability of spam,

P (w_{i} | spam)

, given any particular word,

w_{i}

, we use chain rule in probability.

\begin{array}{c} P ({\neg w_{1}, \neg w_{2}, \dots, w_{i}, \dots, \neg w_{n}} | spam) = \\ P (\neg w_{n} | spam) . \\ P (\neg w_{n - 1} | spam, \neg w_{n}) . \\ P (\neg w_{n - 2} | spam, {\neg w_{n}, \neg w_{n - 1}}) . \\ . . . \\ P (w_{i} | spam, {\neg w_{n}, \neg w_{n - 1}, \dots, \neg w_{n - i}}) . \\ . . . \\ P (\neg w_{1} | spam, {\neg w_{n}, \neg w_{n - 1}, \dots, w_{i}, \dots, \neg w_{2}}) \end{array}

• • where

\neg w

indicates not existence of word

w

in a message.• If

n

is large, then the amount of calculations will get really high. So, we use a simplifying assumption that words are independent of each other. Therefore, equations

(2)

becomes,

\begin{array}{c} P ({\neg w_{1}, \neg w_{2}, \dots, w_{i}, \dots, \neg w_{n}} | spam) = \\ P (\neg w_{n} | spam) . P (\neg w_{n - 1} | spam) . \dots . P (w_{i} | spam) . \dots . P (\neg w_{1} | spam) \end{array}

• Note: The simplifying assumption can potentially disregard some useful information since some words are more likely to appear in a sentence, e.g. London and England. – Due to this simplifying assumption, this model is called Naive Bayes.– • Note:

P (w_{k} | spam)

is the probability of seeing word

w_{k}

in a spam message, i.e.

\frac{no. of spam messages with word w}{Total no. of messages with word w}

, similarly for

P (\neg w_{k} | spam)

P (w_{k} | not spam)

.• • Note: In practice, each word,

w_{k}

, is represented by one-hot encoded vector. • The Problem of Zero Probability: Since probability of different words are multiplied to each other, if the probability of one word (or more) is

0

, then it'll make the entire probability

0

. – In order to solve this issue, Naive Bayes applies Laplace Smoothing to every word in the vocabulary. 1.1. What is Laplace Smoothing?• Laplace smoothing is a smoothing technique that handles the problem of zero probability in Naive Bayes. Using Laplace smoothing, we can represent

P (w_{k} | spam)

as,

P (w_{k} | spam) = \frac{no. of spam messages with word w + 𝛼}{N + 𝛼 . n}

• where:–

N

is the total number of spam messages.–

n

is the vocabulary size.–

𝛼

is the smoothing parameter.* Using higher

𝛼

values will push the likelihood towards a value of

0.5

, i.e. the probability of a word is equal to

0.5

for both spam and not spam messages.* This is not so useful. In practice, it's preferred to use

𝛼 = 1

.1.2. How to Prepare the Data for Naive Bayes?• Let's say we have the following message:– "Hey, good point here - This is interesting."• Here are the steps we do to prepare this sentence:– Remove white space– Remove punctuation– Tokenizing (creating a list of words/token) →

["Hey", "good", "point", "here", "-", "This", "is", "interesting"]

– Remove stop words (i.e. words that don't add much information) →

["Hey", "good", "point", "-", "interesting"]

– Remove non-alphabetic words →

["Hey", "good", "point", "interesting"]

– Stemming (i.e. Removing the ending modifiers of words, leaving the stem of the word) →

["Hey", "good", "point", "interest"]

* Lemmatization: A more calculated form of stemming which ensures the proper lemma results from removing the word modifiers.* The problem with lemmatization is that it is often more expensive (than stemming). So, with large data, you may want to go with stemming over lemmatization. • We can represent words in terms of binary vectors of

0

and

1

. This is called vectorization. Back to Top

2. Performance• Continuing from section Section 1., let's say we want to measure how good the spam detection model is.– We can do so by determining a cutoff point/decision point (say

0.5

) and predict

s p a m

when the probability is

> 0.5

and

n o t - s p a m

otherwise. – Then, we can calculate the accuracy metric, which is simply the number of correctly predicted divided by total number of examples.• Note: Accuracy is not such a good metric to measure the performance of the spam model because

93 %

of the data is

n o t - s p a m

→ imbalanced dataset. So, by just predicting

0

for all the examples, we'd got

93 %

in accuracy.• Note: One problem, particularly with imbalanced data, is that we often care more about the performance on the minority class which in this case is predicting

s p a m

examples correctly. – There are two ways the model could predict a

s p a m

incorrectly:* False Positive → predicting

s p a m

when it's actually a

n o t - s p a m

.* False Negative → predicting

n o t - s p a m

when it's actually a

s p a m

.* The other cases are called True Positive → predicting

s p a m

when it's actually a

s p a m

and True Negative → predicting

n o t - s p a m

when it's a

n o t - s p a m

.2.1. Confusion Matrix• We can summarize all the above in something called a confusion matrix.

		Actual
		Positive	Negative
Predicted	Positive	True Positive (TP)	False Positive (FP)
Predicted	Negative	False Negative (FN)	True Negative (TN)

• Sensitivity =

\frac{T P}{T P + F N}

→ model's ability to correctly classify

s p a m

messages (or positive cases). Higher Sensitivity → fewer False Negative. • Specificity =

\frac{T N}{T N + F P}

→ it represents the classifier's ability correctly classify the

n o t - s p a m

messages (or negative cases). Higher Specificity → fewer False Positives.• – Note: In the case of spam detection model, we'd prefer higher specificity such that an important message wouldn't be falsely classified as

s p a m

.– Note: In some other problems such as cancer detection, we'd prefer higher sensitivity because we want as few false negatives as possible. • Precision =

\frac{T P}{T P + F P}

→ it just measures how accurately the positives are classified. •

F_{1}

Score =

\frac{2. (s e n s i t i v i t y \times p r e c i s i o n)}{s e n s i t i v i t y + p r e c i s i o n}

→ it is the harmonic mean of the sensitivity and precision. 2.2. Can we control the sensitivity and specificity tradeoff?• Higher Sensitivity → Less FN • Higher Specificity → Less FP• We can change the tradeoff by changing the cutoff point.2.3. Cross Validation• To test the performance of our model, we usually split the data into three parts:– Training set– Validation set– Test set • The validation set gives the opportunity to tune our model without using the test set itself.• We use the test set merely for evaluating our model performance on unseen examples. 2.4. Receiver Operator Characteristic (ROC) curve• ROC curve is plotted on

s e n s i t i v i t y

on one axis and

1 - s p e c i f i c i t y

on the other axis.• As we tune our model on the validation set, we can plot the sensitivities and specificities that each cutoff threshold produces. – The

45^{°}

line shows that for every positive example that we correctly classify, we also incorrectly classify a negative example.– The goal for every model should be to always lie above or be better than the

45^{°}

line.– To obtain a good balance specificity and sensitivity, we ought to pick a threshold that maximizes the distance away from the

45^{°}

line.• In order to compare different models, we use the Area Under the Curve (AUC) of ROC. Whichever model that has higher AUC is the model that we can confidently say is a better predictor. 2.5. Hyperparameter Tuning• Hyperparameters are parameters that go along with the model that you don't necessarily train.2.6. Types of Cross Validation• Hold-out Validation → We assign a subset of examples to be our validation set.• K-fold Validation → We train

k

different models and use a different validation set each time. • Leave-One-Out Validation → It's the k-fold validation when

k = n

, where

n

is the number of examples → more used when we have small amount of data.2.7. Steps of ML Modeling1. Problem2. Hypothesis3. Simple Heuristic4. Measure Impact5. More Complex Technique6. Measure Impact7. Tune Model8. Replace Existing Technique Back to Top

3. Naive Bayes Optimization Back to Top

4. K Nearest Neighbors (KNN)• The basic idea of KNN is to figure out what an unlabeled example is according to its neighbors.• Parameter

k

indicates the number of neighbors we're considering.• The label is determined by majority voting.– For example, for

k = 7

, and 6 of examples →

1

and 1 example →

0

6 ⁄ 7 = 85.7 % \to 1

and

1 ⁄ 7 = 14.3 % \to 0

.– Note: In order to avoid ties, it's better to pick odd numbers for

k

.• Distance is KNN is defined as Euclidean Distance →

d (\vec{a}, \vec{b}) = \sqrt{(a_{1} - b_{1})^{2} + \dots + (a_{m} - b_{m})^{2}}

– Note: We should use some feature normalization like Minmax Scaling before calculating distance.– Note: We could also use Manhattan distance →

d (\vec{a}, \vec{b}) = | a_{1} - b_{1} | + . . . + | a_{m} - b_{m} |

• KNN can be susceptible to outliers. – In order to determine if an unlabeled example is an outlier, we can average its distance to

k

nearest neighbors and if the average distance is greater than some threshold, we can label it as an outlier (an odd example).– We can determine the threshold through cross validation.• How to account for categorical features in KNN?– For ordinal categorical features, we can use Gower Distance.– For comparing two asymmetric binary variables, we can use Jaccard Distance.* Note: For two asymmetric binary variables, we exclude the zeros. But for a symmetric binary we include the zeros as well.– For multi-category variables we can use Hamming Distance.• How does KNN predict regression models?– It just averages the value of

k

nearest neighbors.• Partial Considerations– KNN performs poorly when features are in dimensionality (especially euclidean distance).* One way to mitigate this problem is dimensionality reduction.* Another way is to do forward feature selection using cross validation.– KNN is sensitive to scaling (this is true for any model that's based upon distance).– One way to improve the performance of KNN is to weigh the votes.* For any node

i

(

i ⩽ k

), the weight is calculated by →

\frac{d_{1} + d_{2} + \dots + d_{k}}{d_{i}}

* So, if the distance to neighbor node

i

d_{i}

, gets smaller it will get higher weight.* Note: Numerator is fixed and only the denominator changes.– KNN is computationally expensive →

O (n . d . k)

where

n

is the number of nodes, and

d

is the dimension of each node.* To reduce the complexity down, we can use a k-d tree data structure which allows us to locate the relevant points faster than iterating over all the points. * Note that k-d tree also suffers from increase in dimensionality.• Example: Cyber Security– Every once in a while these programs will make calls to the operating system (i.e. system calls).– Our goal is to detect intrusive processes/programs in all the machines in a system.– Our features are going to be system calls which look like this:*

p i d_{1} = [' m k d i r',' m o u n t',' p o l l',' c h o w n']

p i d_{2} = [' c h r o o t',' b i n d',' f o r k',' o p e n',' p r e a d y']

p i d_{3} = . . .

– The labels are

i n s t r u s i v e \to 1

and

n o n - i n t r u s i v e \to 0

.– We can treat system calls as just words and perform TF-IDF on the them to make them numerical.*

p i d_{1} = [0.15, 0.12, 0.13, 0.31]

p i d_{2} = [0.16, 0.31, 0.46, 0.21, 0.11]

p i d_{3} = . . .

– For an unlabeled example, we calculate its distance to all other nodes, and determine its label based on the majority voting of its closest

k

neighbors.– Let's say we also have a categorical features named 'priority'. It has three values: High, Medium, Low. Let's calculate Gower distance.* High → 0, Medium → 1, Low → 2* Gower Distance =

v a l u e ⁄ \max (v a l u e)

d_{G o w e r} (h i g h) = 0 ⁄ 2, d_{G o w e r} (m e d i u m) = 1 ⁄ 2, d_{G o w e r} (l o w) = 2 ⁄ 2

5. Decision Trees• CART (Classification and Regression Trees)• CART can handle missing data in training and prediction. 5.1. How to build a CART model?• We have to start by figuring out how to split examples by their label.• Let's say if we have only numerical features, then we have to decide which features we should split on, and what value of that features is best to make the split for.• The goal is to find the best feature and value which separates our examples by labels the most.• The first thing is to order the data based on the selected feature.• Then, we find the average between each two consecutive data points (i.e.

x_{i}

x_{i + 1}

).• Each of these averages are split points. We can split the data based on each of those split points (picture below). 5.2. Gini Impurity• How do we evaluate the effectiveness of each of these split points?– We use Gini Impurity to help us out. • – Gini Impurity: We find the squared probability of getting class

1

and class

0

for each node (picture below). We just do the following calculations for the rest of the splits and for all the features. We choose the split with the lowest gini impurity.5.3. When do stop splitting?• We can either assign a max_depth or a min_example_per_node.– Or we can go only the nodes are pure (→ This can lead to overfitting). 5.4. How do we make prediction?• For each data point we want to predict for, we feed it through the tree until we get to one of the end nodes (leaf). – The prediction is the majority class on the leaf node. In other words, the vote of the leaf node. 5.5. How does CART handle missing data?• If we're missing data on a feature, we just use the next best features that splits the data (second lowest gini impurity) 5.6. How does CART make predictions for multi-class data?• First, we add terms to the gini impurity corresponding to each class →

1 - (P_{c l a s s 0}^{2} + P_{c l a s s 1}^{2} + P_{c l a s s 2}^{2} + . . .)

• In prediction step, we again use the voting (i.e. relative majority or mode) 5.7. How does CART handle regression?• In case of regression models, we use Mean Squared Error (MSE) instead of Gini Impurity.– MSE node → We're summing the difference between all the values in the end nodes (leaf) and the average of that node. Here's how to calculate MSE for one node:

M S E = \frac{\sum_{i}^{} (l_{x_{i}} - l_{a v g .})^{2}}{n}

• – MSE split → To get the MSE of a split, we just add up all the squared differences (like equation

(5)

) and divide it by the total number of examples under that split.– We choose the split that gives us the smallest MSE. • For prediction → We just average the values of observations in a node. 5.8. How to handle categorical features in CART?• If the categorical feature is binary, the split is easy → each node get one of the values. But what about categorical features with more than 2 values?• In this case, we have to make splits for every possible combinations. • For Example: Let's say we want represent this list of countries:

[U S, U K, I R I, G E R]

. The splits will be like this:–

[U S]

and

[U K, I R I, G E R]

–

[U S, U K]

and

[I R I, G E R]

–

[U S, I R I]

and

[U K, G E R]

– and so on ...• Note: For a categorical feature with

n

distinct values, there will be

2^{n} - 1

subsets. 5.9. What are the down sides of CART?• CARTs tend to overfit very easily.– As the depth increases, the ability of the model to overfit increases as well.– Solutions: * Limit the max_depth to

2

5

.* Use Boosting (refer below for a simple explanation of how boosting works).* Use Bagging → Bootstrap Aggregation– CART is rarely used in practice. Instead, some ensemble variations of it is used very often. Examples of such methods → XGBoost, LightGBM, CatBoost, AdaBoost, RandomForest, etc. 5.9.1. Boosting• Boosting is very simply training another tree model on the error. In other words, boosting is an ensemble of weak learners.– Suppose, we train a tree and predict

230

where the true value is

270

. Boosting in this case is to train on the error of first tree (i.e.

270 - 230 = 40

).– In general, boosting for several times looks like this:*

P r e d (x_{i}) = t r e e_{1} (l a b e l (x_{i})) + t r e e_{2} (e r r o r (t r e e_{1})) + t r e e_{3} (e r r o r (t r e e_{2})) + . . .

– Note: Boosted trees also overfit easily. We need to perform cross-validation to find the best number of trees and max_depth.– Note: In boosted trees, each individual tree is called a weak learner.* These learners are defined as having better performance than random chance. 5.9.2. Bagging• In bagging, we create many trees based on sampling from data (observations) and also training on subset of features. To get predictions, we just average the predictions of all trees → This is called Random Forest.– Note: Out of pure math, it works out that roughly 36.7% (i.e.

1 ⁄ e

) of examples won't be trained on → We automatically get out-of-bag sample which we can use as our validation set.– Note: Bagging reduces the variance.– Note: Another variant is C4.5. * It can only do classification.* It can do

n

-ary splits (instead of binary splits)* It uses information gain based on entropy (instead of gini index)5.9.3. XGBoost vs. LightGBM• Source• What is Gradient Boosting? Gradient Boosting refers to a methodology in machine learning where an ensemble of weak learners is used to improve the model performance in terms of efficiency, accuracy, and interpretability.– These learners are defined as having better performance than random chance.– The hypothesis is to filter out instances that are difficult to accurately predict and develop new weak learners to handle them.• How Gradient Boosting Works? (a) The initial model is trained and predictions are run on the whole dataset. (b) The error between the actual value and prediction is calculated and more weight is given to the incorrect predictions. (c) Subsequently, a new model that attempts to fix the error of the previous model and in a similar way several models are thus created. We arrive at the final model by weighting the mean of all models.• Gradient Boosting Can Be Applied To:– Regression – taking the average of the outputs by the weak learners– Classification – finding the class prediction occurring the maximum number of times– • What is XGBoost?– XGBoost → eXtreme Gradient Boosting– XGB focuses on computation speed and model performance. It was introduced by Tianqi Chen and is currently a part of a wider toolkit by DMLC (Distributed Machine Learning Community).– It can be used for both classification and regression.– It supports the following kinds of boosting:* Gradient Boosting as controlled by learning rate* Stochastic Gradient Boosting that leverages sub-sampling at a row, column or column per split levels* Regularized Gradient Boosting using L1 (Lasso) and L2 (Ridge) regularization– Some of the other features that are offered from a system performance point of view are:* Using a cluster of machines to train a model using distributed computing* Utilization of all the available cores of a CPU during tree construction for parallelization* Out-of-core computing when working with datasets that do not fit into memory* Making the best use of hardware with cache optimization– In addition to the above the framework:* Accepts multiple types of input data* Works well with sparse input data for tree and linear booster* Supports the use of customized objective and evaluation functions.– To learn more about XGBoost, check out this page. • What is LightGBM?– Similar to XGBoost, LightGBM (by Microsoft) is a distributed high-performance framework that uses decision trees for ranking, classification, and regression tasks.– The advantages are as follows:* Faster training speed and accuracy resulting from LightGBM being a histogram-based algorithm that performs bucketing of values (also requires lesser memory)* Also compatible with large and complex datasets but is much faster during training* Support for both parallel learning and GPU learning– In contrast to the level-wise (horizontal) growth in XGBoost, LightGBM carries out leaf-wise (vertical) growth that results in more loss reduction and in turn higher accuracy while being faster. * But this may also result in overfitting on the training data which could be handled using the max-depth parameter that specifies where the splitting would occur. Hence, XGBoost is capable of building more robust models than LightGBM. • Structural Differences between XGBoost and LightGBM– It feels like LightGBM is significantly faster than XGBoost but delivers almost equivalent performance. We might wonder, what are exactly the differences between LightGBM and XGBoost?– – Leaf Growth: LightGBM has a faster rate of execution along with being able to maintain good accuracy levels primarily due to the utilization of two novel techniques:* Gradient-Based One-Side Sampling (GOSS)· In Gradient Boosted Decision Trees, the data instances have no native weight which is leveraged by GOSS.· Data instances with larger gradients contribute more towards information gain.· To maintain the accuracy of the information, GOSS retains instances with larger gradients and performs random sampling on instances with smaller gradients.· Note: We can learn more about this concept in the article – What makes LightGBM lightning fast?· Note: The YouTube channel Machine Learning University also released a video on LightGBM speaking about GOSS.* Exclusive Feature Bundling (EFB)· EFB is a near lossless method to reduce the number of effective features.· Just like One-Hot encoded features, in the sparse space, many features rarely take non-zero values simultaneously. · To reduce dimensionality, improve efficiency, and maintain accuracy, EFB bundles these features, and this bundle is called an Exclusive Feature Bundle.· This thread on EFB and LightGBM’s paper can be referred to gain better insight.* On the other hand, XGBoost uses a pre-sorted and histogram-based algorithm for computing the best split, which is done with GOSS in LightGBM. The pre-sorting splitting works as:1. For each node, enumerate over all features2. For every feature, sort instances by the feature value3. Using linear scan, decide the split along with the feature basis information gain4. Pick the best-split solution along with all the features4. – Handling Categorical Features* Both LightGBM and XGBoost accept numerical features only. This means that the nominal features in our data need to be transformed into numerical features.* XGBoost, by default, treats such variables as numerical variables with order and we don’t want that. Instead, if we can create dummies for each of the categorical values (one-hot encoding), then XGBoost will be able to do its job correctly. But for larger datasets, this is a problem as encoding takes a longer time.· For example, if we encode a categorical variable with three values into (0, 1, 2), XGBoost treats them with order as if category 1 is greater than category

0

, and so on. This is not what we want. That's why categorical variables need to be one-hot encoded.* On the other hand, LightGBM accepts a parameter to check which column is a categorical column and handles this issue with ease by splitting on equality. * Note: The H2O library provides an implementation of XGBoost that supports the native handling of categorical features. * – Handling Missing Values* Both the algorithms treat missing values by assigning them to the side that reduces loss the most in each split.* – Feature Importance Methods* Gain· Every feature in a dataset has some sort of importance/ weightage in helping build an accurate model.· Gain refers to the relative contribution of a particular feature in the context of a particular tree.· This can also be understood by the extent of relevant information that the model gains from a feature for making better predictions.· Available both in XGBoost and LightGBM.* Split/Frequency/Weight· Split for LightGBM and Frequency or Weight for XGBoost method calculates the relative count of times a particular feature occurs in all splits of the model’s trees. One issue with this method is that it is prone to bias when there are a large number of categories in categorical features.· Available both in XGBoost and LightGBM.* Coverage· The relative number of observations per feature.· Available only in XGBoost.· – Processing Unit* The algorithm we want to use often depends upon the type of processing unit we have for running the models. * Although XGBoost is comparatively slower than LightGBM on GPU, it is actually faster on CPU. * LightGBM requires us to build the GPU distribution separately while to run XGBoost on GPU we need to pass the ‘gpu_hist’ value to the ‘tree_method’ parameter when initializing the model.* When working in an institution with access to GPUs and strong CPUs, we should go for XGBoost as it is more scalable than LightGBM. * But personally, I think LightGBM makes more sense as the training time saved can be used for better experimentation and feature engineering. We can train our final model to have model robustness.* – Important hyperparameters* XGBoost parameters· n_estimators [default 100] – Number of trees in the ensemble. A higher value means more weak learners contribute towards the final output but increasing it significantly slows down the training time. · max_depth [default 3] – This parameter decides the complexity of the algorithm. The lesser the value assigned, the lower is the ability for the algorithm to pick up most patterns (underfitting). A large value can make the model too complex and pick patterns that do not generalize well (overfitting).· min_child_weight [default 1] – We know that an extremely deep tree can deliver poor performance due to overfitting. The min_child_weight parameter aims to regularize by limiting the depth of a tree. So, the higher the value of this parameter, the lower are the chances of the model overfitting on the training data.· learning_rate/ eta [default 0.3] – The rate of learning of the model is inversely proportional to the accuracy of the model. Lowering the learning rate, although slower to train, improves the ability of the model to look for patterns and learn them. If the value is too low then it raises difficulty in the model to converge.· gamma/ min_split_loss [default 0] – This is a regularization parameter that can range from 0 to infinity. Higher the value, higher is the strength of regularization, lower are the chances of overfitting (but can underfit if it’s too large). Hence, this parameter varies across all types of datasets.· colsample_bytree [default 1.0] – This parameter instructs the algorithm on the fraction of the total number of features/ predictors to be used for a tree during training. This means that every tree might use a different set of features for prediction and hence reduce the chances of overfitting and also improve the speed of training as not all the features are being used in every tree. The value ranges from 0 to 1.· subsample [default 1.0] – Similar to colsample_bytree, the subsample parameter instructs the algorithm on the fraction of the total number of instances to be used for a tree during training. This also reduces the chances of overfitting and improves training time.* LightGBM parameters· max_depth – Similar to XGBoost, this parameter instructs the trees to not grow beyond the specified depth. A higher value increases the chances for the model to overfit.· num_leaves – This parameter is very important in terms of controlling the complexity of the tree. The value should be less than 2^(max_depth) as a leaf-wise tree is much deeper than a depth-wise tree for a set number of leaves. Hence, a higher value can induce overfitting.· min_data_in_leaf – The parameter is used for controlling overfitting. A higher value can stop the tree from growing too deep but can also lead the algorithm to learn less (underfitting). According to the LightGBM’s official documentation, as a best practice, it should be set to the order of hundreds or thousands.· feature_fraction – Similar to colsample_bytree in XGBoost· bagging_fraction – Similar to subsample in XGBoost· – Tradeoff between model performance and training time* When working with machine learning models, one big aspect involved in the experimentation phase is the baseline requirement of resources to train a complex model. While some might have access to some great hardware, often people have limitations to what they can use.* Example: Let us quickly dummy datasets with sample sizes from 1,000 all the way to 20,000 samples. We’ll take a test size of 20% from each of the dummy datasets to measure model performance. For every iteration having different sample sizes stepped up by 1,000 samples, we want to check how much time it takes for an XGBoost Classifier to train in comparison to a LightGBM Classifier. To run the code, refer to this Google Colab notebook. Run results are here.

import neptune.new as neptune from xgboost import XGBClassifierfrom lightgbm import LGBMClassifier from sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_scorefrom sklearn.datasets import make_classification import timefrom tqdm.notebook import tqdm # initialising a logger instancerun = neptune.init( project="common/xgboost-integration", api_token="ANONYMOUS", name="xgb-train", tags=["xgb-integration", "train"],) # configuration for our custom datasetmin_samples = 1000max_samples = 20000step = 1000 for sample_size in tqdm(range(int(min_samples/0.8), int(max_samples/0.8), step)): xgb_dummy = XGBClassifier(seed=47) lgbm_dummy = LGBMClassifier(random_sate=47) # logging the sample size run['metrics/comparison/sample_size'].log(sample_size) # generating the dataset of custom sample size dummy = make_classification(n_samples=sample_size) # splitting the data into train and test set X_train, X_test, y_train, y_test = train_test_split(dummy[0], dummy[1], test_size=0.2, stratify=dummy[1]) start = time.time() xgb_dummy.fit(X_train, y_train) end = time.time() # logging algorithm execution time run['metrics/comparison/xgb_runtime'].log(end-start) # logging model performance run['metrics/comparison/xbg_accuracy'].log( accuracy_score(y_test, xgb_dummy.predict(X_test)), step=sample_size) start = time.time() lgbm_dummy.fit(X_train, y_train) end = time.time() # logging algorithm execution time run['metrics/comparison/lgbm_runtime'].log(end-start) # logging model performance run['metrics/comparison/lgbm_accuracy'].log( accuracy_score(y_test, lgbm_dummy.predict(X_test)), step=sample_size) run.stop()

Figure 1:XGBoost vs. LightGBM Runtime

Figure 2:XGBoost vs. LightGBM Accuracy

• – * From the figure, we can see that the training time for XGBoost kept on increasing with an increase in sample size almost linearly. On the other hand, the training time required by LightGBM has been a very small fraction of its contender. Interesting!* The accuracy scores for both models go hand-in-hand. The results indicate that not only is LightGBM faster, there is not much compromise in model performances. So does this mean we can just ditch XGBoost for LightGBM?* It all comes down to the availability of hardware resources and bandwidth to figure things out. * Although LightGBM gives good performance at fraction of time as compared to XGBoost, what it still needs to improve on is documentation and community strength. * Also if the hardware is available, since XGBoost scales better, as discussed before we could train using LightGBM, get an understanding of the parameters required, and train the final model as an XGBoost model.* – Summary: Gradient Boosted Decision Trees (GBDTs) are one of the most popular choices of machine learning algorithms. XGBoost and LightGBM which are based on GBDTs have had great success both in enterprise applications and data science competitions. Here are the key takeaways from our comparison:* In XGBoost, trees grow depth-wise while in LightGBM, trees grow leaf-wise which is the fundamental difference between the two frameworks.* XGBoost is backed by the volume of its users that results in enriched literature in the form of documentation and resolutions to issues. While LightGBM is yet to reach such a level of documentation.* Both the algorithms perform similarly in terms of model performance but LightGBM training happens within a fraction of the time required by XGBoost.* Fast training in LightGBM makes it the go-to choice for machine learning experiments.* XGBoost requires a lot of resources to train on large amounts of data which makes it an accessible option for most enterprises while LightGBM is lightweight and can be used on modest hardware.* LightGBM provides the option for passing feature names that are to be treated as categories and handles this issue with ease by splitting on equality. * H2O’s implementation of XGBoost provides the above feature as well which is not yet provided by XGBoost’s original library.* Hyperparameter tuning is extremely important in both algorithms. Back to Top

6. Linear Regression• Linear regression answers the question of how do we find the line of best fit for the data?

y = 𝛽_{0} + 𝛽_{1} x_{1} + 𝛽_{2} x_{2} + \dots + 𝛽_{n} x_{n} + 𝜀

• The challenge is to figure out what the best line is which summarizes the data best.• Note: Just as a reminder, if the confidence interval for a coefficient contains a zero, then that coefficient cannot be statistically significant → A confidence interval that contains zero is not certainty that there is no treatment effect, but it is uncertain whether there is a treatment effect. – Having zero in one's confidence interval implies that a treatment effect could have a positive/negative effect on the outcome of interest. 6.1. R-squared•

R^{2}

(or coefficient of determination) is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

\begin{array}{c} S S_{r e s} = \sum_{i}^{} (y_{i} - f_{i})^{2} = \sum_{i}^{} e_{i}^{2} \\ S S_{t o t} = \sum_{i}^{} (y_{i} - \bar{y})^{2} \\ R^{2} = 1 - \frac{S S_{r e s}}{S S_{t o t}} = 1 - \frac{v a r (e r r o r s)}{v a r (y)} \\ a d j R^{2} = 1 - \frac{n - 1}{n - 2} (1 - R^{2}) \end{array}

6.2. Test for significance• We test for significance by performing a t-test for the regression coefficients.• In other words, we will test a claim about the population regression line because there is a strong correlation observed.• We will carry out a t-test for the slope by calculating the p-value and comparing it with the desired significance level.• The null hypothesis is:–

H_{0} : 𝛽_{i} = 0

→ the coefficient is equal to zero–

H_{a} : 𝛽_{i} \neq 0

→ the coefficient is NOT equal to zero• The

t

-statistic is calculated as follows•

\begin{array}{c} S E_{c o e f} = \sqrt{\frac{\frac{1}{n - 2} \sum_{n}^{} (\hat{y} - y_{i})^{2}}{\sum_{n}^{} (x_{i} - \bar{x})^{2}}} \\ t - s t a t i s t i c = \frac{c o e f}{S E_{c o e f}} \\ p - v a l u e = s u m (95 % t a i l a r e a s) u n d e r t - d i s t r i b u t i o n \\ 95 % C I = [c o e f - 1.96. S E_{c o e f}, c o e f + 1.96. S E_{c o e f}] \end{array}

6.2.1. When to use a t-test?• A t-test is one of the most popular statistical tests for location, i.e., it deals with the population(s) mean value(s).• There are different types of t-tests that you can perform:– One-sample t-test– Two-sample t-test– Paired t-test• Note: Remember that a t-test can only be used for one or two groups. If you need to compare three (or more) means, use the analysis of variance (ANOVA) method.• The t-test is a parametric test, meaning that your data has to fulfill some assumptions:– The data points are independent; AND– The data, at least approximately, follow a normal distribution.• Note: If your sample doesn't fit these assumptions, you can resort to a non-parametric alternatives, e.g., the Mann–Whitney U test (a.k.a. the Wilcoxon rank-sum test), the Wilcoxon signed-rank test or the sign test. 6.2.2. Which t-test?• Your choice of t-test depends on whether you are studying one group or two groups:– One sample t-test* Choose the one-sample t-test to check if the mean of a population is equal to some pre-set hypothesized value.* Example: The average volume of a drink sold in 0.33

m l

cans - is it really equal to 330

m l

?* Example: The average weight of people from a specific city - is it different from the national average?– Two sample t-test* Choose the two-sample t-test to check if the difference between the means of two populations is equal to some pre-determined value, when the two samples have been chosen independently of each other.* In particular, you can use this test to check whether the two groups are different from one another.* Example: The average difference in weight gain in two groups of people: one group was on a high-carb diet and the other on a high-fat diet.* Example: The average difference in the results of a math test from students at two different universities.* Note: This test is sometimes referred to as an independent samples t-test, or an unpaired samples t-test.– Paired t-test* A paired t-test is used to investigate the change in the mean of a population before and after some experimental intervention, based on a paired sample, i.e., when each subject has been measured twice: before and after treatment.* In particular, you can use this test to check whether, on average, the treatment has had any effect on the population.* Example: The change in student test performance before and after taking a course.* Example: The change in blood pressure in patients before and after administering some drug. 6.2.3. How to do a t-test?• Decide on the alternative hypothesis– Use a two-tailed t-test if you only care whether the population's mean (or, in the case of two populations, the difference between the populations' means) agrees or disagrees with the pre-set value.– Use a one-tailed t-test if you want to test whether this mean (or difference in means) is greater/less than the pre-set value.• Compute your t-score value– Formulas for the test statistic in t-tests include the sample size, as well as its mean and standard deviation. The exact formula depends on the t-test type - check the sections dedicated to each particular test for more details.• Determine the degrees of freedom for the t-test– The degrees of freedom are the number of observations in a sample that are free to vary as we estimate statistical parameters. In the simplest case, the number of degrees of freedom equals your sample size minus the number of parameters you need to estimate. Again, the exact formula depends on the t-test you want to perform - check the sections below for details. • The degrees of freedom are essential, as they determine the distribution followed by your t-score (under the null hypothesis)• If there are

d

degrees of freedom, then the distribution of the test statistics is the t-Student distribution with d degrees of freedom.• This distribution has a shape similar to

N (0, 1)

(bell-shaped and symmetric) but has heavier tails.• Note: If the number of degrees of freedom is large (

> 30

), which generically happens for large samples, the t-Student distribution is practically indistinguishable from

N (0, 1)

Figure 3:Density of t-distribution with

𝜈

degrees of freedom

• Fun Fact: The t-Student distribution owes its name to William Sealy Gosset, who, in 1908, published his paper on the t-test under the pseudonym "Student". Gosset worked at the famous Guinness Brewery in Dublin, Ireland, and devised the t-test as an economical way to monitor the quality of beer. 6.2.4. p-value from t-test• Recall that the p-value is the probability (calculated under the assumption that the null hypothesis is true) that the test statistic will produce values at least as extreme as the t-score produced for your sample. • As probabilities correspond to areas under the density function, p-value from t-test can be nicely illustrated with the help of the following pictures:

• The following formulae say how to calculate p-value from t-test. •

C D F_{t, d}

→ Cumulative Distribution Function (CDF) of the t-student distribution with

d

degrees of freedom:– p-value from left-tailed t-test →

C D F_{t, d} (t_{s c o r e})

– p-value from right-tailed t-test →

1 - C D F_{t, d} (t_{s c o r e})

– p-value from two-tailed t-test →

2 \times C D F_{t, d} (- | t_{s c o r e} |)

2 - 2 \times C D F_{t, d} (| t_{s c o r e} |)

• Note: However, the CDF of the t-distribution is given by a somewhat complicated formula.– To find the p-value by hand, you would need to resort to statistical tables, where approximate CDF values are collected, or to specialized statistical software. 6.2.5. t-test critical values• Recall, that in the critical values approach to hypothesis testing, you need to set a significance level,

𝛼

, before computing the critical values, which in turn give rise to critical regions (a.k.a. rejection regions).• Formulas for critical values employ the quantile function of t-distribution, i.e., the inverse of the CDF:– Critical value for left-tailed t-test →

C D F_{t, d}^{- 1} (𝛼)

* Critical region →

(- \infty, C D F_{t, d}^{- 1} (𝛼))

– Critical value for right-tailed t-test →

C D F_{t, d}^{- 1} (1 - 𝛼)

* Critical region →

(C D F_{t, d}^{- 1} (1 - 𝛼), \infty)

– Critical value for two-tailed t-test →

\pm C D F_{t, d}^{- 1} (1 - 𝛼 ⁄ 2)

* Critical region →

(- \infty, - C D F_{t, d}^{- 1} (1 - 𝛼 ⁄ 2)] \cup [C D F_{t, d}^{- 1} (1 - 𝛼 ⁄ 2), \infty)

• • Note: To decide the fate of the null hypothesis, just check if your t-score lies within the critical region:– If your t-score belongs to the critical region, reject the null hypothesis and accept the alternative hypothesis.– If your t-score is outside the critical region, then you don't have enough evidence to reject the null hypothesis. 6.2.6. One-sample t-test• The null hypothesis is that the population mean is equal to some value

𝜇_{0}

.• The alternative hypothesis is that the population mean is:– different from

𝜇_{0}

;– smaller than

𝜇_{0}

; or– greater than

𝜇_{0}

\begin{array}{c} t = \frac{\bar{x} - 𝜇_{0}}{s} . \sqrt{n} \\ 𝜇_{0} \to mean postulated in H_{0} \\ n \to sample size \\ \bar{x} \to sample mean \\ s \to sample standard deviation \end{array}

• Note: Number of degrees of freedom in one-sample t-test →

n - 1

. 6.2.7. Two-sample t-test• The null hypothesis is that the actual difference between these groups' means,

𝜇_{1}

and

𝜇_{2}

, is equal to some pre-set value,

𝛥

.• The alternative hypothesis is that the difference

𝜇_{1} - 𝜇_{2}

is:– different from

𝛥

;– smaller than

𝛥

; or– greater than

𝛥

. • In particular, if this pre-determined difference is zero (

𝛥 = 0

) → The null hypothesis is that the population means are equal.• The alternate hypothesis is that the population means are:–

𝜇_{1}

and

𝜇_{2}

are different from one another;–

𝜇_{1}

is smaller than

𝜇_{2}

; and–

𝜇_{1}

is greater than

𝜇_{2}

. • Note: Formally, to perform a t-test, we should additionally assume that the variances of the two populations are equal (this assumption is called the homogeneity of variance).• There is a version of a t-test which can be applied without the assumption of homogeneity of variance: it is called a Welch's t-test. For your convenience, we describe both versions. 6.2.8. Two-sample t-test if variances are equal• Use this test if you know that the two populations' variances are the same (or very similar).

\begin{array}{c} t = \frac{{\bar{x}}_{1} - {\bar{x}}_{2} - 𝛥}{s_{p} . \sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}} \\ s_{p} \to pooled standard deviation \\ s_{p} = \sqrt{\frac{(n_{1} - 1) s_{1}^{2} + (n_{2} - 1) s_{2}^{2}}{n_{1} + n_{2}}} \end{array}

• Note: Number of degrees of freedom in t-test (two samples, equal variances) =

n_{1} + n_{2} - 2

. 6.2.9. Two-sample t-test if variances are unequal (Welch's t-test)• Two-sample Welch's t-test formula if variances are unequal:

t = \frac{{\bar{x}}_{1} - {\bar{x}}_{2} - 𝛥}{\sqrt{\frac{s_{1}^{2}}{n_{1}} + \frac{s_{2}^{2}}{n_{2}}}}

• Note: The number of degrees of freedom in a Welch's t-test (two-sample t-test with unequal variances) is very difficult to count. We can be approximate it with help of the following Satterthwaite formula:

\frac{(\frac{s_{1}^{2}}{n_{1}} + \frac{s_{2}^{2}}{n_{2}})^{2}}{\frac{(s_{1}^{2} ⁄ n_{1})^{2}}{n_{1} - 1} + \frac{(s_{2}^{2} ⁄ n_{2})^{2}}{n_{2} - 1}}

• Alternatively, you can take the smaller of

n_{1} - 1

and

n_{2} - 1

as a conservative estimate for the number of degrees of freedom. • Fun Fact: The Satterthwaite formula for the degrees of freedom can be rewritten as a scaled weighted harmonic mean of the degrees of freedom of the respective samples:

n_{1} - 1

and

n_{2} - 1

, and the weights are proportional to the standard deviations of the corresponding samples. 6.2.10. Paired t-test• As we commonly perform a paired t-test when we have data about the same subjects measured twice (before and after some treatment), let us adopt the convention of referring to the samples as the pre-group and post-group.• The null hypothesis is that the true difference between the means of pre and post populations is equal to some pre-set value,

𝛥

.• The alternative hypothesis is that the actual difference between these means is:– different from

𝛥

;– smaller than

𝛥

; or– greater than

𝛥

. • Typically, this pre-determined difference is zero. We can then reformulate the hypotheses as follows:– The null hypothesis is that the pre and post means are the same, i.e., the treatment has no impact on the population.– The alternative hypothesis:* The pre and post means are different from one another (treatment has some effect);* The pre mean is smaller than post mean (treatment increases the result); or* The pre mean is greater than post mean (treatment decreases the result). • In fact, a paired t-test is technically the same as a one-sample t-test! Let us see why it is so. Let

x_{1}, \dots, x_{n}

be the pre observations and

y_{1}, \dots, y_{n}

the respective post observations. That is,

x_{i}

y_{i}

are the before and after measurements of the

i

-th subject.• For each subject, compute the difference,

d_{i} = x_{i} - y_{i}

. All that happens next is just a one-sample t-test performed on the sample of differences

d_{1}, \dots, d_{n}

. Take a look at the formula for the t-score:

t = \frac{\bar{x} - 𝛥}{s} . \sqrt{n}

• Note: Number of degrees of freedom in t-test (paired):

n - 1

6.2.11. t-test vs. z-test• We use a z-test when we want to test the population mean of a normally distributed dataset, which has a known population variance. If the number of degrees of freedom is large, then the t-Student distribution is very close to

N (0, 1)

.• Hence, if there are many data points (at least 30), you may swap a t-test for a z-test, and the results will be almost identical. However, for small samples with unknown variance, remember to use the t-test because, in such case, the t-Student distribution differs significantly from the

N (0, 1)

! 6.3. Multicollinearity• Multicollinearity happens where there's a correlation between the some of independent variables. In other words, some of the independent variables are not that independent.• Collinearity won't affect the performance of the model → The

R^{2}

remains unchanged.– Also, the model can still make effective predictions.– However, the way we interpret the coefficients will have to change. 6.3.1. How to detect multicollinearity?• We can look at the features VIF (Variance Inflation Factor).– VIF is derived from finding the correlation itself between certain features.* VIF = 1 → no collinearity* 1 < VIF < 5 → moderate collinearity* VIF

\geq

5 → severe collinearity → need mitigation strategy like centering the features.6.4. Feature interaction• This simply means multiplying the two features together.• After introducing the interaction term, if the

R^{2}

goes up and the p-value of the interaction term is significant → then you can be reasonably confident that the interaction terms are in fact interacting.• Can we multiply a feature by itself?– Yes! → But why would we want to do that?– Because now we can fit polynomial relationships.• Note: When adding interaction terms, be noted not to overfit the data. 6.5. Simpson's paradox• Simpson's paradox is a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined.• A good way to avoid it is to add as many dimensions to your model which segment the data you're trying to predict.

Figure 4:Simpson's paradox

7. Logistic Regression• Logistic regression is based on the same idea as linear regression in the way that we still use a line to designate our model. The only difference is that we now want

y

to be a probability.• The probability equation, which is a sigmoid function, is:

P (y | X) = \frac{1}{1 + e^{- (𝛽_{0} + 𝛽 X)}}

• Unlike linear regression, there's no closed form solution for logistic regression.• The loss function for logistic regression is the log loss (cross-entropy loss):•

L o s s (y, \hat{y}) = - \sum_{n}^{} [y_{i} \log {\hat{y}}_{i} + (1 - y_{i}) \log (1 - {\hat{y}}_{i})]

• Note: Log Loss is a slight twist on the likelihood function. In fact →

l o g l o s s = - 1 \times \log (l i k e l i h o o d f u n c t i o n)

.• Note: The likelihood function of logistic regression is:

L (𝛽_{0}, 𝛽) = \prod_{i = 1}^{n} p (x_{i})_{}^{y_{i}} (1 - p (x_{i}))^{1 - y_{i}}

• To minimize the loss function, we take derivatives w.r.t. coefficients:•

\begin{array}{c} \frac{d L o s s}{d 𝛽} = \sum_{n}^{} [{\hat{y}}_{i} - y_{i}] x_{i} \\ \frac{d L o s s}{d 𝛽_{0}} = \sum_{n}^{} [{\hat{y}}_{i} - y_{i}] .1 \\ 𝛻_{𝛽} = [\begin{array}{c} \frac{d L o s s}{d 𝛽} \\ \frac{d L o s s}{d 𝛽_{0}} \end{array}] \\ 𝛽_{i}^{t + 1} = 𝛽_{i}^{t} - r 𝛻_{𝛽_{i}} \to until coefficient gradients converge to 0 \\ r \to learning rate \to usually [10^{- 6}, 0.1] \end{array}

7.1. Coefficient interpretation• In order to interpret the impact of coefficient

𝛽

on the probability, we have to exponentiate it,

e^{��}

, to get something called the odds ratio.–

1 - e^{𝛽}

gives the % change in the odds. 7.2. Multinomial regression• We use multinomial regression when we want to predict more than two classes.• Instead of sigmoid, we're going to use softmax function.

P (y = k | x_{i}) = \frac{e^{𝛽_{0} + 𝛽 x_{i}}}{\sum_{j = 1}^{K} e^{𝛽_{0} + 𝛽 x_{j}}}

• A softmax function is generalized sigmoid such that it produces the probability among

K

classes.– The predicted value will the be class with the maximum predicted probability. 7.3. Regularization• Regularization is a techniques used to avoid overfitting which involves adding a term to the loss function which is the sum of all coefficients. There are two main types of regularization:–

L 1

, Lasso, or Laplace →

\sum_{j}^{} | 𝛽_{j} |

* Typically results in more zero-valued coefficients, which means fewer features will be used.–

L 2

, ridge, Gaussian →

\sum_{j}^{} 𝛽_{j}^{2}

* Usually results in small weights for many of the features (that would've been out by

L 1

).– Note: Both

L 1

and

L 2

usually have a coefficient,

𝜆

, multiplied to them which allows to control the degree of regularization.* Two high

𝜆

can result in under-fitting and too low can result in overfitting.*

𝜆

is best tuned in cross validation.– Note: When using regularization, it's better to scale our data. Scaling data can also help the model to converge faster.7.3.1. Why Lasso regularization induce model sparsity?• First off, note that– L1 norm:

| | w | |_{1} = | w_{1} | + | w_{2} | + | w_{3} | + \dots + | w_{n} |

– L2 norm:

| | w | |_{2} = \sqrt{w_{1}^{2} + w_{2}^{2} + w_{3}^{2} + \dots + w_{n}^{2}}

• • When optimizing the cost function, we use gradient descent and update our weights by →

w^{t + 1} = w^{t} - r \nabla_{w}

– Convergence occurs when the value of

w^{t}

doesn't change much with further iterations → i.e.

\frac{𝜕 L o s s}{𝜕 w} \approx 0

→ i.e.

w^{t + 1} \approx w^{t}

. • L1 norm: The derivative is →

\frac{𝜕 | w |}{𝜕 w} = 1

, therefore →

w^{t + 1} = w^{t} - r .1

.– We can see that our loss derivative becomes a constant, so the condition of convergence occurs faster because we only have

r

in the subtraction terms and it's not being multiplied by any smaller value of

w

.– Therefore,

w^{t}

tends towards zero in a few iterations. • • L2 norm: The derivative →

\frac{𝜕 w^{2}}{𝜕 w} = 2 w

, therefore →

w^{t + 1} = w^{t} - 2. r . w

.– We can see that our loss derivative term is not constant and thus for smaller values of

w

, our condition of convergence will not occur faster (or maybe at all) because we have a smaller value of

w

getting multiplied with

r

and thus making the whole term to be subtracted even smaller. – Therefore, after a few iterations, our

w^{t}

becomes a very small constant value but not zero. – Hence, not contributing to the sparsity of the weight vector. 7.4. Early stopping• Another technique to avoid overfitting is early stopping.• Simply, it means to stop training somewhere before reaching the absolute minimum [of loss function] to avoid overfitting the training examples. 7.5. Other considerations• We can't use the same

R^{2}

from the linear regression. For logistic regression we use something called McFadden's pseudo

R^{2}

. which also lies between

0

and

1

.– Its value is usually smaller than

R^{2}

. A value of

0.2

and

0.4

usually indicates an excellent fitting model.• Logistic regression → discriminative model– Naive Bayes → generative model Back to Top

8. Support Vector Machine (SVM)• SVM removes the concept of decision threshold by instead selecting a decision boundary which maximizes the distance between itself and the two most difficult examples to classify.• These two most difficult examples to classify are called the support vectors.– In comparison, logistic regression finds a decision boundary which minimized the negative log loss of the training examples.– SVM on the other hand, only focuses on the support vectors.• For SVM, the decision boundary (also called the hyperplane) is defined by•

w^{T} x - b = 0

• • The values of

w

and

b

will be optimized when the distance between the decision boundary and the support vectors are maximized.

Figure 5:SVM, decision boundary is the red line.

• You can think of SVM is actually making three lines; one line for the hyperplane, defined by equation

(6)

, two more lines (support vectors) with these equations:

\begin{array}{c} w^{T} x - b = 1 \\ w^{T} x - b = - 1 \end{array}

• Note: All the positive/negative examples lie in the left/right of these lines.• The distance between these two support vectors is called the margin →

\frac{2}{| | w | |}

, where

| | w | | = \sqrt{w_{1}^{2} + w_{2}^{2} + \dots + w_{n}^{2}}

. • Note: The distance between the decision boundary and the support vector →

\frac{1}{| | w | |}

. 8.1. Hard-margin SVM• • Maximizing the distance between the decision boundary and support vectors means minimizing the denominator of the margin, i.e. minimizing

| | w | |

→ this ensures the maximum margin between the two support vector. • What we can't do is maximize our margin so much so that we actually end up going past the support vectors themselves. Mathematically, this means:

\begin{array}{c} \min | | w | | \\ s . t . \\ w^{T} x - b ⩾ 1 w h e n y_{i} = 1 \\ w^{T} x - b ⩽ - 1 w h e n y_{i} = - 1 \end{array}

• Equation

(8)

ensures that SVM produces something greater (or equal) to 1, given that it is in fact a positive example (i.e.

y_{i} = 1

). Similarly for for equation

(9)

.• We can summarize equations

(8)

and

(9)

into one inequality equation:

y_{i} (w^{T} x - b) ⩾ 1

• This is a form of constrained optimization problem.• Since

| | w | |

is a linear term, we can use linear programming → however, linear programming won't guarantee us a unique solution and it can be unstable in some cases.• Typically, we square the

| | w | |

to use quadratic programming to minimize

| | w | |^{2}

. Quadratic programming will guarantee us a unique solution.• If we use quadratic programming, we'd have something called a hard-margin SVM.

Figure 6:Data becomes linearly separable as quadratic transformation

8.2. Soft-margin SVM • Hard-margin SVM is sensitive to outliers. Check out the figure below. • Obviously, the solution above (the orange) isn't the best separator.– What we'd really like is to be able to give some slack to the hard-margin SVM such that it could assign support vectors like the purple one and ignore the outliers.• To ignore the outliers, we're using something called slack. – Slack allows us to relax our constraints such that every single data point like the outlier above doesn't have to be to the left of the left support vector.– This gives us the soft-margin SVM.– We do this by adding an error term to the constraint. – Adding the error term means that no longer does every single point have to exist on the correct side of the margin → Instead, we can allow some points to go within the margin.

\begin{array}{c} \min | | w | |^{2} + C \sum_{N}^{} e_{i} \\ s . t . \\ y_{i} (w^{T} x - b) ⩾ 1 - e_{i} \end{array}

• The error,

e_{i}

, is the distance between the new support vector (in purple) and where the old support vector (left orange support vector) would have been in hard-margin SVM.–

e_{i} = 0

→ correct side of margin–

0 < e_{i} < 1

→ still correct but with a penalty–

e_{i} ⩾ 1

→ incorrect (example is on the other side of margin/hyperplane), higher penalty• Note: Here, we are minimizing the errors as well as the the sum of squared weights.• Note: The parameter

C

is a regularization parameter. It indicates how much we want to penalize a particular example lying within the margin. 8.2.1. How to solve soft-margin SVM• We have the error term,

e_{i}

, in both equations

(11)

and

(12)

. • We can solve for

e_{i}

in equation

(12)

and plug it in for

e_{i}

in equation

(11)

such that we no longer have any constraints.• We can do this as follows,

\begin{array}{c} e_{i} ⩾ 1 - y_{i} (w^{T} x - b) \\ since e_{i} ⩾ 0 \\ e_{i} = \max (0, 1 - y_{i} (w^{T} x - b)) \\ Substitute it in the optimization function \\ \min | | w | |^{2} + C \sum_{N}^{} \max (0, 1 - y_{i} (w^{T} x - b)) \end{array}

• This part →

\max (0, 1 - y_{i} (w^{T} x - b))

is called hinge loss.• Since hinge loss is not a differentiable function, we can't find its gradient.• To handle the not-differentiable point, we use a technique called sub-gradient descent, which in particular use Pegasus algorithm for soft-margin SVM → which can solve/optimize equation

(14)

with gradient descent.• The benefit is that we get a guaranteed minimum. 8.3. SVM for non-linear data• To handle non-linear data, we can project it into additional dimensions.– Primarily, we can multiply elements by themselves or have feature interaction terms.• Projecting data into higher dimensions make it possible to find a hyperplane that can separate the data. • Adding interaction terms gets difficult when we have many features → i.e. it would result in very high dimensions.– For example, starting with, say, 100 features, we can end up with thousands of features by adding all the interaction terms.– Also, the thing is based on equation

(14)

, we eventually have to dot product all the thousands of features into their weights which results in a single number.– Instead, we can doing that, we can do a more clever technique called the kernel trick. 8.4. Kernel trick• What the kernel trick does is that it allows us to avoid transforming all of our features into these larger dimensions but still allows us to still extract that dot product without performing the feature transformation.• The only caveat is that we can't use the kernel trick with the SVM in its primal form, i.e. equation

(14)

.• We need to implement the kernel trick on the dual form of SVM.• We can use the representer theorem to represent SVM weights by this equation:

\begin{array}{c} w = \sum_{N}^{} a_{i} y_{i} x_{i} \\ a_{i} = {\begin{cases} 1 when i is a support vector \\ 0 otherwise \end{cases} \end{array}

• Now, we take the Lagrangian dual of the SVM which gives us the new optimization form to represent SVM:•

\max \sum_{N}^{} a_{i} - \frac{1}{2} \sum_{N}^{} 𝛼_{j} 𝛼_{k} y_{j} y_{k} x_{j}^{T} x_{k}

• The

x_{j}^{T} x_{k}

term benefits us in two ways:– If the number of examples that you're trying to classify is far less than the number of dimensions that each example has, then this computation will be a lot more efficient compared to computing

w^{T} x

.– Now that we have this new form, equation

(16)

, we can apply the kernel trick.

\max \sum_{N}^{} a_{i} - \frac{1}{2} \sum_{N}^{} 𝛼_{j} 𝛼_{k} y_{j} y_{k} Φ (x_{j}^{T} x_{k})

• • With kernel trick, we can have some dimensions of

x

. The kernel trick function,

Φ (.)

, allows us to calculate the high dimensional dot product and it will return a singular scalar value.– This is massive savings when the number of examples that we have is far less than the dimensions that we want to project our data into.• The kernel trick function can be:– Linear function– Polynomial function– Gaussian (e.g. RBF kernel)– etc. • Note: Kernel trick is not only for SVMs. For instance, we can take Lagrangian dual of linear regression, and we can get our

x

terms together → that means we use the kernel trick here as well.

y = \sum_{N}^{} a_{i} Φ (x^{T} x)

8.4.1. RBF kernel• RBF → Radial Basis Function

\begin{array}{c} K_{R B F} (x, x') = e_{}^{- (\frac{| | x - x' | |^{2}}{2 𝜎^{2}})} \\ 𝜎 is a tuning parameter: \\ if 𝜎 is too small → overfitting \\ if 𝜎 is too large → underfitting \end{array}

• It represents a separate dimension per data point that you have, i.e. if you have a 100 data points, the RBF kernel would represent your data in 100 dimensions.• Why is this useful?– RBF kernel assigns every data point to a Gaussian distribution that can have different height or width.– Then, it traces a line (or hyperplane) of the sum of the Gaussian distributions.– It does it so it can project data points onto that line within its Gaussian distribution.– By doing this, we get a linearly separable data. 8.5. The dual problem• In convex optimization, for every primal problem (e.g. equation

(14)

) we can derive a dual problem.• Let

𝛼 \in R^{N}

be the dual variables, corresponding to Lagrange multipliers that enforce the

N

inequality constraints.• The generalized Lagrangian is given below,

L (w, w_{0}, 𝛼) = \frac{1}{2} w^{T} w - \sum_{n = 1}^{N} 𝛼_{n} ({\tilde{y}}_{n} (w^{T} x_{n} + w_{0}) - 1)

• To optimize this, we must find a stationary point that satisfies

(\hat{w}, \hat{w_{0}}, \hat{𝛼}) = \min_{w, w_{0}} \max_{𝛼} L (w, w_{0}, 𝛼)

• We can do this by computing the partial derivatives w.r.t.

w

and

w_{0}

and setting to zero,

\begin{array}{c} \nabla_{w} L (w, w_{0}, 𝛼) = w - \sum_{n = 1}^{N} 𝛼_{n} {\tilde{y}}_{n} x_{n} \\ \frac{𝜕}{𝜕 w_{0}} L (w, w_{0}, 𝛼) = - \sum_{n = 1}^{N} 𝛼_{n} {\tilde{y}}_{n} \\ and hence \\ \hat{w} = \sum_{n = 1}^{N} {\tilde{𝛼}}_{n} {\tilde{y}}_{n} x_{n} \\ 0 = \sum_{n = 1}^{N} {\tilde{𝛼}}_{n} {\tilde{y}}_{n} \\ Plugging these into the Lagrangian yields the following \\ L (\hat{w}, {\hat{w}}_{0}, 𝛼) = \frac{1}{2} {\hat{w}}^{T} \hat{w} - \sum_{n = 1}^{N} 𝛼_{n} {\tilde{y}}_{n} {\hat{w}}^{T} x_{n} - \sum_{n = 1}^{N} 𝛼_{n} {\tilde{y}}_{n} w_{0} + \sum_{n = 1}^{N} 𝛼_{n} \\ = \frac{1}{2} {\hat{w}}^{T} \hat{w} - {\hat{w}}^{T} \hat{w} - 0 + \sum_{n = 1}^{N} 𝛼_{n} \\ = - \frac{1}{2} \sum_{i = 1}^{N} \sum_{j = 1}^{N} 𝛼_{i} 𝛼_{j} {\tilde{y}}_{i} {\tilde{y}}_{j} x_{i}^{T} x_{j} + \sum_{n = 1}^{N} 𝛼_{n} \end{array}

• This is called the dual form of the objective.• We want to maximize this w.r.t.

𝛼

subject to the constraints that

\sum_{n = 1}^{N} {\tilde{𝛼}}_{n} {\tilde{y}}_{n} = 0

and

𝛼_{n} ⩾ 0

for

n = 1 : N

.• The above objective is a quadratic problem in

N

variables.• Standard QP solvers take

O (n^{3})

time.• However, specialized algorithms, such as the sequential minimal optimization (SMO) algorithm, is developed for this problem that takes

O (n)

.• Since this is a convex objective, the solution must satisfy the KKT conditions,

\begin{array}{c} 𝛼_{n} ⩾ 0 \\ {\tilde{y}}_{n} f (x_{n}) - 1 ⩾ 0 \\ 𝛼_{n} ({\tilde{y}}_{n} f (x_{n}) - 1) = 0 \end{array}

• Hence either

𝛼_{n} = 0

(in which case example

n

is ignored when computing

\hat{w}

) or the constraint

{\tilde{y}}_{n} ({\hat{w}}^{T} x_{n} + {\hat{w}}_{0}) = 1

is active.• This latter condition means that example

n

lies on the decision boundary (they're called support vectors). We denote the set of support vectors by

S

.• To perform prediction, we use

f (x; \hat{w}, {\hat{w}}_{0}) = {\hat{w}}^{T} x_{n} + {\hat{w}}_{0} = \sum_{n \in S}^{} 𝛼_{n} {\tilde{y}}_{n} x_{n}^{T} x + {\hat{w}}_{0}

• To solve for

{\hat{w}}_{0}

, we can use the fact that for any support vector, we have

{\tilde{y}}_{n} f (x; \hat{w}, {\hat{w}}_{0}) = 1

. • Multiplying both sides by

{\tilde{y}}_{n}

, and exploiting the fact that

{\tilde{y}}_{n}^{2} = 1

, we get

{\hat{w}}_{0} = {\tilde{y}}_{n} - {\hat{w}}^{T} x_{n}

. • In practice, we get better results by averaging over all the support vectors,

{\hat{w}}_{0} = \frac{1}{| S |} \sum_{n \in S}^{} ({\tilde{y}}_{n} - {\hat{w}}^{T} x_{n}) = \frac{1}{| S |} \sum_{n \in S}^{} ({\tilde{y}}_{n} - \sum_{m \in S}^{} 𝛼_{m} {\tilde{y}}_{m} x_{m}^{T} x_{n})

8.5.1. Lagrange multiplier• In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints.• The basic idea is to convert a constrained problem into a form such that the derivative test of an unconstrained problem can still be applied.• The method can be summarized as follows: in order to find the maximum or minimum of a function

f (x)

subjected to the equality constraint

g (x) = 0

, form the Lagrangian function, equation

(28)

, and find the stationary of

L

considered as a function of

x

and the Lagrange multiplier

𝜆

L (x, 𝜆) = f (x) + 𝜆 g (x)

• • This means that all partial derivatives should be zero, including the partial derivative w.r.t.

𝜆

.• The solution corresponding to the original constrained optimization is always a saddle point of the Lagrangian function.• The great advantage of this method is that it allows the optimization to be solved without explicit parameterization in terms of the constraints.– As a result, the method of Lagrangian multipliers is widely used to solve challenging constrained optimization problems.

Figure 7:The red curve shows the constraint

g (x, y) = c

. The blue curves are contours of

f (x, y)

. The point wherethe red constraint tangentially touches a blue contour is themaximum of

f (x, y)

along the constraint, since

d_{1} > d_{2}

8.6. More on kernel trick• Kernel Definition: A function that takes as its input vectors in the original space and returns the dot product of the vectors in the feature space is called a kernel function.– More formally, if we have data

x, z \in X

and a map

𝜙 : X \to R^{N}

, then

k (x, z) = ⟨ 𝜙 (x), 𝜙 (z) ⟩

is a kernel function. • Once we converted our problem into a dual problem form, equation

(25)

, it has

N

unknowns (

𝛼

) which (in general) takes

O (n^{3})

time to solve, which can be slow.• However, the principal benefit of dual problem is that we can replace all inner product operations

x_{}^{T} x'

with a call to a positive definite (Mercer) kernel function,

K (x, x')

→ This is called the kernel trick.• In particular, we can rewrite the prediction function, equation

(26)

, as follows,

f (x) = {\hat{w}}^{T} x_{n} + {\hat{w}}_{0} = \sum_{n \in S}^{} 𝛼_{n} {\tilde{y}}_{n} x_{n}^{T} x + {\hat{w}}_{0} = \sum_{n \in S}^{} 𝛼_{n} {\tilde{y}}_{n} K (x_{n}, x) + {\hat{w}}_{0}

• We also need to kernelize the bias term. This can be done by kernelizing equation

(27)

{\hat{w}}_{0} = \frac{1}{| S |} \sum_{i \in S}^{} ({\tilde{y}}_{i} - (\sum_{j \in S}^{} 𝛼_{j} {\tilde{y}}_{j} x_{j})^{T} x_{i}) = \frac{1}{| S |} \sum_{i \in S}^{} ({\tilde{y}}_{i} - \sum_{j \in S}^{} 𝛼_{j} {\tilde{y}}_{j} K (x_{j}, x_{i}))

• This kernel trick allows us to avoid having to deal with an explicit feature representation of our data, and allows us to easily apply classifiers to structured objects, such as strings and graphs. 8.7. Kernel trick - other resources• PDF screenshots here.

8.8. Multi-class SVM• One way to do this is to create an SVM per class → i.e. classify one class against every other classes → This paradigm is called one-vs-rest.– To make a prediction for an example, we feed it to all the trained [one-vs-rest] SVMs and we measure the margin each SVM produces. We choose the prediction that produces the largest margin between that example and the other classes.• Another way to handle multi-classes is to create a one-vs-one paradigm.• Here, we're creating a pair-wise SVM so that for every single pair of classes we're creating a single SVM.– To get a prediction, we simply feed the unseen example through every single SVM and select whichever class was most often predicted for that example.• Note: Even though the one-vs-one paradigm requires a lot of SVMs, the data required per each SVM is only two classes → So, this can actually be faster in some cases depending on the data. 8.8.1. Structured SVM• Here, instead of the margin being

- 1

and

1

, the margin is actually the distance between the two closest classes. 8.9. SVM for regression• The idea is very similar to SVM classifier. • The only difference is that the goal is to keep all points within the margin.• The slack variable (i.e. the error terms) here comes from points that lie outside of the margin. 8.10. Other considerations• SVMs can linearly separate data out of the box → hard-margin SVM.• If we add slack variables → soft-margin SVM.• We can use sub-gradient descent to optimize soft-margin SVM.• We can add interaction terms to separate data even if it's not linearly separable. This is often preferred when you have either a low number of dimensions that you want to project into or if your data is extremely large.• However, if your data isn't so large, and you want to project your features into a very high dimensional space, you can use kernel trick, which allows us to avoid computing this actual feature transformation.• SVMs are distance-based. So, we have to consider scaling our features as well.• Why to use SVM over logistic regression?– If you have a low number of examples → just start with a linear SVM because they only focus on the support vectors.– If you have a ton of data and not many features → you might be better off starting with logistic regression. Back to Top

9. k-Means Back to Top

10. Singular Value Decomposition (SVD)• What if we had a bunch of data and we didn't really know much about it?– We'd like to take the data and look for patterns in it and separate them out → so that we could understand our data better.– We can use SVD to do this.• SVD states that any matrix can be represented by three different matrices as follows,•

A = U 𝛴 V^{T}

•

U

→ rotation•

𝛴

→ scaling•

V^{T}

→ final rotation• For Example

[\begin{array}{ccc} 1 & - 1 & 2 \\ 3 & 2 & - 2 \end{array}] = [\begin{array}{cc} - 0.24 & 0.96 \\ 0.96 & 0.24 \end{array}] [\begin{array}{ccc} 4.2 & 0 & 0 \\ 0 & 2.2 & 0 \end{array}] [\begin{array}{ccc} 0.63 & 0.58 & - 0.57 \\ 0.74 & - 0.2 & 0.63 \\ - 0.2 & 0.82 & 0.51 \end{array}]

• Note: If we divide each diagonal element of

𝛴

by the sum all elements in the diagonal, we get percentage of the variance explained by corresponding column in the

U

matrix. – In the example above, the variance explained by first column of

U

[\begin{array}{c} - 0.24 \\ 0.96 \end{array}]

, is equal to

\frac{4.2}{4.2 + 2.2} = 0.65

. • Note: The third column of

𝛴

and third row of

V^{T}

are not used. 10.1. Eigendecomposition• Eigendecomposition states that any square matrix can be broken down into eigenvectors and eigenvalues.• Few problems with eigendecomposition:– It only works on square matrices.– The eigenvalues don't necessarily lie between

0

and

1

. – The ranks of eigenvectors are not perpendicular.• SVD solves these problem by:– Allowing any sort of matrix (not only limited to square matrices)–

𝛴

is eigenvalues of

A A^{T}

→ This allows these values to lie between

0

and

1

.–

V^{T}

is just the eigenvectors of

A^{T} A

.– To get the values of

U

, we can simply solve this equation →

u_{i} = \frac{A v_{i}}{𝛴}

• We can think of SVD as a generalized version of eigendecomposition. 10.2. Principal Component Analysis (PCA)• The eigendecomposition of matrix

A

is,

A = V L V^{T}

• • Now, what if we take matrix

A

and standardize it (i.e. subtract the mean and divide it by the standard deviation) and then divide it

N - 1

→ This means that we have a correlation matrix →

\frac{A^{T} A}{N - 1}

– The problem is that this computation is typically not stable.– So, instead, what's typically done to get PCA is to use SVD on the standardized matrix

A

.* In this case, the

U 𝛴

term → Principal Components • Note: SVD and PCA can be used for dimensionality reduction.• Note: SVD and PCA assume a linear correlation between the features.– There are non-linear dimensionality reduction techniques. Examples of such methods are Kernel PCA. Back to Top

11. Neural Networks (NN) 11.1. What is a neuron (in NN)? • Sometimes called perceptron, is a graphical representation of the smallest part of a NN that takes an input, multiply it by a weight. • The

𝜎 (W^{T} . X + b)

is the same as logistic regression and the loss is exactly the same as the one in logistic regression:

L (\hat{y}, y) = \frac{1}{N} \sum_{i = 1}^{n} y_{i} \log {\hat{y}}_{i} + (1 - y_{i}) \log (1 - {\hat{y}}_{i})

• Note: There are other non-linear functions used as well, such as Relu, tanh, etc.• We can update the weights by taking gradients of the loss function with respect to the weights,

\begin{array}{c} \nabla L = [\begin{array}{c} \frac{\partial L}{\partial b} \\ \frac{\partial L}{\partial w_{1}} \\ . . . \\ \frac{\partial L}{\partial w_{n}} \end{array}] \\ w^{t + 1} = w^{t} - r \nabla L \end{array}

• Note: According to equation

(34)

, in order to update weights, we move in the opposite direction of loss gradient adjusted by the learning rate (

r

). 11.2. Why do we need the bias term?• Bias is like the intercept added in a linear equation. It is an additional parameter in the Neural Network which is used to adjust the output along with the weighted sum of the inputs to the neuron. Thus, Bias is a constant which helps the model in a way that it can fit best for the given data.• The bias term helps in cases where all the

w_{i} x_{i}

terms are

0

, which means that the model cannot be trained. Adding a bias terms let the model be trained in such cases. 11.3. How a NN learns non-linear patterns?• Each neuron in a NN learns decision boundary.• Since NN has many neurons, the combined learned decision boundaries creates a non-linear decision boundary. • In the example below, there's no way to separate the data with one line. There are feature engineering methods (or other algorithms) that can handle this. But, how a NN can separate these two classes? • The above picture is a simple example of how a NN can capture non-linear patterns. In practice, NN have more than one hidden layers and more neurons per layer.– Note: Usually the number of neurons in each hidden layer decreases as we move forward through the network. 11.4. How does a NN learn?• Let's explain this through a small NN below.

𝜎

\sum

𝜎

\sum

𝜎

\sum

x_{2}

x_{1}

{\hat{y}}_{o u t}

w_{1}

w_{2}

w_{3}

w_{4}

w_{5}

w_{6}

h_{i n}^{1}

h_{i n}^{2}

h_{o u t}^{1}

h_{o u t}^{2}

y_{i n}^{}

h_{i n}^{1} = w_{1} x_{1} + w_{3} x_{2}

h_{i n}^{2} = w_{2} x_{1} + w_{4} x_{2}

h_{o u t}^{1} = 𝜎 (h_{i n}^{1})

h_{o u t}^{2} = 𝜎 (h_{i n}^{2})

y_{i n}^{} = w_{5} h_{o u t}^{1} + w_{6} h_{o u t}^{2}

{\hat{y}}_{o u t} = 𝜎 (y_{i n}^{})

L (\hat{y}, y) = \frac{1}{N} \sum_{i = 1}^{n} y_{i} \log {\hat{y}}_{i} + (1 - y_{i}) \log (1 - {\hat{y}}_{i})

Loss for one example:

{\hat{y}}_{o u t} = 0.33, y = 1

\begin{array}{l} L = \log ({\hat{y}}_{o u t}) \\ = \log (𝜎 (y_{i n}^{})) \\ = \log (𝜎 (w_{5} h_{o u t}^{1} + w_{6} h_{o u t}^{2})) \\ = \log (𝜎 (w_{5} 𝜎 (h_{i n}^{1}) + w_{6} 𝜎 (h_{i n}^{2}))) \\ = \log (𝜎 (w_{5} 𝜎 (w_{1} x_{1} + w_{3} x_{2}) + w_{6} 𝜎 (w_{2} x_{1} + w_{4} x_{2}))) \end{array}

The loss function is →

11.5. Chain Rule

* Now, we have to take the gradient of

L .

Since loss function is a complex functionit's hard to derive the analytical gradient. * Since the loss function is essentially a function of functions, we use the chain rule tocompute the derivatives. For example,

\frac{\partial L}{\partial w_{1}} = \frac{\partial L}{\partial {\hat{y}}_{o u t}} . \frac{\partial {\hat{y}}_{o u t}}{\partial y_{i n}} . \frac{\partial y_{i n}}{\partial h_{o u t}^{1}} . \frac{\partial h_{o u t}^{1}}{\partial h_{i n}^{1}} . \frac{\partial h_{i n}^{1}}{\partial w_{1}}

\frac{\partial L}{\partial w_{6}} = \frac{\partial L}{\partial {\hat{y}}_{o u t}} . \frac{\partial {\hat{y}}_{o u t}}{\partial y_{i n}} . \frac{\partial y_{i n}}{\partial w_{6}}

* Note that the first two terms of

\frac{\partial L}{\partial w_{6}}

and

\frac{\partial L}{\partial w_{1}}

are the same. This means that we can use dynamic programming + chain rule to calculate derivatives. * We start by first calculating

\frac{\partial L}{\partial w_{6}}

and work our way to

\frac{\partial L}{\partial w_{1}}

. * This gives us something called backpropagation. Backpropagation is the standard wayto train NN. * For training we need: * Forward Pass → To figure out how far our predictions are from the actual value * Backpropagation → Once we have the loss, we can backpropagate those gradients to update all of the weights in our NN.

11.6. How do we update the weights?

w^{t + 1} = w^{t} - r \nabla L

* If we plot the average loss obtained from all of the trained examples against a particularparameter, say

w_{1}

, we get a function like below,

L (\hat{y}, y)

w_{1}

* The difference between this function and the logistic regression function is that herewe have local optima. In logistic regression, we were guaranteed to have one minimum. * That's because we stacked up neurons and added layers, so we opened up ourselves tolocal optima.

11.7. Stochastic Gradient Descent (SGD)

* There are some techniques that increases the chance of not getting stuck in the local optima. The most popular method is Stochastic Gradient Descent (SGD). * SGD's characteristic of not getting stuck in local optima is just a by-product of takingrandom examples and updating the weights with just that single example. This randomnessin the weight updates, can increase the chances that we don't get stuck in a local optima. * The problem with SGD is that it's slow to converge. * One idea to speed up convergence is by incorporating momentum. * The idea of momentum is to keep track of the previous updates.

11.8. Momentum

* The problem with SGD is that it's slow to converge. * One idea to speed up convergence is by incorporating momentum. * The idea of momentum is to keep track of the previous updates.

\begin{array}{l} w^{t + 1} = w^{t} - r \nabla L^{t} - 𝛾 r (\nabla L^{t - 1} + \nabla L^{t - 2} + \dots + \nabla L^{t - n}) \\ or \\ w^{t + 1} = w^{t} - V^{t} \\ V^{t} = 𝛾 V^{t - 1} - r \nabla L \end{array}

* The

𝛾

parameter is usually set to

0.9

so that the previous gradient doesn't matter as much as the current gradient. * The problem with momentum is that sometimes we could build so much momentum that we pass the global optima.

11.9. AdaGrad

* There's another method called AdaGrad which adjusts the learning rate per parameter.*Note:

r_{g e n e r a l}

is typically set to

0.001

w_{1}^{t + 1} = w_{1}^{t} - r_{1}^{t} \frac{\partial L^{t}}{\partial w_{1}}

r_{1}^{t} = \frac{r_{g e n e r a l}}{\sqrt{(\frac{\partial L^{t - 1}}{\partial w_{1}})^{2} + \dots + (\frac{\partial L^{t - n}}{\partial w_{1}})^{2}} + 𝜀}

* Why would you want to do something like this? * It balances the update value at each step such that when gradient is high it lowers the learning rate and when the gradient is low it increase the learning rate. * This way it moderates the steps we take at each update (and for each parameter). * Note: The

𝜀

term in the denominator is set to a small value to avoid dividing by

0

. * Note: AdaGrad really helps in the case of sparse features, because if we have sparse features, that means that the weights associated with those features will be updated less, and therefore the learning rate will be higher.

11.10. Adam

* The other method is Adam.* Adam combines momentum and adaptive learning rate.

w_{}^{t + 1} = w_{}^{t} - \frac{r_{g e n e r a l}}{\sqrt{{\hat{V}}_{t}} + 𝜀} {\hat{m}}_{t}

m_{t} = 𝛽_{1} m_{t - 1} + (1 - 𝛽_{1}) \nabla_{l o s s}^{t}

V_{t} = 𝛽_{2} V_{t - 1} + (1 - 𝛽_{2}) (\nabla_{l o s s}^{t})^{2}

* Note:

𝛽_{1}

and

𝛽_{2}

are hyperparameters.* Note: The only difference between

m_{t}

and

V_{t}

is the squared gradient loss term.* Note: Notice that the

m_{t}

and

V_{t}

are adjusted (i.e.

{\hat{m}}_{t}

and

{\hat{V}}_{t}

). The reason is because these terms are technically moments of a function, and in order to get an unbiased moment on these functions we have to adjust them by the

𝛽

parameters.* Note: Adam looks like a ball rolling down a hill with momentum, but the ball also has friction. The idea is that the friction helps the parameters settle in the global optima, while the momentum helps the parameters escape the local minimum.

{\hat{m}}_{t} = \frac{m_{t}}{1 - 𝛽_{1}^{t}}, {\hat{V}}_{t} = \frac{V_{t}}{1 - 𝛽_{2}^{t}}

11.11. RMSProp• Will add the note later 11.12. AdaDelta• Will add the note later 11.13. Vanishing and exploding gradients• Once we have the gradients, from whatever optimizer we use, multiplying these gradients together can result in a problem.– Let's say if we use the sigmoid activation function, the maximum value of the gradients are

0.25

.– Now, if we multiply a lot of

0.25

s together, this final gradient (based on chain rule) will brace towards

0

→ this result in underflow.– This is called → vanishing gradient.• Also, some of the gradient terms include the value of weights. – If the weights are extremely large, by multiplying them together, we can end up getting an extremely large value → This is called an exploding gradient.• There are a few methods to mitigate these problems. 11.13.1. Initialization• One of the method to tackle the vanishing/exploding gradients is to initializing the weights of the NN in a particular way.• A bad way to initialize the weights is just to use a uniform distribution between

0

and

1

. • Another bad way to just to initialize these parameters with a normal distribution in which the mean is

0

and the standard deviation is

1

.• What we can do instead is to initialize the weights from a normal distribution in which the mean is

0

but the standard deviation is

𝜎 = \sqrt{2 ⁄ (f_{i} + f_{o})}

. •

f_{i}

is fan in and

f_{o}

is fan out. • Fan in is the number of inputs to a particular layer and fan out is the number of outputs for that layer.• This way, we can initialize the weights for each layer of NN.• This is called Xavier or Glorot initialization.• The reason why this helps is because we're shrinking the standard deviation by how many ever times we will be multiplying these variables together per layer. – Not doing this makes the variances of each layer multiply together and that causes the variance to grow exponentially. – So, if we can shrink down the standard deviation early, these the other multiplications, hopefully, won't result in exponential growth (or shrinkage) of the gradients.• This works best when we use something called a symmetric activation function. Example of such functions → sigmoid function.• 11.14. ReLU and Leaky ReLU • What if we want to use a non-symmetric activation function?– Example of such functions is ReLU (Rectified Linear Unit) function. • Why do we want to use ReLU?– More computationally efficient → All the negative values take on the value of

0

, and all positive values take on the value itself → When taking derivatives, the derivatives of

0

0

and the derivative of any value is just

1

.– Tends to produce better model performance– Sparsity → reduce overfitting → not all neurons will output a value (negative values →

0

). – • What are the downsides of ReLU?– It has an uncapped activation. With sigmoid, we'd have something called saturation, where the output of the neuron could be no larger than the value of

1

.– However, the ReLU can output any value, which means that we could be susceptible to exploding gradients more often.– As well, we can even now be susceptible to exploding forward passes where by simply doing multiplications in the forward pass all the way through the NN, we can also get unreasonably large numbers that overflow.– Another problem called dying ReLU problem.* It comes from the fact that a neuron that takes on a value of

0

will be

0

forever.* That means that the neuron will be completely dead and never output another value except

0

.– Even with these problems, ReLU activation functions are used often in practice.– • Initialization for ReLU– Instead of Xavier initialization, we use the Kaiming initialization →

𝜎 = \sqrt{2 ⁄ f_{i}}

– The Kaiming initialization can be used for other asymmetric activation functions like Leaky ReLU →

f (x) = {\begin{array}{cc} x & i f x > 0 \\ 0.01 x & otherwise \end{array}

– Leaky ReLU tries to get around the dead neuron problem by adding a slight angle to the slope.• Another thing to do (in addition to the initialization) is feature scaling.• 11.15. tanh•

t a n h

is very similar to the sigmoid function, but instead of being in the range of

[0, 1]

, it lies in the range of

[- 1, 1]

. • The idea to cross validate between all the activation functions to see which works best for your data.• Note: Different activation functions can be used at different layers of NN.• Note: The last neuron will dictate what the output looks like. sigmoid → binary classification, softmax → multi-class classification (the maximum value of softmax function is your prediction), linear regression → linear activation function 11.16. Loss Functions• Regression → Mean Squared Error (MSE) →

L (y, \hat{y}) = \frac{\sum_{N}^{} (y_{i} - {\hat{y}}_{i})^{2}}{N}

–

N

is usually the batch_size.• Regression → Mean Absolute Error (MAE) →

L (y, \hat{y}) = \frac{\sum_{N}^{} | y_{i} - {\hat{y}}_{i} |^{2}}{N}

• Classification → Cross Entropy (sometimes called logloss) →

L (y, \hat{y}) = - (y \log {\hat{y}}_{i} + (1 - y) \log (1 - {\hat{y}}_{i})

• Classification → Cross Entropy for

K

classes →

L (y, \hat{y}) = - \sum_{k}^{} y_{i} \log ({\hat{y}}_{i})

11.17. Avoid Overfitting11.17.1. Regularization• We can do regularization by adding

L_{1}

L_{2}

term to the loss function →

L (y, \hat{y}) = - \sum_{k}^{} y_{i} \log ({\hat{y}}_{i}) + 𝜆 \sum_{w}^{} | w_{i} |

11.17.2. Dropout• Dropout is when you have, per layer of the NN, a particular neuron in that layer that will have some probability of sticking around. The others will be dropped out for this training iteration.• For each layer we assign a dropout probability (e.g.

P = 0.5

).• The problem with dropout is that during training, the dropped out neurons (during training) will not drop out during prediction → so, all of the sudden, the last node summation will be a lot higher (because we have all the neurons).– To solve this problem, we can use inverted dropout. During training (after every mini-batch), they'll take the output of the layers and divide by the dropout rate →

\frac{output}{dropout rate}

.– This ensures that the total sum coming into the last node will match on average the total sum coming to it during prediction time. 11.18. How to determine the number of layers and neurons?• If your data is linearly separable, you don't need any hidden layer at all.• Beyond that, it's safe to start with a single hidden layer, and the number of neurons in that single hidden layer should be the average of input and output.• Another alternative is to start with more layers or units than you need, and then go examine the weights of your connections.– The weights that are close to

0

, should allow you to prune the surrounding neuron.– Once, you drop the neuron, you run the cross validation to see how much the NN model performance is affected. Back to Top

12. Convolutional Neural Networks (CNN) Back to Top

13. Recurrent Neural Networks (RNN) Back to Top

14. Generative Adversarial Networks (GAN)• Let's say we have about 200,000 satellite images and we want to improve their quality. We will capture: coastlines, ports, cities, farms, mountains, oceans, suburbs.• The labels will be equivalent aerial images corresponding to each satellite image.– The aerial images have 4x the resolution of satellite images (e.g. a 2x2 section would expand to 4x4 section).– Note: The aerial images have to be taken from the same exact location (and preferably at the same day time).• We'll use satellite images for creating features.• What model could we use for training this problem?– We can use a type of NN models called Generative Adversarial Networks (GAN).– GANs are just two NNs in themselves. One NN is called the generator.– The generator will take in a low resolution image (i.e. satellite images), and create its best guess for what the high resolution image (i.e. aerial image) would be of the equivalent low resolution image.– The second part of the GAN is called the discriminator. – The discriminator will take in real aerial images and the aerial image estimates (produced by the generator), and it will decide if the input image is real or fake.– Essentially, the generator is working to confuse the discriminator and the discriminator is working its best to differentiate between the generated images and the real images. • How does GANs get trained?– Feed the low-res image to the generator to produce an estimated hi-res image.– Feed the hi-res image to the discriminator.– Propagate the discriminator loss all the way back the generator.* Here, we're fixing the weights of the discriminator, and we're only updating the generator in accordance to the loss generated from its attempt to fool the discriminator. 14.1. GAN Loss Function• Discriminator's Loss:

\max (\frac{1}{m} \sum_{m}^{} [\log D (x) + \log (1 - D (G (z)))])

• Generator's Loss:

\min (\frac{1}{m} \sum_{m}^{} \log (1 - D (G (z))))

• Combined GAN's Loss Function (Adversarial Min-Max):

\min_{G} \max_{D} (\frac{1}{m} \sum_{m}^{} [\log D (x) + \log (1 - D (G (z)))])

• How do we know when this loss function has converged?– When the accuracy of the discriminator drops to around 50%, i.e. it can't do better than the random chance (that means the generator could successfully fool the discriminator). 14.2. Example• Going back to our satellite example: – The generator is going to be a CNN. – Input → Low-res satellite images– Output → estimated aerial images (hi-res)– CNN has to do: * Upsample → 4x resolution* Pixel Shuffle* Residual connections (to avoid vanishing gradients)– The discriminator is also a CNN with a sigmoid binary classifier as the end.* We're using Leaky ReLU.– Mini-batch size → 16* Note: The batch size is a bit smaller than usual. The discriminator goes through a few iterations of training before we allow the generator to start learning from the discriminator's decisions. Keeping the mini batches smaller means that the discriminator will only get 16 images to train on before the generator gets to jump in and begin training as well. This is so the discriminator doesn't out-learn the generator, such that the discriminator doesn't supply any any valuable feedback to the generator because it's already learned so much that the gradients will be small → The generator and the discriminator have to learn together.– Adjust the learning rate:* We can also adjust (lower) the learning rate of the discriminator to make sure that the generator can still keep up and learn together with it.– Mode collapse* Mode collapse is when the generator figures out an image which can fool the discriminator and it just continues to output that same exact image since it's found out how to fool it.* This can be avoided by using unrolled GAN → it allows the generator to see what the discriminator will look like in a few more steps ahead of it such that the generator is encouraged not to learn some local exploitation of the discriminator.* It now has to account for what the discriminator will also look for in the future. 14.3. Evaluation• In our test set, if we have the true labels (e.g. hi-res image in the example above), we can just take the MSE per pixel.• Another thing to look at is Peak Signal to Noise Ratio (PSNR) →

20. \log (p i x e l_{\max}) - 10. \log (M S E)

→This is a gauge of how noisy the produced image is → The smaller the MSE the better the PSNR.• We can also use human raters to determine if they can tell the difference between our generated images vs. the real image. Back to Top

15. Recommender Systems15.1. Collaborative Filtering• The goal of collaborative filtering is to use the user-item matrix to get an idea of how a user would respond to an unseen item.15.1.1. User-based Collaborative Filtering• In user-based collaborative filtering, the goal is to find similar users and recommend to a a user a particular item that is used by other similar users. • Note: If we're using binary indicators in the user-item matrix, we can use Jaccard/Cosine/Hamming metrics to measure similarity.– – Jaccard Similarity →

\frac{no. matching 1 s}{no. matching 1 s + no. u_{i} = 1 & u_{j} = 0 + no. u_{i} = 0 & u_{j} = 1}

* * Note: We don't count the number of matching

0

s here.– Cosine Similarity →

\frac{\sum_{n}^{} u_{i} u_{j}}{\sqrt{\sum_{n}^{} u_{i}^{2}} . \sqrt{\sum_{n}^{} u_{j}^{2}}}

– – Hamming Distance → sum of the differences →

\sum_{i = 1}^{n} 1 (x_{i} \neq y_{i})

– • How do we predict for a particular user (given the user similarities?) →

r e s \hat{p o} n s e_{u_{p}} = \frac{\sum_{u}^{} s i m (u_{p}, u_{i}) . r e s p o n s e_{u_{i}}}{\sum_{u}^{} s i m (u_{p}, u_{i})}

– • Note: We can use the KNN to find

k

number of neighbor users for prediction.• Note: We can use MSE to evaluate our predictions.• Note: Generally, we'd like to recommend the item that generates the highest predicted value.• Note: If the user-item is not binary, we can use the following similarity measures:– Euclidean– Manhattan →

\sum_{i = 1}^{n} | x_{i} - y_{i} |

– Pearson Correlation →

\frac{\sum_{n}^{} (u_{1} - {\bar{u}}_{1}) (u_{2} - {\bar{u}}_{2})}{\sqrt{\sum_{n}^{} (u_{1} - {\bar{u}}_{1})^{2}} \sqrt{\sum_{n}^{} (u_{2} - {\bar{u}}_{2})^{2}}}

– Cosine• Steps of user-based collaborative filtering:– Find KNN of

u_{i}

to form prediction for.– Calculate the similarity score between the neighbor users– Using the KNN, predict the response for

u_{i}

. 15.1.2. Item-based Collaborative Filtering• Here, we use the similarity between the items to make recommendations. • Here, we calculate the item-item similarity.• To predict user response, we use a similar formula as above →

r e s \hat{p o} n s e_{p_{l}} = \frac{\sum_{p}^{} s i m (p_{l}, p_{i}) . r e s p o n s e_{p_{i}}}{\sum_{p}^{} s i m (p_{l}, p_{i})}

• Steps of item-based collaborative filtering:– Calculate item-item similarity matrix– Predict response for

u_{i}

– Recommend the item with the highest prediction 15.1.3. Considerations on Memory-based Filtering• Time complexity of user-based →

O (u . p) \to O (u + p)

– Greater diversity

↑

– More Expensive

↓

(because of KNN calculations)• Time complexity of item-based →

O (u, p^{2}) \to O (u . p)

– Less re-calculations

↑

(items change less than users)– Lack of diversity

↓

• The user-item matrices are sparse, so the → denotes the effective time complexity.• Both user-based and item-based are collectively part of memory-based filtering.• Time Decay: We can apply time decay to the predictions as follows →

r e s \hat{p o} n s e_{u_{p}} = \frac{\sum_{u}^{} s i m (u_{p}, u_{i}) . d_{t} . r e s p o n s e_{u_{i}}}{\sum_{u}^{} s i m (u_{p}, u_{i})}

where

d_{t} = {0.5}^{\frac{t}{half-life}}

• – Time decay means that as time goes on, less influence will be given to a particular rating.• Inverse User Frequency: Another thing that we can do is to apply more weights to less frequent items →

r e s \hat{p o} n s e_{p_{l}} = \frac{\sum_{p}^{} s i m (p_{l}, p_{i}) . f_{i} . r e s p o n s e_{p_{i}}}{\sum_{p}^{} s i m (p_{l}, p_{i})}

where

f_{i} = \log (\frac{N}{n + 1})

– where *

N

→ total number of users and*

n

→ no. of users who interacted with item

i

.–

f_{i}

is called Inverse User Frequency (IUF). 15.2. Matrix Factorization• One problem withe memory-based approaches is that the user-item matrix (or item-item similarity) matrix is extremely large.• One approach we can use is Matrix Factorization → it takes the same user-item matrix and factorizes it into some terms as follows

U . P = {\tilde{u}}^{T} \tilde{p} + b_{p} + b_{u}

• Here, both

{\tilde{u}}^{T}

and

\tilde{p}

need to be learned.• Also, note that we have two bias terms, one for the users,

b_{p}

, and one for the items,

b_{u}

.• The loss function for matrix factorization is,

L = \frac{1}{N} \sum_{N}^{} (U . P - {\tilde{u}}^{T} \tilde{p} - b_{p} - b_{u})^{2} + 𝜆 (\sum_{u}^{} | | u_{i} | |^{2} + \sum_{p}^{} | | p_{i} | |^{2} + \sum_{u}^{} | | b_{u, i} | |^{2} + \sum_{p}^{} | | b_{p, i} | |^{2})

• The second term is the regularization term. 15.2.1. Implicit Ratings• If we have some binary user item matrix, we can transform that into a non-binary matrix of implicit ratings. – Example: Let's say the implicit rating (

r_{i m p}

) of an item is like this,*

1 \to v i e w

2 \to l i k e

3 \to c o m m e n t

– We can multiply

(1 + 𝛼 r_{i m p})

to all the binary values where

𝛼

is an adjusting parameter which can be tuned through cross validation.– With the implicit ratings, the loss function looks like this,

L = \frac{1}{N} \sum_{N}^{} (U . P . C_{p u} - {\tilde{u}}^{T} \tilde{p} - b_{p} - b_{u})^{2} + r e g u l a r i z a t i o n

15.2.2. Alternating Least Squares (ALS)• Since we need to optimize with respect to both

{\tilde{u}}^{T}

and

\tilde{p}

, we can't use ordinary least squares (like in linear regression).• We have to use something called Alternating Least Squares (found in libraries such as Spark ML).• ALS keeps constant one of

{\tilde{u}}^{T}

\tilde{p}

and performs OLS on the other one. Then, it would fix the other one, and runs an OLS on the one that kept fixed in the previous round.• ALS performs well in practice and we can also tune the dimensions of

U

and

P

. 15.2.3. Predicting with ALS• Here's the initial equation →

U . P = {\tilde{u}}^{T} \tilde{p} + b_{p} + b_{u}

• By estimating this equation with ALS, we can get prediction for a particular user and an item as follows →

p r e d (u_{i}, p_{j}) =

{\tilde{u}}_{r o w_{i}}^{T} {\tilde{p}}_{c o l u m n_{j}} + b_{p_{j}} + b_{u_{i}}

• The problem with the above approach is that, we'd have to retrain for every new customer that came in.• Note: To evaluate the performance of the predictions, we can use MSE on the validation set.• How can we avoid having to retrain these models for every new customer? 15.3. Deep Learning Extension• To avoid the prediction limitation of ALS, we can use the deep learning extension of matrix factorization.– In this case, the row and column of the user-item matrix are the inputs to the NN.– These inputs would have their own separate fully connected layers which would act as an embedding layer.– Then, the results of these embedding layers are combined into a single fully connected layer to finally be sent through a linear activation function.– The NN output would be the prediction for a particular user and item.• This typically requires less retraining. 15.4. Challenges of Collaborative Filtering• Cold-start Problem – We can't generate predictions for brand new users with no information (or items with no purchase history).– There are some ways to mitigate this such as:* Recommending new users popular items* Presenting new items to random subgroups• Echo Chamber– Let's say we recommend an item to a user so that increments some of their values in the user-item matrix.– Now that value has been incremented, it has more weights for other users and it just spins into this positive feedback loop of recommending potentially the same or very similar items over and over.– This doesn't make for a very good customer experience because these users won't see the variety that they need to see (they just see the same thing).– One way to avoid that is to recommend some random new items into the mix. • Shilling Attacks– This happens in systems where everyone can provide a rating.– For example, someone can provide a bunch of fake accounts to promote its own products in a platform (similarly when disliking a product).– We can limit that by allowing one user per phone number for example. 15.5. Content-based Filtering• Content-based filtering represents users with respect to items that they've interacted with.• Now, if we get the dot product of user vector with respect to every other items, and the item with the highest dot product will be recommended.• Content-based filtering doesn't require any user data to make a prediction about a particular user.– So, if you have tons of users, you can use content-based filtering to avoid running KNN.• The downside of content-based filtering is that it requires context about the items, i.e. products are need to evaluated in terms of some of their characteristics. 15.6. Deep Learning Recommender Systems• We can use deep learning to combine collaborative and content-based filtering.• It's going to be a similar NN as described in section Section 15.3., but with addition of input (and embedding layers) of user and item features. Back to Top

16. Learning To Rank• In the Recommender Systems section, Section 15., we talked about how to recommend an item to a user.– However, it only considered one item at a time per prediction.– The prediction, which is a score or probability, help us gauge whether or not we should we should recommend an item.• Let's say, instead of that, we wanted to create a feed or a series of posts (or items) for a user to scroll through on their home screen.• Now, this is a question of what order (or ranking) do we place a series of items in to provide the best feed for a user.– This is quite a different problem from classification/regression. 16.1. Framing a Ranking Problem• Our first goal is to extract relevant posts (or items) from all of the posts available to the user → This is typically called Candidate Generation or Obtaining the Top

K

.• Once, we have the top

K

, then we will rank them in some particular order for the feed. 16.2. Candidate Generation• To perform candidate generation we can again use matrix factorization.• In matrix factorization, we perform this,

U . P = {\tilde{u}}^{T} \tilde{p} + b_{p} + b_{u}

• Note that both

{\tilde{u}}^{T}

and

\tilde{p}

are in fact embeddings that represent some characteristics about some user or some item.• To get the top

K

items for a user, we can get the closest adjacent items to the user using the embedding vectors,

{\tilde{u}}^{T}

and

\tilde{p}

.– We can use Euclidean distance or cosine similarity or dot product.• Note: If, instead of matrix factorization, we've used a deep recommender system, then we could use the embedding layers in the DL model. 16.3. Ranking the Top K• Given the top

K

items, now we need to rank these top candidates.• The (bad) solution would be to just rank the items by their distance from a particular user (closer distance will rank higher on the top). There are a couple of problems with this method:– There could be tons of users and items where, in reality, if we extract the top

K

, we could generate a more sophisticated model because now we're only dealing with a small fraction of the users and items.– What if we had different systems generating our candidates? * We could subset our users into groups (e.g. user traffic coming from Amazon or Google).* These groups would exist in separate embedding spaces. However, if we want to apply rankings of items across different embedding spaces, then we would have to have some mechanism to rank beyond just the distances because they don't translate across different embeddings.16.3.1. Learning To Rank• Learning to rank takes some probability that some item

i

should appear before some item

j

P (i > j) = {\hat{y}}_{i, j} = \frac{1}{1 + e^{- (s_{i} - s_{j})}}

• Where

s_{i} = f (u, d_{i})

and

s_{j} = f (u, d_{j})

and

d_{i}

and

d_{j}

indicate items

i

and

j

.• The functions,

f

, are neural networks.• Here, it's modeled as a sigmoid function of

s_{i}

and

s_{j}

.• Note: The user feature term,

u

, don't have to be users per se. They can also be queries.– In fact, originally, learning to rank, was used as a search optimization tool. – In our case, we're going to treat queries as users. 16.3.2. Learning To Rank Loss Function• The loss function is the same sigmoid loss function,

L (y_{i j}, {\hat{y}}_{i j}) = - \sum_{i \neq j}^{} y_{i j} \log {\hat{y}}_{i j} + (1 - y_{i j}) \log (1 - {\hat{y}}_{i j})

• We're also going to use gradient descent to minimize the loss function. 16.4. RankNet• Let's say we have candidate documents

A, B

and

C

which we want to rank.• The first step is to generate all unique pairs of these documents because our loss function takes into account the probability that document

i

outranks document

j

→ so, we have to represent all

i j

pairs.• Now, let's say user clicked on

A

and

B

when presented with the following feed:

A ✓, C \times, B ✓

→ so, our labels should be

A, B, C

(simply because user clicked on

A

and

B

and not on

C

).• Now, we assign

{\hat{y}}_{i j}

as follows:–

A, B \to {\hat{y}}_{12} = 1

–

B, C \to {\hat{y}}_{23} = 1

–

C, A \to {\hat{y}}_{31} = 0

(because

C

does NOT rank over

A

)• Note: If there were more documents (that user haven't seen), we'll assign them

0.5

, because we don't know which one the user will click on.• Now, our neural network is going to look like this: • Note: The neural networks are the

f

functions that'd give us

s_{i}

and

s_{j}

for each pair of documents.• Now, we can replace

s_{i}

and

s_{j}

in equations

(35)

and

(36)

to calculate updates based on the gradient of the loss function →

w^{t + 1} = w^{t} - r \nabla_{l o s s}

.– The only difference here (compared to NN update) is that we're placing adjacent documents into the neural network one at a time and we're applying those differences in outputs to the overall loss function instead of using two documents at once as an input.– This strategy, which is called RankNet, was designed by Microsoft.• Now, let's say we trained our NN with these updates and we get a new document we haven't seen. – We first generate the pairwise documents.– We then repeat the same process described above, except this time with

s_{1}

and

s_{2}

, we just simply plug them into the model and get a probability. * If probability >

0.5

→ we rank item

1

above item

2

.– Then, we go to the next pair of documents and repeat this process.– Eventually, we can come up with a consistent order of which we should place the documents in.* Consistent means that

P (i > j)

will propagate consistency across the entire rank. For example, if in training

P (1 > 2) = 0.5

(i.e. rank uncertainty), AND

P (2 > 3) = 0.5

then

P (1 > 3) = 0.5

→ i.e. complete uncertainty propagates. * Same thing for complete certainty: if

P (1 > 2) = P (2 > 3) = 1

then

P (1 > 3) = 1

.16.5. LambdaNet• RankNet pairwise may not be the best penalty.– This means that we may want to consider more than just two documents at a time when trying to rank all the documents. • It's also pretty inefficient.– For all pairs of documents, we have to put all of those pairs into the neural network and perform SGD on every one of them.• In terms of learning to rank, after RankNet, there came something called LambdaNet.• Two improvements came with LambdaNet:– It is more efficient. They factorize the gradient such that they can find the gradient update for a document in comparison to all the others without having to evaluate itself versus every other pair.– LambdaNet enables us to use better metrics such as nDCG for evaluating a particular ranking, where nDCG → normalized Discounted Cumulative Gain.* It considers more than just a pair of documents at a time. 16.6. nDCG• Let's see how nDCG works.• Let's say we have the following documents along with their probabilities:–

A \to 0.5

–

B \to 1

–

C \to 0

• Here, if we just use the number of inversions, we would count one inversion (

A ⇆ B

) that would give us the optimal ranking → i.e. the model will be penalized by one.• Now, let's consider this case:–

C \to 0

–

A \to 0.5

–

B \to 1

• The problem is that the same penalty would be incurred for for the inversion (

A ⇆ B

) vs. if it was higher (previous case).• nDCG considers documents up to some position

p

D C G_{p} = \sum_{p}^{} \frac{2^{r e l_{i}} - 1}{\log_{2} (i + 1)}

• where

r e l_{i}

is the relevance of document

i

, which is those probabilities.• Now, we calculate the

D C G_{3}

(because there are 3 documents) for the first case we get

1.04

and for the second case we get

0.76

.– It means that even though inversions would penalize (

A ⇆ B

) the same, the

D C G

is clearly lower for the second case, simply because this elements appeared later down the list.• How to compare

D C G

across different users?– To do that we have to take the

D C G

for a particular user and divide it by the ideal

D C G

→ i.e.

I D C G

n D C G_{p} = \frac{D C G_{p}}{I D C G_{p}}

• The

I D C G

in our example is:–

B \to 1

–

A \to 0.5

–

C \to 0

• So, the

I D C G_{3}

for our example is

1.26

and

n D C G_{3} = \frac{1.04}{1.26} = 0.82

.•

n D C G

allows us to use the same loss for all users. • The only problem with incorporating

n D C G

into our loss function is that it's not differentiable.• What LambdaNet does is that instead of defining the gradient as the gradient of the loss function itself, they just assigned the gradient to a value called lambda (

𝜆

) which is defined as,•

𝜆_{i j} = \frac{1}{1 + e^{- (s_{i} - s_{j})}} . | Δ n D C G_{i j} |

• • At the time, the authors didn't have any mathematical backing for it, but it was later proven that this was completely fine and it actually does optimize for

n D C G

.• Now that we use

𝜆

instead of the gradients, how do we update our weights?– It's extremely similar to our old update functions. The only difference is that we're subtracting different terms as shown below,

w^{t + 1} = w^{t} - r \sum_{i j}^{} 𝜆_{i j} (\frac{\partial s_{i}}{\partial w_{k}} - \frac{\partial s_{j}}{\partial w_{k}})

• Note: There's no partial derivative in equation

(40)

with respect to some cost function. 16.7. LambdaMART• The only difference with LambdaNet is that it uses a gradient boosted tree in place of the neural network.• LambdaMART does appear to be pairwise because it compares pairwise documents, but

n D C G

considers elements beyond the pairs → so, technically, it's not a pairwise model.• How does MART work? – MART takes a single example,

{\vec{x}}_{i} = [u_{1}, u_{2}, \dots, p_{1}, p_{2}, \dots]

→ i.e. some user features + item features where features can be embeddings or explicit features.– The example is fed to the first tree and finds itself at the leaf node after it traverses down the tree. – The error is going to be the prediction at the leaf node,

y_{i}

, divided by the weight

w_{i}

.* Here,

y_{i}

is just the

𝜆

and

w_{i}

is just the gradient of the

𝜆

with respect to the output at the leaf node and

\frac{y_{i}}{w_{i}}

is actually the error that gets passed on to the next tree to be learned.* Note: The

i

index just indicates that we perform the above calculations for all the leaf nodes.* Note: For calculating

\frac{y_{i}}{w_{i}}

(or

𝜆

divided by the derivative of

𝜆

), we're taking what's called a Newton Step to figure out our error to be trained on in the next tree. • – Using the concept of MART with our Lambda (instead of the actual gradient), we end up with LambdaMART. 16.8. Other Notes• So far, in our examples, we just talked about whether a user clicked or not clicked a particular item/post/document.– We don't have to use just use clicks. We can have things like:* A user liked a particular post.* A user commented on a particular post.* Or a user performed no action.– We can map these actions like this:* N/A → 0* Clicked → 1* Liked → 2* Commented → 3* Messaged → 4– This is called Implicit Relevance Feedback. (similar to Implicit Response in Recommender Systems)• In addition to

n D C G

, we can also use Mean Average Precision (MAP) (for binary cases).• For implicit relevance feedback, we could transform it to binary by arranging some cutoff point (e.g. anything > 3 is labeled as relevant (i.e. 1) and irrelevant (i.e. 0) otherwise.• We could also use Mean Reciprocal Rank (MRR) (for binary cases). • Note: We can average these metrics across all users to evaluate the model.• Sometimes it's preferable to encode your

n

-ary labels (as in implicit relevance feedback) as binary so that we can use those binary evaluation metrics just for another perspective.• All of these evaluations metrics that we talked about, can also be used in the Lambda, that is,

\begin{array}{c} 𝜆_{i j} = \frac{1}{1 + e^{- (s_{i} - s_{j})}} . | Δ n D C G_{i j} | o r \\ 𝜆_{i j} = \frac{1}{1 + e^{- (s_{i} - s_{j})}} . | Δ M A P_{i j} | o r \\ 𝜆_{i j} = \frac{1}{1 + e^{- (s_{i} - s_{j})}} . | Δ M R R_{i j} | \end{array}

• Sometimes we can to de-bias our clicks data. – Clicks are going to be susceptible to Presentation/Trust Bias.– Basically, the results appearing lower in the ranking can affect:* If the post is even seen* Even if it is seen, the fact that it occurs lower in the list could make the user not trust that result and not click on it.* This could be more relevant in search engines. – How do we account for this bias?* We just have to divide by the bias in the lambda equation

𝜆_{i j} = \frac{\frac{1}{1 + e^{- (s_{i} - s_{j})}} . | Δ n D C G_{i j} |}{b_{i} b_{j}}

• –

b_{i}

is the probability that some item in rank

i

is clicked over some other relevant item that's ranked lower than it. –

b_{j}

is the same as

b_{i}

but in terms of irrelevant items. Back to Top

x	if x>0
0.01x	otherwise

x	if x>0
0.01x	otherwise