1. Naive Bayesβ’ Let's say we want to identify spam messages.β’ We can use Bayes Theorem to formulate probability of a spam message based on appearance of some words in the message.P(spam|w)=P(spam).P(w|spam)
P(spam).P(w|spam)+ P(not spam).P(w|not spam)β’ w represents the vocabulary, V ={w1,w2,β¦,wn}, i.e. just a list of words that our model recognizes.β’ P(spam) indicates the probability of seeing a spam message regardless of the word.β’ Note:P(spam) is called priors, and P(w|spam) and P(w|not spam) are called likelihoods. The denominator is called the evidence, and P(spam|w) is called the posterior.β’ Note: Since we're modeling the presence/absence of a particular word, this is called a Bernoulli Model.β’ In order to calculate the probability of spam, P(wi|spam), given any particular word, wi, we use chain rule in probability.P({Β¬w1,Β¬w2,β¦,wi,β¦,Β¬wn}|spam)=P(Β¬wn|spam).P(Β¬wn-1|spam,Β¬wn).P(Β¬wn-2|spam,{Β¬wn,Β¬wn-1})....P(wi|spam,{Β¬wn,Β¬wn-1,β¦,Β¬wn-i})....P(Β¬w1|spam,{Β¬wn,Β¬wn-1,β¦,wi,β¦,Β¬w2})β’ β’ where Β¬w indicates not existence of word w in a message.β’ If n is large, then the amount of calculations will get really high. So, we use a simplifying assumption that words are independent of each other. Therefore, equations (2) becomes,P({Β¬w1,Β¬w2,β¦,wi,β¦,Β¬wn}|spam)=P(Β¬wn|spam).P(Β¬wn-1|spam).β¦.P(wi|spam).β¦.P(Β¬w1|spam)β’ Note: The simplifying assumption can potentially disregard some useful information since some words are more likely to appear in a sentence, e.g. London and England. β Due to this simplifying assumption, this model is called Naive Bayes.β β’ Note:P(wk|spam) is the probability of seeing word wk in a spam message, i.e. no. of spam messages with word w
Total no. of messages with word w , similarly for P(Β¬wk|spam) or P(wk|not spam).β’ β’ Note: In practice, each word, wk, is represented by one-hot encoded vector.β’ The Problem of Zero Probability: Since probability of different words are multiplied to each other, if the probability of one word (or more) is 0, then it'll make the entire probability 0. β In order to solve this issue, Naive Bayes applies Laplace Smoothing to every word in the vocabulary.1.1. What is Laplace Smoothing?β’ Laplace smoothing is a smoothing technique that handles the problem of zero probability in Naive Bayes. Using Laplace smoothing, we can represent P(wk|spam) as,P(wk|spam)=no. of spam messages with word w + πΌ
N + πΌ.nβ’ where:β N is the total number of spam messages.β n is the vocabulary size.β πΌ is the smoothing parameter.* Using higher πΌ values will push the likelihood towards a value of 0.5, i.e. the probability of a word is equal to 0.5 for both spam and not spam messages.* This is not so useful. In practice, it's preferred to use πΌ=1.1.2. How to Prepare the Data for Naive Bayes?β’ Let's say we have the following message:β "Hey, good point here - This is interesting."β’ Here are the steps we do to prepare this sentence:β Remove white spaceβ Remove punctuationβ Tokenizing (creating a list of words/token) β ["Hey", "good", "point", "here", "-", "This", "is", "interesting"]β Remove stop words (i.e. words that don't add much information) β ["Hey", "good", "point", "-", "interesting"]β Remove non-alphabetic words β ["Hey", "good", "point", "interesting"]β Stemming (i.e. Removing the ending modifiers of words, leaving the stem of the word) β ["Hey", "good", "point", "interest"]* Lemmatization: A more calculated form of stemming which ensures the proper lemma results from removing the word modifiers.* The problem with lemmatization is that it is often more expensive (than stemming). So, with large data, you may want to go with stemming over lemmatization.StemmingLemmatizationStudying β StudyStudies β StudiStudying β StudyStudies β Studyβ’ We can represent words in terms of binary vectors of 0 and 1. This is called vectorization.Back to Top
2. Performanceβ’ Continuing from section Section 1., let's say we want to measure how good the spam detection model is.β We can do so by determining a cutoff point/decision point (say 0.5) and predict spam when the probability is >0.5 and not-spam otherwise. β Then, we can calculate the accuracy metric, which is simply the number of correctly predicted divided by total number of examples.β’ Note: Accuracy is not such a good metric to measure the performance of the spam model because 93% of the data is not-spam β imbalanced dataset. So, by just predicting 0 for all the examples, we'd got 93% in accuracy.β’ Note: One problem, particularly with imbalanced data, is that we often care more about the performance on the minority class which in this case is predicting spam examples correctly. β There are two ways the model could predict a spam incorrectly:* False Positive β predicting spam when it's actually a not-spam.* False Negative β predicting not-spam when it's actually a spam.* The other cases are called True Positive β predicting spam when it's actually a spam and True Negative β predicting not-spam when it's a not-spam.2.1. Confusion Matrixβ’ We can summarize all the above in something called a confusion matrix.
TN+FP β it represents the classifier's ability correctly classify the not-spam messages (or negative cases). Higher Specificity β fewer False Positives.β’ β Note: In the case of spam detection model, we'd prefer higher specificity such that an important message wouldn't be falsely classified as spam.β Note: In some other problems such as cancer detection, we'd prefer higher sensitivity because we want as few false negatives as possible.β’ Precision =TP
TP + FP β it just measures how accurately the positives are classified.β’ F1Score =2.(sensitivity Γ precision)
sensitivity + precision β it is the harmonic mean of the sensitivity and precision.2.2. Can we control the sensitivity and specificity tradeoff?β’ Higher Sensitivity β Less FN β’ Higher Specificity β Less FPβ’ We can change the tradeoff by changing the cutoff point.55%TPFNFPTN2.3. Cross Validationβ’ To test the performance of our model, we usually split the data into three parts:β Training setβ Validation setβ Test setβ’ The validation set gives the opportunity to tune our model without using the test set itself.β’ We use the test set merely for evaluating our model performance on unseen examples.2.4. Receiver Operator Characteristic (ROC) curveβ’ ROC curve is plotted on sensitivity on one axis and 1-specificity on the other axis.β’ As we tune our model on the validation set, we can plot the sensitivities and specificities that each cutoff threshold produces. β The 45Β° line shows that for every positive example that we correctly classify, we also incorrectly classify a negative example.β The goal for every model should be to always lie above or be better than the 45Β° line.β To obtain a good balance specificity and sensitivity, we ought to pick a threshold that maximizes the distance away from the 45Β° line.sensitivity1011-specificityβ’ In order to compare different models, we use the Area Under the Curve (AUC) of ROC. Whichever model that has higher AUC is the model that we can confidently say is a better predictor.sensitivity1011-specificityAUC2.5. Hyperparameter Tuningβ’ Hyperparameters are parameters that go along with the model that you don't necessarily train.2.6. Types of Cross Validationβ’ Hold-out Validation β We assign a subset of examples to be our validation set.β’ K-fold Validation β We train k different models and use a different validation set each time. β’ Leave-One-Out Validation β It's the k-fold validation when k=n, where n is the number of examples β more used when we have small amount of data.2.7. Steps of ML Modeling1. Problem2. Hypothesis3. Simple Heuristic4. Measure Impact5. More Complex Technique6. Measure Impact7. Tune Model8. Replace Existing TechniqueBack to Top
di* So, if the distance to neighbor node i, di, gets smaller it will get higher weight.* Note: Numerator is fixed and only the denominator changes.β KNN is computationally expensive β O(n.d.k) where n is the number of nodes, and d is the dimension of each node.* To reduce the complexity down, we can use a k-d tree data structure which allows us to locate the relevant points faster than iterating over all the points. * Note that k-d tree also suffers from increase in dimensionality.β’ Example: Cyber Securityβ Every once in a while these programs will make calls to the operating system (i.e. system calls).β Our goal is to detect intrusive processes/programs in all the machines in a system.β Our features are going to be system calls which look like this:* pid1=['mkdir', 'mount', 'poll', 'chown']* pid2=['chroot', 'bind', 'fork', 'open', 'pready']* pid3=...β The labels are instrusive β 1 and non-intrusive β 0.β We can treat system calls as just words and perform TF-IDF on the them to make them numerical.* pid1=[0.15, 0.12, 0.13, 0.31]* pid2=[0.16, 0.31, 0.46, 0.21, 0.11]* pid3=...β For an unlabeled example, we calculate its distance to all other nodes, and determine its label based on the majority voting of its closest k neighbors.β Let's say we also have a categorical features named 'priority'. It has three values: High, Medium, Low. Let's calculate Gower distance.* High β 0, Medium β 1, Low β 2* Gower Distance = valueβmax(value)* dGower(high)= 0β2, dGower(medium)= 1β2, dGower(low)= 2β2Back to Top
nβ’ β MSE split β To get the MSE of a split, we just add up all the squared differences (like equation (5)) and divide it by the total number of examples under that split.β We choose the split that gives us the smallest MSE.β’ For prediction β We just average the values of observations in a node.5.8. How to handle categorical features in CART?β’ If the categorical feature is binary, the split is easy β each node get one of the values. But what about categorical features with more than 2 values?β’ In this case, we have to make splits for every possible combinations. β’ For Example: Let's say we want represent this list of countries: [US, UK, IRI, GER]. The splits will be like this:β [US] and [UK, IRI, GER]β [US, UK] and [IRI, GER]β [US, IRI] and [UK, GER]β and so on ...β’ Note: For a categorical feature with n distinct values, there will be 2n-1 subsets.5.9. What are the down sides of CART?β’ CARTs tend to overfit very easily.β As the depth increases, the ability of the model to overfit increases as well.β Solutions:* Limit the max_depth to 2 to 5.* Use Boosting (refer below for a simple explanation of how boosting works).* Use Bagging β Bootstrap Aggregationβ CART is rarely used in practice. Instead, some ensemble variations of it is used very often. Examples of such methods β XGBoost, LightGBM, CatBoost, AdaBoost, RandomForest, etc.5.9.1. Boostingβ’ Boosting is very simply training another tree model on the error. In other words, boosting is an ensemble of weak learners.β Suppose, we train a tree and predict 230 where the true value is 270. Boosting in this case is to train on the error of first tree (i.e. 270 - 230 = 40).β In general, boosting for several times looks like this:* Pred(xi)= tree1(label(xi))+ tree2(error(tree1))+ tree3(error(tree2))+...β Note: Boosted trees also overfit easily. We need to perform cross-validation to find the best number of trees and max_depth.β Note: In boosted trees, each individual tree is called a weak learner.* These learners are defined as having better performance than random chance.5.9.2. Baggingβ’ In bagging, we create many trees based on sampling from data (observations) and also training on subset of features. To get predictions, we just average the predictions of all trees β This is called Random Forest.β Note: Out of pure math, it works out that roughly 36.7% (i.e. 1βe) of examples won't be trained on β We automatically get out-of-bag sample which we can use as our validation set.β Note: Bagging reduces the variance.β Note: Another variant is C4.5. * It can only do classification.* It can do n-ary splits (instead of binary splits)* It uses information gain based on entropy (instead of gini index)5.9.3. XGBoost vs. LightGBMβ’ Sourceβ’ What is Gradient Boosting? Gradient Boosting refers to a methodology in machine learning where an ensemble of weak learners is used to improve the model performance in terms of efficiency, accuracy, and interpretability.β These learners are defined as having better performance than random chance.β The hypothesis is to filter out instances that are difficult to accurately predict and develop new weak learners to handle them.β’ How Gradient Boosting Works?(a) The initial model is trained and predictions are run on the whole dataset. (b) The error between the actual value and prediction is calculated and more weight is given to the incorrect predictions. (c) Subsequently, a new model that attempts to fix the error of the previous model and in a similar way several models are thus created. We arrive at the final model by weighting the mean of all models.β’ Gradient Boosting Can Be Applied To:β Regression β taking the average of the outputs by the weak learnersβ Classification β finding the class prediction occurring the maximum number of timesβ β’ What is XGBoost?β XGBoost βeXtreme Gradient Boostingβ XGBfocuses on computation speed and model performance. It was introduced by Tianqi Chen and is currently a part of a wider toolkit by DMLC (Distributed Machine Learning Community).β It can be used for both classification and regression.β It supports the following kinds of boosting:* Gradient Boosting as controlled by learning rate* Stochastic Gradient Boosting that leverages sub-sampling at a row, column or column per split levels* Regularized Gradient Boosting using L1 (Lasso) and L2 (Ridge) regularizationβ Some of the other features that are offered from a system performance point of view are:* Using a cluster of machines to train a model using distributed computing* Utilization of all the available cores of a CPU during tree construction for parallelization* Out-of-core computing when working with datasets that do not fit into memory* Making the best use of hardware with cache optimizationβ In addition to the above the framework:* Accepts multiple types of input data* Works well with sparse input data for tree and linear booster* Supports the use of customized objective and evaluation functions.β To learn more about XGBoost, check out this page.β’ What is LightGBM?β Similar to XGBoost, LightGBM (by Microsoft) is a distributed high-performance framework that uses decision trees for ranking, classification, and regression tasks.β The advantages are as follows:* Faster training speed and accuracy resulting from LightGBM being a histogram-based algorithm that performs bucketing of values (also requires lesser memory)* Also compatible with large and complex datasets but is much faster during training* Support for both parallel learning and GPU learningβ In contrast to the level-wise (horizontal) growth in XGBoost, LightGBM carries out leaf-wise (vertical) growth that results in more loss reduction and in turn higher accuracy while being faster. * But this may also result in overfitting on the training data which could be handled using the max-depth parameter that specifies where the splitting would occur. Hence, XGBoost is capable of building more robust models than LightGBM.β’ Structural Differences between XGBoost and LightGBMβ It feels like LightGBM is significantly faster than XGBoost but delivers almost equivalent performance. We might wonder, what are exactly the differences between LightGBM and XGBoost?β β Leaf Growth:LightGBM has a faster rate of execution along with being able to maintain good accuracy levels primarily due to the utilization of two novel techniques:* Gradient-Based One-Side Sampling (GOSS)Β· In Gradient Boosted Decision Trees, the data instances have no native weight which is leveraged by GOSS.Β· Data instances with larger gradients contribute more towards information gain.Β· To maintain the accuracy of the information, GOSS retains instances with larger gradients and performs random sampling on instances with smaller gradients.Β· Note: We can learn more about this concept in the article β What makes LightGBM lightning fast?Β· Note: The YouTube channel Machine Learning University also released a video on LightGBM speaking about GOSS.* Exclusive Feature Bundling (EFB)Β· EFB is a near lossless method to reduce the number of effective features.Β· Just like One-Hot encoded features, in the sparse space, many features rarely take non-zero values simultaneously. Β· To reduce dimensionality, improve efficiency, and maintain accuracy, EFB bundles these features, and this bundle is called an Exclusive Feature Bundle.Β· This thread on EFB and LightGBMβs paper can be referred to gain better insight.* On the other hand, XGBoost uses a pre-sorted andhistogram-based algorithm for computing the best split, which is done with GOSS in LightGBM. The pre-sorting splitting works as:1. For each node, enumerate over all features2. For every feature, sort instances by the feature value3. Using linear scan, decide the split along with the feature basis information gain4. Pick the best-split solution along with all the features4. β Handling Categorical Features* Both LightGBM and XGBoost accept numerical features only. This means that the nominal features in our data need to be transformed into numerical features.* XGBoost, by default, treats such variables as numerical variables with order and we donβt want that. Instead, if we can create dummies for each of the categorical values (one-hot encoding), then XGBoost will be able to do its job correctly. But for larger datasets, this is a problem as encoding takes a longer time.Β· For example, if we encode a categorical variable with three values into (0, 1, 2), XGBoost treats them with order as if category 1 is greater than category 0, and so on. This is not what we want. That's why categorical variables need to be one-hot encoded.* On the other hand, LightGBM accepts a parameter to check which column is a categorical column and handles this issue with ease by splitting on equality. * Note: The H2O library provides an implementation of XGBoost that supports the native handling of categorical features. * β Handling Missing Values* Both the algorithms treat missing values by assigning them to the side that reduces loss the most in each split.* β Feature Importance Methods* GainΒ· Every feature in a dataset has some sort of importance/ weightage in helping build an accurate model.Β· Gain refers to the relative contribution of a particular feature in the context of a particular tree.Β· This can also be understood by the extent of relevant information that the model gains from a feature for making better predictions.Β· Available both in XGBoost and LightGBM.* Split/Frequency/WeightΒ· Split for LightGBM and Frequency or Weight for XGBoost method calculates the relative count of times a particular feature occurs in all splits of the modelβs trees. One issue with this method is that it is prone to biaswhen there are a large number of categories in categorical features.Β· Available both in XGBoost and LightGBM.* CoverageΒ· The relative number of observations per feature.Β· Available only in XGBoost.Β· β Processing Unit* The algorithm we want to use often depends upon the type of processing unit we have for running the models. * Although XGBoost is comparatively slower than LightGBM on GPU, it is actually faster on CPU. * LightGBM requires us to build the GPU distribution separately while to run XGBoost on GPU we need to pass the βgpu_histβ value to the βtree_methodβ parameter when initializing the model.* When working in an institution with access to GPUs and strong CPUs, we should go for XGBoost as it is more scalable than LightGBM. * But personally, I think LightGBM makes more sense as the training time saved can be used for better experimentation and feature engineering. We can train our final model to have model robustness.* β Important hyperparameters* XGBoost parametersΒ· n_estimators[default 100] β Number of trees in the ensemble. A higher value means more weak learners contribute towards the final output but increasing it significantly slows down the training time. Β· max_depth [default 3] β This parameter decides the complexity of the algorithm. The lesser the value assigned, the lower is the ability for the algorithm to pick up most patterns (underfitting). A large value can make the model too complex and pick patterns that do not generalize well (overfitting).Β· min_child_weight[default 1] β We know that an extremely deep tree can deliver poor performance due to overfitting. The min_child_weight parameter aims to regularize by limiting the depth of a tree. So, the higher the value of this parameter, the lower are the chances of the model overfitting on the training data.Β· learning_rate/ eta[default 0.3] β The rate of learning of the model is inversely proportional to the accuracy of the model. Lowering the learning rate, although slower to train, improves the ability of the model to look for patterns and learn them. If the value is too low then it raises difficulty in the model to converge.Β· gamma/ min_split_loss[default 0] β This is a regularization parameter that can range from 0 to infinity. Higher the value, higher is the strength of regularization, lower are the chances of overfitting (but can underfit if itβs too large). Hence, this parameter varies across all types of datasets.Β· colsample_bytree[default 1.0] β This parameter instructs the algorithm on the fraction of the total number of features/ predictors to be used for a tree during training. This means that every tree might use a different set of features for prediction and hence reduce the chances of overfitting and also improve the speed of training as not all the features are being used in every tree. The value ranges from 0 to 1.Β· subsample[default 1.0] β Similar to colsample_bytree, the subsample parameter instructs the algorithm on the fraction of the total number of instances to be used for a tree during training. This also reduces the chances of overfitting and improves training time.* LightGBM parametersΒ· max_depth β Similar to XGBoost, this parameter instructs the trees to not grow beyond the specified depth. A higher value increases the chances for the model to overfit.Β· num_leaves β This parameter is very important in terms of controlling the complexity of the tree. The value should be less than 2^(max_depth) as a leaf-wise tree is much deeper than a depth-wise tree for a set number of leaves. Hence, a higher value can induce overfitting.Β· min_data_in_leaf β The parameter is used for controlling overfitting. A higher value can stop the tree from growing too deep but can also lead the algorithm to learn less (underfitting). According to the LightGBMβs official documentation, as a best practice, it should be set to the order of hundreds or thousands.Β· feature_fraction β Similar to colsample_bytree in XGBoostΒ· bagging_fraction β Similar to subsample in XGBoostΒ· β Tradeoff between model performance and training time* When working with machine learning models, one big aspect involved in the experimentation phase is the baseline requirement of resources to train a complex model. While some might have access to some great hardware, often people have limitations to what they can use.* Example: Let us quickly dummy datasets with sample sizes from 1,000 all the way to 20,000 samples. Weβll take a test size of 20% from each of the dummy datasets to measure model performance. For every iteration having different sample sizes stepped up by 1,000 samples, we want to check how much time it takes for an XGBoost Classifier to train in comparison to a LightGBM Classifier. To run the code, refer to this Google Colab notebook. Run results are here.
import neptune.new as neptunefrom xgboost import XGBClassifierfrom lightgbm import LGBMClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_scorefrom sklearn.datasets import make_classificationimport timefrom tqdm.notebook import tqdm# initialising a logger instancerun = neptune.init( project="common/xgboost-integration", api_token="ANONYMOUS", name="xgb-train", tags=["xgb-integration","train"],)# configuration for our custom datasetmin_samples =1000max_samples =20000step =1000for sample_size intqdm(range(int(min_samples/0.8),int(max_samples/0.8), step)): xgb_dummy =XGBClassifier(seed=47) lgbm_dummy =LGBMClassifier(random_sate=47) # logging the sample size run['metrics/comparison/sample_size'].log(sample_size) # generating the dataset of custom sample size dummy =make_classification(n_samples=sample_size) # splitting the data into train and test set X_train, X_test, y_train, y_test =train_test_split(dummy[0], dummy[1], test_size=0.2, stratify=dummy[1]) start = time.time() xgb_dummy.fit(X_train, y_train) end = time.time() # logging algorithm execution time run['metrics/comparison/xgb_runtime'].log(end-start) # logging model performance run['metrics/comparison/xbg_accuracy'].log(accuracy_score(y_test, xgb_dummy.predict(X_test)), step=sample_size) start = time.time() lgbm_dummy.fit(X_train, y_train) end = time.time() # logging algorithm execution time run['metrics/comparison/lgbm_runtime'].log(end-start) # logging model performance run['metrics/comparison/lgbm_accuracy'].log(accuracy_score(y_test, lgbm_dummy.predict(X_test)), step=sample_size)run.stop()
Figure 1:XGBoost vs. LightGBM Runtime
Figure 2:XGBoost vs. LightGBM Accuracy
β’ β * From the figure, we can see that the training time for XGBoost kept on increasing with an increase in sample size almost linearly. On the other hand, the training time required by LightGBM has been a very small fraction of its contender. Interesting!* The accuracy scores for both models go hand-in-hand. The results indicate that not only is LightGBM faster, there is not much compromise in model performances. So does this mean we can just ditch XGBoost for LightGBM?* It all comes down to the availability of hardware resources and bandwidth to figure things out.* Although LightGBM gives good performance at fraction of time as compared to XGBoost, what it still needs to improve on is documentation and community strength. * Also if the hardware is available, since XGBoost scales better, as discussed before we could train using LightGBM, get an understanding of the parameters required, and train the final model as an XGBoost model.* β Summary: Gradient Boosted Decision Trees (GBDTs) are one of the most popular choices of machine learning algorithms. XGBoost and LightGBM which are based on GBDTs have had great success both in enterprise applications and data science competitions. Here are the key takeaways from our comparison:* In XGBoost, trees grow depth-wise while in LightGBM, trees grow leaf-wise which is the fundamental difference between the two frameworks.* XGBoost is backed by the volume of its users that results in enriched literature in the form of documentation and resolutions to issues. While LightGBM is yet to reach such a level of documentation.* Both the algorithms perform similarly in terms of model performance but LightGBM training happens within a fraction of the time required by XGBoost.* Fast training in LightGBM makes it the go-to choice for machine learning experiments.* XGBoost requires a lot of resources to train on large amounts of data which makes it an accessible option for most enterprises while LightGBM is lightweight and can be used on modest hardware.* LightGBM provides the option for passing feature names that are to be treated as categories and handles this issue with ease by splitting on equality.* H2Oβs implementation of XGBoost provides the above feature as well which is not yet provided by XGBoostβs original library.* Hyperparameter tuning is extremely important in both algorithms.Back to Top
6. Linear Regressionβ’ Linear regression answers the question of how do we find the line of best fit for the data?y=π½0+π½1x1+π½2x2+β¦+π½nxn+πβ’ The challenge is to figure out what the best line is which summarizes the data best.β’ Note: Just as a reminder, if the confidence interval for a coefficient contains a zero, then that coefficient cannot be statistically significant β A confidence interval that contains zero is not certainty that there is no treatment effect, but it is uncertain whether there is a treatment effect.β Having zero in one's confidence interval implies that a treatment effect could have a positive/negative effect on the outcome of interest.6.1. R-squaredβ’ R2 (or coefficient of determination) is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).SSres=βi(yi-fi)2=βie2iSStot=βi(yi-β¨y)2R2=1-SSres
SStot=1-var(errors)
var(y)adj R2=1-n-1
n-2(1-R2)6.2. Test for significanceβ’ We test for significance by performing a t-test for the regression coefficients.β’ In other words, we will test a claim about the population regression line because there is a strong correlation observed.β’ We will carry out a t-test for the slope by calculating the p-value and comparing it with the desired significance level.β’ The null hypothesis is:β H0: π½i=0 β the coefficient is equal to zeroβ Ha: π½iβ 0 β the coefficient is NOT equal to zeroβ’ The t-statistic is calculated as followsβ’ SEcoef=1
n-2βn(y-yi)2
βn(xi-β¨x)2t-statistic =coef
SEcoefp-value = sum(95% tail areas) under t-distribution95% CI =[coef-1.96.SEcoef, coef+1.96.SEcoef]6.2.1. When to use a t-test?β’ A t-test is one of the most popular statistical tests for location, i.e., it deals with the population(s) mean value(s).β’ There are different types of t-tests that you can perform:β One-sample t-testβ Two-sample t-testβ Paired t-testβ’ Note: Remember that a t-test can only be used for one or two groups. If you need to compare three (or more) means, use the analysis of variance (ANOVA) method.β’ The t-test is a parametric test, meaning that your data has to fulfill some assumptions:β The data points are independent; ANDβ The data, at least approximately, follow a normal distribution.β’ Note: If your sample doesn't fit these assumptions, you can resort to a non-parametric alternatives, e.g., the MannβWhitney U test (a.k.a. the Wilcoxon rank-sum test), the Wilcoxon signed-rank test or the sign test.6.2.2. Which t-test?β’ Your choice of t-test depends on whether you are studying one group or two groups:β One sample t-test* Choose the one-sample t-test to check if the mean of a population is equal to some pre-set hypothesized value.* Example: The average volume of a drink sold in 0.33 ml cans - is it really equal to 330 ml?* Example: The average weight of people from a specific city - is it different from the national average?β Two sample t-test* Choose the two-sample t-test to check if the difference between the means of two populations is equal to some pre-determined value, when the two samples have been chosen independently of each other.* In particular, you can use this test to check whether the two groups are different from one another.* Example: The average difference in weight gain in two groups of people: one group was on a high-carb diet and the other on a high-fat diet.* Example: The average difference in the results of a math test from students at two different universities.* Note: This test is sometimes referred to as an independent samples t-test, or an unpaired samples t-test.β Paired t-test* A paired t-test is used to investigate the change in the mean of a population before and after some experimental intervention, based on a paired sample, i.e., when each subject has been measured twice: before and after treatment.* In particular, you can use this test to check whether, on average, the treatment has had any effect on the population.* Example: The change in student test performance before and after taking a course.* Example: The change in blood pressure in patients before and after administering some drug.6.2.3. How to do a t-test?β’ Decide on the alternative hypothesisβ Use a two-tailed t-test if you only care whether the population's mean (or, in the case of two populations, the difference between the populations' means) agrees or disagrees with the pre-set value.β Use a one-tailed t-test if you want to test whether this mean (or difference in means) is greater/less than the pre-set value.β’ Compute your t-score valueβ Formulas for the test statistic in t-tests include the sample size, as well as its mean and standard deviation. The exact formula depends on the t-test type - check the sections dedicated to each particular test for more details.β’ Determine the degrees of freedom for the t-testβ The degrees of freedom are the number of observations in a sample that are free to vary as we estimate statistical parameters. In the simplest case, the number of degrees of freedom equals your sample size minus the number of parameters you need to estimate. Again, the exact formula depends on the t-test you want to perform - check the sections below for details.β’ The degrees of freedom are essential, as they determine the distribution followed by your t-score (under the null hypothesis)β’ If there are d degrees of freedom, then the distribution of the test statistics is the t-Student distribution with d degrees of freedom.β’ This distribution has a shape similar to N(0,1) (bell-shaped and symmetric) but has heavier tails.β’ Note: If the number of degrees of freedom is large (> 30), which generically happens for large samples, the t-Student distribution is practically indistinguishable from N(0,1).
Figure 3:Density of t-distribution with π degrees of freedom
β’ Fun Fact:The t-Student distribution owes its name to William Sealy Gosset, who, in 1908, published his paper on the t-test under the pseudonym "Student". Gosset worked at the famous Guinness Brewery in Dublin, Ireland, and devised the t-test as an economical way to monitor the quality of beer.6.2.4. p-value from t-testβ’ Recall that the p-value is the probability (calculated under the assumption that the null hypothesis is true) that the test statistic will produce values at least as extreme as the t-score produced for your sample. β’ As probabilities correspond to areas under the density function, p-value from t-test can be nicely illustrated with the help of the following pictures:
β’ The following formulae say how to calculate p-value from t-test. β’ CDFt,d β Cumulative Distribution Function (CDF) of the t-student distribution with d degrees of freedom:β p-value from left-tailed t-test β CDFt,d(tscore)β p-value from right-tailed t-test β 1-CDFt,d(tscore)β p-value from two-tailed t-test β 2ΓCDFt,d(-|tscore|)or2- 2ΓCDFt,d(|tscore|)β’ Note: However, the CDF of the t-distribution is given by a somewhat complicated formula.β To find the p-value by hand, you would need to resort to statistical tables, where approximate CDF values are collected, or to specialized statistical software.6.2.5. t-test critical valuesβ’ Recall, that in the critical values approach to hypothesis testing, you need to set a significance level, πΌ, before computing the critical values, which in turn give rise to critical regions (a.k.a. rejection regions).β’ Formulas for critical values employ the quantile function of t-distribution, i.e., the inverse of the CDF:β Critical value for left-tailed t-test β CDF-1t,d(πΌ)* Critical region β (-β, CDF-1t,d(πΌ))β Critical value for right-tailed t-test β CDF-1t,d(1-πΌ)* Critical region β (CDF-1t,d(1-πΌ), β)β Critical value for two-tailed t-test β Β±CDF-1t,d(1-πΌβ2)* Critical region β (-β,-CDF-1t,d(1-πΌβ2)]βͺ[CDF-1t,d(1-πΌβ2), β)β’ β’ Note: To decide the fate of the null hypothesis, just check if your t-score lies within the critical region:β If your t-score belongs to the critical region, reject the null hypothesis and accept the alternative hypothesis.β If your t-score is outside the critical region, then you don't have enough evidence to reject the null hypothesis.6.2.6. One-sample t-testβ’ The null hypothesis is that the population mean is equal to some value π0.β’ The alternative hypothesis is that the population mean is:β different from π0;β smaller than π0; orβ greater than π0.t=β¨x-π0
s.nπ0βmean postulated in H0nβsample sizeβ¨xβsample meansβsample standard deviationβ’ Note: Number of degrees of freedom in one-sample t-test β n-1.6.2.7. Two-sample t-testβ’ The null hypothesis is that the actual difference between these groups' means, π1 and π2, is equal to some pre-set value, π₯.β’ The alternative hypothesis is that the difference π1-π2 is:β different from π₯;β smaller than π₯; orβ greater than π₯.β’ In particular, if this pre-determined difference is zero (π₯=0) β The null hypothesis is that the population means are equal.β’ The alternate hypothesis is that the population means are:β π1 and π2 are different from one another;β π1 is smaller than π2; andβ π1 is greater than π2.β’ Note: Formally, to perform a t-test, we should additionally assume that the variances of the two populations are equal (this assumption is called the homogeneity of variance).β’ There is a version of a t-test which can be applied without the assumption of homogeneity of variance: it is called a Welch's t-test. For your convenience, we describe both versions.6.2.8. Two-sample t-test if variances are equalβ’ Use this test if you know that the two populations' variances are the same (or very similar).t=β¨x1-β¨x2-π₯
sp.1
n1+1
n2spβpooled standard deviationsp=(n1-1)s21+(n2-1)s22
n1+n2β’ Note: Number of degrees of freedom in t-test (two samples, equal variances) = n1+n2-2.6.2.9. Two-sample t-test if variances are unequal (Welch's t-test)β’ Two-sample Welch's t-test formula if variances are unequal:t=β¨x1-β¨x2-π₯
s21
n1+s22
n2β’ Note: The number of degrees of freedom in a Welch's t-test (two-sample t-test with unequal variances) is very difficult to count. We can be approximate it with help of the following Satterthwaite formula:(s21
n1+s22
n2)2
(s21βn1)2
n1-1+(s22βn2)2
n2-1β’ Alternatively, you can take the smaller of n1-1 and n2-1 as a conservative estimate for the number of degrees of freedom.β’ Fun Fact:The Satterthwaite formula for the degrees of freedom can be rewritten as a scaled weighted harmonic mean of the degrees of freedom of the respective samples: n1-1 and n2-1, and the weights are proportional to the standard deviations of the corresponding samples.6.2.10. Paired t-testβ’ As we commonly perform a paired t-test when we have data about the same subjects measured twice (before and after some treatment), let us adopt the convention of referring to the samples as the pre-group and post-group.β’ The null hypothesis is that the true difference between the means of pre and post populations is equal to some pre-set value, π₯.β’ The alternative hypothesis is that the actual difference between these means is:β different from π₯;β smaller than π₯; orβ greater than π₯.β’ Typically, this pre-determined difference is zero. We can then reformulate the hypotheses as follows:β The null hypothesis is that the pre and post means are the same, i.e., the treatment has no impact on the population.β The alternative hypothesis:* The pre and post means are different from one another (treatment has some effect);* The pre mean is smaller than post mean (treatment increases the result); or* The pre mean is greater than post mean (treatment decreases the result).β’ In fact, a paired t-test is technically the same as a one-sample t-test! Let us see why it is so. Let x1,β¦,xn be the pre observations and y1,β¦,yn the respective post observations. That is, xi, yi are the before and after measurements of the i-th subject.β’ For each subject, compute the difference, di=xi-yi. All that happens next is just a one-sample t-test performed on the sample of differences d1,β¦,dn. Take a look at the formula for the t-score:t=β¨x-π₯
s.nβ’ Note: Number of degrees of freedom in t-test (paired): n-16.2.11. t-test vs. z-testβ’ We use a z-test when we want to test the population mean of a normally distributed dataset, which has a known population variance. If the number of degrees of freedom is large, then the t-Student distribution is very close to N(0,1).β’ Hence, if there are many data points (at least 30), you may swap a t-test for a z-test, and the results will be almost identical. However, for small samples with unknown variance, remember to use the t-test because, in such case, the t-Student distribution differs significantly from the N(0,1)!6.3. Multicollinearityβ’ Multicollinearity happens where there's a correlation between the some of independent variables. In other words, some of the independent variables are not that independent.β’ Collinearity won't affect the performance of the model β The R2 remains unchanged.β Also, the model can still make effective predictions.β However, the way we interpret the coefficients will have to change.6.3.1. How to detect multicollinearity?β’ We can look at the features VIF (Variance Inflation Factor).β VIF is derived from finding the correlation itself between certain features.* VIF = 1 β no collinearity* 1 < VIF < 5 β moderate collinearity* VIFβ₯5 β severe collinearity β need mitigation strategy like centering the features.6.4. Feature interactionβ’ This simply means multiplying the two features together.β’ After introducing the interaction term, if the R2 goes up and the p-value of the interaction term is significant β then you can be reasonably confident that the interaction terms are in fact interacting.β’ Can we multiply a feature by itself?β Yes! β But why would we want to do that?β Because now we can fit polynomial relationships.β’ Note: When adding interaction terms, be noted not to overfit the data.6.5. Simpson's paradoxβ’ Simpson's paradox is a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined.β’ A good way to avoid it is to add as many dimensions to your model which segment the data you're trying to predict.
7. Logistic Regressionβ’ Logistic regression is based on the same idea as linear regression in the way that we still use a line to designate our model. The only difference is that we now want y to be a probability.β’ The probability equation, which is a sigmoid function, is:P(y|X)=1
1+e-(π½0+π½X)β’ Unlike linear regression, there's no closed form solution for logistic regression.β’ The loss function for logistic regression is the log loss (cross-entropy loss):β’ Loss(y,y)=-βn[yilogyi+(1-yi)log(1-yi)]β’ Note: Log Loss is a slight twist on the likelihood function. In fact β log loss =-1 Γlog(likelihood function).β’ Note: The likelihood function of logistic regression is:L(π½0,π½)=nβi=1p(xi)yi(1-p(xi))1-yiβ’ To minimize the loss function, we take derivatives w.r.t. coefficients:β’ dLoss
dπ½=βn[yi-yi]xidLoss
dπ½0=βn[yi-yi].1π»π½=[a
dLoss
dπ½
dLoss
dπ½0
]π½t+1i=π½ti-rπ»π½iβuntil coefficient gradients converge to 0r βlearning rateβusually[10-6,0.1]7.1. Coefficient interpretationβ’ In order to interpret the impact of coefficient π½ on the probability, we have to exponentiate it, eπ½, to get something called the odds ratio.β 1-eπ½ gives the % change in the odds.7.2. Multinomial regressionβ’ We use multinomial regression when we want to predict more than two classes.β’ Instead of sigmoid, we're going to use softmax function.P(y=k|xi)=eπ½0+π½xi
Kβj=1eπ½0+π½xjβ’ A softmax function is generalized sigmoid such that it produces the probability among K classes.β The predicted value will the be class with the maximum predicted probability.7.3. Regularizationβ’ Regularization is a techniques used to avoid overfitting which involves adding a term to the loss function which is the sum of all coefficients. There are two main types of regularization:β L1, Lasso, or Laplace β βj|π½j|* Typically results in more zero-valued coefficients, which means fewer features will be used.β L2, ridge, Gaussian β βjπ½2j* Usually results in small weights for many of the features (that would've been out by L1).β Note: Both L1 and L2 usually have a coefficient, π, multiplied to them which allows to control the degree of regularization.* Two high π can result in under-fitting and too low can result in overfitting.* π is best tuned in cross validation.β Note: When using regularization, it's better to scale our data. Scaling data can also help the model to converge faster.7.3.1. Why Lasso regularization induce model sparsity?β’ First off, note thatβ L1 norm:||w||1=|w1|+|w2|+|w3|+β¦+|wn|β L2 norm:||w||2=w21+w22+w23+β¦+w2nβ’ β’ When optimizing the cost function, we use gradient descent and update our weights by β wt+1=wt-rβwβ Convergence occurs when the value of wt doesn't change much with further iterations β i.e. πLoss
πwβ0 β i.e. wt+1βwt.β’ L1 norm: The derivative is β π|w|
πw=1, therefore β wt+1=wt-r.1.β We can see that our loss derivative becomes a constant, so the condition of convergence occurs faster because we only have r in the subtraction terms and it's not being multiplied by any smaller value of w.β Therefore, wt tends towards zero in a few iterations. β’ β’ L2 norm: The derivative β πw2
πw=2w, therefore β wt+1=wt-2.r.w.β We can see that our loss derivative term is not constant and thus for smaller values of w, our condition of convergence will not occur faster (or maybe at all) because we have a smaller value of w getting multiplied with r and thus making the whole term to be subtracted even smaller. β Therefore, after a few iterations, our wt becomes a very small constant value but not zero. β Hence, not contributing to the sparsity of the weight vector.7.4. Early stoppingβ’ Another technique to avoid overfitting is early stopping.β’ Simply, it means to stop training somewhere before reaching the absolute minimum [of loss function] to avoid overfitting the training examples. 7.5. Other considerationsβ’ We can't use the same R2 from the linear regression. For logistic regression we use something called McFadden's pseudo R2. which also lies between 0 and 1.β Its value is usually smaller than R2. A value of 0.2 and 0.4 usually indicates an excellent fitting model.β’ Logistic regression β discriminative modelβ Naive Bayes β generative modelBack to Top
8. Support Vector Machine (SVM)β’ SVM removes the concept of decision threshold by instead selecting a decision boundary which maximizes the distance between itself and the two most difficult examples to classify.β’ These two most difficult examples to classify are called the support vectors.β In comparison, logistic regression finds a decision boundary which minimized the negative log loss of the training examples.β SVM on the other hand, only focuses on the support vectors.β’ For SVM, the decision boundary (also called the hyperplane) is defined byβ’ wTx-b=0β’ β’ The values of w and b will be optimized when the distance between the decision boundary and the support vectors are maximized.
Figure 5:SVM, decision boundary is the red line.
β’ You can think of SVM is actually making three lines; one line for the hyperplane, defined by equation (6), two more lines (support vectors) with these equations:wTx-b=1wTx-b=-1β’ Note: All the positive/negative examples lie in the left/right of these lines.β’ The distance between these two support vectors is called the margin β 2
||w||, where ||w||=w21+w22+β¦+w2n .β’ Note: The distance between the decision boundary and the support vector β 1
β’ Now, we take the Lagrangian dual of the SVM which gives us the new optimization form to represent SVM:β’ maxβNai-1
2βNπΌjπΌkyjykxTjxkβ’ The xTjxk term benefits us in two ways:β If the number of examples that you're trying to classify is far less than the number of dimensions that each example has, then this computation will be a lot more efficient compared to computing wTx.β Now that we have this new form, equation (16), we can apply the kernel trick.maxβNai-1
2βNπΌjπΌkyjykΞ¦(xTjxk)β’ β’ With kernel trick, we can have some dimensions of x. The kernel trick function, Ξ¦(.), allows us to calculate the high dimensional dot product and it will return a singular scalar value.β This is massive savings when the number of examples that we have is far less than the dimensions that we want to project our data into.β’ The kernel trick function can be:β Linear functionβ Polynomial functionβ Gaussian (e.g. RBF kernel)β etc.β’ Note: Kernel trick is not only for SVMs. For instance, we can take Lagrangian dual of linear regression, and we can get our x terms together β that means we use the kernel trick here as well.y=βNaiΞ¦(xTx)8.4.1. RBF kernelβ’ RBF β Radial Basis FunctionKRBF(x, xβ²)=e-(||x-xβ²||2
2π2)π is a tuning parameter:if π is too small β overfittingif π is too large β underfittingβ’ It represents a separate dimension per data point that you have, i.e. if you have a 100 data points, the RBF kernel would represent your data in 100 dimensions.β’ Why is this useful?β RBF kernel assigns every data point to a Gaussian distribution that can have different height or width.β Then, it traces a line (or hyperplane) of the sum of the Gaussian distributions.β It does it so it can project data points onto that line within its Gaussian distribution.β By doing this, we get a linearly separable data.8.5. The dual problemβ’ In convex optimization, for every primal problem (e.g. equation (14)) we can derive a dual problem.β’ Let πΌβRN be the dual variables, corresponding to Lagrange multipliers that enforce the N inequality constraints.β’ The generalized Lagrangian is given below,L(w,w0,πΌ)=1
2wTw-Nβn=1πΌn(yn(wTxn+w0)-1)β’ To optimize this, we must find a stationary point that satisfies(Λw,Λw0,ΛπΌ)=minw,w0maxπΌL(w,w0,πΌ)β’ We can do this by computing the partial derivatives w.r.t. w and w0 and setting to zero,βwL(w,w0,πΌ)=w-Nβn=1πΌnynxnπ
πw0L(w,w0,πΌ)=-Nβn=1πΌnynand henceΛw=Nβn=1πΌnynxn0 =Nβn=1πΌnynPlugging these into the Lagrangian yields the followingL(Λw,Λw0,πΌ)=1
|S|βnβS(yn-βmβSπΌmymxTmxn)8.5.1. Lagrange multiplierβ’ In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints.β’ The basic idea is to convert a constrained problem into a form such that the derivative test of an unconstrained problem can still be applied.β’ The method can be summarized as follows: in order to find the maximum or minimum of a function f(x) subjected to the equality constraint g(x)=0, form the Lagrangian function, equation (28), and find the stationary of L considered as a function of x and the Lagrange multiplierπ.L(x,π)=f(x)+πg(x)β’ β’ This means that all partial derivatives should be zero, including the partial derivative w.r.t. π.β’ The solution corresponding to the original constrained optimization is always a saddle point of the Lagrangian function.β’ The great advantage of this method is that it allows the optimization to be solved without explicit parameterization in terms of the constraints.β As a result, the method of Lagrangian multipliers is widely used to solve challenging constrained optimization problems.
Figure 7:The red curve shows the constraint g(x,y)=c. The blue curves are contours of f(x,y). The point wherethe red constraint tangentially touches a blue contour is themaximum of f(x,y) along the constraint, since d1>d2.
|S|βiβS(yi-βjβSπΌjyjK(xj,xi))β’ This kernel trick allows us to avoid having to deal with an explicit feature representation of our data, and allows us to easily apply classifiers to structured objects, such as strings and graphs.8.7. Kernel trick - other resourcesβ’ PDF screenshots here.
8.8. Multi-class SVMβ’ One way to do this is to create an SVM per class β i.e. classify one class against every other classes β This paradigm is called one-vs-rest.β To make a prediction for an example, we feed it to all the trained [one-vs-rest] SVMs and we measure the margin each SVM produces. We choose the prediction that produces the largest margin between that example and the other classes.β’ Another way to handle multi-classes is to create a one-vs-one paradigm.β’ Here, we're creating a pair-wise SVM so that for every single pair of classes we're creating a single SVM.β To get a prediction, we simply feed the unseen example through every single SVM and select whichever class was most often predicted for that example.β’ Note: Even though the one-vs-one paradigm requires a lot of SVMs, the data required per each SVM is only two classes β So, this can actually be faster in some cases depending on the data.8.8.1. Structured SVMβ’ Here, instead of the margin being -1 and 1, the margin is actually the distance between the two closest classes.8.9. SVM for regressionβ’ The idea is very similar to SVM classifier. β’ The only difference is that the goal is to keep all points within the margin.β’ The slack variable (i.e. the error terms) here comes from points that lie outside of the margin.8.10. Other considerationsβ’ SVMs can linearly separate data out of the box β hard-margin SVM.β’ If we add slack variables β soft-margin SVM.β’ We can use sub-gradient descent to optimize soft-margin SVM.β’ We can add interaction terms to separate data even if it's not linearly separable. This is often preferred when you have either a low number of dimensions that you want to project into or if your data is extremely large.β’ However, if your data isn't so large, and you want to project your features into a very high dimensional space, you can use kernel trick, which allows us to avoid computing this actual feature transformation.β’ SVMs are distance-based. So, we have to consider scaling our features as well.β’ Why to use SVM over logistic regression?β If you have a low number of examples β just start with a linear SVM because they only focus on the support vectors.β If you have a ton of data and not many features β you might be better off starting with logistic regression.Back to Top
10. Singular Value Decomposition (SVD)β’ What if we had a bunch of data and we didn't really know much about it?β We'd like to take the data and look for patterns in it and separate them out β so that we could understand our data better.β We can use SVD to do this.β’ SVD states that any matrix can be represented by three different matrices as follows,β’ A = Uπ΄VTβ’ U β rotationβ’ π΄ β scalingβ’ VT β final rotationβ’ For Example[a
1
-1
2
3
2
-2
]=[a
-0.24
0.96
0.96
0.24
][a
4.2
0
0
0
2.2
0
][a
0.63
0.58
-0.57
0.74
-0.2
0.63
-0.2
0.82
0.51
]β’ Note: If we divide each diagonal element of π΄ by the sum all elements in the diagonal, we get percentage of the variance explained by corresponding column in the U matrix. β In the example above, the variance explained by first column of U, [a
-0.24
0.96
], is equal to 4.2
4.2+2.2=0.65.β’ Note: The third column of π΄ and third row of VT are not used.10.1. Eigendecompositionβ’ Eigendecomposition states that any square matrix can be broken down into eigenvectors and eigenvalues.β’ Few problems with eigendecomposition:β It only works on square matrices.β The eigenvalues don't necessarily lie between 0 and 1. β The ranks of eigenvectors are not perpendicular.β’ SVD solves these problem by:β Allowing any sort of matrix (not only limited to square matrices)β π΄ is eigenvalues of AAT β This allows these values to lie between 0 and 1.β VT is just the eigenvectors of ATA.β To get the values of U, we can simply solve this equation β ui=Avi
π΄β’ We can think of SVD as a generalized version of eigendecomposition.10.2. Principal Component Analysis (PCA)β’ The eigendecomposition of matrix A is,A = VLVTβ’ β’ Now, what if we take matrix A and standardize it (i.e. subtract the mean and divide it by the standard deviation) and then divide it N-1 β This means that we have a correlation matrix β ATA
N-1β The problem is that this computation is typically not stable.β So, instead, what's typically done to get PCA is to use SVD on the standardized matrix A.* In this case, the Uπ΄ term β Principal Componentsβ’ Note: SVD and PCA can be used for dimensionality reduction.β’ Note: SVD and PCA assume a linear correlation between the features.β There are non-linear dimensionality reduction techniques. Examples of such methods are Kernel PCA.Back to Top
11. Neural Networks (NN)11.1. What is a neuron (in NN)?β’ Sometimes called perceptron, is a graphical representation of the smallest part of a NN that takes an input, multiply it by a weight.x1x2x3y1βππ(w1x1+w2x2+w3x3+b)= y1π(WT.X+1.b)=Yπ(WT.X+1.b)=1
1 + e-(WT.X+1.b)β’ The π(WT.X+b) is the same as logistic regression and the loss is exactly the same as the one in logistic regression:L(y,y)=1
Nnβi=1yilogyi+(1-yi)log(1 -yi)β’ Note: There are other non-linear functions used as well, such as Relu, tanh, etc.β’ We can update the weights by taking gradients of the loss function with respect to the weights,βL=[a
βL
βb
βL
βw1
...
βL
βwn
]wt+1=wt- rβLβ’ Note: According to equation (34), in order to update weights, we move in the opposite direction of loss gradient adjusted by the learning rate (r). 11.2. Why do we need the bias term?β’ Bias is like the intercept added in a linear equation. It is an additional parameter in the Neural Network which is used to adjust the output along with the weighted sum of the inputs to the neuron. Thus, Bias is a constant which helps the model in a way that it can fit best for the given data.β’ The bias term helps in cases where all the wixi terms are 0, which means that the model cannot be trained. Adding a bias terms let the model be trained in such cases.11.3. How a NN learns non-linear patterns?β’ Each neuron in a NN learns decision boundary.β’ Since NN has many neurons, the combined learned decision boundaries creates a non-linear decision boundary. β’ In the example below, there's no way to separate the data with one line. There are feature engineering methods (or other algorithms) that can handle this. But, how a NN can separate these two classes?y1βx3x2x1ππβπβbβ’ The above picture is a simple example of how a NN can capture non-linear patterns. In practice, NN have more than one hidden layers and more neurons per layer.β Note:Usually the number of neurons in each hidden layer decreases as we move forward through the network.11.4. How does a NN learn?β’ Let's explain this through a small NN below.πβπβπβx2x1youtw1w2w3w4w5w6h1inh2inh1outh2outyinh1in= w1x1+w3x2h2in= w2x1+w4x2h1out= π(h1in)h2out= π(h2in)yin= w5h1out+ w6h2outyout= π(yin)L(y,y)=1
Nnβi=1yilogyi+(1-yi)log(1 -yi)Loss for one example: yout= 0.33, y = 1aL =log(yout)=log(π(yin))=log(π(w5h1out+ w6h2out))=log(π(w5π(h1in)+ w6π(h2in)))=log(π(w5π(w1x1+w3x2)+ w6π(w2x1+w4x2)))The loss function is β11.5. Chain Rule* Now, we have to take the gradient of L. Since loss function is a complex functionit's hard to derive the analytical gradient. * Since the loss function is essentially a function of functions, we use the chain rule tocompute the derivatives. For example,βL
βw1=βL
βyout.βyout
βyin.βyin
βh1out.βh1out
βh1in.βh1in
βw1βL
βw6=βL
βyout.βyout
βyin.βyin
βw6* Note that the first two terms of βL
βw6 and βL
βw1 are the same. This means that we can use dynamic programming + chain rule to calculate derivatives. * We start by first calculating βL
βw6 and work our way to βL
βw1.* This gives us something called backpropagation. Backpropagation is the standard wayto train NN.* For training we need: * Forward Pass β To figure out how far our predictions are from the actual value * Backpropagation β Once we have the loss, we can backpropagate those gradients to update all of the weights in our NN.11.6. How do we update the weights?wt+1= wt- rβL* If we plot the average loss obtained from all of the trained examples against a particularparameter, say w1, we get a function like below,L(y,y)w1* The difference between this function and the logistic regression function is that herewe have local optima. In logistic regression, we were guaranteed to have one minimum.* That's because we stacked up neurons and added layers, so we opened up ourselves tolocal optima.11.7. Stochastic Gradient Descent (SGD)* There are some techniques that increases the chance of not getting stuck in the local optima. The most popular method is Stochastic Gradient Descent (SGD).* SGD's characteristic of not getting stuck in local optima is just a by-product of takingrandom examples and updating the weights with just that single example. This randomnessin the weight updates, can increase the chances that we don't get stuck in a local optima.* The problem with SGD is that it's slow to converge.* One idea to speed up convergence is by incorporating momentum. * The idea of momentum is to keep track of the previous updates.11.8. Momentum* The problem with SGD is that it's slow to converge.* One idea to speed up convergence is by incorporating momentum. * The idea of momentum is to keep track of the previous updates.awt+1= wt- rβLt- πΎ r (βLt-1+βLt-2+β¦+βLt-n)or wt+1= wt-VtVt= πΎVt-1- rβL* The πΎ parameter is usually set to 0.9 so that the previous gradient doesn't matter as much as the current gradient.* The problem with momentum is that sometimes we could build so much momentum that we pass the global optima.11.9. AdaGrad* There's another method called AdaGrad which adjusts the learning rate per parameter.*Note:rgeneral is typically set to 0.001.wt+11= wt1- rt1βLt
βw1rt1=rgeneral
(βLt-1
βw1)2+β¦+(βLt-n
βw1)2+π* Why would you want to do something like this? * It balances the update value at each step such that when gradient is high it lowers the learning rate and when the gradient is low it increase the learning rate. * This way it moderates the steps we take at each update (and for each parameter). * Note: The π term in the denominator is set to a small value to avoid dividing by 0. * Note:AdaGrad really helps in the case of sparse features, because if we have sparse features, that means that the weights associated with those features will be updated less, and therefore the learning rate will be higher.11.10. Adam* The other method is Adam.* Adam combines momentum and adaptive learning rate.wt+1= wt-rgeneral
Vt+πmtmt= π½1mt-1+(1-π½1)βtlossVt= π½2Vt-1+(1-π½2)(βtloss)2* Note:π½1 and π½2 are hyperparameters.* Note: The only difference between mt and Vt is the squared gradient loss term.* Note: Notice that the mt and Vt are adjusted (i.e. mt and Vt). The reason is because these terms are technically moments of a function, and in order to get an unbiased moment on these functions we have to adjust them by the π½ parameters.* Note:Adam looks like a ball rolling down a hill with momentum, but the ball also has friction. The idea is that the friction helps the parameters settle in the global optima, while the momentum helps the parameters escape the local minimum.mt=mt
1-π½t1,Vt=Vt
1-π½t211.11. RMSPropβ’ Will add the note later11.12. AdaDeltaβ’ Will add the note later11.13. Vanishing and exploding gradientsβ’ Once we have the gradients, from whatever optimizer we use, multiplying these gradients together can result in a problem.β Let's say if we use the sigmoid activation function, the maximum value of the gradients are 0.25.β Now, if we multiply a lot of 0.25s together, this final gradient (based on chain rule) will brace towards 0 β this result in underflow.β This is called β vanishing gradient.β’ Also, some of the gradient terms include the value of weights. β If the weights are extremely large, by multiplying them together, we can end up getting an extremely large value β This is called an exploding gradient.β’ There are a few methods to mitigate these problems.11.13.1. Initializationβ’ One of the method to tackle the vanishing/exploding gradients is to initializing the weights of the NN in a particular way.β’ A bad way to initialize the weights is just to use a uniform distribution between 0 and 1. β’ Another bad way to just to initialize these parameters with a normal distribution in which the mean is 0 and the standard deviation is 1.β’ What we can do instead is to initialize the weights from a normal distribution in which the mean is 0 but the standard deviation is π=2β(fi+fo). β’ fi is fan in and fo is fan out. β’ Fan in is the number of inputs to a particular layer and fan out is the number of outputs for that layer.β’ This way, we can initialize the weights for each layer of NN.β’ This is called Xavier or Glorot initialization.β’ The reason why this helps is because we're shrinking the standard deviation by how many ever times we will be multiplying these variables together per layer. β Not doing this makes the variances of each layer multiply together and that causes the variance to grow exponentially. β So, if we can shrink down the standard deviation early, these the other multiplications, hopefully, won't result in exponential growth (or shrinkage) of the gradients.β’ This works best when we use something called a symmetric activation function. Example of such functions β sigmoid function.β’ 11.14. ReLU and Leaky ReLUβ’ What if we want to use a non-symmetric activation function?β Example of such functions is ReLU (Rectified Linear Unit) function.ReLU =max(0, x)β’ Why do we want to use ReLU?β More computationally efficient β All the negative values take on the value of 0, and all positive values take on the value itself β When taking derivatives, the derivatives of 0 is 0 and the derivative of any value is just 1.β Tends to produce better model performanceβ Sparsity β reduce overfitting β not all neurons will output a value (negative values β 0). β β’ What are the downsides of ReLU?β It has an uncapped activation. With sigmoid, we'd have something called saturation, where the output of the neuron could be no larger than the value of 1.β However, the ReLU can output any value, which means that we could be susceptible to exploding gradients more often.β As well, we can even now be susceptible to exploding forward passes where by simply doing multiplications in the forward pass all the way through the NN, we can also get unreasonably large numbers that overflow.β Another problem called dying ReLU problem.* It comes from the fact that a neuron that takes on a value of 0 will be 0 forever.* That means that the neuron will be completely dead and never output another value except 0.β Even with these problems, ReLU activation functions are used often in practice.β β’ Initialization for ReLUβ Instead of Xavier initialization, we use the Kaiming initialization β π=2βfiβ The Kaiming initialization can be used for other asymmetric activation functions like Leaky ReLU β f(x)=aa
x
if x>0
0.01x
otherwise
β Leaky ReLU tries to get around the dead neuron problem by adding a slight angle to the slope.(x)=aa
x
if x>0
0.01x
otherwise
β’ Another thing to do (in addition to the initialization) is feature scaling.β’ 11.15. tanhβ’ tanh is very similar to the sigmoid function, but instead of being in the range of [0,1], it lies in the range of [-1,1]. β’ The idea to cross validate between all the activation functions to see which works best for your data.β’ Note: Different activation functions can be used at different layers of NN.β’ Note: The last neuron will dictate what the output looks like. sigmoid β binary classification, softmax β multi-class classification (the maximum value of softmax function is your prediction), linear regression β linear activation function11.16. Loss Functionsβ’ Regression β Mean Squared Error (MSE) β L(y,y)=βN(yi-yi)2
Nβ N is usually the batch_size.β’ Regression β Mean Absolute Error (MAE) β L(y,y)=βN|yi-yi|2
Nβ’ Classification β Cross Entropy (sometimes called logloss) β L(y,y)=-(ylogyi+(1-y)log(1-yi)β’ Classification β Cross Entropy for K classes β L(y,y)=-βkyilog(yi)11.17. Avoid Overfitting11.17.1. Regularizationβ’ We can do regularization by adding L1 or L2 term to the loss function β L(y,y)=-βkyilog(yi)+ πβw|wi|11.17.2. Dropoutβ’ Dropout is when you have, per layer of the NN, a particular neuron in that layer that will have some probability of sticking around. The others will be dropped out for this training iteration.β’ For each layer we assign a dropout probability (e.g. P = 0.5).β’ The problem with dropout is that during training, the dropped out neurons (during training) will not drop out during prediction β so, all of the sudden, the last node summation will be a lot higher (because we have all the neurons).β To solve this problem, we can use inverted dropout. During training (after every mini-batch), they'll take the output of the layers and divide by the dropout rate β output
dropout rate .β This ensures that the total sum coming into the last node will match on average the total sum coming to it during prediction time. 11.18. How to determine the number of layers and neurons?β’ If your data is linearly separable, you don't need any hidden layer at all.β’ Beyond that, it's safe to start with a single hidden layer, and the number of neurons in that single hidden layer should be the average of input and output.β’ Another alternative is to start with more layers or units than you need, and then go examine the weights of your connections.β The weights that are close to 0, should allow you to prune the surrounding neuron.β Once, you drop the neuron, you run the cross validation to see how much the NN model performance is affected. Back to Top
12. Convolutional Neural Networks (CNN)Back to Top
14. Generative Adversarial Networks (GAN)β’ Let's say we have about 200,000 satellite images and we want to improve their quality. We will capture: coastlines, ports, cities, farms, mountains, oceans, suburbs.β’ The labels will be equivalent aerial images corresponding to each satellite image.β The aerial images have 4x the resolution of satellite images (e.g. a 2x2 section would expand to 4x4 section).β Note: The aerial images have to be taken from the same exact location (and preferably at the same day time).β’ We'll use satellite images for creating features.β’ What model could we use for training this problem?β We can use a type of NN models called Generative Adversarial Networks (GAN).β GANs are just two NNs in themselves. One NN is called the generator.β The generator will take in a low resolution image (i.e. satellite images), and create its best guess for what the high resolution image (i.e. aerial image) would be of the equivalent low resolution image.β The second part of the GAN is called the discriminator. β The discriminator will take in real aerial images and the aerial image estimates (produced by the generator), and it will decide if the input image is real or fake.β Essentially, the generator is working to confuse the discriminator and the discriminator is working its best to differentiate between the generated images and the real images.β’ How does GANs get trained?β Feed the low-res image to the generator to produce an estimated hi-res image.β Feed the hi-res image to the discriminator.β Propagate the discriminator loss all the way back the generator.* Here, we're fixing the weights of the discriminator, and we're only updating the generator in accordance to the loss generated from its attempt to fool the discriminator.14.1. GAN Loss Functionβ’ Discriminator's Loss:max(1
mβmlog(1-D(G(z))))β’ Combined GAN's Loss Function (Adversarial Min-Max):minGmaxD(1
mβm[logD(x)+log(1-D(G(z)))])β’ How do we know when this loss function has converged?β When the accuracy of the discriminator drops to around 50%, i.e. it can't do better than the random chance (that means the generator could successfully fool the discriminator).14.2. Exampleβ’ Going back to our satellite example: β The generator is going to be a CNN. β Input β Low-res satellite imagesβ Output β estimated aerial images (hi-res)β CNN has to do: * Upsample β 4x resolution* Pixel Shuffle* Residual connections (to avoid vanishing gradients)β The discriminator is also a CNN with a sigmoid binary classifier as the end.* We're using Leaky ReLU.β Mini-batch size β 16* Note: The batch size is a bit smaller than usual. The discriminator goes through a few iterations of training before we allow the generator to start learning from the discriminator's decisions. Keeping the mini batches smaller means that the discriminator will only get 16 images to train on before the generator gets to jump in and begin training as well. This is so the discriminator doesn't out-learn the generator, such that the discriminator doesn't supply any any valuable feedback to the generator because it's already learned so much that the gradients will be small β The generator and the discriminator have to learn together.β Adjust the learning rate:* We can also adjust (lower) the learning rate of the discriminator to make sure that the generator can still keep up and learn together with it.β Mode collapse* Mode collapse is when the generator figures out an image which can fool the discriminator and it just continues to output that same exact image since it's found out how to fool it.* This can be avoided by using unrolled GAN β it allows the generator to see what the discriminator will look like in a few more steps ahead of it such that the generator is encouraged not to learn some local exploitation of the discriminator.* It now has to account for what the discriminator will also look for in the future. 14.3. Evaluationβ’ In our test set, if we have the true labels (e.g. hi-res image in the example above), we can just take the MSE per pixel.β’ Another thing to look at is Peak Signal to Noise Ratio (PSNR) β 20.log(pixelmax)-10.log(MSE) βThis is a gauge of how noisy the produced image is β The smaller the MSE the better the PSNR.β’ We can also use human raters to determine if they can tell the difference between our generated images vs. the real image.Back to Top
15. Recommender Systems15.1. Collaborative Filteringβ’ The goal of collaborative filtering is to use the user-item matrix to get an idea of how a user would respond to an unseen item.15.1.1. User-based Collaborative Filteringβ’ In user-based collaborative filtering, the goal is to find similar users and recommend to a a user a particular item that is used by other similar users.p1p2p3p4u1u2110101?1β’ Note:If we're using binary indicators in the user-item matrix, we can use Jaccard/Cosine/Hamming metrics to measure similarity.β β Jaccard Similarity β no. matching 1s
no. matching 1s+no. ui=1 & uj=0 +no. ui=0 & uj=1* * Note:We don't count the number of matching 0s here.β Cosine Similarity β βnuiuj
βnu2i.βnu2jβ β Hamming Distance β sum of the differences β nβi=11(xiβ yi)β β’ How do we predict for a particular user (given the user similarities?) β responseup=βusim(up,ui).responseui
βusim(up,ui)β β’ Note: We can use the KNN to find k number of neighbor users for prediction.β’ Note: We can use MSE to evaluate our predictions.β’ Note: Generally, we'd like to recommend the item that generates the highest predicted value.β’ Note:If the user-item is not binary, we can use the following similarity measures:β Euclideanβ Manhattan β nβi=1|xi-yi|β Pearson Correlation β βn(u1-β¨u1)(u2-β¨u2)
βn(u1-β¨u1)2βn(u2-β¨u2)2β Cosineβ’ Steps of user-based collaborative filtering:β Find KNN of ui to form prediction for.β Calculate the similarity score between the neighbor usersβ Using the KNN, predict the response for ui.15.1.2. Item-based Collaborative Filteringβ’ Here, we use the similarity between the items to make recommendations.p1p2p1p211.86.86u1u2u3u4u5p1p21011011110cosineβ’ Here, we calculate the item-item similarity.β’ To predict user response, we use a similar formula as above β responsepl=βpsim(pl,pi).responsepi
βpsim(pl,pi)β’ Steps of item-based collaborative filtering:β Calculate item-item similarity matrixβ Predict response for uiβ Recommend the item with the highest prediction15.1.3. Considerations on Memory-based Filteringβ’ Time complexity of user-based β O(u.p)β O(u+p)β Greater diversity ββ More Expensive β (because of KNN calculations)β’ Time complexity of item-based β O(u,p2)β O(u.p)β Less re-calculations β (items change less than users)β Lack of diversity ββ’ The user-item matrices are sparse, so the β denotes the effective time complexity.β’ Both user-based and item-based are collectively part of memory-based filtering.β’ Time Decay: We can apply time decay to the predictions as follows β responseup=βusim(up,ui).dt.responseui
βusim(up,ui)where dt= 0.5t
half-lifeβ’ β Time decay means that as time goes on, less influence will be given to a particular rating.β’ Inverse User Frequency: Another thing that we can do is to apply more weights to less frequent items β responsepl=βpsim(pl,pi).fi.responsepi
βpsim(pl,pi)where fi=log(N
n+1)β where * N β total number of users and* n β no. of users who interacted with item i.β fi is called Inverse User Frequency (IUF).15.2. Matrix Factorizationβ’ One problem withe memory-based approaches is that the user-item matrix (or item-item similarity) matrix is extremely large.β’ One approach we can use is Matrix Factorization β it takes the same user-item matrix and factorizes it into some terms as followsU.P=uTp+ bp+buβ’ Here, both uT and p need to be learned.β’ Also, note that we have two bias terms, one for the users, bp, and one for the items, bu.β’ The loss function for matrix factorization is,L =1
NβN(U.P-uTp- bp-bu)2+ π(βu||ui||2+βp||pi||2+βu||bu,i||2+βp||bp,i||2)β’ The second term is the regularization term.15.2.1. Implicit Ratingsβ’ If we have some binary user item matrix, we can transform that into a non-binary matrix of implicit ratings. β Example: Let's say the implicit rating (rimp) of an item is like this,* 1 β view* 2 β like* 3 β commentβ We can multiply (1+πΌrimp) to all the binary values where πΌ is an adjusting parameter which can be tuned through cross validation.β With the implicit ratings, the loss function looks like this,L =1
NβN(U.P.Cpu-uTp- bp-bu)2+ regularization15.2.2. Alternating Least Squares (ALS)β’ Since we need to optimize with respect to both uT and p , we can't use ordinary least squares (like in linear regression).β’ We have to use something called Alternating Least Squares (found in libraries such as Spark ML).β’ ALS keeps constant one of uT or pand performs OLS on the other one. Then, it would fix the other one, and runs an OLS on the one that kept fixed in the previous round.β’ ALS performs well in practice and we can also tune the dimensions of U and P.15.2.3. Predicting with ALSβ’ Here's the initial equation β U.P=uTp+ bp+buβ’ By estimating this equation with ALS, we can get prediction for a particular user and an item as follows β pred(ui,pj)=uTrowipcolumnj+ bpj+buiβ’ The problem with the above approach is that, we'd have to retrain for every new customer that came in.β’ Note: To evaluate the performance of the predictions, we can use MSE on the validation set.β’ How can we avoid having to retrain these models for every new customer?15.3. Deep Learning Extensionβ’ To avoid the prediction limitation of ALS, we can use the deep learning extension of matrix factorization.β In this case, the row and column of the user-item matrix are the inputs to the NN.β These inputs would have their own separate fully connected layers which would act as an embedding layer.β Then, the results of these embedding layers are combined into a single fully connected layer to finally be sent through a linear activation function.β The NN output would be the prediction for a particular user and item.β’ This typically requires less retraining.15.4. Challenges of Collaborative Filteringβ’ Cold-start Problemβ We can't generate predictions for brand new users with no information (or items with no purchase history).β There are some ways to mitigate this such as:* Recommending new users popular items* Presenting new items to random subgroupsβ’ Echo Chamberβ Let's say we recommend an item to a user so that increments some of their values in the user-item matrix.β Now that value has been incremented, it has more weights for other users and it just spins into this positive feedback loop of recommending potentially the same or very similar items over and over.β This doesn't make for a very good customer experience because these users won't see the variety that they need to see (they just see the same thing).β One way to avoid that is to recommend some random new items into the mix. β’ Shilling Attacksβ This happens in systems where everyone can provide a rating.β For example, someone can provide a bunch of fake accounts to promote its own products in a platform (similarly when disliking a product).β We can limit that by allowing one user per phone number for example.15.5. Content-based Filteringβ’ Content-based filtering represents users with respect to items that they've interacted with.β’ Now, if we get the dot product of user vector with respect to every other items, and the item with the highest dot product will be recommended.β’ Content-based filtering doesn't require any user data to make a prediction about a particular user.β So, if you have tons of users, you can use content-based filtering to avoid running KNN.β’ The downside of content-based filtering is that it requires context about the items, i.e. products are need to evaluated in terms of some of their characteristics.15.6. Deep Learning Recommender Systemsβ’ We can use deep learning to combine collaborative and content-based filtering.β’ It's going to be a similar NN as described in section Section 15.3., but with addition of input (and embedding layers) of user and item features.Back to Top
16. Learning To Rankβ’ In the Recommender Systems section, Section 15., we talked about how to recommend an item to a user.β However, it only considered one item at a time per prediction.β The prediction, which is a score or probability, help us gauge whether or not we should we should recommend an item.β’ Let's say, instead of that, we wanted to create a feed or a series of posts (or items) for a user to scroll through on their home screen.β’ Now, this is a question of what order (or ranking) do we place a series of items in to provide the best feed for a user.β This is quite a different problem from classification/regression.16.1. Framing a Ranking Problemβ’ Our first goal is to extract relevant posts (or items) from all of the posts available to the user β This is typically called Candidate Generation or Obtaining the Top K.β’ Once, we have the top K, then we will rank them in some particular order for the feed.16.2. Candidate Generationβ’ To perform candidate generation we can again use matrix factorization.β’ In matrix factorization, we perform this,U.P=uTp+ bp+buβ’ Note that both uT and p are in fact embeddings that represent some characteristics about some user or some item.β’ To get the top K items for a user, we can get the closest adjacent items to the user using the embedding vectors, uT and p.β We can use Euclidean distance or cosine similarity or dot product.β’ Note: If, instead of matrix factorization, we've used a deep recommender system, then we could use the embedding layers in the DL model.16.3. Ranking the Top Kβ’ Given the top K items, now we need to rank these top candidates.β’ The (bad) solution would be to just rank the items by their distance from a particular user (closer distance will rank higher on the top). There are a couple of problems with this method:β There could be tons of users and items where, in reality, if we extract the top K, we could generate a more sophisticated model because now we're only dealing with a small fraction of the users and items.β What if we had different systems generating our candidates? * We could subset our users into groups (e.g. user traffic coming from Amazon or Google).* These groups would exist in separate embedding spaces. However, if we want to apply rankings of items across different embedding spaces, then we would have to have some mechanism to rank beyond just the distances because they don't translate across different embeddings.16.3.1. Learning To Rankβ’ Learning to rank takes some probability that some item i should appear before some item j.P(i>j)=yi,j=1
1+e-(si-sj)β’ Where si=f(u,di) and sj=f(u,dj) and di and dj indicate items i and j.β’ The functions, f, are neural networks.β’ Here, it's modeled as a sigmoid function of si and sj.β’ Note: The user feature term, u, don't have to be users per se. They can also be queries.β In fact, originally, learning to rank, was used as a search optimization tool. β In our case, we're going to treat queries as users.16.3.2. Learning To Rank Loss Functionβ’ The loss function is the same sigmoid loss function,L(yij,yij)=-βiβ jyijlogyij+(1-yij)log(1-yij)β’ We're also going to use gradient descent to minimize the loss function.16.4. RankNetβ’ Let's say we have candidate documents A, B and C which we want to rank.β’ The first step is to generate all unique pairs of these documents because our loss function takes into account the probability that document i outranks document j β so, we have to represent all ij pairs.β’ Now, let's say user clicked on A and B when presented with the following feed: A β,C Γ,B β β so, our labels should be A,B,C (simply because user clicked on A and B and not on C).β’ Now, we assign yijas follows:β A,B βy12=1β B,C βy23=1β C,A βy31=0 (because C does NOT rank over A)β’ Note:If there were more documents (that user haven't seen), we'll assign them 0.5, because we don't know which one the user will click on.β’ Now, our neural network is going to look like this:BABCCAdu......βd1d2s1s2f(u,di)=siβ’ Note: The neural networks are the f functions that'd give us si and sj for each pair of documents.β’ Now, we can replace si and sj in equations (35) and (36) to calculate updates based on the gradient of the loss function β wt+1=wt-rβloss.β The only difference here (compared to NN update) is that we're placing adjacent documents into the neural network one at a time and we're applying those differences in outputs to the overall loss function instead of using two documents at once as an input.β This strategy, which is called RankNet, was designed by Microsoft.β’ Now, let's say we trained our NN with these updates and we get a new document we haven't seen. β We first generate the pairwise documents.β We then repeat the same process described above, except this time with s1 and s2, we just simply plug them into the model and get a probability. * If probability > 0.5 β we rank item 1 above item 2.β Then, we go to the next pair of documents and repeat this process.β Eventually, we can come up with a consistent order of which we should place the documents in.* Consistent means that P(i>j) will propagate consistency across the entire rank. For example, if in training P(1>2)=0.5 (i.e. rank uncertainty), AND P(2>3)=0.5 then P(1>3)=0.5 β i.e. complete uncertainty propagates. * Same thing for complete certainty: if P(1>2)=P(2>3)=1 then P(1>3)= 1.16.5. LambdaNetβ’ RankNet pairwise may not be the best penalty.β This means that we may want to consider more than just two documents at a time when trying to rank all the documents. β’ It's also pretty inefficient.β For all pairs of documents, we have to put all of those pairs into the neural network and perform SGD on every one of them.β’ In terms of learning to rank, after RankNet, there came something called LambdaNet.β’ Two improvements came with LambdaNet:β It is more efficient. They factorize the gradient such that they can find the gradient update for a document in comparison to all the others without having to evaluate itself versus every other pair.β LambdaNet enables us to use better metrics such as nDCG for evaluating a particular ranking, where nDCG β normalized Discounted Cumulative Gain.* It considers more than just a pair of documents at a time. 16.6. nDCGβ’ Let's see how nDCG works.β’ Let's say we have the following documents along with their probabilities:β A β 0.5β B β 1β C β 0β’ Here, if we just use the number of inversions, we would count one inversion (A β B) that would give us the optimal ranking β i.e. the model will be penalized by one.β’ Now, let's consider this case:β C β 0β A β 0.5β B β 1β’ The problem is that the same penalty would be incurred for for the inversion (A β B) vs. if it was higher (previous case).β’ nDCG considers documents up to some position p,DCGp=βp2reli-1
log2(i+1)β’ where reli is the relevance of document i, which is those probabilities.β’ Now, we calculate the DCG3 (because there are 3 documents) for the first case we get 1.04 and for the second case we get 0.76.β It means that even though inversions would penalize (A β B) the same, the DCG is clearly lower for the second case, simply because this elements appeared later down the list.β’ How to compare DCG across different users?β To do that we have to take the DCG for a particular user and divide it by the ideal DCG β i.e. IDCG.nDCGp=DCGp
IDCGpβ’ The IDCG in our example is:β B β 1β A β 0.5β C β 0β’ So, the IDCG3 for our example is 1.26 and nDCG3=1.04
1.26=0.82.β’ nDCG allows us to use the same loss for all users. β’ The only problem with incorporating nDCG into our loss function is that it's not differentiable.β’ What LambdaNet does is that instead of defining the gradient as the gradient of the loss function itself, they just assigned the gradient to a value called lambda (π)which is defined as,β’ πij=1
1+e-(si-sj).|ΞnDCGij|β’ β’ At the time, the authors didn't have any mathematical backing for it, but it was later proven that this was completely fine and it actually does optimize for nDCG.β’ Now that we use π instead of the gradients, how do we update our weights?β It's extremely similar to our old update functions. The only difference is that we're subtracting different terms as shown below,wt+1=wt- rβijπij(βsi
βwk-βsj
βwk)β’ Note: There's no partial derivative in equation (40) with respect to some cost function. 16.7. LambdaMARTβ’ The only difference with LambdaNet is that it uses a gradient boosted tree in place of the neural network.β’ LambdaMART does appear to be pairwise because it compares pairwise documents, but nDCG considers elements beyond the pairs β so, technically, it's not a pairwise model.β’ How does MART work?β MART takes a single example, xi=[u1,u2,β¦,p1,p2,β¦] β i.e. some user features + item features where features can be embeddings or explicit features.β The example is fed to the first tree and finds itself at the leaf node after it traverses down the tree. β The error is going to be the prediction at the leaf node, yi, divided by the weight wi.* Here, yi is just the π and wi is just the gradient of the π with respect to the output at the leaf node and yi
wi is actually the error that gets passed on to the next tree to be learned.* Note: The i index just indicates that we perform the above calculations for all the leaf nodes.* Note: For calculating yi
wi (or π divided by the derivative of π), we're taking what's called a Newton Step to figure out our error to be trained on in the next tree.tree1tree2tree3xiyi
wiyi
wiyi
wi++β’ β Using the concept of MART with our Lambda (instead of the actual gradient), we end up with LambdaMART.16.8. Other Notesβ’ So far, in our examples, we just talked about whether a user clicked or not clicked a particular item/post/document.β We don't have to use just use clicks. We can have things like:* A user liked a particular post.* A user commented on a particular post.* Or a user performed no action.β We can map these actions like this:* N/A β 0* Clicked β 1* Liked β 2* Commented β 3* Messaged β 4β This is called Implicit Relevance Feedback. (similar to Implicit Response in Recommender Systems)β’ In addition to nDCG, we can also use Mean Average Precision (MAP) (for binary cases).β’ For implicit relevance feedback, we could transform it to binary by arranging some cutoff point (e.g. anything > 3 is labeled as relevant (i.e. 1) and irrelevant (i.e. 0) otherwise.β’ We could also use Mean Reciprocal Rank (MRR) (for binary cases). β’ Note: We can average these metrics across all users to evaluate the model.β’ Sometimes it's preferable to encode your n-ary labels (as in implicit relevance feedback) as binary so that we can use those binary evaluation metrics just for another perspective.β’ All of these evaluations metrics that we talked about, can also be used in the Lambda, that is,πij=1
1+e-(si-sj).|ΞnDCGij| or πij=1
1+e-(si-sj).|ΞMAPij| or πij=1
1+e-(si-sj).|ΞMRRij|β’ Sometimes we can to de-bias our clicks data. β Clicks are going to be susceptible to Presentation/Trust Bias.β Basically, the results appearing lower in the ranking can affect:* If the post is even seen* Even if it is seen, the fact that it occurs lower in the list could make the user not trust that result and not click on it.* This could be more relevant in search engines. β How do we account for this bias?* We just have to divide by the bias in the lambda equationπij=1
1+e-(si-sj).|ΞnDCGij|
bi bjβ’ β bi is the probability that some item in rank i is clicked over some other relevant item that's ranked lower than it. β bj is the same as bi but in terms of irrelevant items.Back to Top