Gradient boost for regression
Gradient Boost and AdaBoost are very similar. So, let’s first start by comparing the two algorithms.
Gradient Boost vs. AdaBoost
For a regression problem:
- AdaBoost starts by:
- Building a very short tree, called a stump, from the training data.
- The amount of say the stump has on the final output is based on how well it compensated for those previous errors.
- Then, AdaBoost builds the next stump based on errors that the previous stump made.
- AdaBoost continues to make stumps in this fashion until it has made the number of stumps you asked for, or it has a perfect fit.
- In contrast, Gradient Boost (GB) start by:
- Making a single leaf (instead of a tree or stump).
- This leaf represents an initial guess for the target value for all samples.
- When trying to predict a continuous value, the first guess is the average value.
- Then, GB builds a tree. Like AdaBoost, this tree is based on the errors made by the previous tree.
- Unlike AdaBoost, this tree is usually larger than a stump. However, GB still restricts the size of the tree (you specify the tree size by the number of leaves). In practice, the maximum number of leaves is set between 8 and 32.
Like AdaBoost, GB also scales the trees, but GB scales all the trees by the same amount (by learning rate), unlike AdaBoost.
- Then, GB builds another tree based on the errors made by the previous tree and scales it.
- GB continues to build trees in this fashion until it has made the number of trees you asked for, or additional trees fail to improve the fit.
Gradient Boost algorithm
- Calculate the average value of the target variable.
- Next, we build a tree based on the error of the previous tree. The error is just the difference between the observed target and the predicted (average) target.
- Note: The difference is called pseudo residual. We build a tree to predict these residuals.
- Note: The trees can be different at each step.
- By restricting the number of leaves, we get a fewer number of leaves than the residuals. As a result, some residuals end up in the same leaf. We replace these residuals with their average value.
- Now, we can combine the original leaf (average of target variable) with this tree to make a new prediction of the target variable. New prediction value = original prediction value + learning rate x prediction value from the tree.
- Learning rate: To control the contribution of trees and avoid overfitting the data, GB uses a learning rate. In other words, scaling the tree by the learning rate results in a small step in the right direction.
- Main authors of GB suggested that empirical evidence shows that taking lots of small steps in the right direction results in better predictions with a testing dataset, i.e. lower variance.
- Note: At each step, to get the new prediction, we combine the results of ALL the predictions, i.e. new prediction = original prediction + (LR) x residuals tree1 + (LR) x residuals tree2 + (LR) x residuals tree3 + …
- We repeat the steps above over and over again until we reach the maximum specified, or adding additional trees does not significantly reduce the size of residuals.