Gradient boost for regression

Gradient Boost and AdaBoost are very similar. So, let’s first start by comparing the two algorithms.

Gradient Boost vs. AdaBoost

For a regression problem:

AdaBoost starts by:
- Building a very short tree, called a stump, from the training data.
- The amount of say the stump has on the final output is based on how well it compensated for those previous errors.
- Then, AdaBoost builds the next stump based on errors that the previous stump made.
- AdaBoost continues to make stumps in this fashion until it has made the number of stumps you asked for, or it has a perfect fit.
In contrast, Gradient Boost (GB) start by:
- Making a single leaf (instead of a tree or stump).
- This leaf represents an initial guess for the target value for all samples.
- When trying to predict a continuous value, the first guess is the average value.
- Then, GB builds a tree. Like AdaBoost, this tree is based on the errors made by the previous tree.
- Unlike AdaBoost, this tree is usually larger than a stump. However, GB still restricts the size of the tree (you specify the tree size by the number of leaves). In practice, the maximum number of leaves is set between 8 and 32.
  Like AdaBoost, GB also scales the trees, but GB scales all the trees by the same amount (by learning rate), unlike AdaBoost.
- Then, GB builds another tree based on the errors made by the previous tree and scales it.
- GB continues to build trees in this fashion until it has made the number of trees you asked for, or additional trees fail to improve the fit.

Gradient Boost algorithm

Calculate the average value of the target variable.
Next, we build a tree based on the error of the previous tree. The error is just the difference between the observed target and the predicted (average) target.
- Note: The difference is called pseudo residual. We build a tree to predict these residuals.
- Note: The trees can be different at each step.
By restricting the number of leaves, we get a fewer number of leaves than the residuals. As a result, some residuals end up in the same leaf. We replace these residuals with their average value.
Now, we can combine the original leaf (average of target variable) with this tree to make a new prediction of the target variable. New prediction value = original prediction value + learning rate x prediction value from the tree.
- Learning rate: To control the contribution of trees and avoid overfitting the data, GB uses a learning rate. In other words, scaling the tree by the learning rate results in a small step in the right direction.
- Main authors of GB suggested that empirical evidence shows that taking lots of small steps in the right direction results in better predictions with a testing dataset, i.e. lower variance.
- Note: At each step, to get the new prediction, we combine the results of ALL the predictions, i.e. new prediction = original prediction + (LR) x residuals tree1 + (LR) x residuals tree2 + (LR) x residuals tree3 + …
We repeat the steps above over and over again until we reach the maximum specified, or adding additional trees does not significantly reduce the size of residuals.