Gradient boost for regression

Gradient Boost and AdaBoost are very similar. So, let’s first start by comparing the two algorithms.

Gradient Boost vs. AdaBoost

For a regression problem:

Gradient Boost algorithm

  1. Calculate the average value of the target variable.
  2. Next, we build a tree based on the error of the previous tree. The error is just the difference between the observed target and the predicted (average) target.
    • Note: The difference is called pseudo residual. We build a tree to predict these residuals.
    • Note: The trees can be different at each step.
  3. By restricting the number of leaves, we get a fewer number of leaves than the residuals. As a result, some residuals end up in the same leaf. We replace these residuals with their average value.
  4. Now, we can combine the original leaf (average of target variable) with this tree to make a new prediction of the target variable. New prediction value = original prediction value + learning rate x prediction value from the tree.
    • Learning rate: To control the contribution of trees and avoid overfitting the data, GB uses a learning rate. In other words, scaling the tree by the learning rate results in a small step in the right direction.
    • Main authors of GB suggested that empirical evidence shows that taking lots of small steps in the right direction results in better predictions with a testing dataset, i.e. lower variance.
    • Note: At each step, to get the new prediction, we combine the results of ALL the predictions, i.e. new prediction = original prediction + (LR) x residuals tree1 + (LR) x residuals tree2 + (LR) x residuals tree3 + …
  5. We repeat the steps above over and over again until we reach the maximum specified, or adding additional trees does not significantly reduce the size of residuals.