XGBoost trees for regression

XGBoost is extreme, i.e. it’s a big machine learning algorithm with lots of parts.

Gradient boost (-ish)
Regularization
A unique regression tree
Approximate greedy algorithm
Weighted quantile sketch
Sparsity-aware split finding
Parallel learning
Cache-aware access
Blocks for out-of-core computation

Unique regression tree

For the sake of explanation, let’s say we have this simple dataset:

Drug Effectiveness	Drug Dosage (mg)
-10	5
-7	20
7	25
8	35

The very first step in fitting xgb to the training data is to make an initial prediction.
- This prediction can be anything, but by default, it is 0.5, regardless of whether you are using xgb for regression or classification.
- The residuals show us how good the prediction is.
Just like gradient boosting, xgb fits a regression tree to the residuals.
- However, unlike gradient boost, which typically uses regular off-the-shelf regression trees, xgb uses a ***unique regression tree (xgb tree)***.

How to build an XGB tree for regression?

Note: There are many ways to build xgb trees, but here we focus on the most common way to build them.

Each tree starts out as a single leaf.
- All the residuals go to the first leaf.
We calculate a quality score, or similarity score for the residuals.

$\text{similarity score} = \frac{\text{sum of residuals squared}}{\text{number of residuals}+\lambda}$

Note: $\lambda$ is a regularization parameter. For now, let $\lambda=0$ .

Note: Because we do not square the residuals before we add them together in the numerator, 7.5 and -7.5 cancel each other out. The similarity score in the example above is $4$ .

Now we split the observations into two groups and calculate the similarity score for each node. Here, we split them based on dosage.
- Note: The number of residuals in the similarity score changes based on the number of observations that end up in each group.

Now that we have calculated the similarity scores for each node, we see that when the residuals in a node are very different, they cancel each other out and the similarity score is relatively small.
- Now, we need to quantify how much better the leaves cluster similar residuals than the root.
- We do this by calculating the gain of splitting the residuals into two groups.

$\text{Gain} = \text{Left}_{\text{similarity}} + \text{Right}_{\text{similarity}} - \text{Root}_{\text{similarity}}$

Now, we can calculate the gain for dosage < 15, we can compare it to the gain calculated for other thresholds.
- Say, we calculate the gain for dosage < 22.5. In this case, we get two observations in each node, with similarity scores of $8$ and $0$ for the left and right nodes, respectively. The gain here is $4$ .
- Since the gain for this threshold (22.5) is lower than the gain for threshold 15, dosage < 15 is better at splitting residuals into clusters of similar values.
We keep splitting observations based on different thresholds and calculate the gain based on the node similarity score. We use the threshold that gave us the largest gain for the first branch in the tree.

We keep splitting the observations (residuals) at each level of the tree as explained above. The default tree depth in xgb is 6 levels.

How to prune the XGB tree?

We prune and xgb tree based on its gain values.
We start by picking a number. xgb calls this number Gamma ( $\large{\gamma}$ ).
Then, we calculate the difference between the gain associated with the lowest branch in the tree and the value for $\large{\gamma}$ .
If the difference between the gain and $\large{\gamma}$ is negative, we will remove the branch.

Regularization parameter $\lambda$

$\lambda$ is intended to reduce the prediction’s sensitivity to individual observations.
When $\lambda > 0$ , the similarity scores are smaller. The amount of decrease is inversely proportional to the number of residuals in the node, i.e. nodes with an already low number of residuals will have the largest decrease due to regularization.
When $\lambda > 0$ , it’s easier to prune leaves because the values for gain are smaller.
Note: Sometimes $\lambda > 0$ can cause the similarity score for a brach to become negative. In such cases, even with $\gamma = 0$ , the branch gets pruned. This is the effect of $\lambda$ . It prevents overfitting the data.

Making predictions with XGB tree

Given a tree (and its branches), we can calculate the output value of each node:

$\text{output value}=\frac{\text{sum of residuals}}{\text{number of residuals} + \lambda}$

Note: With $\lambda > 0$ , we reduce the amount that any individual observation adds to the overall prediction. Thus, $\lambda$ reduces the prediction’s sensitivity to any individual observation.
Just like gradient boost, xgb makes new predictions for each observation by starting with the initial prediction and adding the output of the tree scaled by the learning rate.
XGBoost calls the learning rate, $\eta$ , and the default value is 0.3.

Now, we build another tree based on the new residuals and repeat the same steps. We keep building trees until the residuals are super small, or we have reached the maximum number.

XGBoost for Classification

Add the notes later Source

XGBoost: Mathematical details

In both regression and classification, xgb builds trees using similarity scores and then calculates the output values for the leaves. Now, we will derive the equations for the similarity scores and output values and show that the only difference between regression and classification is the loss function.

In both regression and classification xgb starts with an initial prediction of 0.5.
It then calculates the residuals.
Just like Gradient Boost, we can quantify how good the prediction is with loss function.
- For regression, we use this loss function: $\sum\limits_{i=1}^{N}L(y_i, p_i) = \sum\limits_{i=1}^{N}\frac{1}{2}(y_i - p_i)^2$
- For classification, we use this loss function: $\sum\limits_{i=1}^{N}L(y_i, p_i) = \sum\limits_{i=1}^{N}-[y_i \log{(p_i)} + (1-y_i)\log{(1-p_i)}]$

Now that we have the loss functions for both regression and classification, xgb uses those loss functions to build trees by minimizing this equation (eq.1):

$[\sum\limits_{i=1}^{N}L(y_i, p_i)] + \frac{1}{2}\lambda O_{\text{value}}^2$

where $\frac{1}{2}\lambda O_{\text{value}}^2$ is the **regularization term. The goal is to find an output value ( $O_{\text{value}}$ ) for the leaf that minimizes the whole equation.

Note: In the original xgb paper, this equation contains an extra term (which we are omitting):

$[\sum\limits_{i=1}^{N}L(y_i, p_i)] + \gamma T + \frac{1}{2}\lambda O_{\text{value}}^2$

Where $T$ is the number of Terminal nodes (or leaves) in a tree and $\gamma$ is a user-defined penalty that is meant to encourage pruning.
- We say encourages pruning because as we saw before, xgb can prune even when $\gamma = 0$ . Also, pruning takes place after the full tree is built, and it plays no role in deriving the optimal values or similarity scores.

In a way, the loss function equation is very similar to ridge regression. We square the output values from the new tree and scale it with $\lambda$ . Just like ridge regression, if $\lambda < 0$ , then we will shrink the $O_{\text{value}}$ . Note that the $\frac{1}{2}$ just makes the math easier.

To sum up, we use the equation eq.1 to build our tree such that at each level it finds the output value that minimizes the loss function.

Note: Based on the eq.1, we can see that the more emphasis we give the regularization penalty by increasing $\lambda$ , the optimal $O_{\text{value}}$ gets closer to $0$ .

Note: When the Gradient Boost algorithm also solves for the optimal output value for a leaf, it solves an equation very similar to eq.1 without the regularization.

Gradient Boost uses two techniques to solve this equation. One for the regression (because the math was easy), and a different one for classification because the math was a little harder.
Specifically, for classification, Gradient Boost uses a Second Order Taylor Approximation to simplify the math when solving for the optimal output value.

$L(y, p_i + O_{\text{value}}) \approx L(y, p_i) + [\frac{d}{dp_i}L(y, p_i)]O_{\text{value}} + \frac{1}{2}[\frac{d^2}{dp^{2}_{i}}L(y, p_i)]O^{2}_{\text{value}}$

In contrast, xgb uses the Second Order Taylor Approximation for both regression and classification. xgb uses $g$ (gradient) to show the first derivative, and $h$ (Hessian) to show the second derivative.

Deriving the equation for regression

Here’s how the optimal output value is calculated for each node:

$[\sum\limits_{i=1}^{N}L(y_i, p^{0}_{i})] + \frac{1}{2}\lambda O_{\text{value}}^2 = L(y, p_i) + \mathbf{g}O_{\text{value}} + \mathbf{h}O^{2}_{\text{value}} + \frac{1}{2}\lambda O_{\text{value}}^2 \\ = L(y, p_i) + (g_1+g_2+\dots+g_n)O_{\text{value}} + \frac{1}{2}(h_1+h_2+\dots+h_n+\lambda)O_{\text{value}}^2$

Note: Since $L(y, p_i)$ has no $O_{\text{value}}$ in it, we can leave it off the equation. We number this equation eq.2.0.

$(g_1+g_2+\dots+g_n)O_{\text{value}} + \frac{1}{2}(h_1+h_2+\dots+h_n+\lambda)O_{\text{value}}^2$

Setting eq.2.0 equal to $0$ results in (eq.2):

$\frac{d}{d_{O_{\text{value}}}} = 0 \Rightarrow O_{\text{value}} = \frac{-(g_1+g_2+\dots+g_n)}{(h_1+h_2+\dots+h_n+\lambda)}$

where $n$ is the total number of observations (residuals) that goes into the leaf.

As mentioned above the most common loss function for regression is: $\sum\limits_{i=1}^{N}L(y_i, p_i) = \sum\limits_{i=1}^{N}\frac{1}{2}(y_i - p_i)^2$ , therefore,

$g_i = -(y_i - p_i) \\ h_i = 1$

where $g_i = \frac{d}{d_{p_i}}$ and $h_i = \frac{d^2}{d^2_{p_i}}$ and $(y_i - p_i)$ is the residual. Placing these values into eq.2, we get the equation that we used in previous section:

$O_{\text{value}} = \frac{\small{\text{sum of residuals}}}{\small{\text{number of residuals}}+\lambda}$

Deriving the equation for classification

For classification, the loss function is,

$\sum\limits_{i=1}^{N}L(y_i, p_i) = \sum\limits_{i=1}^{N}-[y_i \log{(p_i)} + (1-y_i)\log{(1-p_i)}]$

In this case, the output values are in log(odds) terms, so we convert the probabilities to log(odds),

$L(y_i, p_i) = -[y_i \log{(p_i)} + (1-y_i)\log{(1-p_i)}]$
$L(y_i, {\log(odds)}_i) = -y_i {\log(odds)}_i + \log{(1 + e^{{\log(odds)}_i})}$
$g_i = \frac{d}{d{\log(odds)}_i}L(y_i, {\log(odds)}_i) = -y_i + \frac{e^{{\log(odds)}_i}}{1+e^{{\log(odds)}_i}} = -(y_i - p_i)$
$h_i = \frac{d^2}{d{\log(odds)}^2_i}L(y_i, {\log(odds)}_i) = \frac{e^{{\log(odds)}_i}}{1+e^{{\log(odds)}_i}} \times \frac{1}{1+e^{{\log(odds)}_i}} = p_i \times (1-p_i)$

If we replace the above values into eq.2, we get the optimal output value for a leaf as follows,

$O_{\text{value}} = \frac{\small{\text{sum of residuals}}}{\sum[\small{\text{previous probability}}_i \times (1 - \small{ \text{previous probability}}_i)]+\lambda}$

Note: Regardless of whether we are doing regression or classification, the optimal output value is always calculated using eq.2,

$\frac{d}{d_{O_{\text{value}}}} = 0 \Rightarrow O_{\text{value}} = \frac{-(g_1+g_2+\dots+g_n)}{(h_1+h_2+\dots+h_n+\lambda)}$

Deriving the equation for Similarity Score

We need to derive the Similarity Score in order to grow the tree. Remember that:

We derived the equation for $O_{\text{value}}$ by minimizing the sum of the loss functions plus regularizations.
Depending on the loss function, optimizing the first part of (eq.1) can be hard, so we approximated it with a Second Order Taylor Polynomial.

Here’s how xgb calculates the Similarity Score:

Multiplying eq.2.0 by $-1$ . This way, the optimal $O_{\text{value}}$ represents the highest point for eq.2.
The value of negative eq.2.0 equation is the Similarity Score (as described the original XGB manuscript).
- Note: The Similarity Score used in the implementation is $2$ times that number. The reason will become clear once we do the algebra on eq.2.

$-(g_1+g_2+\dots+g_n)O_{\text{value}} - \frac{1}{2}(h_1+h_2+\dots+h_n+\lambda)O_{\text{value}}^2$

Now, replacing $O_{\text{value}}$ with its optimal value $\frac{-(g_1+g_2+\dots+g_n)}{(h_1+h_2+\dots+h_n+\lambda)}$ , we get (eq.3)

$\small{\text{Similarity Score}} = \frac{1}{2}\frac{(g_1+g_2+\dots+g_n)^2}{(h_1+h_2+\dots+h_n+\lambda)}$

eq.3 is the equation for Similarity Score as described in the original xgb paper. However, in the implementation the $\frac{1}{2}$ is omitted because the Similarity Score is only a relative measure and as long as every Similarity Score is scaled the same amount, the results of the comparison will be the same. Therefore, we can rewrite eq.3 as follows,

$\small{\text{Similarity Score}} = \frac{(g_1+g_2+\dots+g_n)^2}{(h_1+h_2+\dots+h_n+\lambda)}$

Now, for regression with the loss function $L(y_i, p_i) = \frac{1}{2}(y_i - p_i)^2$ and $g_i = -(y_i - p_i)$ and $h_i = 1$ , we get

$\small{\text{Similarity Score}} = \frac{\small{\text{sum of residuals, squared}}}{\small{\text{number of residuals}}+\lambda}$

Similarly, for classification we get

$\small{\text{Similarity Score}} = \frac{(\small{\text{sum of residuals}})^2}{\sum[\small{\text{previous probability}}_i \times (1 - \small{ \text{previous probability}}_i)]+\lambda}$

Cover

Cover is related to the minimum number of residuals in a leaf. It is the denominator of the Similarity Score minus $\lambda$ .

Therefore, Cover for the regression is simply: $\text{number of residuals}$ , and for classification, it is: $\sum[p_i \times (1-p_i)]$

XGBoost Optimizations

XGB is an algorithm with a lot of parts. So far, we have only talked about the unique regression trees that xgb uses. The first three items in the list give us a conceptual idea of how xgb is fit to training data and how it makes predictions. The remaining items describe optimizations for large datasets. These parts are what make xgb relatively efficient with relatively large training datasets.

Approximate greedy algorithm
Parallel learning
Weighted quantile sketch
Sparsity-aware split finding
Cache-aware access
Blocks for out-of-core computation

Approximate Greedy Algorithm

Just to review how xgb fits to data:

It starts by making an initial prediction which could be anything (default is 0.5).
Calculates the residuals for each observation.
Fits a tree to the residuals. It does this by calculating the Similarity Scores and the Gain for each possible threshold. The threshold with the largest Gain is the one xgb uses.
- Note: The decision to use the threshold that gives the largest Gain is made without worrying about how the leaves will split later. That means xgb uses a greedy algorithm to build trees.
- In other words, since xgb uses a greedy algorithm, it makes a decision without looking ahead to see if it is the absolute best choice in the long term.
- In contrast, if xgb did not use a greedy algorithm, it would postpone making the final decision about a threshold, until after trying different thresholds in the leaves to see how things played out in the long run.
- This same process would be repeated for every single possible threshold for the root. In other words, by using a greedy algorithm, xgb can build a tree relatively quickly.
- That said, when we have a lot of data, then the greedy algorithm becomes slow because it still has to look at every possible threshold. With large datasets with several features, checking every single threshold in every single variable would take forever to finish.

This is where the Approximate Greedy Algorithm comes in.

For example, with a lot of observations, instead of testing every single threshold, we could divide the data into quantiles, and only use the quantiles as candidate thresholds to split the observations.
For example, instead of using the smallest two observations the define the first threshold, it uses the first quantile to define the first threshold, second quantile for the second threshold, and so on.
Note: If we only used one quantile and it split the observations in half, then, since there are no other options, that quantile would be the threshold.
- This would make finding the “best” threshold very fast since we would not have to calculate Gain or Similarity to make the decision.
- But there’s possibility that such threshold does not do a good job separating data. Adding more quantiles would mitigate this problem.
- Adding more quantiles can increase accuracy of model predictions, but the more quantiles we have, the more thresholds we will have to test, and that means it will take longer to build the tree.
- For xgb, the Approximate Greedy Algorithm means that instead of testing all possible thresholds, we only test quantiles. By default, the Approximate Greedy Algorithm uses about 33 quantiles.
- Why do we say “about” 33 quantiles instead of “exactly” 33 quantiles?
  - The answer to this question will be clear in once we talk about Parallel Learning and Weighted Quantile Sketch.

Parallel Learning & Weighted Quantile Sketch

Let’s say you have such a large dataset that it can fit into your computer’s memory at one time. Then, things that seem simple, like sorting a list of numbers and finding quantiles, become really slow. To get around this problem, a class of algorithms, called Sketches, can quickly create approximate solutions.

In simple words, Sketch algorithms split data in smaller pieces (this allows for Parallel Learning), and Quantile Sketch Algorithm combines the values from each computer to make an approximate histogram. Then, the approximate histogram is used to calculate approximate quantiles. The Approximate Greedy Algorithm uses approximate quantiles $\rightarrow$ This is Quantile Sketch Algorithm, but xgb uses Weighted Quantile Sketch algorithm.

What’s a weighted quantile?

Usually quantiles are set up so that the same number of observations are in each one. In contrast, with weighted quantiles, each observation has a corresponding weight, and the sum of the weights are similar in each quantile. The weights are derived from the Cover metric we discussed in previous sections.

Specifically, the ***weight for each observation is the 2nd derivative of the loss function (Hessian)***. That means, for regression the weights are all equal to 1 (i.e. weighted quantiles are just like normal quantiles and contain an equal number of observations).

In contrast, for classification the weights are: $\small{\text{previous probability}}_i \times (1 - \small{ \text{previous probability}}_i)$ . These weights are calculated after building each tree. To explain how this can improve predictions, let’s consider a simple example. Let’s say we have this small dataset:

obs. ID	predicted probability
1	0.1
2	0.1
3	0.9
4	0.9
5	0.6
6	0.4

If we calculate the weights we get

obs. ID	predicted probability	quantile weights
1	0.1	0.09
2	0.1	0.09
3	0.9	0.09
4	0.9	0.09
5	0.6	0.24
6	0.4	0.24

We see that when predicted probability is close to 0.5, meaning we don’t have much confidence in the classification, the weights are relatively large. In contrast, when the predicted probability is very close to 0 or 1, meaning we have a lot of confidence in the classification, the weights are relatively small.
If we split the data into equal quantiles (and considering each quantile as a unit $\rightarrow$ they end up in the same leaf together in the tree), since positive residual will cancel out negative residual, it will be very difficult to improve the predicted probabilities (especially for those low-confidence cases).

So, instead of using equal quantiles, xgb tries to make quantiles that have a similar sum of weights.
By dividing the observations into quantiles where the sum of weights are similar, we split the two observations with low confidence into separate bins.

The advantage of using the Weighted Quantile Sketch is that we get smaller quantiles when we need them.

Note: xgb only uses the above-mentioned tricks when the training dataset is huge. With small datasets, xgb just uses a normal greedy algorithm.

Sparsity-Aware Split Finding

Let’s say we have to this data with two missing values. We can still calculate the residuals (with the initial predicted probability 0.5).

Just like we normally do when we build xgb trees, we can put all of the residuals into a single leaf.
Now we need to determine if splitting the residuals into two leaves will do a better job clustering them. In order to do this we need to sort Dosages from low to high. But, with missing values we can’t do that.
xgb splits the data into two tables, one table will contain all of the observations with Dosage values, and another table will contain all of the observations without Dosage values.

Now, we use the table with all the values to build the tree. We do this as follows,
1. Sort data by Dosage
2. Use the average of first two Dosage as threshold and split the data (Note: if data is large, we use quantiles for thresholds).
3. For each leaf, we calculate two Gains. The first one, which we call $\small{\textbf{Gain}_\textbf{left}}$ , is calculated by putting all of the residuals with missing Dosage values into the leaf on the left. Similarly, we calculate the $\small{\textbf{Gain}_\textbf{right}}$ .
4. Repeat step 2 & 3 for the next two observations (Dosage) until we go through the entire dataset.
5. In the end, we choose the threshold that gave us the largest value for Gain, overall. In our example, that means picking $\small{\textbf{Gain}_\textbf{left}}$ when the threshold was *Dosage < 15.5 *.
  - Note: This path (i.e. going to the left leaf when *Dosage < 15.5 *) will be the default path for all the future observations that are missing values. So, if we get a new observation without a value for Dosage, but we still needed to predict Drug Effectiveness, then we would assume that this observation goes to the leaf on the left.

Thus, Sparsity-Aware Split Finding tells us how to build trees with missing data.

Cache-Aware Access

If you want your program to run really fast, the goal is to maximize what you can do with the cache memory. xgb puts the gradients and Hessians in the cache so that it can rapidly calculate Similarity Scores and Output Values.

Blocks for Out-of-Core Computation

When the dataset is too large for cache and main memory, then, at least some of it, must be stored on the hard drive. Because reading and writing data to the hard drive is super slow, xgb tries to minimize these actions by compressing the data.

Even though the CPU must spend some time decompressing the data that comes from the hard drive, it can do this faster than the hard drive can read the data. In other words, by spending a little bit of CPU time uncompressing the data we can avoid spending a lot of time accessing the hard drive.

Also, when there is more than one hard drive available for storage, xgb uses a database technique called Sharding to speed up disk access.

Note: xgb is fast for several reasons, some of them are not related to statistical methods at all. For example, the last two elements of xgb algorithm (cache-aware access and blocks for out-of-core computation) are merely computation hardware optimizations.

XGBoost in Python

Refer to this link