XGBoost trees for regression

XGBoost is extreme, i.e. it’s a big machine learning algorithm with lots of parts.

Unique regression tree

For the sake of explanation, let’s say we have this simple dataset:

Drug Effectiveness Drug Dosage (mg)
-10 5
-7 20
7 25
8 35

How to build an XGB tree for regression?

Note: There are many ways to build xgb trees, but here we focus on the most common way to build them.

  1. Each tree starts out as a single leaf.
    • All the residuals go to the first leaf.
  2. We calculate a quality score, or similarity score for the residuals.

similarity score=sum of residuals squarednumber of residuals+λ\text{similarity score} = \frac{\text{sum of residuals squared}}{\text{number of residuals}+\lambda}

Screen Shot 2022-01-15 at 11.44.09 PM
  1. Now we split the observations into two groups and calculate the similarity score for each node. Here, we split them based on dosage.
    • Note: The number of residuals in the similarity score changes based on the number of observations that end up in each group.
Screen Shot 2022-01-15 at 11.55.12 PM

Gain=Leftsimilarity+RightsimilarityRootsimilarity\text{Gain} = \text{Left}_{\text{similarity}} + \text{Right}_{\text{similarity}} - \text{Root}_{\text{similarity}}

  1. We keep splitting the observations (residuals) at each level of the tree as explained above. The default tree depth in xgb is 6 levels.

How to prune the XGB tree?

Regularization parameter λ\lambda

Making predictions with XGB tree

output value=sum of residualsnumber of residuals+λ\text{output value}=\frac{\text{sum of residuals}}{\text{number of residuals} + \lambda}

Screen Shot 2022-01-16 at 1.09.48 AM
Screen Shot 2022-01-16 at 1.18.05 AM

Now, we build another tree based on the new residuals and repeat the same steps. We keep building trees until the residuals are super small, or we have reached the maximum number.

XGBoost for Classification

Add the notes later Source

XGBoost: Mathematical details

In both regression and classification, xgb builds trees using similarity scores and then calculates the output values for the leaves. Now, we will derive the equations for the similarity scores and output values and show that the only difference between regression and classification is the loss function.

Now that we have the loss functions for both regression and classification, xgb uses those loss functions to build trees by minimizing this equation (eq.1):

[i=1NL(yi,pi)]+12λOvalue2[\sum\limits_{i=1}^{N}L(y_i, p_i)] + \frac{1}{2}\lambda O_{\text{value}}^2

where 12λOvalue2\frac{1}{2}\lambda O_{\text{value}}^2 is the **regularization term. The goal is to find an output value (OvalueO_{\text{value}}) for the leaf that minimizes the whole equation.

[i=1NL(yi,pi)]+γT+12λOvalue2[\sum\limits_{i=1}^{N}L(y_i, p_i)] + \gamma T + \frac{1}{2}\lambda O_{\text{value}}^2

In a way, the loss function equation is very similar to ridge regression. We square the output values from the new tree and scale it with λ\lambda. Just like ridge regression, if λ<0\lambda < 0, then we will shrink the OvalueO_{\text{value}}. Note that the 12\frac{1}{2} just makes the math easier.

To sum up, we use the equation eq.1 to build our tree such that at each level it finds the output value that minimizes the loss function.

Note: Based on the eq.1, we can see that the more emphasis we give the regularization penalty by increasing λ\lambda, the optimal OvalueO_{\text{value}} gets closer to 00.

Note: When the Gradient Boost algorithm also solves for the optimal output value for a leaf, it solves an equation very similar to eq.1 without the regularization.

L(y,pi+Ovalue)L(y,pi)+[ddpiL(y,pi)]Ovalue+12[d2dpi2L(y,pi)]Ovalue2L(y, p_i + O_{\text{value}}) \approx L(y, p_i) + [\frac{d}{dp_i}L(y, p_i)]O_{\text{value}} + \frac{1}{2}[\frac{d^2}{dp^{2}_{i}}L(y, p_i)]O^{2}_{\text{value}}

In contrast, xgb uses the Second Order Taylor Approximation for both regression and classification. xgb uses gg (gradient) to show the first derivative, and hh (Hessian) to show the second derivative.

Deriving the equation for regression

Here’s how the optimal output value is calculated for each node:

[i=1NL(yi,pi0)]+12λOvalue2=L(y,pi)+gOvalue+hOvalue2+12λOvalue2=L(y,pi)+(g1+g2++gn)Ovalue+12(h1+h2++hn+λ)Ovalue2[\sum\limits_{i=1}^{N}L(y_i, p^{0}_{i})] + \frac{1}{2}\lambda O_{\text{value}}^2 = L(y, p_i) + \mathbf{g}O_{\text{value}} + \mathbf{h}O^{2}_{\text{value}} + \frac{1}{2}\lambda O_{\text{value}}^2 \\ = L(y, p_i) + (g_1+g_2+\dots+g_n)O_{\text{value}} + \frac{1}{2}(h_1+h_2+\dots+h_n+\lambda)O_{\text{value}}^2

Note: Since L(y,pi)L(y, p_i) has no OvalueO_{\text{value}} in it, we can leave it off the equation. We number this equation eq.2.0.

(g1+g2++gn)Ovalue+12(h1+h2++hn+λ)Ovalue2(g_1+g_2+\dots+g_n)O_{\text{value}} + \frac{1}{2}(h_1+h_2+\dots+h_n+\lambda)O_{\text{value}}^2

Setting eq.2.0 equal to 00 results in (eq.2):

ddOvalue=0Ovalue=(g1+g2++gn)(h1+h2++hn+λ)\frac{d}{d_{O_{\text{value}}}} = 0 \Rightarrow O_{\text{value}} = \frac{-(g_1+g_2+\dots+g_n)}{(h_1+h_2+\dots+h_n+\lambda)}

where nn is the total number of observations (residuals) that goes into the leaf.

As mentioned above the most common loss function for regression is: i=1NL(yi,pi)=i=1N12(yipi)2\sum\limits_{i=1}^{N}L(y_i, p_i) = \sum\limits_{i=1}^{N}\frac{1}{2}(y_i - p_i)^2, therefore,

gi=(yipi)hi=1 g_i = -(y_i - p_i) \\ h_i = 1

where gi=ddpig_i = \frac{d}{d_{p_i}} and hi=d2dpi2h_i = \frac{d^2}{d^2_{p_i}} and (yipi)(y_i - p_i) is the residual. Placing these values into eq.2, we get the equation that we used in previous section:

Ovalue=sum of residualsnumber of residuals+λO_{\text{value}} = \frac{\small{\text{sum of residuals}}}{\small{\text{number of residuals}}+\lambda}

Deriving the equation for classification

For classification, the loss function is,

i=1NL(yi,pi)=i=1N[yilog(pi)+(1yi)log(1pi)]\sum\limits_{i=1}^{N}L(y_i, p_i) = \sum\limits_{i=1}^{N}-[y_i \log{(p_i)} + (1-y_i)\log{(1-p_i)}]

In this case, the output values are in log(odds) terms, so we convert the probabilities to log(odds),

L(yi,pi)=[yilog(pi)+(1yi)log(1pi)]L(y_i, p_i) = -[y_i \log{(p_i)} + (1-y_i)\log{(1-p_i)}]
L(yi,log(odds)i)=yilog(odds)i+log(1+elog(odds)i)L(y_i, {\log(odds)}_i) = -y_i {\log(odds)}_i + \log{(1 + e^{{\log(odds)}_i})}
gi=ddlog(odds)iL(yi,log(odds)i)=yi+elog(odds)i1+elog(odds)i=(yipi)g_i = \frac{d}{d{\log(odds)}_i}L(y_i, {\log(odds)}_i) = -y_i + \frac{e^{{\log(odds)}_i}}{1+e^{{\log(odds)}_i}} = -(y_i - p_i)
hi=d2dlog(odds)i2L(yi,log(odds)i)=elog(odds)i1+elog(odds)i×11+elog(odds)i=pi×(1pi)h_i = \frac{d^2}{d{\log(odds)}^2_i}L(y_i, {\log(odds)}_i) = \frac{e^{{\log(odds)}_i}}{1+e^{{\log(odds)}_i}} \times \frac{1}{1+e^{{\log(odds)}_i}} = p_i \times (1-p_i)

If we replace the above values into eq.2, we get the optimal output value for a leaf as follows,

Ovalue=sum of residuals[previous probabilityi×(1previous probabilityi)]+λO_{\text{value}} = \frac{\small{\text{sum of residuals}}}{\sum[\small{\text{previous probability}}_i \times (1 - \small{ \text{previous probability}}_i)]+\lambda}


Note: Regardless of whether we are doing regression or classification, the optimal output value is always calculated using eq.2,

ddOvalue=0Ovalue=(g1+g2++gn)(h1+h2++hn+λ)\frac{d}{d_{O_{\text{value}}}} = 0 \Rightarrow O_{\text{value}} = \frac{-(g_1+g_2+\dots+g_n)}{(h_1+h_2+\dots+h_n+\lambda)}

Deriving the equation for Similarity Score

We need to derive the Similarity Score in order to grow the tree. Remember that:

Here’s how xgb calculates the Similarity Score:

  1. Multiplying eq.2.0 by 1-1. This way, the optimal OvalueO_{\text{value}} represents the highest point for eq.2.
  2. The value of negative eq.2.0 equation is the Similarity Score (as described the original XGB manuscript).
    • Note: The Similarity Score used in the implementation is 22 times that number. The reason will become clear once we do the algebra on eq.2.

(g1+g2++gn)Ovalue12(h1+h2++hn+λ)Ovalue2-(g_1+g_2+\dots+g_n)O_{\text{value}} - \frac{1}{2}(h_1+h_2+\dots+h_n+\lambda)O_{\text{value}}^2

Now, replacing OvalueO_{\text{value}} with its optimal value (g1+g2++gn)(h1+h2++hn+λ)\frac{-(g_1+g_2+\dots+g_n)}{(h_1+h_2+\dots+h_n+\lambda)}, we get (eq.3)

Similarity Score=12(g1+g2++gn)2(h1+h2++hn+λ)\small{\text{Similarity Score}} = \frac{1}{2}\frac{(g_1+g_2+\dots+g_n)^2}{(h_1+h_2+\dots+h_n+\lambda)}

eq.3 is the equation for Similarity Score as described in the original xgb paper. However, in the implementation the 12\frac{1}{2} is omitted because the Similarity Score is only a relative measure and as long as every Similarity Score is scaled the same amount, the results of the comparison will be the same. Therefore, we can rewrite eq.3 as follows,

Similarity Score=(g1+g2++gn)2(h1+h2++hn+λ)\small{\text{Similarity Score}} = \frac{(g_1+g_2+\dots+g_n)^2}{(h_1+h_2+\dots+h_n+\lambda)}

Now, for regression with the loss function L(yi,pi)=12(yipi)2L(y_i, p_i) = \frac{1}{2}(y_i - p_i)^2 and gi=(yipi)g_i = -(y_i - p_i) and hi=1h_i = 1, we get

Similarity Score=sum of residuals, squarednumber of residuals+λ\small{\text{Similarity Score}} = \frac{\small{\text{sum of residuals, squared}}}{\small{\text{number of residuals}}+\lambda}

Similarly, for classification we get

Similarity Score=(sum of residuals)2[previous probabilityi×(1previous probabilityi)]+λ\small{\text{Similarity Score}} = \frac{(\small{\text{sum of residuals}})^2}{\sum[\small{\text{previous probability}}_i \times (1 - \small{ \text{previous probability}}_i)]+\lambda}

Cover

Cover is related to the minimum number of residuals in a leaf. It is the denominator of the Similarity Score minus λ\lambda.

Therefore, Cover for the regression is simply: number of residuals\text{number of residuals}, and for classification, it is: [pi×(1pi)]\sum[p_i \times (1-p_i)]

XGBoost Optimizations

XGB is an algorithm with a lot of parts. So far, we have only talked about the unique regression trees that xgb uses. The first three items in the list give us a conceptual idea of how xgb is fit to training data and how it makes predictions. The remaining items describe optimizations for large datasets. These parts are what make xgb relatively efficient with relatively large training datasets.

Approximate Greedy Algorithm

Just to review how xgb fits to data:

  1. It starts by making an initial prediction which could be anything (default is 0.5).
  2. Calculates the residuals for each observation.
  3. Fits a tree to the residuals. It does this by calculating the Similarity Scores and the Gain for each possible threshold. The threshold with the largest Gain is the one xgb uses.
    • Note: The decision to use the threshold that gives the largest Gain is made without worrying about how the leaves will split later. That means xgb uses a greedy algorithm to build trees.
    • In other words, since xgb uses a greedy algorithm, it makes a decision without looking ahead to see if it is the absolute best choice in the long term.
    • In contrast, if xgb did not use a greedy algorithm, it would postpone making the final decision about a threshold, until after trying different thresholds in the leaves to see how things played out in the long run.
    • This same process would be repeated for every single possible threshold for the root. In other words, by using a greedy algorithm, xgb can build a tree relatively quickly.
    • That said, when we have a lot of data, then the greedy algorithm becomes slow because it still has to look at every possible threshold. With large datasets with several features, checking every single threshold in every single variable would take forever to finish.

This is where the Approximate Greedy Algorithm comes in.

Parallel Learning & Weighted Quantile Sketch

Let’s say you have such a large dataset that it can fit into your computer’s memory at one time. Then, things that seem simple, like sorting a list of numbers and finding quantiles, become really slow. To get around this problem, a class of algorithms, called Sketches, can quickly create approximate solutions.

In simple words, Sketch algorithms split data in smaller pieces (this allows for Parallel Learning), and Quantile Sketch Algorithm combines the values from each computer to make an approximate histogram. Then, the approximate histogram is used to calculate approximate quantiles. The Approximate Greedy Algorithm uses approximate quantiles \rightarrow This is Quantile Sketch Algorithm, but xgb uses Weighted Quantile Sketch algorithm.

What’s a weighted quantile?

Usually quantiles are set up so that the same number of observations are in each one. In contrast, with weighted quantiles, each observation has a corresponding weight, and the sum of the weights are similar in each quantile. The weights are derived from the Cover metric we discussed in previous sections.

Specifically, the ***weight for each observation is the 2nd derivative of the loss function (Hessian)***. That means, for regression the weights are all equal to 1 (i.e. weighted quantiles are just like normal quantiles and contain an equal number of observations).

In contrast, for classification the weights are: previous probabilityi×(1previous probabilityi)\small{\text{previous probability}}_i \times (1 - \small{ \text{previous probability}}_i). These weights are calculated after building each tree. To explain how this can improve predictions, let’s consider a simple example. Let’s say we have this small dataset:

obs. ID predicted probability
1 0.1
2 0.1
3 0.9
4 0.9
5 0.6
6 0.4

If we calculate the weights we get

obs. ID predicted probability quantile weights
1 0.1 0.09
2 0.1 0.09
3 0.9 0.09
4 0.9 0.09
5 0.6 0.24
6 0.4 0.24
Screen Shot 2022-01-17 at 12.27.24 PM
Screen Shot 2022-01-17 at 12.30.16 PM

Note: xgb only uses the above-mentioned tricks when the training dataset is huge. With small datasets, xgb just uses a normal greedy algorithm.

Sparsity-Aware Split Finding

Let’s say we have to this data with two missing values. We can still calculate the residuals (with the initial predicted probability 0.5).

Screen Shot 2022-01-17 at 12.43.07 PM
Screen Shot 2022-01-17 at 12.47.02 PM
Screen Shot 2022-01-17 at 12.59.46 PM
Screen Shot 2022-01-17 at 1.05.41 PM

Thus, Sparsity-Aware Split Finding tells us how to build trees with missing data.

Cache-Aware Access

If you want your program to run really fast, the goal is to maximize what you can do with the cache memory. xgb puts the gradients and Hessians in the cache so that it can rapidly calculate Similarity Scores and Output Values.

Blocks for Out-of-Core Computation

When the dataset is too large for cache and main memory, then, at least some of it, must be stored on the hard drive. Because reading and writing data to the hard drive is super slow, xgb tries to minimize these actions by compressing the data.

Even though the CPU must spend some time decompressing the data that comes from the hard drive, it can do this faster than the hard drive can read the data. In other words, by spending a little bit of CPU time uncompressing the data we can avoid spending a lot of time accessing the hard drive.

Also, when there is more than one hard drive available for storage, xgb uses a database technique called Sharding to speed up disk access.

Note: xgb is fast for several reasons, some of them are not related to statistical methods at all. For example, the last two elements of xgb algorithm (cache-aware access and blocks for out-of-core computation) are merely computation hardware optimizations.

XGBoost in Python

Refer to this link