4_decision_trees

Decision Tree

Table of Content

1. Decision Trees 1.1. How to build a CART model? 1.2. Gini Impurity 1.3. When do stop splitting? 1.4. How do we make prediction? 1.5. How does CART handle missing data? 1.6. How does CART make predictions for multi-class data? 1.7. How does CART handle regression? 1.8. How to handle categorical features in CART? 1.9. What are the down sides of CART? 1.9.1. Boosting 1.9.2. Bagging 1.10. XGBoost vs. LightGBM 1.10.1. What is Gradient Boosting? 1.10.2. What is XGBoost? 1.10.3. What is LightGBM? 1.10.4. Structural Differences between XGBoost and LightGBM

1. Decision Trees• CART (Classification and Regression Trees)• CART can handle missing data in training and prediction. 1.1. How to build a CART model?• We have to start by figuring out how to split examples by their label.• Let's say if we have only numerical features, then we have to decide which features we should split on, and what value of that features is best to make the split for.• The goal is to find the best feature and value which separates our examples by labels the most.• The first thing is to order the data based on the selected feature.• Then, we find the average between each two consecutive data points (i.e.

x_{i}

,

x_{i + 1}

).• Each of these averages are split points. We can split the data based on each of those split points (picture below). 1.2. Gini Impurity• How do we evaluate the effectiveness of each of these split points?– We use Gini Impurity to help us out. • – Gini Impurity: We find the squared probability of getting class

1

and class

0

for each node (picture below). We just do the following calculations for the rest of the splits and for all the features. We choose the split with the lowest gini impurity.1.3. When do stop splitting?• We can either assign a max_depth or a min_example_per_node.– Or we can go only the nodes are pure (→ This can lead to overfitting). 1.4. How do we make prediction?• For each data point we want to predict for, we feed it through the tree until we get to one of the end nodes (leaf). – The prediction is the majority class on the leaf node. In other words, the vote of the leaf node. 1.5. How does CART handle missing data?• If we're missing data on a feature, we just use the next best features that splits the data (second lowest gini impurity) 1.6. How does CART make predictions for multi-class data?• First, we add terms to the gini impurity corresponding to each class →

1 - (P_{c l a s s 0}^{2} + P_{c l a s s 1}^{2} + P_{c l a s s 2}^{2} + . . .)

• In prediction step, we again use the voting (i.e. relative majority or mode) 1.7. How does CART handle regression?• In case of regression models, we use Mean Squared Error (MSE) instead of Gini Impurity.– MSE node → We're summing the difference between all the values in the end nodes (leaf) and the average of that node. Here's how to calculate MSE for one node:

M S E = \frac{\sum_{i}^{} (l_{x_{i}} - l_{a v g .})^{2}}{n}

• – MSE split → To get the MSE of a split, we just add up all the squared differences (like equation

) and divide it by the total number of examples under that split.– We choose the split that gives us the smallest MSE. • For prediction → We just average the values of observations in a node. 1.8. How to handle categorical features in CART?• If the categorical feature is binary, the split is easy → each node get one of the values. But what about categorical features with more than 2 values?• In this case, we have to make splits for every possible combinations. • For Example: Let's say we want represent this list of countries:

[U S, U K, I R I, G E R]

. The splits will be like this:–

[U S]

and

[U K, I R I, G E R]

–

[U S, U K]

and

[I R I, G E R]

–

[U S, I R I]

and

[U K, G E R]

– and so on ...• Note: For a categorical feature with

n

distinct values, there will be

2^{n} - 1

subsets. 1.9. What are the down sides of CART?• CARTs tend to overfit very easily.– As the depth increases, the ability of the model to overfit increases as well.– Solutions: * Limit the max_depth to

2

to

5

.* Use Boosting (refer below for a simple explanation of how boosting works).* Use Bagging → Bootstrap Aggregation– CART is rarely used in practice. Instead, some ensemble variations of it is used very often. Examples of such methods → XGBoost, LightGBM, CatBoost, AdaBoost, RandomForest, etc. 1.9.1. Boosting• Boosting is very simply training another tree model on the error. In other words, boosting is an ensemble of weak learners.– Suppose, we train a tree and predict

230

where the true value is

270

. Boosting in this case is to train on the error of first tree (i.e.

270 - 230 = 40

).– In general, boosting for several times looks like this:*

P r e d (x_{i}) = t r e e_{1} (l a b e l (x_{i})) + t r e e_{2} (e r r o r (t r e e_{1})) + t r e e_{3} (e r r o r (t r e e_{2})) + . . .

– Note: Boosted trees also overfit easily. We need to perform cross-validation to find the best number of trees and max_depth.– Note: In boosted trees, each individual tree is called a weak learner.* These learners are defined as having better performance than random chance. 1.9.2. Bagging• In bagging, we create many trees based on sampling from data (observations) and also training on subset of features. To get predictions, we just average the predictions of all trees → This is called Random Forest.– Note: Out of pure math, it works out that roughly 36.7% (i.e.

1 ⁄ e

) of examples won't be trained on → We automatically get out-of-bag sample which we can use as our validation set.– Note: Bagging reduces the variance.– Note: Another variant is C4.5. * It can only do classification.* It can do

n

-ary splits (instead of binary splits)* It uses information gain based on entropy (instead of gini index)1.10. XGBoost vs. LightGBM• Source1.10.1. What is Gradient Boosting? • Gradient Boosting refers to a methodology in machine learning where an ensemble of weak learners is used to improve the model performance in terms of efficiency, accuracy, and interpretability.– These learners are defined as having better performance than random chance.– The hypothesis is to filter out instances that are difficult to accurately predict and develop new weak learners to handle them.• How Gradient Boosting Works? (a) The initial model is trained and predictions are run on the whole dataset. (b) The error between the actual value and prediction is calculated and more weight is given to the incorrect predictions. (c) Subsequently, a new model that attempts to fix the error of the previous model and in a similar way several models are thus created. We arrive at the final model by weighting the mean of all models.• Gradient Boosting Can Be Applied To:– Regression – taking the average of the outputs by the weak learners– Classification – finding the class prediction occurring the maximum number of times– 1.10.2. What is XGBoost?• XGBoost → eXtreme Gradient Boosting– XGB focuses on computation speed and model performance. It was introduced by Tianqi Chen and is currently a part of a wider toolkit by DMLC (Distributed Machine Learning Community).– It can be used for both classification and regression.– It supports the following kinds of boosting:* Gradient Boosting as controlled by learning rate* Stochastic Gradient Boosting that leverages sub-sampling at a row, column or column per split levels* Regularized Gradient Boosting using L1 (Lasso) and L2 (Ridge) regularization– Some of the other features that are offered from a system performance point of view are:* Using a cluster of machines to train a model using distributed computing* Utilization of all the available cores of a CPU during tree construction for parallelization* Out-of-core computing when working with datasets that do not fit into memory* Making the best use of hardware with cache optimization– In addition to the above the framework:* Accepts multiple types of input data* Works well with sparse input data for tree and linear booster* Supports the use of customized objective and evaluation functions.– To learn more about XGBoost, check out this page. 1.10.3. What is LightGBM?• Similar to XGBoost, LightGBM (by Microsoft) is a distributed high-performance framework that uses decision trees for ranking, classification, and regression tasks.– The advantages are as follows:* Faster training speed and accuracy resulting from LightGBM being a histogram-based algorithm that performs bucketing of values (also requires lesser memory)* Also compatible with large and complex datasets but is much faster during training* Support for both parallel learning and GPU learning– In contrast to the level-wise (horizontal) growth in XGBoost, LightGBM carries out leaf-wise (vertical) growth that results in more loss reduction and in turn higher accuracy while being faster. * But this may also result in overfitting on the training data which could be handled using the max-depth parameter that specifies where the splitting would occur. Hence, XGBoost is capable of building more robust models than LightGBM. 1.10.4. Structural Differences between XGBoost and LightGBM• It feels like LightGBM is significantly faster than XGBoost but delivers almost equivalent performance. We might wonder, what are exactly the differences between LightGBM and XGBoost?– – Leaf Growth: LightGBM has a faster rate of execution along with being able to maintain good accuracy levels primarily due to the utilization of two novel techniques:* Gradient-Based One-Side Sampling (GOSS)· In Gradient Boosted Decision Trees, the data instances have no native weight which is leveraged by GOSS.· Data instances with larger gradients contribute more towards information gain.· To maintain the accuracy of the information, GOSS retains instances with larger gradients and performs random sampling on instances with smaller gradients.· Note: We can learn more about this concept in the article – What makes LightGBM lightning fast?· Note: The YouTube channel Machine Learning University also released a video on LightGBM speaking about GOSS.* Exclusive Feature Bundling (EFB)· EFB is a near lossless method to reduce the number of effective features.· Just like One-Hot encoded features, in the sparse space, many features rarely take non-zero values simultaneously. · To reduce dimensionality, improve efficiency, and maintain accuracy, EFB bundles these features, and this bundle is called an Exclusive Feature Bundle.· This thread on EFB and LightGBM’s paper can be referred to gain better insight.* On the other hand, XGBoost uses a pre-sorted and histogram-based algorithm for computing the best split, which is done with GOSS in LightGBM. The pre-sorting splitting works as:1. For each node, enumerate over all features2. For every feature, sort instances by the feature value3. Using linear scan, decide the split along with the feature basis information gain4. Pick the best-split solution along with all the features4. – Handling Categorical Features* Both LightGBM and XGBoost accept numerical features only. This means that the nominal features in our data need to be transformed into numerical features.* XGBoost, by default, treats such variables as numerical variables with order and we don’t want that. Instead, if we can create dummies for each of the categorical values (one-hot encoding), then XGBoost will be able to do its job correctly. But for larger datasets, this is a problem as encoding takes a longer time.· For example, if we encode a categorical variable with three values into (0, 1, 2), XGBoost treats them with order as if category 1 is greater than category

0

, and so on. This is not what we want. That's why categorical variables need to be one-hot encoded.* On the other hand, LightGBM accepts a parameter to check which column is a categorical column and handles this issue with ease by splitting on equality. * Note: The H2O library provides an implementation of XGBoost that supports the native handling of categorical features. * – Handling Missing Values* Both the algorithms treat missing values by assigning them to the side that reduces loss the most in each split.* – Feature Importance Methods* Gain· Every feature in a dataset has some sort of importance/ weightage in helping build an accurate model.· Gain refers to the relative contribution of a particular feature in the context of a particular tree.· This can also be understood by the extent of relevant information that the model gains from a feature for making better predictions.· Available both in XGBoost and LightGBM.* Split/Frequency/Weight· Split for LightGBM and Frequency or Weight for XGBoost method calculates the relative count of times a particular feature occurs in all splits of the model’s trees. One issue with this method is that it is prone to bias when there are a large number of categories in categorical features.· Available both in XGBoost and LightGBM.* Coverage· The relative number of observations per feature.· Available only in XGBoost.· – Processing Unit* The algorithm we want to use often depends upon the type of processing unit we have for running the models. * Although XGBoost is comparatively slower than LightGBM on GPU, it is actually faster on CPU. * LightGBM requires us to build the GPU distribution separately while to run XGBoost on GPU we need to pass the ‘gpu_hist’ value to the ‘tree_method’ parameter when initializing the model.* When working in an institution with access to GPUs and strong CPUs, we should go for XGBoost as it is more scalable than LightGBM. * But personally, I think LightGBM makes more sense as the training time saved can be used for better experimentation and feature engineering. We can train our final model to have model robustness.* – Important hyperparameters* XGBoost parameters· n_estimators [default 100] – Number of trees in the ensemble. A higher value means more weak learners contribute towards the final output but increasing it significantly slows down the training time. · max_depth [default 3] – This parameter decides the complexity of the algorithm. The lesser the value assigned, the lower is the ability for the algorithm to pick up most patterns (underfitting). A large value can make the model too complex and pick patterns that do not generalize well (overfitting).· min_child_weight [default 1] – We know that an extremely deep tree can deliver poor performance due to overfitting. The min_child_weight parameter aims to regularize by limiting the depth of a tree. So, the higher the value of this parameter, the lower are the chances of the model overfitting on the training data.· learning_rate/ eta [default 0.3] – The rate of learning of the model is inversely proportional to the accuracy of the model. Lowering the learning rate, although slower to train, improves the ability of the model to look for patterns and learn them. If the value is too low then it raises difficulty in the model to converge.· gamma/ min_split_loss [default 0] – This is a regularization parameter that can range from 0 to infinity. Higher the value, higher is the strength of regularization, lower are the chances of overfitting (but can underfit if it’s too large). Hence, this parameter varies across all types of datasets.· colsample_bytree [default 1.0] – This parameter instructs the algorithm on the fraction of the total number of features/ predictors to be used for a tree during training. This means that every tree might use a different set of features for prediction and hence reduce the chances of overfitting and also improve the speed of training as not all the features are being used in every tree. The value ranges from 0 to 1.· subsample [default 1.0] – Similar to colsample_bytree, the subsample parameter instructs the algorithm on the fraction of the total number of instances to be used for a tree during training. This also reduces the chances of overfitting and improves training time.* LightGBM parameters· max_depth – Similar to XGBoost, this parameter instructs the trees to not grow beyond the specified depth. A higher value increases the chances for the model to overfit.· num_leaves – This parameter is very important in terms of controlling the complexity of the tree. The value should be less than 2^(max_depth) as a leaf-wise tree is much deeper than a depth-wise tree for a set number of leaves. Hence, a higher value can induce overfitting.· min_data_in_leaf – The parameter is used for controlling overfitting. A higher value can stop the tree from growing too deep but can also lead the algorithm to learn less (underfitting). According to the LightGBM’s official documentation, as a best practice, it should be set to the order of hundreds or thousands.· feature_fraction – Similar to colsample_bytree in XGBoost· bagging_fraction – Similar to subsample in XGBoost· – Tradeoff between model performance and training time* When working with machine learning models, one big aspect involved in the experimentation phase is the baseline requirement of resources to train a complex model. While some might have access to some great hardware, often people have limitations to what they can use.* Example: Let us quickly dummy datasets with sample sizes from 1,000 all the way to 20,000 samples. We’ll take a test size of 20% from each of the dummy datasets to measure model performance. For every iteration having different sample sizes stepped up by 1,000 samples, we want to check how much time it takes for an XGBoost Classifier to train in comparison to a LightGBM Classifier. To run the code, refer to this Google Colab notebook. Run results are here.

import neptune.new as neptune from xgboost import XGBClassifierfrom lightgbm import LGBMClassifier from sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_scorefrom sklearn.datasets import make_classification import timefrom tqdm.notebook import tqdm # initialising a logger instancerun = neptune.init( project="common/xgboost-integration", api_token="ANONYMOUS", name="xgb-train", tags=["xgb-integration", "train"],) # configuration for our custom datasetmin_samples = 1000max_samples = 20000step = 1000 for sample_size in tqdm(range(int(min_samples/0.8), int(max_samples/0.8), step)): xgb_dummy = XGBClassifier(seed=47) lgbm_dummy = LGBMClassifier(random_sate=47) # logging the sample size run['metrics/comparison/sample_size'].log(sample_size) # generating the dataset of custom sample size dummy = make_classification(n_samples=sample_size) # splitting the data into train and test set X_train, X_test, y_train, y_test = train_test_split(dummy[0], dummy[1], test_size=0.2, stratify=dummy[1]) start = time.time() xgb_dummy.fit(X_train, y_train) end = time.time() # logging algorithm execution time run['metrics/comparison/xgb_runtime'].log(end-start) # logging model performance run['metrics/comparison/xbg_accuracy'].log( accuracy_score(y_test, xgb_dummy.predict(X_test)), step=sample_size) start = time.time() lgbm_dummy.fit(X_train, y_train) end = time.time() # logging algorithm execution time run['metrics/comparison/lgbm_runtime'].log(end-start) # logging model performance run['metrics/comparison/lgbm_accuracy'].log( accuracy_score(y_test, lgbm_dummy.predict(X_test)), step=sample_size) run.stop()

Figure 1:XGBoost vs. LightGBM Runtime

Figure 2:XGBoost vs. LightGBM Accuracy

• – * From the figure, we can see that the training time for XGBoost kept on increasing with an increase in sample size almost linearly. On the other hand, the training time required by LightGBM has been a very small fraction of its contender. Interesting!* The accuracy scores for both models go hand-in-hand. The results indicate that not only is LightGBM faster, there is not much compromise in model performances. So does this mean we can just ditch XGBoost for LightGBM?* It all comes down to the availability of hardware resources and bandwidth to figure things out. * Although LightGBM gives good performance at fraction of time as compared to XGBoost, what it still needs to improve on is documentation and community strength. * Also if the hardware is available, since XGBoost scales better, as discussed before we could train using LightGBM, get an understanding of the parameters required, and train the final model as an XGBoost model.* – Summary: Gradient Boosted Decision Trees (GBDTs) are one of the most popular choices of machine learning algorithms. XGBoost and LightGBM which are based on GBDTs have had great success both in enterprise applications and data science competitions. Here are the key takeaways from our comparison:* In XGBoost, trees grow depth-wise while in LightGBM, trees grow leaf-wise which is the fundamental difference between the two frameworks.* XGBoost is backed by the volume of its users that results in enriched literature in the form of documentation and resolutions to issues. While LightGBM is yet to reach such a level of documentation.* Both the algorithms perform similarly in terms of model performance but LightGBM training happens within a fraction of the time required by XGBoost.* Fast training in LightGBM makes it the go-to choice for machine learning experiments.* XGBoost requires a lot of resources to train on large amounts of data which makes it an accessible option for most enterprises while LightGBM is lightweight and can be used on modest hardware.* LightGBM provides the option for passing feature names that are to be treated as categories and handles this issue with ease by splitting on equality. * H2O’s implementation of XGBoost provides the above feature as well which is not yet provided by XGBoost’s original library.* Hyperparameter tuning is extremely important in both algorithms. Back to Top