XGBoost model 0 basic beginners can understand (with code)

Original link

What is the XGBoost model?

XGBoost is the abbreviation of eXtreme Gradient Boosting, it is a very powerful Boosting algorithm toolkit, excellent performance (effect and speed) so that in a long period of time to dominate the top of the list of solutions to the data science competitions, and now a lot of large manufacturers of machine learning programs will continue to prefer this model.

XGBoost is excellent in parallel computation efficiency, missing value processing, overfitting control, and predictive generalization. In this paper, we introduce XGBoost in detail, including "algorithm principle" and "engineering implementation".

Regarding the principle of XGBoost, its author, Tianqi Chen himself, has a very detailedSlidesA systematic presentation was made.

Boosted Tree

Boosted Tree is a commonly used machine learning method that belongs to a type of integrated learning. It works by combining multiple weak learners (usually decision trees) in order to improve the predictive performance of the whole model.The core idea of Boosted Tree is to train multiple decision trees step by step, each trying to fix the mistakes of the previous one, and eventually get a stronger model.

Model: suppose we have\(K\)tree\(\hat{y_i}=\sum_{k=1}^Kf_k(x_i),f_k\in{F}\)，\(F\)is the function space containing all regression trees.
Objective function:\(Obj=\sum_{i=1}^nl(y_i,\hat{y_i})+\sum_{k=1}^K\Omega(f_k)\)
\(\sum_{i=1}^nl(y_i,\hat{y_i})\)is a cost function
\(\sum_{k=1}^K\Omega(f_k)\)is a regularization term that represents the complexity of the tree, the more complex the tree the higher the value of the regularization term (we'll talk more about how regularization terms are defined later).

When we discuss decision trees or related tree models, it is usually heuristic. Heuristic in machine learning refers to the use of rules of thumb or approximations to solve a problem without any guarantee of finding an optimal solution.

Gradient Boosting (how to learn)

There is no way to use SGD (Stochastic Gradient Descent) when doing GBDT because they are trees, not vectors of values - i.e., they have changed from the familiar parameter space to a function space.Gradient Boosting Decision Trees (GBDT) differs from deep learning or linear models in that its core is not optimized directly by parameter updates, but by building new decision trees to gradually reduce the error.

Solution: initialize a prediction and add a new function for each iteration\(（f）\)

1) Objective function transformation

The objective function can be initially deformed according to the solution

where constant is a constant term, such as\(\Omega(f_1),\Omega(f_2)\)something like that, and then the third line is to consider the squared loss.\(l(y_i,\hat{y_i})=\frac{1}{2}(y_i-\hat{y_i})^2\)Just put it in.

So our goal is to find\(f(t)\)makes the objective function the lowest. However, the objective function after the initial deformation above is still complex and the objective function produces quadratic terms. Introducing Taylor's formula

The graph is also more or less problematic, being the introduction of Taylor's formula where the squared loss has not yet been taken into account, and then Taylor's formula is also problematic, as the last two terms should be\(f(x)\)The first-order derivative and second-order derivative of\(g_i,h_i\)。

Then extract the constant term inside, and\(f_t\)having nothing do (with sth else)

2) Redefining the tree

This has already been done with the\(f_t(x)\)represents a tree, and in this subsection, we redefine the tree: we define the tree by means of a vector of scores in the leaf nodes and an index-mapping function that maps instances to the leaf nodes: (a bit of an abstraction, see below for details)

There is a problem in the graph, the first leaf node weight is +2

3) Define the complexity of the tree

included among these\(T\)is the number of leaf nodes.\(\gamma\)is a parameter that controls the complexity of the tree; the more leaf nodes the tree has, the higher the complexity. The complexity of the tree is controlled by adjusting the
\(\gamma\)The complexity of the model can be controlled. The last bunch are the L2 Norm regularization coefficients

4) Revisiting the objective function

Defined at the leaf node\(j\)The set of instances in is:\(I_j=\{i|q(x_i)=j\}\), defined this way also to be able to construct a third equation, both written as\(\sum_{j=1}^T\)

It will also be found that the above equation is\(T\)Sum of independent quadratic functions

5) Calculate the value of the leaf node

It's a big mess, but it's really just a matter of replacing the value with a\(G_j,H_j\),then use a quadratic equation to find an optimal value and you're done.

The following figure shows a practical example corresponding to the previous formula explanation.

To summarize here again, we have made the objective function a function that is only compatible with the\(G,H,\gamma,\lambda,T\)These five functions with known parameters are related by putting the previous variable\(f_t\)Eliminate the need to score every leaf!

So now the question comes, just now we mentioned that these are the results obtained by assuming that the tree structure is determined. But there are several kinds of tree structures, how should we determine them?

6) Greedy Algorithm for Spanning Trees

In the previous section we assumed that the structure of the tree is fixed. However, there are actually an infinite number of possibilities for the structure of the tree, and in this subsection we use the greedy algorithm to generate the tree:

First generate a tree with depth 0 (only one root node, also called leaf node)

For each leaf node of each tree, try to do a split (generate two new leaf nodes, the original leaf node is no longer a leaf node). The change in objective function before and after the addition of the split is (we want the objective function after the addition of the tree to be smaller than the objective function before, so subtract the objective function after from the objective function before):

\(Gain=\frac{1}{2}(\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{(G_L+G_R)^2}{H_L+H_R+\lambda})-\gamma\)

The next thing to consider is how to find the best split.

For example, if\(x_j\)is the age when the splitting point is\(a\)The gain at the time of the\(Gain\)How much is it?

In fact, here a left-to-right linear scan of the sorted instances is sufficient to determine the best split point for the feature. Scanning from left to right: Once the data has been sorted by feature value, we calculate each possible split point in turn, starting with the first sample. For each split, we divide the samples into "left" and "right" subsets and calculate the change in the objective function before and after the split. Here are some other approaches

7) How to handle subtyped variables

In many cases, we don't need to devise special treatment for categorical variables, and can convert them to one-hot coding to handle them.

\(z_j= \begin{cases} 0& \text{if x is in category y}\\ 1& \text{otherwise} \end{cases} \)

If there are too many categories the matrix will be very sparse and the algorithm will prioritize the sparse data.

8) Pruning and Regularization

Recalling the previous gains, when the value of the training loss reduction is less than the complexity introduced by regularization, the gain is likely to be negative, and it is then a trade-off between the simplicity of the model and its predictability.

XGBoost core principles summarized and analyzed

After all the padding, we're finally here.XGBoost is also a Boosting addition model, where each iteration optimizes only the submodel in the current step.

(prefix indicating ordinal number, e.g. first, number two etc)\(m\)Step we have:\(F_m(x_i)=F_{m-1}(x_i)+f_m(x_i)\)

\(f_m(x_i)\)is the submodel of the current step.
\(F_{m-1}(x_i)\)before\(m-1\)A sub-model that has completed training and has been fixed.

Taylor Expand

Then remove the constant and bring in the complexity (as before)

1) Approximation Algorithm

Based on performance considerations, XGBoost also makes an approximate version of the greedy criterion, which, in short, deals with "using feature quartiles as division candidates". This reduces the set of partitioning candidates from a full sample traversal to a traversal between a few quartiles.

Expanding on this, there are two alternative strategies for selecting the eigenquantiles, global and local:

Exact greedy criterion: this is the default exact algorithm that traverses all possible splitting points to find the one that maximizes the gain. It is the most computationally intensive, but has the best splitting effect.
Global Approximate Split: one-time division using the eigenquantiles of the whole sample, split points are reused in all nodes, computation is reduced dramatically, suitable for larger datasets.
Local Approximate Split: recalculate the feature quantile according to the current node's samples before splitting each node, which can be more flexible to adapt to the feature distribution of different nodes, and is suitable for the case of large differences in the distribution of samples.

The performance of the approximation algorithm is almost the same as the exact greedy algorithm, but the computational cost is greatly reduced.

2) Weighted quartiles

In XGBoost, Weighted Quantile Sketch is used to speed up the process of finding split points. Instead of dividing the split points directly based on the eigenvalues of the samples, the Weighted Quantile algorithm considers theSecond order derivative of a sample (Hessian)as weights to better balance the choice of split points, especially in approximation algorithms.

Making the partial derivation 0 is easily obtained\(f_m^*(x_i)=-\frac{g_i}{h_i}\)

3) Column Sampling and Learning Rate

Column sampling means that when building each decision tree, XGBoost does not use all the features, but randomly selects some of them for splitting. This approach is derived from the idea of random forests and aims to increase the diversity of the model, thus preventing overfitting.

Learning rate is a very important hyperparameter in Gradient Boosted Tree (GBDT) to control the contribution of each tree to the model. The learning rate prevents the model from being updated too quickly, thus improving the stability and performance of the model. It is also called step size, shrinkage, which is done by multiplying this factor before each submodel (i.e., on the regression value of each leaf node) to keep a single tree from fitting too aggressively, leaving some space for more stable iterations.XGBoost is set by default to .

4) Feature Absence and Sparsity

Briefly, it does this by combining missing values and sparse\(0\)The value equivalents are treated as missing values, which are 'bound' together, and the traversal of the split node skips the missing values as a whole. This greatly improves computational efficiency.

For example, in the following example there are six division cases, XGBoost will traverse the above six cases (3 non-missing values of the cutoff x two directions of the missing values), and the largest split gain is the split gain on this feature

XGBoost Engineering Optimization

1) Parallel column block design

XGBoost sorts each column of features in advance, stores them in the form of blocks in the cache, and corresponds the feature values to the gradient statistics with indexes, and repeats the call to the sorted blocks each time the node splits. Moreover, different features are distributed in independent blocks, so distributed or multi-threaded computation is possible.

2) Cache access optimization

Eigenvalue sorting followed by indexing to fetch the gradient\(g_i,h_i\)This can lead to inconsistent accessed memory space, which in turn reduces the cache hit rate and affects the efficiency of the algorithm. To solve this problem, XGBoost allocates a separate contiguous cache for each thread to hold gradient information.

3) Extracore block calculations

In the case of very large amounts of data that cannot all be loaded into memory at the same time, XGBoost divides the data into blocks stored on the hard disk, and uses a separate thread dedicated to reading the data from the disk into memory, so as to achieve simultaneous computation and reading of data.
To further improve disk read data performance, XGBoost uses two other methods:

① Compression block, trading the overhead of decompression for the overhead of disk reads.
② Decentralize block storage across multiple disks to improve disk throughput.

XGBoost vs GBDT

GBDT is a machine learning algorithm, and XGBoost has some engineering implementation optimizations on top of the algorithm.

GBDT uses the first-order derivative of the loss function, which is equivalent to gradient descent in function space; XGBoost also uses the second-order derivative of the loss function, which is equivalent to Newton's method in function space.

regularization: XGBoost explicitly adds regular terms to control the complexity of the model, which can effectively prevent overfitting.

column sampling: XGBoost employs the practice in random forests of randomly sampling the forward ranks each time a node splits.

missing value: XGBoost utilizes a sparse-aware strategy to handle missing values, and GBDT has no missing value handling strategy.

Parallel and efficient: XGBoost's column block design can effectively support parallel operation with better efficiency.

code implementation

You need to download xgboost first

pip install xgboost

The code is as follows

# Import the required libraries
import xgboost as xgb
from sklearn.model_selection import train_test_split
from import load_diabetes # replace with load_diabetes
from import mean_squared_error

# 1. Load diabetes dataset
# This dataset contains 442 samples with 10 features for predicting a continuous target variable
diabetes = load_diabetes()
X, y = , # X is the feature data, y is the label (target variable)

# 2. Divide the dataset into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Convert the data to DMatrix format
dtrain = (X_train, label=y_train)
dtest = (X_test, label=y_test)

# 4. Set the hyperparameters for the XGBoost model
params = {
    'objective': 'reg:squarederror', # Objective function used for regression task, square error
    'max_depth': 3, # Maximum depth of the decision tree, controls the complexity of the model
    'eta': 0.05, # learning rate, controls the contribution of each tree to the overall model
    'eval_metric': 'rmse' , # Evaluation metric, using root mean square error (RMSE)
    'lambda': 2, # L2 regularization term to prevent overfitting
    'alpha': 0.5 # L1 regularization term
}

# 5. Set the number of training rounds
num_round = 200 # Number of training rounds, i.e. how many trees to build.

# 6. Define the evaluation dataset
evals = [(dtrain, 'train'), (dtest, 'eval')] # (dataset, dataset name)

# 7. Train the XGBoost model, adding early_stopping_rounds to prevent overfitting
bst = (params, dtrain, num_round, evals, early_stopping_rounds=10)

# 8. Predict the test set using the trained model
y_pred = (dtest)

# 9. Evaluate model performance
mse = mean_squared_error(y_test, y_pred)
print(f "Mean Squared Error: {mse}")

# 10. Save the trained model
bst.save_model('xgboost_model.json')

# 11. Load the saved model
loaded_bst = ()
loaded_bst.load_model('xgboost_model.json')

# 12. Use the loaded model for prediction
y_pred_loaded = loaded_bst.predict(dtest)
mse_loaded = mean_squared_error(y_test, y_pred_loaded)
print(f "Mean Squared Error from loaded model: {mse_loaded}")

The results are as follows

[0]	train-rmse:76.08309	eval-rmse:71.75905
[1]	train-rmse:74.34324	eval-rmse:70.47408
[2]	train-rmse:72.66427	eval-rmse:69.24759
[3]	train-rmse:71.10664	eval-rmse:68.09809
[4]	train-rmse:69.63498	eval-rmse:67.14668
[5]	train-rmse:68.24045	eval-rmse:66.09854
[6]	train-rmse:66.93042	eval-rmse:64.91738
[7]	train-rmse:65.73304	eval-rmse:64.08775
[8]	train-rmse:64.58640	eval-rmse:63.26052
[9]	train-rmse:63.51304	eval-rmse:62.49745
[10]	train-rmse:62.44810	eval-rmse:61.64759
[11]	train-rmse:61.51387	eval-rmse:60.96222
[12]	train-rmse:60.61767	eval-rmse:60.32972
[13]	train-rmse:59.77722	eval-rmse:59.74329
[14]	train-rmse:59.01348	eval-rmse:59.13121
[15]	train-rmse:58.24704	eval-rmse:58.55106
[16]	train-rmse:57.57392	eval-rmse:58.15165
[17]	train-rmse:56.92761	eval-rmse:57.68188
[18]	train-rmse:56.33319	eval-rmse:57.37781
[19]	train-rmse:55.72582	eval-rmse:56.97001
[20]	train-rmse:55.14420	eval-rmse:56.45029
[21]	train-rmse:54.61096	eval-rmse:55.97904
[22]	train-rmse:54.12594	eval-rmse:55.57225
[23]	train-rmse:53.68383	eval-rmse:55.39305
[24]	train-rmse:53.24822	eval-rmse:55.01127
[25]	train-rmse:52.85214	eval-rmse:54.85699
[26]	train-rmse:52.43814	eval-rmse:54.49904
[27]	train-rmse:52.07004	eval-rmse:54.42905
[28]	train-rmse:51.68191	eval-rmse:54.25354
[29]	train-rmse:51.28268	eval-rmse:54.09452
[30]	train-rmse:50.94229	eval-rmse:54.06703
[31]	train-rmse:50.58475	eval-rmse:53.88010
[32]	train-rmse:50.24739	eval-rmse:53.74475
[33]	train-rmse:49.97042	eval-rmse:53.49905
[34]	train-rmse:49.65855	eval-rmse:53.41597
[35]	train-rmse:49.38190	eval-rmse:53.34692
[36]	train-rmse:49.07203	eval-rmse:53.32202
[37]	train-rmse:48.81472	eval-rmse:53.22084
[38]	train-rmse:48.57124	eval-rmse:53.24058
[39]	train-rmse:48.33730	eval-rmse:53.13983
[40]	train-rmse:47.97171	eval-rmse:53.05406
[41]	train-rmse:47.75619	eval-rmse:52.87405
[42]	train-rmse:47.43067	eval-rmse:52.80852
[43]	train-rmse:47.18844	eval-rmse:52.70296
[44]	train-rmse:46.96694	eval-rmse:52.61260
[45]	train-rmse:46.79053	eval-rmse:52.58588
[46]	train-rmse:46.58746	eval-rmse:52.51602
[47]	train-rmse:46.38476	eval-rmse:52.50433
[48]	train-rmse:46.15591	eval-rmse:52.44922
[49]	train-rmse:46.00542	eval-rmse:52.36981
[50]	train-rmse:45.84480	eval-rmse:52.27445
[51]	train-rmse:45.63700	eval-rmse:52.23794
[52]	train-rmse:45.49250	eval-rmse:52.25740
[53]	train-rmse:45.31208	eval-rmse:52.16836
[54]	train-rmse:45.15374	eval-rmse:52.22044
[55]	train-rmse:45.00284	eval-rmse:52.15072
[56]	train-rmse:44.87677	eval-rmse:52.04112
[57]	train-rmse:44.71921	eval-rmse:52.08482
[58]	train-rmse:44.55626	eval-rmse:52.02783
[59]	train-rmse:44.41483	eval-rmse:52.09304
[60]	train-rmse:44.27997	eval-rmse:52.03098
[61]	train-rmse:44.15710	eval-rmse:52.08378
[62]	train-rmse:44.00683	eval-rmse:52.02136
[63]	train-rmse:43.84878	eval-rmse:52.06178
[64]	train-rmse:43.74180	eval-rmse:52.06495
[65]	train-rmse:43.59775	eval-rmse:52.08875
[66]	train-rmse:43.44009	eval-rmse:52.20317
[67]	train-rmse:43.29717	eval-rmse:52.14245
[68]	train-rmse:43.10437	eval-rmse:52.15464
[69]	train-rmse:43.00768	eval-rmse:52.17011
[70]	train-rmse:42.87951	eval-rmse:52.11852
[71]	train-rmse:42.79951	eval-rmse:52.21249
[72]	train-rmse:42.66769	eval-rmse:52.22331
Mean Squared Error: 2727.2736118611274
Mean Squared Error from loaded model: 2727.2736118611274

train-rmse is the error between the predicted and true values on the training set. eval-rmse is the RMSE of the model on the test set

Analyzing the last data under the early stopping mechanism, 42.66769 indicates that the model has a prediction error of 42.67 on the training set. the lower the RMSE, the better the model fits on the training set. 52.22 indicates that the model's prediction error on the test set is significantly higher than that on the training set, which suggests that the model may have some overfitting problems, and that the model performs well on the training set but its generalization ability is not as good as that on the new data (test set). set) the generalization ability is not as good as the performance on the training set.