XGBoost model 0 basic beginners can understand (with code)
Original link
What is the XGBoost model?
XGBoost is the abbreviation of eXtreme Gradient Boosting, it is a very powerful Boosting algorithm toolkit, excellent performance (effect and speed) so that in a long period of time to dominate the top of the list of solutions to the data science competitions, and now a lot of large manufacturers of machine learning programs will continue to prefer this model.
XGBoost is excellent in parallel computation efficiency, missing value processing, overfitting control, and predictive generalization. In this paper, we introduce XGBoost in detail, including "algorithm principle" and "engineering implementation".
Regarding the principle of XGBoost, its author, Tianqi Chen himself, has a very detailedSlidesA systematic presentation was made.
Boosted Tree
Boosted Tree is a commonly used machine learning method that belongs to a type of integrated learning. It works by combining multiple weak learners (usually decision trees) in order to improve the predictive performance of the whole model.The core idea of Boosted Tree is to train multiple decision trees step by step, each trying to fix the mistakes of the previous one, and eventually get a stronger model.
Model: suppose we have\(K\)tree\(\hat{y_i}=\sum_{k=1}^Kf_k(x_i),f_k\in{F}\),\(F\)is the function space containing all regression trees.
Objective function:\(Obj=\sum_{i=1}^nl(y_i,\hat{y_i})+\sum_{k=1}^K\Omega(f_k)\)
\(\sum_{i=1}^nl(y_i,\hat{y_i})\)is a cost function
\(\sum_{k=1}^K\Omega(f_k)\)is a regularization term that represents the complexity of the tree, the more complex the tree the higher the value of the regularization term (we'll talk more about how regularization terms are defined later).
When we discuss decision trees or related tree models, it is usually heuristic. Heuristic in machine learning refers to the use of rules of thumb or approximations to solve a problem without any guarantee of finding an optimal solution.
Gradient Boosting (how to learn)
There is no way to use SGD (Stochastic Gradient Descent) when doing GBDT because they are trees, not vectors of values - i.e., they have changed from the familiar parameter space to a function space.Gradient Boosting Decision Trees (GBDT) differs from deep learning or linear models in that its core is not optimized directly by parameter updates, but by building new decision trees to gradually reduce the error.
Solution: initialize a prediction and add a new function for each iteration\((f)\)
1) Objective function transformation
The objective function can be initially deformed according to the solution
where constant is a constant term, such as\(\Omega(f_1),\Omega(f_2)\)something like that, and then the third line is to consider the squared loss.\(l(y_i,\hat{y_i})=\frac{1}{2}(y_i-\hat{y_i})^2\)Just put it in.
So our goal is to find\(f(t)\)makes the objective function the lowest. However, the objective function after the initial deformation above is still complex and the objective function produces quadratic terms. Introducing Taylor's formula
The graph is also more or less problematic, being the introduction of Taylor's formula where the squared loss has not yet been taken into account, and then Taylor's formula is also problematic, as the last two terms should be\(f(x)\)The first-order derivative and second-order derivative of\(g_i,h_i\)。
Then extract the constant term inside, and\(f_t\)having nothing do (with sth else)
2) Redefining the tree
This has already been done with the\(f_t(x)\)represents a tree, and in this subsection, we redefine the tree: we define the tree by means of a vector of scores in the leaf nodes and an index-mapping function that maps instances to the leaf nodes: (a bit of an abstraction, see below for details)
There is a problem in the graph, the first leaf node weight is +2
3) Define the complexity of the tree
included among these\(T\)is the number of leaf nodes.\(\gamma\)is a parameter that controls the complexity of the tree; the more leaf nodes the tree has, the higher the complexity. The complexity of the tree is controlled by adjusting the
\(\gamma\)The complexity of the model can be controlled. The last bunch are the L2 Norm regularization coefficients
4) Revisiting the objective function
Defined at the leaf node\(j\)The set of instances in is:\(I_j=\{i|q(x_i)=j\}\), defined this way also to be able to construct a third equation, both written as\(\sum_{j=1}^T\)
It will also be found that the above equation is\(T\)Sum of independent quadratic functions
5) Calculate the value of the leaf node
It's a big mess, but it's really just a matter of replacing the value with a\(G_j,H_j\),then use a quadratic equation to find an optimal value and you're done.
The following figure shows a practical example corresponding to the previous formula explanation.
To summarize here again, we have made the objective function a function that is only compatible with the\(G,H,\gamma,\lambda,T\)These five functions with known parameters are related by putting the previous variable\(f_t\)Eliminate the need to score every leaf!
So now the question comes, just now we mentioned that these are the results obtained by assuming that the tree structure is determined. But there are several kinds of tree structures, how should we determine them?
6) Greedy Algorithm for Spanning Trees
In the previous section we assumed that the structure of the tree is fixed. However, there are actually an infinite number of possibilities for the structure of the tree, and in this subsection we use the greedy algorithm to generate the tree:
First generate a tree with depth 0 (only one root node, also called leaf node)
For each leaf node of each tree, try to do a split (generate two new leaf nodes, the original leaf node is no longer a leaf node). The change in objective function before and after the addition of the split is (we want the objective function after the addition of the tree to be smaller than the objective function before, so subtract the objective function after from the objective function before):
\(Gain=\frac{1}{2}(\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{(G_L+G_R)^2}{H_L+H_R+\lambda})-\gamma\)
The next thing to consider is how to find the best split.
For example, if\(x_j\)is the age when the splitting point is\(a\)The gain at the time of the\(Gain\)How much is it?
In fact, here a left-to-right linear scan of the sorted instances is sufficient to determine the best split point for the feature. Scanning from left to right: Once the data has been sorted by feature value, we calculate each possible split point in turn, starting with the first sample. For each split, we divide the samples into "left" and "right" subsets and calculate the change in the objective function before and after the split. Here are some other approaches
7) How to handle subtyped variables
In many cases, we don't need to devise special treatment for categorical variables, and can convert them to one-hot coding to handle them.
\(z_j= \begin{cases} 0& \text{if x is in category y}\\ 1& \text{otherwise} \end{cases} \)
If there are too many categories the matrix will be very sparse and the algorithm will prioritize the sparse data.
8) Pruning and Regularization
Recalling the previous gains, when the value of the training loss reduction is less than the complexity introduced by regularization, the gain is likely to be negative, and it is then a trade-off between the simplicity of the model and its predictability.
XGBoost core principles summarized and analyzed
After all the padding, we're finally here.XGBoost is also a Boosting addition model, where each iteration optimizes only the submodel in the current step.
(prefix indicating ordinal number, e.g. first, number two etc)\(m\)Step we have:\(F_m(x_i)=F_{m-1}(x_i)+f_m(x_i)\)
\(f_m(x_i)\)is the submodel of the current step.
\(F_{m-1}(x_i)\)before\(m-1\)A sub-model that has completed training and has been fixed.
Taylor Expand
Then remove the constant and bring in the complexity (as before)
1) Approximation Algorithm
Based on performance considerations, XGBoost also makes an approximate version of the greedy criterion, which, in short, deals with "using feature quartiles as division candidates". This reduces the set of partitioning candidates from a full sample traversal to a traversal between a few quartiles.
Expanding on this, there are two alternative strategies for selecting the eigenquantiles, global and local:
Exact greedy criterion: this is the default exact algorithm that traverses all possible splitting points to find the one that maximizes the gain. It is the most computationally intensive, but has the best splitting effect.
Global Approximate Split: one-time division using the eigenquantiles of the whole sample, split points are reused in all nodes, computation is reduced dramatically, suitable for larger datasets.
Local Approximate Split: recalculate the feature quantile according to the current node's samples before splitting each node, which can be more flexible to adapt to the feature distribution of different nodes, and is suitable for the case of large differences in the distribution of samples.
The performance of the approximation algorithm is almost the same as the exact greedy algorithm, but the computational cost is greatly reduced.
2) Weighted quartiles
In XGBoost, Weighted Quantile Sketch is used to speed up the process of finding split points. Instead of dividing the split points directly based on the eigenvalues of the samples, the Weighted Quantile algorithm considers theSecond order derivative of a sample (Hessian)as weights to better balance the choice of split points, especially in approximation algorithms.
Making the partial derivation 0 is easily obtained\(f_m^*(x_i)=-\frac{g_i}{h_i}\)
3) Column Sampling and Learning Rate
Column sampling means that when building each decision tree, XGBoost does not use all the features, but randomly selects some of them for splitting. This approach is derived from the idea of random forests and aims to increase the diversity of the model, thus preventing overfitting.
Learning rate is a very important hyperparameter in Gradient Boosted Tree (GBDT) to control the contribution of each tree to the model. The learning rate prevents the model from being updated too quickly, thus improving the stability and performance of the model. It is also called step size, shrinkage, which is done by multiplying this factor before each submodel (i.e., on the regression value of each leaf node) to keep a single tree from fitting too aggressively, leaving some space for more stable iterations.XGBoost is set by default to .
4) Feature Absence and Sparsity
Briefly, it does this by combining missing values and sparse\(0\)The value equivalents are treated as missing values, which are 'bound' together, and the traversal of the split node skips the missing values as a whole. This greatly improves computational efficiency.
For example, in the following example there are six division cases, XGBoost will traverse the above six cases (3 non-missing values of the cutoff x two directions of the missing values), and the largest split gain is the split gain on this feature
XGBoost Engineering Optimization
1) Parallel column block design
XGBoost sorts each column of features in advance, stores them in the form of blocks in the cache, and corresponds the feature values to the gradient statistics with indexes, and repeats the call to the sorted blocks each time the node splits. Moreover, different features are distributed in independent blocks, so distributed or multi-threaded computation is possible.
2) Cache access optimization
Eigenvalue sorting followed by indexing to fetch the gradient\(g_i,h_i\)This can lead to inconsistent accessed memory space, which in turn reduces the cache hit rate and affects the efficiency of the algorithm. To solve this problem, XGBoost allocates a separate contiguous cache for each thread to hold gradient information.
3) Extracore block calculations
In the case of very large amounts of data that cannot all be loaded into memory at the same time, XGBoost divides the data into blocks stored on the hard disk, and uses a separate thread dedicated to reading the data from the disk into memory, so as to achieve simultaneous computation and reading of data.
To further improve disk read data performance, XGBoost uses two other methods:
① Compression block, trading the overhead of decompression for the overhead of disk reads.
② Decentralize block storage across multiple disks to improve disk throughput.
XGBoost vs GBDT
GBDT is a machine learning algorithm, and XGBoost has some engineering implementation optimizations on top of the algorithm.
GBDT uses the first-order derivative of the loss function, which is equivalent to gradient descent in function space; XGBoost also uses the second-order derivative of the loss function, which is equivalent to Newton's method in function space.
regularization: XGBoost explicitly adds regular terms to control the complexity of the model, which can effectively prevent overfitting.
column sampling: XGBoost employs the practice in random forests of randomly sampling the forward ranks each time a node splits.
missing value: XGBoost utilizes a sparse-aware strategy to handle missing values, and GBDT has no missing value handling strategy.
Parallel and efficient: XGBoost's column block design can effectively support parallel operation with better efficiency.
code implementation
You need to download xgboost first
pip install xgboost
The code is as follows
# Import the required libraries
import xgboost as xgb
from sklearn.model_selection import train_test_split
from import load_diabetes # replace with load_diabetes
from import mean_squared_error
# 1. Load diabetes dataset
# This dataset contains 442 samples with 10 features for predicting a continuous target variable
diabetes = load_diabetes()
X, y = , # X is the feature data, y is the label (target variable)
# 2. Divide the dataset into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 3. Convert the data to DMatrix format
dtrain = (X_train, label=y_train)
dtest = (X_test, label=y_test)
# 4. Set the hyperparameters for the XGBoost model
params = {
'objective': 'reg:squarederror', # Objective function used for regression task, square error
'max_depth': 3, # Maximum depth of the decision tree, controls the complexity of the model
'eta': 0.05, # learning rate, controls the contribution of each tree to the overall model
'eval_metric': 'rmse' , # Evaluation metric, using root mean square error (RMSE)
'lambda': 2, # L2 regularization term to prevent overfitting
'alpha': 0.5 # L1 regularization term
}
# 5. Set the number of training rounds
num_round = 200 # Number of training rounds, i.e. how many trees to build.
# 6. Define the evaluation dataset
evals = [(dtrain, 'train'), (dtest, 'eval')] # (dataset, dataset name)
# 7. Train the XGBoost model, adding early_stopping_rounds to prevent overfitting
bst = (params, dtrain, num_round, evals, early_stopping_rounds=10)
# 8. Predict the test set using the trained model
y_pred = (dtest)
# 9. Evaluate model performance
mse = mean_squared_error(y_test, y_pred)
print(f "Mean Squared Error: {mse}")
# 10. Save the trained model
bst.save_model('xgboost_model.json')
# 11. Load the saved model
loaded_bst = ()
loaded_bst.load_model('xgboost_model.json')
# 12. Use the loaded model for prediction
y_pred_loaded = loaded_bst.predict(dtest)
mse_loaded = mean_squared_error(y_test, y_pred_loaded)
print(f "Mean Squared Error from loaded model: {mse_loaded}")
The results are as follows
[0] train-rmse:76.08309 eval-rmse:71.75905
[1] train-rmse:74.34324 eval-rmse:70.47408
[2] train-rmse:72.66427 eval-rmse:69.24759
[3] train-rmse:71.10664 eval-rmse:68.09809
[4] train-rmse:69.63498 eval-rmse:67.14668
[5] train-rmse:68.24045 eval-rmse:66.09854
[6] train-rmse:66.93042 eval-rmse:64.91738
[7] train-rmse:65.73304 eval-rmse:64.08775
[8] train-rmse:64.58640 eval-rmse:63.26052
[9] train-rmse:63.51304 eval-rmse:62.49745
[10] train-rmse:62.44810 eval-rmse:61.64759
[11] train-rmse:61.51387 eval-rmse:60.96222
[12] train-rmse:60.61767 eval-rmse:60.32972
[13] train-rmse:59.77722 eval-rmse:59.74329
[14] train-rmse:59.01348 eval-rmse:59.13121
[15] train-rmse:58.24704 eval-rmse:58.55106
[16] train-rmse:57.57392 eval-rmse:58.15165
[17] train-rmse:56.92761 eval-rmse:57.68188
[18] train-rmse:56.33319 eval-rmse:57.37781
[19] train-rmse:55.72582 eval-rmse:56.97001
[20] train-rmse:55.14420 eval-rmse:56.45029
[21] train-rmse:54.61096 eval-rmse:55.97904
[22] train-rmse:54.12594 eval-rmse:55.57225
[23] train-rmse:53.68383 eval-rmse:55.39305
[24] train-rmse:53.24822 eval-rmse:55.01127
[25] train-rmse:52.85214 eval-rmse:54.85699
[26] train-rmse:52.43814 eval-rmse:54.49904
[27] train-rmse:52.07004 eval-rmse:54.42905
[28] train-rmse:51.68191 eval-rmse:54.25354
[29] train-rmse:51.28268 eval-rmse:54.09452
[30] train-rmse:50.94229 eval-rmse:54.06703
[31] train-rmse:50.58475 eval-rmse:53.88010
[32] train-rmse:50.24739 eval-rmse:53.74475
[33] train-rmse:49.97042 eval-rmse:53.49905
[34] train-rmse:49.65855 eval-rmse:53.41597
[35] train-rmse:49.38190 eval-rmse:53.34692
[36] train-rmse:49.07203 eval-rmse:53.32202
[37] train-rmse:48.81472 eval-rmse:53.22084
[38] train-rmse:48.57124 eval-rmse:53.24058
[39] train-rmse:48.33730 eval-rmse:53.13983
[40] train-rmse:47.97171 eval-rmse:53.05406
[41] train-rmse:47.75619 eval-rmse:52.87405
[42] train-rmse:47.43067 eval-rmse:52.80852
[43] train-rmse:47.18844 eval-rmse:52.70296
[44] train-rmse:46.96694 eval-rmse:52.61260
[45] train-rmse:46.79053 eval-rmse:52.58588
[46] train-rmse:46.58746 eval-rmse:52.51602
[47] train-rmse:46.38476 eval-rmse:52.50433
[48] train-rmse:46.15591 eval-rmse:52.44922
[49] train-rmse:46.00542 eval-rmse:52.36981
[50] train-rmse:45.84480 eval-rmse:52.27445
[51] train-rmse:45.63700 eval-rmse:52.23794
[52] train-rmse:45.49250 eval-rmse:52.25740
[53] train-rmse:45.31208 eval-rmse:52.16836
[54] train-rmse:45.15374 eval-rmse:52.22044
[55] train-rmse:45.00284 eval-rmse:52.15072
[56] train-rmse:44.87677 eval-rmse:52.04112
[57] train-rmse:44.71921 eval-rmse:52.08482
[58] train-rmse:44.55626 eval-rmse:52.02783
[59] train-rmse:44.41483 eval-rmse:52.09304
[60] train-rmse:44.27997 eval-rmse:52.03098
[61] train-rmse:44.15710 eval-rmse:52.08378
[62] train-rmse:44.00683 eval-rmse:52.02136
[63] train-rmse:43.84878 eval-rmse:52.06178
[64] train-rmse:43.74180 eval-rmse:52.06495
[65] train-rmse:43.59775 eval-rmse:52.08875
[66] train-rmse:43.44009 eval-rmse:52.20317
[67] train-rmse:43.29717 eval-rmse:52.14245
[68] train-rmse:43.10437 eval-rmse:52.15464
[69] train-rmse:43.00768 eval-rmse:52.17011
[70] train-rmse:42.87951 eval-rmse:52.11852
[71] train-rmse:42.79951 eval-rmse:52.21249
[72] train-rmse:42.66769 eval-rmse:52.22331
Mean Squared Error: 2727.2736118611274
Mean Squared Error from loaded model: 2727.2736118611274
train-rmse is the error between the predicted and true values on the training set. eval-rmse is the RMSE of the model on the test set
Analyzing the last data under the early stopping mechanism, 42.66769 indicates that the model has a prediction error of 42.67 on the training set. the lower the RMSE, the better the model fits on the training set. 52.22 indicates that the model's prediction error on the test set is significantly higher than that on the training set, which suggests that the model may have some overfitting problems, and that the model performs well on the training set but its generalization ability is not as good as that on the new data (test set). set) the generalization ability is not as good as the performance on the training set.