Background knowledge points needed to read this article: adaptive enhancement algorithm, Taylor formula, One-Hot coding, a little programming knowledge
I. Introduction
In the previous section, we learned about the Adaptive Boosting / AdaBoost Algorithm, which is a Boosting Algorithm, and there is another important algorithm in this family of algorithms - Gradient Boosting Decision Tree 1 (Gradient Boosting Algorithm). Boosted Decision Trees / GBDT), GBDT and its variant algorithms have a wide range of applications in traditional machine learning, and understanding the ideas and principles behind it is very helpful for future learning.
2. Introduction to the model
The gradient boosting decision tree is a boosting algorithm like the adaptive boosting algorithm, and it can also be interpreted as an additive model. The strong estimator after the kth round is the strong estimator after the k - 1th round plus the weak with coefficients. Estimator h(x):
$$ H_k(x) = H_{k - 1}(x) + \alpha_k h_k(x) $$
<center> Formula 2-1 </center>
Assuming that the cost function corresponding to the k - 1th round is $Cost(y, H_{k - 1}(X))$, and the cost function corresponding to the kth round is $Cost(y, H_k(X))$, our The purpose is to make the cost function gradually smaller after each iteration.
Due to the different optimization methods corresponding to each different cost function, Jerome H. Friedman gave a unified processing method in the paper 2 of his original algorithm, which can use the negative gradient of the cost function to fit the iterative cost function of this round. decrease, which is also an approximation of the steepest descent method, and its essence is to use the first-order Taylor formula to approximate its cost function. When the underlying estimator uses a decision regression tree, the algorithm is called a gradient boosted decision tree (GBDT).
$$ \hat{y}_{k, i} = -\left[\frac{\partial Cost(y_i, H(X_i))}{\partial H(X_i)} \right]_{H(x) = H_{k-1}(x)} $$
<center> Form 2-2 </center>
The following is still the same as the AdaBoost algorithm, considering the three problems of regression, binary classification, and multi-classification, and introducing the algorithms corresponding to each problem one by one. Since the regression of the GBDT algorithm is simpler than the classification, the regression problem will be introduced earlier this time.
return
For the cost function of the regression problem, we first use the squared error as the cost function:
$$ Cost(y, H(x)) = (y - H(x))^2 $$
<center> Form 2-3 </center>
The cost function for the kth round:
(1) Bring-in type 2-1
(2) Cost function with squared error
(3) Change the parentheses
(4) Record the difference between y and the result of round k - 1 as $\hat{y}_k$
$$ \begin{aligned} Cost(y, H_{k}(x)) & = Cost(y, H_{k - 1}(x) + \alpha_kh_k(x)) & (1) \\ & = ( y - (H_{k - 1}(x) + \alpha_kh_k(x) ))^2 & (2) \\ &= ((y - H_{k - 1}(x)) - \alpha_kh_k(x) )^2 & (3) \\ &= (\hat{y}_k - \alpha_kh_k(x))^2 & (4) \end{aligned} $$
<center> Form 2-4 </center>
Observing equation (4) in equations 2-4, you will find that this is the cost function of the decision regression tree introduced in the previous chapter, but the regression tree no longer uses y in the training set, but fits the $ in the above equation. \hat{y}$, also known as residual. Update H(x) after getting the regression tree, and then do a new iteration.
In this way, the simplest implementation of the GBDT algorithm for least squares regression is obtained. For specific steps, please refer to the regression part of the algorithm steps in Section 3. It can be seen that the coefficient $\alpha$ can be understood as being integrated into the regression tree.
In the paper, the author also introduces several other cost functions, namely Least Absolute Deviation (LDA), Huber Loss 3 , etc. Interested readers can read the corresponding chapters in the paper.
Since GBDT needs to calculate the negative gradient is a continuous value, for the classification problem, its basic estimator does not use the classification tree, but still uses the regression tree.
two-class
For the cost function of classification problems, exponential and logarithmic functions can be used. When using the exponential function, GBDT will be equivalent to the AdaBoost algorithm described in the previous section, where we use the logarithmic function as the cost function:
$$ Cost(y, H(x)) = \log (1 + e^{-2yH(x)}) $$
<center> Form 2-5 </center>
According to the method of calculating the negative gradient introduced in the previous model, the cost function is brought in to calculate $\hat{y}$ :
$$ \begin{aligned} \hat{y}_{k, i} &= -\left[\frac{\partial Cost(y_i, H(X_i))}{\partial H(X_i)} \right] _{H(x) = H_{k-1}(x)} & (1)\\ &= \frac{2y_i}{1 + e^{2y_iH_{k-1}(X_i)}} & (2 ) \end{aligned} $$
<center> Type 2-6 </center>
After calculating $\hat{y}$, we can fit a regression tree estimator h(x), then we require the coefficient $\alpha$ of the estimator after each iteration:
$$ \alpha_k = \underset{\alpha}{argmin} \sum_{i = 1}^{N} \log (1 + e^{-2y_i(H_{k - 1}(X_i) + \alpha h_k( X_i))}) $$
<center> Form 2-7 </center>
Let's take a look at the regression tree estimator h(x) first, and its expression can be written as the following formula, where the regression tree has a total of J leaf nodes, and $R_j$ and $\beta_j$ represent the jth leaf of the regression tree respectively. The training set and values contained in the node, $I(x)$ is the indicator function mentioned in the previous chapter:
$$ h(x) = \sum_{j = 1}^{J} \beta_j I(x \in R_j) $$
<center> Form 2-8 </center>
At this time, bring Equation 2-8 into Equation 2-7 and rewrite it. At this time, it is no longer to find the coefficient alpha of the estimator, but to directly find $\gamma$, where $\gamma_{k,j} = \ alpha_k * \beta _{k,j}$:
$$ \gamma_{k, j} = \underset{\gamma}{argmin} \sum_{X_i \in R_{k, j}}^{} \log (1 + e^{-2y_i(H_{k - 1}(X_i) + \gamma )}) $$
<center> Formula 2-9 </center>
Since the logarithmic exponential function is included in Equation 2-9, it has no closed-form solution. In this case, the second-order Taylor formula can be used to approximate it to obtain the following results:
$$ \gamma_{k,j} = \frac{\sum_{X_i \in R_{k,j}} \hat{y}_{k,i}}{\sum_{X_i \in R_{k,j }} |\hat{y}_{k,i}| (2 - |\hat{y}_{k,i}|)} $$
<center> Formula 2-10 </center>
Update H(x) after getting gamma, then do a new iteration.
In this way, the implementation of the two-category GBDT algorithm is obtained. For the specific steps, please refer to the two-category part in the algorithm steps in Section 3. For the source of the specific formula, please refer to the proof of principle in Section 4.
Multiclass
Multi-classification is more complicated than binary classification. It also uses the logarithmic function as its cost function, and also uses the Softmax function introduced in the previous multi-class logarithmic probability regression algorithm. At the same time, it is necessary to perform One-Hot on the input label value y. Encoding 4 operations, the cost function is as follows:
$$ \begin{aligned} Cost(y, H(x)) &= \sum_{m = 1}^M y_m \log p_m(x) & (1) \\ p_{m}(x) &= \ frac{e^{H_{m}(x)}}{\sum_{l=1}^M e^{H_{l}(x)}} & (2) \\ \end{aligned} $$
<center> Formula 2-11 </center>
Similarly, according to the method of calculating negative gradient introduced in the previous model, bring in the cost function to calculate $\hat{y}$ :
$$ \begin{aligned} \hat{y}_{k, m, i} &= -\left[\frac{\partial Cost(y_i, H(X_i))}{\partial H(X_i)} \ right]_{H(x) = H_{k-1}(x)} & (1)\\ &= y_{m, i} - p_{k-1, m}(X_i) & (2) \ \ \end{aligned} $$
<center> Formula 2-12 </center>
The same as the two classification, fit the regression tree, and convert it to the corresponding $\gamma$, the difference is that a regression tree needs to be fitted for each classification, so a total of K * M decisions need to be fitted for multi-classification Regression tree:
$$ \gamma_{k, m, j} = \underset{\gamma}{argmin} \sum_{i = 1}^{N} \sum_{m = 1}^{M} Cost\left(y_{m ,i}, H_{k - 1, m}(X_i) + \sum_{j = 1}^{J} \gamma I(X_i \in R_{k, m, j})\right) $$
<center> Formula 2-13 </center>
Similarly, the Taylor formula is used to approximate it and the following results are obtained:
$$ \gamma_{k, m, j} = \frac{M - 1}{M} \frac{\sum_{X_i \in R_{k, m, j}} \hat{y}_{k, m , i}}{\sum_{X_i \in R_{k, m, j}} |\hat{y}_{k, m, i}| (1 - |\hat{y}_{k, m, i}|)} $$
<center> Formula 2-14 </center>
After getting $\gamma$, update the H(x) of the corresponding classification, and then perform a new iteration.
In this way, the implementation of the multi-class GBDT algorithm is obtained. For the specific steps, please refer to the multi-class part in the algorithm steps of the third section.
3. Algorithm steps
return
Suppose the training set T = { $X_i$, $y_i$ }, i = 1,...,N, h(x) is the estimator, and the number of estimators is K.
The steps of the gradient boosting decision tree regression algorithm are as follows:
Initialize $H_0(x)$ to be equal to the mean value of y $\bar{y}$
$$ H_0(X_i) = \bar{y} $$
Iterate over the number of estimators K times:
Calculate the residual of the kth round $\hat{y}_{k}$
$$ \hat{y}_{k, i} = y_i - H_{k-1}(X_i) $$
Use the kth round $\hat{y}_{k}$ as the label value to fit the training set, and get the decision regression tree estimator $h_k(x)$
update $H_k(x)$
$$ H_k(X_i) = H_{k-1}(X_i) + h_k(X_i) $$
end loop
Final prediction strategy:
Input x, K decision regression tree estimators predict and add them in turn, and then add the initial value $H_0$ to get the final prediction result
$$ H(x) = H_{0} + \sum_{k = 1}^K h_k(x) $$
two-class
Suppose the training set T = { $X_i$, $y_i$ }, i = 1, ..., N, $y_i \in \{ -1, +1 \}$ , h(x) is the estimator, the estimator The number is K.
The steps of the gradient boosting decision tree binary classification algorithm are as follows:
Initialize $H_0(x)$, where $\bar{y}$ is the mean of y
$$ H_0(X_i) = \frac{1}{2} \log \frac{1 + \bar{y}}{1 - \bar{y}} $$
Iterate over the number of estimators K times:
Calculate $\hat{y}_{k}$ for the kth round
$$ \hat{y}_{k,i} = \frac{2y_i}{1 + e^{2y_iH_{k-1}(X_i)}} $$
The kth round $\hat{y}_{k}$ is used as the label value to fit the training set, and the decision regression tree estimator $h_k(x)$ is obtained, where $h_k(x)$ contains J leaf nodes
Calculate the coefficient $\gamma$ of the jth leaf node in the kth round, where $R_{kj}$ represents the decision regression tree estimator fitted in the kth round $h_k(x)$ The jth leaf node contains training set
$$ \gamma_{k,j} = \frac{\sum_{X_i \in R_{k,j}} \hat{y}_{k,i}}{\sum_{X_i \in R_{k,j }} |\hat{y}_{k,i}| (2 - |\hat{y}_{k,i}|)} $$
Update $H_k(x)$, where $I(x)$ is the indicator function mentioned in the previous section
$$ H_k(X_i) = H_{k - 1}(X_i) + \sum_{j = 1}^{J} \gamma_{k,j} I(X_i \in R_{k,j}) $$
end loop
Final prediction strategy:
Input x, K decision regression tree estimators judge the leaf nodes in turn, accumulate the coefficients $\gamma$ corresponding to the leaf nodes, and then add the initial value $H_0$ to get H(x)
$$ H(x) = H_0 + \sum_{k = 1}^{K}\sum_{j = 1}^{J} \gamma_{k,j} I(x \in R_{k,j}) $$
Calculate the probability of positive and negative classes separately
$$ \left\{ \space \begin{aligned} p_+(x) &= \frac{1}{1 + e^{-2H(x)}} \\ p_-(x) &= \frac{ 1}{1 + e^{2H(x)}} \end{aligned} \right. $$
Take the classification with the largest probability of positive and negative classes as the final classification result
$$ \underset{m}{argmax} \space p_m(x) \quad (m \in \{+ , -\}) $$
Multiclass
Assuming that the training set T = { $X_i, y_i$ }, i = 1, ..., N, there are M possible values of y, h(x) is the estimator, and the number of estimators is K.
The steps of the gradient boosting decision tree multi-classification algorithm are as follows:
One-Hot encoding of label values y in the training set
Initialize $H_{0,m}(x)$, where m is the mth category, which is assigned a value of 0 in the original paper, and the implementation in scikit-learn is the prior of each category
$$ H_{0, m}(X_i) = \frac{\sum_{i=1}^{N}I(y_i = m)}{N} or 0 $$
Iterate over the number of estimators K times:
Traverse the number of categories M times:
Calculate the probability p(x) of the mth class for round k - 1
$$ p_{k-1, m}(X_i) = \frac{e^{H_{k-1, m}(X_i)}}{\sum_{l=1}^M e^{H_{k- 1, l}(X_i)}} $$
Calculate the $\hat{y}_{k, m}$ of the mth classification in the kth round
$$ \hat{y}_{k, m, i} = y_{m, i} - p_{k-1, m}(X_i) $$
Use the $\hat{y}_{k, m}$ of the mth classification in the kth round as the label value to fit the training set, and get the decision regression tree estimator $h_{k,m}(x)$, where $h_{k,m}(x)$ contains J leaf nodes
Calculate the coefficient $\gamma$ of the jth leaf node of the mth classification in the kth round, where $R_{k,m,j}$ represents the decision regression tree estimator fitted by the mth classification in the kth round $\gamma$ h_{k,m}(x)$ The training set contained in the jth leaf node
$$ \gamma_{k, m, j} = \frac{M - 1}{M} \frac{\sum_{X_i \in R_{k, m, j}} \hat{y}_{k, m , i}}{\sum_{X_i \in R_{k, m, j}} |\hat{y}_{k, m, i}| (1 - |\hat{y}_{k, m, i}|)} $$
Update $H_{k,m}(x)$, where $I(x)$ is the indicator function mentioned in the previous section
$$ H_{k, m}(X_i) = H_{k - 1, m}(X_i) + \sum_{j = 1}^{J} \gamma_{k, m, j} I(X_i \in R_ {k, m, j}) $$
end loop
end loop
Final prediction strategy:
Input x, the K decision regression tree estimators corresponding to the mth classification determine the leaf nodes in turn, accumulate the coefficients $\gamma$ corresponding to the leaf nodes, and then add the initial value $H_0$ of the mth classification, get H(x) for the mth class
$$ H_{m}(x) = H_{0,m} + \sum_{k = 1}^{K} \sum_{j = 1}^{J} \gamma_{k, m, j} I( x \in R_{k, m, j}) $$
Calculate the probability of each class in turn
$$ p_m(x) = \frac{e^{H_m(x)}}{\sum_{l = 1}^M e^{H_l(x)}} $$
Take the largest classification in the probability of each classification as the final classification result
$$ \underset{m}{argmax} \space p_m(x) \quad (m = 1,2,\dots,M) $$
4. Proof of Principle
Regression problem initial value
For least squares regression with squared error as the cost function, the initial value is the mean of y:
(1) Cost function of regression problem
(2) Take the derivative of the cost function and make it equal to zero
(3) The mean value of its initial value y can be calculated
$$ \begin{aligned} Cost(H(x)) &= \sum_{i = 1}^{N} (y_i - H(x))^2 & (1) \\ \frac{\partial Cost( H(x))}{\partial H(x)} &= 2\sum_{i = 1}^{N} (H(x) - y_i) = 0 & (2) \\ H(x) &= \frac{\sum_{i = 1}^{N} y_i}{N} = \bar{y} & (3) \\ \end{aligned} $$
<center> Form 4-1 </center>
get the certificate
Initial value for binary classification problem
For a binary classification problem, $y \in \{ -1, +1 \}$:
(1) The number of $y = +1$
(2) The number of $y = -1$
(3) The sum of the two formulas is N
$$ \begin{aligned} n_{+} &= \sum_{i = 1}^NI(y_i = +1) & (1) \\ n_{-} &= \sum_{i = 1}^NI( y_i = -1) & (2) \\ N &= n_{+} + n_{-} & (3) \\ \end{aligned} $$
<center> Form 4-2 </center>
(1) The expression (2) of the mean value of y is simplified to obtain
$$ \begin{aligned} \bar{y} &= \frac{1 * n_{+} + (-1) * n_{-}}{N} & (1) \\ &= \frac{n_{ +} - n_{-}}{N} & (2) \\ \end{aligned} $$
<center> Form 4-3 </center>
From equation (3) in equation 4-2 and equation (2) in equation 4-3, the following results can be obtained respectively:
$$ \begin{aligned} n_{+} &= \frac{N(1 + \bar{y})}{2} & (1) \\ n_{-} &= \frac{N(1 - \ bar{y})}{2} & (2) \end{aligned} $$
<center> Form 4-4 </center>
(1) The cost function of the two-class problem
(2) Derivative of the cost function
(3) Split the result of (2) into the sum of two consecutive additions
(4) Bring in the result of Equation 4-4 and remove the consecutive addition sign
(5) After simplification, make it equal to zero
(6) Finally, its initial value can be calculated
$$ \begin{aligned} Cost(H(x)) & = \sum_{i = 1}^N \log (1 + e^{-2y_iH(x)}) & (1) \\ \frac{\ partial Cost(H(x))}{\partial H(x)} &= -\sum_{i = 1}^N \frac{2y_i}{1 + e^{2y_iH(x)}} & (2) \\ &= -\sum_{y_i = +1} \frac{2y_i}{1 + e^{2y_iH(x)}} - \sum_{y_i = -1} \frac{2y_i}{1 + e^{ 2y_iH(x)}} & (3) \\ &= - \frac{N(1 + \bar{y})}{2} * \frac{2}{1 + e^{2H(x)}} - \frac{N(1 - \bar{y})}{2} * \frac{-2}{1 + e^{-2H(x)}} & (4) \\ &= - \frac{ N(1 + \bar{y})}{1 + e^{2H(x)}} + \frac{N(1 - \bar{y})}{1 + e^{-2H(x)} } = 0 & (5) \\ H(x) &= \frac{1}{2} \log \frac{1 + \bar{y}}{1 - \bar{y}} & (6) \ end{aligned} $$
<center> Form 4-5 </center>
get the certificate
Two-class problem coefficient $\gamma$
Let's first look at the optimization objective function of $\gamma$:
(1) The cost function of the two-class problem
(2) The optimization objective of $\gamma$ obtained by Equation 2-9
(3) Perform a second-order Taylor expansion approximation at $H_{k - 1}(x)$ for equation (2)
(4) You can see that $H(x) - H_{k - 1}(x)$ is equal to $\gamma$
$$ \begin{aligned} Cost(H(x)) &= \sum_{X_i \in R_{k, j}}^{} \log (1 + e^{-2y_iH(X_i)}) & (1 ) \\ \gamma_{k, j} &= \underset{\gamma}{argmin} \sum_{X_i \in R_{k, j}}^{} \log (1 + e^{-2y_i(H_{ k - 1}(X_i) + \gamma )}) & (2)\\ &= \underset{\gamma}{argmin} \sum_{X_i \in R_{k, j}}^{} Cost(H_{ k - 1}(X_i)) + Cost^{'}(H_{k - 1}(X_i)) (H(X_i) - H_{k - 1}(X_i)) + \frac{1}{2} Cost^{''}(H_{k - 1}(X_i)) (H(X_i) - H_{k - 1}(X_i))^2 & (3) \\ &= \underset{\gamma}{ argmin} \sum_{X_i \in R_{k, j}}^{} Cost(H_{k - 1}(X_i)) + Cost^{'}(H_{k - 1}(X_i)) \gamma + \frac{1}{2} Cost^{''}(H_{k - 1}(X_i)) \gamma^2 & (4) \end{aligned} $$
<center> Form 4-6 </center>
Solve for its approximation:
(1) Taylor expansion approximation of $\gamma$ obtained by Equation 4-6
(2) Differentiate the function and make it zero
(3) Get the result of $\gamma$
$$ \begin{aligned} \phi (\gamma ) &= \sum_{X_i \in R_{k, j}}^{} Cost(H_{k - 1}(X_i)) + Cost^{'}( H_{k - 1}(X_i)) \gamma + \frac{1}{2} Cost^{''}(H_{k - 1}(X_i)) \gamma^2 & (1) \\ \frac {\partial \phi (\gamma )}{\partial \gamma } &= \sum_{X_i \in R_{k, j}}^{} + Cost^{'}(H_{k - 1}(X_i) ) + Cost^{''}(H_{k - 1}(X_i)) \gamma = 0 & (2) \\ \gamma &= -\frac{\sum_{X_i \in R_{k, j}} ^{} Cost^{'}(H_{k - 1}(X_i))}{\sum_{X_i \in R_{k, j}}^{} Cost^{''}(H_{k - 1} (X_i))} & (3) \\ \end{aligned} $$
<center> Form 4-7 </center>
The first and second derivatives of the cost function are calculated as follows:
$$ \begin{aligned} Cost^{'}(H_{k - 1}(X_i)) &= -\frac{2y_i}{1 + e^{2y_iH_{k-1}(X_i)}} = - \hat{y_i} & (1) \\ Cost^{''}(H_{k - 1}(X_i)) &= \frac{4y_i^2e^{2y_iH_{k-1}(X_i)}}{ (1 + e^{2y_iH_{k-1}(X_i)})^2} = 2\hat{y_i}y_i - \hat{y}_i^2 & (2) \\ \end{aligned} $$
<center> Form 4-8 </center>
(1) The expression of $\gamma$ obtained in Equation 4-7
(2) Bring into formula 4-8 to get
(3) When $y = +1$, $\hat{y} \in (0, 2)$, when $y = -1$, $\hat{y} \in (-2, 0) $, so $\hat{y} * y = |\hat{y}|$
(4) The denominator proposes $|\hat{y}|$
$$ \begin{aligned} \gamma &= -\frac{\sum_{X_i \in R_{k, j}}^{} Cost^{'}(H_{k - 1}(X_i))}{\ sum_{X_i \in R_{k, j}}^{} Cost^{''}(H_{k - 1}(X_i))} & (1) \\ &= \frac{\sum_{X_i \in R_{k, j}}^{} \hat{y}_i}{\sum_{X_i \in R_{k, j}}^{} (2\hat{y}_iy_i - \hat{y}_i^ 2) } & (2) \\ &= \frac{\sum_{X_i \in R_{k, j}}^{} \hat{y}_i}{\sum_{X_i \in R_{k, j} }^{} (2|\hat{y}_i| - \hat{y}_i^2) } & (3) \\ &= \frac{\sum_{X_i \in R_{k, j}} \ hat{y}_{i}}{\sum_{X_i \in R_{k,j}} |\hat{y}_{i}| (2 - |\hat{y}_{i}|)} & (4) \\ \end{aligned} $$
<center> Form 4-9 </center>
get the certificate
Multi-class problem coefficient $\gamma$
The coefficient $\gamma$ of the multi-classification problem involves the Hessian matrix due to the overlapping of multiple trees, which cannot be solved separately like the binary classification. The ability is limited, if you know how to deduce the results, please leave a message or private message.
5. Regularization
The gradient boosting tree also needs to be regularized to prevent overfitting. The regularization methods are generally as follows:
learning rate
A hyperparameter of the learning rate $\eta$ is added when H(x) is updated in each iteration. The following formulas show how the learning rate $\eta$ is used in different problems:
$$ \begin{aligned} H_k(x) &= H_{k - 1}(x) + \eta \alpha_k h_k(x) & (1) \\ H_k(x) &= H_{k - 1}( x) + \eta \sum_{j = 1}^{J} \gamma_{k,j} I(x \in R_{k,j}) & (2) \\ H_{k, m}(x) &= H_{k - 1, m}(x) + \eta \sum_{j = 1}^{J} \gamma_{k, m ,j} I(x \in R_{k, m, j}) & (3) \\ \end{aligned} $$
<center> Formula 5-1 </center>
Among them, the learning rate $\eta \in (0,1]$, when the learning rate $\eta$ is too small, it is necessary to increase the number of iterations to achieve a good learning effect, so we need to comprehensively consider the use of this hyperparameter. In scikit- The learning_rate parameter is used in learn to control.
subsampling
Subsampling is similar to the stochastic gradient descent method. Only a part of the training set is taken for learning at a time, which can reduce the variance, but also increase the bias. In scikit-learn, use the subsample parameter to control, which is also a decimal greater than 0 and less than or equal to 1.
decision tree branch shears
The decision tree pruning is the same as the method introduced in the decision tree section above, and achieves the purpose of regularization by controlling the base estimator. Use decision tree related parameters to control in scikit-learn.
The following code implementation only implements the regularization operation using the learning rate. For other methods, please refer to the source code implementation of scikit-learn.
6. Code implementation
Use Python to implement the gradient boosting tree regression algorithm:
import numpy as np
from sklearn.tree import DecisionTreeRegressor
class gbdtr:
"""
梯度提升树回归算法
"""
def __init__(self, n_estimators = 100, learning_rate = 0.1):
# 梯度提升树弱学习器数量
self.n_estimators = n_estimators
# 学习速率
self.learning_rate = learning_rate
def fit(self, X, y):
"""
梯度提升树回归算法拟合
"""
# 初始化 H0
self.H0 = np.average(y)
# 初始化预测值
H = np.ones(X.shape[0]) * self.H0
# 估计器数组
estimators = []
# 遍历 n_estimators 次
for k in range(self.n_estimators):
# 计算残差 y_hat
y_hat = y - H
# 初始化决策回归树估计器
estimator = DecisionTreeRegressor(max_depth = 3)
# 用 y_hat 拟合训练集
estimator.fit(X, y_hat)
# 使用回归树的预测值
y_predict = estimator.predict(X)
# 更新预测值
H += self.learning_rate * y_predict
estimators.append(estimator)
self.estimators = np.array(estimators)
def predict(self, X):
"""
梯度提升树回归算法预测
"""
# 初始化预测值
H = np.ones(X.shape[0]) * self.H0
# 遍历估计器
for k in range(self.n_estimators):
estimator = self.estimators[k]
y_predict = estimator.predict(X)
# 计算预测值
H += self.learning_rate * y_predict
return H
Use Python to implement the gradient boosting tree binary classification algorithm:
import numpy as np
from sklearn.tree import DecisionTreeRegressor
class gbdtc:
"""
梯度提升树二分类算法
"""
def __init__(self, n_estimators = 100, learning_rate = 0.1):
# 梯度提升树弱学习器数量
self.n_estimators = n_estimators
# 学习速率
self.learning_rate = learning_rate
def fit(self, X, y):
"""
梯度提升树二分类算法拟合
"""
# 标签类
self.y_classes = np.unique(y)
# 标签类数量
self.n_classes = len(self.y_classes)
# 标签的平均值
y_avg = np.average(y)
# 初始化H0
self.H0 = np.log((1 + y_avg) / (1 - y_avg)) / 2
# 初始化预测值
H = np.ones(X.shape[0]) * self.H0
# 估计器数组
estimators = []
# 叶子结点取值数组
gammas = []
for k in range(self.n_estimators):
# 计算 y_hat
y_hat = 2 * np.multiply(y, 1 / (1 + np.exp(2 * np.multiply(y, H))))
# 初始化决策回归树估计器
estimator = DecisionTreeRegressor(max_depth = 3, criterion="friedman_mse")
# 将 y_hat 当作标签值拟合训练集
estimator.fit(X, y_hat)
# 计算训练集在当前决策回归树的叶子结点
leaf_ids = estimator.apply(X)
# 每个叶子结点下包含的训练数据序号
node_ids_dict = self.get_leaf_nodes(leaf_ids)
# 叶子结点取值字典表
gamma_dict = {}
# 计算叶子结点取值
for leaf_id, node_ids in node_ids_dict.items():
# 当前叶子结点包含的 y_hat
y_hat_sub = y_hat[node_ids]
y_hat_sub_abs = np.abs(y_hat_sub)
# 计算叶子结点取值
gamma = np.sum(y_hat_sub) / np.sum(np.multiply(y_hat_sub_abs, 2 - y_hat_sub_abs))
gamma_dict[leaf_id] = gamma
# 更新预测值
H[node_ids] += self.learning_rate * gamma
estimators.append(estimator)
gammas.append(gamma_dict)
self.estimators = estimators
self.gammas = gammas
def predict(self, X):
"""
梯度提升树二分类算法预测
"""
# 初始化预测值
H = np.ones(X.shape[0]) * self.H0
# 遍历估计器
for k in range(self.n_estimators):
estimator = self.estimators[k]
# 计算在当前决策回归树的叶子结点
leaf_ids = estimator.apply(X)
# 每个叶子结点下包含的数据序号
node_ids_dict = self.get_leaf_nodes(leaf_ids)
# 叶子结点取值字典表
gamma_dict = self.gammas[k]
# 计算预测值
for leaf_id, node_ids in node_ids_dict.items():
gamma = gamma_dict[leaf_id]
H[node_ids] += self.learning_rate * gamma
# 计算概率
probs = np.zeros((X.shape[0], self.n_classes))
probs[:, 0] = 1 / (1 + np.exp(2 * H))
probs[:, 1] = 1 / (1 + np.exp(-2 * H))
return self.y_classes.take(np.argmax(probs, axis=1), axis = 0)
def get_leaf_nodes(self, leaf_ids):
"""
每个叶子结点下包含的数据序号
"""
node_ids_dict = {}
for j in range(len(leaf_ids)):
leaf_id = leaf_ids[j]
node_ids = node_ids_dict.setdefault(leaf_id, [])
node_ids.append(j)
return node_ids_dict
Use Python to implement the gradient boosting tree multi-classification algorithm:
import numpy as np
from sklearn.tree import DecisionTreeRegressor
class gbdtmc:
"""
梯度提升树多分类算法
"""
def __init__(self, n_estimators = 100, learning_rate = 0.1):
# 梯度提升树弱学习器数量
self.n_estimators = n_estimators
# 学习速率
self.learning_rate = learning_rate
def fit(self, X, y):
"""
梯度提升树多分类算法拟合
"""
# 标签类,对应标签的数量
self.y_classes, y_counts = np.unique(y, return_counts = True)
# 标签类数量
self.n_classes = len(self.y_classes)
# 对标签进行One-Hot编码
y_onehot = np.zeros((y.size, y.max() + 1))
y_onehot[np.arange(y.size), y] = 1
# 初始化 H0
self.H0 = y_counts / X.shape[0]
# 初始化预测值
H = np.ones((X.shape[0], 1)).dot(self.H0.reshape(1, -1))
# 估计器数组
estimators = []
# 叶子结点取值数组
gammas = []
# 遍历 n_estimators 次
for k in range(self.n_estimators):
H_exp = np.exp(H)
H_exp_sum = np.sum(H_exp, axis = 1)
# 估计器
sub_estimators = []
# 叶子结点取值
sub_gammas = []
# 遍历 n_classes 次
for m in range(self.n_classes):
p_m = H_exp[:, m] / H_exp_sum
# 计算第 m 个 y_hat
y_hat_m = y_onehot[:, m] - p_m
# 初始化决策回归树估计器
estimator = DecisionTreeRegressor(max_depth = 3, criterion="friedman_mse")
# 将第 m 个 y_hat 当作标签值拟合训练集
estimator.fit(X, y_hat_m)
# 计算训练集在当前决策回归树的叶子结点
leaf_ids = estimator.apply(X)
# 每个叶子结点下包含的训练数据序号
node_ids_dict = self.get_leaf_nodes(leaf_ids)
# 叶子结点取值字典表
gamma_dict = {}
# 计算叶子结点取值
for leaf_id, node_ids in node_ids_dict.items():
# 当前叶子结点包含的 y_hat
y_hat_sub = y_hat_m[node_ids]
y_hat_sub_abs = np.abs(y_hat_sub)
# 计算叶子结点取值
gamma = np.sum(y_hat_sub) / np.sum(np.multiply(y_hat_sub_abs, 1 - y_hat_sub_abs)) * (self.n_classes - 1) / self.n_classes
gamma_dict[leaf_id] = gamma
# 更新预测值
H[node_ids, m] += self.learning_rate * gamma
sub_estimators.append(estimator)
sub_gammas.append(gamma_dict)
estimators.append(sub_estimators)
gammas.append(sub_gammas)
self.estimators = estimators
self.gammas = gammas
def predict(self, X):
"""
梯度提升树多分类算法预测
"""
# 初始化预测值
H = np.ones((X.shape[0], 1)).dot(self.H0.reshape(1, -1))
# 遍历估计器
for k in range(self.n_estimators):
sub_estimators = self.estimators[k]
sub_gammas = self.gammas[k]
# 遍历分类数
for m in range(self.n_classes):
estimator = sub_estimators[m]
# 计算在当前决策回归树的叶子结点
leaf_ids = estimator.apply(X)
# 每个叶子结点下包含的训练数据序号
node_ids_dict = self.get_leaf_nodes(leaf_ids)
# 叶子结点取值字典表
gamma_dict = sub_gammas[m]
# 计算预测值
for leaf_id, node_ids in node_ids_dict.items():
gamma = gamma_dict[leaf_id]
H[node_ids, m] += self.learning_rate * gamma
H_exp = np.exp(H)
H_exp_sum = np.sum(H_exp, axis = 1)
# 计算概率
probs = np.zeros((X.shape[0], self.n_classes))
for m in range(self.n_classes):
probs[:, m] = H_exp[:, m] / H_exp_sum
return self.y_classes.take(np.argmax(probs, axis=1), axis = 0)
def get_leaf_nodes(self, leaf_ids):
"""
每个叶子结点下包含的数据序号
"""
node_ids_dict = {}
for j in range(len(leaf_ids)):
leaf_id = leaf_ids[j]
node_ids = node_ids_dict.setdefault(leaf_id, [])
node_ids.append(j)
return node_ids_dict
7. Third-party library implementation
scikit-learn 5 implementation:
from sklearn.ensemble import GradientBoostingClassifier
# 梯度提升树分类器
clf = GradientBoostingClassifier(n_estimators = 100)
# 拟合数据集
clf = clf.fit(X, y)
scikit-learn 6 implementation:
from sklearn.ensemble import GradientBoostingRegressor
# 梯度提升树回归器
reg = GradientBoostingRegressor(n_estimators = 100, max_depth = 3, random_state = 0, loss = 'ls')
# 拟合数据集
reg = reg.fit(X, y)
Eight, example demonstration
The following three pictures show the results of binary classification, multi-classification and regression using gradient boosting algorithm respectively
<center> Figure 8-1 </center>
<center> Figure 8-2 </center>
<center> Figure 8-3 </center>
Nine, mind map
<center> Figure 9-1 </center>
10. References
- https://en.wikipedia.org/wiki/Gradient_boosting
- https://www.cse.cuhk.edu.hk/irwin.king/_media/presentations/2001_greedy_function_approximation_a_gradient_boosting_machine.pdf
- https://en.wikipedia.org/wiki/Huber_loss
- https://en.wikipedia.org/wiki/One-hot
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html
For full demo click here
Note: This article strives to be accurate and easy to understand, but because the author is also a beginner and has limited level, if there are errors or omissions in the text, I urge readers to criticize and correct them by leaving a message
This article was first published in - AI map , welcome to pay attention
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。