人工智能 - Machine Learning Algorithm Series (Twenty) - Gradient Boosted Decision Trees / GBDT - 机器学习算法系列

Background knowledge points needed to read this article: adaptive enhancement algorithm, Taylor formula, One-Hot coding, a little programming knowledge

I. Introduction

In the previous section, we learned about the Adaptive Boosting / AdaBoost Algorithm, which is a Boosting Algorithm, and there is another important algorithm in this family of algorithms - Gradient Boosting Decision Tree ¹ (Gradient Boosting Algorithm). Boosted Decision Trees / GBDT), GBDT and its variant algorithms have a wide range of applications in traditional machine learning, and understanding the ideas and principles behind it is very helpful for future learning.

2. Introduction to the model

The gradient boosting decision tree is a boosting algorithm like the adaptive boosting algorithm, and it can also be interpreted as an additive model. The strong estimator after the kth round is the strong estimator after the k - 1th round plus the weak with coefficients. Estimator h(x):

$$ H_k(x) = H_{k - 1}(x) + \alpha_k h_k(x) $$

<center> Formula 2-1 </center>

Assuming that the cost function corresponding to the k - 1th round is $Cost(y, H_{k - 1}(X))$, and the cost function corresponding to the kth round is $Cost(y, H_k(X))$, our The purpose is to make the cost function gradually smaller after each iteration.

Due to the different optimization methods corresponding to each different cost function, Jerome H. Friedman gave a unified processing method in the paper ² of his original algorithm, which can use the negative gradient of the cost function to fit the iterative cost function of this round. decrease, which is also an approximation of the steepest descent method, and its essence is to use the first-order Taylor formula to approximate its cost function. When the underlying estimator uses a decision regression tree, the algorithm is called a gradient boosted decision tree (GBDT).

$$ \hat{y}_{k, i} = -\left[\frac{\partial Cost(y_i, H(X_i))}{\partial H(X_i)} \right]_{H(x) = H_{k-1}(x)} $$

The following is still the same as the AdaBoost algorithm, considering the three problems of regression, binary classification, and multi-classification, and introducing the algorithms corresponding to each problem one by one. Since the regression of the GBDT algorithm is simpler than the classification, the regression problem will be introduced earlier this time.

return

For the cost function of the regression problem, we first use the squared error as the cost function:

$$ Cost(y, H(x)) = (y - H(x))^2 $$

The cost function for the kth round:

(1) Bring-in type 2-1

(2) Cost function with squared error

(3) Change the parentheses

(4) Record the difference between y and the result of round k - 1 as $\hat{y}_k$

$$ \begin{aligned} Cost(y, H_{k}(x)) & = Cost(y, H_{k - 1}(x) + \alpha_kh_k(x)) & (1) \\ & = ( y - (H_{k - 1}(x) + \alpha_kh_k(x) ))^2 & (2) \\ &= ((y - H_{k - 1}(x)) - \alpha_kh_k(x) )^2 & (3) \\ &= (\hat{y}_k - \alpha_kh_k(x))^2 & (4) \end{aligned} $$

Observing equation (4) in equations 2-4, you will find that this is the cost function of the decision regression tree introduced in the previous chapter, but the regression tree no longer uses y in the training set, but fits the $ in the above equation. \hat{y}$, also known as residual. Update H(x) after getting the regression tree, and then do a new iteration.

In this way, the simplest implementation of the GBDT algorithm for least squares regression is obtained. For specific steps, please refer to the regression part of the algorithm steps in Section 3. It can be seen that the coefficient $\alpha$ can be understood as being integrated into the regression tree.

In the paper, the author also introduces several other cost functions, namely Least Absolute Deviation (LDA), Huber Loss ³ , etc. Interested readers can read the corresponding chapters in the paper.

Since GBDT needs to calculate the negative gradient is a continuous value, for the classification problem, its basic estimator does not use the classification tree, but still uses the regression tree.

two-class

For the cost function of classification problems, exponential and logarithmic functions can be used. When using the exponential function, GBDT will be equivalent to the AdaBoost algorithm described in the previous section, where we use the logarithmic function as the cost function:

$$ Cost(y, H(x)) = \log (1 + e^{-2yH(x)}) $$

According to the method of calculating the negative gradient introduced in the previous model, the cost function is brought in to calculate $\hat{y}$ :

$$ \begin{aligned} \hat{y}_{k, i} &= -\left[\frac{\partial Cost(y_i, H(X_i))}{\partial H(X_i)} \right] _{H(x) = H_{k-1}(x)} & (1)\\ &= \frac{2y_i}{1 + e^{2y_iH_{k-1}(X_i)}} & (2 ) \end{aligned} $$

After calculating $\hat{y}$, we can fit a regression tree estimator h(x), then we require the coefficient $\alpha$ of the estimator after each iteration:

$$ \alpha_k = \underset{\alpha}{argmin} \sum_{i = 1}^{N} \log (1 + e^{-2y_i(H_{k - 1}(X_i) + \alpha h_k( X_i))}) $$

Let's take a look at the regression tree estimator h(x) first, and its expression can be written as the following formula, where the regression tree has a total of J leaf nodes, and $R_j$ and $\beta_j$ represent the jth leaf of the regression tree respectively. The training set and values contained in the node, $I(x)$ is the indicator function mentioned in the previous chapter:

$$ h(x) = \sum_{j = 1}^{J} \beta_j I(x \in R_j) $$

At this time, bring Equation 2-8 into Equation 2-7 and rewrite it. At this time, it is no longer to find the coefficient alpha of the estimator, but to directly find $\gamma$, where $\gamma_{k,j} = \ alpha_k * \beta _{k,j}$:

$$ \gamma_{k, j} = \underset{\gamma}{argmin} \sum_{X_i \in R_{k, j}}^{} \log (1 + e^{-2y_i(H_{k - 1}(X_i) + \gamma )}) $$

<center> Formula 2-9 </center>

Since the logarithmic exponential function is included in Equation 2-9, it has no closed-form solution. In this case, the second-order Taylor formula can be used to approximate it to obtain the following results:

$$ \gamma_{k,j} = \frac{\sum_{X_i \in R_{k,j}} \hat{y}_{k,i}}{\sum_{X_i \in R_{k,j }} |\hat{y}_{k,i}| (2 - |\hat{y}_{k,i}|)} $$

<center> Formula 2-10 </center>

Update H(x) after getting gamma, then do a new iteration.

In this way, the implementation of the two-category GBDT algorithm is obtained. For the specific steps, please refer to the two-category part in the algorithm steps in Section 3. For the source of the specific formula, please refer to the proof of principle in Section 4.

Multiclass

Multi-classification is more complicated than binary classification. It also uses the logarithmic function as its cost function, and also uses the Softmax function introduced in the previous multi-class logarithmic probability regression algorithm. At the same time, it is necessary to perform One-Hot on the input label value y. Encoding ⁴ operations, the cost function is as follows:

$$ \begin{aligned} Cost(y, H(x)) &= \sum_{m = 1}^M y_m \log p_m(x) & (1) \\ p_{m}(x) &= \ frac{e^{H_{m}(x)}}{\sum_{l=1}^M e^{H_{l}(x)}} & (2) \\ \end{aligned} $$

<center> Formula 2-11 </center>

Similarly, according to the method of calculating negative gradient introduced in the previous model, bring in the cost function to calculate $\hat{y}$ :

$$ \begin{aligned} \hat{y}_{k, m, i} &= -\left[\frac{\partial Cost(y_i, H(X_i))}{\partial H(X_i)} \ right]_{H(x) = H_{k-1}(x)} & (1)\\ &= y_{m, i} - p_{k-1, m}(X_i) & (2) \ \ \end{aligned} $$

<center> Formula 2-12 </center>

The same as the two classification, fit the regression tree, and convert it to the corresponding $\gamma$, the difference is that a regression tree needs to be fitted for each classification, so a total of K * M decisions need to be fitted for multi-classification Regression tree:

$$ \gamma_{k, m, j} = \underset{\gamma}{argmin} \sum_{i = 1}^{N} \sum_{m = 1}^{M} Cost\left(y_{m ,i}, H_{k - 1, m}(X_i) + \sum_{j = 1}^{J} \gamma I(X_i \in R_{k, m, j})\right) $$

<center> Formula 2-13 </center>

Similarly, the Taylor formula is used to approximate it and the following results are obtained:

$$ \gamma_{k, m, j} = \frac{M - 1}{M} \frac{\sum_{X_i \in R_{k, m, j}} \hat{y}_{k, m , i}}{\sum_{X_i \in R_{k, m, j}} |\hat{y}_{k, m, i}| (1 - |\hat{y}_{k, m, i}|)} $$

<center> Formula 2-14 </center>

After getting $\gamma$, update the H(x) of the corresponding classification, and then perform a new iteration.

In this way, the implementation of the multi-class GBDT algorithm is obtained. For the specific steps, please refer to the multi-class part in the algorithm steps of the third section.

3. Algorithm steps

return

Suppose the training set T = { $X_i$, $y_i$ }, i = 1,...,N, h(x) is the estimator, and the number of estimators is K.

The steps of the gradient boosting decision tree regression algorithm are as follows:

Initialize $H_0(x)$ to be equal to the mean value of y $\bar{y}$

$$ H_0(X_i) = \bar{y} $$

Iterate over the number of estimators K times:

Calculate the residual of the kth round $\hat{y}_{k}$

$$ \hat{y}_{k, i} = y_i - H_{k-1}(X_i) $$

Use the kth round $\hat{y}_{k}$ as the label value to fit the training set, and get the decision regression tree estimator $h_k(x)$

update $H_k(x)$

$$ H_k(X_i) = H_{k-1}(X_i) + h_k(X_i) $$

end loop

Final prediction strategy:

Input x, K decision regression tree estimators predict and add them in turn, and then add the initial value $H_0$ to get the final prediction result

$$ H(x) = H_{0} + \sum_{k = 1}^K h_k(x) $$

two-class

Suppose the training set T = { $X_i$, $y_i$ }, i = 1, ..., N, $y_i \in \{ -1, +1 \}$ , h(x) is the estimator, the estimator The number is K.

The steps of the gradient boosting decision tree binary classification algorithm are as follows:

Initialize $H_0(x)$, where $\bar{y}$ is the mean of y

$$ H_0(X_i) = \frac{1}{2} \log \frac{1 + \bar{y}}{1 - \bar{y}} $$

Iterate over the number of estimators K times:

Calculate $\hat{y}_{k}$ for the kth round

$$ \hat{y}_{k,i} = \frac{2y_i}{1 + e^{2y_iH_{k-1}(X_i)}} $$

The kth round $\hat{y}_{k}$ is used as the label value to fit the training set, and the decision regression tree estimator $h_k(x)$ is obtained, where $h_k(x)$ contains J leaf nodes

Calculate the coefficient $\gamma$ of the jth leaf node in the kth round, where $R_{kj}$ represents the decision regression tree estimator fitted in the kth round $h_k(x)$ The jth leaf node contains training set

$$ \gamma_{k,j} = \frac{\sum_{X_i \in R_{k,j}} \hat{y}_{k,i}}{\sum_{X_i \in R_{k,j }} |\hat{y}_{k,i}| (2 - |\hat{y}_{k,i}|)} $$

Update $H_k(x)$, where $I(x)$ is the indicator function mentioned in the previous section

$$ H_k(X_i) = H_{k - 1}(X_i) + \sum_{j = 1}^{J} \gamma_{k,j} I(X_i \in R_{k,j}) $$

end loop

Final prediction strategy:

Input x, K decision regression tree estimators judge the leaf nodes in turn, accumulate the coefficients $\gamma$ corresponding to the leaf nodes, and then add the initial value $H_0$ to get H(x)

$$ H(x) = H_0 + \sum_{k = 1}^{K}\sum_{j = 1}^{J} \gamma_{k,j} I(x \in R_{k,j}) $$

Calculate the probability of positive and negative classes separately

$$ \left\{ \space \begin{aligned} p_+(x) &= \frac{1}{1 + e^{-2H(x)}} \\ p_-(x) &= \frac{ 1}{1 + e^{2H(x)}} \end{aligned} \right. $$

Take the classification with the largest probability of positive and negative classes as the final classification result

$$ \underset{m}{argmax} \space p_m(x) \quad (m \in \{+ , -\}) $$

Multiclass

Assuming that the training set T = { $X_i, y_i$ }, i = 1, ..., N, there are M possible values of y, h(x) is the estimator, and the number of estimators is K.

The steps of the gradient boosting decision tree multi-classification algorithm are as follows:

One-Hot encoding of label values y in the training set

Initialize $H_{0,m}(x)$, where m is the mth category, which is assigned a value of 0 in the original paper, and the implementation in scikit-learn is the prior of each category

$$ H_{0, m}(X_i) = \frac{\sum_{i=1}^{N}I(y_i = m)}{N} or 0 $$

Iterate over the number of estimators K times:

Traverse the number of categories M times:

Calculate the probability p(x) of the mth class for round k - 1

$$ p_{k-1, m}(X_i) = \frac{e^{H_{k-1, m}(X_i)}}{\sum_{l=1}^M e^{H_{k- 1, l}(X_i)}} $$

Calculate the $\hat{y}_{k, m}$ of the mth classification in the kth round

$$ \hat{y}_{k, m, i} = y_{m, i} - p_{k-1, m}(X_i) $$

Use the $\hat{y}_{k, m}$ of the mth classification in the kth round as the label value to fit the training set, and get the decision regression tree estimator $h_{k,m}(x)$, where $h_{k,m}(x)$ contains J leaf nodes

Calculate the coefficient $\gamma$ of the jth leaf node of the mth classification in the kth round, where $R_{k,m,j}$ represents the decision regression tree estimator fitted by the mth classification in the kth round $\gamma$ h_{k,m}(x)$ The training set contained in the jth leaf node

$$ \gamma_{k, m, j} = \frac{M - 1}{M} \frac{\sum_{X_i \in R_{k, m, j}} \hat{y}_{k, m , i}}{\sum_{X_i \in R_{k, m, j}} |\hat{y}_{k, m, i}| (1 - |\hat{y}_{k, m, i}|)} $$

Update $H_{k,m}(x)$, where $I(x)$ is the indicator function mentioned in the previous section

$$ H_{k, m}(X_i) = H_{k - 1, m}(X_i) + \sum_{j = 1}^{J} \gamma_{k, m, j} I(X_i \in R_ {k, m, j}) $$

end loop

end loop

Final prediction strategy:

Input x, the K decision regression tree estimators corresponding to the mth classification determine the leaf nodes in turn, accumulate the coefficients $\gamma$ corresponding to the leaf nodes, and then add the initial value $H_0$ of the mth classification, get H(x) for the mth class

$$ H_{m}(x) = H_{0,m} + \sum_{k = 1}^{K} \sum_{j = 1}^{J} \gamma_{k, m, j} I( x \in R_{k, m, j}) $$

Calculate the probability of each class in turn

$$ p_m(x) = \frac{e^{H_m(x)}}{\sum_{l = 1}^M e^{H_l(x)}} $$

Take the largest classification in the probability of each classification as the final classification result

$$ \underset{m}{argmax} \space p_m(x) \quad (m = 1,2,\dots,M) $$

4. Proof of Principle

Regression problem initial value

For least squares regression with squared error as the cost function, the initial value is the mean of y:

(1) Cost function of regression problem

(2) Take the derivative of the cost function and make it equal to zero

(3) The mean value of its initial value y can be calculated

$$ \begin{aligned} Cost(H(x)) &= \sum_{i = 1}^{N} (y_i - H(x))^2 & (1) \\ \frac{\partial Cost( H(x))}{\partial H(x)} &= 2\sum_{i = 1}^{N} (H(x) - y_i) = 0 & (2) \\ H(x) &= \frac{\sum_{i = 1}^{N} y_i}{N} = \bar{y} & (3) \\ \end{aligned} $$

get the certificate

Initial value for binary classification problem

For a binary classification problem, $y \in \{ -1, +1 \}$:

(1) The number of $y = +1$

(2) The number of $y = -1$

(3) The sum of the two formulas is N

$$ \begin{aligned} n_{+} &= \sum_{i = 1}^NI(y_i = +1) & (1) \\ n_{-} &= \sum_{i = 1}^NI( y_i = -1) & (2) \\ N &= n_{+} + n_{-} & (3) \\ \end{aligned} $$

(1) The expression (2) of the mean value of y is simplified to obtain

$$ \begin{aligned} \bar{y} &= \frac{1 * n_{+} + (-1) * n_{-}}{N} & (1) \\ &= \frac{n_{ +} - n_{-}}{N} & (2) \\ \end{aligned} $$

From equation (3) in equation 4-2 and equation (2) in equation 4-3, the following results can be obtained respectively:

$$ \begin{aligned} n_{+} &= \frac{N(1 + \bar{y})}{2} & (1) \\ n_{-} &= \frac{N(1 - \ bar{y})}{2} & (2) \end{aligned} $$

(1) The cost function of the two-class problem

(2) Derivative of the cost function

(3) Split the result of (2) into the sum of two consecutive additions

(4) Bring in the result of Equation 4-4 and remove the consecutive addition sign

(5) After simplification, make it equal to zero

(6) Finally, its initial value can be calculated

$$ \begin{aligned} Cost(H(x)) & = \sum_{i = 1}^N \log (1 + e^{-2y_iH(x)}) & (1) \\ \frac{\ partial Cost(H(x))}{\partial H(x)} &= -\sum_{i = 1}^N \frac{2y_i}{1 + e^{2y_iH(x)}} & (2) \\ &= -\sum_{y_i = +1} \frac{2y_i}{1 + e^{2y_iH(x)}} - \sum_{y_i = -1} \frac{2y_i}{1 + e^{ 2y_iH(x)}} & (3) \\ &= - \frac{N(1 + \bar{y})}{2} * \frac{2}{1 + e^{2H(x)}} - \frac{N(1 - \bar{y})}{2} * \frac{-2}{1 + e^{-2H(x)}} & (4) \\ &= - \frac{ N(1 + \bar{y})}{1 + e^{2H(x)}} + \frac{N(1 - \bar{y})}{1 + e^{-2H(x)} } = 0 & (5) \\ H(x) &= \frac{1}{2} \log \frac{1 + \bar{y}}{1 - \bar{y}} & (6) \ end{aligned} $$

get the certificate

Two-class problem coefficient $\gamma$

Let's first look at the optimization objective function of $\gamma$:

(1) The cost function of the two-class problem

(2) The optimization objective of $\gamma$ obtained by Equation 2-9

(3) Perform a second-order Taylor expansion approximation at $H_{k - 1}(x)$ for equation (2)

(4) You can see that $H(x) - H_{k - 1}(x)$ is equal to $\gamma$

$$ \begin{aligned} Cost(H(x)) &= \sum_{X_i \in R_{k, j}}^{} \log (1 + e^{-2y_iH(X_i)}) & (1 ) \\ \gamma_{k, j} &= \underset{\gamma}{argmin} \sum_{X_i \in R_{k, j}}^{} \log (1 + e^{-2y_i(H_{ k - 1}(X_i) + \gamma )}) & (2)\\ &= \underset{\gamma}{argmin} \sum_{X_i \in R_{k, j}}^{} Cost(H_{ k - 1}(X_i)) + Cost^{'}(H_{k - 1}(X_i)) (H(X_i) - H_{k - 1}(X_i)) + \frac{1}{2} Cost^{''}(H_{k - 1}(X_i)) (H(X_i) - H_{k - 1}(X_i))^2 & (3) \\ &= \underset{\gamma}{ argmin} \sum_{X_i \in R_{k, j}}^{} Cost(H_{k - 1}(X_i)) + Cost^{'}(H_{k - 1}(X_i)) \gamma + \frac{1}{2} Cost^{''}(H_{k - 1}(X_i)) \gamma^2 & (4) \end{aligned} $$

Solve for its approximation:

(1) Taylor expansion approximation of $\gamma$ obtained by Equation 4-6

(2) Differentiate the function and make it zero

(3) Get the result of $\gamma$

$$ \begin{aligned} \phi (\gamma ) &= \sum_{X_i \in R_{k, j}}^{} Cost(H_{k - 1}(X_i)) + Cost^{'}( H_{k - 1}(X_i)) \gamma + \frac{1}{2} Cost^{''}(H_{k - 1}(X_i)) \gamma^2 & (1) \\ \frac {\partial \phi (\gamma )}{\partial \gamma } &= \sum_{X_i \in R_{k, j}}^{} + Cost^{'}(H_{k - 1}(X_i) ) + Cost^{''}(H_{k - 1}(X_i)) \gamma = 0 & (2) \\ \gamma &= -\frac{\sum_{X_i \in R_{k, j}} ^{} Cost^{'}(H_{k - 1}(X_i))}{\sum_{X_i \in R_{k, j}}^{} Cost^{''}(H_{k - 1} (X_i))} & (3) \\ \end{aligned} $$

The first and second derivatives of the cost function are calculated as follows:

$$ \begin{aligned} Cost^{'}(H_{k - 1}(X_i)) &= -\frac{2y_i}{1 + e^{2y_iH_{k-1}(X_i)}} = - \hat{y_i} & (1) \\ Cost^{''}(H_{k - 1}(X_i)) &= \frac{4y_i^2e^{2y_iH_{k-1}(X_i)}}{ (1 + e^{2y_iH_{k-1}(X_i)})^2} = 2\hat{y_i}y_i - \hat{y}_i^2 & (2) \\ \end{aligned} $$

(1) The expression of $\gamma$ obtained in Equation 4-7

(2) Bring into formula 4-8 to get

(3) When $y = +1$, $\hat{y} \in (0, 2)$, when $y = -1$, $\hat{y} \in (-2, 0) $, so $\hat{y} * y = |\hat{y}|$

(4) The denominator proposes $|\hat{y}|$

$$ \begin{aligned} \gamma &= -\frac{\sum_{X_i \in R_{k, j}}^{} Cost^{'}(H_{k - 1}(X_i))}{\ sum_{X_i \in R_{k, j}}^{} Cost^{''}(H_{k - 1}(X_i))} & (1) \\ &= \frac{\sum_{X_i \in R_{k, j}}^{} \hat{y}_i}{\sum_{X_i \in R_{k, j}}^{} (2\hat{y}_iy_i - \hat{y}_i^ 2) } & (2) \\ &= \frac{\sum_{X_i \in R_{k, j}}^{} \hat{y}_i}{\sum_{X_i \in R_{k, j} }^{} (2|\hat{y}_i| - \hat{y}_i^2) } & (3) \\ &= \frac{\sum_{X_i \in R_{k, j}} \ hat{y}_{i}}{\sum_{X_i \in R_{k,j}} |\hat{y}_{i}| (2 - |\hat{y}_{i}|)} & (4) \\ \end{aligned} $$

get the certificate

Multi-class problem coefficient $\gamma$

The coefficient $\gamma$ of the multi-classification problem involves the Hessian matrix due to the overlapping of multiple trees, which cannot be solved separately like the binary classification. The ability is limited, if you know how to deduce the results, please leave a message or private message.

5. Regularization

The gradient boosting tree also needs to be regularized to prevent overfitting. The regularization methods are generally as follows:

learning rate

A hyperparameter of the learning rate $\eta$ is added when H(x) is updated in each iteration. The following formulas show how the learning rate $\eta$ is used in different problems:

$$ \begin{aligned} H_k(x) &= H_{k - 1}(x) + \eta \alpha_k h_k(x) & (1) \\ H_k(x) &= H_{k - 1}( x) + \eta \sum_{j = 1}^{J} \gamma_{k,j} I(x \in R_{k,j}) & (2) \\ H_{k, m}(x) &= H_{k - 1, m}(x) + \eta \sum_{j = 1}^{J} \gamma_{k, m ,j} I(x \in R_{k, m, j}) & (3) \\ \end{aligned} $$

<center> Formula 5-1 </center>

Among them, the learning rate $\eta \in (0,1]$, when the learning rate $\eta$ is too small, it is necessary to increase the number of iterations to achieve a good learning effect, so we need to comprehensively consider the use of this hyperparameter. In scikit- The learning_rate parameter is used in learn to control.

subsampling

Subsampling is similar to the stochastic gradient descent method. Only a part of the training set is taken for learning at a time, which can reduce the variance, but also increase the bias. In scikit-learn, use the subsample parameter to control, which is also a decimal greater than 0 and less than or equal to 1.

decision tree branch shears

The decision tree pruning is the same as the method introduced in the decision tree section above, and achieves the purpose of regularization by controlling the base estimator. Use decision tree related parameters to control in scikit-learn.

The following code implementation only implements the regularization operation using the learning rate. For other methods, please refer to the source code implementation of scikit-learn.

6. Code implementation

Use Python to implement the gradient boosting tree regression algorithm:

 import numpy as np
from sklearn.tree import DecisionTreeRegressor

class gbdtr:
    """
    梯度提升树回归算法
    """
    
    def __init__(self, n_estimators = 100, learning_rate = 0.1):
        # 梯度提升树弱学习器数量
        self.n_estimators = n_estimators
        # 学习速率
        self.learning_rate = learning_rate
        
    def fit(self, X, y):
        """
        梯度提升树回归算法拟合
        """
        # 初始化 H0
        self.H0 = np.average(y)
        # 初始化预测值
        H = np.ones(X.shape[0]) * self.H0
        # 估计器数组
        estimators = []
        # 遍历 n_estimators 次
        for k in range(self.n_estimators):
            # 计算残差 y_hat
            y_hat = y - H
            # 初始化决策回归树估计器
            estimator = DecisionTreeRegressor(max_depth = 3)
            # 用 y_hat 拟合训练集
            estimator.fit(X, y_hat)
            # 使用回归树的预测值
            y_predict = estimator.predict(X)
            # 更新预测值
            H += self.learning_rate * y_predict
            estimators.append(estimator)
        self.estimators = np.array(estimators)
        
    def predict(self, X):
        """
        梯度提升树回归算法预测
        """
        # 初始化预测值
        H = np.ones(X.shape[0]) * self.H0
        # 遍历估计器
        for k in range(self.n_estimators):
            estimator = self.estimators[k]
            y_predict = estimator.predict(X)
            # 计算预测值
            H += self.learning_rate * y_predict
        return H

Use Python to implement the gradient boosting tree binary classification algorithm:

 import numpy as np
from sklearn.tree import DecisionTreeRegressor

class gbdtc:
    """
    梯度提升树二分类算法
    """
    
    def __init__(self, n_estimators = 100, learning_rate = 0.1):
        # 梯度提升树弱学习器数量
        self.n_estimators = n_estimators
        # 学习速率
        self.learning_rate = learning_rate
        
    def fit(self, X, y):
        """
        梯度提升树二分类算法拟合
        """
        # 标签类
        self.y_classes = np.unique(y)
        # 标签类数量
        self.n_classes = len(self.y_classes)
        # 标签的平均值
        y_avg = np.average(y)
        # 初始化H0
        self.H0 = np.log((1 + y_avg) / (1 - y_avg)) / 2
        # 初始化预测值
        H = np.ones(X.shape[0]) * self.H0
        # 估计器数组
        estimators = []
        # 叶子结点取值数组
        gammas = []
        for k in range(self.n_estimators):
            # 计算 y_hat
            y_hat = 2 * np.multiply(y, 1 / (1 + np.exp(2 * np.multiply(y, H))))
            # 初始化决策回归树估计器
            estimator = DecisionTreeRegressor(max_depth = 3, criterion="friedman_mse")
            # 将 y_hat 当作标签值拟合训练集
            estimator.fit(X, y_hat)
            # 计算训练集在当前决策回归树的叶子结点
            leaf_ids = estimator.apply(X)
            # 每个叶子结点下包含的训练数据序号
            node_ids_dict = self.get_leaf_nodes(leaf_ids)
            # 叶子结点取值字典表
            gamma_dict = {}
            # 计算叶子结点取值
            for leaf_id, node_ids in node_ids_dict.items():
                # 当前叶子结点包含的 y_hat
                y_hat_sub = y_hat[node_ids]
                y_hat_sub_abs = np.abs(y_hat_sub)
                # 计算叶子结点取值
                gamma = np.sum(y_hat_sub) / np.sum(np.multiply(y_hat_sub_abs, 2 - y_hat_sub_abs))
                gamma_dict[leaf_id] = gamma
                # 更新预测值
                H[node_ids] += self.learning_rate * gamma
            estimators.append(estimator)
            gammas.append(gamma_dict)
        self.estimators = estimators
        self.gammas = gammas
        
    def predict(self, X):
        """
        梯度提升树二分类算法预测
        """
        # 初始化预测值
        H = np.ones(X.shape[0]) * self.H0
        # 遍历估计器
        for k in range(self.n_estimators):
            estimator = self.estimators[k]
            # 计算在当前决策回归树的叶子结点
            leaf_ids = estimator.apply(X)
            # 每个叶子结点下包含的数据序号
            node_ids_dict = self.get_leaf_nodes(leaf_ids)
            # 叶子结点取值字典表
            gamma_dict = self.gammas[k]
            # 计算预测值
            for leaf_id, node_ids in node_ids_dict.items():
                gamma = gamma_dict[leaf_id]
                H[node_ids] += self.learning_rate * gamma
        # 计算概率
        probs = np.zeros((X.shape[0], self.n_classes))
        probs[:, 0] = 1 / (1 + np.exp(2 * H))
        probs[:, 1] = 1 / (1 + np.exp(-2 * H))
        return self.y_classes.take(np.argmax(probs, axis=1), axis = 0)
    
    def get_leaf_nodes(self, leaf_ids):
        """
        每个叶子结点下包含的数据序号
        """
        node_ids_dict = {}
        for j in range(len(leaf_ids)):
            leaf_id = leaf_ids[j]
            node_ids = node_ids_dict.setdefault(leaf_id, [])
            node_ids.append(j)
        return node_ids_dict

Use Python to implement the gradient boosting tree multi-classification algorithm:

 import numpy as np
from sklearn.tree import DecisionTreeRegressor

class gbdtmc:
    """
    梯度提升树多分类算法
    """
    
    def __init__(self, n_estimators = 100, learning_rate = 0.1):
        # 梯度提升树弱学习器数量
        self.n_estimators = n_estimators
        # 学习速率
        self.learning_rate = learning_rate
        
    def fit(self, X, y):
        """
        梯度提升树多分类算法拟合
        """
        # 标签类，对应标签的数量
        self.y_classes, y_counts = np.unique(y, return_counts = True)
        # 标签类数量
        self.n_classes = len(self.y_classes)
        # 对标签进行One-Hot编码
        y_onehot = np.zeros((y.size, y.max() + 1))
        y_onehot[np.arange(y.size), y] = 1
        # 初始化 H0
        self.H0 = y_counts / X.shape[0]
        # 初始化预测值
        H = np.ones((X.shape[0], 1)).dot(self.H0.reshape(1, -1))
        # 估计器数组
        estimators = []
        # 叶子结点取值数组
        gammas = []
        # 遍历 n_estimators 次
        for k in range(self.n_estimators):
            H_exp = np.exp(H)
            H_exp_sum = np.sum(H_exp, axis = 1)
            # 估计器
            sub_estimators = []
            # 叶子结点取值
            sub_gammas = []
            # 遍历 n_classes 次
            for m in range(self.n_classes):
                p_m = H_exp[:, m] / H_exp_sum
                # 计算第 m 个 y_hat
                y_hat_m = y_onehot[:, m] - p_m
                # 初始化决策回归树估计器
                estimator = DecisionTreeRegressor(max_depth = 3, criterion="friedman_mse")
                # 将第 m 个 y_hat 当作标签值拟合训练集
                estimator.fit(X, y_hat_m)
                # 计算训练集在当前决策回归树的叶子结点
                leaf_ids = estimator.apply(X)
                # 每个叶子结点下包含的训练数据序号
                node_ids_dict = self.get_leaf_nodes(leaf_ids)
                # 叶子结点取值字典表
                gamma_dict = {}
                # 计算叶子结点取值
                for leaf_id, node_ids in node_ids_dict.items():
                    # 当前叶子结点包含的 y_hat
                    y_hat_sub = y_hat_m[node_ids]
                    y_hat_sub_abs = np.abs(y_hat_sub)
                    # 计算叶子结点取值
                    gamma = np.sum(y_hat_sub) / np.sum(np.multiply(y_hat_sub_abs, 1 - y_hat_sub_abs)) * (self.n_classes - 1) / self.n_classes
                    gamma_dict[leaf_id] = gamma
                    # 更新预测值
                    H[node_ids, m] += self.learning_rate * gamma
                sub_estimators.append(estimator)
                sub_gammas.append(gamma_dict)
            estimators.append(sub_estimators)
            gammas.append(sub_gammas)
        self.estimators = estimators
        self.gammas = gammas
        
    def predict(self, X):
        """
        梯度提升树多分类算法预测
        """
        # 初始化预测值
        H = np.ones((X.shape[0], 1)).dot(self.H0.reshape(1, -1))
        # 遍历估计器
        for k in range(self.n_estimators):
            sub_estimators = self.estimators[k]
            sub_gammas = self.gammas[k]
            # 遍历分类数
            for m in range(self.n_classes):
                estimator = sub_estimators[m]
                # 计算在当前决策回归树的叶子结点
                leaf_ids = estimator.apply(X)
                # 每个叶子结点下包含的训练数据序号
                node_ids_dict = self.get_leaf_nodes(leaf_ids)
                # 叶子结点取值字典表
                gamma_dict = sub_gammas[m]
                # 计算预测值
                for leaf_id, node_ids in node_ids_dict.items():
                    gamma = gamma_dict[leaf_id]
                    H[node_ids, m] += self.learning_rate * gamma
        H_exp = np.exp(H)
        H_exp_sum = np.sum(H_exp, axis = 1)
        # 计算概率
        probs = np.zeros((X.shape[0], self.n_classes))
        for m in range(self.n_classes):
            probs[:, m] = H_exp[:, m] / H_exp_sum
        return self.y_classes.take(np.argmax(probs, axis=1), axis = 0)
    
    def get_leaf_nodes(self, leaf_ids):
        """
        每个叶子结点下包含的数据序号
        """
        node_ids_dict = {}
        for j in range(len(leaf_ids)):
            leaf_id = leaf_ids[j]
            node_ids = node_ids_dict.setdefault(leaf_id, [])
            node_ids.append(j)
        return node_ids_dict

7. Third-party library implementation

scikit-learn ⁵ implementation:

 from sklearn.ensemble import GradientBoostingClassifier

# 梯度提升树分类器
clf = GradientBoostingClassifier(n_estimators = 100)
# 拟合数据集
clf = clf.fit(X, y)

scikit-learn ⁶ implementation:

 from sklearn.ensemble import GradientBoostingRegressor

# 梯度提升树回归器
reg = GradientBoostingRegressor(n_estimators = 100, max_depth = 3, random_state = 0, loss = 'ls')
# 拟合数据集
reg = reg.fit(X, y)

Eight, example demonstration

The following three pictures show the results of binary classification, multi-classification and regression using gradient boosting algorithm respectively

<center> Figure 8-1 </center>

<center> Figure 8-2 </center>

<center> Figure 8-3 </center>

Nine, mind map

<center> Figure 9-1 </center>

10. References

For full demo click here

Note: This article strives to be accurate and easy to understand, but because the author is also a beginner and has limited level, if there are errors or omissions in the text, I urge readers to criticize and correct them by leaving a message

This article was first published in - AI map , welcome to pay attention

Machine Learning Algorithm Series (Twenty) - Gradient Boosted Decision Trees / GBDT

I. Introduction

2. Introduction to the model

return

two-class

Multiclass

3. Algorithm steps

return

The steps of the gradient boosting decision tree regression algorithm are as follows:

Final prediction strategy:

two-class

The steps of the gradient boosting decision tree binary classification algorithm are as follows:

Final prediction strategy:

Multiclass

The steps of the gradient boosting decision tree multi-classification algorithm are as follows:

Final prediction strategy:

4. Proof of Principle

Regression problem initial value

Initial value for binary classification problem

Two-class problem coefficient $\gamma$

Multi-class problem coefficient $\gamma$

5. Regularization

learning rate

subsampling

decision tree branch shears

6. Code implementation

7. Third-party library implementation

Eight, example demonstration

Nine, mind map

10. References

Saisimon

引用和评论

【万字长文】大模型开源开发全景与趋势解读

一文掌握 MCP 上下文协议：从理论到实践

LRU算法，你别跑，我就要吃透你

AI Agent爆火后，MCP协议为什么如此重要！

2025年医疗大模型各医疗场景赋能实践研究报告130+份汇总解读|附PDF下载

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

MCP 协议为何不如你想象的安全？从技术专家视角解读