人工智能 - Machine Learning Algorithm Series (19) - Adaptive Boosting Algorithm - 机器学习算法系列

Background knowledge points needed to read this article: integrated learning, Lagrange multiplier method, a little programming knowledge

I. Introduction

In the previous section, we learned about the Random Forest Algorithm and talked about one of the methods of ensemble learning - Bagging algorithm. In this section, we will learn another method of ensemble learning - algorithm ) ¹ (Boosting Algorithm), while introducing the more common algorithms - Adaptive Boosting Algorithm ² (Adaptive Boosting Algorithm / AdaBoost Algorithm)

2. Introduction to the model

Boosting algorithm

The Boosting algorithm is also a kind of ensemble learning. Different from the Bagging algorithm, each training pays more attention to the samples with large classification errors or regression errors in the trained estimator, that is, each training is adjusted differently according to the results of the previous training. Sample weights until the final output is less than a preset threshold.

Figure 2-1

Figure 2-1 shows the specific flow of the hinting algorithm, which differs from the Bagging algorithm in that: First, each estimator of the Bagging algorithm is relatively independent and has the same weight, while each estimator of the Boosting algorithm depends on the previous one The estimators also have different weights. Second, in general, the Bagging algorithm can reduce the variance, while the Boosting algorithm can reduce the deviation.

The representative algorithm in the Boosting algorithm is the Adaptive Boosting Algorithm (Adaptive Boosting Algorithm / AdaBoost Algorithm)

AdaBoost algorithm

The AdaBoost algorithm was proposed by Yoav Freund and Robert E. Schapire in 1995, and AdaBoost.M1 and AdaBoost.M2 algorithms were also proposed for multi-classification problems, and AdaBoost.R algorithm was used for regression problems. Later, some people proposed variants of the above algorithms, AdaBoost-SAMME, AdaBoost-SAMME.R, and AdaBoost.R2 algorithms.

The basic steps of the AdaBoost algorithm are the same as the Boosting algorithm, which is the specific implementation of the Boosting algorithm, which defines how to update the sample weights in each loop and how to combine each estimator at the end.

Due to the limited ability of the author, this article will only introduce the basic AdaBoost algorithm and the AdaBoost-SAMME, AdaBoost-SAMME.R, AdaBoost.R2 algorithms implemented in scikit-learn. Other algorithms cannot be introduced one by one. Interested Readers can refer to the original paper of the corresponding algorithm at the end of the article.

3. Algorithm steps

The execution steps of each algorithm are given below, and then the sources of the formulas in these algorithm steps are explained one by one.

two-class

Suppose the training set T = { X_i, y_i }, i = 1, ..., N, y_i can take -1, +1, h(x) is the estimator, and the number of estimators is K.

The steps of the AdaBoost algorithm are as follows:

<hr/>

Initialize the sample weight vector ω_1

$$ \begin{aligned} \omega_{1,i} &= \frac{1}{N} \quad i = 1,...,N \end{aligned} $$

Number of traversal of the estimator K times:

Train estimator h(x)

Calculate the error rate e_k for the kth time

$$ \begin{aligned} e_k &= \sum_{i = 1}^{N}\omega_{k,i} I(y_i \ne h_k(X_i)) \end{aligned} $$

If the error rate e_k is greater than 0.5

break loop

Calculate the k-th estimator weight α_k

$$ \begin{aligned} \alpha_k &= \frac{1}{2} \ln \frac{1 - e_k}{e_k}\\ \end{aligned} $$

Calculate the k+1th weight vector ω_{k+1}

$$ \begin{aligned} \omega_{k+1,i} &= \frac{\omega_{k,i} e^{-y_i\alpha_kh_k(X_i)}}{\sum_{j = 0}^N \left(\omega_{k,j} e^{-y_j\alpha_kh_k(X_j)}\right) } \end{aligned} $$

end loop

The final combination strategy uses the weighted result to take the sign function to obtain the final strong estimator:

$$ \begin{aligned} H(x) &= \operatorname{sign} \left(\sum_{i = 1}^{K} \alpha_i h_i(x)\right) \end{aligned} $$

<hr/>

Multiclass

Assuming that the training set T = { X_i, y_i }, i = 1, ..., N, there are M possible values of y, h(x) is the estimator, and the number of estimators is K.

The steps of the AdaBoost-SUMME algorithm are as follows:

<hr/>

Initialize the sample weight vector ω_1

$$ \begin{aligned} \omega_{1,i} &= \frac{1}{N} \quad i = 1,...,N \end{aligned} $$

Number of traversal of the estimator K times:

Train estimator h(x) with sample weight ω_k

Calculate the error rate e_k for the kth time

$$ \begin{aligned} e_k &= \sum_{i = 1}^{N}\omega_{k,i} I(y_i \ne h_k(X_i)) \end{aligned} $$

Calculate the k-th estimator weight α_k

$$ \begin{aligned} \alpha_k &= \ln \frac{1 - e_k}{e_k} + \ln (M - 1) \\ \end{aligned} $$

Calculate the k+1th weight vector ω_{k+1}

$$ \begin{aligned} \bar{\omega_{k+1,i}} &= \omega_{k,i}e^{\alpha_kI(y_i \ne h_k(X_i))} \end{aligned} $$

Normalize the weight vector ω_{k+1}

$$ \begin{aligned} \omega_{k+1,i} &= \frac{\bar{\omega_{k + 1,i}}}{\sum_{j = 1}^N \bar{\omega_{k + 1,i}} } \end{aligned} $$

end loop

In the final combination strategy, the result of the correct classification is weighted and the classification with the largest value is used to obtain the final strong estimator:

$$ \begin{aligned} H(x) &= \underset{m}{\operatorname{argmax}} \left( \sum_{i = 1}^{K} \alpha_i I(h_i(x) = m) \right) \end{aligned} $$

<hr/>

The steps of the AdaBoost-SUMME.R algorithm are as follows:

<hr/>

Initialize the sample weight vector ω_1

$$ \begin{aligned} \omega_{1,i} &= \frac{1}{N} \quad i = 1,...,N \end{aligned} $$

Number of traversal of the estimator K times:

Calculate the weighted class probability estimation vector P_k under the sample weight ω_k

$$ \begin{aligned} p_k^m(x) = P(y = m \mid x) \end{aligned} $$

Calculate the k+1th weight vector ω_{k+1}

$$ \hat{y} = \left\{ \begin{array}{c} 1 & y =m\\ -\frac{1}{M-1} & y \ne m \end{array}\right. \quad m = 1,\dots,M $$

$$ \begin{aligned} \bar{\omega_{k+1,i}} &= \omega_{k,i}e^{-\frac{M-1}{M} \hat{y_i}^T \ln p_k(x) } \end{aligned} $$

Normalize the weight vector ω_{k+1}

$$ \begin{aligned} \omega_{k+1,i} &= \frac{\bar{\omega_{k + 1,i}}}{\sum_{j = 1}^N \bar{\omega_{k + 1,i}} } \end{aligned} $$

end loop

The final combination strategy uses the classification with the largest value of the probability estimation calculation result to obtain the final strong estimator:

$$ \begin{aligned} h_k(x) &= (M - 1) \left( \ln p_k^m(x) - \frac{1}{M} \sum_{i = 1}^{M} \ln p_k^i(x) \right) \\ H(x) &= \underset{m}{\operatorname{argmax}} \left( \sum_{i = 1}^{K} h_i(x)\right) \end{aligned} $$

<hr/>

return

Suppose the training set T = { X_i, y_i }, i = 1, ..., N, h(x) is the estimator, and the number of estimators is K

The steps of the AdaBoost.R2 algorithm are as follows:

<hr/>

Initialize the sample weight vector ω_1

$$ \begin{aligned} \omega_{1,i} &= \frac{1}{N} \quad i = 1,...,N \end{aligned} $$

Number of traversal of the estimator K times:

Train estimator h(x) with sample weight ω_k

Calculate the maximum error E_k

$$ \begin{aligned} E_k &= \max \mid y_i - h_k(X_i) \mid \end{aligned} $$

Calculate the error rate e_k for the kth time

$$ \begin{aligned} e_{k,i} &= \frac{\mid y_i - h_k(X_i) \mid}{E_k} & 线性误差 \\ e_{k,i} &= \frac{\left( y_i - h_k(X_i) \right)^2}{E_k^2} & 平方误差 \\ e_{k,i} &= 1 - e^{-\frac{\mid y_i - h_k(X_i) \mid}{E_k} } & 指数误差 \\ e_k & = \sum_{i = 1}^{N}\omega_{k,i} e_{k,i} \end{aligned} $$

If the error rate e_k is greater than 0.5

break loop

Calculate the k-th estimator weight α_k

$$ \begin{aligned} \alpha_k &= \frac{e_k}{1 - e_k} \end{aligned} $$

Calculate the k+1th weight vector ω_{k+1}

$$ \begin{aligned} \bar{\omega_{k+1,i}} &= \omega_{k,i}\alpha_k^{1 - e_{k,i}} \end{aligned} $$

Normalize the weight vector ω_{k+1}

$$ \begin{aligned} \omega_{k+1,i} &= \frac{\bar{\omega_{k + 1,i}}}{\sum_{j = 1}^N \bar{\omega_{k + 1,i}} } \end{aligned} $$

end loop

The final combination strategy uses the result of the estimator corresponding to the median of the estimator weights to obtain the final strong estimator:

$$ \begin{aligned} H(x) &= \inf \left\{ y \in A: \sum_{h_i(x) \le y } \ln \left(\frac{1}{\alpha_i}\right) \ge \frac{1}{2} \sum_{i = 1}^{K} \ln \left(\frac{1}{\alpha_i}\right) \right\} \end{aligned} $$

<hr/>

4. Proof of Principle

AdaBoost algorithm derivation

The same as the preconditions in the algorithm steps, suppose the training set T = { X_i, y_i }, i = 1, ..., N, y_i can take -1, +1, h(x) is the estimator, the number of estimators for K.

One interpretation of the AdaBoost algorithm is the additive model, which is weighted by multiple estimators h(x) to obtain the final strong estimator H(x), as follows:

(1) The strong estimator expression for the k-1th round

(2) The strong estimator expression for the kth round

(3) The strong estimator of the kth round can be represented by the strong estimator of the k-1th round and the weighted estimator of the kth round

$$ \begin{aligned} H_{k-1}(x) &= \sum_{i = 1}^{k-1} \alpha_i h_i(x) & (1) \\ H_k(x) &= \sum_{i = 1}^{k} \alpha_i h_i(x) & (2) \\ H_k(x) &= H_{k-1}(x) + \alpha_k h_k(x) & (3) \\ \end{aligned} $$

formula 4-1

Next, we define the cost function of the final strong estimator. The AdaBoost algorithm uses an exponential function, which has better mathematical properties than the 0/1 function.

(1)
cost function
(2) Equation (3) brought into Equation

(3) Our goal is to find the optimal estimator weight α and estimator h(x)

(4) Define a new variable ω, including the strong estimator of the previous round and other values that have nothing to do with α, h(x)

(5) Replace ω

$$ \begin{aligned} Cost(H(x)) &= \sum_{i = 1}^{N} e^{-y_iH(X_i)} & (1) \\ Cost(\alpha, h(x)) &= \sum_{i = 1}^{N} e^{-y_i(H_{k-1}(X_i) + \alpha h(X_i))} & (2) \\ \alpha_k, h_k(x) &= \underset{\alpha, h(x)}{\operatorname{argmin} } \sum_{i = 1}^{N} e^{-y_i(H_{k-1}(X_i) + \alpha h(X_i))} & (3) \\ \bar{\omega_{k,i}} &= e^{-y_iH_{k-1}(X_i)} & (4) \\ \alpha_k, h_k(x) &= \underset{\alpha, h(x)}{\operatorname{argmin} } \sum_{i = 1}^{N} \bar{\omega_{k,i}} e^{-y_i\alpha h(X_i)} & (5) \\ \end{aligned} $$

Type 4-2

Let's first look at the estimator h(x). After each training of the estimator, the estimator has been determined, so we only need to care about the weight α of each estimator now.

(1) Find the optimal estimator weight α so that the value of the cost function is the smallest

(2) Cost function Cost(α)

(3) Since the label value can take positive or negative 1, it is divided into two
according to whether the predicted value and the label value are the same.
(4) The addition of the second and third
not affect the final result.
(5) Combine the first two terms and the last two terms in (4) to get

$$ \begin{aligned} \alpha_k &= \underset{\alpha}{\operatorname{argmin} } \sum_{i = 1}^{N} \bar{\omega_{k,i}} e^{-y_i\alpha h_k(X_i)} & (1) \\ Cost(\alpha) &= \sum_{i = 1}^{N} \bar{\omega_{k,i}} e^{-y_i\alpha h_k(X_i)} & (2) \\ &= \sum_{y_i = h_k(X_i)}^{N} \bar{\omega_{k,i}} e^{-\alpha} + \sum_{y_i \ne h_k(X_i)}^{N} \bar{\omega_{k,i}} e^{\alpha} & (3) \\ &= \sum_{y_i = h_k(X_i)}^{N} \bar{\omega_{k,i}} e^{-\alpha} + \sum_{y_i \ne h_k(X_i)}^{N} \bar{\omega_{k,i}} e^{-\alpha} - \sum_{y_i \ne h_k(X_i)}^{N} \bar{\omega_{k,i}} e^{-\alpha} + \sum_{y_i \ne h_k(X_i)}^{N} \bar{\omega_{k,i}} e^{\alpha} & (4) \\ &= e^{-\alpha} \sum_{i = 1}^{N} \bar{\omega_{k,i}} + (e^{\alpha} - e^{-\alpha}) \sum_{i = 1}^{N} \bar{\omega_{k,i}} I(y_i \ne h_k(X_i)) & (5) \\ \end{aligned} $$

Type 4-3

(1) Take the derivative of the cost function and make it zero

(2) Define the expression

(3) Bring the error rate e_k into (2) formula

(4) Multiply both sides by exp(α)

(5)
is sorted after the item is moved
(6) Obtain the expression for the final estimator weight α

$$ \begin{aligned} \frac{\partial Cost(\alpha )}{\partial \alpha } &= -e^{-\alpha} \sum_{i = 1}^{N} \bar{\omega_{k,i}} + (e^{\alpha} + e^{-\alpha}) \sum_{i = 1}^{N} \bar{\omega_{k,i}} I(y_i \ne h_k(X_i)) = 0& (1) \\ e_k &= \frac{\sum_{i = 1}^{N}\bar{\omega_{k,i}} I(y_i \ne h_k(X_i))}{\sum_{i = 1}^{N}\bar{\omega_{k,i}}} & (2) \\ 0 &= -e^{-\alpha} + (e^\alpha + e^{-\alpha}) e_k & (3) \\ 0 &= -1 + (e^{2\alpha } + 1)e_k & (4) \\ e^{2\alpha } &= \frac{1 - e_k}{e_k} & (5) \\ \alpha &= \frac{1}{2} \ln \frac{1 - e_k}{e_k} & (6) \\ \end{aligned} $$

formula 4-4

(1) Definition of error rate e_k

(2) Define ω_k

(3) Obtain the expression of the error rate e_k

$$ \begin{aligned} e_k &= \frac{\sum_{i = 1}^{N}\bar{\omega_{k,i}} I(y_i \ne h_k(X_i))}{\sum_{i = 1}^{N}\bar{\omega_{k,i}}} & (1) \\ \omega_{k,i} &= \frac{\bar{\omega_{k,i}}}{\sum_{i = 1}^{N}\bar{\omega_{k,i}}} & (2) \\ e_k &= \sum_{i = 1}^{N}\omega_{k,i} I(y_i \ne h_k(X_i)) & (3) \\ \end{aligned} $$

Formula 4-5

Next is the update method of ω:

(1) Definition of ω_{k+1}

(2) Equation (3) brought into Equation

(3) Replace with ω_k

$$ \begin{aligned} \bar{\omega_{k+1,i}} &= e^{-y_iH_{k}(X_i)} & (1) \\ &= e^{-y_i(H_{k-1}(X_i) + \alpha_kh_k(X_i))} & (2) \\ &= \bar{\omega_{k,i}}e^{-y_i\alpha_kh_k(X_i)} & (3) \end{aligned} $$

formula 4-6

(1) (3) in Equation 4-6

(2) Divide both sides by the normalization parameter

(3) The numerator is replaced according to the definition of formula (2) in formula 4-5, and the denominator is replaced by formula (1) in formula
.
(4) The denominator is then replaced by
according to the definition of (2) in Equation 4-5
(5) Since the sum of ω is a constant C

(6) The constant C of the numerator and denominator can be eliminated to obtain the updater expression of ω

$$ \begin{aligned} \bar{\omega_{k+1,i}} &= \bar{\omega_{k,i}}e^{-y_i\alpha_kh_k(X_i)} & (1) \\ \omega_{k+1,i} &= \frac{ \bar{\omega_{k,i}}e^{-y_i\alpha_kh_k(X_i)} }{\sum_{j = 0}^N \bar{\omega_{k+1,j}}} & (2) \\ &= \frac{\omega_{k,i} \sum_{j = 0}^N \left(\bar{\omega_{k,j}}\right) e^{-y_i\alpha_kh_k(X_i)} }{\sum_{j = 0}^N \left(\bar{\omega_{k,j}} e^{-y_j\alpha_kh_k(X_j)} \right) } & (3) \\ &= \frac{\omega_{k,i} \sum_{j = 0}^N \left(\bar{\omega_{k,j}}\right) e^{-y_i\alpha_kh_k(X_i)}}{\sum_{j = 0}^N \left(\omega_{k,j} \left(\sum_{l = 0}^N \bar{\omega_{k,l}}\right) e^{-y_j\alpha_kh_k(X_j)}\right) } & (4) \\ &= \frac{\omega_{k,i} C e^{-y_i\alpha_kh_k(X_i)}}{\sum_{j = 0}^N \left(\omega_{k,j} C e^{-y_j\alpha_kh_k(X_j)}\right) } & (5) \\ &= \frac{\omega_{k,i} e^{-y_i\alpha_kh_k(X_i)}}{\sum_{j = 0}^N \left(\omega_{k,j} e^{-y_j\alpha_kh_k(X_j)}\right) } & (6) \\ \end{aligned} $$

Formula 4-7

Combining Equation 4-1 to Equation 4-7, the expression of the AdaBoost algorithm can be obtained:

$$ \begin{aligned} e_k &= \sum_{i = 1}^{N}\omega_{k,i} I(y_i \ne h_k(X_i)) & (1) \\ \alpha_k &= \frac{1}{2} \ln \frac{1 - e_k}{e_k} & (2) \\ \omega_{k+1,i} &= \frac{\omega_{k,i} e^{-y_i\alpha_kh_k(X_i)}}{\sum_{j = 0}^N \left(\omega_{k,j} e^{-y_j\alpha_kh_k(X_j)}\right) } & (3) \\ H(x) &= \operatorname{sign} \left(\sum_{i = 1}^{K} \alpha_i h_i(x)\right) & (4) \\ \end{aligned} $$

formula 4-8

AdaBoost-SAMME algorithm derivation

The same as the preconditions in the algorithm steps, suppose that the training set T = { X_i, y_i }, i = 1, ..., N, there are M possible values of y, h(x) is the estimator, and the estimator's The quantity is K.

In order to adapt to the multi-classification problem, the AdaBoost-SAMME algorithm converts the original numerical label y into a vector form, as shown in Equation 4-9:

$$ \hat{y} = \left\{ \begin{array}{c} 1 & y =m\\ -\frac{1}{M-1} & y \ne m \end{array}\right. \quad m = 1,\dots,M $$

formula 4-9

The following uses an example to illustrate the meaning of Equation 4-9. Suppose the label y can be 1, 2, 3, and the label set y = { 2, 1, 2, 3 }, then according to Equation 4-9, the corresponding converted The set of labels is shown in Equation 4-10:

$$ \begin{array}{c} y \in \{1,2,3\} \\ y = \{2,1,2,3\} \\ \hat{y}_i = \left\{ \begin{array}{c} 1 & y_i =m\\ -\frac{1}{2} & y_i \ne m \end{array}\right. \quad m = 1,2,3 \\ \hat{y} = \begin{bmatrix} -\frac{1}{2} & 1 & -\frac{1}{2} \\ 1 & -\frac{1}{2} & -\frac{1}{2} \\ -\frac{1}{2} & 1 & -\frac{1}{2} \\ -\frac{1}{2} & -\frac{1}{2} & 1 \end{bmatrix} \end{array} $$

formula 4-10

The algorithm is also interpreted as an additive model, and the final strong estimator H(x) is obtained after weighting by multiple estimators h(x), and the cost function uses the exponential function

(1) The cost function, here is one more 1/M than the original algorithm, for the convenience of subsequent calculations, and H(X_i) is also a vector

(2) Equation (3) in Equation

(3) Also define a ω, including the strong estimator of the previous round and other values unrelated to

(4) Bring in ω to get the expression of the cost function

(5) The goal is to find the optimal estimator weight α to minimize the value of the cost function

$$ \begin{aligned} Cost(H(x)) &= \sum_{i = 1}^{N} e^{-\frac{1}{M} \hat{y}_iH(X_i)} & (1) \\ Cost(\alpha) &= \sum_{i = 1}^{N} e^{-\frac{1}{M}\hat{y}_i(H_{k-1}(X_i) + \alpha h_k(X_i))} & (2) \\ \bar{\omega_{k,i}} &= e^{-\frac{1}{M}\hat{y}_iH_{k-1}(X_i)} & (3) \\ Cost(\alpha) &= \sum_{i = 1}^{N} \bar{\omega_{k,i}} e^{-\frac{1}{M}\hat{y}_i\alpha h_k(X_i)} & (4) \\ \alpha_k &= \underset{\alpha}{\operatorname{argmin} } \sum_{i = 1}^{N} \bar{\omega_{k,i}} e^{-\frac{1}{M}\hat{y}_i\alpha h_k(X_i)} & (5) \\ \end{aligned} $$

formula 4-11

Let's first look at the exponential part of the cost function, that is, the dot product of the predicted value and the label value. The following two cases are discussed:

When the predicted value is the same as the label value, the position of 1 in the vector is the same, and there are M - 1 -1 / (M - 1) in total, and the following dot product results are obtained:

$$ \begin{aligned} 1 + \left(M - 1\right)\left(-\frac{1}{M-1}\right)\left(-\frac{1}{M-1}\right) = \frac{M}{M-1}\\ \end{aligned} $$

formula 4-12

When the predicted value is not the same as the label value, the position of 1 in the vector is inconsistent, -1 / (M - 1) has a total of M - 2, and the following dot product results are obtained:

$$ \begin{aligned} \left(-\frac{1}{M-1}\right) + \left(-\frac{1}{M-1}\right) + \left(M - 2\right) \left(-\frac{1}{M-1}\right)\left(-\frac{1}{M-1}\right) = -\frac{M}{(M-1)^2} \end{aligned} $$

formula 4-13

Combining the above two cases, the following results are obtained:

$$ \hat{y}_ih_k(X_i) = \left\{ \begin{aligned} &\frac{M}{M-1} & \hat{y}_i = h_k(X_i) \\ &-\frac{M}{(M-1)^2} & \hat{y}_i \ne h_k(X_i) \end{aligned} \right. $$

formula 4-14

(1) Cost function Cost(α)

(2) Bring-in type 4-14
in two cases
(3) Adding the second and third
not affect the final result.
(4) Combine the first two terms and the last two terms in (3) to get

$$ \begin{aligned} Cost(\alpha) &= \sum_{i = 1}^{N} \bar{\omega_{k,i}} e^{-\frac{1}{M}\hat{y}_i\alpha h_k(X_i)} & (1) \\ &= \sum_{\hat{y}_i = h_k(X_i)}^{N} \bar{\omega_{k,i}} e^{-\frac{\alpha}{M-1} } + \sum_{\hat{y}_i \ne h_k(X_i)}^{N} \bar{\omega_{k,i}} e^{\frac{\alpha}{(M-1)^2}} & (2) \\ &= \sum_{\hat{y}_i = h_k(X_i)}^{N} \bar{\omega_{k,i}} e^{-\frac{\alpha}{M-1} } + \sum_{\hat{y}_i \ne h_k(X_i)}^{N} \bar{\omega_{k,i}} e^{-\frac{\alpha}{M-1}} - \sum_{\hat{y}_i \ne h_k(X_i)}^{N} \bar{\omega_{k,i}} e^{-\frac{\alpha}{M-1}} + \sum_{\hat{y}_i \ne h_k(X_i)}^{N} \bar{\omega_{k,i}} e^{\frac{\alpha}{(M-1)^2}} & (3) \\ &= e^{-\frac{\alpha}{M-1} } \sum_{i = 1}^{N} \bar{\omega_{k,i}} + (e^{\frac{\alpha}{(M-1)^2}} - e^{-\frac{\alpha}{M-1}}) \sum_{i = 1}^{N} \bar{\omega_{k,i}} I(\hat{y}_i \ne h_k(X_i)) & (4) \\ \end{aligned} $$

Formula 4-15

(1) Take the derivative of the cost function and set it to zero

(2) Define the expression

(3) Bring the error rate e_k into (2) formula

(4) Multiply both sides by exp(α / (M - 1))

(5)
is sorted after the item is moved
(6) Obtain the expression for the final estimator weight α

$$ \begin{aligned} \frac{\partial Cost(\alpha )}{\partial \alpha } &= \left(-\frac{1}{M-1}\right)e^{-\frac{\alpha}{M-1}} \sum_{i = 1}^{N} \bar{\omega_{k,i}} + \left(\left(\frac{1}{(M-1)^2}\right)e^{\frac{\alpha}{(M-1)^2}} + \left(\frac{1}{(M-1)}\right)e^{-\frac{\alpha}{M-1}}\right) \sum_{i = 1}^{N} \bar{\omega_{k,i}} I(y_i \ne h_k(X_i)) = 0& (1) \\ e_k &= \frac{\sum_{i = 1}^{N}\bar{\omega_{k,i}} I(y_i \ne h_k(X_i))}{\sum_{i = 1}^{N}\bar{\omega_{k,i}}} & (2) \\ e^{-\frac{\alpha}{M-1}} &= \left(\left(\frac{1}{M-1}\right)e^{\frac{\alpha}{(M-1)^2}} + e^{-\frac{\alpha}{M-1}}\right) e_k & (3) \\ 1 &= \left(\left(\frac{1}{M-1}\right)e^{\frac{\alpha}{(M-1)^2} + \frac{\alpha}{M-1}} + 1\right) e_k & (4) \\ \frac{1 - e_k}{e_k} &= \left(\frac{1}{M-1}\right)e^{\frac{M\alpha}{(M-1)^2}} & (5) \\ \alpha &= \frac{(M-1)^2}{M}\left( \ln \left(\frac{1 - e_k}{e_k}\right) + \ln (M - 1) \right) & (6) \end{aligned} $$

Type 4-16

The constant in front of the expression of the estimator weight α in Equation 4-16 has no effect on the result after normalization, and the following formula for updating the sample weight is also the simplified result. For more detailed algorithm description, please refer to the original paper - Multi-class AdaBoost ⁷

AdaBoost-SAMME.R algorithm derivation

The AdaBoost-SAMME.R algorithm is a variant of the AdaBoost-SAMME algorithm that uses weighted probability estimates to update the additive model, as shown in Equation 4-17:

$$ \begin{aligned} H_k(x) = H_{k - 1}(x) + h_k(x) \end{aligned} $$

style 4-17

The cost function still uses the exponential function, the difference is that there is no estimator weight or the weight of each estimator is 1, and it is changed to the desired form, where h(x) returns an M-dimensional vector , and at the same time, in order to ensure that the obtained h(x) is unique, the restriction that the sum of each element of the vector is 0 is added.

$$ \begin{array}{c} h_k(x) = \underset{h(x)}{\operatorname{argmax}} E(e^{-\frac{1}{M} \hat{y}_i (H_{k-1}(x) + h(x)) } \mid x) \\ s.t. \quad h_k^1(x) + h_k^2(x) + \cdots + h_k^M(x) = 0 \end{array} $$

Type 4-18

The cost function can be split into separate expectations for each class and then added together:

$$ \begin{aligned} Cost(h(x)) &= E(e^{-\frac{1}{M} \hat{y}_i (H_{k-1}(x) + h(x)) } \mid x) & (1) \\ &= E(e^{-\frac{1}{M} \hat{y}_i H_{k-1}(x)}e^{-\frac{1}{M}\hat{y}_ih(x) } \mid x) & (2) \\ &= E(e^{-\frac{1}{M} \hat{y}_i H_{k-1}(x)}e^{-\frac{1}{M}\hat{y}_ih(x) } I(y = 1) \mid x) + \cdots + E(e^{-\frac{1}{M} \hat{y}_i H_{k-1}(x)}e^{-\frac{1}{M}\hat{y}_ih(x) } I(y = M) \mid x) & (3) \\ \end{aligned} $$

Type 4-19

Let's first look at the result of y * h(x) when y = 1:

(1) When y = 1, the vector form of y after conversion is

(2) Calculate the result of dot product

(3) Merge the last item

(4) Replace
according to the restrictions
(5) Get the simplified result

$$ \begin{aligned} \hat{y} &= [1, -\frac{1}{M - 1}, \cdots, -\frac{1}{M - 1} ] & (1) \\ \hat{y}_ih(x) &= h^1(x) + (-\frac{1}{M - 1})h^2(x) + \cdots + (-\frac{1}{M - 1})h^M(x) & (2) \\ &= h^1(x) - \frac{h^2(x) + \cdots + h^M(x)}{M - 1} & (3) \\ &= h^1(x) - \frac{-h^1(x)}{M - 1} & (4) \\ &= \frac{Mh^1(x)}{M - 1} & (5) \\ \end{aligned} $$

formula 4-20

(1) Bring-in 4-20

(2) Propose a term unrelated to expectations

(3) The expectation behind the other is P(y = 1 | x)

(4) In the same way, the expected results of each category can be obtained

$$ \begin{aligned} E(e^{-\frac{1}{M} \hat{y}_i H_{k-1}(x)}e^{-\frac{1}{M}\hat{y}_ih(x) } I(y = 1) \mid x) &= E(e^{-\frac{1}{M} \hat{y}_i H_{k-1}(x)}e^{-\frac{h^1(x)}{M-1} } I(y = 1) \mid x) & (1) \\ &= e^{-\frac{h^1(x)}{M-1}} E(e^{-\frac{1}{M} \hat{y}_i H_{k-1}(x)} I(y = 1) \mid x) & (2) \\ P(y = 1 | x) &= E(e^{-\frac{1}{M} \hat{y}_i H_{k-1}(x)} I(y = 1) \mid x) & (3) \\ E(e^{-\frac{1}{M} \hat{y}_i H_{k-1}(x)}e^{-\frac{1}{M}\hat{y}_ih(x) } I(y = m) \mid x) &= e^{-\frac{h^m(x)}{M-1}} P(y = m | x) & (4) \\ \end{aligned} $$

Formula 4-21

Plug the above result into the cost function to get:

$$ \begin{aligned} Cost(h(x)) &= e^{-\frac{h^1(x)}{M-1}} P(y = 1 | x) + \cdots + e^{-\frac{h^M(x)}{M-1}} P(y = M | x) & (1) \\ &= \sum_{m = 1}^{M} e^{-\frac{h^m(x)}{M-1}} P(y = m | x) & (2) \\ \end{aligned} $$

formula 4-22

At this time, the Lagrangian multiplier method can be used to solve the above problem, and its Lagrangian function L is as follows:

$$ \begin{aligned} L(h(x), \lambda ) &= \sum_{m = 1}^{M} e^{-\frac{h^m(x)}{M-1}} P(y = m | x) - \lambda \sum_{m = 1}^{M} h^m(x)\\ \end{aligned} $$

Type 4-23

The Lagrangian function takes the derivative of each component of h(x) separately:

$$ \begin{aligned} \frac{\partial L(h(x), \lambda)}{\partial h^1(x)} &= -\frac{1}{M-1} e^{-\frac{h^1(x)}{M-1}} P(y = 1 | x) - \lambda = 0 \\ \frac{\partial L(h(x), \lambda)}{\partial h^2(x)} &= -\frac{1}{M-1} e^{-\frac{h^2(x)}{M-1}} P(y = 2 | x) - \lambda = 0 \\ & \cdots \\ \frac{\partial L(h(x), \lambda)}{\partial h^M(x)} &= -\frac{1}{M-1} e^{-\frac{h^M(x)}{M-1}} P(y = M | x) - \lambda = 0 \\ \end{aligned} $$

style 4-24

Paired with the vertical equations 4-24, the results of each component are obtained separately. The first one is used as an example below:

(1) The 1st and 2nd equations in the

(2) Eliminate the same constant term and take the logarithm
on both sides at the same time
(3) After simplification of shifting items, we get

$$ \begin{aligned} -\frac{1}{M-1} e^{-\frac{h^1(x)}{M-1}} P(y = 1 | x) &= -\frac{1}{M-1} e^{-\frac{h^2(x)}{M-1}} P(y = 2 | x) & (1)\\ -\frac{h^1(x)}{M-1} + \ln P(y = 1 | x) &= -\frac{h^2(x)}{M-1} + \ln P(y = 2 | x) & (2) \\ h^1(x) - h^2(x) &= (M - 1) (\ln P(y = 1 | x) - \ln P(y = 2 | x)) & (3) \\ \end{aligned} $$

Type 4-25

(1)～(3) can be obtained in the same way as

(4) Accumulate equations (1) to (3) and simplify according to the

(5) Complete the last item

(6) Get the result of the first component

$$ \begin{aligned} h^1(x) - h^2(x) &= (M - 1) (\ln P(y = 1 | x) - \ln P(y = 2 | x)) & (1) \\ h^1(x) - h^3(x) &= (M - 1) (\ln P(y = 1 | x) - \ln P(y = 3 | x)) & (2) \\ & \cdots \\ h^1(x) - h^M(x) &= (M - 1) (\ln P(y = 1 | x) - \ln P(y = M | x)) & (3) \\ (M - 1) h^1(x) - (-h^1(x)) &= (M - 1)((M - 1)\ln P(y = 1 | x) - \sum_{m \ne 1} \ln P(y = m | x))) & (4) \\ Mh^1(x) &= (M - 1)(M\ln P(y = 1 | x) - \sum_{m = 1}^{M} \ln P(y = m | x)) & (5) \\ h^1(x) &= (M - 1)(\ln P(y = 1 | x) - \frac{1}{M} \sum_{m = 1}^{M} \ln P(y = m | x)) & (6) \\ \end{aligned} $$

Type 4-26

Similarly, the results of each component of h(x) can be obtained

$$ \begin{aligned} h^m(x) &= (M - 1)(\ln P(y = m | x) - \frac{1}{M} \sum_{m^{'} = 1}^{M} \ln P(y = m^{'} | x)) \\ \end{aligned} $$

type 4-27

The update of the sample weight is as follows. Bring h(x) into the update method. You can see that the update method only retains the former item, because the latter item sums the p(x) of each class, which can be considered as a constant. , the normalization does not affect the final result.

$$ \begin{aligned} \bar{\omega_{k,i}} &= e^{-\frac{1}{M}\hat{y}_iH_{k-1}(X_i)} & (1) \\ \bar{\omega_{k+1,i}} &= \bar{\omega_{k,i}} e^{-\frac{1}{M}\hat{y}_ih_{k}(X_i)} & (2) \\ &= \bar{\omega_{k,i}} e^{-\frac{M - 1}{M}\hat{y}_i\ln p_k(X_i)} & (3) \\ \end{aligned} $$

formula 4-28

In this way, the update formula of the sample weight in the algorithm step is obtained. For more detailed algorithm description, please refer to the original paper - Multi-class AdaBoost ⁷

Five, code implementation

Implementing the AdaBoost algorithm using Python

import numpy as np
from sklearn.tree import DecisionTreeClassifier

class adaboostc():
    """
    AdaBoost 分类算法
    """

    def __init__(self, n_estimators = 100):
        # AdaBoost弱学习器数量
        self.n_estimators = n_estimators

    def fit(self, X, y):
        """
        AdaBoost 分类算法拟合
        """
        # 初始化样本权重向量
        sample_weights = np.ones(X.shape[0]) / X.shape[0]
        # 估计器数组
        estimators = []
        # 估计器权重数组
        weights = []
        # 遍历估计器
        for i in range(self.n_estimators):
            # 初始化最大深度为1的决策树估计器
            estimator = DecisionTreeClassifier(max_depth = 1)
            # 按照样本权重拟合训练集
            estimator.fit(X, y, sample_weight=sample_weights)
            # 预测训练集
            y_predict = estimator.predict(X)
            # 计算误差率
            e = np.sum(sample_weights[y_predict != y])
            # 当误差率大于等于0.5时跳出循环
            if e >= 0.5:
                self.n_estimators = i
                break
            # 计算估计器权重
            weight = 0.5 * np.log((1 - e) / e)
            # 计算样本权重
            temp_weights = np.multiply(sample_weights, np.exp(- weight * np.multiply(y, y_predict)))
            # 归一化样本权重
            sample_weights = temp_weights / np.sum(temp_weights)
            weights.append(weight)
            estimators.append(estimator)
        self.weights = weights
        self.estimators = estimators

    def predict(self, X):
        """
        AdaBoost 分类算法预测
        """
        y = np.zeros(X.shape[0])
        # 遍历估计器
        for i in range(self.n_estimators):
            estimator = self.estimators[i]
            weight = self.weights[i]
            # 预测结果
            predicts = estimator.predict(X)
            # 按照估计器的权重累加
            y += weight * predicts
        # 根据权重的正负号返回结果
        return np.sign(y)

Implementing the AdaBoost-SAMME algorithm using Python

import numpy as np
from sklearn.tree import DecisionTreeClassifier

class adaboostmc():
    """
    AdaBoost 多分类SAMME算法
    """

    def __init__(self, n_estimators = 100):
        # AdaBoost弱学习器数量
        self.n_estimators = n_estimators

    def fit(self, X, y):
        """
        AdaBoost 多分类SAMME算法拟合
        """
        # 标签分类
        self.classes = np.unique(y)
        # 标签分类数
        self.n_classes = len(self.classes)
        # 初始化样本权重向量
        sample_weights = np.ones(X.shape[0]) / X.shape[0]
        # 估计器数组
        estimators = []
        # 估计器权重数组
        weights = []
        # 遍历估计器
        for i in range(self.n_estimators):
            # 初始化最大深度为1的决策树估计器
            estimator = DecisionTreeClassifier(max_depth = 1)
            # 按照样本权重拟合训练集
            estimator.fit(X, y, sample_weight=sample_weights)
            # 训练集预测结果
            y_predict = estimator.predict(X)
            incorrect = y_predict != y
            # 计算误差率
            e = np.sum(sample_weights[incorrect])
            # 计算估计器权重
            weight = np.log((1 - e) / e) + np.log(self.n_classes - 1)
            # 计算样本权重
            temp_weights = np.multiply(sample_weights, np.exp(weight * incorrect))
            # 归一化样本权重
            sample_weights = temp_weights / np.sum(temp_weights)
            weights.append(weight)
            estimators.append(estimator)
        self.weights = weights
        self.estimators = estimators

    def predict(self, X):
        """
        AdaBoost 多分类SAMME算法预测
        """
        # 加权结果集合
        results = np.zeros((X.shape[0], self.n_classes))
        # 遍历估计器
        for i in range(self.n_estimators):
            estimator = self.estimators[i]
            weight = self.weights[i]
            # 预测结果
            predicts = estimator.predict(X)
            # 遍历标签分类
            for j in range(self.n_classes):
                # 对应标签分类的权重累加
                results[predicts == self.classes[j], j] += weight
        # 取加权最大对应的分类作为最后的结果
        return self.classes.take(np.argmax(results, axis=1), axis=0)

Implementing the AdaBoost-SAMME.R algorithm using Python

import numpy as np
from sklearn.tree import DecisionTreeClassifier

class adaboostmcr():
    """
    AdaBoost 多分类SAMME.R算法
    """

    def __init__(self, n_estimators = 100):
        # AdaBoost弱学习器数量
        self.n_estimators = n_estimators

    def fit(self, X, y):
        """
        AdaBoost 多分类SAMME.R算法拟合
        """
        # 标签分类
        self.classes = np.unique(y)
        # 标签分类数
        self.n_classes = len(self.classes)
        # 初始化样本权重
        sample_weights = np.ones(X.shape[0]) / X.shape[0]
        # 估计器数组
        estimators = []
        # 论文中对 y 的定义
        y_codes = np.array([-1. / (self.n_classes - 1), 1.])
        # 将训练集中的标签值转换成论文中的矩阵形式
        y_coding = y_codes.take(self.classes == y[:, np.newaxis])
        # 遍历估计器
        for i in range(self.n_estimators):
            # 初始化最大深度为1的决策树估计器
            estimator = DecisionTreeClassifier(max_depth = 1)
            # 根据样本权重拟合训练集
            estimator.fit(X, y, sample_weight=sample_weights)
            # 预测训练集标签值的概率
            y_predict_proba = estimator.predict_proba(X)
            # 处理概率为0的结果，避免取对数是结果为负无穷大的问题
            np.clip(y_predict_proba, np.finfo(y_predict_proba.dtype).eps, None, out=y_predict_proba)
            # 计算样本权重
            temp_weights = sample_weights * np.exp(- ((self.n_classes - 1) / self.n_classes) * np.sum(np.multiply(y_coding, np.log(y_predict_proba)), axis=1))
            # 归一化样本权重
            sample_weights = temp_weights / np.sum(temp_weights)
            estimators.append(estimator)
        self.estimators = estimators

    def predict(self, X):
        """
        AdaBoost 多分类SAMME.R算法预测
        """
        # 结果集合
        results = np.zeros((X.shape[0], self.n_classes))
        # 遍历估计器
        for i in range(self.n_estimators):
            estimator = self.estimators[i]
            # 预测标签值的概率
            y_predict_proba = estimator.predict_proba(X)
            # 同样需要处理零概率的问题
            np.clip(y_predict_proba, np.finfo(y_predict_proba.dtype).eps, None, out=y_predict_proba)
            # 对概率取对数
            y_predict_proba_log = np.log(y_predict_proba)
            # 计算 h(x)
            h = (self.n_classes - 1) * (y_predict_proba_log - (1 / self.n_classes) * np.sum(y_predict_proba_log, axis=1)[:, np.newaxis])
            # 累加
            results += h
        # 取累加最大对应的分类作为最后的结果
        return self.classes.take(np.argmax(results, axis=1), axis=0)

Implementing the AdaBoost.R2 algorithm using Python

import numpy as np
from sklearn.tree import DecisionTreeRegressor

class adaboostr():
    """
    AdaBoost 回归算法
    """

    def __init__(self, n_estimators = 100):
        # AdaBoost弱学习器数量
        self.n_estimators = n_estimators

    def fit(self, X, y):
        """
        AdaBoost 回归算法拟合
        """
        # 初始化样本权重向量
        sample_weights = np.ones(X.shape[0]) / X.shape[0]
        # 估计器数组
        estimators = []
        # 估计器权重数组
        weights = []
        # 遍历估计器
        for i in range(self.n_estimators):
            # 初始化最大深度为3的决策树估计器
            estimator = DecisionTreeRegressor(max_depth = 3)
            # 根据样本权重拟合训练集
            estimator.fit(X, y, sample_weight=sample_weights)
            # 预测结果
            y_predict = estimator.predict(X)
            # 计算误差向量（线性误差）
            errors = np.abs(y_predict - y)
            errors = errors / np.max(errors)
            # 计算误差率
            e = np.sum(np.multiply(errors, sample_weights))
            # 当误差率大于等于0.5时跳出循环
            if e >= 0.5:
                self.n_estimators = i
                break
            # 计算估计器权重
            weight = e / (1 - e)
            # 计算样本权重
            temp_weights = np.multiply(sample_weights, np.power(weight, 1 - errors))
            # 归一化样本权重
            sample_weights = temp_weights / np.sum(temp_weights)
            weights.append(weight)
            estimators.append(estimator)
        self.weights = np.array(weights)
        self.estimators = np.array(estimators)

    def predict(self, X):
        """
        AdaBoost 回归算法预测
        """
        # 论文中权重的定义
        weights = np.log(1 / self.weights)
        # 预测结果矩阵
        predictions = np.array([self.estimators[i].predict(X) for i in range(self.n_estimators)]).T
        # 根据预测结果排序后的下标
        sorted_idx = np.argsort(predictions, axis=1)
        # 根据排序结果依次累加估计器权重，得到新的累积权重矩阵，类似累积分布函数的定义
        weight_cdf = np.cumsum(weights[sorted_idx], axis=1, dtype=np.float64)
        # 累积权重矩阵中大于其中中位数的结果
        median_or_above = weight_cdf >= 0.5 * weight_cdf[:, -1][:, np.newaxis]
        # 中位数结果对应的下标
        median_idx = median_or_above.argmax(axis=1)
        # 对应的估计器
        median_estimators = sorted_idx[np.arange(X.shape[0]), median_idx]
        # 取对应的估计器的预测结果作为最后的结果
        return predictions[np.arange(X.shape[0]), median_estimators]

6. Third-party library implementation

scikit-learn ³ Implement adaptive boosted classification

from sklearn.ensemble import AdaBoostClassifier

# 自适应增强分类器 SAMME 算法
clf = AdaBoostClassifier(n_estimators = 50, random_state = 0, algorithm = "SAMME")
# 自适应增强分类器 SAMME.R 算法
clf = AdaBoostClassifier(n_estimators = 50, random_state = 0, algorithm = "SAMME.R")
# 拟合数据集
clf = clf.fit(X, y)

scikit-learn ⁴ Implement adaptive augmented regression

from sklearn.ensemble import AdaBoostRegressor

# 自适应增强回归器
clf = AdaBoostRegressor(n_estimators = 50, random_state = 0)
# 拟合数据集
clf = clf.fit(X, y)

Seven, example demonstration

Figure 7-1 shows the results of binary classification using the adaptive boosting algorithm. Red represents the sample points with a label value of -1, and blue represents the sample points with a label value of 1. The light red area is the part with a predicted value of -1, and the light blue area is the part with a predicted value of 1

<center> Figure 7-1 </center>

Figure 7-2 and Figure 7-3 show the results of multi-classification using the SAMME and SAMME.R algorithms, respectively. Red represents the sample point with a label value of 0, blue represents the sample point with a label value of 1, and green represents the label value. is a sample point of 2. The light red area is the part with a predicted value of 0, the light blue area is the part with a predicted value of 1, and the light green area is the part with a predicted value of 1

<center> Figure 7-2 </center>

<center> Figure 7-3 </center>

Figure 7-4 shows the results of regression using the adaptive boosting algorithm

Figure 7-4

Eight, mind map

Figure 8-1

9. References

For full demo please click here

Note: This article strives to be accurate and easy to understand, but because the author is also a beginner and has limited level, if there are errors or omissions in the article, I urge readers to criticize and correct by leaving a message.

This article was first published in - AI map , welcome to pay attention

Machine Learning Algorithm Series (19) - Adaptive Boosting Algorithm

I. Introduction

2. Introduction to the model

Boosting algorithm

AdaBoost algorithm

3. Algorithm steps

two-class

The steps of the AdaBoost algorithm are as follows:

Multiclass

The steps of the AdaBoost-SUMME algorithm are as follows:

The steps of the AdaBoost-SUMME.R algorithm are as follows:

return

The steps of the AdaBoost.R2 algorithm are as follows:

4. Proof of Principle

AdaBoost algorithm derivation

AdaBoost-SAMME algorithm derivation

AdaBoost-SAMME.R algorithm derivation

Five, code implementation

6. Third-party library implementation

Seven, example demonstration

Eight, mind map

9. References

Saisimon

引用和评论

机器学习算法系列（二十）-梯度提升决策树算法（Gradient Boosted Decision Trees / GBDT）

LRU算法，你别跑，我就要吃透你

人工智能与机器学习入门：基尼系数（Gini Index）和基于熵（Entropy）

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

一文掌握 MCP 上下文协议：从理论到实践

人工智能与机器学习入门：决策树应用