1
头图

Background knowledge points needed to read this article: decision tree learning algorithm, a little programming knowledge

I. Introduction

In the previous section, we learned a simple and efficient algorithm-Decision Tree Learning Algorithm (Decision Tree Learning Algorithm), ensemble learning 1 algorithm Random Forest Algorithm 2 (Random Forest Algorithm).

2. Introduction to the model

There is an idiom called brainstorming, which refers to concentrating the wisdom of the masses and absorbing useful opinions widely. There is a similar idea in machine learning algorithms called ensemble learning.

integrated learning

Ensemble learning learns multiple estimators through training, and when prediction is required, the results of multiple estimators are integrated through the combiner as the final result output.

0.png
Figure 2-1

Figure 2-1 shows the basic flow of ensemble learning.

The advantage of ensemble learning is that it improves the generality and robustness of a single estimator, and has better prediction performance than a single estimator. Another feature of ensemble learning is that it can be easily parallelized.

Bagging algorithm

The Bagging algorithm is an ensemble learning algorithm. Its specific steps are as follows: Suppose there is a training data set of size N, and each time there is a sub-data set of size M that is returned from the data set, a total of K times are selected. , according to the K sub-data sets, train and learn K models. When predicting, use these K models to predict, and then obtain the final prediction result by taking the average or majority classification.

1.png
Figure 2-2

Figure 2-2 shows how the Bagging algorithm takes sub-datasets.

Random Forest Algorithm

Combining multiple decision trees, each data set is randomly selected with replacement, and some features are randomly selected as input, so the algorithm is called the random forest algorithm. It can be seen that the random forest algorithm is a Bagging algorithm with a decision tree as an estimator.

2.png
Figure 2-3

Figure 2-3 shows the specific flow of the random forest algorithm. In the classification problem, the combiner selects the majority of the classification results as the final result, and in the regression problem, the average of multiple regression results is used as the final result.

Using the bagging algorithm can reduce overfitting, resulting in better performance. A single decision tree is very sensitive to the noise of the training set, but the Bagging algorithm reduces the correlation between the trained decision trees, effectively alleviating the above problems.

3. Algorithm steps

Assuming that the size of the training set T is N, the number of features is M, and the size of the random forest is K, the specific steps of the random forest algorithm are as follows:

Traverse the random forest size K times:

There is replacement sampling from the training set T, sampling N times to form a new sub-training set D

randomly select m features, where m < M

Using the new training set D and m features, learn a complete decision tree

get random forest

The choice of m in the above algorithm: For classification problems, the root number M features can be used in each division, and for regression problems, one-third M but not less than 5 features are selected.

Fourth, the advantages and disadvantages

Advantages of random forest algorithm:

  1. For a wide variety of data, a high-accuracy classifier can be generated
  2. Can handle a large number of input variables
  3. You can evaluate the importance of variables when deciding on categories
  4. When building a forest, an unbiased estimate of the generalized error can be generated internally
  5. Contains a good way to estimate missing data and still maintain accuracy if a significant portion of data is missing
  6. For unbalanced classification datasets, the error can be balanced
  7. Can be extended to unlabeled data, often using unsupervised clustering, as well as detecting deviators and viewing data
  8. The learning process is fast

Disadvantages of random forest algorithm:

  1. At the expense of decision tree interpretability
  2. Overfitting
    on some noisy classification or regression problems
  3. In problems with multiple categorical variables, random forests may not improve the accuracy of base learners

Five, code implementation

Random forest classification using Python:

import numpy as np
from sklearn.tree import DecisionTreeClassifier

class rfc:
    """
    随机森林分类器
    """

    def __init__(self, n_estimators = 100, random_state = 0):
        # 随机森林的大小
        self.n_estimators = n_estimators
        # 随机森林的随机种子
        self.random_state = random_state

    def fit(self, X, y):
        """
        随机森林分类器拟合
        """
        self.y_classes = np.unique(y)
        # 决策树数组
        dts = []
        n = X.shape[0]
        rs = np.random.RandomState(self.random_state)
        for i in range(self.n_estimators):
            # 创建决策树分类器
            dt = DecisionTreeClassifier(random_state=rs.randint(np.iinfo(np.int32).max), max_features = "auto")
            # 根据随机生成的权重,拟合数据集
            dt.fit(X, y, sample_weight=np.bincount(rs.randint(0, n, n), minlength = n))
            dts.append(dt)
        self.trees = dts

    def predict(self, X):
        """
        随机森林分类器预测
        """
        # 预测结果数组
        probas = np.zeros((X.shape[0], len(self.y_classes)))
        for i in range(self.n_estimators):
            # 决策树分类器
            dt = self.trees[i]
            # 依次预测结果可能性
            probas += dt.predict_proba(X)
        # 预测结果可能性取平均
        probas /= self.n_estimators
        # 返回预测结果
        return self.y_classes.take(np.argmax(probas, axis = 1), axis = 0)

Implement random forest regression using Python:

import numpy as np
from sklearn.tree import DecisionTreeRegressor

class rfr:
    """
    随机森林回归器
    """

    def __init__(self, n_estimators = 100, random_state = 0):
        # 随机森林的大小
        self.n_estimators = n_estimators
        # 随机森林的随机种子
        self.random_state = random_state

    def fit(self, X, y):
        """
        随机森林回归器拟合
        """
        # 决策树数组
        dts = []
        n = X.shape[0]
        rs = np.random.RandomState(self.random_state)
        for i in range(self.n_estimators):
            # 创建决策树回归器
            dt = DecisionTreeRegressor(random_state=rs.randint(np.iinfo(np.int32).max), max_features = "auto")
            # 根据随机生成的权重,拟合数据集
            dt.fit(X, y, sample_weight=np.bincount(rs.randint(0, n, n), minlength = n))
            dts.append(dt)
        self.trees = dts

    def predict(self, X):
        """
        随机森林回归器预测
        """
        # 预测结果
        ys = np.zeros(X.shape[0])
        for i in range(self.n_estimators):
            # 决策树回归器
            dt = self.trees[i]
            # 依次预测结果
            ys += dt.predict(X)
        # 预测结果取平均
        ys /= self.n_estimators
        return ys

6. Third-party library implementation

scikit-learn 3 implements random forest classification:

from sklearn.ensemble import RandomForestClassifier

# 随机森林分类器
clf = RandomForestClassifier(n_estimators = 100, random_state = 0)
# 拟合数据集
clf = clf.fit(X, y)

scikit-learn 4 Implement random forest regression:

from sklearn.ensemble import RandomForestRegressor

# 随机森林回归器
clf = RandomForestRegressor(n_estimators = 100, random_state = 0)
# 拟合数据集
clf = clf.fit(X, y)

Seven, animation demonstration

Figure 7-1 and Figure 7-2 show the results of classification and regression using the random forest algorithm, respectively, and Figure 7-3 and Figure 7-4 respectively show the results of classification and regression using the decision learning algorithm in the previous section. It can be seen that compared with the unregularized decision tree in the previous section, the predicted results are relatively more stable.

3.png
Figure 7-1

4.png
Figure 7-2

14.png
Figure 7-3

16.png
Figure 7-4

Eight, mind map

5.jpeg
Figure 8-1

9. References

  1. https://en.wikipedia.org/wiki/Ensemble_learning
  2. https://en.wikipedia.org/wiki/Random_forest
  3. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
  4. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

For full demo please click here

Note: This article strives to be accurate and easy to understand, but because the author is also a beginner with limited level, if there are errors or omissions in the text, I urge readers to criticize and correct

This article was first published in - AI map , welcome to pay attention


Saisimon
19 声望26 粉丝