人工智能 - Machine Learning Algorithm Series (18) - Random Forest Algorithm - 机器学习算法系列

Background knowledge points needed to read this article: decision tree learning algorithm, a little programming knowledge

I. Introduction

In the previous section, we learned a simple and efficient algorithm-Decision Tree Learning Algorithm (Decision Tree Learning Algorithm), ensemble learning ¹ algorithm ^{Random Forest Algorithm 2} (Random Forest Algorithm).

2. Introduction to the model

There is an idiom called brainstorming, which refers to concentrating the wisdom of the masses and absorbing useful opinions widely. There is a similar idea in machine learning algorithms called ensemble learning.

integrated learning

Ensemble learning learns multiple estimators through training, and when prediction is required, the results of multiple estimators are integrated through the combiner as the final result output.

Figure 2-1

Figure 2-1 shows the basic flow of ensemble learning.

The advantage of ensemble learning is that it improves the generality and robustness of a single estimator, and has better prediction performance than a single estimator. Another feature of ensemble learning is that it can be easily parallelized.

Bagging algorithm

The Bagging algorithm is an ensemble learning algorithm. Its specific steps are as follows: Suppose there is a training data set of size N, and each time there is a sub-data set of size M that is returned from the data set, a total of K times are selected. , according to the K sub-data sets, train and learn K models. When predicting, use these K models to predict, and then obtain the final prediction result by taking the average or majority classification.

Figure 2-2

Figure 2-2 shows how the Bagging algorithm takes sub-datasets.

Random Forest Algorithm

Combining multiple decision trees, each data set is randomly selected with replacement, and some features are randomly selected as input, so the algorithm is called the random forest algorithm. It can be seen that the random forest algorithm is a Bagging algorithm with a decision tree as an estimator.

Figure 2-3

Figure 2-3 shows the specific flow of the random forest algorithm. In the classification problem, the combiner selects the majority of the classification results as the final result, and in the regression problem, the average of multiple regression results is used as the final result.

Using the bagging algorithm can reduce overfitting, resulting in better performance. A single decision tree is very sensitive to the noise of the training set, but the Bagging algorithm reduces the correlation between the trained decision trees, effectively alleviating the above problems.

3. Algorithm steps

Assuming that the size of the training set T is N, the number of features is M, and the size of the random forest is K, the specific steps of the random forest algorithm are as follows:

Traverse the random forest size K times:

There is replacement sampling from the training set T, sampling N times to form a new sub-training set D

randomly select m features, where m < M

Using the new training set D and m features, learn a complete decision tree

get random forest

The choice of m in the above algorithm: For classification problems, the root number M features can be used in each division, and for regression problems, one-third M but not less than 5 features are selected.

Fourth, the advantages and disadvantages

Advantages of random forest algorithm:

For a wide variety of data, a high-accuracy classifier can be generated
Can handle a large number of input variables
You can evaluate the importance of variables when deciding on categories
When building a forest, an unbiased estimate of the generalized error can be generated internally
Contains a good way to estimate missing data and still maintain accuracy if a significant portion of data is missing
For unbalanced classification datasets, the error can be balanced
Can be extended to unlabeled data, often using unsupervised clustering, as well as detecting deviators and viewing data
The learning process is fast

Disadvantages of random forest algorithm:

At the expense of decision tree interpretability
Overfitting
on some noisy classification or regression problems
In problems with multiple categorical variables, random forests may not improve the accuracy of base learners

Five, code implementation

Random forest classification using Python:

import numpy as np
from sklearn.tree import DecisionTreeClassifier

class rfc:
    """
    随机森林分类器
    """

    def __init__(self, n_estimators = 100, random_state = 0):
        # 随机森林的大小
        self.n_estimators = n_estimators
        # 随机森林的随机种子
        self.random_state = random_state

    def fit(self, X, y):
        """
        随机森林分类器拟合
        """
        self.y_classes = np.unique(y)
        # 决策树数组
        dts = []
        n = X.shape[0]
        rs = np.random.RandomState(self.random_state)
        for i in range(self.n_estimators):
            # 创建决策树分类器
            dt = DecisionTreeClassifier(random_state=rs.randint(np.iinfo(np.int32).max), max_features = "auto")
            # 根据随机生成的权重，拟合数据集
            dt.fit(X, y, sample_weight=np.bincount(rs.randint(0, n, n), minlength = n))
            dts.append(dt)
        self.trees = dts

    def predict(self, X):
        """
        随机森林分类器预测
        """
        # 预测结果数组
        probas = np.zeros((X.shape[0], len(self.y_classes)))
        for i in range(self.n_estimators):
            # 决策树分类器
            dt = self.trees[i]
            # 依次预测结果可能性
            probas += dt.predict_proba(X)
        # 预测结果可能性取平均
        probas /= self.n_estimators
        # 返回预测结果
        return self.y_classes.take(np.argmax(probas, axis = 1), axis = 0)

Implement random forest regression using Python:

import numpy as np
from sklearn.tree import DecisionTreeRegressor

class rfr:
    """
    随机森林回归器
    """

    def __init__(self, n_estimators = 100, random_state = 0):
        # 随机森林的大小
        self.n_estimators = n_estimators
        # 随机森林的随机种子
        self.random_state = random_state

    def fit(self, X, y):
        """
        随机森林回归器拟合
        """
        # 决策树数组
        dts = []
        n = X.shape[0]
        rs = np.random.RandomState(self.random_state)
        for i in range(self.n_estimators):
            # 创建决策树回归器
            dt = DecisionTreeRegressor(random_state=rs.randint(np.iinfo(np.int32).max), max_features = "auto")
            # 根据随机生成的权重，拟合数据集
            dt.fit(X, y, sample_weight=np.bincount(rs.randint(0, n, n), minlength = n))
            dts.append(dt)
        self.trees = dts

    def predict(self, X):
        """
        随机森林回归器预测
        """
        # 预测结果
        ys = np.zeros(X.shape[0])
        for i in range(self.n_estimators):
            # 决策树回归器
            dt = self.trees[i]
            # 依次预测结果
            ys += dt.predict(X)
        # 预测结果取平均
        ys /= self.n_estimators
        return ys

6. Third-party library implementation

scikit-learn ³ implements random forest classification:

from sklearn.ensemble import RandomForestClassifier

# 随机森林分类器
clf = RandomForestClassifier(n_estimators = 100, random_state = 0)
# 拟合数据集
clf = clf.fit(X, y)

scikit-learn ⁴ Implement random forest regression:

from sklearn.ensemble import RandomForestRegressor

# 随机森林回归器
clf = RandomForestRegressor(n_estimators = 100, random_state = 0)
# 拟合数据集
clf = clf.fit(X, y)

Seven, animation demonstration

Figure 7-1 and Figure 7-2 show the results of classification and regression using the random forest algorithm, respectively, and Figure 7-3 and Figure 7-4 respectively show the results of classification and regression using the decision learning algorithm in the previous section. It can be seen that compared with the unregularized decision tree in the previous section, the predicted results are relatively more stable.

Figure 7-1

Figure 7-2

Figure 7-3

Figure 7-4

Eight, mind map

Figure 8-1

9. References

For full demo please click here

Note: This article strives to be accurate and easy to understand, but because the author is also a beginner with limited level, if there are errors or omissions in the text, I urge readers to criticize and correct

This article was first published in - AI map , welcome to pay attention

Machine Learning Algorithm Series (18) - Random Forest Algorithm

I. Introduction

2. Introduction to the model

integrated learning

Bagging algorithm

Random Forest Algorithm

3. Algorithm steps

Fourth, the advantages and disadvantages

Five, code implementation

6. Third-party library implementation

Seven, animation demonstration

Eight, mind map

9. References

Saisimon

引用和评论

机器学习算法系列（二十）-梯度提升决策树算法（Gradient Boosted Decision Trees / GBDT）

一文掌握 MCP 上下文协议：从理论到实践

LRU算法，你别跑，我就要吃透你

AI Agent爆火后，MCP协议为什么如此重要！

2025年医疗大模型各医疗场景赋能实践研究报告130+份汇总解读|附PDF下载

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

MCP 协议为何不如你想象的安全？从技术专家视角解读