Background knowledge points needed to read this article: decision tree learning algorithm, a little programming knowledge
I. Introduction
In the previous section, we learned a simple and efficient algorithm-Decision Tree Learning Algorithm (Decision Tree Learning Algorithm), ensemble learning 1 algorithm Random Forest Algorithm 2 (Random Forest Algorithm).
2. Introduction to the model
There is an idiom called brainstorming, which refers to concentrating the wisdom of the masses and absorbing useful opinions widely. There is a similar idea in machine learning algorithms called ensemble learning.
integrated learning
Ensemble learning learns multiple estimators through training, and when prediction is required, the results of multiple estimators are integrated through the combiner as the final result output.
Figure 2-1
Figure 2-1 shows the basic flow of ensemble learning.
The advantage of ensemble learning is that it improves the generality and robustness of a single estimator, and has better prediction performance than a single estimator. Another feature of ensemble learning is that it can be easily parallelized.
Bagging algorithm
The Bagging algorithm is an ensemble learning algorithm. Its specific steps are as follows: Suppose there is a training data set of size N, and each time there is a sub-data set of size M that is returned from the data set, a total of K times are selected. , according to the K sub-data sets, train and learn K models. When predicting, use these K models to predict, and then obtain the final prediction result by taking the average or majority classification.
Figure 2-2
Figure 2-2 shows how the Bagging algorithm takes sub-datasets.
Random Forest Algorithm
Combining multiple decision trees, each data set is randomly selected with replacement, and some features are randomly selected as input, so the algorithm is called the random forest algorithm. It can be seen that the random forest algorithm is a Bagging algorithm with a decision tree as an estimator.
Figure 2-3
Figure 2-3 shows the specific flow of the random forest algorithm. In the classification problem, the combiner selects the majority of the classification results as the final result, and in the regression problem, the average of multiple regression results is used as the final result.
Using the bagging algorithm can reduce overfitting, resulting in better performance. A single decision tree is very sensitive to the noise of the training set, but the Bagging algorithm reduces the correlation between the trained decision trees, effectively alleviating the above problems.
3. Algorithm steps
Assuming that the size of the training set T is N, the number of features is M, and the size of the random forest is K, the specific steps of the random forest algorithm are as follows:
Traverse the random forest size K times:
There is replacement sampling from the training set T, sampling N times to form a new sub-training set D
randomly select m features, where m < M
Using the new training set D and m features, learn a complete decision tree
get random forest
The choice of m in the above algorithm: For classification problems, the root number M features can be used in each division, and for regression problems, one-third M but not less than 5 features are selected.
Fourth, the advantages and disadvantages
Advantages of random forest algorithm:
- For a wide variety of data, a high-accuracy classifier can be generated
- Can handle a large number of input variables
- You can evaluate the importance of variables when deciding on categories
- When building a forest, an unbiased estimate of the generalized error can be generated internally
- Contains a good way to estimate missing data and still maintain accuracy if a significant portion of data is missing
- For unbalanced classification datasets, the error can be balanced
- Can be extended to unlabeled data, often using unsupervised clustering, as well as detecting deviators and viewing data
- The learning process is fast
Disadvantages of random forest algorithm:
- At the expense of decision tree interpretability
- Overfitting
on some noisy classification or regression problems- In problems with multiple categorical variables, random forests may not improve the accuracy of base learners
Five, code implementation
Random forest classification using Python:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
class rfc:
"""
随机森林分类器
"""
def __init__(self, n_estimators = 100, random_state = 0):
# 随机森林的大小
self.n_estimators = n_estimators
# 随机森林的随机种子
self.random_state = random_state
def fit(self, X, y):
"""
随机森林分类器拟合
"""
self.y_classes = np.unique(y)
# 决策树数组
dts = []
n = X.shape[0]
rs = np.random.RandomState(self.random_state)
for i in range(self.n_estimators):
# 创建决策树分类器
dt = DecisionTreeClassifier(random_state=rs.randint(np.iinfo(np.int32).max), max_features = "auto")
# 根据随机生成的权重,拟合数据集
dt.fit(X, y, sample_weight=np.bincount(rs.randint(0, n, n), minlength = n))
dts.append(dt)
self.trees = dts
def predict(self, X):
"""
随机森林分类器预测
"""
# 预测结果数组
probas = np.zeros((X.shape[0], len(self.y_classes)))
for i in range(self.n_estimators):
# 决策树分类器
dt = self.trees[i]
# 依次预测结果可能性
probas += dt.predict_proba(X)
# 预测结果可能性取平均
probas /= self.n_estimators
# 返回预测结果
return self.y_classes.take(np.argmax(probas, axis = 1), axis = 0)
Implement random forest regression using Python:
import numpy as np
from sklearn.tree import DecisionTreeRegressor
class rfr:
"""
随机森林回归器
"""
def __init__(self, n_estimators = 100, random_state = 0):
# 随机森林的大小
self.n_estimators = n_estimators
# 随机森林的随机种子
self.random_state = random_state
def fit(self, X, y):
"""
随机森林回归器拟合
"""
# 决策树数组
dts = []
n = X.shape[0]
rs = np.random.RandomState(self.random_state)
for i in range(self.n_estimators):
# 创建决策树回归器
dt = DecisionTreeRegressor(random_state=rs.randint(np.iinfo(np.int32).max), max_features = "auto")
# 根据随机生成的权重,拟合数据集
dt.fit(X, y, sample_weight=np.bincount(rs.randint(0, n, n), minlength = n))
dts.append(dt)
self.trees = dts
def predict(self, X):
"""
随机森林回归器预测
"""
# 预测结果
ys = np.zeros(X.shape[0])
for i in range(self.n_estimators):
# 决策树回归器
dt = self.trees[i]
# 依次预测结果
ys += dt.predict(X)
# 预测结果取平均
ys /= self.n_estimators
return ys
6. Third-party library implementation
scikit-learn 3 implements random forest classification:
from sklearn.ensemble import RandomForestClassifier
# 随机森林分类器
clf = RandomForestClassifier(n_estimators = 100, random_state = 0)
# 拟合数据集
clf = clf.fit(X, y)
scikit-learn 4 Implement random forest regression:
from sklearn.ensemble import RandomForestRegressor
# 随机森林回归器
clf = RandomForestRegressor(n_estimators = 100, random_state = 0)
# 拟合数据集
clf = clf.fit(X, y)
Seven, animation demonstration
Figure 7-1 and Figure 7-2 show the results of classification and regression using the random forest algorithm, respectively, and Figure 7-3 and Figure 7-4 respectively show the results of classification and regression using the decision learning algorithm in the previous section. It can be seen that compared with the unregularized decision tree in the previous section, the predicted results are relatively more stable.
Figure 7-1
Figure 7-2
Figure 7-3
Figure 7-4
Eight, mind map
Figure 8-1
9. References
- https://en.wikipedia.org/wiki/Ensemble_learning
- https://en.wikipedia.org/wiki/Random_forest
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
For full demo please click here
Note: This article strives to be accurate and easy to understand, but because the author is also a beginner with limited level, if there are errors or omissions in the text, I urge readers to criticize and correct
This article was first published in - AI map , welcome to pay attention
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。