机器学习:随机森林学习笔记

丹追兵

前言

随机森林是一个很强大的模型,由一组决策树投票得到最后的结果。要研究清楚随机森林,首先需要研究清楚决策树,然后理解随机森林如何通过多棵树的集成提高模型效果。

本文的目的是将自己学习这个模型时有用的资料汇总在一起。

决策树基本知识

决策树知识点精要

ID3:信息增益
C4.5:信息增益率
CART:Gini系数

决策树的优缺点

集成智慧编程

优点有:

  • 最大的优势是易于解释

  • 同时接受categorical和numerical数据,不需要做预处理或归一化。

  • 允许结果是不确定的:叶子节点具有多种可能的结果值却无法进一步拆分,可以统计count,评估出一个概率。

缺点有:

  • 对于只有几种可能结果的问题,算法很有效;面对拥有大量可能结果的数据集时,决策树会变得异常复杂,预测效果也可能会大打折扣。

  • 尽管能处理简单的数值型数据,但只能创建满足“大于/小于”条件的节点。若决定分类的因素取决于更多变量的复杂组合,此时要根据决策树进行分类就会比较困难了。例如,假设结果值是由两个变量的差来决定的,那么这棵树会变得异常庞大,而且预测的准确性也会迅速下降。

总而言之:决策树最适合用来处理的,是那些带分界点的、由大量分类数据和数值数据共同组成的数据集。

关于书中提到的假设结果值是由两个变量的差来决定的,那么这棵树会变得异常庞大,而且预测的准确性也会迅速下降,我们可以用下面的例子来实验一下:

library(rpart)
library(rpart.plot);  

age1 <- as.integer(runif(1000, min=18, max=30))
age2 <- as.integer(runif(1000, min=18, max=30))

df <- data.frame(cbind(age1, aage2))

df <- df %>% dplyr::mutate(diff=age1-age2, label = diff >= 0 & diff <= 5)

ct <- rpart.control(xval=10, minsplit=20, cp=0.01) 
cfit <- rpart(label~age1+age2,
              data=df, method="class", control=ct,
              parms=list(split="gini")
)
print(cfit)


rpart.plot(cfit, branch=1, branch.type=2, type=1, extra=102,  
           shadow.col="gray", box.col="green",  
           border.col="blue", split.col="red",  
           split.cex=1.2, main="Decision Tree");  


cfit <- rpart(label~diff,
              data=df, method="class", control=ct,
              parms=list(split="gini")
)
print(cfit)

rpart.plot(cfit, branch=1, branch.type=2, type=1, extra=102,  
           shadow.col="gray", box.col="green",  
           border.col="blue", split.col="red",  
           split.cex=1.2, main="Decision Tree");  

用age1和age2来预测,得到的决策树截图如下:

clipboard.png

用diff来预测,得到的决策树截图如下:

clipboard.png

随机森林理论

sklearn官方文档

Each tree in the ensemble is built from a sample drawn with replacement (bootstrap sample) from the training set. When splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features.

As a result of this randomness, the bias of the forest usually slightly increases with respect to the bias of a single non-random tree, but, due to average, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

In contrast to the original publication, the sklearn implementation combines classifiers by averaging their probabilistic prediction, instead of letting each classifier vote for a single class.

随机森林实现

from sklearn.ensemble import RandomForestClassifier
X = [[0,0], [1,1]]
Y = [0, 1]
clf = RandomForestClassifier(n_estimator=10)
clf = clf.fit(X, Y)

调参

sklearn官网

核心参数由n_estimatorsmax_features

  • n_estimators: the number of trees in the forest

  • max_features: the size of the random subsets of features to consider when splitting a node. Default values: max_features=n_features for regression problems, and max_features=sqrt(n_features) for classification tasks.

其他参数:Good results are often achieved when setting max_depth=None in combination with min_samples_split=1.

n_jobs=k:computations are partitioned into k jobs, and run on k cores of the machine. if n_jobs=-1 then all cores available on the machine are used.

特征重要性评估

sklearn官方文档

The depth of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the target variable. Features used at the top of the tree are used contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features.

By averaging those expected activity rates over several randomized trees one can reduce the variance of such an estimate and use it for feature selection.

In practice those estimates are stored as an attribute named feature_importances_ on the fitted model. This is an array with shape (n_features,) whose values are positive and sum to 1.0. The higher the value, the more important is the contribution of the matching feature to the prediction function.

StackOverflow

  1. You initialize an array feature_importances of all zeros with size n_features.

  2. You traverse the tree: for each internal node that splits on feature i you compute the error reduction of that node multiplied by the number of samples that were routed to the node and add this quantity to feature_importances[i].

The error reduction depends on the impurity criterion that you use (e.g. Gini, Entropy). It's the impurity of the set of observations that gets routed to the internal node minus the sum of the impurities of the two partitions created by the split.

关于作者:丹追兵:数据分析师一枚,编程语言python和R,使用Spark、Hadoop、Storm、ODPS。本文出自丹追兵的pytrafficR专栏,转载请注明作者与出处:https://segmentfault.com/blog...

阅读 6.6k

数据实验室
使用python和R进行数据分析、机器学习

本人年少时在欧洲三国边境小城Aachen游学,瞻仰了两位机械泰斗的风采,然未继承任何技能,终日游手好闲...

747 声望
335 粉丝
0 条评论
你知道吗?

本人年少时在欧洲三国边境小城Aachen游学,瞻仰了两位机械泰斗的风采,然未继承任何技能,终日游手好闲...

747 声望
335 粉丝
宣传栏