ID3:信息增益
C4.5：信息增益率
CART：Gini系数

# 决策树的优缺点

• 最大的优势是易于解释

• 同时接受categorical和numerical数据，不需要做预处理或归一化。

• 允许结果是不确定的：叶子节点具有多种可能的结果值却无法进一步拆分，可以统计count，评估出一个概率。

• 对于只有几种可能结果的问题，算法很有效；面对拥有大量可能结果的数据集时，决策树会变得异常复杂，预测效果也可能会大打折扣。

• 尽管能处理简单的数值型数据，但只能创建满足“大于/小于”条件的节点。若决定分类的因素取决于更多变量的复杂组合，此时要根据决策树进行分类就会比较困难了。例如，假设结果值是由两个变量的差来决定的，那么这棵树会变得异常庞大，而且预测的准确性也会迅速下降。

``````library(rpart)
library(rpart.plot);

age1 <- as.integer(runif(1000, min=18, max=30))
age2 <- as.integer(runif(1000, min=18, max=30))

df <- data.frame(cbind(age1, aage2))

df <- df %>% dplyr::mutate(diff=age1-age2, label = diff >= 0 & diff <= 5)

ct <- rpart.control(xval=10, minsplit=20, cp=0.01)
cfit <- rpart(label~age1+age2,
data=df, method="class", control=ct,
parms=list(split="gini")
)
print(cfit)

rpart.plot(cfit, branch=1, branch.type=2, type=1, extra=102,
border.col="blue", split.col="red",
split.cex=1.2, main="Decision Tree");

cfit <- rpart(label~diff,
data=df, method="class", control=ct,
parms=list(split="gini")
)
print(cfit)

rpart.plot(cfit, branch=1, branch.type=2, type=1, extra=102,
border.col="blue", split.col="red",
split.cex=1.2, main="Decision Tree");  ``````

# 随机森林理论

sklearn官方文档

Each tree in the ensemble is built from a sample drawn with replacement (bootstrap sample) from the training set. When splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features.

As a result of this randomness, the bias of the forest usually slightly increases with respect to the bias of a single non-random tree, but, due to average, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

In contrast to the original publication, the sklearn implementation combines classifiers by averaging their probabilistic prediction, instead of letting each classifier vote for a single class.

# 随机森林实现

``````from sklearn.ensemble import RandomForestClassifier
X = [[0,0], [1,1]]
Y = [0, 1]
clf = RandomForestClassifier(n_estimator=10)
clf = clf.fit(X, Y)``````

# 调参

sklearn官网

• `n_estimators`: the number of trees in the forest

• `max_features`: the size of the random subsets of features to consider when splitting a node. Default values: `max_features=n_features` for regression problems, and `max_features=sqrt(n_features)` for classification tasks.

`n_jobs=k`：computations are partitioned into k jobs, and run on k cores of the machine. if `n_jobs=-1` then all cores available on the machine are used.

# 特征重要性评估

sklearn官方文档

The depth of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the target variable. Features used at the top of the tree are used contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features.

By averaging those expected activity rates over several randomized trees one can reduce the variance of such an estimate and use it for feature selection.

In practice those estimates are stored as an attribute named feature_importances_ on the fitted model. This is an array with shape (n_features,) whose values are positive and sum to 1.0. The higher the value, the more important is the contribution of the matching feature to the prediction function.

StackOverflow

1. You initialize an array `feature_importances` of all zeros with size `n_features`.

2. You traverse the tree: for each internal node that splits on feature i you compute the error reduction of that node multiplied by the number of samples that were routed to the node and add this quantity to `feature_importances[i]`.

The error reduction depends on the impurity criterion that you use (e.g. Gini, Entropy). It's the impurity of the set of observations that gets routed to the internal node minus the sum of the impurities of the two partitions created by the split.

747 声望
335 粉丝
##### 0 条评论 