The background knowledge required to read this article: a little bit of programming knowledge
I. Introduction
In life, every time it’s time to eat, we will recite in our hearts—“What will we eat next?” Maybe we don’t want to go far after a long day of work. At this time, we will decide that the distance between the restaurants should not exceed 200 meters, and then Looking at the twenty dollars in my wallet, I decided that I could not eat more than twenty, and finally ordered Lanzhou Ramen. As can be seen from the above example, the Lanzhou ramen we eat today is determined by the previous series of decisions.
<center> Figure 1-1 </center>
As shown in Figure 1-1, the above decision-making process is represented by a binary tree, which is called a decision tree. In machine learning, the decision tree model shown in Figure 1-1 can also be trained through the data set. This algorithm is called Decision Tree Learning (Decision Tree Learning) 1 .
2. Introduction to the model
Model
The decision tree learning algorithm (Decision Tree Learning), must first be a tree structure, composed of internal nodes and leaf nodes, the internal node represents a dimension (feature), and the leaf node represents a classification. The nodes are connected by certain conditions, so the decision tree can be regarded as a collection of if...else...rules.
<center> Figure 2-1 </center>
As shown in Figure 2-1, it shows a basic decision tree data structure and the decision method it contains.
Feature selection
Since a decision needs to be made, what needs to be decided is from which dimension (feature) to make the decision, such as the distance of the store and the amount of change in the wallet in the previous example. In machine learning, we need a quantitative indicator to determine the more appropriate features to use, that is, the "purity" of the obtained subset is higher after using the feature to divide. At this time, three indicators - Information Gain (Information Gain), Gini Index (Gini Index), and Mean Square Error (MSE) are introduced to solve the aforementioned problems.
Information Gain
Equation 2-1 is an indicator that represents the purity of the sample set, which is called Information Entropy, where D represents the sample set, K represents the number of classifications of the sample set, and p_k represents the proportion of the k-th sample in the sample set. The smaller the value of Ent(D), the higher the purity of the sample set.
$$ \operatorname{Ent}(D)=-\sum_{k=1}^{K} p_{k} \log _{2} p_{k} $$
<center> type 2-1 </center>
Equation 2-2 represents the impact on the sample set after dividing by a discrete attribute, which is called Information Gain, where D represents the sample set, a represents the discrete attribute, and V represents the number of all possible values of the discrete attribute a, D^v represents the subsample set of the vth value in the sample set.
$$ \operatorname{Gain}(D, a)=\operatorname{Ent}(D)-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \operatorname{Ent}\left(D^{v}\right) $$
<center> formula 2-2 </center>
When the attribute is a continuous attribute, its possible values are not as limited as the discrete attribute. At this time, the values of the continuous attribute in the sample set can be sorted and the average value of the two can be used as the dividing point. Rewrite Equation 2-2 to get As the result of formula 2-3, where T_a represents the average set, D_t^v represents the subset, when v = - represents the sample subset smaller than the average t in the sample, when v = + represents the sample greater than the average t in the sample subset For the sample subset, take the largest information gain among the division points as the information gain value of this attribute.
$$ \begin{aligned} T_{a} &=\left\{\frac{a^{i}+a^{i+1}}{2} \mid 1 \leq i \leq n-1\right\} \\ \operatorname{Gain}(D, a) &=\max _{t \in T_{a}} \operatorname{Gain}(D, a, t) \\ &=\max _{t \in T_{a}} \operatorname{Ent}(D)-\sum_{v \in\{-,+\}} \frac{\left|D_{t}^{v}\right|}{|D|} \operatorname{Ent}\left(D_{t}^{v}\right) \end{aligned} $$
<center> type 2-3 </center>
The larger the value of Gain(D, a) is, the higher the purity of the sample set is after being divided by this attribute. From this, the most suitable partition attribute can be found, as shown in Equation 2-4:
$$ a_{\text {best }}=\underset{a}{\operatorname{argmax}} \operatorname{Gain}(D, a) $$
<center> type 2-4 </center>
Gini Index
Equation 2-5 is another indicator that represents the purity of the sample set, which is called the Gini value (Gini), where D represents the sample set, K represents the number of classifications in the sample set, and p_k represents the proportion of the kth class of samples in the sample set. The smaller the value of Gini(D), the higher the purity of the sample set.
$$ \operatorname{Gini}(D)=1-\sum_{k=1}^{K} p_{k}^{2} $$
<center> formula 2-5 </center>
Equation 2-6 represents the impact on the sample set after dividing by a discrete attribute, which is called the Gini Index, where D represents the sample set, a represents the discrete attribute, and V represents the number of all possible values of the discrete attribute a, D^v represents the subsample set of the vth value in the sample set.
$$ \operatorname{Gini_{-}index}(D, a)=\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \operatorname{Gini}\left(D^{v}\right) $$
<center> formula 2-6 </center>
As in Equation 2-3, take the average of the two consecutive attributes as the dividing point, rewrite Equation 2-6, and get the result as in Equation 2-7, where T_a represents the average set, D_t^v represents the subset, When v = - it means the sample subset smaller than the mean t in the sample, and when v = + means the sample subset greater than the mean t in the sample, take the smallest Gini index among the dividing points as the Gini index value of this attribute.
$$ \operatorname{Gini_{-}index}(D, a)=\min _{t \in T_{a}} \sum_{v \in\{-,+\}} \frac{\left|D_{t}^{v}\right|}{|D|} \operatorname{Gini}\left(D_{t}^{v}\right) $$
<center> formula 2-7 </center>
The smaller the value of Gini_index(D, a), the higher the purity of the sample set divided by the discrete attribute. From this, the most suitable partition attribute can be found, as shown in Equation 2-8:
$$ a_{\text {best }}=\underset{a}{\operatorname{argmin}} \operatorname{Gini\_index}(D, a) $$
<center> formula 2-8 </center>
Mean Square Error (MSE)
The first two indicators make the decision tree can be used for classification problems. If the decision tree is used for regression problems, different indicators are needed to determine the characteristics of the division. This indicator is the mean square error shown in Equation 2-9 ( MSE), where T_a represents the average set, y_t^v represents the subset label, when v = - represents the sample subset label smaller than the mean t in the sample, and when v = + represents the sample subset greater than the mean t in the sample label, the latter item is the mean of the corresponding subset label.
$$ \operatorname{MSE}(D, a)=\min _{t \in T_{a}} \sum_{v \in\{-,+\}}\left(y_{t}^{v}-\hat{y_{t}^{v}}\right)^{2} $$
<center> type 2-9 </center>
The smaller the value of MSE(D, a), the better the decision tree fits the sample set. From this, the most suitable partition attribute can be found, as shown in Equation 2-10:
$$ a_{\text {best }}=\underset{a}{\operatorname{argmin}} \operatorname{MSE}(D, a) $$
<center> type 2-10 </center>
Knowing the data structure of the decision tree model and how to divide the best data set, then let's learn how to generate a decision tree.
3. Algorithm steps
Since the data structure of the decision tree is a tree, its child nodes must also be a tree. The decision tree can be generated recursively. The steps are as follows:
Generate a new node node;
When only one category C exists in the sample:
Mark the node node as the leaf node of category C, and return the node node;
Iterate over all features:
Calculate the information gain or Gini index or mean square error of the current feature;
The best partition feature is recorded in the node node;
After dividing according to the best feature, the left part recursively calls the current method as the left child node of node node;
After dividing according to the best feature, the right part recursively calls the current method as the right child node of the node node;
return node node;
4. Regularization
When the decision tree is generated recursively, the model will be very accurate in classifying the training data, but the performance of the unknown prediction data will not be ideal. This is the so-called overfitting phenomenon. At this time, it can be learned from the previous linear regression. As with the solution to overfitting, regularize the model.
Depth of decision tree
The regularization effect of the decision tree can be achieved by limiting the maximum depth of the decision tree to prevent the decision tree from overfitting. At this time, it is only necessary to add a parameter for recording the depth of the current recursive tree in the algorithm step. When the preset maximum depth is reached, no new child nodes will be generated, and the current node will be marked as the classification accounting for the sample. than the largest classification and exit the current recursion.
leaf size of decision tree
Another way to regularize decision trees is to limit the minimum number of samples contained in leaf nodes, which can also prevent overfitting. When the number of samples contained in the nodule, mark the current node as the category with the largest proportion of categories in the sample and exit the current recursion
Pruning of decision trees
The decision tree can also be prevented from overfitting by pruning it, cutting off redundant subtrees. There are two methods of pruning, namely prepruning and post-pruning.
Pre-pruning
As the name suggests, pre-pruning is to decide whether to generate sub-nodes when generating a decision tree. The method of judgment is to use the validation data set to compare the accuracy of generating sub-nodes and not generating sub-nodes. When the accuracy of generating sub-nodes is If it is promoted, a child node will be generated, otherwise, no child node will be generated.
<center> Figure 4-1 The picture comes from Zhou Zhihua's "Machine Learning" </center>
post pruning
Post-pruning is to generate a complete decision tree first, and then start from the leaf nodes. The same judgment method as pre-pruning is used. When the accuracy of generating sub-nodes is improved, the sub-nodes are retained, otherwise, the sub-nodes are retained. Node cut.
<center> Figure 4-2 The picture comes from Zhou Zhihua's "Machine Learning" </center>
Five, code implementation
Use Python to implement decision tree classification based on information gain:
import numpy as np
class GainNode:
"""
分类决策树中的结点
基于信息增益-Information Gain
"""
def __init__(self, feature=None, threshold=None, gain=None, left=None, right=None):
# 结点划分的特征下标
self.feature = feature
# 结点划分的临界值,当结点为叶子结点时为分类值
self.threshold = threshold
# 结点的信息增益值
self.gain = gain
# 左结点
self.left = left
# 右结点
self.right = right
class GainTree:
"""
分类决策树
基于信息增益-Information Gain
"""
def __init__(self, max_depth = None, min_samples_leaf = None):
# 决策树最大深度
self.max_depth = max_depth
# 决策树叶结点最小样本数
self.min_samples_leaf = min_samples_leaf
def fit(self, X, y):
"""
分类决策树拟合
基于信息增益-Information Gain
"""
y = np.array(y)
self.root = self.buildNode(X, y, 0)
return self
def buildNode(self, X, y, depth):
"""
构建分类决策树结点
基于信息增益-Information Gain
"""
node = GainNode()
# 当没有样本时直接返回
if len(y) == 0:
return node
y_classes = np.unique(y)
# 当样本中只存在一种分类时直接返回该分类
if len(y_classes) == 1:
node.threshold = y_classes[0]
return node
# 当决策树深度达到最大深度限制时返回样本中分类占比最大的分类
if self.max_depth is not None and depth >= self.max_depth:
node.threshold = max(y_classes, key=y.tolist().count)
return node
# 当决策树叶结点样本数达到最小样本数限制时返回样本中分类占比最大的分类
if self.min_samples_leaf is not None and len(y) <= self.min_samples_leaf:
node.threshold = max(y_classes, key=y.tolist().count)
return node
max_gain = -np.inf
max_middle = None
max_feature = None
# 遍历所有特征,获取信息增益最大的特征
for i in range(X.shape[1]):
# 计算特征的信息增益
gain, middle = self.calcGain(X[:,i], y, y_classes)
if max_gain < gain:
max_gain = gain
max_middle = middle
max_feature = i
# 信息增益最大的特征
node.feature = max_feature
# 临界值
node.threshold = max_middle
# 信息增益
node.gain = max_gain
X_lt = X[:,max_feature] < max_middle
X_gt = X[:,max_feature] > max_middle
# 递归处理左集合
node.left = self.buildNode(X[X_lt,:], y[X_lt], depth + 1)
# 递归处理右集合
node.right = self.buildNode(X[X_gt,:], y[X_gt], depth + 1)
return node
def calcMiddle(self, x):
"""
计算连续型特征的俩俩平均值
"""
middle = []
if len(x) == 0:
return np.array(middle)
start = x[0]
for i in range(len(x) - 1):
if x[i] == x[i + 1]:
continue
middle.append((start + x[i + 1]) / 2)
start = x[i + 1]
return np.array(middle)
def calcEnt(self, y, y_classes):
"""
计算信息熵
"""
ent = 0
for j in range(len(y_classes)):
p = len(y[y == y_classes[j]])/ len(y)
if p != 0:
ent = ent + p * np.log2(p)
return -ent
def calcGain(self, x, y, y_classes):
"""
计算信息增益
"""
x_sort = np.sort(x)
middle = self.calcMiddle(x_sort)
max_middle = -np.inf
max_gain = -np.inf
ent = self.calcEnt(y, y_classes)
# 遍历每个平均值
for i in range(len(middle)):
y_gt = y[x > middle[i]]
y_lt = y[x < middle[i]]
ent_gt = self.calcEnt(y_gt, y_classes)
ent_lt = self.calcEnt(y_lt, y_classes)
# 计算信息增益
gain = ent - (ent_gt * len(y_gt) / len(x) + ent_lt * len(y_lt) / len(x))
if max_gain < gain:
max_gain = gain
max_middle = middle[i]
return max_gain, max_middle
def predict(self, X):
"""
分类决策树预测
"""
y = np.zeros(X.shape[0])
self.checkNode(X, y, self.root)
return y
def checkNode(self, X, y, node, cond = None):
"""
通过分类决策树结点判断分类
"""
# 当没有子结点时,直接返回当前临界值
if node.left is None and node.right is None:
return node.threshold
X_lt = X[:,node.feature] < node.threshold
if cond is not None:
X_lt = X_lt & cond
# 递归判断左结点
lt = self.checkNode(X, y, node.left, X_lt)
if lt is not None:
y[X_lt] = lt
X_gt = X[:,node.feature] > node.threshold
if cond is not None:
X_gt = X_gt & cond
# 递归判断右结点
gt = self.checkNode(X, y, node.right, X_gt)
if gt is not None:
y[X_gt] = gt
Use Python to implement decision tree classification based on Gini index:
import numpy as np
class GiniNode:
"""
分类决策树中的结点
基于基尼指数-Gini Index
"""
def __init__(self, feature=None, threshold=None, gini_index=None, left=None, right=None):
# 结点划分的特征下标
self.feature = feature
# 结点划分的临界值,当结点为叶子结点时为分类值
self.threshold = threshold
# 结点的基尼指数值
self.gini_index = gini_index
# 左结点
self.left = left
# 右结点
self.right = right
class GiniTree:
"""
分类决策树
基于基尼指数-Gini Index
"""
def __init__(self, max_depth = None, min_samples_leaf = None):
# 决策树最大深度
self.max_depth = max_depth
# 决策树叶结点最小样本数
self.min_samples_leaf = min_samples_leaf
def fit(self, X, y):
"""
分类决策树拟合
基于基尼指数-Gini Index
"""
y = np.array(y)
self.root = self.buildNode(X, y, 0)
return self
def buildNode(self, X, y, depth):
"""
构建分类决策树结点
基于基尼指数-Gini Index
"""
node = GiniNode()
# 当没有样本时直接返回
if len(y) == 0:
return node
y_classes = np.unique(y)
# 当样本中只存在一种分类时直接返回该分类
if len(y_classes) == 1:
node.threshold = y_classes[0]
return node
# 当决策树深度达到最大深度限制时返回样本中分类占比最大的分类
if self.max_depth is not None and depth >= self.max_depth:
node.threshold = max(y_classes, key=y.tolist().count)
return node
# 当决策树叶结点样本数达到最小样本数限制时返回样本中分类占比最大的分类
if self.min_samples_leaf is not None and len(y) <= self.min_samples_leaf:
node.threshold = max(y_classes, key=y.tolist().count)
return node
min_gini_index = np.inf
min_middle = None
min_feature = None
# 遍历所有特征,获取基尼指数最小的特征
for i in range(X.shape[1]):
# 计算特征的基尼指数
gini_index, middle = self.calcGiniIndex(X[:,i], y, y_classes)
if min_gini_index > gini_index:
min_gini_index = gini_index
min_middle = middle
min_feature = i
# 基尼指数最小的特征
node.feature = min_feature
# 临界值
node.threshold = min_middle
# 基尼指数
node.gini_index = min_gini_index
X_lt = X[:,min_feature] < min_middle
X_gt = X[:,min_feature] > min_middle
# 递归处理左集合
node.left = self.buildNode(X[X_lt,:], y[X_lt], depth + 1)
# 递归处理右集合
node.right = self.buildNode(X[X_gt,:], y[X_gt], depth + 1)
return node
def calcMiddle(self, x):
"""
计算连续型特征的俩俩平均值
"""
middle = []
if len(x) == 0:
return np.array(middle)
start = x[0]
for i in range(len(x) - 1):
if x[i] == x[i + 1]:
continue
middle.append((start + x[i + 1]) / 2)
start = x[i + 1]
return np.array(middle)
def calcGiniIndex(self, x, y, y_classes):
"""
计算基尼指数
"""
x_sort = np.sort(x)
middle = self.calcMiddle(x_sort)
min_middle = np.inf
min_gini_index = np.inf
for i in range(len(middle)):
y_gt = y[x > middle[i]]
y_lt = y[x < middle[i]]
gini_gt = self.calcGini(y_gt, y_classes)
gini_lt = self.calcGini(y_lt, y_classes)
gini_index = gini_gt * len(y_gt) / len(x) + gini_lt * len(y_lt) / len(x)
if min_gini_index > gini_index:
min_gini_index = gini_index
min_middle = middle[i]
return min_gini_index, min_middle
def calcGini(self, y, y_classes):
"""
计算基尼值
"""
gini = 1
for j in range(len(y_classes)):
p = len(y[y == y_classes[j]])/ len(y)
gini = gini - p * p
return gini
def predict(self, X):
"""
分类决策树预测
"""
y = np.zeros(X.shape[0])
self.checkNode(X, y, self.root)
return y
def checkNode(self, X, y, node, cond = None):
"""
通过分类决策树结点判断分类
"""
if node.left is None and node.right is None:
return node.threshold
X_lt = X[:,node.feature] < node.threshold
if cond is not None:
X_lt = X_lt & cond
lt = self.checkNode(X, y, node.left, X_lt)
if lt is not None:
y[X_lt] = lt
X_gt = X[:,node.feature] > node.threshold
if cond is not None:
X_gt = X_gt & cond
gt = self.checkNode(X, y, node.right, X_gt)
if gt is not None:
y[X_gt] = gt
Use Python to implement mean square error based decision tree regression:
import numpy as np
class RegressorNode:
"""
回归决策树中的结点
"""
def __init__(self, feature=None, threshold=None, mse=None, left=None, right=None):
# 结点划分的特征下标
self.feature = feature
# 结点划分的临界值,当结点为叶子结点时为分类值
self.threshold = threshold
# 结点的均方差值
self.mse = mse
# 左结点
self.left = left
# 右结点
self.right = right
class RegressorTree:
"""
回归决策树
"""
def __init__(self, max_depth = None, min_samples_leaf = None):
# 决策树最大深度
self.max_depth = max_depth
# 决策树叶结点最小样本数
self.min_samples_leaf = min_samples_leaf
def fit(self, X, y):
"""
回归决策树拟合
"""
self.root = self.buildNode(X, y, 0)
return self
def buildNode(self, X, y, depth):
"""
构建回归决策树结点
"""
node = RegressorNode()
# 当没有样本时直接返回
if len(y) == 0:
return node
y_classes = np.unique(y)
# 当样本中只存在一种分类时直接返回该分类
if len(y_classes) == 1:
node.threshold = y_classes[0]
return node
# 当决策树深度达到最大深度限制时返回样本中分类占比最大的分类
if self.max_depth is not None and depth >= self.max_depth:
node.threshold = np.average(y)
return node
# 当决策树叶结点样本数达到最小样本数限制时返回样本中分类占比最大的分类
if self.min_samples_leaf is not None and len(y) <= self.min_samples_leaf:
node.threshold = np.average(y)
return node
min_mse = np.inf
min_middle = None
min_feature = None
# 遍历所有特征,获取均方差最小的特征
for i in range(X.shape[1]):
# 计算特征的均方差
mse, middle = self.calcMse(X[:,i], y)
if min_mse > mse:
min_mse = mse
min_middle = middle
min_feature = i
# 均方差最小的特征
node.feature = min_feature
# 临界值
node.threshold = min_middle
# 均方差
node.mse = min_mse
X_lt = X[:,min_feature] < min_middle
X_gt = X[:,min_feature] > min_middle
# 递归处理左集合
node.left = self.buildNode(X[X_lt,:], y[X_lt], depth + 1)
# 递归处理右集合
node.right = self.buildNode(X[X_gt,:], y[X_gt], depth + 1)
return node
def calcMiddle(self, x):
"""
计算连续型特征的俩俩平均值
"""
middle = []
if len(x) == 0:
return np.array(middle)
start = x[0]
for i in range(len(x) - 1):
if x[i] == x[i + 1]:
continue
middle.append((start + x[i + 1]) / 2)
start = x[i + 1]
return np.array(middle)
def calcMse(self, x, y):
"""
计算均方差
"""
x_sort = np.sort(x)
middle = self.calcMiddle(x_sort)
min_middle = np.inf
min_mse = np.inf
for i in range(len(middle)):
y_gt = y[x > middle[i]]
y_lt = y[x < middle[i]]
avg_gt = np.average(y_gt)
avg_lt = np.average(y_lt)
mse = np.sum((y_lt - avg_lt) ** 2) + np.sum((y_gt - avg_gt) ** 2)
if min_mse > mse:
min_mse = mse
min_middle = middle[i]
return min_mse, min_middle
def predict(self, X):
"""
回归决策树预测
"""
y = np.zeros(X.shape[0])
self.checkNode(X, y, self.root)
return y
def checkNode(self, X, y, node, cond = None):
"""
通过回归决策树结点判断分类
"""
if node.left is None and node.right is None:
return node.threshold
X_lt = X[:,node.feature] < node.threshold
if cond is not None:
X_lt = X_lt & cond
lt = self.checkNode(X, y, node.left, X_lt)
if lt is not None:
y[X_lt] = lt
X_gt = X[:,node.feature] > node.threshold
if cond is not None:
X_gt = X_gt & cond
gt = self.checkNode(X, y, node.right, X_gt)
if gt is not None:
y[X_gt] = gt
6. Third-party library implementation
scikit-learn 2 decision tree classification implementation
from sklearn import tree
# 决策树分类
clf = tree.DecisionTreeClassifier()
# 拟合数据
clf = clf.fit(X, y)
scikit-learn 3 decision tree regression implementation
from sklearn import tree
# 决策树回归
clf = tree.DecisionTreeRegressor()
# 拟合数据
clf = clf.fit(X, y)
Seven, animation demonstration
Figure 7-1 shows the classification result of an unregularized decision tree, and Figure 7-2 shows the classification result of a regularized decision tree (max_depth = 3, min_samples_leaf = 5)
<center> Figure 7-1 </center>
<center> Figure 7-2 </center>
Figure 7-3 shows the regression results of an unregularized decision tree, and Figure 7-4 shows the regression results of a regularized decision tree (max_depth = 3, min_samples_leaf = 5)
<center> Figure 7-3 </center>
<center> Figure 7-4 </center>
It can be seen that the decision tree without regularization obviously overfits the training data set, and the decision tree after regularization is relatively better.
Eight, mind map
<center> Figure 8-1 </center>
9. References
- https://en.wikipedia.org/wiki/Decision_tree_learning
- https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
- https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
Full demo please click here
Note: This article strives to be accurate and easy to understand, but because the author is also a beginner and has limited level, if there are errors or omissions in the text, I urge readers to criticize and correct by leaving a message
This article was first published in - AI map , welcome to pay attention
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。