训练集和测试集

为什么要分训练集和测试集

在机器学习中,我们一般要将数据分为训练集和测试集。在训练集上训练模型,然后在测试集上测试模型。我们训练模型的目的是用训练好的模型帮助我们在后续的实践中做出准确的预测,所以我们希望模型能够在今后的实际使用中有很好的性能,而不是只在训练集上有良好的性能。如果模型在学习中过于关注训练集,那就会只是死记硬背地将整个训练集背下来,而不是去理解数据集的内在结构。这样的模型能够对训练集有非常好的掌握,但对训练集外没有记忆过的数据毫无判断能力。这就和人在学习中只去背诵题目答案,而不去理解解题思路一样。这种学习方法是无法在实际工作中取得好的成绩的。

我们为了判断一个模型是只会死记硬背还是学会了数据的内在结构,就需要用测试集来检查模型是否能对没有学习过的数据进行准确判断。

如何进行训练集和测试集的划分

from sklearn.model_selection import train_test_split
from numpy import random
random.seed(2)
X = random.random(size=(12,4))
y = random.random(size=(12,1))
X_train, X_test, y_train,  y_test = train_test_split(X,y,test_size=0.25)
print ('X_train:\n')
print (X_train)
print ('\ny_train:\n')
print (y_train)
print ('\nX_test:\n')
print (X_test)
print ('\ny_test:\n')
print (y_test)
X_train:

[[ 0.4203678   0.33033482  0.20464863  0.61927097]
 [ 0.22030621  0.34982629  0.46778748  0.20174323]
 [ 0.12715997  0.59674531  0.226012    0.10694568]
 [ 0.4359949   0.02592623  0.54966248  0.43532239]
 [ 0.79363745  0.58000418  0.1622986   0.70075235]
 [ 0.13457995  0.51357812  0.18443987  0.78533515]
 [ 0.64040673  0.48306984  0.50523672  0.38689265]
 [ 0.50524609  0.0652865   0.42812233  0.09653092]
 [ 0.85397529  0.49423684  0.84656149  0.07964548]]

y_train:

[[ 0.95374223]
 [ 0.02720237]
 [ 0.40627504]
 [ 0.53560417]
 [ 0.06714437]
 [ 0.08209492]
 [ 0.24717724]
 [ 0.8508505 ]
 [ 0.3663424 ]]

X_test:

[[ 0.29965467  0.26682728  0.62113383  0.52914209]
 [ 0.96455108  0.50000836  0.88952006  0.34161365]
 [ 0.56714413  0.42754596  0.43674726  0.77655918]]

y_test:

[[ 0.54420816]
 [ 0.99385201]
 [ 0.97058031]]

sklearn.model_selection.train_test_split(arrays, *options)[source]

Split arrays or matrices into random train and test subsets
Quick utility that wraps input validation and next(ShuffleSplit().split(X, y)) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.

Parameters:
*arrays : sequence of indexables with same length / shape[0]
Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
test_size : float, int, or None (default is None)
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is automatically set to the complement of the train size. If train size is also None, test size is set to 0.25.
train_size : float, int, or None (default is None)
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
random_state : int or RandomState
Pseudo-random number generator state used for random sampling.
stratify : array-like or None (default is None)
If not None, data is split in a stratified fashion, using this as the class labels.
Returns:
splitting : list, length=2 * len(arrays)
List containing train-test split of inputs.
New in version 0.16: If the input is sparse, the output will be a scipy.sparse.csr_matrix. Else, output type is the same as the input type.

Confusion Matrix (Error Matrix)

Reference

A confusion matrix (Kohavi and Provost, 1998) contains information about actual and predicted classifications done by a classification system. Performance of such systems is commonly evaluated using the data in the matrix. The following table shows the confusion matrix for a two class classifier.
The entries in the confusion matrix have the following meaning in the context of our study:

a is the number of correct predictions that an instance is negative,
b is the number of incorrect predictions that an instance is positive,
c is the number of incorrect of predictions that an instance negative, and
d is the number of correct predictions that an instance is positive.
image

Several standard terms have been defined for the 2 class matrix:

  • The accuracy (AC) is the proportion of the total number of predictions that were correct. It is determined using the equation:

$$AC=\frac{a+d}{a+b+c+d}$$

  • The recall or true positive rate (TP) is the proportion of positive cases that were correctly identified, as calculated using the equation:

$$Recall=\frac{d}{c+d}$$

  • precision (P) is the proportion of the predicted positive cases that were correct, as calculated using the equation:
    $$P=\frac{d}{b+d}$$

The accuracy determined using equation 1 may not be an adequate performance measure when the number of negative cases is much greater than the number of positive cases (Kubat et al., 1998).
Accuracy在negative cases远多于positive cases的时候是不合适的,因为即使true prositive为0,accuracy依然可以很高。

Suppose there are 1000 cases, 995 of which are negative cases and 5 of which are positive cases. If the system classifies them all as negative, the accuracy would be 99.5%, even though the classifier missed all positive cases. Other performance measures account for this by including TP in a product: for example, geometric mean (g-mean) (Kubat et al., 1998), and F-Measure (Lewis and Gale, 1994).

$$g-mean=\sqrt{R\cdot P}$$

$$F_{\beta}=\frac{(\beta^2+1)PR}{\beta^2P+R}$$

F1 Score 就是F-Measure当$$\beta = 1$$时的特例

sklearn.metrics.confusion_matrix

ompute confusion matrix to evaluate the accuracy of a classification
By definition a confusion matrix C is such that $C_{i, j}$ is equal to the number of observations known to be in group i but predicted to be in group j.
Thus in binary classification, the count of true negatives is $C_{0,0}$, false negatives is $C_{1,0}$, true positives is $C_{1,1}$ and false positives is $C_{0,1}$.
Read more in the User Guide.

Parameters
y_true : array, shape = [n_samples]
Ground truth (correct) target values.
y_pred : array, shape = [n_samples]
Estimated targets as returned by a classifier.
labels : array, shape = [n_classes], optional
List of labels to index the matrix. This may be used to reorder or select a subset of labels. If none is given, those that appear at least once in y_true or y_pred are used in sorted order.
sample_weight : array-like of shape = [n_samples], optional
Sample weights.
Returns: C : array, shape = [n_classes, n_classes]
Confusion matrix

Examples

from sklearn.metrics import confusion_matrix
y_true = [1, 0, 0, 1, 0, 1]
confusion_matrix(y_true, y_pred)
array([[2, 1],
       [1, 2]])
from sklearn.metrics import confusion_matrix
y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
confusion_matrix(y_true, y_pred)
array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])
y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"])
array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])

分类器性能指标之ROC曲线、AUC值

一 roc曲线

roc曲线:接收者操作特征(receiver operating characteristic),roc曲线上每个点反映着对同一信号刺激的感受性。

  • 横轴:负正类率(false postive rate FPR)特异度,划分实例中所有负例占所有负例的比例;(1-Specificity)

  • 纵轴:真正类率(true postive rate TPR)灵敏度,Sensitivity(正类覆盖率)

针对一个二分类问题,将实例分成正类(postive)或者负类(negative)。但是实际中分类时,会出现四种情况.

  1. TP:正确的肯定数目
    若一个实例是正类并且被预测为正类,即为真正类(True Postive TP)

  • FN:漏报,没有找到正确匹配的数目
    若一个实例是正类,但是被预测成为负类,即为假负类(False Negative FN)

  • FP:误报,没有的匹配不正确
    若一个实例是负类,但是被预测成为正类,即为假正类(False Postive FP)

  • TN:正确拒绝的非匹配数目
    若一个实例是负类,但是被预测成为负类,即为真负类(True Negative TN)

列联表如下,1代表正类,0代表负类:
image

由上表可得出横,纵轴的计算公式:
(1)真正类率(True Postive Rate)TPR: TP/(TP+FN),代表分类器预测的正类中实际正实例占所有正实例的比例。Sensitivity

(2)负正类率(False Postive Rate)FPR: FP/(FP+TN),代表分类器预测的正类中实际负实例占所有负实例的比例。1-Specificity

(3)真负类率(True Negative Rate)TNR: TN/(FP+TN),代表分类器预测的负类中实际负实例占所有负实例的比例,TNR=1-FPR。Specificity

假设采用逻辑回归分类器,其给出针对每个实例为正类的概率,那么通过设定一个阈值如0.6,概率大于等于0.6的为正类,小于0.6的为负类。对应的就可以算出一组(FPR,TPR),在平面中得到对应坐标点。随着阈值的逐渐减小,越来越多的实例被划分为正类,但是这些正类中同样也掺杂着真正的负实例,即TPR和FPR会同时增大。阈值最大时,对应坐标点为(0,0),阈值最小时,对应坐标点(1,1)。

如下面这幅图,(a)图中实线为ROC曲线,线上每个点对应一个阈值。

横轴FPR:1-TNR,1-Specificity,FPR越大,预测正类中实际负类越多

纵轴TPR:Sensitivity(正类覆盖率),TPR越大,预测正类中实际正类越多

理想目标:TPR=1,FPR=0,即图中(0,1)点,故ROC曲线越靠拢(0,1)点,越偏离45度对角线越好,Sensitivity、Specificity越大效果越好。

如何画roc曲线

对于一个特定的分类器和测试数据集,显然只能得到一个分类结果,即一组FPR和TPR结果,而要得到一个曲线,我们实际上需要一系列FPR和TPR的值,这又是如何得到的呢?我们先来看一下Wikipedia上对ROC曲线的定义:

问题在于“as its discrimination threashold is varied”。如何理解这里的“discrimination threashold”呢?我们忽略了分类器的一个重要功能“概率输出”,即表示分类器认为某个样本具有多大的概率属于正样本(或负样本)。通过更深入地了解各个分类器的内部机理,我们总能想办法得到一种概率输出。通常来说,是将一个实数范围通过某个变换映射到(0,1)区间。

假如我们已经得到了所有样本的概率输出(属于正样本的概率),现在的问题是如何改变“discrimination threashold”?我们根据每个测试样本属于正样本的概率值从大到小排序。下图是一个示例,图中共有20个测试样本,“Class”一栏表示每个测试样本真正的标签(p表示正样本,n表示负样本),“Score”表示每个测试样本属于正样本的概率.

接下来,我们从高到低,依次将“Score”值作为阈值threshold,当测试样本属于正样本的概率大于或等于这个threshold时,我们认为它为正样本,否则为负样本。举例来说,对于图中的第4个样本,其“Score”值为0.6,那么样本1,2,3,4都被认为是正样本,因为它们的“Score”值都大于等于0.6,而其他样本则都认为是负样本。每次选取一个不同的threshold,我们就可以得到一组FPR和TPR,即ROC曲线上的一点。这样一来,我们一共得到了20组FPR和TPR的值,将它们画在ROC曲线的结果如下图:

当我们将threshold设置为1和0时,分别可以得到ROC曲线上的(0,0)和(1,1)两个点。将这些(FPR,TPR)对连接起来,就得到了ROC曲线。当threshold取值越多,ROC曲线越平滑。

其实,我们并不一定要得到每个测试样本是正样本的概率值,只要得到这个分类器对该测试样本的“评分值”即可(评分值并不一定在(0,1)区间)。评分越高,表示分类器越肯定地认为这个测试样本是正样本,而且同时使用各个评分值作为threshold。我认为将评分值转化为概率更易于理解一些。

AUC

AUC(Area under Curve): Roc曲线下的面积,介于0.1和1之间。Auc作为数值可以直观的评价分类器的好坏,值越大越好。

首先AUC值是一个概率值,当你随机挑选一个正样本以及负样本,当前的分类算法根据计算得到的Score值将这个正样本排在负样本前面的概率就是AUC值,AUC值越大,当前分类算法越有可能将正样本排在负样本前面,从而能够更好地分类。

为什么使用Roc和Auc评价分类器

既然已经这么多标准,为什么还要使用ROC和AUC呢?因为ROC曲线有个很好的特性:当测试集中的正负样本的分布变换的时候,ROC曲线能够保持不变。在实际的数据集中经常会出现样本类不平衡,即正负样本比例差距较大,而且测试数据中的正负样本也可能随着时间变化。下图是ROC曲线和Presision-Recall曲线的对比:

在上图中,(a)和(c)为Roc曲线,(b)和(d)为Precision-Recall曲线。

(a)和(b)展示的是分类其在原始测试集(正负样本分布平衡)的结果,(c)(d)是将测试集中负样本的数量增加到原来的10倍后,分类器的结果,可以明显的看出,ROC曲线基本保持原貌,而Precision-Recall曲线变化较大。

from sklearn import datasets,svm,metrics
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

iris=datasets.load_iris()
X_train,X_test,y_train,y_test=train_test_split(iris.data,iris.target,test_size=0.5)
clf=svm.SVC(kernel='rbf',C=1,gamma=1).fit(X_train,y_train)
y_pred=clf.predict(X_test)
fpr, tpr, thresholds = metrics.roc_curve(y_test,y_pred, pos_label=2)
print('gamma=1 AUC= ',metrics.auc(fpr, tpr))

clf=svm.SVC(kernel='rbf',C=1,gamma=5).fit(X_train,y_train)
y_pred_rbf=clf.predict(X_test)
fpr_ga1, tpr_ga1, thresholds_ga1 = metrics.roc_curve(y_test,y_pred_rbf, pos_label=2)
print('gamma=10 AUC= ',metrics.auc(fpr_ga1, tpr_ga1))

neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train) 
y_pred_knn=neigh.predict(X_test)
fpr_knn, tpr_knn, thresholds_knn = metrics.roc_curve(y_test,y_pred_knn, pos_label=2)
print('knn AUC= ',metrics.auc(fpr_knn, tpr_knn))

plt.figure()
plt.plot([0,1],[0,1],'k--')
plt.plot(fpr,tpr,label='gamma=1')
plt.plot(fpr_rbf,tpr_rbf,label='gamma=10')
plt.plot(fpr_knn,tpr_knn,label='knn')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()
gamma=1 AUC=  0.927469135802
gamma=10 AUC=  0.927469135802
knn AUC=  0.936342592593

clipboard.png

从上图可以看出,SVM gamma 取10要明显好于取1.

# Author: Tim Head <betatim@gmail.com>
#
# License: BSD 3 clause

import numpy as np
np.random.seed(10)

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (RandomTreesEmbedding, RandomForestClassifier,
                              GradientBoostingClassifier)
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.pipeline import make_pipeline

n_estimator = 10
X, y = make_classification(n_samples=80000)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
# It is important to train the ensemble of trees on a different subset
# of the training data than the linear regression model to avoid
# overfitting, in particular if the total number of leaves is
# similar to the number of training samples
X_train, X_train_lr, y_train, y_train_lr = train_test_split(X_train,
                                                            y_train,
                                                            test_size=0.5)

# Unsupervised transformation based on totally random trees
rt = RandomTreesEmbedding(max_depth=3, n_estimators=n_estimator,
    random_state=0)

rt_lm = LogisticRegression()
pipeline = make_pipeline(rt, rt_lm)
pipeline.fit(X_train, y_train)
y_pred_rt = pipeline.predict_proba(X_test)[:, 1]
fpr_rt_lm, tpr_rt_lm, _ = roc_curve(y_test, y_pred_rt)

# Supervised transformation based on random forests
rf = RandomForestClassifier(max_depth=3, n_estimators=n_estimator)
rf_enc = OneHotEncoder()
rf_lm = LogisticRegression()
rf.fit(X_train, y_train)
rf_enc.fit(rf.apply(X_train))
rf_lm.fit(rf_enc.transform(rf.apply(X_train_lr)), y_train_lr)

y_pred_rf_lm = rf_lm.predict_proba(rf_enc.transform(rf.apply(X_test)))[:, 1]
fpr_rf_lm, tpr_rf_lm, _ = roc_curve(y_test, y_pred_rf_lm)

grd = GradientBoostingClassifier(n_estimators=n_estimator)
grd_enc = OneHotEncoder()
grd_lm = LogisticRegression()
grd.fit(X_train, y_train)
grd_enc.fit(grd.apply(X_train)[:, :, 0])
grd_lm.fit(grd_enc.transform(grd.apply(X_train_lr)[:, :, 0]), y_train_lr)

y_pred_grd_lm = grd_lm.predict_proba(
    grd_enc.transform(grd.apply(X_test)[:, :, 0]))[:, 1]
fpr_grd_lm, tpr_grd_lm, _ = roc_curve(y_test, y_pred_grd_lm)


# The gradient boosted model by itself
y_pred_grd = grd.predict_proba(X_test)[:, 1]
fpr_grd, tpr_grd, _ = roc_curve(y_test, y_pred_grd)

# The random forest model by itself
y_pred_rf = rf.predict_proba(X_test)[:, 1]
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_rf)

plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_rt_lm, tpr_rt_lm, label='RT + LR')
plt.plot(fpr_rf, tpr_rf, label='RF')
plt.plot(fpr_rf_lm, tpr_rf_lm, label='RF + LR')
plt.plot(fpr_grd, tpr_grd, label='GBT')
plt.plot(fpr_grd_lm, tpr_grd_lm, label='GBT + LR')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()

plt.figure(2)
plt.xlim(0, 0.2)
plt.ylim(0.8, 1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_rt_lm, tpr_rt_lm, label='RT + LR')
plt.plot(fpr_rf, tpr_rf, label='RF')
plt.plot(fpr_rf_lm, tpr_rf_lm, label='RF + LR')
plt.plot(fpr_grd, tpr_grd, label='GBT')
plt.plot(fpr_grd_lm, tpr_grd_lm, label='GBT + LR')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve (zoomed in at top left)')
plt.legend(loc='best')
plt.show()

clipboard.png

clipboard.png

Errors

mean absoulte error 均绝对值误差

绝对值函数不连续,所以无法用在梯度下降中,取而代之用MSE

mean squre error 均方误差

from sklearn import datasets,svm,metrics
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

iris=datasets.load_iris()
X_train,X_test,y_train,y_test=train_test_split(iris.data,iris.target,test_size=0.5)
clf=svm.SVC(kernel='rbf',C=1,gamma=1).fit(X_train,y_train)
y_pred=clf.predict(X_test)
error=metrics.mean_absolute_error(y_test,y_pred)
print('mean_absolute_error: ',error)
print('mean_square_error: ',metrics.mean_squared_error(y_test,y_pred))
mean_absolute_error:  0.04
mean_square_error:  0.04

K Fold

from sklearn.model_selection import KFold
X=np.array([0,1,2,3,4,5,6,7,8,9])
kf=KFold(n_splits=10, random_state=3, shuffle=True)
for train_indices,test_indices in kf.split(X):
    print (train_indices,test_indices)
[0 1 2 3 4 6 7 8 9] [5]
[0 1 2 3 5 6 7 8 9] [4]
[0 2 3 4 5 6 7 8 9] [1]
[0 1 3 4 5 6 7 8 9] [2]
[0 1 2 3 4 5 6 7 8] [9]
[0 1 2 3 4 5 7 8 9] [6]
[0 1 2 3 4 5 6 8 9] [7]
[1 2 3 4 5 6 7 8 9] [0]
[0 1 2 4 5 6 7 8 9] [3]
[0 1 2 3 4 5 6 7 9] [8]

DerekGrant
196 声望20 粉丝