Common evaluation indicators for deep learning classification tasks

Abstract: This article mainly introduces you to the evaluation indicators of deep learning classification tasks. The main content includes basic applications, practical skills, principles and mechanisms, and I hope it will be helpful to you.

This article is shared from the HUAWEI Cloud Community " Deep Learning Classification Tasks", the original author: lutianfei.

This article mainly introduces you to the evaluation indicators of deep learning classification tasks. The main content includes basic applications, practical skills, principles and mechanisms, etc., I hope it will be helpful to everyone.

Classification model

Confusion matrix

sklearn implementation:

sklearn.metrics.confusion_matrix(y_true, y_pred, labels=None)

返回值：一个格式化的字符串，给出了分类结果的混淆矩阵。

参数：参考classification_report 。

混淆矩阵的内容如下，其中Cij表示真实标记为i但是预测为j的样本的数量。

Confusion Matrix:

[[5 0]

[3 2]]
def calc_confusion_matrix(y_true: list, y_pred: list, show=True, save=False, figsize=(16, 16), verbose=False):
    """
    计算混淆矩阵
    :param y_true: 
    :param y_pred: 
    :param show: 
    :param save: 
    :param figsize: 
    :param verbose: 
    :return: 
    """
    confusion = confusion_matrix(y_true, y_pred)
    if verbose:
        print(confusion)

    if show:
        show_confusion_matrix(confusion, figsize=figsize, save=save)

    return confusion

def show_confusion_matrix(confusion, classes=MY_CLASSES, x_rot=-60, figsize=None, save=False):
    """
    绘制混淆矩阵
    :param confusion:
    :param classes:
    :param x_rot:
    :param figsize:
    :param save:
    :return:
    """
    if figsize is not None:
        plt.rcParams['figure.figsize'] = figsize

    plt.imshow(confusion, cmap=plt.cm.YlOrRd)
    indices = range(len(confusion))
    plt.xticks(indices, classes, rotation=x_rot, fontsize=12)
    plt.yticks(indices, classes, fontsize=12)
    plt.colorbar()
    plt.xlabel('y_pred')
    plt.ylabel('y_true')

    # 显示数据
    for first_index in range(len(confusion)):
        for second_index in range(len(confusion[first_index])):
            plt.text(first_index, second_index, confusion[first_index][second_index])

    if save:
        plt.savefig("./confusion_matrix.png")
    plt.show()

Confusion matrix is a visualization tool in supervised learning, which is mainly used to compare the classification results with the real information of the examples. Each row in the matrix represents the predicted category of the instance, and each column represents the true category of the instance.

Understanding method:

P (positive): The prediction is a positive sample

N (Negative): The prediction is a negative sample

T (True): The prediction is correct

F (False): prediction error

True (True Positive, TP): A positive sample predicted to be positive by the model. The prediction is 1, the prediction is correct, that is, the actual 1
False Positive (FP): Negative samples predicted to be positive by the model. The prediction is 1, the prediction is wrong, that is, the actual 0
False Negative (FN): Positive samples predicted to be negative by the model. The prediction is 0, the prediction is wrong, that is, the actual 1
True Negative (TN): Negative samples predicted to be negative by the model. The prediction is 0, the prediction is correct, that is, the actual 0

True positive rate (True Positive Rate, TPR) or sensitivity rate (sensitivity)

TPR = TP/(TP+FN) -> the number of positive sample prediction results/the actual number of positive samples

From the above formula, it can be seen that TPR is equivalent to Recall

True Negative Rate (TNR) or specificity/specificity

TNR=TN/(TN+FP) -> the number of negative sample prediction results/the actual number of negative samples

False Positive Rate (FPR)

FPR = FP/(FP+TN) -> the number of negative sample results predicted to be positive/the actual number of negative samples

False Negative Rate (FNR)

FNR = FN/(TP+FN) -> the number of positive sample results predicted to be negative/the actual number of positive samples

Accuracy (Accuracy)

Also known as the correct rate, the percentage of correct prediction results in the total sample, the most commonly used classification performance index.

Formula: Accuracy = (TP+TN)/(TP+FN+FP+FN)

Disadvantages: There are limitations when the sample is not balanced,

For example: when the negative samples account for 99%, the classifier predicts all samples as negative samples and can obtain 99% accuracy. Therefore, when the sample proportions of different categories are very unbalanced, the category with a large proportion often becomes the most important factor affecting the accuracy.

sklearn implementation:

from sklearn.metrics import accuracy_score

accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)

返回值：如果normalize为True，则返回准确率；如果normalize为False，则返回正确分类的数量。

参数：

y_true：真实的标记集合。

y_pred：预测的标记集合。

normalize：一个布尔值，指示是否需要归一化结果。

如果为True，则返回分类正确的比例（准确率）。

如果为False，则返回分类正确的样本数量。

sample_weight：样本权重，默认每个样本的权重为 1 。

# 方法封装
def calc_accuracy_score(y_true: list, y_pred: list, verbose=False):
    res = accuracy_score(y_true, y_pred)
    if verbose:
        print("accuracy:%s" % res)

    return res

Error rate

That is, the number of positive and negative examples of mispredictions/total number.

The correct rate and the error rate are indicators that are evaluated from both positive and negative aspects, and the sum of the two values is exactly equal to 1.

ErrorRate = (FP+FN)/(TP+FN+FP+TN)

Precision (Precision)
Precision is also called precision rate. It is for the prediction result. Its meaning is the probability of the sample that is actually positive among all the samples that are predicted to be positive, which means that the result of the prediction is positive. In, how confident we can predict is correct, the formula is as follows:

Precision = TP/(TP+FP)

Disadvantages: The predicted result is only 1 positive case, and it is correct, and the accuracy rate is 100%. In fact, there are many negative examples of wrong predictions, that is, real positive examples.

Scenario: It is predicted that the stock will rise, and the real rise is 10 times, and only two times are predicted to rise. The prediction is correct both times, then the accuracy we want is high, and the recall rate is not important at this time.

sklearn implementation:

from sklearn.metrics import precision_score

sklearn.metrics.precision_score(y_true, y_pred, labels=None, pos_label=1,
average='binary', sample_weight=None)

返回值：查准率。即预测结果为正类的那些样本中，有多少比例确实是正类。

参数：

y_true：真实的标记集合。

y_pred：预测的标记集合。

labels：一个列表。当average 不是'binary' 时使用。

对于多分类问题，它表示：将计算哪些类别。不在labels 中的类别，计算macro precision
时其成分为 0 。

对于多标签问题，它表示：待考察的标签的索引。

除了average=None 之外，labels 的元素的顺序也非常重要。

默认情况下，y_true 和 y_pred 中所有的类别都将被用到。

pos_label：一个字符串或者整数，指定哪个标记值属于正类。

如果是多分类或者多标签问题，则该参数被忽略。

如果设置label=[pos_label] 以及average!='binary' 则会仅仅计算该类别的precision 。

average：一个字符串或者None，用于指定二分类或者多类分类的precision 如何计算。

'binary'：计算二类分类的precision。 此时由pos_label
指定的类为正类，报告其precision 。

它要求y_true、y_pred 的元素都是0,1 。

'micro'：通过全局的正例和负例，计算precision 。

'macro'：计算每个类别的precision，然后返回它们的均值。

'weighted'：计算每个类别的precision，然后返回其加权均值，权重为每个类别的样本数。

'samples'：计算每个样本的precision，然后返回其均值。该方法仅仅对于多标签分类问题有意义。

None：计算每个类别的precision，然后以数组的形式返回每个precision 。

sample_weight：样本权重，默认每个样本的权重为 1
# 方法封装
def calc_precision_score(y_true: list, y_pred: list, labels=MY_CLASSES, average=None, verbose=False):
    res = precision_score(y_true, y_pred, labels=labels, average=average)
    if verbose:
        print("precision:%s" % res)

    return res

Recall rate (recall rate, Recall)

Recall is also called recall. It is for the original sample. Its meaning is the probability of being predicted as a positive sample in a sample that is actually positive. Its formula is as follows:

Recall = TP/(TP+FN)

Disadvantages: When all predictions are positive, all real positive cases will be covered, and the recall rate is also 100%.

Application scenarios of example, take the default rate of online loans. Relative to good users, we care more about bad users, and we can't miss any bad users. Because if we treat bad users too much as good users, the amount of default that may occur in the future will far exceed the amount of loan interest repaid by good users, causing serious compensation. The higher the recall rate, the higher the probability that the actual bad user will be predicted. Its meaning is similar: it is better to kill a thousand by mistake than to miss one.

sklearn implementation:

from sklearn.metrics import recall_score

sklearn.metrics.recall_score(y_true, y_pred, labels=None,
pos_label=1,average='binary', sample_weight=None)
返回值：查全率。即真实的正类中，有多少比例被预测为正类。

参数：参考precision_score。

# 方法封装
def calc_recall_score(y_true: list, y_pred: list, labels=MY_CLASSES, average=None, verbose=False):
    res = recall_score(y_true, y_pred, labels=labels, average=average)
    if verbose:
        print("recall: %s" % res)

    return res

The following figure further illustrates the positive accuracy rate and recall rate

PR curve

From the formula of precision rate and recall rate, we can see that the numerator of precision rate and recall rate is the same, both are TP, but the denominator is different, one is (TP+FP) and the other is (TP+FN). The relationship between the two can be shown with a PR diagram:

The final output of the classification model is often a probability value. We generally need to convert the probability value into a specific category. For two-class classification, we set a threshold, and then judge it as a positive class if it is greater than this threshold, and vice versa. The above evaluation indicators (Accuracy, Precision, Recall) are all for a specific threshold. So when different models take different thresholds, how to comprehensively evaluate different models? Therefore, it is necessary to introduce the PR curve, namely the Precision-Recall curve, for evaluation.

In order to find the most suitable threshold to meet our requirements, we must traverse all the thresholds between 0 and 1, and each threshold corresponds to a pair of precision and recall, so we get this curve.

As shown in the figure below, the ordinate is the precision rate P, and the abscissa is the recall rate R. For a model,
A point on the PR curve represents that, under a certain threshold, the model judges the result larger than the threshold as a positive sample, and the result smaller than the threshold as a negative sample.
At this time, the returned result corresponds to a pair of recall rate and precision rate, as a coordinate on the PR coordinate system.
The entire PR curve is generated by shifting the threshold from high to low. The closer the PR curve is to the upper right corner (1,1), the better the model. In real scenarios, it is necessary to comprehensively judge the quality of different models according to different decision-making requirements.

evaluation standard:

First look at whether it is smooth or not (the smoother the better). Generally speaking, in the same test set, the top one is better than the bottom one (the green line is better than the red line). When the values of P and R are close to each other, the value of F1 is the largest. Now draw a line connecting (0,0) and (1,1). The F1 value is the largest where the line and the PR curve overlap. At this time, the F1 for the PR curve is equivalent to the AUC for the ROC.

The area under the PR curve is called AP (Average
Precision), represents the average precision value of the recall rate from 0-1. AP can be calculated with points. The AP area will not be greater than 1. The larger the area under the PR curve, the better the model performance.

The so-called high-performance model should maintain a high level of accuracy while increasing the recall rate.

sklearn implementation:

sklearn.metrics.precision_recall_curve(y_true, probas_pred, pos_label=None,
sample_weight=None)

返回值：一个元组，元组内的元素分别为：

P-R曲线的查准率序列。该序列是递增序列，序列第 i 个元素是当正类概率的判定阈值为
thresholds[i]时的查准率。

P-R曲线的查全率序列。该序列是递减序列，序列第 i 个元素是当正类概率的判定阈值为
thresholds[i]时的查全率。

P-R曲线的阈值序列thresholds。该序列是一个递增序列，给出了判定为正例时的正类概率的阈值。

参数：

y_true：真实的标记集合。
probas_pred：每个样本预测为正类的概率的集合。
pos_label：正类的类别标记。
sample_weight：样本权重，默认每个样本的权重为 1。
def calc_precision_recall_curve(class_info, class_name=None, show=True, save=False, verbose=False):
    """
    计算PR曲线
    :param class_info: 
    :param class_name: 
    :param show: 
    :param save: 
    :param verbose: 
    :return: 
    """
    precision, recall, thresholds = precision_recall_curve(class_info['gt_lbl'], class_info['score'])
    if verbose:
        print("%s precision:%s " % (class_name, precision,))
        print("%s recall:%s " % (class_name, recall,))
        print("%s thresholds:%s " % (class_name, thresholds,))

    if show:
        show_PR_curve(recall, precision, class_name)

    return precision, recall, thresholds

PR curve drawing method:

def show_PR_curve(recall, precision, class_name=None, save=False):
    """
    绘制PR曲线
    :param recall:
    :param precision:
    :param class_name:
    :param save:
    :return:
    """
    plt.figure("%s P-R Curve" % class_name)
    plt.title('%s Precision/Recall Curve' % class_name)
    plt.xlabel('Recall')
    plt.ylabel('Precision')

    plt.plot(recall, precision)
    if save:
        plt.savefig("./%s_pr_curve.png")
    plt.show()

F1 score harmonic mean

F1
Score, also known as F-Measure, is the harmonic value of precision and recall, which is closer to the two smaller numbers, so the F1 value is the largest when the precision and recall are close. Many recommendation system evaluation indicators use F1
Score。

Precision and Recall are two contradictory and unified indicators. In order to increase the Precision value,
The classifier needs to try to predict the sample as a positive sample when it is "more certain".
But at this time, many “unsure” positive samples are often missed because of being too conservative.
Cause the Recall value to decrease. So how to choose a model when the Recall and Precision of different models have their own advantages? If we want to find a balance between the two, we need a new indicator: F1 score. The F1 score considers both precision and recall, so that both can reach the highest at the same time and strike a balance. The formula for F1 score is
= 2 precision rate recall rate / (precision rate +
Recall rate). The balance point we see in Figure 1 of the PR curve is the result of the F1 score.

which is

In a real scene, if there are two models, one with very high precision and very low recall, the other with very high recall and very low precision, f1-score may be the same, and one f1 may not pass.
Socre makes the final judgment. At this time, it needs to choose other suitable indicators to make judgments according to different scenarios.

sklearn implementation:

from sklearn.metrics import f1_score #调和平均值F1

f1_score(y_true, y_pred, labels=None, pos_label=1, average='binary',
sample_weight=None)

返回值： 值。即查准率和查全率的调和均值。

参数：参考precision_score。

#方法封装
def calc_f1_score(y_true: list, y_pred: list, labels=MY_CLASSES, average=None, verbose=False):
    res = f1_score(y_true, y_pred, labels=labels, average=average)
    if verbose:
        print("f1_score: %s" % res)

    return res

ROC curve

ROC（Receiver Operating
Characteristic-receiver operating characteristic curve, also known as susceptibility curve), uses graphs to describe the performance of the two-class model, and is an indicator for comprehensive evaluation of the model.

The reasons why ROC and AUC can ignore sample imbalance are: sensitivity and (1-specificity), also called true rate (TPR) and false positive rate (FPR). Since TPR and FPR are based on actual performance of 1 and 0 respectively, that is to say, they are used to observe related probability problems in actual positive samples and negative samples respectively. Because of this, no matter whether the sample is balanced or not, it will not be affected.

For example: In the total sample, 90% are positive samples and 10% are negative samples. We know that there is water in using accuracy, but using TPR is different from FPR. Here, TPR only pays attention to how much of the 90% positive samples are actually covered, and has nothing to do with that 10%. Similarly, FPR only pays attention to how many of the 10% negative samples are incorrectly covered, which is also related to 90% of the negative samples. % Is irrelevant, so it can be seen that if we start from the perspective of each result of actual performance, we can avoid the problem of sample imbalance, which is why TPR and FPR are chosen as ROC/AUC indicators.

X axis: false positive rate (FPR), the proportion of positive cases predicted by the model as false in true negative cases. The proportion of misdiagnosed positive categories in all negative categories is medically equivalent to the rate of misdiagnosis.

Y-axis: True Rate (TPR), the proportion of true positive cases predicted by the model to be true.

The closer the ROC curve is to the diagonal, the lower the accuracy of the model.

Curve description:

The ROC curve is obtained by FPR and TPR coordinates under different thresholds. Specifically, by dynamically adjusting the model probability threshold (the probability threshold means how likely the model is to be judged as positive),
Start from the highest value (for example, 1, corresponding to the zero point of the ROC curve), and gradually adjust to the lowest probability value,
Each probability value corresponds to an FPR and TPR, and the position corresponding to each probability value is drawn on the ROC chart.
Then connect all the points to get the final ROC curve.

Curve characteristics:

By adjusting the threshold for judging classification (the default threshold of logistic regression is 0.5), TPR and FPR are changed accordingly, and then multiple points are formed on the ROC curve coordinates to reflect the classification effect of the model.
The faster the TPR growth, and the lower the FPR, the more convex the curve, the better the classification performance of the model, that is, the more positive examples that are predicted to be correct.
The ROC curve compares (0,0), (1,0) two points. the reason:

The default threshold of the logistic regression model is 0.5. When the result of sigmoid() is the category probability p ≥ 0.5 by default, the model predicts category 1 (positive example). Then when the modified threshold is 0, the p≥0 model predicts category 1 (positive example), which means that the model will predict all data as category 1 (regardless of right or wrong) under this threshold. At this time, FN=TN=0. TPR=FPR=1

When the modified threshold is 1, the p≥1 model predicts category 1 (positive example), and p is unlikely to be greater than 100%, indicating that the model will predict all data as category 0 (whether right or wrong) under this threshold. When FP=TP=0, TPR=FPR=0

sklearn implementation:

roc_curve函数用于计算分类结果的ROC曲线。其原型为：

sklearn.metrics.roc_curve(y_true, y_score, pos_label=None, sample_weight=None,
drop_intermediate=True)

返回值：一个元组，元组内的元素分别为：

ROC曲线的FPR序列。该序列是递增序列，序列第 i 个元素是当正类概率的判定阈值为
thresholds[i]时的假正例率。

ROC曲线的TPR序列。该序列是递增序列，序列第 i 个元素是当正类概率的判定阈值为
thresholds[i]时的真正例率。

ROC曲线的阈值序列thresholds。该序列是一个递减序列，给出了判定为正例时的正类概率的阈值。

参数：

y_true：真实的标记集合。

y_score：每个样本预测为正类的概率的集合。

pos_label：正类的类别标记。

sample_weight：样本权重，默认每个样本的权重为 1。

drop_intermediate：一个布尔值。如果为True，则抛弃某些不可能出现在ROC曲线上的阈值。

#方法封装
def calc_roc_curve(class_info, class_name=None, show=True, save=False, verbose=False):
    """
    计算roc曲线
    :param class_info: 
    :param class_name: 
    :param show: 
    :param save: 
    :param verbose: 
    :return: 
    """
    fpr, tpr, thresholds = roc_curve(class_info['gt_lbl'], class_info['score'], drop_intermediate=True)
    if verbose:
        print("%s fpr:%s " % (class_name, fpr,))
        print("%s tpr:%s " % (class_name, tpr,))
        print("%s thresholds:%s " % (class_name, thresholds,))

    if show:
        auc_score = calc_auc_score(fpr, tpr)
        show_roc_curve(fpr, tpr, auc_score, class_name)

    return fpr, tpr, thresholds

ROC curve drawing method:

def show_roc_curve(fpr, tpr, auc_score, class_name=None, save=False):
    plt.figure("%s ROC Curve" % class_name)
    plt.title('%s ROC Curve' % class_name)
    plt.xlabel('False Positive Rate')  # 横坐标是fpr
    plt.ylabel('True Positive Rate')  # 纵坐标是tpr

    plt.plot(fpr, tpr, 'b', label='AUC = %0.2f' % auc_score)
    plt.legend(loc='lower right')
    plt.plot([0, 1], [0, 1], 'r--')
    plt.xlim([-0.1, 1.1])
    plt.ylim([-0.1, 1.1])
    if save:
        plt.savefig("./%s_auc_curve.png")
    plt.show()

AUC (area of ROC curve)

AUC (Area Under Curve
It is defined as the area under the ROC curve. Obviously, the value of this area will not be greater than 1. And because the ROC curve is generally above the line y=x, the value range of AUC is generally between 0.5 and 1. The AUC value is used as the evaluation criterion because in many cases the ROC curve does not clearly indicate which classifier performs better, and as a value, a classifier with a larger AUC performs better.

sklearn implementation:

roc_auc_score函数用于计算分类结果的ROC曲线的面积AUC。其原型为：

sklearn.metrics.roc_auc_score(y_true, y_score, average='macro',
sample_weight=None)

返回值：AUC值。

参数：参考 roc_curve。

#也可以通过如下方法计算得到
def calc_auc_score(fpr, tpr, verbose=False):
    res = auc(fpr, tpr)
    if verbose:
        print("auc:%s" % res)

    return res

There are two ways to calculate AUC: trapezoidal method and ROC
The AUCH method uses the approximation method to find the approximate value, see Wikipedia for details.

Standards for judging the pros and cons of a classifier (predictive model) from AUC:

AUC =
1. It is a perfect classifier. When this prediction model is used, there is at least one threshold to get a perfect prediction. In most prediction situations, there is no perfect classifier.
-5 < AUC <
1. Better than random guessing. This classifier (model) can have predictive value if the threshold is properly set.
AUC = 0.5, follow the machine to guess the same (for example: lost copper plate), the model has no predictive value.
AUC <0.5 is worse than random guessing; but as long as it is always anti-predictable, it is better than random guessing.

Examples of three AUC values:

Simply put: a classifier with a larger AUC value has a higher accuracy rate.

Note: Compare the definitions of TPR, FPR, Precision, and Recall. The denominator of TPR and Recall is the number of positive classes in the sample, and the denominator of FPR is the number of negative classes in the sample. Once the sample is determined, the denominator is a fixed value, so three The change of each index increases monotonously with the increase of the numerator. However, the denominator of Precision is the number of positive classes predicted, which will change with the change of the threshold. Therefore, the change of Precision is affected by the combination of TP and FP. It is not monotonous and the change is unpredictable.

ROC-AUC curve drawing method in multi-category situation

def show_roc_info(classdict, show=True, save=False, figsize=(30, 22), fontsize=12):
    """
    多类别情况下计算展示每个类别的roc-auc图
    :param classdict: 
    :param show: 
    :param save: 
    :param figsize: 
    :param fontsize: 
    :return: 
    """
    def sub_curve(fpr, tpr, auc_score, class_name, sub_idx):
        plt.subplot(6, 5, sub_idx)
        plt.title('%s ROC Curve' % class_name)
        plt.xlabel('False Positive Rate')  # 横坐标是fpr
        plt.ylabel('True Positive Rate')  # 纵坐标是tpr

        plt.plot(fpr, tpr, 'b', label='AUC = %0.2f' % auc_score)
        plt.legend(loc='lower right')
        plt.plot([0, 1], [0, 1], 'r--')
        plt.xlim([-0.1, 1.1])
        plt.ylim([-0.1, 1.1])

    if show:
        plt.figure("Maoyan video class AUC Curve", figsize=figsize)
        plt.subplots_adjust(bottom=0.02, right=0.98, top=0.98)
    for idx, cls in enumerate(MY_CLASSES):
        if cls in classdict:
            fpr, tpr, thresholds = calc_roc_curve(classdict[cls], class_name=cls, show=False)
            auc_score = calc_auc_score(fpr, tpr)
            print("%s auc:\t\t\t%.4f" % (cls, auc_score))
            if show:
                sub_curve(fpr, tpr, auc_score, cls, idx + 1)
        else:
            print("%s auc:\t\t\t0" % cls)
            sub_curve([0], [0], 0, cls, idx + 1)

    if save:
        plt.savefig("./maoyan_all_auc_curve.png")
    if show:
        plt.show()

Practical skills

For ROC, generally speaking, if the ROC is smooth, it can basically be judged that there is not much overfitting (for example, 0.2 to 0.4 in the figure may be problematic, but the sample is too small). At this time, the model can only be adjusted to look at AUC. The larger the area, the better the model is generally considered.

For PRC (precision recall
Curve) is the same as ROC. First look at whether it is smooth or not (the blue line is obviously better), and then see who goes up and down (on the same test set). Generally speaking, the upper one is better than the lower one (the green line is better than the red line). When P and R are closer to F1, the larger F1 is. Generally, F1 where the line connecting (0,0) and (1,1) overlaps with PRC is the largest F1 of this line (in the case of smooth). F1 is to PRC what AUC is to ROC. A number is more convenient to adjust the model than a line.

AP

Strict AP is the area under the PR curve, and mAP is the arithmetic average of all APs.
But generally, approximate methods are used to estimate this area.

sklearn implementation:

sklearn.metrics.average_precision_score(y_true, y_score, average=‘macro’,
sample_weight=None)

注意：此实现仅限于二进制分类任务或多标签分类任务。

参数：

y_true : array, shape = [n_samples] or [n_samples, n_classes]
真实标签：取0和1

y_score : array, shape = [n_samples] or [n_samples, n_classes]

预测标签：[0,1]之间的值。
可以是正类的概率估计、置信值，也可以是决策的非阈值度量（如某些分类器上的“决策函数”返回的）

average : string, [None, ‘micro’, ‘macro’ (default), ‘samples’, ‘weighted’]
sample_weight : array-like of shape = [n_samples], optional sample weights.

#方法封装
def calc_AP_score(class_info, class_name=None, average="macro", verbose=True):
    res = average_precision_score(class_info['gt_lbl'], class_info['score'], average=average)
    if verbose:
        print("%s ap:\t\t\t%.4f" % (class_name, res))

    return res

AP calculation method

First use the trained model to get the confidence of all test samples
score, assuming there are 20 test samples in a certain category, each id, confidence score and ground truth
The label is as follows:

Then sort the confidence score to get:

Calculate that TopN corresponds to Recall and precision. Among them, for a certain recall value r, the precision value takes all recall

= the maximum value in r (this ensures that the pr curve is monotonically decreasing and avoids the curve from swinging) This method is called all-points-interpolation. This AP value is also the area value under the PR curve.

For example: recall=2/6, precision=2/5=0.4 for top5, when recall>=2/6, the maximum precision is 1.

The recall=3/6, precision=3/6 of top6, and the maximum precision is 4/7 when all recall>=3/6.

at this time

AP=1*(1/6) + 1*(1/6)+ (4/7)*(1/6) + (4/7)*(1/6) + (5/11)*(1/6) + (6/16)*
(1/6) = 0.6621

The corresponding Precision-Recall curve (this curve is monotonically decreasing) is as follows:

mAP

mean Average Precision, which is the average value of each category of AP

Use the above method to calculate the AP of each class, and then take the average to get the mAP. The advantage of mAP is that it can prevent AP
Bias to a certain category with a larger number.

sklearn implementation:

#mAP计算封装
def calc_mAP_score(classdict, verbose=True):
    AP = []
    for cls in MY_CLASSES:
        if cls in classdict:
            AP.append(calc_AP_score(classdict[cls], cls))
        else:
            print("%s ap:\t 0" % cls)
            AP.append(0)
    mAP_score = np.mean(AP)
    if verbose:
        print("mAP:%s" % mAP_score)
    return mAP_score

Calculate and display the pr graph and corresponding AP value of each category in the case of multiple categories

def show_mAP_info(classdict, show=True, save=False, figsize=(30, 22), fontsize=12):
    """
    多类别情况下计算展示每个类别的pr图及对应的AP值
    :param classdict:
    :param show:
    :param save:
    :param figsize:
    :param fontsize:
    :return:
    """

    def sub_curve(recall, precision, class_name, ap_score, sub_idx):
        plt.subplot(6, 5, sub_idx)
        plt.title('%s PR Curve, ap:%.4f' % (class_name, ap_score))
        plt.xlabel('Recall')
        plt.ylabel('Precision')
        plt.plot(recall, precision)

    AP = []
    if show:
        plt.figure("Maoyan video class P-R Curve", figsize=figsize)
        plt.subplots_adjust(bottom=0.02, right=0.98, top=0.98)
    for idx, cls in enumerate(MY_CLASSES):
        if cls in classdict:
            ap_score = calc_AP_score(classdict[cls], cls)
            precision, recall, thresholds = calc_precision_recall_curve(classdict[cls], class_name=cls, show=False)
            if show:
                sub_curve(recall, precision, cls, ap_score, idx + 1)
        else:
            ap_score = 0
            print("%s ap:\t\t\t0" % cls)
            sub_curve([0], [0], cls, ap_score, idx + 1)

        AP.append(ap_score)
    if save:
        plt.savefig("./maoyan_all_ap_curve.png")
    if show:
        plt.show()

    mAP_score = np.mean(AP)
    print("mAP:%s" % mAP_score)
    return mAP_score

Method for obtaining required label information in multi-category situation

def get_simple_result(df: DataFrame):
    y_true = []
    y_pred = []
    pred_scores = []
    for idx, row in df.iterrows():
        video_path = row['video_path']
        gt_label = video_path.split("/")[-2]
        y_true.append(gt_label)

        pred_label = row['cls1']
        y_pred.append(pred_label)

        pred_score = row['score1']
        pred_scores.append(pred_score)

    return y_true, y_pred, pred_scores
def get_multiclass_result(df: DataFrame):
    classdict = {}
    for idx, row in df.iterrows():
        video_path = row['video_path']
        gt_label = video_path.split("/")[-2]
        pred_label = row['cls1']
        pred_score = row['score1']
        if pred_label in classdict:
            classdict[pred_label]['score'].append(pred_score)
            classdict[pred_label]['gt_lbl'].append(1 if gt_label == pred_label else 0)
        else:
            classdict[pred_label] = {'score': [pred_score], 'gt_lbl': [1 if gt_label == pred_label else 0]}
    return classdict

Log loss (log_loss)

from sklearn.metrics import log_loss
log_loss(y_true,y_pred)

Text report of classification indicators (classification_report)

The classification_report function is used to display a text report of the main classification indicators. Display the accuracy, recall rate, F1 value and other information of each class in the report.

sklearn implementation:

sklearn.metrics.classification_report(y_true, y_pred, labels=None,
target_names=None, sample_weight=None, digits=2)

返回值：一个格式化的字符串，给出了分类评估报告。
参数：
y_true：真实的标记集合。
y_pred：预测的标记集合。
labels：一个列表，指定报告中出现哪些类别。
target_names：一个列表，指定报告中类别对应的显示出来的名字。
digits：用于格式化报告中的浮点数，保留几位小数。
sample_weight：样本权重，默认每个样本的权重为 1
分类评估报告的内容如下，其中：
precision列：给出了查准率。它依次将类别 0 作为正类，类别 1 作为正类...
recall列：给出了查全率。它依次将类别 0 作为正类，类别 1 作为正类...
F1列：给出了F1值。
support列：给出了该类有多少个样本。
avg / total行：
对于precision,recall,f1给出了该列数据的算术平均。
对于support列，给出了该列的算术和（其实就等于样本集总样本数量）。

Use summary

For the classification model, AUC, ROC curve (the line connecting the points of FPR and TPR), and the PR curve (the line connecting the points of accuracy and recall) are indicators for comprehensively evaluating the distinguishing ability and sorting ability of the model, and the accuracy rate , Recall rate and F1 value are indicators calculated after determining the threshold.
For the same model, both PRC and ROC curves can explain certain problems, and the two have a certain correlation. If you want to evaluate the effect of the model, you can also draw both curves for a comprehensive evaluation.
For supervised binary classification problems, when both positive and negative samples are sufficient, the ROC curve and AUC can be directly used to evaluate the effect of the model; and in the case of extremely unbalanced samples, the PR curve can better reflect the effect of the model.
In the process of determining the threshold, the classification effect of the model can be evaluated according to Precision, Recall or F1. For multi-class problems, Precision, Recall, and F1 can be calculated separately for each category, and integrated as the model evaluation index.
Regression model
Mean absolute error (MAE mean absolute error)
sklearn implementation:

mean_absolute_error函数用于计算回归预测误差绝对值的均值(mean absolute
error:MAE)，其原型为：

sklearn.metrics.mean_absolute_error(y_true, y_pred, sample_weight=None,
multioutput='uniform_average')

返回值：预测误差绝对值的均值。

参数：

y_true：真实的标记集合。

y_pred：预测的标记集合。

multioutput：指定对于多输出变量的回归问题的误差类型。可以为：

'raw_values'：对每个输出变量，计算其误差 。

'uniform_average'：计算其所有输出变量的误差的平均值。

sample_weight：样本权重，默认每个样本的权重为 1。

MSE mean squared error

The mean_squared_error function is used to calculate the mean squared error of the regression prediction (mean square
error:MSE), its prototype is:

sklearn.metrics.mean_squared_error(y_true, y_pred, sample_weight=None,

multioutput='uniform_average')

返回值：预测误差的平方的平均值。

参数：参考mean_absolute_error 。

RMSE root mean squared error

NRMSE normalized root mean squared error (NRMSE normalized root mean squared error)

Coefficient of determination (R2)

R2 is the ratio of the regression sum of squares to the total sum of squares in multiple regression. It is a statistic that measures the degree of fit in the multiple regression equation and reflects the proportion explained by the estimated regression equation in the variation of the dependent variable y .

The closer R2 is to 1, the larger the proportion of the regression sum of squares to the total sum of squares, the closer the regression line is to each observation point, the more the change in x is used to explain the variation of the y value, and the better the fit of the regression. .

from sklearn.metrics import r2_score

r2_score(y_true, y_pred, sample_weight=None, multioutput='uniform_average')

Refer to the following articles to summarize the commonly used evaluation indicators for deep learning:
Terry：https://zhuanlan.zhihu.com/p/86120987
cicada：https://zhuanlan.zhihu.com/p/267901426
https://www.pythonf.cn/read/128402
Lu Yuan: https://www.zhihu.com/question/30643044

Click to follow, and get to know the fresh technology of Huawei Cloud for the first time~

Common evaluation indicators for deep learning classification tasks

Classification model

Confusion matrix

True positive rate (True Positive Rate, TPR) or sensitivity rate (sensitivity)

True Negative Rate (TNR) or specificity/specificity

False Positive Rate (FPR)

False Negative Rate (FNR)

Accuracy (Accuracy)

Error rate

Recall rate (recall rate, Recall)

PR curve

F1 score harmonic mean

ROC curve

AUC (area of ROC curve)

Practical skills

AP

AP calculation method

mAP

Log loss (log_loss)

Text report of classification indicators (classification_report)

Use summary

Regression model

Mean absolute error (MAE mean absolute error)

MSE mean squared error

RMSE root mean squared error

NRMSE normalized root mean squared error (NRMSE normalized root mean squared error)

Coefficient of determination (R2)

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

🔥全程不用写代码，我用 AI 程序员写了一个飞机大战

人工智能与机器学习入门：基尼系数（Gini Index）和基于熵（Entropy）

DeepSeek(私有化)+IDEA+Dify+微信搭建AI助手保姆级教程

人工智能与机器学习入门：决策树应用

如何给本地部署的 DeepSeek-R1投喂数据

AlphaFolding填补蛋白质动态结构预测空白！复旦大学等提出4D扩散模型，成果入选AAAI 2025

Common evaluation indicators for deep learning classification tasks

Classification model

Confusion matrix

True positive rate (True Positive Rate, TPR) or sensitivity rate (sensitivity)

True Negative Rate (TNR) or specificity/specificity

False Positive Rate (FPR)

False Negative Rate (FNR)

Accuracy (Accuracy)

Error rate

Recall rate (recall rate, Recall)

PR curve

F1 score harmonic mean

ROC curve

AUC (area of ​​ROC curve)

Practical skills

AP

AP calculation method

mAP

Log loss (log_loss)

Text report of classification indicators (classification_report)

Use summary

Regression model

Mean absolute error (MAE mean absolute error)

MSE mean squared error

RMSE root mean squared error

NRMSE normalized root mean squared error (NRMSE normalized root mean squared error)

Coefficient of determination (R2)

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

🔥全程不用写代码，我用 AI 程序员写了一个飞机大战

人工智能与机器学习入门：基尼系数（Gini Index）和基于熵（Entropy）

DeepSeek(私有化)+IDEA+Dify+微信 搭建AI助手保姆级教程

人工智能与机器学习入门：决策树应用

如何给本地部署的 DeepSeek-R1投喂数据

AlphaFolding填补蛋白质动态结构预测空白！复旦大学等提出4D扩散模型，成果入选AAAI 2025

AUC (area of ROC curve)

DeepSeek(私有化)+IDEA+Dify+微信搭建AI助手保姆级教程