Metrics of Classification and Regression

Classification is about deciding which categories new instances belong to. For example we can organize objects based on whether they are square or round, or we might have data about different passengers on the Titanic like in project 0, and want to know whether or not each passenger survived. Then when we see new objects we can use their features to guess which class they belong to.

In regression, we want to make a prediction on continuous data. For example we might have a list of different people's height, age, and gender and wish to predict their weight. Or perhaps, like in the final project of this course, we have some housing data and wish to make a prediction about the value of a single home.

The problem at hand will determine how we choose to evaluate a model.

Classification Metrics

机器学习(ML),自然语言处理(NLP),信息检索(IR)等领域,评估(Evaluation)是一个必要的工作,而其评价指标往往有如下几点:准确率(Accuracy),精确率(Precision),召回率(Recall)和F1-Measure。(注:相对来说,IR 的 ground truth 很多时候是一个 Ordered List, 而不是一个 Bool 类型的 Unordered Collection,在都找到的情况下,排在第三名还是第四名损失并不是很大,而排在第一名和第一百名,虽然都是“找到了”,但是意义是不一样的,因此更多可能适用于 MAP 之类评估指标。)

本文将简单介绍其中几个概念。中文中这几个评价指标翻译各有不同,所以一般情况下推荐使用英文。

现在我先假定一个具体场景作为例子。

 假如某个班级有男生80人,女生20人,共计100人.目标是找出所有女生. 现在某人挑选出50个人,其中20人是女生,另外还错误的把30个男生也当作女生挑选出来了. 作为评估者的你需要来评估(evaluation)下他的工作.

首先我们可以计算准确率(accuracy),其定义是: 对于给定的测试数据集,分类器正确分类的样本数与总样本数之比。也就是损失函数是0-1损失时测试数据集上的准确率.

Accuracy

The most basic and common classification metric is accuracy. Accuracy here is described as the proportion of items classified or labeled correctly.

For instance if a classroom has 14 boys and 16 girls, can a facial recognition software correctly identify all boys and all girls? If the software can identify 10 boys and 8 girls, then the software is 60% accurate.

accuracy = number of correctly identified instances / all instances

Accuracy is the default metric used in the .score() method for classifiers in sklearn. You can read more in the documentation here.

这样说听起来有点抽象,简单说就是,前面的场景中,实际情况是那个班级有男的和女的两类,某人(也就是定义中所说的分类器)他又把班级中的人分为男女两类。accuracy需要得到的是此君分正确的人占总人数的比例。很容易,我们可以得到:他把其中70(20女+50男)人判定正确了,而总人数是100人,所以它的accuracy就是70 %(70 / 100).

由准确率,我们的确可以在一些场合,从某种意义上得到一个分类器是否有效,但它并不总是能有效的评价一个分类器的工作。举个例子,google抓取了argcv 100个页面,而它索引中共有10,000,000个页面,随机抽一个页面,分类下,这是不是argcv的页面呢?如果以accuracy来判断我的工作,那我会把所有的页面都判断为"不是argcv的页面",因为我这样效率非常高(return false,一句话),而accuracy已经到了99.999%(9,999,900/10,000,000),完爆其它很多分类器辛辛苦苦算的值,而我这个算法显然不是需求期待的,那怎么解决呢?这就是precision,recall和f1-measure出场的时间了.
(在数据不均匀的情况下,如果以accuracy作为评判依据来优化模型的话。很容易将样本误判为占比较高的类别,而使分类器没有任何实际作用)

在说precision,recall和f1-measure之前,我们需要先需要定义TP,FN,FP,TN四种分类情况. 按照前面例子,我们需要从一个班级中的人中寻找所有女生,如果把这个任务当成一个分类器的话,那么女生就是我们需要的,而男生不是,所以我们称女生为"正类",而男生为"负类".

相关(Relevant),正类 无关(NonRelevant),负类
被检索到(Retrieved) true positives(TP 正类判定为正类,例子中就是正确的判定"这位是女生") false positives(FP 负类判定为正类,"存伪",例子中就是分明是男生却判断为女生,当下伪娘横行,这个错常有人犯)
未被检索到(Not Retrieved) false negatives(FN 正类判定为负类,"去真",例子中就是,分明是女生,这哥们却判断为男生--梁山伯同学犯的错就是这个) true negatives(TN 负类判定为负类,也就是一个男生被判断为男生,像我这样的纯爷们一准儿就会在此处)

通过这张表,我们可以很容易得到这几个值: + TP=20 + FP=30 + FN=0 + TN=50

Precision And Recall

  • Precision: $$\frac{True Positive} {True Positive + False Positive}$$. Out of all the items labeled as positive, how many truly belong to the positive class.

精确率(precision)的公式是$$P=\frac{TP}{TP+FP}$$,它计算的是所有"正确被检索的item(TP)"占所有"实际被检索到的(TP+FP)"的比例.

在例子中就是希望知道此君得到的所有人中,正确的人(也就是女生)占有的比例.所以其precision也就是40%(20女生/(20女生+30误判为女生的男生)).

  • Recall: $frac{True Positive }{ True Positive + False Negative}$. Out of all the items that are truly positive, how many were correctly classified as positive. Or simply, how many positive items were 'recalled' from the dataset.

召回率(recall)的公式是$$R=\frac{TP}{TP+FN}$$,它计算的是所有"正确被检索的item(TP)"占所有"应该检索到的item(TP+FN)"的比例。

在例子中就是希望知道此君得到的女生占本班中所有女生的比例,所以其recall也就是100%.
$$\frac{20女生}{20女生+ 0误判为男生的女生}$$

F1 Score

Now that you've seen precision and recall, another metric you might consider using is the F1 score. F1 score combines precision and recall relative to a specific positive class.

The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst at 0:

$$F1 = \frac{2 \cdot precision \cdot recall}{precision + recall}$$

For more information about F1 score how to use it in sklearn, check out the documentation here.

F1值就是精确值和召回率的调和均值,也就是

$$\frac{2}{F1}=\frac{1}{P}+\frac{1}{R}$$

调整下也就是:

$$F_{1}=\frac{2PR}{P+R}=\frac{2TP}{2TP+FP+FN}$$

需要说明的是,有人列了这样个公式
$$F_{a}=\frac{(a^2+1)PR}{a^2(P+R)}$$
将F-measure一般化.

F1-measure认为精确率和召回率的权重是一样的,但有些场景下,我们可能认为精确率会更加重要,调整参数a,使用Fa-measure可以帮助我们更好的evaluate结果.

Regression Metrics

As mentioned earlier for regression problems we are dealing with model that makes continuous predictions. In this case we care about how close the prediction is.

For example with height & weight predictions it is unreasonable to expect a model to 100% accurately predict someone's weight down to a fraction of a pound! But we do care how consistently the model can make a close prediction--perhaps with 3-4 pounds.

Mean Absolute Error

One way to measure error is by using absolute error to find the predicted distance from the true value. The mean absolute error takes the total absolute error of each example and averages the error based on the number of data points. By adding up all the absolute values of errors of a model we can avoid canceling out errors from being too high or below the true values and get an overall error metric to evaluate the model on.

For more information about mean absolute error and how to use it in sklearn, check out the documentation here.

Mean Squared Error

Mean squared is the most common metric to measure model performance. In contrast with absolute error, the residual error (the difference between predicted and the true value) is squared.

Some benefits of squaring the residual error is that error terms are positive, it emphasizes larger errors over smaller errors, and is differentiable. Being differentiable allows us to use calculus to find minimum or maximum values, often resulting in being more computationally efficient.

For more information about mean squared error and how to use it in sklearn, check out the documentation here.

Regression Scoring Functions

In addition to error metrics, scikit-learn contains two scoring metrics which scale continuously from 0 to 1, with values of 0 being bad and 1 being perfect performance.

These are the metrics that you'll use in the project at the end of the course. They have the advantage of looking similar to classification metrics, with numbers closer to 1.0 being good scores and bad scores tending to be near 0.

One of these is the R2 score, which computes the coefficient of determination of predictions for true values. This is the default scoring method for regression learners in scikit-learn.

The other is the explained variance score.

While we will not dive deep into explained variance score and R2 score in this lecture , one important point to remember is that, in general, metrics for regression are such that "higher is better"; that is, higher scores indicate better performance. When using error metrics, such as mean squared error or mean absolute error, we will need to overwrite this preference.

误差分析

做回归分析,常用的误差主要有均方误差根(RMSE)和R-平方(R2)。

RMSE是预测值与真实值的误差平方根的均值。这种度量方法很流行(Netflix机器学习比赛的评价方法),是一种定量的权衡方法。

R2方法是将预测值跟只使用均值的情况下相比,看能好多少。其区间通常在(0,1)之间。0表示还不如什么都不预测,直接取均值的情况,而1表示所有预测跟真实结果完美匹配的情况。

R2的计算方法,不同的文献稍微有不同。如本文中函数R2是依据scikit-learn官网文档实现的,跟clf.score函数结果一致。

What Is Goodness-of-Fit for a Linear Model?

Linear regression calculates an equation that minimizes the distance between the fitted line and all of the data points. Technically, ordinary least squares (OLS) regression minimizes the sum of the squared residuals.

In general, a model fits the data well if the differences between the observed values and the model's predicted values are small and unbiased.

Before you look at the statistical measures for goodness-of-fit, you should check the residual plots. Residual plots can reveal unwanted residual patterns that indicate biased results more effectively than numbers. When your residual plots pass muster, you can trust your numerical results and check the goodness-of-fit statistics.

What Is R-squared?

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.

The definition of R-squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model. Or:

  • R-squared = Explained variation / Total variation

  • R-squared is always between 0 and 100%:

0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.
In general, the higher the R-squared, the better the model fits your data. However, there are important conditions for this guideline that I’ll talk about both in this post and my next post.

The Coefficient of Determination, r-squared

Here's a plot illustrating a very weak relationship between y and x. There are two lines on the plot, a horizontal line placed at the average response, $bar{y}$, and a shallow-sloped estimated regression line, $hat{y}$. Note that the slope of the estimated regression line is not very steep, suggesting that as the predictor x increases, there is not much of a change in the average response y. Also, note that the data points do not "hug" the estimated regression line:

image
$$SSR=\sum_{i=1}^{n}(\hat{y}_i-\bar{y})^2=119.1$$

$$SSE=\sum_{i=1}^{n}(y_i-\hat{y}_i)^2=1708.5$$

$$SSTO=\sum_{i=1}^{n}(y_i-\bar{y})^2=1827.6$$

The calculations on the right of the plot show contrasting "sums of squares" values:

  • SSR is the "regression sum of squares" and quantifies how far the estimated sloped regression line, $hat{y}_i$, is from the horizontal "no relationship line," the sample mean or $bar{y}$.

  • SSE is the "error sum of squares" and quantifies how much the data points, $y_i$, vary around the estimated regression line, $hat{y}_i$.

  • SSTO is the "total sum of squares" and quantifies how much the data points, $y_i$, vary around their mean, $bar{y}$

Note that SSTO = SSR + SSE. The sums of squares appear to tell the story pretty well. They tell us that most of the variation in the response y (SSTO = 1827.6) is just due to random variation (SSE = 1708.5), not due to the regression of y on x (SSR = 119.1). You might notice that SSR divided by SSTO is 119.1/1827.6 or 0.065.Do you see where this quantity appears on Minitab's fitted line plot?

Contrast the above example with the following one in which the plot illustrates a fairly convincing relationship between y and x. The slope of the estimated regression line is much steeper, suggesting that as the predictor x increases, there is a fairly substantial change (decrease) in the response y. And, here, the data points do "hug" the estimated regression line:
image
$$SSR=\sum_{i=1}^{n}(\hat{y}_i-\bar{y})^2=6679.3$$

$$SSE=\sum_{i=1}^{n}(y_i-\hat{y}_i)^2=1708.5$$

$$SSTO=\sum_{i=1}^{n}(y_i-\bar{y})^2=8487.8$$

The sums of squares for this data set tell a very different story, namely that most of the variation in the response y (SSTO = 8487.8) is due to the regression of y on x (SSR = 6679.3) not just due to random error (SSE = 1708.5). And, SSR divided by SSTO is 6679.3/8487.8 or 0.799, which again appears on Minitab's fitted line plot.

The previous two examples have suggested how we should define the measure formally. In short, the "coefficient of determination" or "r-squared value," denoted $r^2$, is the regression sum of squares divided by the total sum of squares. Alternatively, as demonstrated in this , since SSTO = SSR + SSE, the quantity $r^2$ also equals one minus the ratio of the error sum of squares to the total sum of squares:

$$r^2=\frac{SSR}{SSTO}=1-\frac{SSE}{SSTO}$$

Here are some basic characteristics of the measure:

  • Since $r^2$ is a proportion, it is always a number between 0 and 1.

  • If $r^2$ = 1, all of the data points fall perfectly on the regression line. The predictor x accounts for all of the variation in y!

  • If $r^2$ = 0, the estimated regression line is perfectly horizontal. The predictor x accounts for none of the variation in y!

We've learned the interpretation for the two easy cases — when r2 = 0 or r2 = 1 — but, how do we interpret r2 when it is some number between 0 and 1, like 0.23 or 0.57, say? Here are two similar, yet slightly different, ways in which the coefficient of determination r2 can be interpreted. We say either:

$r^2$ ×100 percent of the variation in y is reduced by taking into account predictor x

or:

$r^2$ ×100 percent of the variation in y is 'explained by' the variation in predictor x.

Many statisticians prefer the first interpretation. I tend to favor the second. The risk with using the second interpretation — and hence why 'explained by' appears in quotes — is that it can be misunderstood as suggesting that the predictor x causes the change in the response y. Association is not causation. That is, just because a data set is characterized by having a large r-squared value, it does not imply that x causes the changes in y. As long as you keep the correct meaning in mind, it is fine to use the second interpretation. A variation on the second interpretation is to say, "$r^2$ ×100 percent of the variation in y is accounted for by the variation in predictor x."

Students often ask: "what's considered a large r-squared value?" It depends on the research area. Social scientists who are often trying to learn something about the huge variation in human behavior will tend to find it very hard to get r-squared values much above, say 25% or 30%. Engineers, on the other hand, who tend to study more exact systems would likely find an r-squared value of just 30% merely unacceptable. The moral of the story is to read the literature to learn what typical r-squared values are for your research area!

Key Limitations of R-squared

R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots.

R-squared does not indicate whether a regression model is adequate. You can have a low R-squared value for a good model, or a high R-squared value for a model that does not fit the data!

The R-squared in your output is a biased estimate of the population R-squared.


DerekGrant
196 声望20 粉丝