[Recommended topics in this issue] must-read for IoT practitioners: HUAWEI CLOUD experts will give you a detailed explanation of the development of LiteOS modules and their implementation principles.

Abstract: Focal Loss can be regarded as the core. In fact, it is to use a suitable function to measure the contribution of difficult and easy to classify samples to the total loss.

This article is shared from the Huawei Cloud Community " Technical Dry Goods | A Better Understanding of Focal Loss Based on MindSpore", the original author: chengxiaoli.

Today we update the Focal Loss of Kaiming Great God. It is the loss function proposed by the Kaiming Great God team in their paper Focal Loss for Dense Object Detection. It uses it to improve the effect of image object detection. New work by ICCV2017RBG and Kaiming Great God ( https://arxiv.org/pdf/1708.02002.pdf ).

scenes to be used

Recently, I have been working on the direction related to facial expressions. The number of DataSets in this field is not large, and there is often a problem of imbalance between positive and negative samples. Generally speaking, there are two ways to solve the problem of the imbalance in the number of positive and negative samples:

1. Design a sampling strategy , which generally resamples a small number of samples

2. The design of Loss generally to assign weights to samples of different categories

This article is about Focal Loss in the second strategy.

theoretical analysis

Paper analysis

We know that object detection is generally divided into two categories according to its process. One type is two stage detector (such as the very classic Faster R-CNN, RFCN requires a region proposal detection algorithm), and the second type is one stage detector (such as SSD, YOLO series, which does not require a region proposal, and direct regression detection) algorithm).

For the first type of algorithm, a high accuracy rate can be achieved, but the speed is slower. Although the speed can be increased by reducing the number of proposals or reducing the resolution of the input image, the speed is not improved qualitatively.

For the second type of algorithm, the speed is very fast, but the accuracy is not as good as the first type.

So the goal is: the starting point of focal loss is to hope that the one-stage detector can reach the accuracy of the two-stage detector

So,Why?and result?

What caused this? The Reason is: Class Imbalance (positive and negative sample imbalance), caused by the imbalance of the sample category.

We know that in the field of object detection, an image may generate thousands of candidate locations, but only a few of them contain objects, which leads to imbalances in categories. So what are the consequences of unbalanced categories? Quoting the two consequences of the original text:

(1) training is inefficient as most locations are easy negatives that contribute no useful learning signal;
(2) en masse, the easy negatives can overwhelm training and lead to degenerate models.

It means that the number of negative samples is too large (samples belonging to the background), which account for most of the total loss, and most of them are easy to classify, so the optimization direction of the model is not what we want. In this way, the network cannot learn useful information and cannot accurately classify objects. In fact, there are some algorithms to deal with the problem of unbalanced categories, such as OHEM (online hard example mining). The main idea of OHEM can be summarized in one sentence of the original text: In OHEM each example is scored by its loss, non-maximum suppression (nms ) is then applied, and a minibatch is constructed with the highest-loss examples. Although the OHEM algorithm increases the weight of misclassified samples, the OHEM algorithm ignores samples that are easy to classify.

Therefore, in response to the problem of category imbalance, the author proposes a new loss function: Focal Loss. This loss function is modified on the basis of the standard cross-entropy loss. This function can reduce the weight of easy-to-classify samples so that the model can focus more on difficult-to-classify samples during training. In order to prove the effectiveness of Focal Loss, the author designed a dense detector: RetinaNet, and used Focal Loss training during training. Experiments show that RetinaNet can not only achieve the speed of one-stage detector, but also the accuracy of two-stage detector.

Formula description

Introducing focal loss, before introducing focal loss, let’s take a look at cross-entropy loss. Here we take two classification as an example. The original classification loss is the direct summation of the cross-entropy of each training sample, that is, the weight of each sample is the same. The formula is as follows:

Through experiments, it is found that the drawing is shown in Figure 1 as follows, the abscissa is pt, and the ordinate is loss. CE (pt) represents the standard cross entropy formula, FL (pt) represents the improved cross entropy used in focal loss. The blue curve with γ=0 in Figure 1 is the standard cross-entropy loss (loss).
image.png

This not only solves the imbalance of positive and negative samples, but also solves the problem of imbalance between easy and hard samples.

in conclusion

The author regards category imbalance as the main reason that hinders the one-stage method from surpassing the top-performing two-stage method. In order to solve this problem, the author proposed focal loss, using an adjustment item in the cross entropy, in order to focus the learning on the hard examples and reduce the weight of a large number of easy negatives. simultaneously solves the problem of imbalance between positive and negative samples and the distinction between simple and complex samples.

MindSpore code implementation

Let's take a look at the code that implements Focal Loss based on MindSpore:

import mindspore

import mindspore.common.dtype as mstype

from mindspore.common.tensor import Tensor

from mindspore.common.parameter import Parameter

from mindspore.ops import operations as P

from mindspore.ops import functional as F

from mindspore import nn



class FocalLoss(_Loss):



    def __init__(self, weight=None, gamma=2.0, reduction='mean'):

        super(FocalLoss, self).__init__(reduction=reduction)

        # 校验gamma,这里的γ称作focusing parameter,γ>=0,称为调制系数

        self.gamma = validator.check_value_type("gamma", gamma, [float])

        if weight is not None and not isinstance(weight, Tensor):

            raise TypeError("The type of weight should be Tensor, but got {}.".format(type(weight)))

        self.weight = weight

        # 用到的mindspore算子

        self.expand_dims = P.ExpandDims()

        self.gather_d = P.GatherD()

        self.squeeze = P.Squeeze(axis=1)

        self.tile = P.Tile()

        self.cast = P.Cast()



    def construct(self, predict, target):

        targets = target

        # 对输入进行校验

        _check_ndim(predict.ndim, targets.ndim)

        _check_channel_and_shape(targets.shape[1], predict.shape[1])

        _check_predict_channel(predict.shape[1])



        # 将logits和target的形状更改为num_batch * num_class * num_voxels.

        if predict.ndim > 2:

            predict = predict.view(predict.shape[0], predict.shape[1], -1) # N,C,H,W => N,C,H*W

            targets = targets.view(targets.shape[0], targets.shape[1], -1) # N,1,H,W => N,1,H*W or N,C,H*W

        else:

            predict = self.expand_dims(predict, 2) # N,C => N,C,1

            targets = self.expand_dims(targets, 2) # N,1 => N,1,1 or N,C,1

       

        # 计算对数概率

        log_probability = nn.LogSoftmax(1)(predict)

        # 只保留每个voxel的地面真值类的对数概率值。

        if target.shape[1] == 1:

            log_probability = self.gather_d(log_probability, 1, self.cast(targets, mindspore.int32))

            log_probability = self.squeeze(log_probability)



        # 得到概率

        probability = F.exp(log_probability)



        if self.weight is not None:

            convert_weight = self.weight[None, :, None]  # C => 1,C,1

            convert_weight = self.tile(convert_weight, (targets.shape[0], 1, targets.shape[2])) # 1,C,1 => N,C,H*W

            if target.shape[1] == 1:

                convert_weight = self.gather_d(convert_weight, 1, self.cast(targets, mindspore.int32))  # selection of the weights  => N,1,H*W

                convert_weight = self.squeeze(convert_weight)  # N,1,H*W => N,H*W

            # 将对数概率乘以它们的权重

            probability = log_probability * convert_weight

        # 计算损失小批量

        weight = F.pows(-probability + 1.0, self.gamma)

        if target.shape[1] == 1:

            loss = (-weight * log_probability).mean(axis=1)  # N

        else:

            loss = (-weight * targets * log_probability).mean(axis=-1)  # N,C



        return self.get_loss(loss)

The method of use is as follows:

from mindspore.common import dtype as mstype

from mindspore import nn

from mindspore import Tensor



predict = Tensor([[0.8, 1.4], [0.5, 0.9], [1.2, 0.9]], mstype.float32)

target = Tensor([[1], [1], [0]], mstype.int32)

focalloss = nn.FocalLoss(weight=Tensor([1, 2]), gamma=2.0, reduction='mean')

output = focalloss(predict, target)

print(output)



0.33365273

Two important properties of Focal Loss

  1. When a sample is classified incorrectly, pt is very small, then the modulation factor (1-Pt) is close to 1, and the loss is not affected; when Pt→1, the factor (1-Pt) is close to 0, then the score is better The weights of the well-classified samples are adjusted down. Therefore, the modulation coefficient tends to 1, which means that there is no major change compared to the original loss. When pt tends to 1 (the classification is correct and easy to classify samples), the modulation coefficient tends to 0, that is, the contribution to the total loss is small.
  2. When γ=0, the focal loss is the traditional cross-entropy loss. When γ increases, the modulation coefficient will also increase. The focus parameter γ smoothly adjusts the proportion of easy-to-divide samples to lower their weights. Increasing γ can enhance the influence of the modulation factor, and experiments have found that γ is the best choice. Intuitively speaking, the modulation factor reduces the loss contribution of easily separated samples and broadens the range of samples receiving low loss. When γ is constant, for example equal to 2, the loss of easy example (pt=0.9) is 100+ times smaller than the standard cross-entropy loss, when pt=0.968, it is 1000+ times smaller, but for hard example (pt <0.5), the loss is up to 4 times smaller. In this case, the weight of hard example is relatively increased a lot.

This increases the importance of those misclassifications. The two properties of Focal Loss are considered to be the core. In fact, it is to use a suitable function to measure the contribution of difficult and easy to classify samples to the total loss.

Click to follow, and get to know the fresh technology of


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量