Abstract: This paper proposes an end-to-end line detection model based on Transformer. Using the multi-scale Encoder/Decoder algorithm, more accurate line endpoint coordinates can be obtained. The author directly uses the distance between the predicted end point of the line segment and the end point of Ground truth as the objective function, which can better regress the end point coordinates of the line segment.

This article is shared from Huawei Cloud Community " Paper Interpretation Series Seventeen: Transformer-based Straight Segment Detection ", author: cver.
image.png

1 Article summary

The traditional morphological line segment detection first needs to perform edge detection on the image, and then perform post-processing to obtain the line segment detection result. The general deep learning method firstly obtains the heat map characteristics of the end points of the line segment and the line, and then performs the fusion processing to obtain the detection result of the line. The author proposes a new Transformer-based method that does not require edge detection, nor does it require the end and line heat map features, and directly obtains the line segment detection result end-to-end, that is, the end point coordinates of the line segment.

Line segment detection belongs to the category of target detection. The line segment detection model LETR proposed in this article is an extension on the basis of DETR (End-to-End Object Detection with Transformers). The difference is that when the decoder predicts and returns, one is regression. The center point, width, and height of the box, one is the endpoint coordinates of the regression line.

Therefore, let me first introduce how DETR uses Transformer for target detection. After that, I will focus on some of the unique contents of LETR.

2. How to use Transformer for target detection (DETR)

image.png

Figure 1. DETR model structure

The figure above is the model structure of DETR. DETR first uses a CNN backbone to extract the features of the image, enters the Transformer model after encoding to obtain N predicted boxes, and then uses FFN to perform classification and coordinate regression. This part is similar to traditional target detection, and then combines the N predicted boxes with M real boxes perform binary matching (N>M, the extra is empty, that is, there is no object, and the coordinate value is directly set to 0). Use the matching result and the matching loss to update the weight parameters to obtain the final box detection result and category. There are a few key points here:

The first is the serialization and coding of image features.

The dimension of the feature output by CNN-backbone is C H W. First, use 1 to reduce the dimension, compress the channel from C to d, and get the feature map of H Then the two dimensions of H and W are merged, and the dimension of the feature map becomes d HW. The serialized feature map loses the position information of the original image, so the position encoding feature needs to be added to obtain the final serialized encoding feature.

Then there is Transformer's Decoder

The Decoder of the Transformer for target detection processes all decoder inputs at once, that is, object queries, which are slightly different from the original Transformer, which outputs one by one from left to right.

Another point is that the input of the Decoder is randomly initialized and can be trained and updated.

Dichotomous match

Transformer's Decoder outputs N object proposals. We don't know the corresponding relationship between it and the real Ground truth. Therefore, it needs to be matched by the bipartite graph. The Hungarian algorithm is used to obtain a match that minimizes the matching loss. The matching loss is as follows:
image.png

After getting the final match, use this loss and classification loss to update the parameters.

3. LETR model structure

image.png

Figure 2. LETR model structure

The structure of Transformer mainly includes Encoder, Decoder and FFN. Each Encoder contains two sub-layers: self-attention and feed-forward. In addition to self-attention and feed-forward, Decoder also includes cross-attention. Attention mechanism: The attention mechanism is similar to the original Transformer, the only difference is the cross-attention of the Decoder, which has been introduced above, so I won’t repeat it.

Coarse-to-Fine strategy

It can be seen from the figure above that LETR contains two Transformers. The author calls this a multi-scale Encoder/Decoder strategy, and the two Transformers are called Coarse Encoder/Decoder and Fine Encoder/Decoder respectively. That is, first use the deep small-scale feature map of the CNN backbone (ResNet's conv5, the size of the feature map is 1/32 of the original image size, and the number of channels is 2048) to train a Transformer, that is, Coarse Encoder/Decoder, to get a coarse-grained The characteristics of the line segment (Fix the Fine Encoder/Decoder during training, and only update the parameters of the Coarse Encoder/Decoder). Then use the output of Coarse Decoder as the input of Fine Decoder, and then train a Transformer, namely Fine Encoder/Decoder. The input of Fine Encoder is the shallow feature map of the CNN backbone (ResNet's conv4, the size of the feature map is 1/16 of the original image size, and the number of channels is 1024), which has a larger dimension than the deep feature map, which can be better Use the high-resolution information of the image.

Note: The deep and shallow feature map features of CNN backbone need to be reduced to 256 dimensions through 1*1 convolution first, and then used as the input of Transformer

Dichotomous match

Like DETR, the N output of the fine decoder is used for classification and regression, and the prediction results of N line segments are obtained. But we don't know the correspondence between N prediction results and M real line segments, and N is greater than M. At this time, we have to perform binary matching. The so-called binary matching is to find a corresponding relationship to make the matching loss the smallest, so we need to give the matching loss, which is the same as the expression of DERT above, but this item is slightly different, one is GIou and the other is the end point distance of the line segment. .

4. Model test results

The model has reached state-of-the-arts on Wireframe and YorkUrban datasets.
image.png

Figure 3. Comparison of the effect of line segment detection methods

Figure 4. Comparison of performance indicators of the line segment detection method on the two data sets (Table 1); PR curve of the line segment detection method (Figure 6)

[ Click to follow to learn about Huawei Cloud's fresh technology for the first time~ ]( https://bbs.huaweicloud.com/blogs?utm_source=segmentfault&utm_medium=bbs-ex&utm_campaign=ei&utm_conte image.png


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量