Abstract: text detection is the first step of text reading and recognition, which has a significant impact on subsequent text recognition.
This article is shared from Huawei Cloud Community " Technology Overview 13: Curved Text Detection Algorithm (1) ", author: I want to be quiet.
Background introduction
Text detection is the first step of text reading and recognition, and has a significant impact on subsequent text recognition. In general scenarios, the detection and positioning of text lines can be realized by configuring and modifying the general target detection algorithm. However, in curved text scenes, general object detection algorithms cannot achieve accurate representation of text borders. Therefore, in recent years, many academic papers have proposed novel algorithms for solving scene text detection, which mainly include two ideas: 1. Text detection based on region reorganization; 2. Text detection based on pixel segmentation.
Text detection algorithm based on region reorganization
PixelLink
PixelLink is mainly proposed for the problem of the difficulty of separating adjacent texts. The method mainly predicts the text/non-text area, and the connection relationship between each pixel and its top, bottom, left, right, top left, top right, bottom left, and bottom right pixels. In the inference stage, the pixel predicted to be text and the pixel that has a connection relationship with the pixel are connected together. Finally, the smallest enclosing rectangle of each connected component is used as the text frame.
Figure 1. PinxelLink algorithm framework
Because the connected domain-based method is used for text pixel aggregation, this method is more sensitive to noise, and it is easy to generate some false positives with a small area in the inference stage. The author alleviates this problem by removing the detection results of less than 10 pixels on the short side or less than 300 pixels in area.
TextSnake
TextSnake is mainly proposed for the use of quadrilateral boxes that cannot effectively detect arbitrary shape text. This method uses a series of overlapping discs to represent the text area, and each disc has a specific center, radius, and direction. As shown in Figure 1, the text contour is reconstructed by predicting the text area, the text center line (actually the center area), the radius and the angle corresponding to each point on the text center line. The post-processing stage needs to obtain multiple center points from the predicted text center area as the center of the disc, then draw a circle according to the radius corresponding to the center of the circle, and finally enclose the outlines of all circles to obtain the final text bounding box.
Figure 2. TextSnake text representation method
Figure 3. Center point mechanism
The steps to obtain the center point of the disc are shown in Figure 3. First, randomly select a point in the predicted text center area, and then make the tangent and normal of the point according to the predicted direction, and the intersection of the normal and the two ends of the text center area The midpoint (the red dot in Figure (a)) is the center point (as the center of the disc). The center point advances in two opposite directions for a certain step to obtain two new points, and then find the corresponding midpoint based on these two new points. And so on, until you reach the ends of the text center area.
This method can effectively detect text of any shape and direction, but the post-processing is more complicated and time-consuming.
CRAFT
CRAFT is mainly proposed for the limitation of curved text detection based on character-level text detection methods, but it is also suitable for curved text detection. The idea of this paper is to detect arbitrary shape text by returning to the affinity between characters and characters, where the affinity is used to indicate whether adjacent characters belong to the same text instance. In addition, because many data sets do not provide character-level annotations, this paper proposes a weakly supervised algorithm to generate character-level annotations from word-level annotations.
Figure 4. CRAFT network architecture
As shown in Figure 4, the character area and the affinity of adjacent characters are all obtained through regression through a channel.
Figure 5. Ground-truth generation method of CRAFT character area
The ground truth generation process of the character region score and the affinity score used to train the model is shown in Figure 5. For the character area score, first generate a 2D Gauss map, then calculate the perspective transformation matrix of the Gauss map to the corresponding character frame, and finally use this matrix to transform the 2D Gauss map to the corresponding character area. The same method is used to generate the ground-truth of the affinity score, provided that only the affinity box is required. The process of obtaining the affinity box is as follows: 1. Each character box connects diagonal lines to divide the character box into 4 triangles, and the center of the upper and lower triangles is taken as the apex of the affinity box. 2. The centers of the two upper triangles and the lower triangles obtained by two adjacent character frames are used as the vertices of the quadrilateral to form an affinity frame.
The process of weakly supervised character generation algorithm to generate character pseudo-labels: 1. Use the model trained in the synthetic data set to predict the character area score of the cropped text area; 2. Use the watershed algorithm to get each character area; 3. Transform the coordinates Go to the original image to get the actual character frame coordinates.
Figure 6. CRAFT weakly supervised learning process
Post-processing: In the inference stage, after predicting the character and the affinity map, the character area and the affinity area whose confidence is greater than the specified threshold are set to 1. Then mark each connected area. Finally, for quadrilateral text, use the smallest outer rectangle as the border.
Figure 7. The reorganization process of the curved text border.
For curved text, the process of obtaining the text outline is shown in Figure 7. The first step is to find the local longest line of each character area along the direction of the character; the line connected by the center of each line is the center line; The local longest line is rotated to be perpendicular to the center line; the lines at both ends are moved to the two ends of the text area; all the end points are connected to obtain a curved text frame.
Text detection algorithm based on region reorganization
PSENet
PSENet is a purely segmented text detection method. The original intention of this method is to effectively separate adjacent texts of arbitrary shapes. It achieves this goal by predicting text segmentation maps of multiple scales. Specifically, as shown in Figure 1, here is an example of predicting a segmentation map of 3 scales, namely (a), (e), (f). The post-processing flow is as follows: firstly, assign labels to each connected component from the smallest-scale segmentation map (a), and then expand (a) to the periphery to merge the pixels predicted as text in (e). In the same way, merge the text pixels in (f).
Figure 1. Progressive expansion process of PSENet
This method of merging adjacent text pixels progressively and from small to large can effectively separate adjacent text instances, but at the cost of slow speed, which can be alleviated by C++.
PAN
PAN is mainly designed for the fact that the existing text detection method is too slow to realize industrial application. This method improves the speed of text detection from two aspects. First, from the network structure, this method uses the lightweight ResNet18 as the backbone. However, the feature extraction ability of ResNet18 is not strong enough, and the receptive field obtained is not large enough. Therefore, a lightweight feature enhancement module and a feature fusion module are further proposed, which are similar to FPN and can be cascaded together. The feature enhancement module effectively enhances the feature extraction ability of the model and increases the receptive field with only a small amount of calculation. Second, improve the speed from post-processing. This method detects text by predicting the text area, the core area of the text (kernel), and the similarity between pixels. Using the idea of clustering, the kernel is the cluster center, and the text pixels are the samples that need to be clustered. For clustering, the distance between the similarity vector of the kernel and the corresponding pixel belonging to the same text instance should be as small as possible, and the distance between the similarity vectors of different kernels should be far. In the inference stage, first obtain the connected components according to the kernel, and then merge the pixels whose distance from the kernel is less than the threshold d along the circumference. This method achieves real-time text detection speed while achieving high accuracy.
Figure 2. PAN network structure
MSR
MSR is proposed to solve the difficulty of multi-scale text detection. Different from other text detection methods, this method uses multiple identical backbones, and downsamples the input image to multiple scales and then inputs them into these backbones together with the original image. Finally, the features of different backbones are upsampled and then merged , Thus capturing a wealth of multi-scale features. The network finally predicts the x-coordinate and y-coordinate offsets from each point in the text center area and the text center area to the nearest boundary point. In the inference stage, each point in the central area of the text obtains the corresponding boundary point according to the predicted x/y coordinate offset, and the final text contour is the contour that surrounds all the boundary points.
Figure 3. MSR algorithm framework
Figure 4: MSR network structure
The advantage of this method is that it has strong detection ability for multi-scale text, but because the central area of the text defined by this method is only the text area is reduced in the up and down direction, and the left and right directions are not reduced, it cannot effectively separate horizontally adjacent Text.
DB
DB is mainly proposed for the existing segmentation-based methods that need to use thresholds for binarization, which results in time-consuming post-processing and insufficient performance. This method cleverly designs a binarization function similar to the step function, so that the segmentation network can learn the threshold of text segmentation during training. In addition, in the inference stage, this method directly expands a certain proportion of the area and perimeter of the central region of the text to obtain the final text contour, which further improves the inference speed of the method. On the whole, DB provides a good algorithm framework for text detection methods based on pixel segmentation, which solves the problem of threshold configuration of such algorithms, and at the same time has good compatibility-developers can solve the difficult points of the scene. The backbone is remodeled and optimized to achieve a better balance of performance and precision.
Figure 5. DB network structure
Algorithms based on pixel segmentation can accurately predict text instances of any shape, and then for overlapping text regions, it is difficult to distinguish different instances. To truly implement this series of algorithms and meet business needs, the problem of overlapping texts needs to be solved in the future.
Reference
[1]. Deng D, Liu H, Li X, et al. Pixellink: Detecting scene text via instance segmentation[C] //Proceedings of the AAAI Conference on Artificial Intelligence. 2018, 32(1).
[2]. Long S, Ruan J, Zhang W, et al. Textsnake: A flexible representation for detecting text of arbitrary shapes[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 20-36.
[3]. Baek Y, Lee B, Han D, et al. Character region awareness for text detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 9365-9374.
[4]. Wang W, Xie E, Li X, et al. Shape robust text detection with progressive scale expansion network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 9336-9345.
[5]. Wang W, Xie E, Song X, et al. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 8440-8449.
[6]. Xue C, Lu S, Zhang W. Msr: Multi-scale shape regression for scene text detection[J]. arXiv preprint arXiv:1901.02596, 2019.
[7]. Liao M, Wan Z, Yao C, et al. Real-time scene text detection with differentiable binarization[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34(07): 11474-11481.
Click to follow and learn about Huawei Cloud's fresh technology for the first time~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。