Abstract: text detection is the first step of text reading and recognition, which has a significant impact on subsequent text recognition. In general scenarios, the detection and positioning of text lines can be realized by configuring and modifying the general target detection algorithm. This article mainly introduces the text detection algorithm based on pixel segmentation.
This article is shared from Huawei Cloud Community " Technology Overview Fourteen: Curved Text Detection Algorithm (2) ", author: I want to be quiet.
Background introduction
Text detection is the first step of text reading and recognition, and has a significant impact on subsequent text recognition. In general scenarios, the detection and positioning of text lines can be realized by configuring and modifying the general target detection algorithm. However, in the curved text scene, the general target detection algorithm cannot achieve an accurate representation of the text border. Therefore, in recent years, many academic papers have proposed novel algorithms for solving scene text detection, which mainly include two ideas: 1. Text detection based on region reorganization; 2. Text detection based on pixel segmentation. This article mainly introduces the text detection algorithm based on pixel segmentation.
PSENet
PSENet is a purely segmented text detection method. The original intention of this method is to effectively separate adjacent texts of arbitrary shapes. It achieves this goal by predicting text segmentation maps of multiple scales. Specifically, as shown in Figure 1, here is an example of predicting a segmentation map of 3 scales, namely (a), (e), (f). The post-processing flow is as follows: firstly, assign labels to each connected component from the smallest-scale segmentation map (a), and then expand (a) to the periphery to merge the pixels predicted as text in (e). In the same way, merge the text pixels in (f).
Figure 1. Progressive expansion process of PSENet
This method of merging adjacent text pixels progressively and from small to large can effectively separate adjacent text instances, but at the cost of slow speed, which can be alleviated by C++.
PAN
PAN is mainly designed for the fact that the existing text detection method is too slow to realize industrial application. This method improves the speed of text detection from two aspects. First, from the network structure, this method uses the lightweight ResNet18 as the backbone. However, the feature extraction ability of ResNet18 is not strong enough, and the receptive field obtained is not large enough. Therefore, a lightweight feature enhancement module and a feature fusion module are further proposed, which are similar to FPN and can be cascaded together. The feature enhancement module effectively enhances the feature extraction ability of the model and increases the receptive field with only a small amount of calculation. Second, improve the speed from post-processing. This method detects text by predicting the text area, the core area of the text (kernel), and the similarity between pixels. Using the idea of clustering, the kernel is the cluster center, and the text pixels are the samples that need to be clustered. For clustering, the distance between the similarity vector of the kernel and the corresponding pixel belonging to the same text instance should be as small as possible, and the distance between the similarity vectors of different kernels should be far. In the inference stage, first obtain the connected components according to the kernel, and then merge the pixels whose distance from the kernel is less than the threshold d along the circumference. This method achieves real-time text detection speed while achieving high accuracy.
Figure 2. PAN network structure
MSR
MSR is proposed to solve the difficulty of multi-scale text detection. Different from other text detection methods, this method uses multiple identical backbones, and downsamples the input image to multiple scales and then inputs them into these backbones together with the original image. Finally, the features of different backbones are upsampled and then merged , Thus capturing a wealth of multi-scale features. The network finally predicts the x-coordinate and y-coordinate offsets from each point in the text center area and the text center area to the nearest boundary point. In the inference stage, each point in the central area of the text obtains the corresponding boundary point according to the predicted x/y coordinate offset, and the final text contour is the contour that surrounds all the boundary points.
Figure 3. MSR algorithm framework
Figure 4: MSR network structure
The advantage of this method is that it has strong detection ability for multi-scale text, but because the central area of the text defined by this method is only the text area is reduced in the up and down direction, and the left and right directions are not reduced, it cannot effectively separate horizontally adjacent Text.
DB
DB is mainly proposed for the existing segmentation-based methods that need to use thresholds for binarization, which results in time-consuming post-processing and insufficient performance. This method cleverly designs a binarization function that approximates the step function, so that the segmentation network can learn the threshold of text segmentation during training. In addition, in the inference stage, this method directly expands a certain proportion of the area and perimeter of the central region of the text to obtain the final text contour, which further improves the inference speed of the method. On the whole, DB provides a good algorithm framework for text detection methods based on pixel segmentation, which solves the problem of threshold configuration of such algorithms, and at the same time has good compatibility-developers can solve the difficult points of the scene. The backbone is remodeled and optimized to achieve a better balance of performance and precision.
Figure 5. DB network structure
Algorithms based on pixel segmentation can accurately predict text instances of any shape, and then for overlapping text regions, it is difficult to distinguish different instances. To truly implement this series of algorithms and meet business needs, the problem of overlapping texts needs to be solved in the future.
Reference
[1]. Wang W, Xie E, Li X, et al. Shape robust text detection with progressive scale expansion network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 9336-9345.
[2]. Wang W, Xie E, Song X, et al. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 8440-8449.
[3]. Xue C, Lu S, Zhang W. Msr: Multi-scale shape regression for scene text detection[J]. arXiv preprint arXiv:1901.02596, 2019.
[4]. Liao M, Wan Z, Yao C, et al. Real-time scene text detection with differentiable binarization[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34(07): 11474-11481.
If you want to know more about the dry goods of AI technology, welcome to the AI area of HUAWEI CLOUD. There are currently six actual combat camps such as AI programming Python ( http://su.modelarts.club/qQB9) for everyone to learn for free.
Click to follow and learn about Huawei Cloud's fresh technology for the first time~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。