Ant&#39;s self-developed mobile terminal xNN-OCR technology evolution and capability opening

Author: Zhang Weichen (Jing Ming)

With the continuous improvement of mobile phone performance, complex AI computing on the mobile phone has become the core development direction of major manufacturers, and a large number of end intelligent applications have been generated on top of this. This end-side AI computing model makes it possible to realize a large number of scenarios involving timeliness, cost, and privacy considerations. Here, we take the widely used text recognition technology (OCR) as an example to introduce Ant's self-developed mobile terminal OCR technology (xNN-OCR).

Background introduction

Text recognition technology is a research direction with a long history and a wide range of applications in the field of computer vision. Especially with the development of deep learning technology, its ability space is constantly expanding. Compared with cloud computing, the mobile OCR algorithm can complete text extraction in pictures offline, which has great application value for scenarios with high real-time, privacy protection, and high cost requirements. On the other hand, OCR models based on deep learning are becoming more and more complex, usually with tens of M parameters and hundreds of GFlops calculations. How to complete the OCR model operation under the limited computing resources of mobile phones is a very challenging Task. In Alipay, we combined the self-developed end-to-side reasoning engine xNN and the in-depth optimization of the application algorithm to develop a small, fast, and accurate xNN-OCR technology product. From the launch in 2018 to the bank card number recognition scene, we have successively developed Supported the technological upgrade of dozens of core businesses. In this article, we will give you a complete overview of the technological evolution and capability opening of xNN-OCR.

xNN-OCR technology evolution

The development of an end-to-side model needs to go through the following processes: training data acquisition and labeling, network structure design, training adjustment, end-to-side migration, and end-to-side deployment. Each link is interrelated and affects each other. In terms of basic algorithms, xNN-OCR has gone through three model development stages from small fonts, large fonts to heterogeneous computing. We will introduce the latest results from the core data, network design and model compression levels.

Data generation

Data, like ammunition, largely determines the effect of the model, especially in text recognition scenarios. The combination of Chinese is ever-changing, and it is difficult to obtain enough actual data in many scenarios. In response to this problem, we explored text generation technology based on GAN technology. In network design, the three encoders of background extraction, text extraction and font extraction respectively extract the corresponding feature information, and complete the fusion from the source text content to the target font and background through font migration and background fusion. In the training process, in addition to the conventional generation and confrontation loss, a recognition loss function is also added to monitor whether the content of the synthesis is correct. At the same time, in order to use the existing real data, we add Cycle-Path to the training link to improve the overall data generation effect. The card card data synthesized in this way only uses 10% of the original data volume to achieve the recognition accuracy of 100% real data.

xNN-OCR network architecture

After having the data, the next step is model design. Here we mainly introduce the three main parts of the xNN-OCR algorithm: text line detection, text line recognition and structuring.

Text detection algorithm

Compared with general object detection tasks, text detection tasks have the characteristics of large aspect ratio and inclined frames. In view of these two problems, the traditional Anchor-based detection scheme needs to configure a very large number of Anchors, which will increase the amount of calculation. Therefore, we designed a lightweight detection network. The backbone network is based on the ShuffleNet design idea, using a multi-layer Shuffle structure, and the head of the network uses a Pixel-based dense prediction method. Each pixel in the output image will output the category and The frame position returns, and the final detection result is obtained after fusion processing. In order to adapt to the end-side computing environment, the resolution of the image input is usually not too large, which brings about the problems of easy loss of small targets and inaccurate boundaries of long targets. We use Instance-balancing + OHEM to solve the problem of small target loss during training, and use weighted fusion NMS to solve the problem of inaccurate frame prediction during prediction, which has greatly improved performance and accuracy.

Text recognition algorithm

After text line detection, the CRNN structure is usually used for content recognition. Based on the previous work, we have made further upgrades to the backbone network and network head design. In order to obtain a high-performance lightweight backbone network, we designed a corresponding search strategy for text recognition scenarios through NAS, and searched for the most cost-effective network structure parameters on the target data set. For the structure of CRNN, we found that the head part of the model has a very large amount of calculation, accounting for more than 50% of the overall calculation amount. This is mainly due to the One-hot sparse coding method of Softmax classification. By combining the dense Hamming coding method with the CRNN model, the time consumption of the Head part is reduced by about 70% compared with the original Softmax classification scheme.

Text structured

Text structuring refers to the structure information corresponding to the output text. For example, in a card scenario, the OCR algorithm result is sorted into the key-value output format. The traditional structured method is usually based on the text location and recognition result design rules, the debugging is more complicated, and the engineering needs to develop the processing logic for different cards, and the deployment and maintenance costs are high. We start with text line detection and propose an Instance detection algorithm to structure the card. Simply put, it is to add the category information of the text box to the detection network head for learning, and directly correspond the recognized content of the text box to the category when structured. This method can save the time-consuming identification calculation, simplify the online debugging and deployment process, and at the same time, because the model learns the implicit relationship between the fields, the overall recognition accuracy is improved.

Model compression

In order to improve the performance and effect of end-side model research and development, xNN has self-developed xNAS algorithm tools based on the previous lightweight structure based on the existing structure to provide model structure search capabilities. Based on the mainstream NAS search framework, xNAS expands the amount of calculation and hardware time-consuming factors that the end-side model cares about, and combines hyperparameter search (HPO), Multi-Trial NAS, One-Shot NAS and other algorithms to search for the best mobile End model structure. In the OCR scenario, we focused on using the NAS solution for the identification network. By searching for each Channel and the number of convolutional layers, the model reduced the amount of calculation by 70% and increased the accuracy by 2%.

In terms of model compression, pruning, floating-point quantization, and fixed-point functions are essential to improve inference performance, especially fixed-point capabilities, which can effectively reduce model size and running time. In order to solve the accuracy problem caused by the difficulty of determining the fixed-point parameters in the OCR scene, xNN combines the NAS idea to propose the qNAS algorithm, which effectively improves the fixed-point accuracy. We performed qNAS quantitative training on the text detection and recognition model. When the accuracy dropped by less than 1%, the model package size dropped to about 1/4 of the original, and the computing time on the end-side CPU was reduced by about 50%.

xNN-OCR performance accuracy

On the basis of basic model research and development, we have gradually covered most of the OCR application scenarios, including general OCR recognition and various card recognition. While ensuring high accuracy, it can be approximated on the mobile computing platform. The performance of real-time computing, the specific indicators are as follows (time-consuming is the time-consuming CPU single-thread computing on Qualcomm 855):

Ability open

xNN-OCR, as a mobile OCR technology developed by Ant, realizes OCR recognition as smooth as scanning a code. At present, Alipay has been widely used in core application scenarios such as security risk control, document upload and digital finance. In order to allow more users and external businesses to use xNN-OCR, we provide it to external developers through small program plug-ins in Alipay. Users outside Alipay can use the Ant mPaaS product and the Alibaba Cloud Visual Open Platform to access in the form of offline SDK.

For Alipay applet access, please refer to: https://forum.alipay.com/mini-app/post/29301014
For external use of Alipay, you can use DingTalk (23124039) to consult mPaaS products or access the Alibaba Cloud Vision Open Platform access offline SDK.

In order to facilitate the experience of the majority of developers, we have aggregated the existing plug-ins in the "Mini Program Experience Center", which can be experienced by scanning the QR code below through Alipay.

pays attention to [Alibaba Mobile Technology], 3 mobile technology practice & dry goods for you to think about every week!

Ant's self-developed mobile terminal xNN-OCR technology evolution and capability opening

Background introduction