Editor's note: For a long time, text recognition has always been an important research topic on document digitization. Existing text recognition methods usually use CNN network for image understanding and RNN network for character-level text generation. However, this method requires an additional language model as a post-processing step to improve the accuracy of recognition.
To this end, researchers from Microsoft Research Asia conducted in-depth research and proposed the first end-to-end Transformer-based text recognition OCR model using a pre-trained model: TrOCR. The model is simple and effective, can be pre-trained using large-scale synthetic data, and can be fine-tuned on manually labeled data. Experiments have proved that TrOCR surpasses the current state-of-the-art model in both printed data and handwritten data. The training code and model are now open source. I hope that interested readers can read the full text and understand the advantages of TrOCR!
Optical character recognition (OCR) converts images of handwritten or printed text into machine-encoded text, which can be applied to scanned documents, photos, or subtitle text superimposed on images. General optical character recognition consists of two parts: text detection and text recognition .
- text detection used to locate text blocks in text images, and the granularity can be at the word level or the text line level. Most of the current solutions regard this task as an object detection problem, and use traditional object detection models such as YoLOv5 and DBNet.
- Text Recognition dedicated to understanding text images and converting visual signals into natural language symbols. This task usually uses an encoder-decoder architecture. The existing methods use an encoder based on a CNN network for image understanding, and a decoder based on an RNN network for text generation.
In the field of text recognition, the Transformer model is frequently used, and the advantages of its structure have brought significant efficiency improvements to . However, the existing methods still mainly use the CNN network as the backbone network, and on this basis, cooperate with the self-attention mechanism to understand text images; in addition, the existing methods still use CTC as the decoder, with additional character-level language models To improve the overall accuracy rate. Although this hybrid model has achieved great success, still has a lot of room for improvement :
- The parameters of the existing models are trained from scratch on synthetic or manually labeled data, and the application of large-scale pre-training models has not been explored.
- The image Transformer model has become more and more popular, especially the recently proposed self-supervised image pre-training. It is now time to explore whether the pre-trained image Transformer can replace the CNN backbone network, and whether the pre-trained image Transformer can work with the pre-trained text Transformer in a single network for text recognition tasks.
Therefore, researchers from Microsoft Research Asia have conducted a number of studies focusing on text recognition tasks, and proposed the first end-to-end Transformer-based text recognition OCR model using a pre-trained model: TrOCR . The model structure is shown in Figure 1.
Figure 1: Schematic diagram of TrOCR model structure
Different from existing methods, TrOCR is simple and efficient. does not use CNN as the backbone network. Instead divided into image slices and then input into the image Transformer. The TrOCR encoder and decoder both use the standard Transformer structure and self-attention mechanism . The decoder generates wordpiece as the recognition text of the input image. In order to train the TrOCR model more effectively, the researchers used the ViT mode pre-training model and the BERT mode pre-training model to initialize the encoder and decoder respectively.
Paper: https://arxiv.org/abs/2109.10282
Code/model: https://aka.ms/trocr
advantage TrOCR of has three aspects:
- TrOCR uses pre-trained image and text models to use the advantages of large-scale unlabeled data for image understanding and modeling language models without the need for additional language model intervention.
- TrOCR does not require any complicated convolutional network as the backbone network, which is easier to implement and maintain. Experiments have proved that TrOCR surpasses the current state-of-the-art method on the benchmark data sets for print and handwritten text recognition tasks, and does not require any complex pre/post processing steps.
- TrOCR can be easily extended to a multi-language model, just use the multi-language pre-training model on the decoder side. In addition, by simply adjusting the parameter configuration of the pre-training model, cloud/end deployment becomes extremely simple.
Implementation
Model structure
TrOCR uses a Transformer structure, including image Transformer and text Transformer , which are used to extract visual features and modeling language models, respectively, and use a standard Transformer encoder-decoder mode. The encoder is used to obtain the characteristics of the image slice; the decoder is used to generate the wordpiece sequence, while paying attention to the output of the encoder and the previously generated wordpiece.
For the encoder, TrOCR uses the ViT model structure , which changes the size of the input image and slices it into fixed-size square image blocks to form the input sequence of the model. The model retains the special mark "[CLS]" in the pre-training model to represent the characteristics of the entire picture. For the DeiT pre-training model, it also retains the corresponding distillation token, which represents the distillation knowledge from the teacher model. uses the original Transformer decoder structure .
Model initialization
Both the encoder and the decoder are applied to a public model pre-trained with large-scale annotated/unannotated data for initialization. The encoder uses the DeiT and BEiT models for initialization, and the decoder uses the RoBERTa model for initialization. Because RoBERTa's model structure does not completely match the standard Transformer decoder, such as the lack of encoder-decoder attention layers, the researchers will randomly initialize these layers that do not exist in the RoBERTa model.
Task flow
TrOCR's text recognition task process is: given the text line image to be detected, the model extracts its visual features, and gives the image and the generated text to predict the corresponding wordpiece. The real text ends with the "[EOS]" symbol, which represents the end of the sentence. During the training process, the researchers rotated back the wordpiece sequence of the real text, moved the "[EOS]" symbol to the first place, input it to the decoder, and used the cross-entropy loss function to monitor the output of the decoder. When inferring, the decoder iteratively predicts the wordpiece from the "[EOS]" symbol, and uses the predicted wordpiece as the next input.
Pre-training
Researchers use the text recognition task as a pre-training task because it allows the model to learn both visual feature extraction and language model knowledge at the same time. The pre-training process is divided into two stages: the first stage, the researchers synthesize a data set containing hundreds of millions of printed text lines and corresponding text annotations, and pre-train the TrOCR model on it; the second stage , The researchers constructed two relatively small data sets, corresponding to printed text recognition tasks and handwritten text recognition tasks, each containing millions of text line images, and two pre-trained on the printed data and handwritten data. There are two independent models, and they are all initialized by the pre-trained model in the first stage.
Fine-tuning
The researchers fine-tuned the pre-trained TrOCR model on printed text recognition tasks and handwritten text recognition tasks. The output of the model is based on BPE (Byte Pair Encoding) and does not depend on any task-related dictionaries.
Data enhancement
In order to increase the pre-training data and fine-tune the changes in the data, the researchers used data enhancement technology, a total of seven image conversion methods (including keeping the original input image unchanged). For each example, the researchers will randomly select a method to transform the image in random rotation, Gaussian blur, image expansion, image erosion, down-sampling, underlining, and keeping as it is.
Pre-training data
In order to construct a large-scale high-quality data set, the researchers randomly selected two million document pages from PDF documents publicly available on the Internet. Since these PDFs are digitally generated, high-quality printed text line images can be obtained by converting PDF documents into page images, and then extracting text lines and cropping images. The pre-training data in the first stage contains a total of 680 million text lines.
For the second stage of pre-training, the researchers used 5427 handwritten fonts and TRDG open source text recognition data generation tool to synthesize a large number of handwritten text line images, and randomly selected text from Wikipedia pages. The handwritten data set pre-trained at this stage contains synthetic data and IIIT-HWS data set, with a total of 18 million text lines. In addition, the researchers also collected 53,000 receipt photos in the real world, recognized the text above through a commercial OCR engine, and corrected and cropped the photos. Similarly, the researchers also used TRDG to synthesize a text line image of 1 million prints, and used two receipt fonts and its built-in print fonts. The print volume data set pre-trained in the second stage contains 3.3 million text lines. Table 1 shows the scale of the synthetic data.
Table 1: Synthetic data scale of two-stage pre-training
Pre-training results
First, the researchers compared different combinations of encoders and decoders to find the best model settings. The researchers compared DeiT, BEiT and ResNet50 networks as options for encoders. In the comparison, DeiT and BEiT both use the model setting of base in the paper. For the decoder, the researchers compared the base decoder initialized with RoBERTa-base and the large decoder initialized with RoBERTa-large. As a control, the researchers also conducted experiments on the randomly initialized model, the CRNN baseline model, and the Tesseract open source OCR engine.
Table 2 shows the combined model results. The BEiT encoder and RoBERTa-large decoder showed the best results. At the same time, the show that the pre-training model does improve the performance of the text recognition model, and the performance of the pure Transformer model is better than the CRNN model and Tesseract . Based on this result, the researchers selected two model settings for subsequent experiments: TrOCR-base, containing 334M parameters, consisting of BEiT-base encoder and RoBERTa-large decoder; TrOCR-large, containing 558M parameters, consisting of It is composed of BEiT-large encoder and RoBERTa-large decoder.
Table 2: Results of ablation experiments performed on the SROIE dataset
Table 3 shows the results of the current state-of-the-art models on the ranking list of the TrOCR model and SROIE data set. It can be seen that the TrOCR model surpasses the performance of the current state-of-the-art model by virtue of the pure Transformer model, and it also confirms that it does not require any complicated pre/post-processing steps . The Transformer-based text recognition model can have similar performance to the CNN-based model in terms of visual feature extraction, and is comparable to the RNN in the language model.
Table 3: Experimental results of the large-scale pre-trained TrOCR model on the SROIE print data set
Table 4 shows the results of the existing methods on the TrOCR model and the IAM dataset. The results show that the CTC decoder and additional language models in the existing methods can bring significant effect improvement. Compared with (Bluche and Messina, 2017), TrOCR-large has better results, which shows that Transformer decoder is more competitive than CTC decoder in text recognition tasks, and it already has enough modeling language Model capabilities without relying on additional language models .
Table 4: Experimental results of the large-scale pre-trained TrOCR model on the IAM handwriting data set
TrOCR uses information from image slices and obtains results similar to or even better than that of the CNN network. shows that the pre-trained Transformer structure can be competent for the task of extracting visual features . From the results, the TrOCR model uses the pure methods that only use synthetic+IAM data. At the same time, without using additional manual annotation data, it achieves similar results to the method using manual annotation data.
Summarize
In this article, the researchers proposed first end-to-end Transformer-based text recognition OCR model using a pre-trained model: TrOCR . Different from the existing methods, TrOCR does not rely on the traditional CNN model for image understanding, but uses an image Transformer as a visual encoder and a text Transformer as a text encoder. In addition, unlike character-based methods, researchers use wordpiece as the basic unit of recognition output, saving additional computational overhead in additional language models. Experiments have proved that without any post-processing steps and only using a simple encoder-decoder model, TrOCR has achieved the most advanced accuracy in both printed text and handwritten text recognition.
Here are more Microsoft official learning materials and technical documents, scan the code to get the free version!
The content will be updated from time to time!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。