Abstract: ECCV2020 uses visual matching to do text recognition, solving the problem of text recognition diversity and generalization in document recognition

This article is shared from the HUAWEI cloud community " Paper Interpretation 23: Adaptive Text Recognition Based on Visual Matching ", author: wooheng.
image.png

introduction

The goal of this paper is the generalization and flexibility of text recognition. The previous text recognition methods [1,2,3,4] have achieved good results in many single scenarios, but once they are extended to another one that contains new fonts For scenarios with new languages, either retraining with large amounts of data or fine-tuning for each new sample.

This article is based on a key point: text is a repeated sequence of a limited number of discrete entities, and the repeated entities are the characters and glyphs in the text string, that is, the visual representation of the characters/symbols in the text line image. Suppose you can access the glyph examples (ie, cropped images of characters) and ask the visual encoder to locate these repeated glyphs in a given text line image. The output of the visual encoder is a similarity map, which encodes the visual similarity between each spatial position in the text line and each glyph in the alphabet, as shown in Figure 1. The decoder takes the similarity map to infer the most probable character string. Figure 2 summarizes the proposed method.
image.png

Figure 1 Visual matching for text recognition. current text recognition model learns the discriminant features specific to the shape of the character (glyph) from a predefined (fixed) alphabet. We train our model to establish the visual similarity between a given character glyph (top) and the image of the text line to be recognized (left). This makes the model highly adaptable to invisible glyphs, new alphabets (different languages), and can be extended to new character classes without further training, such as English→Greek. Brighter colors correspond to higher visual similarity.
image.png

Figure 2 The architecture of adaptive visual matching. This paper transforms the problem of text recognition into a problem of visual matching of glyph samples in a given text line image. Left: Architecture diagram. The visual encoder Φ embeds the glyph g and the text line x, and generates a similarity map S, which scores the similarity of each glyph. Then, the ambiguity in (potentially) incomplete visual matching is resolved to produce an enhanced similarity map S*. Finally, using the true glyph width contained in M, the similarity score is aggregated to the output class probability P. Right: Illustrates how the glyph width is coded into the model. The height of the glyph width band (top) is the same as the width of its corresponding glyph example, and its scalar value is the glyph width in pixels. The glyph width map (bottom) is a binary matrix, and each character in alphabet A has a column; these columns indicate the range of glyphs in the glyph line image by setting the corresponding row to a non-zero value (=1).

2. Model structure

The model in this paper recognizes the given text line image by locating the glyph samples in the given text line image by visual matching. It takes a text line image and a letter image containing a set of samples as input, and predicts a probability sequence on N classes as an output, where N is equal to the number of samples given in the letter image. For inference, the glyph line image is assembled by connecting the individual character glyphs of the reference font side by side, and then the text lines in the font can be read.

The model has two main parts: (1) a visual similarity encoder (Section 2.1), which outputs a similarity map of the similarity of each glyph in the image of the encoded text line, and (2) a letter-independent decoding (Section 2.2), which receives this similarity map to infer the most probable string. In Section 2.3, we introduced the training objectives in detail. Figure 2 shows a concise schematic diagram of the model.

2.1 Visual similarity encoder

Input: the glyphs of all target letters; the image of the text line to be recognized

Purpose: Get the position of the font of the target letter in the image of the text line to be recognized

Use the visual encoder Φ to encode the glyph g and the text line x, and generate a similarity graph S, which represents the similarity of each glyph and each position of the text line. Use the cosine distance to calculate the similarity.
image.png

The encoder is implemented using U-Net with two residual blocks, and the visual similarity map is obtained from the cosine distance between all positions of the text line and the glyph line image along the encoding feature width.

2.2 Letter Independent Encoder

The letter-independent decoder discretizes the similarity map into the probability of each glyph in the sample of all spatial positions along the width of the text line image.

A simple implementation would predict the argmax or sum of similarity scores aggregated over the range of each glyph in the similarity map. However, this strategy cannot overcome the ambiguity in similarity, nor can it produce smooth/uniform character prediction. Therefore, it is carried out in two steps: First, the similarity disambiguation resolves the ambiguity of the glyphs in the alphabet by considering the width and position of the glyphs in the line image, and generates an enhanced similarity map (S ). S in each glyph in the space range of the score to calculate the glyph probability.

Disambiguation of similarity

The ideal similarity maps a square area with high similarity. This is because the width of the characters in the glyph and the text line image will be the same. Therefore, the glyph width and the local x and y coordinates are coded into the similarity map using a small MLP. The two channels of x and y coordinates (normalized to [0,1]) and the glyph width are stacked and input into the MLP. In order to disambiguate, this paper uses a self-attention module and outputs the same size as S to enhance the similarity mapping S*.

Class aggregator

The similarity graph S* is mapped to the probability S∗→P of the sample glyph corresponding to each glyph, and P = MS∗ is realized by multiplying the matrix M, where M = [m1, m2,..., M∣A∣]T, mi ∈ {0, 1}=[0,...,0,1,...,1,0,...,0], where the non-zero value corresponds to the width of the i-th glyph in the glyph image .

Inference stage

Greedy algorithm is used for decoding in the inference stage.

3. Training loss function

Use the CTC loss to supervise the glyph example P to align the prediction with the output label. Auxiliary cross entropy loss (L sim) is also used at each position to supervise the similarity mapping output of the visual encoder S. The real character bounding box is used to determine the spatial span of each character. The overall training consists of the following two parts of loss.
image.png

4. Experimental results

This article compares with state-of-the-art text recognition models, and then generalizes to new fonts and languages.
image.png

Figure 3 VS-1, VS-2: Generalized to new fonts with/without known test glyphs and increased number of training fonts. Error rate on the FontSynth test set (in %; ↓ is better). Ours-cross stands for cross font matching, where the test glyph is unknown, and the training font is used as the glyph sample. When the sample font is randomly selected from the training set, mean and standard-dev are displayed, and selected shows the best matching example automatically selected based on the confidence level. result. R, B, L, and I correspond to Regular, Bold, Light, and Italian in the FontSynth training set; OS represents the Omniglot-Seq data set.
image.png

Figure 4 VS-3: Promotion from synthetic data to real data. The average error rate of the model trained only on synthetic data in Google1000 English documents (%; ↓ is better). LM stands for 6-gram language model.

5 Conclusion

This article proposes a text recognition method that can be extended to novel font visual styles (fonts, colors, backgrounds, etc.), and is not tied to a specific letter size/language. It achieves this goal by reshaping classic text recognition into visual matching recognition. This paper has proved that matching can be trained with random shapes/glyphs. The model in this article may be the first one-shot sequence recognition model. Compared with traditional text recognition methods, it has excellent generalization ability without expensive adaptation/fine-tuning. Although this method has been proven to be used for text recognition, it is suitable for other sequence recognition problems, such as speech and motion recognition.

references

[1] Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, and Hwalsuk Lee. What is wrong with scene text recognition model comparisons? dataset and model analysis. In Proc. ICCV, 2019.

[2] Zhanzhan Cheng, Yangliu Xu, Fan Bai, Yi Niu, Shiliang Pu, and Shuigeng Zhou. Aon: Towards arbitrarily-oriented text recognition. In Proc. CVPR, 2018.

[3] Chen-Yu Lee and Simon Osindero. Recursive recurrent nets with attention modeling for OCR in the wild. In Proc. CVPR, 2016.

[4] Baoguang Shi, Mingkun Yang, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. Aster: An attentional scene text recognizer with flexible rectification. PAMI, 2018.

Click to follow and learn about Huawei Cloud's fresh technology for the first time~


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量