Abstract: Extracting key information from document images is very important in automated office applications. Traditional methods based on template matching or rules are not effective in terms of versatility and unseen layout template data. For this reason, this paper proposes an end-to-end spatial multimodal graph reasoning model (SDMG-R) , Can effectively extract key information from template data that has never been seen before, and has better versatility.

This article is shared from Huawei Cloud Community " Paper Interpretation Series 12: SDMG-R Structured Extraction-Unlimited Layout Receipt Scenario Application 16114ce003eb37", author:
image.png

Source code: https://github.com/open-mmlab/mmocr/tree/4882c8a317cc0f59c96624ce14c8c10d05fa6dbc

1 background

Extracting key information from document images is very important in office automation applications, such as common archive files, receipt slips, credit forms and other data scenarios for rapid and automated archiving, compliance checks, and so on. The traditional methods based on template matching or rules mainly use fixed layout template data layout, position coordinate information, content rules, etc. These information are very limited, so in terms of versatility and unseen layout template data, the effect is all not good. To this end, this paper proposes an end-to-end spatial multimodal graph reasoning model (SDMG-R), which can make full use of the location layout, semantics, and visual information of the detected text area, which is more abundant than the previously obtained information. Therefore, it can effectively extract key information from template data that has never been seen before, and has better versatility.

2 Innovative methods and highlights

2.1 Data

In the previous key information extraction tasks, most of the commonly used data sets are SROIE and IEHHR, but their training set and test set have many common template layouts, so they are not suitable for evaluating or verifying the general capabilities of general information extraction models; based on For the above reasons, this article constructs a new set of key information extraction task data set, and named it WildReceipt: It consists of 25 categories, about 50,000 text areas, and the data volume is more than twice that of SROIE. The detailed information is shown in Table 2- 1 shows:
image.png

Table 2-1 Key information extraction task data set

2.2 Innovation and contribution

The SDMG-R proposed in this paper has achieved better results on both the SROIE data set and the WildReceipt data set, and is better than the previous method model. The author of this article also performed related ablation experiments and verified that the spatial relationship information and multi-modal features proposed in this article have a very important impact on key information extraction. Specific innovations and contributions are as follows:

  • An effective spatial multimodal graph reasoning network (SDMG-R) is proposed, which can make full use of the semantic and visual spatial feature relationship information of the text area;
  • A set of benchmark data set (WildReceipt) is constructed, which is twice the amount of SROIE data, and the training set layout template and the test set layout template are rarely crossed, so it can be used for exploratory research on general key information extraction tasks;
  • This article uses visual and semantic features, and how to make good use of the two data. This article does related verification: the effectiveness of the feature fusion method (CONCAT, linear summation, Kronecker product), and the final result is that the Kronecker product is better than others The two feature fusion methods are about two points higher, as shown in Table 2-2 below:
    image.png

Table 2-2 Comparison results of feature fusion methods

3 network structure

The entire network structure of the SDMG-R model is shown in Figure 3-1. The input data of the model consists of pictures, the corresponding text detection coordinate area, and the text content of the corresponding text area. The visual features are extracted through Unet and ROI-Pooling, and the semantic features are extracted through Bi- LSTM is used to extract, and then multi-modal features are merged with semantic and visual features through Kronecker product, and then input to the spatial multi-modal inference model to extract the final node features, and finally multi-classification tasks are performed through the classification module;
image.png

Figure 3-1 SDMG-R network structure

3.1 Detailed steps of visual feature extraction:

  1. Enter the original picture and resize it to a fixed input size (512x512 in this article);
  2. Input to Unet, use Unet as the visual feature extractor to obtain the feature map of the last layer of CNN;
  3. Map the text area coordinates () of the input size to the last layer of CNN feature map, and perform feature extraction through the ROI-pooling method to obtain the visual features of the corresponding text area image;

3.2 Detailed steps of text semantic feature extraction:

  1. First collect the character set table. This article collects 91 length character tables, covering numbers (0-9), letters (az, AZ), special character sets for related tasks (such as "/", "n", ".", “$”, “AC”, “”, “¥”, “:”, “-”, “*”, “#”, etc.), the characters not in the character table are uniformly marked as “unkown”;
  2. Second, map the text character content to the 32-dimensional one-hot semantic input encoding form;
  3. Then input into the Bi-LSTM model to extract 256-dimensional semantic features;

3.3 Visual + text semantic feature fusion steps:

image.png

3.4 Spatial relationship multi-modal graph reasoning model:

The final node features are completed by the multi-modal graph reasoning model, and the formula is as follows:
image.png
image.png

3.5 Multi-classification task module

According to the graph reasoning model, the characteristics of the nodes are obtained, and finally input to the classification module, and the final entity classification results are output through the multi-classification task. The loss function uses the cross-entropy loss, and the formula is as follows:
image.png

4 Experimental results

The results in the SROIE data set are shown in Table 4-1:
image.png

Table 4-1 Accuracy of SROIE

The results of the WildReceipt test set are shown in Table 4-2:
image.png

Table 4-2 Accuracy of WildReceipt

Click to follow, and learn about Huawei Cloud's fresh technology for the first time~


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量