Abstract: In the document layout analysis task, the visual information, text information, and relationship information between the layout components of the document all play an important role in the analysis process. This paper proposes a layout analysis framework VSR that integrates visual, text, and relational multi-modal information.

This article is shared from Huawei Cloud Community " Integrating Visual, Semantic, and Relational Multi-modal Information", author: Xiao Cainiao chg.
image.png

The existing document layout analysis methods can be roughly divided into two types: NLP-based methods regard the layout analysis task as a sequence labeling task (sequence labeling), but this type of method has insufficient performance in layout modeling and cannot capture spatial information. ; CV-based methods regard layout analysis as an object detection or segmentation task. The shortcomings of this type of method are (1) lack of fine-grained semantics, (2) simple splicing methods, (3) ) The relationship information is not used. Figure 1 shows the schematic diagram of VSR motivation. In order to solve the limitations of the above methods, this paper proposes a layout analysis architecture VSR (Vision, Semantic, Relation) that integrates visual, text, and relational multi-modal information.
image.png

Figure 1 Schematic diagram of VSR motivation

1. Problem definition

The layout analysis task can be used as a sequence label classification, but also as a target detection. The main difference lies in the selection of component candidates. For the definition based on the NLP method, that is, sequence label classification, choose to obtain text tokens through pdf analysis or OCR recognition; for the definition based on the CV method, that is, target detection or segmentation, choose the area RoI obtained by the target detection network such as Mask RCNN. VSR mainly revolves around the definition of target detection. At the same time, VSR can also be directly applied to NLP-based methods.

2. VSR architecture

The VSR architecture is shown in Figure 2, which mainly includes three modules: two-stream ConvNets, a multi-scale adaptive aggregation module, and a relational learning module. First, the dual-stream convolutional network extracts visual and semantic features; then, compared to simple splicing, a multi-size adaptive aggregation module obtains the visual and semantic dual-modal information representation; then, based on the aggregated multi-modal information representation, it can be generated Layout the component candidate set; finally, the relationship learning module learns the relationship between the component candidates and generates the final result. Let's expand on each module in detail.
image.png

Figure 2 VSR architecture diagram

2.1 Two-stream convolutional network

VSR uses a dual-stream convolutional neural network (ResNeXt-101 used in this article) to extract image visual information and text semantic information respectively.

Vision ConvNet
image.png

Semantic ConvNet
image.png

2.2 Multi-size adaptive aggregation module

image.png

2.3 Relationship Learning Module

After the FM is obtained, ROI (Region of Interest) can be easily obtained as a candidate set of layout components through the RPN network. In this paper, Mask RCNN is selected in the experimental stage, and 7 anchor ratios (0.02, 0.05, 0.1, 0.2, 0.5, 1.0, 2.0) (0.02, 0.05, 0.1, 0.2, 0.5, 1.0, 2.0) are set at the same time to obtain the component candidate set . As shown in Figure 3, according to the relationship between the component candidates, it can have the following effects: (1) Use the spatial position relationship to adjust the coordinates of the text box; (2) According to the co-occurrence relationship between the components (such as the table and the table title) Will appear at the same time) Correct the prediction label; (3) The non-overlapping feature between each component removes the redundant frame. The relationship learning module in VSR models the relationship between the component candidates, and finally obtains the results of the layout analysis.
image.png

Figure 3 Schematic diagram of the function of the VSR relationship learning module

Take a document as a graph, and each component candidate as a node node. The feature representation of each node is composed of multi-modal feature representation and location information representation:
image.png

2.4 Optimize training

image.png

3. Experimental results

3.1 Comparative experiment

VSR achieved the best results on the three open source data sets Article Regions, PubLayNet, and DocBank.
image.png
image.png
image.png

3.2 Ablation experiment

The experimental results of Table 5, Table 6, and Table 7 respectively verify the effectiveness of the three parts of A. Different granularities of text representation; B. Two-stream convolutional network and aggregation module; C. Relational learning module.
image.png
image.png

4. Summary

The three important parts of the VSR method are as follows:
(1) Text semantics are represented by two granularities: character and sentence;
(2) Use two-stream convnet to extract visual and semantic features separately, then aggregate the two modal features through attention, and finally obtain component candidates based on the aggregated features;
(3) GNN is Self attention to learn the relationship between component candidates.

Click to follow to learn about Huawei Cloud's fresh technology for the first time~


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量