Abstract: In the document layout analysis task, the visual information, text information, and relationship information between the layout components of the document all play an important role in the analysis process. This paper proposes a layout analysis framework VSR that integrates visual, text, and relational multi-modal information.
This article is shared from Huawei Cloud Community " Integrating Visual, Semantic, and Relational Multi-modal Information", author: Xiao Cainiao chg.
The existing document layout analysis methods can be roughly divided into two types: NLP-based methods regard the layout analysis task as a sequence labeling task (sequence labeling), but this type of method has insufficient performance in layout modeling and cannot capture spatial information. ; CV-based methods regard layout analysis as an object detection or segmentation task. The shortcomings of this type of method are (1) lack of fine-grained semantics, (2) simple splicing methods, (3) ) The relationship information is not used. Figure 1 shows the schematic diagram of VSR motivation. In order to solve the limitations of the above methods, this paper proposes a layout analysis architecture VSR (Vision, Semantic, Relation) that integrates visual, text, and relational multi-modal information.
Figure 1 Schematic diagram of VSR motivation
1. Problem definition
The layout analysis task can be used as a sequence label classification, but also as a target detection. The main difference lies in the selection of component candidates. For the definition based on the NLP method, that is, sequence label classification, choose to obtain text tokens through pdf analysis or OCR recognition; for the definition based on the CV method, that is, target detection or segmentation, choose the area RoI obtained by the target detection network such as Mask RCNN. VSR mainly revolves around the definition of target detection. At the same time, VSR can also be directly applied to NLP-based methods.
2. VSR architecture
The VSR architecture is shown in Figure 2, which mainly includes three modules: two-stream ConvNets, a multi-scale adaptive aggregation module, and a relational learning module. First, the dual-stream convolutional network extracts visual and semantic features; then, compared to simple splicing, a multi-size adaptive aggregation module obtains the visual and semantic dual-modal information representation; then, based on the aggregated multi-modal information representation, it can be generated Layout the component candidate set; finally, the relationship learning module learns the relationship between the component candidates and generates the final result. Let's expand on each module in detail.
Figure 2 VSR architecture diagram
2.1 Two-stream convolutional network
VSR uses a dual-stream convolutional neural network (ResNeXt-101 used in this article) to extract image visual information and text semantic information respectively.
Vision ConvNet
Semantic ConvNet
2.2 Multi-size adaptive aggregation module
2.3 Relationship Learning Module
After the FM is obtained, ROI (Region of Interest) can be easily obtained as a candidate set of layout components through the RPN network. In this paper, Mask RCNN is selected in the experimental stage, and 7 anchor ratios (0.02, 0.05, 0.1, 0.2, 0.5, 1.0, 2.0) (0.02, 0.05, 0.1, 0.2, 0.5, 1.0, 2.0) are set at the same time to obtain the component candidate set . As shown in Figure 3, according to the relationship between the component candidates, it can have the following effects: (1) Use the spatial position relationship to adjust the coordinates of the text box; (2) According to the co-occurrence relationship between the components (such as the table and the table title) Will appear at the same time) Correct the prediction label; (3) The non-overlapping feature between each component removes the redundant frame. The relationship learning module in VSR models the relationship between the component candidates, and finally obtains the results of the layout analysis.
Figure 3 Schematic diagram of the function of the VSR relationship learning module
Take a document as a graph, and each component candidate as a node node. The feature representation of each node is composed of multi-modal feature representation and location information representation:
2.4 Optimize training
3. Experimental results
3.1 Comparative experiment
VSR achieved the best results on the three open source data sets Article Regions, PubLayNet, and DocBank.
3.2 Ablation experiment
The experimental results of Table 5, Table 6, and Table 7 respectively verify the effectiveness of the three parts of A. Different granularities of text representation; B. Two-stream convolutional network and aggregation module; C. Relational learning module.
4. Summary
The three important parts of the VSR method are as follows:
(1) Text semantics are represented by two granularities: character and sentence;
(2) Use two-stream convnet to extract visual and semantic features separately, then aggregate the two modal features through attention, and finally obtain component candidates based on the aggregated features;
(3) GNN is Self attention to learn the relationship between component candidates.
Click to follow to learn about Huawei Cloud's fresh technology for the first time~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。