Abstract: an end-to-end document structure analysis program (DocParser), which extracts the structure of documents (scanned version, picture version, etc.), including entity recognition (here entity refers to all elements that need to be detected, including text, rows, columns, Cell, etc.) and relationship classification.
This article is shared from the HUAWEI Cloud Community " Paper Interpretation Series 15: Document Structure Analysis ", the original author: Yixiao Allure.
1 Article summary
Propose an end-to-end document structure analysis solution (DocParser) to extract the structure of the document (scanned version, picture version, etc.), including entity recognition (here, entity refers to all elements that need to be detected, including text, rows, columns, cells, etc.) ) And relationship classification. Based on TEX and synctex, weakly supervised tags are generated by reversely generating TEX codes.
2 Solution
Given a document set D, the goal is to generate a hierarchical structure T, where T includes entities and relations between entities. For entities, E refers to various elements in the document, such as numbers, tables, rows, cells, etc. Each entity includes 3 characteristics, 1. Semantic category, 2. Coordinates of the bouding box, 3. Confidence Degree (confidence score). For Relations, R is given by triples (Esubj, Eobj, Ψ), and the relationship category Ψ ∈ {parent of ,followed by,null}, null represents other unrelated entities, such as headers and footers.
The combination of entity E and its relationship R is sufficient to reconstruct the hierarchical structure T of a document.
Difficulties: similar entities in appearance, nested levels, and diversity in different documents.
2.1 ImageConversion
Convert the input document image into a picture with a resolution of ρ, this resolution is predefined, and then all pictures are resized to a fixed size φ (zero padding if necessary); After that, the pictures are preprocessed, and all the RGB channels of the pictures are The analogy MS COCO data set is standardized. This is done to use the pre-trained weights of this data set when the model is subsequently initialized.
2.2 EntityDetection
Use Mask R-CNN to construct a model and perform image segmentation to identify all entities in a document picture. The image generated in the previous stage of this model is used as input, and an entity list E1,...,Em is output. For each entity, Mask R-CNN determines: 1) its square bounding box, 2) confidence score confidence, 3) a binary segmentation mask (distinguishes the entity detected in the bounding box from the background pixel pixel), 4) the entity’s Category tags, a total of 23 categories, CONTENT BLOCK, TABLE, TABLE ROW, TABLE COLUMN, TABLE CELL, TABULAR, FIGURE, HEADING, ABSTRACT, EQUATION, ITEMIZE, ITEM, BIBLIOGRAPHY BLOCK, TABLE CAPTION, FIGURE GRAPHIC, FIGURE CAPTION, HEADER, FOOTER , PAGE NUMBER, DATE, KEYWORDS, AUTHOR, AFFILIATION.
2.3 Relation Classification
It is basically a heuristic algorithm.
2.3.1 nesting (parent of) here is divided into 4 steps:
- h1: Overlaps, the overlap relationship between the detection frames is judged by IOU;
- h2: Grammar Check, grammar check;
- h3: Direct Children, trim the candidate list, only keep direct children direct children, sub-children will be removed;
- h4: Unique Parents, trim the candidate list so that each entity has only one parent node;
2.3.2 ordering (followed by)
Entities are arranged according to natural reading order (for example, from left to right). By default, all entities will undergo these two heuristics processing:
- Page Layout Entities mainly determines whether the page is a single-column layout or a multi-column layout;
- Reading Flow: Reorganize the order of nodes according to the reading order;
3 Experimental results
The effect of structure analysis in ICDAR table:
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。