Paper interpretation丨document structure analysis

Abstract: an end-to-end document structure analysis program (DocParser), which extracts the structure of documents (scanned version, picture version, etc.), including entity recognition (here entity refers to all elements that need to be detected, including text, rows, columns, Cell, etc.) and relationship classification.

This article is shared from the HUAWEI Cloud Community " Paper Interpretation Series 15: Document Structure Analysis ", the original author: Yixiao Allure.

1 Article summary

Propose an end-to-end document structure analysis solution (DocParser) to extract the structure of the document (scanned version, picture version, etc.), including entity recognition (here, entity refers to all elements that need to be detected, including text, rows, columns, cells, etc.) ) And relationship classification. Based on TEX and synctex, weakly supervised tags are generated by reversely generating TEX codes.

2 Solution

Given a document set D, the goal is to generate a hierarchical structure T, where T includes entities and relations between entities. For entities, E refers to various elements in the document, such as numbers, tables, rows, cells, etc. Each entity includes 3 characteristics, 1. Semantic category, 2. Coordinates of the bouding box, 3. Confidence Degree (confidence score). For Relations, R is given by triples (Esubj, Eobj, Ψ), and the relationship category Ψ ∈ {parent of ,followed by,null}, null represents other unrelated entities, such as headers and footers.

The combination of entity E and its relationship R is sufficient to reconstruct the hierarchical structure T of a document.

Difficulties: similar entities in appearance, nested levels, and diversity in different documents.

2.1 ImageConversion

Convert the input document image into a picture with a resolution of ρ, this resolution is predefined, and then all pictures are resized to a fixed size φ (zero padding if necessary); After that, the pictures are preprocessed, and all the RGB channels of the pictures are The analogy MS COCO data set is standardized. This is done to use the pre-trained weights of this data set when the model is subsequently initialized.

2.2 EntityDetection

Use Mask R-CNN to construct a model and perform image segmentation to identify all entities in a document picture. The image generated in the previous stage of this model is used as input, and an entity list E1,...,Em is output. For each entity, Mask R-CNN determines: 1) its square bounding box, 2) confidence score confidence, 3) a binary segmentation mask (distinguishes the entity detected in the bounding box from the background pixel pixel), 4) the entity’s Category tags, a total of 23 categories, CONTENT BLOCK, TABLE, TABLE ROW, TABLE COLUMN, TABLE CELL, TABULAR, FIGURE, HEADING, ABSTRACT, EQUATION, ITEMIZE, ITEM, BIBLIOGRAPHY BLOCK, TABLE CAPTION, FIGURE GRAPHIC, FIGURE CAPTION, HEADER, FOOTER , PAGE NUMBER, DATE, KEYWORDS, AUTHOR, AFFILIATION.

2.3 Relation Classiﬁcation

It is basically a heuristic algorithm.

2.3.1 nesting (parent of) here is divided into 4 steps:

h1: Overlaps, the overlap relationship between the detection frames is judged by IOU;
h2: Grammar Check, grammar check;
h3: Direct Children, trim the candidate list, only keep direct children direct children, sub-children will be removed;
h4: Unique Parents, trim the candidate list so that each entity has only one parent node;

2.3.2 ordering (followed by）

Entities are arranged according to natural reading order (for example, from left to right). By default, all entities will undergo these two heuristics processing:

Page Layout Entities mainly determines whether the page is a single-column layout or a multi-column layout;
Reading Flow: Reorganize the order of nodes according to the reading order;

3 Experimental results

The effect of structure analysis in ICDAR table:

Click to follow, and get to know the fresh technology of

Paper interpretation丨document structure analysis

1 Article summary

2 Solution

2.1 ImageConversion

2.2 EntityDetection

2.3 Relation Classiﬁcation

2.3.1 nesting (parent of) here is divided into 4 steps:

2.3.2 ordering (followed by）

3 Experimental results

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

入选AAAI 2025，浙江大学提出多对一回归模型M2OST，利用数字病理图像精准预测基因表达

60行代码就可以训练/微调 Segment Anything 2 (SAM 2)

什么是模型上下文协议（MCP）？

基于oracle linux的 DBI/DBD 标准化安装文档(二)

推理大模型时代，TextIn ParseX助力出版业知识资产重构

OpenBayes 教程上新丨开源代码推理模型 DeepCoder-14B-Preview 狂揽 3k stars