Editor's note: Reading sequence extraction is a very important task in document intelligent analysis. It aims to combine the original independent words into text that the reader can understand by extracting and reordering words in scanned documents or digital business documents. However, because the documents used in daily work and life often have different templates and formats, it is often impossible to achieve better results by arranging them in accordance with the traditional method when complex formats appear. Therefore, researchers from the Natural Language Computing Group of Microsoft Asia Research Institute constructed the first large-scale reading sequence data set ReadingBank, and proposed a reading sequence extraction model LayoutReader based on ReadingBank. This article will briefly introduce the implementation principles of ReadingBank and LayoutReader. Interested readers are welcome to click to read the original text for more details in the paper. This article has been accepted as a long article by EMNLP 2021.
Reading sequence extraction refers to the combination of original independent words into text that readers can understand by extracting words from scanned documents or digital business documents and re-ordering them. For most electronic documents, such as web pages, Word documents, etc., it is not difficult to obtain the correct reading sequence, as long as the source code is analyzed. But many scanned documents or PDF documents do not have such information. Therefore, the wrong reading sequence not only makes readers unable to understand, but also makes it difficult to perform intelligent document analysis, because intelligent document analysis extracts key information from scanned documents or digital business documents, and makes unstructured information more structured. Realize automated document understanding.
However, the existing document intelligent analysis model still relies on the input sequence of the document content. If the key information part is disordered, it is likely to cause the model to judge errors or omit information. Therefore, reading sequence extraction is a very important task in document intelligent analysis.
Documents in daily work and life have a variety of different templates and formats. In order to extract the reading sequence, traditional methods often arrange the words directly from left to right and from top to bottom or manually match the templates. But when there are multiple columns, tables, and other formats, traditional methods usually fail. In order to deal with a wide variety of document types, needs to introduce a large-scale pre-training language model to extract the reading sequence with the help of the text information, layout position and other information in the document.
Figure 1: Schematic diagram of the reading sequence of document pictures in the ReadingBank dataset
Because the existing data set cannot meet the pre-training requirements and the cost of constructing a new data set by manual annotation is too high, etc. , researchers from the Natural Language Computing Group of Microsoft Research Asia, used the XML source code in the Word document to construct the first large-scale reading sequence data set ReadingBank, and proposed a reading sequence extraction model LayoutReader based on ReadingBank.
first large-scale reading sequence data set ReadingBank
The current multi-modal information extraction models (LayoutLM, LayoutLMv2) usually depend on the text content and corresponding positions in the document. So ReadingBank consists of two parts: the text content (reading sequence) arranged in the correct reading order, and the position of these texts in the current page.
Document collection
There are two formats of Word documents, .doc and .docx respectively. Only .docx documents are used here because they need to use the decompressed XML source code. The researchers filtered out low-quality text and non-English documents through the document length and language detection API, and finally crawled a total of 210,000 English .docx documents, and randomly selected 500,000 pages of them as a data set
Get the read sequence
Read the sequence is in accordance with the current document text of the correct reading order, how to get the case without the aid of manual annotation of the correct reading order is a difficult problem. For this reason, the researchers used the XML source code in the Word document to construct the correct reading sequence. All the information in the Word document is recorded in the XML source code and arranged in order according to the reading order of the document itself. Therefore, the researchers first used the open source tool python-docx to parse the .docx documents crawled on the Internet, and then traverse the entire document paragraph by paragraph and cell by cell to obtain the reading sequence.
Get the corresponding location information
Although the Word document contains the correct reading order information, the XML source code does not record the corresponding position, but is rendered in real time when the user opens the document. In order to fix the text and obtain accurate location information, the researchers used the PDF Metamorphosis .Net tool, convert the Word document to PDF, and then use the PDF parser to obtain the location of the text on the PDF page.
With the reading sequence and the position of the text in the reading sequence, the next step is to construct a one-to-one correspondence between the "reading sequence" and the "corresponding position". The commonly used method is usually to use the correspondence between words, for example, "MSRA" in a Word document corresponds to "MSRA" in a PDF. However, when a word appears more than once in the document, this simple correspondence cannot be established.
In order to distinguish the same word that appears in different positions, the academic circles often use the "staining method". first adds a serial number to each word in the reading sequence. When a word appears for the first time, it is marked as 0, when it appears for the second time, it is marked as 1, and so on. At the same time, the word is dyed, and the font color is determined by a bijective function C, so that after converting to PDF, the font color can be obtained through the parser, and then the original serial number can be restored. Therefore, by combining the text content with the sequence number of the appearance order, a corresponding relationship can be established between the reading sequence and the position extracted from the PDF.
Figure 2: Constructing the ReadingBank data set through text coloring of Word documents
Because of this correspondence, the words in the reading sequence can be added to the corresponding positions, and then a complete data set can be obtained. The researchers split randomly according to the ratio of 8:1:1 to obtain the training set, validation set, and test set. The data set information after segmentation is shown in Table 1 below. Each part of the data set is divided and balanced, and there will be no data imbalance in downstream tasks.
Table 1: Randomly divided data set (Avg. BLEU refers to the BLEU value calculated from left to right and top to bottom compared with ReadingBank, ARD refers to the average relative distance, used to measure the difficulty of the data set)
Reading sequence extraction model LayoutReader
The researchers also proposed LayoutReader based on the Seq2Seq model and pre-trained it on ReadingBank. The model input is a text sequence arranged from left to right and top to bottom in the page, and the target sequence is the reading sequence provided by ReadingBank. (Click to read the original text for detailed information on the paper).
encoder
In order to use the position layout information, the researchers used LayoutLM as an encoder to connect the input sequence with the target sequence, and use the mask to control the visible information at each position in the Attention process, thereby realizing the Seq2Seq model.
decoder
Since the input sequence and the target sequence are from the same word sequence, but the order is different, the researchers modified the decoder to predict the next word from the vocabulary instead of predicting the next word from the input sequence, that is, prediction The sequence number of the next word in the input sequence.
Experiment and comparison
benchmark model
The researchers compared LayoutReader with heuristic methods, plain text methods, and pure layout methods:
- Heuristic Method: Arrange the text content in order from left to right and top to bottom.
- Plain text method: Replace the encoder LayoutLM in LayoutReader with plain text encoders, such as BERT and UniLM, so that the model will not be able to use the position layout information to make predictions.
- Pure layout method: Remove the word vector in the LayoutReader encoder LayoutLM, so that the model cannot use text information.
Evaluation method
The researchers used two evaluation indicators to measure the effect of the model:
- average page BLEU value: BLEU value is a common indicator to measure the effect of sequence generation. By comparing the output sequence with the target sequence provided by ReadingBank, the corresponding BLEU value can be obtained.
- Average Relative Distance (ARD): Because the output sequence of the model is the same as the target sequence, the difference lies in the position of the corresponding word, so the effect of the model can be verified by the distance of the corresponding word. When a word is missing during the generation process, ARD will also introduce a penalty. The formula is as follows, where A is the output sequence, B is the target sequence, e_k is the k-th word in A, and I(e_k, B) refers to the sequence number of e in B.
Reading sequence extraction
The input of the model is a sequence of words from left to right and top to bottom. Comparing the output sequence with ReadingBank, it can be seen from the results that LayoutReader, which combines text information and layout information, achieved the best results. Compared with the most commonly used heuristic method, the average page BLEU value increased by 0.2847, an average The relative distance is reduced by 6.71. After removing the text mode or layout mode, the result is still improved. On the average page BLEU, the increase was 0.16 and 0.27 respectively. However, the average relative distance has declined in the plain text method, mainly due to the penalty for missing words in the average relative distance. At the same time, comparing the pure text method and the pure layout method can also see that the layout information plays a more important role. Compared with the pure text method BLEU, and an increase in the average relative distance. Around 9.0.
Table 2: Experimental results of the LayoutReader model in the ReadingBank dataset (input order is from left to right, top to bottom)
input sequence research
The above experiments are based on the input sequence from left to right and top to bottom. Due to people's reading habits, such training plays a great role in prompting subsequent generation. The input sequence research aims to study the impact of input sequence on the results during training or testing. Therefore, the researchers designed to verify:
- During training, some training samples are scrambled, and no longer input in the order from left to right and top to bottom, and during testing, the input from left to right and top to bottom is still retained
- Furthermore, some training samples are still shuffled during training, and all input is a sequence of shuffled word order during testing.
The results of the experiment are as follows (r is the proportion of training samples disrupted):
Table 3: Disrupt some samples during training, and keep the order from left to right and top to bottom during testing
Table 4: Disrupt some samples during training, and disrupt all samples during testing
It can be seen from the results that, compared to the model compared with the first three lines, LayoutReader combines text information and layout information to effectively combat disrupted input, maintaining good results in almost all scenarios. In the second experiment, it can be seen that when the training sample is not disrupted (r=0%), that is, the input sequence is kept from left to right and top to bottom, but the scrambled sample is used in the test, the result is There is a huge drop. Researchers believe that this is due to overfitting the sequence from left to right and top to bottom during training, and thus cannot cope with unfamiliar and disordered input. This is consistent with the previous results. shows that location information has a stronger guiding role in reading order.
in OCR
The current OCR has been able to recognize the corresponding text well, but it does not pay attention to the order of these texts. Therefore, the researchers used LayoutReader to arrange the text lines obtained by OCR in the reading order. Then, the bounding box corresponding to the text line is intersected with the bounding box of each word, and each word is assigned to the text line with the largest intersection, and the text line is sorted according to the smallest sequence number of the words contained, and then the obtained order is summed The order of ReadingBank is compared. The researchers chose Tesseract, an open source OCR algorithm, and a commercial OCR algorithm, and got the following results:
Table 5: Comparative experiment for Tesseract OCR text lines
Table 6: Comparative experiments in this article for a commercial OCR engine
From the above results, it can be seen that LayoutReader can get better results compared to the two OCR algorithms; can still increase BLEU by about 0.1 and decrease by 2 on ARD compared to commercial OCR.
Future work
In the future, researchers at Microsoft Research Asia plan to further extract the reading order from ReadingBank, and by introducing more noise and transformations such as rotation, the data set will be more robust and can be used in more scenarios. At the same time, on the basis of the large-scale data set ReadingBank, researchers will also introduce a small number of manual annotations in specific fields, and will have a more detailed application of reading order extraction.
(Natural Language Computing Group of Microsoft Research Asia)
Welcome to follow the Microsoft China MSDN subscription account for more latest releases!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。