Abstract: Extracting key information from document images is very important in office automation applications. The traditional method based on template matching or rules is not good in terms of generality, and has not seen the layout template data. Therefore, this paper proposes an end-to-end spatial multimodal graph inference model (SDMG-R), which can effectively extract key information from unseen template data and has better universality.

This article is shared from huawei Cloud community “paper interpretation series 12: SDMG-R Structured Extraction – unlimited format receipt scenario application”, author: a smile.

Source: github.com/open-mmlab/…

1, the background

Extracting key information from document images is critical in office automation applications, such as rapid automated filing of common archived documents, receipts, credit forms and other data scenarios, compliance checks, and so on. Traditional methods based on template matching or rules mainly use the layout, position coordinate information and content rules of fixed layout template data. These information has strong limitations, so the effect is not good in terms of universality or in terms of unseen layout template data. Therefore, this paper proposes an end-to-end spatial multimodal graph reasoning model (SDMG-R), which can make full use of the location layout, semantic and visual information of the detected text area. Compared with the previous information, it is more sufficient and rich, so it can effectively extract key information from the template data that has never been seen before, and has better universality.

2. Innovative methods and highlights

2.1 data

In the previous key information extraction tasks, most commonly used data sets are SROIE and IEHHR, but their training sets and test sets have many common template formats, so they are not suitable to evaluate or verify the general ability of the general information extraction model. For these reasons, this paper constructs a new data set of key information extraction tasks, named WildReceipt: it consists of 25 categories and has about 50000 text areas, more than twice the data volume of SROIE. The details are shown in Table 2-1 below:

Table 2-1 Data sets of key information extraction tasks

2.2 Innovation points and contributions

The proposed SMDG-R performs well on both SROIE and WildReceipt datasets, and is better than the previous method models. The author of this paper also conducted relevant ablation experiments, and verified that the spatial relationship information and multi-modal features proposed in this paper have a very important impact on the extraction of key information. Specific innovations and contributions are as follows:

  • An effective spatial multimodal graph inference network (SDMG-R) is proposed, which can make full use of the semantic and visual spatial feature relationship information of text region.

  • A baseline data set (WildReceipt) is constructed, which is twice the amount of SROIE data, and the training set layout template and the test set layout template have little cross, so it can be used to explore some general key information extraction tasks.

  • This paper makes use of visual and semantic features, and makes relevant verification on how to make good use of both data: the effectiveness of feature fusion methods (CONCAT, linear sum, kronecker product), and the final kronecker product is about two points higher than the other two feature fusion methods, as shown in Table 2-2 below:

Table 2-2 Comparison results of feature fusion methods

3. Network structure

The entire network structure of THE SDMG-R model is shown in Figure 3-1 below. The input data of the model is detected by images and corresponding texts in coordinate areas and text contents in corresponding text areas. Visual features are extracted by Unet and ROI-pooling, and semantic features are extracted by BI-LSTM. Then multi-modal features are fused with semantic and visual features through Kronecker product, and then input into spatial multi-modal reasoning model to extract the final node features. Finally, multi-classification tasks are performed through the classification module

Figure 3-1 NETWORK structure of sdMG-R

3.1 Detailed steps of visual feature extraction

A. Enter the original image and resize it to a fixed input size (512×512 for this article);

B. Input to Unet, and use Unet as the visual feature extractor to obtain the feature map of the last layer of CNN;

C. Map the text area coordinates () of the input size to the CNN feature map of the last layer, and conduct feature extraction using the ROI-pooling method to obtain the visual features of the corresponding text area image;

3.2 Detailed steps of text semantic feature extraction

A. First, collect the character set table. This paper collects 91 character sets of length, including digits (0-9), letters (A-Z, A-Z), and special character sets related to tasks (such as “/”, “n”, “. , “$”,” AC “, “”,” selections “, “:”, “-“, “*”, “#”, etc.), not in the character table of unified characters marked as “unkown”;

B. Next, the text character content is mapped to the encoding form of 32-dimensional one-hot semantic input;

C. Input them into bi-LSTM model to extract semantic features of 256 dimensions;

3.3 Steps of visual + text semantic feature fusion

A. Multi-modal feature fusion: Feature fusion is carried out through kronecker product. The specific formula is as follows:

3.4 Reasoning model of spatial relation multimodal diagram

The final node features are completed by the multi-modal graph reasoning model, and the formula is as follows:

3.5 Multi-category task module

The characteristics of nodes are obtained according to the graph reasoning model, and finally input into the classification module. The final entity classification result is output through multiple classification tasks. The loss function uses cross entropy loss, and the formula is as follows:

4. Experimental results

The results in the SROIE data set are shown in Table 4-1 below:

Table 4-1 Accuracy of SROIE

The results in the WildReceipt test set are shown in Table 4-2:

Table 4-2 Accuracy of WildReceipt

Click follow to learn about the fresh technologies of Huawei Cloud