Abstract: An end-to-end document structure analysis scheme (DocParser), document (scan version, image version, etc.) for structure extraction, including entity recognition (entity here refers to all the elements that need to be detected, including text, row, column, cell, etc.) and relationship classification.

This article is shared from huawei cloud community “Paper interpretation series 15: Document structure analysis”, the author of the original: a smile.

1. Abstract

An end-to-end document structure analysis scheme (DocParser) was proposed to extract the structure of documents (scanned version, image version, etc.), including entity recognition (entity refers to all elements to be detected, including text, row, column, cell, etc.) and relationship classification. Based on TEX and Synctex, weakly supervised labels are generated by reverse generation of TEX code.

2. Solutions

Given a document set D, the goal is to generate a hierarchy T, where T contains entities and the Relations between them. For entities, E refers to various elements in the document, such as numbers, tables, rows, cells, etc., and each entity includes three features: 1. 2. Coordinates of bouding box, 3. Confidence score. For Relations, R is given by triples (Esubj,Eobj, ψ), and the relationship class ψ ∈ {parent of,followed by, NULL}, which represents other unrelated entities, such as header footers.

The combination of entity E and its relation R is sufficient to reconstruct the hierarchy T of a document.

Difficulties: similarity of entity appearance, nested hierarchy, and diversity of different documents.

2.1 ImageConversion

Convert the input document image to an image with a predefined resolution of ρ, then resize all images to a fixed size φ (zero padding if necessary); After that, the images were preprocessed, and the RGB channels of all images were standardized by analogy with the MS COCO data set, in order to make use of the pre-trained weight of this data set in the subsequent initialization of the model.

2.2 EntityDetection

Mask R-CNN is used to construct the model and do image segmentation to identify all entities in a document picture. The model outputs a list of entities E1, with the images generated in the previous phase as input. , Em. For each entity, Mask R-CNN is determined: 1) its bounding box, 2) confidence score, 3) a binary segmentation mask (distinguish detected entity and background pixel in bounding box), 4) Category labels of entities, 23 categories in total, CONTENT BLOCK, TABLE, TABLE ROW, TABLE COLUMN, TABLECELL, TABULAR, FIGURE, HEADING, ABSTRACT, EQUATION, ITEMIZE, ITEM, BIBLIOGRAPHYBLOCK, TABLE CAPTION, FIGURE GRAPHIC, FIGURE CAPTION, HEADER, FOOTER, PAGENUMBER, DATE, KEYWORDS, AUTHOR, AFFILIATION.

2.3 RelationClassi fi cation

It’s basically a heuristic algorithm.

2.3.1 Nesting (Parent of) There are 4 steps:

  • H1: Overlaps, determine the overlapped relationship between detection boxes through IOU;

  • H2: Grammar Check;

  • H3: Direct Children, the candidate list is modified and only Direct Children are retained. Direct Children, sub-children will be removed;

  • H4: Unique Parents, trim the candidate list so that each entity has only one parent;

2.3.2 ordering (followed by)

Entities are arranged in a natural reading order (e.g. from left to right). By default, all entities go through these two heuristics processes:

  • Page Layout Entities are identified as single-column or multi-column Entities.

  • Reading Flow: Reorganizes node order according to Reading order;

3. Experimental results

The results of ICDAR table structure analysis:

Click to follow, the first time to learn about Huawei cloud fresh technology ~