“This article has participated in the good article call order activity, click to see: back end, big front end double track submission, 20,000 yuan prize pool for you to challenge!

Introduction: Briefly describe the transcoding and display process of various documents in Baidu Library. The early layout data met the reading experience of various documents on PC. With the iteration of business development requirements, the document reading experience on wireless terminal needs to be improved. In the process of transferring format data to streaming data, the simple content structure satisfies the relayout of PDF data in wireless terminal. The underlying parsing of OOXML data and the detailed content structure bring a good effect of word wireless retypesetting. The structured metadata extracted from chart pictures “from scratch” opens up the imagination space for the interaction between users and documents.

The full text contains 3724 words and takes an estimated reading time of 9 minutes.

I. Display of various documents in Baidu Library

The library contains billions of documents, including more than a dozen common office documents such as Word, PPT, Excel and PDF. The core basic service is document transcoding and presentation.

In order to unify the transcoding and presentation scheme of more than a dozen documents, and not rely on the opening software of the original file format, the final plan is to transcode any document to PDF format after technical investigation, analyze the open source PDF data format, form the library’s own document format after processing, and typesetting and rendering on PC and wireless end.

PC rendering uses XReader layout data from PDF. Layout data means that each element (text, image) has a coordinate information and element width and height information, as well as other description information. Each text fragment, image, and other vector elements are displayed in the current layout according to the coordinate information. Therefore, the layout data is more suitable for the PC side to show all kinds of documents in equal proportion, the restoration effect of layout is better.

The screen size of wireless terminal is generally small. If the layout data and other proportions are reduced, the text and formula in the whole layout are small, which brings inconvenience to reading, as shown in Figure 1. Although the display can be enlarged, it obviously increases the operation cost of the user.

Figure 1. The wireless terminal uses the layout data to carry on the proportionally reduced layout

An ideal solution would be to convert the layout data into streaming data, which would be reformatted according to different wireless screen sizes. Difference in format data coordinate information of each element has the current layout, streaming data is not coordinate information, have a plenty of chapters, bar, paragraphs, formula, and form structured information, such as a large amount of data structure will be the most basic information of text, images, forms structured document content, suitable for a variety of screen sizes adaptive reset.

Second, the technical exploration of document content structure

2.1 Retype Streaming data (Based on XReader format data)

Library early document content “format streaming” scheme, traversing xReader format data in each element, extract the coordinate information X, y and element width and height w, H information. The close Y is considered to be the data of the same row. In the case of close Y, adjacent text elements are spliced according to X and W, and adjacent text and pictures are connected. Then we get the data structure line of all the lines in the current layout. According to the y and H information of each line, we concatenate the adjacent lines into paragraphs. The end of a paragraph can be determined by judging that the x+ W data of the current line is less than the width of the page, that it ends with a special punctuation mark, and that the X information of the next line contains the indentation of the first paragraph.

The above is the general idea of the “layout flow” scheme. When the layout structure of a document is relatively complex, such as a large number of columns, text and text wrapping, table footnotes and end notes in papers and literatures, it is also necessary to pre-process range identification, analyze and cut the whole layout into multiple range structures. In each range, the general scheme of “layout to flow” can get better results.

This scheme extracts structured information such as “paragraphs and lines” from the layout data, which is helpful for streaming typesetting. However, in some cases, the accuracy of displaying these structured information is not 100%, such as paragraphs are forced to be newline or the position of inline images is incorrect, and the ability to extract complex structured information such as formulas, charts, and tables is weak.

Unlike PDF documents, which only have coordinate information of elements relative to the layout and lack content structured information, Office documents such as Word documents have structured information in the source documents, but these information is lost in the transcoding process of Word to PDF. Therefore, it has become an important goal to extract structured information from Word documents, which account for a large number of libraries, and to improve the effect of wireless streaming typesetting.

2.2 BDJson streaming data (based on OOXML data)

Microsoft Office has a long history, and There are many versions of Word that are easily distinguished between the DOC binary compound document format and the OOXML document format docX. The Doc binary compound document format is complex, and is a Microsoft closed source project, the cost of parsing and transcoding is high. In order to simplify the scheme, doc is converted into DOCX, and then the core scheme is to parse docX format, transcode, and produce BDJson format streaming data.

OOXML is an open source project, based on the format of ZIP + XML, it is easy to read and parse ordinary text and its character attributes and paragraph attributes. It has structured information such as chapters, paragraphs and tables, which is convenient for streaming typesetting. Based on the requirements of this typesetting and considering the future online editing of Word, the scheme is designed to accurately parse documents at the semantic level, extract content and attributes, and build office data structures.

Data structures such as chapters and paragraphs follow the OOXML standard and can be assembled into corresponding data structures after parsing data from document. XML. Document. XML only stores indexes and basic information. The specific area content needs to be obtained from other XML files, assembled according to the corresponding relationship of indexes, and inserted into the specific position of the text.

Some data structures need to do some adaptation work due to the difference between office structure and HTML structure. For example, common bullet and number, there can be 9 layers of structure in Word, each layer of structure has character attributes, paragraph attributes, TAB Settings and picture number, etc., need to be compatible with the OL, UL simple structure mapped to HTML. The row span, column span, and hiding of merged cells in the table are also very different in Office and HTML. The whole table needs to be traversed, and compatibility transcoding is performed after calculation and conversion.

In addition, for some online editing scenarios involving data structure, extraction and transcoding, also made for example will support more set of formula of the data in the word “domain formula, mathtype formula, omath formula” unified transcoding into LaTex data format, not only facilitate the editing, but also can fit the font and size of the body, the whole layout more unified effect.

The implementation of the above technical solutions perfectly extracts the structured information in Word documents and optimizes the transcoding and presentation process of existing documents, as shown in Figure 2. The structured information of the document content enables the wireless terminal of Word document to achieve adaptive streaming typesetting, which greatly improves the display effect, as shown in Figure 3.

Figure 2 Document transcoding and presentation (format, streaming)

Figure 3. Wireless streaming of documents and formula presentation in LaTex


Extract structured data from **** slice (or PDF data)

In specific TYPES of PDF documents such as papers, journals and financial research reports, there are often some chart information, which generally appear in the form of “unstructured PDF data, pictures and background maps”. Extracting the chart information and importing the metadata into Excel allows users to edit, observe and generate new chart, which has great product value.

Some existing tools generally allow users to manually take screenshots of range, where chart is located in the document, and then manually select the origin of the coordinate axis, input information such as the scale of the coordinate axis, and perform a series of complicated operations such as chart stroke. Moreover, the accuracy of data extraction is not high.

The technical solution of extracting structured metadata from Chart picture or unstructured PDF data can be simplified into two modules: range identification and metadata extraction.

2.3.1 Range identification

Take a PDF document as an example, first go through all the elements of the page, select the text fragment span, image and other boxes. Merge the original span by y and x to get larger fragments, and then aggregate them into lines. The validity of the line area is determined based on information such as the number and location of the text, and some can be eliminated.

Search for empty areas within the remaining space as candidates for range. Gets the page Settings information to determine the scope of the page content. I’m going to go from top to bottom, and I’m going to identify the whole line of blank range. If both ends of the line are free, add two ranges. If the current line intersects the existing range, the intersecting part will be eliminated, and the original range will be cut into multiple new ranges. At this point, you get the purple set of range candidate regions, as shown in Figure 4.

Figure 4. Set of range candidate areas

Traversing the range candidate area set, combining and recombining adjacent ranges according to the position, width and height of the range, to obtain a new set of ranges, as shown in Figure 5.

Figure 5 Set of Range candidate areas (after merging)

Filter range(according to rectangle size, position, before and after the text line, OCR text quantity and other information), and cut the edge of the range by white edge, finally get the valid range, as shown in Figure 6.

Figure 6 Set of range candidate areas (filtered)

2.3.2 Metadata extraction

The range collection produced by the module can be identified through range, so that the metadata extraction can be carried out in the next step. Not all ranges are charts, it could be a simple picture, flow chart, etc.

Firstly, according to the range information, the pictures corresponding to the range are intercepted from the current page, and the image analysis is carried out to preliminarically determine whether they are chart pictures or not and make preliminary chart classification, such as bar chart and pie chart, as shown in FIG. 7.

Figure 7 shows the chart in the form of a picture

Taking the bar chart as an example, based on pixel analysis and edge extraction operator pretreatment, the candidate lines of x axis and Y axis are identified and deleted according to the length, position and other information. Finally, the XY axis is obtained and the coordinate system is formed. Scan the scale line on the XY axis. At this time, there is a lot of interference and the error may be large. Add scale line to the axis through pixel comparison and cross verification.

Complete and correct coordinate system is very important for subsequent chart metadata extraction. Based on the coordinate system, the whole picture can be cut into multiple subRange, OCR the small graph in subRange to obtain the text, and then the data items and data points of Chart can be assembled into the metadata of the whole chart after a series of data correction and recombination, as shown in Figure 8.

Figure 8 metadata extracted from the Chart picture

= = =

Third, the subsequent development of document content structure

With the development of business, on the basis of full-page presentation of documents, how to give users a better document presentation and interaction effect has put forward higher requirements for document transcoding and presentation technology, and the basis of all these is to extract fine-grained document elements and further structured identification and extraction of document content.

Recruitment information

The research and development department of Baidu Library is committed to building an industry-leading online interactive knowledge sharing platform for documents and audio. In the past ten years, it has collected more than 900 million high-value documents, nearly 400,000 certified authors and 20,000 professional authorities, and has become a leading document and knowledge service platform in China. Baidu Wenku insists on the goal of “let everyone improve themselves equally”, and tries to share knowledge to every corner as much as possible.

Invite iOS & Android friends.

Follow Baidu Geek said that the menu bar of the public account can be clicked within the push.

Recommended reading

| icon from the Web evolution history best practice | below to book

Those thing | | baidu contents of risk control glossary at the end of the article to send books

Reveal Baidu micro service monitoring: Evolution of Baidu game service monitoring

———- END ———-

Baidu said Geek

Baidu official technology public number online!

Technical dry goods · Industry information · online salon · Industry conference

Recruitment information · Internal promotion information · Technical books · Around Baidu

Welcome your attention