Baidu Library new generation document reader! Core technology full analysis!

Introduction: Library has billions of documents, including Word, PPT, Excel, TXT and PDF and more than ten kinds of common office documents, the core technology is transcoding and presentation, the purpose of transcoding is to parse different documents into a set of common data format, by the back-end implementation, while the presentation is to render the document data. Before this, the front end of the library used HTML+CSS for rendering, which encountered resistance in the later business development process, such as difficult to achieve document export long graph, document marking, keyword highlighting, watermarking, document content analysis, anti-copy and so on.

The full text is 2803 words, and the expected reading time is 14 minutes

A, architecture,

In May this year, we started to use Canvas to implement a new generation of reader, which can support PC, WAP and small programs at the same time. Compared with the old reader, it has the following advantages:

	The old reader	New generation reader CReader
Export long figure	Cannot be exported. Use the server capability	Derivable directly
The document tag	High implementation cost	support
replication	Text selection experience is poor, not good shielding	Text selection experience is good, can shield copy
performance	HTML elements, large document performance poor	Better performance with a few HTML elements and only visible areas of content rendered
Development experience	No local development, no documentation, no single test, TS type support is weak, complex integration, low configurability	Adopt Vite +TS, efficient development experience, rich documentation, simple integration, independent of business side technology stack
scalability	Adding features is complicated	The hierarchical division of labor is clear, which can facilitate the expansion of functions

In the selection of technology, we choose Vite + TS, vite can bring us the ultimate development experience, TS can make our code more secure, more convenient for business side to call API. The overall structure is divided into:

Logical layer: responsible for managing data loading, page creation, page rendering scheduling, event distribution and providing core API externally;

Data layer: responsible for loading document data, including document content, custom fonts and pictures;

Parsing layer: responsible for the document data parsing, output to render data, such as text content, font size, text position, picture content, picture position, etc.;

Rendering layer: it is responsible for rendering the parsed content. Currently, it only supports Canvas rendering. Other rendering methods, such as HTML and SVG, can be extended according to business needs.

Application layer: The business side that uses CReader for document rendering, internally provides an online reader to assist development.

Second, core technology points

2.1 Text image rendering mechanism

The document is mainly composed of text and images, and the entire document rendering is mainly carried out around these two. Canvas is a capability provided by the browser, which can be drawn by drawing instructions and finally present the content to be displayed to the user. Since it doesn’t need to be rendered by DOM elements, it doesn’t enjoy the interactivity of the DOM itself, such as text selection and DOM events, but for documents, it’s static and doesn’t require as much interactivity. Here’s a comparison of the various rendering mechanisms available on the front end:

After considering various business scenarios, we finally chose to use Canvas, but the whole design of the reader has been considered to support multiple rendering methods. Canvas needs some attention during rendering. For example, in Safari, the size of Canvas cannot exceed its limit on mobile and PC, as can be seen from webKit source code:

Also, the drawing process should not take up too much memory, otherwise the call to getContext(‘2d’) would return NULL, which was a lot of memory overhead when we were rendering 1000 pages of documents. The result was a strategy that drew only pages in visible areas and freed up memory immediately when a page was not visible, which reduced the memory footprint of the entire reader by 90%. You may worry that Canvas rendering will have performance problems, but Canvas rendering can be optimized in some ways during the whole testing process, and drawing performance will not be a bottleneck.

Canvas needs to ensure that the font has been loaded when drawing a custom font, otherwise the font used will be invalid. Moreover, Canvas drawing does not support character spacing, and the final position of each character on the Canvas needs to be calculated after data parsing. Since users will adjust the document size, the final character position needs to consider the scaling ratio. Canvas needs to ensure that the picture is loaded before drawing. If you want to export the picture to Canvas, you need to ensure that the picture has no cross-domain problems. The final drawing looks like this (the lines in the picture are our debugging tool to see where each character is drawn) :

2.2 Text Selection

In the old readers, due to the complexity of DOM nodes, the experience of text selection is very bad, and it will bounce back and forth. If cross-page selection is made, the advertising content will be copied because there will be advertisements between pages. When rendering with Canva, it does not provide a text selection scheme, and you need to implement the entire text selection yourself. The whole idea is to find the corresponding text according to the coordinate position of the mouse, which needs to match the coordinate of the cursor with the coordinate of the text. In the whole reader design process, a data layer will record the coordinate information of each node. However, in the process of text selection, we need to consider a variety of circumstances, such as the cursor falls on a non-text area for the first time, text selection occurs across the page, the font size of the whole line of text content is different, line spacing is different, etc. In order not to affect the original content of the document, a Canvas is created at the top of the document for text selection highlighting state drawing. The overall effect is as follows:

Cross-page text selection

Three, business functions

3.1 anti-cheating

The Chinese text copy function of the old version reader is the browser’s own, all the text nodes can see the specific content through the debugging tool of the browser, and can also obtain the text content through the way of code. However, using Canvas rendering, all content nodes will be drawn on a graph, so as to effectively avoid obtaining the content in the document.

3.2 Document transfer diagram

As the old version of reader is rendered in THE way of HTML, DOM nodes are complicated and have many styles, so it is not feasible to use a library like HTML2Canvas. We implement the document export long graph in a headless browser way, which consumes server resources and takes a long time, 5-6 seconds on average. The new generation of reader adopts Canvas rendering. Canvas naturally supports exporting pictures, so that any page in the document can be exported to the map and different pages can be joined together to grow the map.

3.3 Document Markup

The new version of reader using Canvas rendering, document marking is a natural thing, making the implementation cost is very low. Graphics can be drawn with the help of open source solutions such as Fabric, which makes it easy to draw various graphics and export them as JSON files, making multi-party tag sharing possible. Document marking effect:

Four, small program typesetting

There are two typesetting methods for library documents. One is called streaming, which is suitable for mobile terminals but loses the original document format. The other, called typography, is mainly used in WAP and PC, and is consistent with the original document structure. At present, the small program mainly uses streaming, supplemented by layout. The layout directly loads the H5 page through WebView, but the native components of the small program cannot be added to the webview of the small program. As a result, users cannot see other information around the document when they read the layout document, such as document recommendation, VIP guidance, toolbar and so on. When downloading a document, you can only jump back to the applet native page from the WebView. I’ve tried rendering documents in iframe mode before, but the experience wasn’t great and it was abandoned for a variety of reasons. The new generation of readers can support not only PC, WAP, and can be extended to small applications, we recently tried to render layout documents successfully, and the experience is very good. This allows layout documents to be embedded in the applet native pages as well as streaming documents. The effect is as follows:

Five, the last

In general, the new generation of Library document readers can bring better user experience and meet more complex business scenarios. In terms of development experience, the latest construction tool Vite is used to bring the ultimate development experience; Access is easier and simple, regardless of the use of any technology stack can easily access, if the default configuration is used, 5 lines of code can access, which was not possible in the past; Comprehensive single test coverage for quality protection. The same set of code can satisfy waP, PC and applets at the same time, covering 90% of the library’s document types, and covering all of the library’s document types over time.

Baidu Library new generation document reader! Core technology full analysis!

A, architecture,

Second, core technology points

Three, business functions

Four, small program typesetting

Five, the last

Related Posts

React Native uses Chrome for debugging

JS-Generator+async

Enumerations in TypeScript