Vientiane: Baidu's massive multimedia information processing system

Takeaway:Unlike traditional web pages, the understanding and processing of rich media data is more difficult and challenging than previous web pages. Vientiane system is baidu search in order to solve the problem of huge amounts of rich media information processing and the design and development of the system, this paper conducted a host system comprehensive overview introduction, vientiane system is currently needed to search in baidu has to undertake the all pictures, video, data processing and processing, management of large scale images and video data entity characteristics, Support billions of processing throughput every day, Baidu products to improve the effectiveness of the foundation.

The background,

The Internet information has experienced the development in recent years, has developed from the early form of simple web pages to today’s text and video flourishing period. What we see now is not just a boring full-page text web page, but a content carrier containing a lot of pictures and video information. The shift from traditional text to pictures and video (including voice) is an upgrade of the human communication channel, and a developmental stage in a more natural communication process. The Report on the Development of China’s Audio-Visual New Media (2017), jointly compiled by the State Administration of Radio, Film and Television and the Administration Department of Network Audio-Visual Programs, shows that:

“China’s online video market reached 60.9 billion yuan in 2016, up 56 percent year on year. As of August 2016, the number of home-made online audio-visual programs has increased by 180 percent year-on-year, and the proportion of traffic has increased from 8 percent in 2015 to 14 percent.”

— Report on the Development of New Audiovisual Media in China (2017)

From the point of information, pictures carry more information than text, while videos carry more information than pictures. The same content can be presented in the form of text, pictures or even videos. Even if it is the same picture or video, the amount of information obtained by different people is not exactly the same. With the popularity of mobile phones and the prosperity of mobile apps, content presentation is not limited to web pages, but more presented in native apps with better experience, which provide friendlier operation and different information experience. ** All of this presents a lot of new challenges to traditional search engines.

Second, the new challenge that search engine faces

Traditional search engines undertake the user query access to information, in the era of HTML content presentation has a fixed standard and uniform carrier (browser), search engines can be convenient for the whole network information content extraction, processing and retrieval, and find out the most relevant results with user query.

Now, however, that approach is quietly shifting:

(1) Content-centered competition: the web is no longer the only carrier of information

△ The variety of ways content is presented

The same content can appear in a press release on the PC web page, can also appear in the mobile media Wise application with pictures and pictures, can also be presented in a variety of video apps in the vertical class. With the emergence of header applications, barriers to content openness have begun to form. PC era web pages are open form of external output, now the head application can put the content in their own application.

Especially in the era of rich media, with the popularization of mobile phones and the lowering of the threshold of picture and video editing, more and more high-quality content is presented in the form of pictures and videos, which gives better and better experience to users. The emergence of popular video APPs such as Watermelon, Douyin and Kuaishou also reflects users’ recognition and pursuit of rich media resources. For search engines, more and more quality content will be in the form of rich media across media.

(2) The status of search engines as distribution portals is challenged

With the rise of mobile APP ecology, users no longer consume information entirely through search engines, but at the same time consume various types of content information through a variety of adjacent APPs. Since there are multiple entry points for creation, content on different platforms will be consumed by different people and produce different feedback signals. Because the feedback comes from many different mobile application platforms and is not fully reflected in the traditional PC web, it has a fatal impact on the traditional search engine Ranking mechanism.

When traditional search engines serve as the main entrance for users to obtain information, they can obtain feedback signals of various characteristics of users’ consumption of each content: click amount, browsing time, play amount, thumb up amount, comments and so on. However, due to the diversity of content presentation and the separation of mobile application barriers, the feedback signals obtained by search engines cannot fully represent the intention of users, thus causing the deviation of relevance ranking.

The △ feedback signal is dispersed in multiple terminals

Since users can access and consume content and information from multiple applications, their behavior feedback on different terminals needs to be collected by search engines for precise ranking.

This poses a new challenge to the search engine capture system, that is, how to obtain the user behavior data scattered to different carriers. The scraping system needs to go beyond the previous limitation of only scraping web pages, expand to grab pictures, web pages, and even break through the pages on mobile applications to get the content and signals authorized by users.

Furthermore, in the process of information processing inside the search engine, these signals and features are aggregated and transmitted according to the granularity of content, so that other carriers of the same content can use these signals in the sorting process.

For example, Xie Hao’s “Flying Wild Bees” has different user feedback signals in different videos, but they are essentially user feedback signals for the same content. Traditionally, these videos have come from different playback pages and sites, and user behavior feedback signals have been interpreted as belonging to different pages. This site-based, rather than content-based, retrieval approach will change in the design of the new generation of search engines based on rich media content.

(3) The consumption mode of rich media information is diversified

Under the traditional web search, users enter keywords to retrieve information, and the search engine returns the most relevant web pages to the user’s search needs; In the case of rich media information, users can retrieve information not only through keywords, but also through images, and even through content semantics. At present, the common product form is:

Search with text (key words)
Search video by text (keywords)
Search by picture: search for the same or similar images, such as taking a picture to search for the same or similar images
Video search by image: Search for videos that contain the image, such as taking a screenshot of a movie to find out which movie it is from
Search pictures or videos with text (content semantics) : search related pictures or videos with semantic content represented by text, such as searching all the clips in the Three Lives III that contain Liu Yifei’s kiss

These input diversification brings new requirements to the design of search engines: not only the traditional inverted zipper based on item, but also the inversion given semantic vector.

3. Processing and retrieval of rich media information

Rich media information processing and retrieval refers to the ability to collect, screen and index multimedia resources including videos and pictures, and provide users with the input methods of text and pictures to carry out information retrieval. The following figure shows the process of Baidu search engine for rich media information.

△ Processing and retrieval of rich media information

First will be offline for pictures, video and other rich media data processing, and processing, identify the content and semantics, and then into a system to recognize and deal with the properties (such as text labels, category labels, etc), quality information (such as site or author authority, thumb up measure/play, such as yellow against information, etc.), or optical properties (such as the definition). This basic characteristic information constitutes the complete representation of a rich media data.

Different rich media data will also have the same/similar, contain and other relational attributes. Such as a lot of video clips on the Internet, may come from a movie; Or two identical video files, but from different playback sites with different watermarks or covers. This relationship is referred to in our system as the aggregate characteristic of entity granularity (as opposed to the underlying characteristic mentioned in the previous paragraph). When two entities are judged to be the same, the relevant characteristic information can be merged and aggregated for both entities to use simultaneously. For example, the amount of playback and thumb up will accumulate, and the missing feature tag will be supplemented by another entity, so as to achieve the convergence of content level.

IV. Vientiane System

In Baidu’s search engine, the system for processing multimedia data such as video/pictures is called: Vientiane (from the meaning of “all-inclusive”, “vientiane update”) is the processing and processing of multimedia contents such as pictures/videos, providing large-scale collection, processing, screening and indexing capabilities, and providing strong data support for users to carry out information retrieval by text, pictures and other input methods.

△ Vientiane architecture diagram

The whole Vientiane system carries the main picture and video data processing of Baidu search engine, which involves a large number of pictures and videos (horizontal video and vertical video); Calculating and processing this huge amount of data every day, ** supports image search, search results with images, video search, recommendations, and all of Baidu’s major internal product lines involving rich media.

Scaling and timeliness are the two core design indicators for the whole system to deal with rich media information.

Scaling: it can process video, pictures and other multimedia data on a large scale. At the same time, it can mobilize and manage heterogeneous resources (CPU, GPU, FPGA, etc.) involving hundreds of thousands of core computing power.
Timeliness: refers to the ability to complete data production in a way that meets the requirements and cycles of product iteration, including production of various features and attributes, data screening, index production, etc., to ensure the timeliness of product effects.

In addition to the basic services at the bottom, the whole Vientiane system mainly includes:

(1) Thousand ren system: responsible for analyzing the basic features of a single entity (picture/video), such as the character/scene /OCR/ clarity analysis of a single picture;

(2) Primary meta-system: it is responsible for analyzing the relationship between entities (identical, similar, contained, clustering, etc.), such as whether it is a video clip or an atlas of the same event;

(3) Danding system: responsible for managing features, aggregating and organizing feature data according to content entity granularity;

(4) Other auxiliary systems: responsible for cutting, transcoding, editing, etc.

1. Blades

The systematic analysis of characteristic data produced by a single entity is called basic characteristic data. Analyzing and understanding images/videos is an extremely complex and costly process, requiring the use of nearly hundreds of thousands of Core elastic CPU resources and dedicated computing resources such as GPU/FPGA in the whole system.

In the analysis of basic features, some features can only be analyzed and processed by low-cost CPU, such as the length and width of the picture; Others are expensive, require advanced hardware such as GPU and take a long time to produce, such as OCR/ classification/definition, etc. The feature of high cost, in thousands of precipitates will reduce double calculation as far as possible, improve productivity.

A thousand meter system

In a thousand meter system, the first problem is how to support such a large computing demand with limited resources. In a metric system, all feature calculations are converted to DAG. In addition to supporting traditional batch feature calculations, streaming is also one of the most important parts. DAG execution engine can effectively manage all kinds of characteristic correlation relations, combine duplicate computation units, and combine data hot spot scheduling, so as to improve the performance of computation. At the same time, it also includes the calculation optimization of heavy operators, such as OCR and video fingerprint calculation, how to break through the single card throughput barrier to achieve hundredfold or even thousandfold scale improvement and so on.

2. Initial System

Analyzing the characteristics of a single image or video entity is not enough to meet business needs. In many cases, we need to know the relationship between entities, such as :(1) yellow and yellow identification, (2) original identification, (3) high-quality content extraction, (4) entity aggregation search based on event/time/space relations, (5) recommendation based on the same or similar entities, etc.

The analysis of these relations is carried out by Chuyuan, a subsystem of Vientiane.

△ Primary system

The Yuan system is based on the basic features produced by the Qianren system. Characterizing each entity is its basic characteristic attribute, through the fingerprint level comparison in the whole set of entity set, to find the required relationships. One of the challenges of system design is how to construct the whole entity set dynamically and in real time.

3. Athanors

Both the features of the 1000 ren output and those of the primary yuan output will eventually be stored in the feature library – the Danding System.

The Danding system not only stores the characteristic information of the entity, but also aggregates and transfers the characteristic attributes of the entity granularity. Two of the same entities (for example, a video) will have different title, thumb up amount, retweet amount, and so on. If the preliminary meta-analysis shows that two entities are actually the same entity (such as the same video with different watermarks), when these information is gathered into the Danding system, the related attributes can be combined and used for either entity at the same time.

△ Danding system

The aggregation technology of Danding system is to integrate information by content, so as to lay the foundation for the downstream retrieval system to realize the information retrieval of content granularity. The process of content aggregation does not erase the original characteristic information of a single entity, on the contrary, dynamic aggregation technology will be implemented while retaining the original characteristic attributes of a single entity.

For example, entity E1 and E2 are considered to be two videos with the same content. E1 has relatively high quality content (for example, clearer or no black and white, etc.), while E2 has title keywords more in line with requirements. When users use search engines, E1 will have a better user experience and be distributed to users by search engines. At this time, the generated E1 title will not be the original title, but will be selected from the E1 and E2 title keywords or reproduced to better meet the requirements of relevance. At the same time, other characteristic attributes of E2 (such as clicks, plays, comments, etc.) will be added to E1 attribute at the same time and returned to the search engine as the final attribute of E1 to participate in the final result sorting.

Five, the summary

In the era of content as the king, the new generation of search engines have upgraded from the traditional collection and retrieval based on web pages to the collection and retrieval based on rich media information and other content carriers. Unlike traditional web pages, the understanding and processing of rich media data is more difficult and challenging than previous web pages.

* * all system is baidu search in order to solve the problem of huge amounts of rich media information processing and the design and development of the system, all of the system conducted a comprehensive overview introduction, * * vientiane system is currently needed to search in baidu has to undertake the all pictures, video, data processing and processing, * * management of large scale images and video data entity characteristics, ** supports billions of processing throughput every day, which lays the foundation for the improvement of Baidu’s products.

The original link: https://mp.weixin.qq.com/s/-yhs\_86CAMnsCxIYwrmMeQ

- –

Architect of Baidu

Baidu’s official technical public account has been launched!

Technology dry goods, industry information, online salon, industry conference

Recruitment information, internal information, technical books, Baidu surrounding

Welcome to your attention!

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Vientiane: Baidu’s massive multimedia information processing system

Vientiane: Baidu’s massive multimedia information processing system

Related Posts

Online Excel Editor – LuckySheet

Clean, low-carbon and environmentally friendly new energy, 3D photovoltaic and photothermal power station visualization

Build your own JS library from scratch TypeScript + Rollup + Karma + Mocha + Coverage