The application of graph database in Baidu Chinese

Abstract: With the rapid development of various industries, the correlation between data is becoming higher and higher, but the traditional database is difficult to deal with the deep level, the variety of relational operations, thus the graph database emerged. This paper introduces the application of graph database in various scenarios in Baidu Chinese.

The full text is 3909 words, and the expected reading time is 6 minutes.

One, foreword

Baidu Chinese service includes more than 10 categories such as character, ci, ancient poetry, idiom and Xiehouyu, involving more than 10 million entity data. Although the data involved is of small magnitude, each entity type has various attributes. For example, an ancient poem contains dozens of attributes, such as content, poetic style, interpretation, author, poem title, appreciation, background, label, etc., while the product needs to support the query of a variety of related conditions, such as “the author of silent night thinking is a person of that dynasty”. In the face of the above situation, if the traditional relational database, such as Mysql, you need to give each classification data to build a list, and for each query attribute index condition, so that in the face of complex query scenarios, need a one-off correlation can form as a result, more apparently does not conform to the general online business practices; In addition to meeting the general requirements, Baidu Chinese also needs to support the standard of Quick recall of Aladdin cards. The response result should be within 200ms, and the request quantity of peak single machine over 1000 QPS can be supported. To sum up, graph database arises at the historic moment in the face of this kind of massive and complex data relations with deep levels and many kinds of relations.

2. Graph database selection

2.1 Graph database introduction

The Graph Database (GDB) is a database that uses graph structures for semantic queries, using nodes, edges, and attributes to represent and store data. The key concept of the system is a graph, which directly associates data items in storage with a collection of data nodes and edges representing relationships between nodes. These relationships allow data in the store to be linked together directly and, in many cases, retrieved through an operation. Graph databases prioritize relationships between data. Querying relationships in a graph database is fast because they are permanently stored in the database itself. You can use a graph database to visually display relationships, making it useful for highly interconnected data.

Figure 1 shows trends in the various databases on DB-Engines. Graph databases have been in the lead since 2014.

Figure 2 shows the number of different types of database systems. It can be seen that common relational database and KV database account for the highest proportion. In addition, graph database also accounts for a large proportion.

2.2 Figure database selection

As can be seen from the figure above, there are at least 32 mainstream graph databases at present, and businesses need to choose according to their own business characteristics. For Baidu Chinese, we mainly consider the following points:

1) Open source projects

2) The project is mature and extensible

3) Low deployment operation and maintenance cost and high stability

4) Rich documentation, preferably active community

5) Support batch data import and export

Neo4j is the most widely used one with a long history and a leading position in graph database field that can stand the test of time for a long time. However, unfortunately, Neo4j has two major drawbacks: 1. The open source community version only supports standalone, not distributed; 2. 2. Using the commercially unfriendly GPL V3 protocol.

For Baidu Chinese, due to department adjustment, business adjustment and other needs, need to transfer the underlying storage engine. Therefore, considering the complex in business is the query scenario of diversity, rather than multiple levels, and in the case of most of the data for the static data, so using the baidu open source HugeGraph figure database (link: HugeGraph. Making. IO/HugeGraph – d…

It can be mentioned here that HugeGraph itself provides a variety of storage engines for use (Memory, Cassandra, ScyllaDB, RocksDB, Mysql, etc.). In Baidu Chinese, RocksDB is more familiar to us than other KV storage engines. The data is stored in folders as files, which makes it easy to move copies around, and has better SSD support and throughput, so we used Rocksdb. In addition, HugeGraph has better support for data maintenance, providing RESTful apis and loader tools for mass data intervention, easily supporting the import of billions of data.

HugeGraph also supports Gremlin, Apache TinkerPop3’s graph traversal query language. SQL is a relational database query language, while Gremlin is a general graph database query language. Gremlin can be used to create graph entities (Vertex and Edge), modify entity attributes, delete entities, and perform graph query operations. Gremlin can be used to create graph entities (Vertex and Edge), modify attributes within entities, delete entities, and most importantly, perform graph queries and analysis operations.

3. Baidu Chinese Map database construction

3.1 Deployment of Chinese service

The deployment of HugeGraph is very simple. Baidu Chinese uses unified virtualization technology and PaaS platform to deploy it directly. In order to ensure low consumption and high availability of services, we did not adopt the distributed deployment mode, but directly adopted the mode of multi-master deployment. In this way, the full amount of data is stored on each instance, so that the data can be quickly obtained from the graph database in the same machine room. Of course, the biggest disadvantage of this deployment mode is the possibility of data inconsistency. In order to solve this problem, Baidu Chinese adopts the data unified intervention platform to ensure the data consistency of the whole cluster.

In the case of Chinese search, there is a high requirement for content recall time. In order to avoid some long tail time, we did not expose the Interface of HugeGraph directly, and added a layer of Nginx forwarding service on the upper layer, which has the following two functions: 2) Use nginx proxy_cache to cache data and speed up the return of results. Because the entity content of Chinese retrieval is not much, cache can be used in the database layer to directly return about 30% of hot data. Data files are shared between each server through Afs cloud disk.

3.2 Chinese intervention platform

The deployment mode of HugeGraph used by Baidu Chinese is introduced above: horizontal redundancy is used for deployment, which increases the requirement for data consistency of multiple clusters during data update. Chinese class of some of the data errors, network new words appeared, all need to manually update the original data, baidu Chinese data to support the rapid intervention, to set up a data platform, unified intervention this platform can interfere with online data in real time, in order to be able to offline analysis on the data at the same time, the data from the need to support batch data import and export.

Currently, the intervention platform supports three basic functions: real-time data intervention, batch data intervention and data export. At the service layer, the intervention platform will record the contents of each operation, regularly check whether the intervention results meet the expectations, and at the same time, it can carry out data backtracking for historical operations. In order to ensure that the HugeGraph cluster updates are “transactional” at the same time, the mechanism of N retries is added here. When the retries still fail, the alarm will be actively triggered for human intervention.

Before the cluster is deployed, we need to get the data ready. Since Baidu Chinese uses the RocksDb engine, we pack the data ready and upload it to Afs cloud disk. When HugeGraph service is deployed, we pull the data packets from Afs and load the data according to the directory we have configured. Start the HugeGraph service.

When abnormal server migration, we use the data package, the current exception server upload Afs cloud disk, update the backup, a new server and new mining, again from the Afs pull packets, during this period in the intervention platform marking service exceptions, again this time, organize data of intervention, so as to ensure the consistency of the data. Minute-level scheduled backup and update backup during server migration already meet the backup requirements in Chinese. In terms of data visualization, the intervention platform provides two capabilities of JSON-view mode and HugeGraph-Studio Web graphization, which is convenient for operation students to quickly locate data attribute information.

For example, in Baidu Chinese, a poem is stored like the following:

There are four different vertex types, representing four important attributes of poem, poem name, author and poem respectively, which are associated with each other by edges.

3.3 Service Query

HugeGraph database can be used to query multidimensional relationships of characters, poems and other contents, but for the search scene mentioned at the beginning of the search, the query entered by users is generally vague, such as “Which dynasty is the author of Thoughts in the Still Night”. You need to convert such queries into Gremlin syntax that the graph database can understand,

That is:

Shici. Traversal (). V (). HasLabel (' poem_name). HasId (' p_name - the stillness of the night thinking '). OutE (' name_poem). InV () hasLabel (' poem). OutE (' type_poem _author').inV().path();Copy the code

It is necessary to introduce DA parsing module (data dictionary) to parse the query input by users into Gremlin syntax. The basic principle of DA is to classify and template the query summarized by historical data and parse the query of users into specific template variables for replacement and filling. To form a complete Gremlin statement.

Iv. Summary and outlook

As we all know, no software dare to say that it is perfect, and it has its own focus more or less. HugeGraph adds a general graph semantic interpretation layer on the existing storage system, which provides graph traversal capability, but the performance of multi-hop traversal is poor due to the limitations of storage layer or architecture. It is difficult to meet the requirements of low latency in OLTP scenarios. In the retrieval scenarios such as inverted index, services need to be extended by plug-ins, and its storage type is relatively simple. But despite the flaws, HugeGraph’s own functions and The Lucene engine have already satisfied the search scene of Baidu Chinese. The next thing we need to do is to continue to understand and optimize the graph database, in the case of ensuring its own stability, more in-depth mining of the value of the graph database, better service to users.

References:

Database Trend chart: db-engines.com/en/ranking\…

HugeGraph website: HugeGraph. Making. IO/HugeGraph – d…

Job Information:

Baidu education business carries Baidu Chinese search, education-related traffic and a number of innovative apps; Content production side, the use of graph database, Nosql and other storage engines, the production of hundreds of millions of educational resources, for the company to continue to provide high-quality content. Content consumption side supports tens of millions of traffic, dozens of business scenarios, with multi-terminal and multi-technology types. Welcome to the front and rear end, data, and IOS/Android clients!

Resume address: [email protected] (Note [Baidu Education])