The following article is from CSDN, written by Xi Yan



Respondents | Shao Zongwen, tencent cloud database product manager
Reporter | evening yan
Product | CSDN (ID: CSDNnews)

Recently, another domestic database was born! This time is Tencent home launched a distributed Graph Database product — Tencent Cloud digital Graph TGDB (Tencent Graph Database).

It is said that the database can realize real-time query of trillions of relational data, efficiently process heterogeneous data, and support real-time graph computing. Theoretically, the cluster node size of the graph database can reach more than 10,000 units, and the query speed is 20-150 times faster than that of Neo4j with the highest market share in the world under different public data sets!

In recent years, graph databases have become increasingly popular. According to Gartner’s top 10 Data Analysis Technology Trends, both graph processing and graph database usage will grow by 100% annually from 2012 to 2022. Graph databases are far more popular than other major databases.

Previously, the mainstream map database products in the industry are mainly foreign manufacturers, domestic finance, e-commerce, energy and other key industries can only rely on foreign map database products. In this context, when we can have a powerful domestic database that can truly meet the needs of domestic enterprises has become a hot topic of concern. But as the big data, especially the figure data after being more and more manufacturers realize the importance of domestic size factory also successively launched its own figure database products, trying to break the monopoly of foreign technology vendors, including giant ali cloud chart database GDB, ant gold take independent research and development of distributed database GeaBase, huawei GraphBase, Smaller companies like Nebula Graph have their own digital graphics products, including Nebula Graph, and TigerGraph from Vega.

According to the theory, from the technical suitability, security, cost, domestic database products should be more in line with the needs of domestic enterprises and the pace of information. Is that really the case? Our focus today, is compared with the already released graph database products and foreign database products, Tencent cloud digital map released this TGDB has what special? Does it have an advantage over them? To this end, CSDN invited Shao Zongwen, product manager of Tencent Cloud Map database, to comprehensively evaluate whether the graph database product is worthy of its name from the research and development background to the upper design.

I. RESEARCH and development background of TGDB

Driven by digital technologies such as 5G, Internet of Things and artificial intelligence, enterprise data is exploding, and the complexity of the association between data is also increasing rapidly. Traditional relational database has low efficiency in processing complex associated data, and it is difficult to help enterprises to further explore the value behind massive relational data. In order to make better use of the connections between data, enterprises need a database technology that stores relationships as entities and flexibly extends the data model. Tencent sees the opportunity hidden in graph database.

After in-depth investigation, Tencent found that customers often need a car. In addition to the manufacturing of graph database engines, they also need a series of partners to support them, so as to meet the needs of enterprises. Currently, tencent’s ecological constitute the main figure database is the industry’s top talent and database and related upstream and downstream partners, including sea turtles, and more than 10 years of senior experts in the field of database, the key research direction includes drawing database distributed storage, high performance computing, graph algorithms, as well as the ecological components such as the migration tool, visualization, data extraction, data modeling, etc.

Second, graph database technology breakthrough

Compared to other graph database products at home and abroad, TGDB has some unique features. In general, TGDB’s technological breakthroughs led to performance improvements and flexible architecture extensions that enabled new features, including a decentralized pure distributed architecture, efficient native storage, graph cutting, and distributed algorithms.

Decentralized distributed system architecture

According to Shao Zongwen, TGDB adopts a decentralized distributed architecture and theoretically supports linear expansion. According to the current deployment and use, TGDB’s graph data storage limit is far from reaching the limit. In the laboratory, the team once measured the number of graph data cluster nodes as 100 in the test, but according to the theoretical derivation, The cluster node scale of TGDB graph database can reach more than 10,000 units, and its query speed is 20-150 times faster than that of Neo4j, which has the highest market share in the world, under different public data sets.

Such large storage limits and extremely fast query speed are inseparable from TGDB’s system architecture design.

TGDB distributed graph database is divided into three layers from the internal architecture:

  • Resource management layer: responsible for the management and deployment of the underlying computing and data resources. Simply put, it is responsible for coordinating the distribution of each computing task and corresponding data to each distributed node for execution, monitoring, fault tolerance and summary of results according to a certain algorithm.

  • Data abstraction layer: Provides the abstraction of Property Graph, involving the Graph’s data structure, storage method, access mode and message protocol;

  • Upper algorithm application layer: it provides algorithms based on distributed computing engine. These algorithms need to access graph data of data abstraction layer. According to the different design of each algorithm, the execution of the algorithm is changed into a unit that can be distributed and parallel processing and handed over to the resource management layer for execution.

TGDB diagram Database system in cluster deployment architecture is completely distributed decentralized, each node is very equal, there is no single master single point of failure or to prevent the problem caused by the system complexity.

The underlying data consistency based on a set of stable message queue and snapshot mechanism, makes any node and process can assume a virtual stability in the middle of the information interaction platform, and news platform to ensure global consistency, such as sorting, delivered a highest guarantee, at the same time support more hot standby, combined with the reasonable layout of cabinet, can guarantee high fault tolerance.

From the technical point of view, how can TGDB achieve real-time query of trillions of relational data? To this, Shao Zongwen made a detailed explanation.

He said that large-scale real-time query can not be solved by a simple query diversion or optimization, but requires query planning optimization, high concurrency task processing mechanism, distributed underlying resource management and system deployment architecture to achieve efficient implementation.

Specifically, TGDB first transforms each query or computation request into an optimized Directed Acyclic Graph (DAG). Distributed tasks are guaranteed to be completed correctly through the DAG model. The vertex of each DAG is an executable task and the edge is a logical sequence or data transfer task. Each machine node parallel scheduling task decomposition, every DAG are broken down into multiple not depend on each other’s independent computing tasks, such as computing tasks become very easy to parallel distribution and execution, because without the dependencies between tasks and message sending/receiving/processing, greatly reduces the system task control complexity, The workflow optimization control of high concurrency computing is realized.

The TGDB distributed resource management logic is responsible for the unified management and scheduling of computing and data storage resources in the cluster. Tasks can be registered and published on any distributed machine node. It supports cross-platform migration and provides the monitoring, transfer and recovery of tasks. Distributed resource management uses the bag of Tasks mode to build a resource pool within the platform, so that computing tasks in the pool can be intelligently acquired and executed by each node, and the advantages of decentralized self-organization architecture can be effectively utilized to achieve optimal, bottleneck free, and high fault tolerant distributed resource scheduling.

Simply put, based on such a design, highly concurrent real-time queries can be broken down into units that are easy to be distributed and executed in parallel, optimized by the entire system.

Native map storage

In terms of storage computing, TGDB uses native graph storage and does not rely on any third-party data storage platform, such as HBase or RocksDB. The storage system is independently developed by Tencent, which is similar to foreign native graph such as Neo4j, but different from open source products such as JanusGraph.

In contrast, native diagrams have a huge performance advantage over non-native diagrams in terms of query and computation speed. To illustrate this point, Shao Zongwen made an analogy: the communication between the upper layer of the native map and storage is equivalent to a person talking with himself in the brain, while the communication between the upper layer of the non-native map and the third party storage is equivalent to the language communication between people, they need to shout, the other party heard, and then reply. It can be seen that the performance cost is higher in the non-native mode, especially in the case of depth graph query, multiple rounds of iterative calculation, and quantitative change of graph data, the disadvantage will be more obvious.

Graph cutting algorithm

Most of the traditional graph algorithms are based on matrix to express and calculate, while another characteristic of TGDB is distributed. It is not only distributed in system architecture and deployment, but more importantly, the design and implementation of distributed graph segmentation algorithm and other distributed graph algorithms. Support for graph segmentation is also the key to whether a graph database can truly support linear scaling, which is essentially different from some other database products. TGDB truly implements the partitioning of a large map into small maps and disperses them to each distributed node for storage, rather than building a single node using Raft protocol, where the whole map is not cut and each node on the distributed node is the whole map for storage. Obviously, the latter is still essentially storing all the data on a single machine and does not really support data volume scaling. TGDB is a native distributed graph database. The data storage abstraction is vertices and edges, not in the form of matrix. At the same time, the graph segmentation is carried out, and a large graph is divided into many pieces and stored on multiple servers. Under this new structure, the traditional graph algorithm needs to be completely rewritten to achieve scalable distributed concurrent execution by using the form of vertices and edges, fully considering the distribution of graph data fragments, and optimizing the way of cross-server message transmission.

Other features

  • The query language supports Neo4j’s Cypher language

In terms of query language, TGDB supports THE Cypher language of Neo4j, which can be easily replaced with Neo4j. It also supports an easy-to-use graphical user interface, allowing analysts to quickly conduct graph management and graph iterative analysis without programming.

TGDB features high scalability, high integration, fast computing, and light deployment. Its core functions are listed as follows:

  • Combine with AI cutting-edge technology

TGDB currently supports a growing number of algorithms, and it can be combined with Tencent’s Platonic computing engine platform to output algorithms, including some of the graph neural network algorithms. In addition, Shao Zongwen mentioned that, as mentioned above, the traditional graph algorithm needs to be reconstructed and optimized under the distributed architecture. TGDB still has a lot of research tasks to complete in this regard, which is also a frontier academic field.

Outlook: Firstly, there is explosive growth in the financial field

As an expert in the field of graph database, Shao Zongwen makes a prediction of the future development trend of graph database from two aspects of technological innovation and application.

He predicts that figure database have explosive growth is expected in the first place in the financial field, because the traditional relational database or big data before actually restricted to congenital architectural issues, such as traditional database can’t very well solve the problem of financial risk control related, as well as employees and family members, employees and customers, the relationship between the customer and business compliance, these are all very complex relationship.

In addition, with the arrival of THE 5G era, there will be more and more information connecting people, people and things, and things and things, which also provides a good opportunity for the development of graph database.

4. TGDB’s future planning: To traditional industries

To make

Currently, TGDB is used in Internet, financial risk control, Internet of Things, power network, e-commerce, intelligent transportation, biological sequence research, medical diagnostic decision-making, disease transmission analysis, auxiliary judicial decision-making, public security, etc. But in the future, Shao said TGDB will also dig deep into the connections between data from traditional industries, such as energy and power. He said, although these industries have the certain ability for data integration, but the correlation between data mining is also difficult, electricity knowledge map, for example, support to the power grid access various types of timing measurement data store and update, and directly to the power grid dependency and topological structure of the power equipment, said Fully reveal device status and relationships between devices to monitor and manage devices on the whole network. Different from traditional time-consuming operations based on vectors and matrices, complex power networks and knowledge are represented by graph structure, which can be queried and calculated directly on the graph, and the calculation results can be directly stored as elements in the graph, greatly improving the efficiency of power grid calculation and analysis. The application of power grid operation mode retrieval, equipment state inference, equipment portrait and family defect analysis is realized.

These are all areas where TGDB can make a big impact in the future.

Phase to recommend

Birth is king! Tencent Cloud database performance super relational database 1000 times