Abstract:This paper mainly introduces the design, planning and current situation of GAUSSDB (DWS) database intelligent monitoring operation and maintenance service system.

This article is shared from Huawei Cloud Community “Eyes, six ways, ears and all directions, but still tireless? How does the intelligent operation and maintenance service system of data warehouse achieve?” , the original author: Master Lu.


In the early days, the database system only provided SQL commands to query its internal running state, which led to the high threshold of database operation and maintenance operation and poor usability. DBA once became a highly specialized key position, enjoying high salary and the admiration of everyone, but also brought uncertain risks for the data security of enterprises. Moreover, the command line operation and maintenance is not intuitive, and it relies heavily on the experience of the operation and maintenance personnel, so it cannot find, locate and solve problems quickly, which leads to the database operation and maintenance problems, which are difficult to find, locate and solve.

In order to deal with this dilemma, the running state of the database visualization (database control system) came into being, through visual means to human understanding the form of chart, will focus on data by means of a graphical display to operations staff, thus significantly reducing the threshold of the database operations, improve the efficiency of database operations. This stage has some representative products such as OEM(Oracle), Viewpoint (Teradata), and so on. However, during this period, the scale of the user’s data was not very large, and the database was still deployed in the user’s own data center, and it was still a stage of several DBAs operating and maintaining several sets of databases.

With the arrival of the cloud era, the cloud database gradually hosts the data storage services of customers. Cloudization concentrates all the heavy IT operation and maintenance work in the cloud background management, thus liberating customers from the professional, complex and heavy data center operation and maintenance activities and enabling customers to focus more on their core business. At the same time, as the provider of data storage services, cloud service providers need to make deep and detailed efforts in IT operation and maintenance and database operation and maintenance, give full play to their advantages of stable team, high degree of specialization, and master the operation data of massive databases. Make full use of the current scientific research achievements in the field of machine learning and artificial intelligence, and use technical means to gradually increase the number of databases that each operation and maintenance personnel can manage, so as to realize the “reduction of personnel and increase efficiency” of database operation and maintenance work. On the other hand, with database services in the cloud, the number of databases that cloud service providers need to operate and maintain is dramatically different than before, and the previous tools may not be suitable for the cloud era. How to do a good job of the operation and maintenance of the cloud database will become a huge challenge for the cloud service providers.

Database intelligent operation and maintenance system

In the traditional sense, the database monitoring service only refers to (1) collecting the running state of the database; (2) Reporting/storing database operation data; (3) Graphic display of database running status data. However, this is only a part of the database intelligent monitoring operation and maintenance system.

If the whole database intelligent monitoring operation and maintenance system is compared to a person, the database monitoring service in the traditional sense only represents the role of the eye. The service can only find problems, and the DBA is required to identify and locate problems and solve them. Therefore, DBA is the core element of the traditional database monitoring operation and maintenance system, which is also one of the reasons why DBA talent is so critical.

The advent of the cloud era and the maturity of big data analysis, artificial intelligence and other technologies give more imagination to the operation and maintenance of database monitoring. I can on the basis of the traditional database monitoring (eyes), increase the forecast analysis and returning to judge module, establishing the phenomenon – returning – solution mapping (brain), finally through a database management module performs solutions (hands), so as to realize from finding problems, positioning problem, to solve the problem of operational closed loop. And the machine is different from human beings, as long as the calculation force allows, it can do the eyes and ears, tireless, also will not feel bored, 7×24 staring at hundreds of database system of various operating data, will not miss any tiny potential problems. Therefore, in the intelligentization of database operation and maintenance, it is very important to use rules or algorithms to solidify DBA’s judgment and decision-making experience.

GAUSSDB (DWS) database intelligent operation and maintenance system

By referring to the construction experience of You-business database monitoring operation and maintenance system and combining with the characteristics of GAUSSBD (DWS) data warehouse, we plan to build a closed-loop database intelligent monitoring operation and maintenance system from three aspects: eyes, brain and hands.

GAUSSDB (DWS) uses DMS to host the intelligent operation and maintenance system of the database. DMS will string up the three steps of monitoring, analysis and processing in the process of database operation and maintenance, respectively corresponding to the three parts of eye, brain and hand in the intelligent database operation and maintenance system mentioned above, and form a closed loop of the operation and maintenance system from the conceptual design.

Monitoring part: mainly responsible for the collection, storage and visual display of the database running state data, which is basically equivalent to the traditional database monitoring business. In the selection of this part of functions and indicators, we refer to the suggestions of friends and the operation and maintenance team and divide the monitoring indicators into two categories: the operation and maintenance indicators of the underlying IT system and the operation and maintenance indicators of the database system, which are in the process of gradually completing and improving respectively. Monitoring module is the first power of DMS database operation intelligent monitoring operation and maintenance system, and to form a competitive module in a short time.

Analysis part: as the brain of the intelligent operation and maintenance system of the whole DMS database, this part is the key module responsible for the analysis and decision of operation and maintenance data. Because of its complexity, this part is still in the design conception stage. The preliminary planning has three sub-modules, the sub-module of trend analysis of time series, which is mainly used to do trend prediction analysis and to predict potential problems. Logical inference sub-module, users analyze the relationship between the problem phenomenon and the actual root cause, can realize the inference from the problem phenomenon to the trigger reason, initially consider to use the search engine technology to achieve; The sub-module of knowledge graph spectrum is mainly used to express the mapping relationship between phenomena, root causes and solutions, so as to find the most appropriate solution from the root causes located conveniently.

Processing part: It is mainly undertaken by the database management function provided by DWS. At present, it can provide database parameter configuration (few configurable parameters need to be further enriched), workload queue configuration, cluster installation/unloading, cluster restart, cluster capacity expansion, cluster data redistribution, node temperature and other operation and maintenance capabilities.

Typical users and requirements for intelligent operation and maintenance of GausSDB (DWS) database

In order to further clarify the design ideas of intelligent database operation and maintenance products, we plan to analyze their needs from the user’s point of view, then derive the function (tool) page design from the requirements, and summarize the required monitoring database indicators from the function (tool) page. By analyzing various use scenarios of database monitoring system, we make user role portraits of database monitoring system users, define three roles in the process of database operation and maintenance, and think that different roles only pay attention to one side of database operation and maintenance. In a real database operation scenario, it would be possible for the same user to have multiple roles, but here we just logically define the three roles for analysis purposes.

Application development: This mainly refers to the application development role on the client side, which is responsible for designing the specific business SQL. They care about the correctness and efficiency of business SQL execution. Application development engineers need to use Web SQL to debug the query efficiency of their SQL statements; The query monitoring page is needed to view the performance and resource consumption of business SQL in actual execution scenarios; Workload queue monitoring is required to verify that the newly developed business SQL is in the appropriate workload queues, that the configured circuit breakers are appropriate, and so on.

SRE: refers to the huawei cloud side database operations roles, they usually need to responsible for the stable operation of hundreds of thousands of cluster, they need to be able to quickly identify the anomaly of the running state of the cluster, the cluster resource bottleneck and cluster latent capacity, and they need to actively respond to customer requests for help, help customers to locate, identify and solve problems. SRE requires node resource monitoring to identify resource skews in the cluster; It is necessary to identify the change trend of the baseline of cluster resource consumption, so as to identify the expansion needs and remind users; Need to pay attention to the storage changes to calculate the time point of the next routine maintenance and automatic planning; At the same time, it also needs to respond to user needs and use the problem location tools provided by DMS to assist users to locate existing network problems.

DBA: Refers to GausSDB (DWS) database cluster experts who are familiar with database design methodology, database tuning, and database problem localization. They need to analyze location database failures and use multiple tools to comprehensively analyze location system failures, system stability and potential bottlenecks from a resource and business perspective. It also needs to help users recommend database indexes, distributed column configuration, and recommend users to buy appropriate cluster size according to the user’s business level from the perspective of business and database design. It also needs to assist application development engineers in tuning SQL statements that cause performance degradation. After the exact root cause of the failure is identified, the appropriate solution is recommended to fix the problem.

Generally speaking, there are only two types of user roles in the public cloud scenario: application development and SRE. The SRE role in the public cloud scenario usually covers the role of DBA. In fact, the purpose of subdividing operation and maintenance roles here is to present a complete operation and maintenance scene sandtable, and to list the operation and maintenance demands of customers in different categories, so as to provide a basis for further function (tool) page design and operation and maintenance scene design.

GAUSSDB (DWS) database intelligent operation and maintenance indicators

The number of database monitoring indicators is large, the form and logic are complex, according to the type of indicators can be divided into logical relations and physical relations. Among them, the internal logical relations of the index library are logical relations. For example, the top layer is the database, in which there are multiple schemas, in which there are multiple tables, and in which there are multiple users. One user can have multiple schemas and tables. The physical relationship refers to the topological relationship of the GausSDB (DWS) cluster, for example, a database cluster is composed of multiple compute nodes, and each compute node will deploy multiple compute instances. These two index relations will affect the collection dimension and aggregation display dimension of the database index.

Since the dimensionality relationships of metrics have been analyzed above, we will discuss only the specific database metric types below and will not expand on the dimensions of metrics. Database is a software service, and it must run on a host machine and operating system, so the monitoring indicators can be roughly divided into two categories:

System resource indicators: This category of indicators mainly describes the consumption of various resources on the system

Database-related indicators:This category of metrics describes the level of business load associated with data performance

The figure above summarizes the main indexes of the database collected by DMS, and the specific index items are arranged according to three levels: the index category, the atomic index and the derived index. However, this indicator map is not fixed at present. In the future, with the gradual maturity of the intelligent operation and maintenance system of GausSDB (DWS), this indicator map will be gradually improved and fixed.

Because of the special configuration of the MPP database, the database instance is run on the node as a trial process. Therefore, our metric design actually has its own dimension attributes, such as disk utilization metrics, the smallest dimension should be a DN instance, one level above the node level, and the next level above that is the entire cluster. Therefore, the actual monitoring metric we provide should be a Cartesian product between the metric dimension relationship and the cluster indicator map. To describe this situation, we introduce the concepts of atomic index, derived index, and combinatorial index. As an example of the disk utilization above, we use the disk utilization of the DN instance as an atomic metric and the disk utilization of the other dimensions as a derived metric.

  • Atomic metric: The smallest dimensional metric that describes a particular feature of a database, such as CPU usage of nodes, disk usage of DN instances, and so on.
  • Derivative metrics :(1) aggregation of atomic metrics on different dimensions, such as cluster average CPU utilization, cluster disk utilization, and so on; (2) New indicators obtained by statistical operation on atomic indicators, such as CPU skew rate, etc.
  • Composite metrics: A combination of multiple atomic or derived metrics to produce a new metric that is easier to understand. Cluster health, etc.

The DMS index to build more stay at atomic index and derived indicators stage, because we think it is the basis of the first supplement the database index form basic monitoring operational ability, can be combined with the user habits, depth excavation index under the various dimensions of operational meaning as well as a variety of indicators represent operational significance after combination.


Finally, to sum up, this paper mainly introduces the design, planning and status quo of GAUSSDB (DWS) database intelligent monitoring operation and maintenance service system. As the first article of DMS series, this article mainly plays the role of a brief introduction, so that we have a general understanding of GAUSSDB (DWS) database intelligent monitoring operation and maintenance service system. More dry details are welcome to look forward to the following articles.

Want to know more information about GUASSDB (DWS), welcome WeChat search “GAUSSDB DWS” pay attention to WeChat public number, and you share the latest and most complete PB series silo black technology, background can also obtain many learning materials Oh ~

Click on the attention, the first time to understand Huawei cloud fresh technology ~