Construction and practice of intelligent mass operation and maintenance of Tencent cloud database

Welcome toTencent Cloud + community, get more Tencent mass technology practice dry goods oh ~

Lu Yue, the head of Tencent Cloud database architect team, is mainly responsible for the pre-sales architecture, operation and maintenance, tuning and other work of Tencent cloud database MySQL, Redis, Oracle and other databases. He once worked for NetEase and Nibiru.

Tencent cloud database massive operation and maintenance experience, mainly divided into the following three parts:

Building a team of database architects
Construction of automatic operation and maintenance platform
Practice of intelligent mass operation and maintenance

Building a team of database architects

A reason for

Due to the particularity and complexity of database products, we often encounter some problems in the process of serving customers at ordinary times, such as: customers distributed in all walks of life, they will have different scene requirements, which is very different for the application of database. However, our pre-sales architect may not be very proficient in the application of database in various industries and the architecture of different customer needs, so it is impossible to recommend the optimal architecture.

On the other hand, we have a lot of customers, but maybe some of our after-sales service engineers are not very proficient in such complex database products, unable to completely cover difficult problems, or not smooth enough to communicate with customers, so the service quality needs to be improved.

For these two reasons, we formed a team of database architects.

Division of labor cooperation

After the architect team was set up, our entire database product service system became the following three layers: the first layer is operation and maintenance, which is responsible for dealing with the work related to platform stability; The second layer is the architect, responsible for the supervision in the middle of the key points, including the construction of database, operation and maintenance tools and so on; The third layer is the first-line service engineer, who is responsible for dealing with the main consulting and process problems.

The current work of the entire architect team includes four aspects:

Customer operation: While carrying out data operation, communicate more with customers, including following up all kinds of problems encountered by customers when using the database;
Solutions: including basic solutions and industry solutions;
Service system: including platform operation and maintenance of basic products;
Platform construction: including customer operation platform, solution export and support service system.

Among them, in the operation platform construction, we made a CDB wechat assistant, which can realize active push and passive pull, and help our front-line colleagues to better serve customers.

Construction of automatic operation and maintenance platform

To better serve customers and improve service quality, it is not enough to have a database architect team and after-sales service system. We also need a very stable automated operation and maintenance platform to support the environment. Therefore, the answer to the question of why an automated operations platform is needed is obvious. So far, we have a total of 10W+ instances, 2W+ physical machines, the stability requirements for the platform must be very high.

function

Resource management: including quadrant upper limit of instances, management and deployment of physical machines, etc.
Operation and maintenance operations: including upgrade and upper limit;
Monitoring: on the one hand, database performance monitoring, including QPS, CPU related performance monitoring; The second is the monitoring of availability.
Self-healing: Common database problems are automatically discovered, such as replication exceptions, and the platform proactively recovers them.

architecture

The overall architecture of the automatic operation and maintenance platform is divided into three layers from bottom to top. The bottom layer is the entrance of APP, that is, our client. The middle layer is the customer entrance; The top layer is the platform backend.

From the client to the APP layer, there are two architectures. Our MySQL backup will be stored in the cluster, and the performance monitoring data will be uploaded to a module to display our monitoring in real time. On the client side, you can access our automatic operation and maintenance management platform through the official website or API, and perform systematic operations such as resource management, instance management and data transmission. In addition to such operations, our operation and maintenance platform can also monitor alarms, operation reports, holographic monitoring and other operations. Every operation data of our entire platform will be collected into the operation database to support our back-end to do some analysis including big data.

Monitoring module

There are many modules in the whole automated operation and maintenance platform. I will focus on sharing the monitoring module here. As mentioned earlier, our monitoring module is divided into two parts, the first part is performance monitoring, the second part is usability monitoring.

As shown in the figure above, our monitoring module is mainly composed of two main lines, one is DB master, the other is dial Svr. If an exception is detected, this information will be sent to the DB master. After receiving the feedback of the instance exception information, it will verify whether the exception is true through the long link.

CDB instance performance monitoring is mainly monitored through the cDB_Report module, which will pull real-time performance monitoring and summarize data, including CDB alarms, into our Apd Netman module, which is a relatively important component.

In a common scenario, the operation of dialing Svr is relatively simple. However, in such a scenario with a large number of instances, what might be the problem? It mainly includes two points. The first point is the performance of the dial test Svr, that is, whether the dial test request can be sent successfully and on time when there are so many instances. If the Svr performance is not good, it will directly affect the time interval of each Svr. If the Svr performance is not good, we are forced to increase the Svr time interval, which may not be timely for us to discover the problem of the instance. The second problem is the Svr itself. If the Svr is a single point, if it fails, the state of the entire instance is unknowable to us, which would be a very dangerous state.

Based on the above two reasons, we will consider the following three optimization objectives in the design of dial Svr in massive scenarios:

According to these three optimization objectives, we made the dial Svr architecture as shown in the figure below. This node will send these instances to the pingSvr node, which is the node that actually carries out dial test. After the dial test operation is performed, the node will store the dial test failure result in DB, and there will be an alarmChecker to read it in real time, and then alarm. All requests, whether successful or unsuccessful, are written in, pulled by modules and stored in the database in real time. Each of these nodes has a Dr Deployment.

Practice of intelligent mass operation and maintenance

After practice and thinking, it is found that in mass data operation and maintenance, our automated operation and maintenance platform cannot solve the following problems:

Customized services. Different scenarios of different industries have very big differences in the application of databases. In fact, we can achieve customized and optimized services for databases according to different scenarios.
Database problems are automatically diagnosed and tuned.

Therefore, Tencent is currently developing an intelligent product, which can be used to sketch the customer’s database application scenarios through data mining or communication between architects and customers, so as to achieve customized services.

Customization service

Based on the results of data mining and our communication with customers, we can summarize some special use methods into the following four types:

Computing applications: For example, a report application may need to obtain computing resources frequently in a period of time.
Storage-based application: for example, a historical data storage application;
Traffic type application: will pull a large amount of data;
Hot apps: A news app or a red envelope app, for example, may have a sharp line between peak and low peaks.

For these four special methods of use, we can actually make some customized services, as shown in the figure below:

For computing applications, such as BI report class, its business characteristic is to execute at dawn, and the whole machine is relatively idle during this time period, so we can carry out some optimization to allow idle time overuse for such computing scenarios.

For storage applications, where customers may be concerned about total capacity usage, we can provide a compression engine that automatically loads and compresses data accordingly.

For traffic type applications, SQL requirements may be very high. If one or two SQL performance is not particularly good, the entire database may be affected. Therefore, for this kind of application, we can provide customers with a similar self-service SQL optimization tool to help customers do certain optimization of SQL.

For hot applications, we can provide the capability of dynamically elevating an instance to a higher configuration before the business peak.

Automatic database tuning

For the database auto-tuning problem mentioned above, we can do a real-time analysis and predictive analysis to analyze the quality score of the instance.

Among them, predictive analysis is to analyze its trend in the next two to three months by taking historical data from the past. In fact, this analysis can also be customized. For example, for computing applications mentioned above, if customers are sensitive to CPU, we can increase the proportion of CPU usage weight during analysis. If it is sensitive to storage and space utilization, we can adjust some indicators of storage accordingly. Through this predictive analysis, coupled with big data and AI models, we can come up with an index of instance scores that can automatically optimize the database or make recommendations for optimization.

Question and answer

How does the cloud database file back?

reading

AI technology practice of Tencent Cloud CDB: CDBTune

How to operate the Redis system of 100 million QPS

Tencent Cloud CDB: in-depth analysis of MySQL binlog

Has been authorized by the author tencent cloud + community release, the original link: https://cloud.tencent.com/developer/article/1146948?fromSource=waitui

Welcome toTencent Cloud + communityOr pay attention to the wechat public account (QcloudCommunity), the first time to get more massive technical practice dry goods oh ~