Abstract: Alibaba CloudDBA is mainly divided into offline analysis and online analysis of two functions. Help users save costs, locate problems, analyze causes and recommend solutions. CloudDBA provides real-time diagnostics, offline diagnostics, and SQL optimization. And through the MySQL parameter tuning, detection of unreasonable parameters or preparation delay.

Brief Introduction of speakers:

Xunchen, technical expert of Ali Cloud RDS kernel team, currently development of Ali Cloud CloudDBA expert system. Have rich experience in database development management and optimization.

This live video highlights, poke here!

PPT download links: click.aliyun.com/m/51146/

The following content is compiled according to the video sharing and PPT of the speakers.

This sharing mainly focuses on the following three aspects:

What does CloudDBA offer

Ii. Core competence

Three, typical practice application

What does CloudDBA offer

CloudDBA provides two main features, one for offline analytics and the other for online analytics. As we know, the daily work of DBAs is mainly divided into two parts, one is group check, and the other is online response. For example, the application of my database is suddenly stuck, or the performance of the database is jitter. These problems need the REAL-TIME response of DBAs. Oracle includes two reports, one is AWR report, and another is called ASH report, which is similar to Oracle in terms of function. The offline analysis is mainly AWR reports, followed by the online response is ACTIVE SESS HISTORY.

CloudDBA is a sass-based value-added service based on THE PaaS platform. SASS on the cloud needs to solve performance problems, diagnose problems, and provide ancillary tools. The database on the cloud is a little different from the self-built database. If the database is in the cloud, the PaaS layer of the working cloud will help solve the problem. For example, performance monitoring, HA and so on have been done. What dbAs really need to do is the upper layer, which is how to make the database run better, let users use the database well.

Whether it’s in the cloud or a self-built database, the cost of the database itself is actually very visible and very low. For those of you who are DBAs, the amount of effort spent from preparation to database launch is actually limited. But the real difficulty is how to manage the database well? Because we provide support for platform applications that make products, if users have bad habits, it is easy to break our database, and the whole business will be affected. So as you can see from the figure below, our database has a large maintenance cost, probably more than 80%. Of course, DBA is mainly to solve some problems in the application and save time and cost. For example, if the user reports that the application is stuck, for a DBA you need to log in to the database, go to the console and watch the animation, and see what’s going on? The actions are actually quite repetitive and mechanical. If CloudDBA is available, it has its own little scripts, such as locating problems, quickly entering usernames and passwords, grabbing the status and making judgments based on the status. That’s fine, but there’s a better way to do it, and deliver it as a product and as a service. When an application gets stuck, the user only needs to click a button, and the product can capture the state, analyze where the database is stuck, and suggest solutions for the next step. Even in most scenarios, commands will be generated for the user to directly copy and execute.

Ii. Core competence

1. Real-time diagnosis

We will productize, program, and record DBA experience into a database. Output the diagnostic results. We often find in our daily work that the same problem can be solved differently by different DBAs. It even said that one student met a problem when he was on duty and knew how to solve it, while another student pointed out that the class did not meet the problem, and when it happened again after a long time, everyone might forget how to solve the problem. So this is where you need to precipitate, productize, servize your work experience, and then input it. We put the solution to the problem. Skills, experience input into the Knowledge Base, which is a diagnostic program, with continuous input experience, the Knowledge Base will become richer and richer. The results are divided into symptom description, cause description, and diagnosis suggestions.

2. Offline diagnosis

Offline diagnosis is state-based, in-depth analysis, mining Top SQL to see which SQL is executed the most times, longest, and takes the most time. There are also things to analyze, to see if things make sense, and SQL Review. Because we are DBAs, if we don’t have a powerful tool to regulate developer behavior, the tool will be worn down sooner or later. In the early days, a spec was issued to developers, requiring search statements to be written in the spec or else something bad would happen. But without a tool constraint and specification, it is impossible for every development team to look at every specification statement. There is deadlock analysis.

3. The SQL optimization

The MySQL optimizer is certainly not as good as Qracle, and we often hear that it does not perform very well and that the join order of tables is not optimal. For example, if a table has an index on it and the index fails, we all know that an index fails when the fields don’t match. Our tool will help us add a function after the field. For example, if you have a transaction table, there is a field on the transaction table that takes time to get, because the current time is at least accurate to the second. Many developers add the date function directly to get, which equals a specific day. However, if you use Oracle or SQL Server3 database, there is no problem, the DBA will add a functional index to you. But if you use MySQL, and the version before 5.7 is no way, the real writing method is greater than or equal to the beginning of the day and less than or equal to the end of the day, should be within 24 hours of the day can be identified. There is also the rewriting of calculation costs. We will dynamically sample the standby database. For example, a query has no index, but has multiple fields. We’re going to dynamically sample the standby library, look at the distribution of the data on those columns, and then generate the optimal field order, the optimal index. Dynamic sampling is used because it is not possible to look at all the index orders that several fields have. This piece of content can go to the official website of Ali Cloud search, there are a lot of very detailed information and video.

Best practices

It’s not uncommon to see users upgrade specifications and then pressure test them, only to find performance drops after upgrading specifications. For example, 4C32G became 8C62G, and throughput decreased. Locate the cause of performance degradation by using the diagnosis report TOP SQL. Truncate execution time is slow. Why is it slow? As the memory of the table becomes larger and the number of pages in the memory becomes larger, MySQL truncate needs to drop the pages into the file before it is truncate. Our tool can quickly locate the cause statement. Next, the parameter of MySQL Max Present should be reduced to control the blocks within a certain range.

Another problem is that the user said the pressure jitter would appear every half hour, find out what the reason is. Because when the user raises this question, jitter occurs in a few days ago or after a few hours. Therefore, we recommend that users turn on CloudDBA so that we can track the data in their console. The following figure shows the diagnostic report obtained using TOP SQL to know when jitter occurs.

The links are also divided into different scenes. The first is the occurrence of a lock, which is the most common, which is to KILL the lock session. The second is when the business overhead DDL operation is performed, this is also very easy to solve, we will help users locate out. There is also a problem with the connection of the application and it is not closed. For example, when Java JDBC is turned on and not turned off, we can also recognize it. We advise users to use connection pooling and close their connections as soon as possible. The other one, which is neither MySQL heap nor lock, also uses connection pooling normally, may be too small and too stressful. If the specification cannot be upgraded, the application must be curbed.

When the connection is full, CloudDBA helps identify and terminate the session.

CloudDBA can help identify and optimize when the CPU reaches 100%

In addition to the above scenarios, Ali also made some parameter optimizations. MySQL has a very large number of parameters, and unreasonable parameters or delays in preparation can be detected by CloudDBA.

CloudDBA is a dynamic cleansing product that we are constantly updating. We will contact the work order system of Ali Cloud, and the work order processed by them will be reversed to our side. We will absorb and digest some of them to see which can be integrated through the program. RDBA will be embedded in the console of RDS database, and users can use it for free.

The original link