Summary: Due to the similarity of business attributes in exception SQL and the intricate relationships that affect and are affected, it can be challenging to understand the propagation of problem SQL to various resource exceptions. The DAS team continues to research and explore how to find abnormal SQL. In the process of exploration, we provide a new analysis function, SQL request behavior identification, to help users better locate SQL problems.

Business background:

DAS(Database Autonomy Service) protects the stable operation of millions of Database instances, and the most basic function of DAS is to accurately locate abnormal SQL in the process of Database operation. More than 90% of the problems of the database are from the abnormal requests of the database. Whether it is the massive transaction request behavior of the group on the Double 11 Day or the request behavior of the user’s business changes, the performance of the database is affected all the time. The self-driving car grasps the direction of the car by perceiving the change of the road condition image, while the self-driving database constantly fixes and optimizes various problems of the database by perceiving and recognizing the user’s request behavior, thus escorting the cloud database. The problem of how to locate the different scenarios of different database engines from the huge number of requests in the huge database has been a difficult problem for DBA for many years. In the field of recommendation, through the analysis of the behavior of the user to replace the mechanical web page display accurate recommendations to users expect text/video/product, enhance the user experience and product conversion, automatic driving platform as the next generation of database also needs to analyze user requests and user development business behavior, recommend the corresponding optimization of repair capacity and so on, Improve the efficiency of the automatic driving database, so that the database is faster, more stable and safer. Therefore, starting from the user request behavior and business behavior, data mining in the massive requests of massive database instances is a subject worthy of in-depth research. It is also the underlying technical ability that the database autonomous driving platform relies on very much, which supports the autonomous ability of DAS database autonomous service in all scenarios.

DAS has provided several L2 features for SQL data analysis over the years, including: Professional SQL Insight, Full SQL, Slow Logging, One-click Diagnostic, Lock Analysis, Sessions, and more. Each feature precipitates a DBA’s approach to a different problem from a different perspective, and diagnoses the problem slightly differently for different instances and different businesses. For users who are not familiar with DB operation and maintenance, DAS is providing a unified, efficient and simple way to help users locate problems. Combining with the multi-indicator characteristics of SQL slowing down, we proposed a method based on feature similarity matching VLDB 2020 to precipitate into the functions of the autonomous center. However, for the similarity of business attributes in abnormal SQL and the complicated relationship between the influence and the affected, It is a challenge to clarify the relationship between problem SQL and the propagation of abnormal phenomena in various resources. The DAS team still continues to research and explore how to find abnormal SQL. In the process of exploration, we provide a new analysis function, SQL request behavior identification, to help users better locate SQL problems.

Problem description:

The following figure is an example, the example CPU spike phenomenon, the database has a potential risk of CPU full, when the user’s request amount is less or the request SQL mode is less, through the sorting and screening of indicators is easy to find the problem SQL, but when the user’s full amount of SQL template more than tens of thousands or even hundreds of millions, The user cannot locate the exception SQL quickly from the current DAS page. We need to provide a more efficient way to locate the exception request with more data.

When users use the DAS Professional SQL Insight feature, even if we stream the full amount of SQL, compress and aggregate it into templates, the number of templates is amazing. We can see a large number of templates with similar characteristics and trends. So if we compress the template further based on the SQL request behavior, the user can better locate the problem of abnormal SQL

At present, the functions of DAS products and other products such as AWS Azure in the industry all have the preliminary ability to locate abnormal SQL. By sorting the collected SQL data in various dimensions, users can locate database problems by themselves. This method is effective for more than 80% of simple database problems. However, the results are poor in complex business scenarios and database problems that are difficult for DBA to locate. Take the metadata cluster instance under the internal control of Ali Cloud as an example, this year, the average monthly occurrence of more than 10 CPU full problems, several performance related failure problems throughout the year, but each problem is different, sometimes DBA can only find the phenomenon, it is difficult to quickly locate the root cause of the problem. Therefore, through the analysis of user request behavior, the DAS database autonomy service product will be better iterated, the database performance problem of our complex scenarios will be solved, and the stability, usability and efficiency of each engine of the whole database will be improved.

Industry products:

AWS: RDS: Performance Insight has the same function as the current DAS product, the data dimensions collected are similar, and the abnormal SQL positioning is carried out by Top N Ranking, without the function of SQL request behavior analysis





Azure: Query Performance Insight

By taking Top N to locate the SQL request, it can locate 60% of the obvious problems, but it cannot locate the database problems of the complex business of SQL request, and there is no SQL request behavior analysis function





Tencent Cloud: DB Brain function is similar to the existing function of DAS, without SQL request behavior analysis function

Huawei Cloud: Database Admin Service, similar to the current DAS function, no SQL request behavior analysis function

Challenges & Difficulties

Challenges:

Scale challenges:

The sea of performance issues in the sea of queries from the sea of the databases

The user’s business request is rich, how to locate the performance problem of multiple database engines from the massive request in the massive database instance.

Monitoring Diagnostic Challenges:

724 real time anomaly detection => 724 root cause analysis in near real time

In order to solve the database performance problems caused by potential SQL requests, the root cause localization needs to be done in near real time.

Complex database exceptions:

Exception indicators are usually related to multiple SQL requests, which cannot be explained by a single SQL and affect the SQL requests of multiple businesses. The associated problems include full table scan/index/lock problem/cache breakdown/kernel problem, etc. 1. Several problems in the index phenomenon exist similarities and differences:

Artificial root cause localization:

To help DBA or users solve performance problems, work order problems to help back-end developers reasonably arrange the process of request query, try to break up resource-intensive requests from a business perspective to help DBA find the direct and indirect relationship between different requests at the business level.

Enable autonomous services:

At the same time, we must Limit the amount of SQL more meticulous to Forecast the workload more accurate. Workload can be well-partitioned to better estimate the resource benefits of autonomous actions: Estimate the SQL Resource Cost for Autonomous Actions in the first place to resolve potential performance issues: Crack the potential performance issue at the first place

DAS solution:

Enlightening ideas:

In the development process of many back-end applications, the design of back-end architecture tends to ensure the idempotency of interfaces. For example, in order to solve the timeout problem, a retry mechanism is usually introduced. Sometimes duplicate data is requested, messages are consumed, and duplicate data is read. For example, multiple inserts or updates may result in data errors.

To address these idempotent approaches, the back end often uses methods such as SELECT then INSERT, pessimistic/optimistic/distributed locking, or managing stateful business according to state machines.

Example payment scenario state machine:

.

update bill set status=1 where id=520 and status=0;

Order behavior SQL A

update bill set status=2 where id=520 and status=1;

Payment behavior SQL B

update bill set status=3 where id=520 and status=2;

Cancel the order behavior SQL C

.

So the same business process is accompanied by multiple SQL requests, serially or in parallel, which means that there is a correlation between the execution trends of the SQL, which is relevant to the business. When we find the business anomaly, it is accompanied by the index anomaly, so when we locate the abnormal SQL, there will be abnormal phenomenon in the SQL of the same business. Therefore, we can carry out the clustering of massive SQL data through the algorithm based on these trend characteristics of SQL. Therefore, we thought that by analyzing the homology of SQL and locating abnormal SQL from the perspective of business, we could locate abnormal SQL more efficiently



Process framework:



Perceptual process:

In the process of diagnosis, the DAS backend first requests Data from the Dataset Layer, performance Data (PERF Data) and SQL request Data (SQL Query Data). Performance data is detected by MTS Anomaly Detection/Feature Extraction

Exception request localization process:

Example:

X:{sql_a, sql_b, sql_c} ==> affects mysql.cpu_usage

==> SQL set (CPU_TIME ratio)

Y: {SQL_I, SQL_J, SQL_K} ==>

==> SQL Collect Impact (Extras Session Proportion)

After perceiving the abnormal timing index, the perception layer uses the template-processed data of the full amount of SQL and the Graph Based clustering method to classify the massive SQL according to the characteristics of the request Behavior. Finally, it ranks the Query Behavior Ranking according to the contribution rating of the request Behavior after aggregation. Detecting abnormal requests and their effects on performance metrics.

Root cause analysis process:

Example: rotten SQL template SQL_I –> causes lock wait –> affects mysql.rows_lock_wait_time

> blocks SQL in Y set --> blocks session --> blocks CPU intensive SQL in X set of Y set --> blocks CPU surge

After the index anomaly is explained by SQL, there are still many failure problems that we cannot accurately locate, such as primary and standby delay, locking problems,OOM, kernel problems, etc. These problems may lead to the increase of SQL execution time, and in turn, SQL may also produce these problems.

Anomaly Propagation Analysis (Anomaly Propagation Analysis) helps us to analyze the Propagation relations among these phenomena. In this analysis, we made a comprehensive comparison based on the time sequence relationship and our historical case data. Finally, we added the obtained anomaly propagation chain, the whole DAS analysis process and suggestions to the case library at the back end and further detailed the case model. The Case Model will constantly overlay and adjust the matching parameters according to the feedback to give more accurate suggestions.

Exception SQL localization cases based on request behavior identification:

Active_session spike problem:

There are abnormal spikes in the active session of the database example in the figure below. Such spikes last for too long and may cause potential problems for some sensitive businesses. We want to locate the causes of the spikes. For CPUs, detection of active session exceptions is indicated by a yellow exception event.





Active sessions are usually strongly correlated with total execution time. Select the corresponding metric through SQL request behavior analysis and click Analysis





Find the metrics that are similar to the session, and click View, sorted by total elapsed time, to find the exception SQL that “contributes” the most to the session exception

Click on the corresponding SQL_ID to look at the details. By ranking the results of the trend behavior, you can clearly see that the SQL is getting slower compared to the historical trend. By executing the trend, you can see that the exception trend is completely different from the historical trend and coincides with the trend of active session exceptions

The number of SQL executions increased dramatically (from 1000 to over 8000), causing other SQL executions to slow down, causing active session stacking and resulting in a spike in the active_session metric

CPU usage (cpu_usage) :

The following database instance CPU is full



Except for SQL design CPU-intensive computations such as JOIN, which are more expensive, in most cases, CPU is positively correlated with the number of rows scanned, the SQL request behavior analysis selection, CPU_USAGE, and the total number of rows scanned



We can easily locate the CPU-associated metrics





Final positioning: This SQL full table scan caused the CPU to be full and caused the session to stack

Future plans

The DAS will support more engines for real-time detection and exception localization, and the Professional Edition will help more users locate more types of database instance problems with the full amount of SQL from the user. Not only let the professional DBA better use DAS to control the database instance, but also let the beginners in the database field control the database without the threshold, truly ensure that the database instance self-perception, self-optimization, self-repair.

This article is the original content of Aliyun, shall not be reproduced without permission.