New SQL request behavior recognition functionality to help solve the needle in the haystack problem of abnormal SQL detection

Brief introduction:Because of the similarity of business attributes in the exception SQL and the intricate relationships between the affected and the affected, it can be challenging to figure out the propagation of problem SQL and the various resource exceptions. The DAS team continues to research and explore how to find abnormal SQL. In the process of exploration, we provide a new analysis function, SQL request behavior identification, to help users better locate SQL problems.

Business background:

DAS(Database Autonomy Service) protects the stable operation of millions of Database instances, and the most basic function of DAS is to accurately locate abnormal SQL in the process of Database operation. More than 90% of the problems of the database are from the abnormal requests of the database. Whether it is the massive transaction request behavior of the group on the Double 11 Day or the request behavior of the user’s business changes, the performance of the database is affected all the time. The self-driving car grasps the direction of the car by perceiving the change of the road condition image, while the self-driving database constantly fixes and optimizes various problems of the database by perceiving and recognizing the user’s request behavior, thus escorting the cloud database. The problem of how to locate the different scenarios of different database engines from the huge number of requests in the huge database has been a difficult problem for DBA for many years. In the field of recommendation, through the analysis of the behavior of the user to replace the mechanical web page display accurate recommendations to users expect text/video/product, enhance the user experience and product conversion, automatic driving platform as the next generation of database also needs to analyze user requests and user development business behavior, recommend the corresponding optimization of repair capacity and so on, Improve the efficiency of the automatic driving database, so that the database is faster, more stable and safer. Therefore, starting from the user request behavior and business behavior, data mining in the massive requests of massive database instances is a subject worthy of in-depth research. It is also the underlying technical ability that the database autonomous driving platform relies on very much, which supports the autonomous ability of DAS database autonomous service in all scenarios.

DAS has provided several L2 features for SQL data analysis over the years, including: Professional SQL Insight, Full SQL, Slow Logging, One-click Diagnostic, Lock Analysis, Sessions, and more. Each feature precipitates a DBA’s approach to a different problem from a different perspective, and diagnoses the problem slightly differently for different instances and different businesses. For users who are not familiar with DB operation and maintenance, DAS is providing a unified, efficient and simple way to help users locate problems. Combining with the multi-indicator characteristics of SQL slowing down, we proposed a method based on feature similarity matching VLDB 2020 to precipitate into the functions of the autonomous center. However, for the similarity of business attributes in abnormal SQL and the complicated relationship between the influence and the affected, It is a challenge to clarify the relationship between problem SQL and the propagation of abnormal phenomena in various resources. The DAS team still continues to research and explore how to find abnormal SQL. In the process of exploration, we provide a new analysis function, SQL request behavior identification, to help users better locate SQL problems.

Problem description:

The following figure is an example, the example CPU spike phenomenon, the database has a potential risk of CPU full, when the user’s request amount is less or the request SQL mode is less, through the sorting and screening of indicators is easy to find the problem SQL, but when the user’s full amount of SQL template more than tens of thousands or even hundreds of millions, The user cannot locate the exception SQL quickly from the current DAS page. We need to provide a more efficient way to locate the exception request with more data.

When users use the DAS Professional SQL Insight feature, even if we stream the full amount of SQL, compress and aggregate it into templates, the number of templates is amazing. We can see a large number of templates with similar characteristics and trends. So if we compress the template further based on the SQL request behavior, the user can better locate the problem of abnormal SQL

At present, the functions of DAS products and other products such as AWS Azure in the industry all have the preliminary ability to locate abnormal SQL. By sorting the collected SQL data in various dimensions, users can locate database problems by themselves. This method is effective for more than 80% of simple database problems. However, the results are poor in complex business scenarios and database problems that are difficult for DBA to locate. Take the metadata cluster instance under the internal control of Ali Cloud as an example, this year, the average monthly occurrence of more than 10 CPU full problems, several performance related failure problems throughout the year, but each problem is different, sometimes DBA can only find the phenomenon, it is difficult to quickly locate the root cause of the problem. Therefore, through the analysis of user request behavior, the DAS database autonomy service product will be better iterated, the database performance problem of our complex scenarios will be solved, and the stability, usability and efficiency of each engine of the whole database will be improved.

Industry products:

AWS: RDS： Performance Insight

Same as the current DAS product function, the collected data dimension is similar, and the abnormal SQL positioning is carried out by Top N Ranking, without SQL request behavior analysis function

Azure: Query Performance Insight

By taking Top N to locate the SQL request, it can locate 60% of the obvious problems, but it cannot locate the database problems of the complex business of SQL request, and there is no SQL request behavior analysis function

Tencent Cloud: DB Brain function is similar to the existing function of DAS, without SQL request behavior analysis function

Huawei Cloud: Database Admin Service, similar to the current DAS function, no SQL request behavior analysis function

Challenges & Difficulties

Challenges:

Scale challenges:

The sea of performance issues in the sea of queries from the sea of the databases

The user’s business request is rich, how to locate the performance problem of multiple database engines from the massive request in the massive database instance.

Monitoring Diagnostic Challenges:

7*24 real time anomaly detection => 7*24 root cause analysis in near real time

In order to solve the database performance problems caused by potential SQL requests, the root cause localization needs to be done in near real time.

Complex database exceptions:

Exception indicators are usually related to multiple SQL requests, which cannot be explained by a single SQL and affect the SQL requests of multiple businesses. The associated problems include full table scan/index/lock problem/cache breakdown/kernel problem, etc. 1. Several problems in the index phenomenon exist similarities and differences:

Artificial root cause localization:

Help DBA or user solve performance problem, workorder problem

Help back-end developers arrange the process of request queries, and try to break up resource-intensive requests from a business perspective

Help DBAs find direct and indirect relationships at the business level between different requests.

Enable autonomous services:

Limit anomalous SQL more meticulous flow Limit

Forecast workload more accurate

A: Workload can be well-partitioned

Better Estimate the Resource Cost for autonomous actions: Estimate the SQL Resource Cost for autonomous actions

Solve the potential performance issue at the first place

DAS solution:

Enlightening ideas:

In the development process of many back-end applications, the design of back-end architecture tends to ensure the idempotency of interfaces. For example, in order to solve the timeout problem, a retry mechanism is usually introduced. Sometimes duplicate data is requested, messages are consumed, and duplicate data is read. For example, multiple inserts or updates may result in data errors.

To address these idempotent approaches, the back end often uses methods such as SELECT then INSERT, pessimistic/optimistic/distributed locking, or managing stateful business according to state machines.

Example payment scenario state machine:

update \`bill\` set status=1 where id=520 and status=0;

Order behavior SQL A

update \`bill\` set status=2 where id=520 and status=1;

Payment behavior SQL B

update \`bill\` set status=3 where id=520 and status=2;

Cancel the order behavior SQL C

So the same business process is accompanied by multiple SQL requests, serially or in parallel, which means that there is a correlation between the execution trends of the SQL, which is relevant to the business. When we find the business anomaly, it is accompanied by the index anomaly, so when we locate the abnormal SQL, there will be abnormal phenomenon in the SQL of the same business. Therefore, we can carry out the clustering of massive SQL data through the algorithm based on these trend characteristics of SQL. Therefore, we thought that by analyzing the homology of SQL and locating abnormal SQL from the perspective of business, we could locate abnormal SQL more efficiently

Process framework:

Perceptual process:

In the process of diagnosis, the DAS backend first requests Data from the Dataset Layer, performance Data (PERF Data) and SQL request Data (SQL Query Data). Performance data is detected by MTS Anomaly Detection/Feature Extraction

Exception request localization process:

Example:

Mysql :{SQL \_a, SQL \_b, SQL \_c} ==> mysql. CPU \_usage

==>sql 集合的影响程度 (推算cpu\_time占比)

Y: {SQL \_i, SQL \_j, SQL \_k} ==>

==> SQL Collect Impact (Extras Session Proportion)

After perceiving the abnormal timing index, the perception layer uses the template-processed data of the full amount of SQL and the Graph Based clustering method to classify the massive SQL according to the characteristics of the request Behavior. Finally, it ranks the Query Behavior Ranking according to the contribution rating of the request Behavior after aggregation. Detecting abnormal requests and their effects on performance metrics.

Root cause analysis process:

Example:

SQL \_i –> causes lock wait –> affects mysql.rows\_lock\_wait\_time

> causes SQL in template Y set to block > causes session spike

CPU-intensive SQL in set X of set Y where > is blocked

After the index anomaly is explained by SQL, there are still many failure problems that we cannot accurately locate, such as primary and standby delay, locking problems,OOM, kernel problems, etc. These problems may lead to the increase of SQL execution time, and in turn, SQL may also produce these problems.

Anomaly Propagation Analysis (Anomaly Propagation Analysis) helps us to analyze the Propagation relations among these phenomena. In this analysis, we made a comprehensive comparison based on the time sequence relationship and our historical case data. Finally, we added the obtained anomaly propagation chain, the whole DAS analysis process and suggestions to the case library at the back end and further detailed the case model. The Case Model will constantly overlay and adjust the matching parameters according to the feedback to give more accurate suggestions.

Exception SQL localization cases based on request behavior identification:

Positioning session (active\_session) spike problem:

There are abnormal spikes in the active session of the database example in the figure below. Such spikes last for too long and may cause potential problems for some sensitive businesses. We want to locate the causes of the spikes. For CPUs, detection of active session exceptions is indicated by a yellow exception event.

Active sessions are usually strongly correlated with total execution time. Select the corresponding metric through SQL request behavior analysis and click Analysis

Find the metrics that are similar to the session, and click View to sort them by total elapsed time to find the exception SQL that contributes the most to the session exception.

Click on the corresponding SQL\_ID to look at the details. By ranking the results of the trend behavior, you can clearly see that the SQL is getting slower compared to the historical trend. By executing the trend, you can see that the exception trend is completely different from the historical trend and coincides with the trend of active session exceptions

The number of SQL executions increased dramatically (from 1000 to 8000), which slowed down other SQL executions and caused active\_session to increase dramatically as a result of active session stack

CPU \_usage (CPU \_usage)

The following database instance is full of CPU.

In most cases, there is a positive correlation between the number of scanned rows and the CPU \_usage and the total number of scanned rows in the SQL request behavior analysis selection, except for the CPU-intensive calculation of the more expensive operations such as JOIN.

We can easily locate the CPU-associated metrics

Final positioning: This SQL full table scan caused the CPU to be full and caused the session to stack

Future plans

The DAS will support more engines for real-time detection and exception localization, and the Professional Edition will help more users locate more types of database instance problems with the full amount of SQL from the user. Not only let the professional DBA better use DAS to control the database instance, but also let the beginners in the database field control the database without the threshold, truly ensure that the database instance self-perception, self-optimization, self-repair.

** Related reading: **

DAS, the database autonomy service, has released the new version of the year: 1-5000, which marks the entry of “database autonomous driving” into the era of scale

Technology revelation | large stimulating behind, how to effectively assess computing resources and planning database?

Blockbuster | database autonomous service will DAS papers to be included in the global top SIGMOD, pilot “database automated driving” a new era

| DAS introduces the global Workload Optimization feature to automate SQL diagnostics

How does Autoscaling work in Aliyun

Copyright Notice:The content of this article is contributed by Aliyun real-name registered users, and the copyright belongs to the original author. Aliyun developer community does not own the copyright and does not bear the corresponding legal liability. For specific rules, please refer to User Service Agreement of Alibaba Cloud Developer Community and Guidance on Intellectual Property Protection of Alibaba Cloud Developer Community. If you find any suspected plagiarism in the community, fill in the infringement complaint form to report, once verified, the community will immediately delete the suspected infringing content.