TiDB slow log in practice with fish

Author profile: Liu Jiang, director of Panyu English Database, TUG 2020 ANNUAL MOA. Responsible for the operation and maintenance of Banyu database, big data operation and maintenance and database platform construction.

This article comes from Liu Jiang, head of DBA group of Banyu English, who shared the practice of TiDB slow log in Banyu during the second phase of “Energy Titanium” activity. This paper will be launched from the following three aspects:

  • The first part is the background and requirements. Firstly, it introduces the background of TiDB slow log system, and what kind of slow log system should be made based on this background.
  • The second part introduces how to do the slow log system.
  • The third part, through several online cases, looks at how slow log system locates online problems.

Background and Requirements

In the first half of this year, Ali Cloud released a new generation of DAS (Autonomous database service) services, which talked about the problems of the database, more than 90% of the problems are from the abnormal requests of the database. In fact, in our daily database problem scenario, most of the problems are abnormal SQL requests, such as database bugs or machine caused by failure, usually encountered or relatively few. For the location of abnormal SQL, database slow log analysis is a particularly effective means.

So what are the pain points when we use slow logs to analyze problems? Before the slow log system, the slow logs of the cluster are distributed on multiple machines. If a database problem occurs, you need to log in to multiple machines to analyze the problem, which is very low efficiency. Especially when the cluster size is very large, it is almost impossible to quickly locate problems.

Of course, TiDB supports dashboards in version 4.0, where we can view the slow logs of the entire cluster, such as the last 15 minutes or half an hour. However, when the system has real problems, there will be a lot of slow logs, and Dashboard will face performance problems such as computing load. Meanwhile, Dashboard does not support retrieval and analysis statistics, which is not conducive to fast locating abnormal SQL.

The TiDB system library contains a table ** (information_schema.slow_query) ** to store slow logs in real time. It can also be used to locate abnormal SQL. However, this table is a relational table and has no index. Multidimensional retrieval and analysis is particularly slow. At the same time, it is not good at multi-dimensional retrieval and analysis statistics of relational tables.

Based on the above pain points, the slow log system of Banyu needs to meet the following requirements:

First of all, centralized slow log collection can collect all the slow logs from multiple clusters or even dozens of clusters online for centralized analysis. In this way, the entry point is unified.

Second, ensure that the slow logs collected are quasi-real-time. If the slow log collection delay is too large, it is not helpful for handling online problems and analyzing problems.

Slow logs can then be retrieved and statistically analyzed. Because slow logs are especially abundant when problems occur, retrieval and statistical analysis can quickly locate abnormal SQL.

Finally, slow logging systems need to support monitoring and alerting.

The system break down

Based on the above background and requirements, let’s take a look at the slow log system with fish is how to do.

System architecture

The overall architecture of the log system is shown in the figure below. The Filebeat component is deployed during TiDB Server machine initialization, through which the collected slow logs are written to Kafka and the machine IP information is typed. We then parse out the fields we care about through logstash and store them in ES. ES itself is a search engine that performs data analysis and statistics at a very fast speed. At the same time, we checked the slow log data in ES through Kibana for visual statistics and retrieval.

Of course, there is another architecture for slow logging, as shown in the figure below.ClickhouseClickhouse is an analytical database that has become popular in recent years, and some companies do it by sending monitoring data to ClickhouseReal-time monitoring and alarm analysis. We can replace the Logstash component with Flink and write the slow log data to the Clickhouse with a simple calculation.

Because companion fish slow log system to do earlier, so the use isThe structure of the ELK.

First, Filebeat is lightweight enough. We’ve parsed hundreds of megabytes of files online and found little impact on database performance.

Second, when a problem occurs online, the amount of instant logs is extremely large. If slow logs are written directly to the Logstash machine, the Logstash machine load will be impacted. Therefore, Kafka is used to eliminate the peak.

When resolving slow logs with Logstash, use fuzzy matching rules as little as possible. This is because using fuzzy matching rules to parse slow logs will cause the machine CPU to soar.

Then, the ES index itself is Schema Free, plus the inverted index data structure, which is perfect for the statistical analysis class scenario.

At the same time, Kibana was used for visual retrieval and statistical analysis.

Finally, we read the slow log data of ES every 2 minutes to make monitoring alarm.

Log collection

Take a look at the details of each component, with the Filebeat collection configuration on the left as shown below. When we deploy Filebeat, we will pass in the IP of the deployment machine, so that we can know which machine the slow log comes from in the later statistics. And then on the right is the Configuration for Kafka, and once the data is collected, it’s sent to the Kafka cluster.

Here is an example of a TiDB slow log in this format.Slow log toTimeTo begin withQuery_time, Total_keys, Process_keys, DBAnd finally, there is SQL information. The slow log, after being collected by Filebeat, will turn into the following line of text, as shown in the figure below. If this line of text is directly saved in ES, there is no way to perform retrieval and statistical analysis.

Field screening

A text can not do statistical analysis and multidimensional retrieval, if we want to do, we need to put this line of text insideField analyticalSo what fields do we focus on? First take a look at the slow logging of MySQL 5.7, as shown in the figure below. When we deal with MySQL failure, we will first look at the query time of a SQL, if the SQL query time is long, we think it may be the cause of the online problem.However, when the number of online requests is relatively large, the long query time does not mean that it is the root cause of the problem, and it should be combined with other keywords for comprehensive analysis. For example, Rows_examined, a particularly important keyword in slow logs, represents the number of logical data scan lines. In general, Query_time and Rows_examined can be used to locate SQL problems.

Next, let’s look at TiDB’s slow logs. First, take a look at the slow log of TiDB 3.0.13 KV interface, as shown in the figure below. There are some important keywords here, for exampleQuery_time, DB, SQL and Prewrite_timeThese keywords are helpful for locating online problems.Here is a different format of TiDB 3.0.13 slow log, which uses the DistSQL interface, as shown in the figure below.In addition to printing Query_time and Total_keys at the same time, there is also Index_names, which indicates whether the SQL table is indexed or not. Index_names also contains information such as table names.

TiDB 3.0.13: TiDB 3.0.13: TiDB 3.0.13: TiDB 3.0.13: TiDB 3.0.13: TiDB 3.0.13: TiDB 3.0.13: TiDB 3.0.13: TiDB 3.0.13: TiDB 3.0.13KV_total, PD_total, Backoff_totalAnd other information.By looking at the slow log above, we can see that it contains a lot of key information, and we can even see where requests are slow in the database. If we collect slow logs, retrieve them through some relationship, or even aggregate them, it can be very helpful to find problems.

In the companion fish, we pay attention to the TiDB slow log field mainly has the following:

  • TiDB IP: The IP address of the machine is passed in when Filebeat is deployed. With this IP, we can know the source of the log and do statistics by IP dimension;
  • DB: DATABASE used when executing statements. We can do statistics according to the DB dimension, and we can also map DB to a specific database cluster through the internal system.
  • TABLE: For some slow logs, the TABLE name can be resolved and statistics can be collected according to the TABLE dimension.
  • IDX_NAME: Except for the Insert statement and the statement that goes to the KV interface, the slow log information records whether the statement goes to the index.
  • TOTAL_KEYS: number of keys scanned by Coprocessor;
  • PROCESS_KEYS: number of keys processed by Coprocessor;
  • QUERY_TIME: the time when the statement is executed;
  • SQL: Specific SQL statements.

Field analytical

The fields needed for a slow log are resolved using the Grok syntax of Logstash, as shown in the figure below.

Statistical analysis

The graph below shows the slow logging of all our clusters in the last 30 minutes. Through Kibana, we can see the total number of slow logs, which can be retrieved by DB, Quwry_time, Total_keys, and statistics by DB, Table, IP and other dimensions. This greatly improves the efficiency of SQL in locating problems.In addition to visual retrieval and aggregation through Kibana, our internal platform system also implements aggregation and sorting through various dimensions of slow logs, including aggregation dimensionsCluster, library, table, IP, and operation typeAnd so on, the sorting rules can be based onTotal time, average time, Maximum time, and number of logsSuch as dimension. We can periodically send slow log reports to r&d.

Monitoring alarm

In addition to visual retrieval and statistical analysis, we also monitor and alarm slow logs, as shown in the figure below. You can set two alarm rules for each database by counting the number of slow SQL entries in each database. For example, the general rule of the advertising library is that an alarm will be generated if the execution time exceeds 200 ms and the alarm threshold reaches five.

However, we found that online execution frequency is very low, but the execution time is very long, which cannot be covered by general rules. So another rule is added: the execution time exceeds500 milliseconds, the alarm threshold reached2An alarm is generated. This basically covers all slow SQL on the line. The alarm information, as shown in the figure below. If you look at the graph on the left, we collect the DB field of the slow log and associate the DB with the database cluster through the internal system, we can see which cluster and which library the slow log occurs in. The number of slow logs exceeded when they were generatedThe threshold, the total SQL execution time and average time. The slow log alarm information helps us determine the impact of the SQL problem on the online network in a short time.

Case sharing

Now that we’ve talked about how slow logging works, let’s look at how we can use slow logging to find online problems.

The first example is shown below. SQL, whose Total_keys and Process_keys are used up a lot, and whose index_name is empty. That means it didn’t go to the index. We quickly solved the performance problem by indexing this SQL.Similarly, in the second case, we found that the Coprocessor CPU index was up, and then we searched through the slow logging system and found that the SQL had no index. Problem location is fast because aggregation and retrieval through ES is extremely fast.

conclusion

The above is with fish slow log system in the performance problem discovery and prevention of database performance risk and other aspects of the experience, I hope to help you. In the future, we will continue to mine the information of slow log, enrich the functions of slow log system, and escort the fish database.