An overview of the project

background

Slow SQL statements whose execution time exceeds the threshold set by long_QUERy_time. You can run the select @@long_query_time command to view the specific slow query threshold of the database. In addition, slow SQL includes not only SELECT statements, but also DELETE, INSERT and other DML statements.

Hazards of slow query SQL include:

  1. Performance: Slow SQL execution takes a long time, resulting in a long wait time, which directly affects user experience.
  2. Stability: when db appears slow query, once there are other DDL operations, may cause the whole database wait; On the other hand, slow SQL can drag down the database, causing normally executed SQL to become slow. Searching for slow SQL keyword in byte online accident management platform, you can find many accidents caused by slow SQL, which are of great harm.

results

  • Published monthly slow SQL reports, sorted out best practices, and promoted the improvement of swimming lanes in the head. The number of slow SQL decreased by nearly 50%, and the weekly running times of slow SQL decreased by an order of magnitude.
  • Slow SQL configuration & Alarm subscription continued allocation rate increased from 18% to 70%, and continued optimization.

Noun explanation

  • RDS: Relational Database Service It provides relational database services and uses open source MySQL database as the main database products. Byte Cloud Relational Database Services (RDS) focuses on providing businesses with stable, reliable, and resilient online database services.
  • Mars: indicates the name of the guest performance platform.
  • Aeolus: Byte research agile BI platform, providing flexible and easy to use data query, efficient and beautiful report making, and rich and diversified data content.

Design scheme

1. The architecture diagram

2. Core functions

2.1 Comprehensive slow SQL metrics Kanban

Quantified the daily/weekly/monthly slow SQL quantity & run times of customer growth based on slow SQL data of byte RDS platform database. Measure kanban data to drive timely improvement of the stock of slow SQL and reduce database quality risks. For example, the trend chart of weekly run times & slow SQL entries is as follows:

2.2 Slow SQL governance System

2.2.1 Automatic management of RDS slow SQL Threshold Configuration

Byte relational database platform -RDS provides slow SQL threshold configuration capabilities:

  • When the SQL execution time exceeds the threshold, the SYSTEM is automatically killed, which is equivalent to the Dr Configuration of slow SQL. (If an SQL is still running after being executed for 3 months, the result is unimaginable.)

Slow SQL threshold configuration Automatic management solves the problem that all databases associated with services are configured with slow SQL threshold information. This part is realized through online periodic inspection, and the process is as follows:

2.2.2 Mars- Slow SQL governance platform

The mars-Customer Growth slow SQL governance Web page is set up in the customer growth quality workbench to display the current situation of slow SQL of related businesses and the follow-up and repair of scheduling. The purpose is to let business students have a clearer and faster understanding of the current business and provide the efficiency of problem repair. The plan is as follows:

Slow SQL follow up page:

2.2.3 Slow SQL risk assessment model – Slow SQL score

When there are many slow SQL in the business line, how to accurately and reasonably analyze which slow SQL has the highest risk?

We build a risk assessment model of incremental slow SQL based on Quert_time,Lock_time,Rows_sent,Rows_affected,Bytes_sent and other dimensions of relational databases, and score each slow SQL and each database. Sort by slow SQL score, with the highest slow SQL score having the highest risk.

Slow SQL model is as follows:

2.3 Slow SQL-CI assembly line access/exit bayonet construction

Develop slow SQL atomic nodes based on ByteCycle (ByteCycle Byte Unified Energy Efficiency Center), and provide slow SQL related blocking point capabilities. Bytecycle builds a continuous integration pipeline based on the PSM dimension. By providing slow SQL atomic nodes, it can be easily used by users. CI card points can provide you with attention to slow SQL and improve the efficiency of slow SQL improvement.

2.4 Slow SQL Monitoring & Alarm Subscription

The system provides monthly slow SQL reports, daily slow SQL troubleshooting reminders, and SQLL Kill Lark alarm cards. The related styles are as follows:

  • Slow SQL monthly report

  • Daily reminder of slow SQL problem fixes

  • After the db slow query threshold is set, the DB automatically kills the statement if the threshold is exceeded. After the subscription, the DB automatically sends the obtained kill information to the corresponding group

3. The Code scheme

RDS meta information acquisition implementation scheme

Data table design
create table cg_rds_external ( id int unsigned auto_increment primary key comment 'id', Db_name varchar(100) default 'null comment ', owners varchar(100) default' not NULL comment 'db owners', Region varchar(100) default "not NULL comment 'DB region', Proxy_port_master varchar(100) default "not null comment 'master port', Proxy_port_slave varchar(100) default "not null comment 'Slave varchar ', Sync_time datetime default CURRENT_TIMESTAMP not NULL on UPDATE CURRENT_TIMESTAMP comment 'Data synchronization time') ENGINE=InnoDB DEFAULT CHARSET= utf8MB4 comment 'RDS db '; create table cg_rds_slow_query_config ( id int unsigned auto_increment primary key comment 'id', Db_name varchar(255) default "null comment 'db name ', Region varchar(100) default "not NULL comment 'REGION of db deployment ', port varchar(100) default" not NULL comment' Port of the rule ', Db_role varchar(100) default "not NULL comment 'master or slave', max_query_time int null comment 'Timeout threshold, Create_time varchar(100) default 'null comment' create_time varchar(100) default 'null comment' create_time ', Sync_time datetime default CURRENT_TIMESTAMP not NULL on UPDATE CURRENT_TIMESTAMP comment 'Data synchronization time') ENGINE=InnoDB DEFAULT CHARSET= UTf8MB4 comment 'RDS slow query rule configuration '; create table cg_rds_db_alarm_config ( id int unsigned auto_increment primary key comment 'id', Region varchar(100) default "not NULL comment 'REGION of db deployment ', alarm_id int NULL comment' Alarm rule ID ', Db_name varchar(255) default 'null comment ', type varchar(100) default' not NULL comment 'alarm type, e.g. Lark ', group_id varchar(100) default '' not null comment 'lark id', Create_time varchar(100) default 'not null comment' create_time varchar(100) default 'not null comment' Owner varchar(100) default 'not null comment 'alarm creator ', Sync_time datetime default CURRENT_TIMESTAMP not NULL on UPDATE CURRENT_TIMESTAMP comment 'Data synchronization time')ENGINE=InnoDB DEFAULT CHARSET= utf8MB4 comment 'RDS alarm ';Copy the code

Slow SQL query details fall into the database

The data table
create table cg_slow_query_detail_info ( id int unsigned auto_increment primary key comment 'id', Db_region varchar(255) default 'null comment' db_region varchar(255) default 'null comment' DB_region varchar(255) default 'null comment' DB region', Fingerprint_md5 varchar(255) default "NULL comment", Begin_time datetime DEFAULT CURRENT_TIMESTAMP NULL COMMENT 'Slow SQL start time ', Max_run_time varchar(255) default 'null comment' maximum time spent on SQL execution ', run_count int default 0 NULL comment 'Number of SQL executions ', Psm_name varchar(255) default "NULL comment", avg_query_time varchar(255) default "null comment", Rds_address varchar(255) default 'null comment' RDS host IP :port', Psm_host varchar(255) default "NULL comment", Sync_time datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT 'Data synchronization time') ENGINE = InnoDB DEFAULT CHARSET = utf8MB4 comment 'SQL ';Copy the code

Scheme for obtaining details about slow SQL killed

The data table
create table cg_kill_sql_detail_info ( id int unsigned auto_increment primary key comment 'id', Db_region varchar(255) default 'null comment' db_region varchar(255) default 'null comment' DB_region varchar(255) default 'null comment' DB region', Db_role varchar(255) default '' NULL comment 'DB node: Master slave', begin_time datetime DEFAULT CURRENT_TIMESTAMP NULL comment 'Kill SQL execution start time ', Psm_name varchar(255) default 'NULL comment ', sql_detail varchar(2000) default' null comment ', Db_table_name varchar(255) default 'null comment' DB_table_name varchar(255) default 'null comment' Just take the first prime, Sync_time datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT 'Data synchronization time') ENGINE = InnoDB DEFAULT CHARSET = utf8MB4 comment 'RDS kill slow SQL data ';Copy the code

Metrics Monitoring rules

RDS alarm subscription monitors can only discover SQL data executed on RDS and cannot discover slow interfaces in real time. Therefore, it is recommended to use dBatman metrics to monitor alarms in slow SQL.

$key = "max:toutiao.ttds.dbatman.latency.max{db=sales_manage,port=*,host=*,dc=*}"
$value = max(q($key, "3m", "1m"))/1000
warn = $value>50
runEvery=1
Copy the code

4. Slow development of SQL governance best practices and standards

Slow SQL governance optimization can be basically divided into the following three categories:

  • Optimize shcema
  • Optimize the index, as far as possible to build a three-star index
  • Optimize query, reasonable design query

The relevant rules are as follows:

reference

Dbaplus. Cn/news – 11-406…

summary

Although slow SQL special has made some progress, but the follow-up continuous optimization still has a long way to go, if you are also interested in slow SQL optimization, database quality, welcome to join us! In addition to slow SQL special, SRE, EP and other directions of the construction is also welcome to join. You can contact [email protected] for information or send an inline resume. Welcome to chat, HC Duo Duo, waiting for you to join!