Application of SQL parsing in Meituan Dianping

Total 245 article

37 in 2018

takeaway

Database, as the core basic component, is the object that needs to be protected. Any careless online operation may cause a serious database failure, resulting in huge losses to the business. To avoid this kind of loss, management efforts are usually made. For example, develop database development specifications for r&d personnel; The newly launched SQL needs to be audited by the DBA; Maintenance operations must be approved by the management. And managing these measures effectively requires effective database training and careful SQL auditing by dbAs. Many small and medium-sized startups can manage their databases by setting norms, training, and auditing processes.

As Meituan-Dianping’s business has grown and grown, these measures have become more expensive to implement. How to rely more on technical means to improve efficiency is getting more and more attention. The industry has a lot of SQL audit, optimization suggestions and other tools based on MySQL source development, greatly reducing the BURDEN of SQL audit DBA. So can we continue to expand the MySQL source code to assist DBAs and developers to further improve efficiency? For example, more comprehensive SQL optimization capabilities; Multi-dimensional slow query analysis; Assist fault analysis, etc. One of the core technologies to achieve the above functionality is SQL parsing.

Current Situation and Scenario

SQL parsing is a complex technology, which is generally mastered by database vendors, but there are also companies that provide SQL parsing apis. Due to the rise of MySQL database middleware in recent years, it is necessary to support read and write separation, library and table separation and other functions, and it is necessary to extract table names, library names and related field values from SQL. So things like Druid (Java), MaxScale (C), Kingshard (Go), and so on, all parse SQL partially. And the real SQL parsing technology for database maintenance products are few, mainly as follows:

Meituan-dianping open source SQLAdvisor. It is based on MySQL primitive lexical analysis, combined with the analysis of SQL WHERE conditions, aggregation conditions, multi-table Join relations to give index optimization suggestions.

Go where open source Inception. Focuses on auditing SQL against built-in rules.

Cloud DBA of Ali. According to the official documentation, it also provides SQL optimization suggestions and rewriting.

The above products have very suitable application scenarios and are widely used in the industry. But SQL parsing applications are far from fully explored, such as:

Slow query reports based on table granularity. For example, a Schema that contains tables belonging to different lines of business wants to provide slow query reports with table granularity from a line of business perspective.

Generate SQL characteristics. Replace the value in the SQL statement with a question mark to facilitate SQL categorization. You can use regular expressions to do the same thing, but they are more buggy. See pt-query-digest. In pt-query-digest, for example, all numbers encountered are replaced with “? , resulting in the inability to distinguish between different numeric suffixes.

Confirm and avoid high-risk operations. For example, if a DBA accidentally drops a table, and there is no effective tool to roll back such an operation, especially a large table, the consequences can be disastrous.

SQL validity judgment. For security, audit, control and other reasons, Meituan-Dianping does not let developers directly operate the database, but instead offers RDS services. Especially for data changes, the business approval of the r&d personnel’s superior supervisor is required. If a developer writes a SQL that is syntactically incorrect, and THE RDS can’t determine whether the SQL is valid or not, there will be unnecessary communication costs.

Therefore, in order to facilitate the use of SQL parsing for all required services, we believe that it should have the following features:

Direct exposure of SQL parsing interface, as simple as possible to use. For example, if you enter SQL, you output table names, characteristics, and optimization recommendations.

The use of interfaces is not language-specific, which would otherwise be too expensive to maintain and use. For example, providing services through HTTP and so on.

A journey of a thousand miles begins with a single step. I will first introduce the principle of SQL parsing.

The principle of

SQL parsing and optimization belong to the compiler category, and C and other languages parsing no essential difference. It is divided into lexical analysis, syntax and semantic analysis, optimization, implementation code generation. The part corresponding to MySQL is shown as follows:

Figure 1 SQL parsing principle

Lexical analysis

SQL parsing consists of lexical analysis and syntax/semantic analysis. Lexical analysis is about converting input into tokens. Tokens contain Keyword (also known as symbol) and non-keyword. For example, the SQL statement select username from userinfo, after the analysis, will get 4 tokens, including 2 Keyword, respectively select and from:In general, lexical analysis can be usedFlexMySQL, however, does not use this tool, instead writing the lexical analysis section by hand (supposedly for efficiency and flexibility, reference)This article). The specific codes are in the SQL /lex.h and SQL/SQL_lex.cc files.

The MySQL Keyword is defined in SQL /lex.h.

{ "&&", SYM(AND_AND_SYM)},{ "<", SYM(LT)},{ "<=", SYM(LE)},{ "<>", SYM(NE)},{ "! =", SYM(NE)},{ "=", SYM(EQ)},{ ">", SYM(GT_SYM)},{ ">=", SYM(GE)},{ "<<", SYM(SHIFT_LEFT)},{ ">>", SYM(SHIFT_RIGHT)},{ "<=>", SYM(EQUAL_SYM)},{ "ACCESSIBLE", SYM(ACCESSIBLE_SYM)},{ "ACTION", SYM(ACTION)},{ "ADD", SYM(ADD)},{ "AFTER", SYM(AFTER_SYM)},{ "AGAINST", SYM(AGAINST)},{ "AGGREGATE", SYM(AGGREGATE_SYM)},{ "ALL", SYM(ALL)},Copy the code

The core code of lexical analysis is in the SQL/SQL_lex.c file, MySQLLex→lex_one_Token, interested students can download the source code research.

Syntax analysis

Parsing is the process of generating a syntax tree. This is the best and most complex part of the parsing process, but MySQL uses Bison to do this. Even so, it is worth studying here how to design appropriate data structures and related algorithms to store and traverse all information.

A) Parsing tree

The SQL statement:

select username, ismale from userinfo where age > 20 and level > 5 and 1 = 1Copy the code

The following syntax tree is generated:

For those of you who are new to compiler implementation, you may wonder how you can generate such a syntax tree. The principles behind this are the domain of compilers and can be found in a Wikipedia article and reference books listed here. I am also in the process of learning MySQL source code, read part of the content. Because the compiler involves too much content, my experience and time is limited, I will not do too much exploration. From an engineering point of view, learning how to use Bison to build syntax trees to solve practical problems might be more helpful. I’m going to explore this process based on Bison.

B) MySQL syntax analysis tree generation process

All the source code in SQL/SQL_Yacc. yy, there are 17K lines of code in MySQL5.6. Here is a list that involves SQL:

select username, ismale from userinfo where age > 20 and level > 5 and 1 = 1 Copy the code

Part of the parsing process is excerpted. In fact, with Bison, SQL parsing is not as difficult as expected. Especially given the context of the analysis here.

Select * from select_init {LEX * LEX = LEX; lex->sql_command= SQLCOM_SELECT; }; Select_init: SELECT_SYM / * the select keyword * / select_init2 | '(' select_paren) union_opt; select_init2: select_part2 { LEX *lex= Lex; SELECT_LEX * sel= lex->current_select; if (lex->current_select->set_braces(0)) { my_parse_error(ER(ER_SYNTAX_ERROR)); MYSQL_YYABORT; } if (sel->linkage == UNION_TYPE && sel->master_unit()->first_select()->braces) { my_parse_error(ER(ER_SYNTAX_ERROR)); MYSQL_YYABORT; } } union_clause ; select_part2: { LEX *lex= Lex; SELECT_LEX *sel= lex->current_select; if (sel->linkage ! = UNION_TYPE) mysql_init_select(lex); lex->current_select->parsing_place= SELECT_LIST; } select_options select_item_list /* select_item_list */ {Select->parsing_place= NO_MATTER; } select_into select_lock_type ; select_into: Opt_order_clause opt_limit_clause {} | into | select_from / * from * / words | into select_from | select_from into; select_from: FROM join_table_list /* Parses the table name */ where_clause /*where clause */ group_clause having_clause opt_order_clause opt_limit_clause procedure_analyse_clause { Select->context.table_list= Select->context.first_name_resolution_table= Select->table_list.first; } | FROM DUAL_SYM where_clause opt_limit_clause /* oracle compatibility: oracle always requires FROM clause, and DUAL is system table without fields. Is "SELECT 1 FROM DUAL" any better than "SELECT 1" ? Hmmm :) */ ; where_clause: /* empty */ { Select->where= 0; } | WHERE { Select->parsing_place= IN_WHERE; */ {SELECT_LEX *select= select; select->where= $3; select->parsing_place= NO_MATTER; if ($3) $3->top_level_item(); }; /* all possible expressions */expr: | expr and expr %prec AND_SYM { /* See comments in rule expr: expr or expr */ Item_cond_and *item1; Item_cond_and *item3; if (is_cond_and($1)) { item1= (Item_cond_and*) $1; if (is_cond_and($3)) { item3= (Item_cond_and*) $3; /* (X1 AND X2) AND (Y1 AND Y2) ==> AND (X1, X2, Y1, Y2) */ item3->add_at_head(item1->argument_list()); ? = $3; } else { /* (X1 AND X2) AND Y ==> AND (X1, X2, Y) */ item1->add($3); ? = $1; } } else if (is_cond_and($3)) { item3= (Item_cond_and*) $3; /* X AND (Y1 AND Y2) ==> AND (X, Y1, Y2) */ item3->add_at_head($1); ? = $3; } else { /* X AND Y */ ? = new (YYTHD->mem_root) Item_cond_and($1, $3); if (? == NULL) MYSQL_YYABORT; }}Copy the code

As you browse through the above code, you’ll see that Bison has embedded C++ code. Through C++ code, the parsed information is stored in related objects. For example, table information is stored in TABLE_LIST, order_list stores information in the ORDER by clause, and WHERE clauses are stored in Item. With this information, algorithms can be used to further process the SQL.

C) Core data structures and their relationships

In SQL parsing, the most central structure is SELECT_LEX, defined in SQL/SQL_lex.h. Only the parts relevant to the above examples are listed below.

In the above diagram, column names username and ismale are stored in item_list, table names are stored in table_list, and conditions are stored in WHERE. Item in the WHERE condition has the deepest hierarchy and more complex expression, as shown in the figure below:

Application of SQL parsing

To take a closer look at SQL parsers, here are two examples of SQL parsing being used.

Useless condition removal

The removal of useless conditions belongs to the logical optimization category of the optimizer, and can be completed only according to the SQL itself and table structure, its optimization is also more cases, code in SQL/SQL_optimizer.cc file remove_eq_conds function. In order to avoid too tedious description and pasting large sections of code, the following four situations are analyzed through diagrams:

A) 1=1 and (m > 3 and n > 4)

B) 1=2 and (m > 3 and n > 4)

C) 1=1 or (m > 3 and n > 4)

D) 1=2 or (m > 3 and n > 4)

Figure 5. Removal of a by useless condition

If you are interested in implementing this code, you need to know something about the Item class, an important data structure in MySQL. Because of its complexity, the official MySQL documentation specifically describes the Item class. Ali’s MySQL group has a similar article. For more details, check the SQL /item_* files in the source code.

SQL Feature Generation

To ensure the stable and efficient operation of the database, the basic component of the system, there are many auxiliary systems in the industry. Such as slow query systems, middleware systems. After these systems collect and receive SQL, they need to categorize the SQL for statistical information or to apply relevant policies. When categorizing, you usually need to capture SQL characteristics. Such as SQL:

Select username, ismale from userinfo where age > 20 and level > 5;Copy the code

SQL features are:

select username, ismale from userinfo where age > ? and level > ? Copy the code

Pt-query-digest, the industry’s well-known slow query analysis tool, uses regular expressions to do this, but it is buggy. Next, I’ll show you how to use SQL parsing to complete the generation of SQL characteristics.

SQL features are composed of two parts.

A) Generate Token array

B) Generate SQL features according to Token array

First of all, in the lexical analysis section, we introduced the keyword in SQL, and each keyword has a 16-bit integer corresponding to it. The non-keyword is represented by ident, which also corresponds to a 16-bit integer. The following table:The process of converting an SQL into a feature:In the SQL parsing process, it is very convenient to complete the generation of Token array. Once the Token array is generated, SQL features can be easily generated. SQL features are widely used in various systems. For example, PT-Query-Digest needs to classify SQL according to features, but its implementation based on regular expressions has many bugs. Here are a few known bugs:

Study suggest

Recently, in the process of exploring SQL parsers and optimizers, FROM the beginning of the confused to the rules to follow, also summed up some experience, here to share with you.

First, read the relevant books, which can give us a systematic view of parsers and optimizers. However, there are few books about MySQL in the market. You can see the art of database Query Optimizer: Principle Parsing and SQL Performance Optimization in Chinese.

Second, read the source code, but it’s best to base it on a version such as MySQL5.6.23, because the CODE for SQL parsing and optimization is constantly changing. This is especially true when it comes to big releases.

Third, use MORE GDB debugging, verify their guesses, check the quality of reading.

Finally, you need to write relevant code verification, only write out to be able to truly master.

Author’s brief introduction

Guang You, senior MySQL DBA of Meituan-Dianping comprehensive Business Group, graduated from University of Science and Technology of China in 2012, and joined Meituan-Dianping in 2017. He has been committed to the research of MySQL and its peripheral tools for a long time.

Jin Long joined Meituan-Dianping in 2014, mainly engaged in database operation and maintenance, high availability and related operation and maintenance platform construction. Students who are interested in high availability and architecture of operation and maintenance can follow the personal wechat public account “Own Designer” and regularly push original content related to operation and maintenance.

Xing Fan, DBA of Meituan-Dianping, joined Meituan-Dianping after graduate school in 2017. At present, she has some experience in MySQL operation and maintenance and has written some automatic scripts.

———- END ———-

Recruitment information

The DBA team of Meituan-Dianping recruits all kinds of DBA talents from Base Beijing and Shanghai. We are committed to providing stable, reliable and efficient online storage services for companies and building an industry-leading database team. There is Squirrel, a massively distributed caching system based on Redis Cluster, Cellar, a distributed KV storage system based on Tair, and thousands of MySQL instances of various architectures providing trillions of OLTP access requests per day. A truly massive, distributed, high-concurrency environment. Welcome friends to recommend or self-recommendation to jinlong.cai#dianping.com.

You might also want to see:

SQLAdvisor is an open-source SQL optimization tool

10 Must-read questions about database operation and maintenance

Evolution and idea of Meituan-Dianping database high availability architecture

Application of SQL parsing in Meituan Dianping

Current Situation and Scenario

Related Posts

ElasticSearch is distributed

Deep learning and machine learning learning routes

Efficient Java serialization in Dubbo (Kryo and FST)