One, SQL: a familiar and unfamiliar programming language

Here are some key words; “Familiar,” “unfamiliar,” “Programming language.”

It’s “familiar” because it’s the primary way dbAs, and developers in general, operate databases, and it’s used almost every day. It’s “strange” because many people simply use it, but how does it work? How can you make it work more efficiently? I never thought about it.

SQL boils down to a “programming language,” which is probably not what many people think of it. Let’s take a look at its simple definition (from Baidu Baike)

The Structured Query Language (SQL) is a special purpose programming Language. It is a database Query and programming Language used to access data and Query, update, and manage relational database systems. Structured query Language is a high-level, non-procedural programming language that allows users to work on high-level data structures. It does not require the user to specify the data storage method, nor does it require the user to understand the specific data storage method, so different database systems with completely different underlying structure can use the same structured query language as the interface of data input and management. Structured query Language statements can be nested, which gives it great flexibility and power.

To sum up, SQL is a non-procedural programming language that allows access to relational database systems.

Do you really know “SQL”?

I’m going to go through a small example to see if you really understand SQL.

This is a simple example of the order in which SQL statements are executed. Here we break a normal SELECT statement into three clauses. So in the actual execution, in what order? Here are six options, A and F, you can think about the choice…

The final answer is D, following the execution of the FROM clause, then the WHERE clause, and finally the SELECT section.

For the above example, let’s actually construct a scenario by looking at the execution plan to see if it is executed in the order we choose. The interpretation of the execution plan will be addressed exclusively later. Let me explain the execution first.

  • The first step is to access the object table (EMP) as a full table scan. Corresponds to the FROM section of the statement.

  • The second step is to filter the extracted result set (filter part), that is, to filter out the records that meet the conditions. Corresponds to the WHERE section of the statement.

  • The third step is to project the fields of the records that meet the conditions, that is, to extract the fields that need to be displayed. Corresponds to the SELECT section of the statement.

This is a detailed description of the order in which the parts of SQL are executed.

Understanding the order of execution can help us a lot in future optimization work. It’s easy to understand that the higher up the optimization, the better.

Is SQL still important?

This introduces a new question, is the SQL language still important at this stage?

This topic is introduced because as NOSQL, NEWSQL, BIGDATA and other technologies mature and spread, “SQL language has become less important” has become the view of some people. So what actually happened?

Let’s start with a classic picture. The figure describes the comparison of different application scenarios of the traditional SMP relational database, MPP NEWSQL, and MPP NoSQL.

From the above “data value density, real-time”, the traditional relational database is suitable for higher value density, real-time requirements of the scene (it is not difficult to understand that similar account, amount of information are stored in the traditional relational database); The NewSQL of MPP architecture is second, and the NoSQL of MPP architecture is more suitable for low-value and low-real-time scenarios.

From the following “data scale”, the traditional relational database is suitable for the size limit of TB level, and the latter two can save data on a larger scale (PB, EB).

From the following “typical scenario”, traditional relational databases are suitable for OLTP online transaction systems; MPP architecture NewSQL is suitable for OLAP online analysis system; NoSQL can be used in many scenarios (such as KV demand, data mining, etc.).

Finally, from the perspective of “data characteristics”, the former two are suitable for the preservation of structured data, while the latter is more suitable for the preservation of semi-structured or even unstructured data.

To sum up, different technologies have their own characteristics, there is no question who replaces who. Traditional relational database has its own distinct characteristics, in some occasions is still the best choice. As its main interactive language, SQL is bound to exist for a long time to develop.

Let’s compare traditional databases with big data technologies. Comparing the two technologies from the data volume, growth type, diversification, value and other dimensions, each has its own applicable scenarios.

For the field of big data, various technologies emerge in an endless stream. But for the majority of users, there is often a certain threshold to use, so now a trend is to introduce “SQL like” in the field of big data, to access data in a similar way to SQL. This for the majority of users, undoubtedly greatly reduced the use of the threshold.

Answer some questions:

NoSQL, NewSQL have gone beyond the traditional database, SQL has no place!

Each technology is suitable for different scenarios and cannot be generalized. SQL language, as the main way to access relational databases, still has its use.

After all is cloud age, who still uses relational database!

For high-value, strictly consistent scenarios, a relational database is still a good solution.

I do all my programming with OR Mapping tools and never need to write SQL!

Yes, the introduction of the OR Mapping tool has greatly improved productivity, but it also has the obvious side effect of losing control over the efficiency of statements. Many inefficient statements are generated directly by tools. This is why some Mapping tools also provide raw SQL interfaces to ensure efficient execution of key statements.

In the era of big data, we all use Hadoop, Spark, no need to write SQL!

Whether using Hadoop or Spark, it is possible to write a program to do data analysis, but it is often very inefficient. This is why “SQL-like” solutions such as Hive and Spark SQL were created to improve productivity.

Database processing power is very strong, do not care too much about SQL performance!

Indeed, with the development of multi-core CPU, large memory, flash memory and other hardware technologies, the processing power of the database has been greatly enhanced. But SQL performance is still important. As we will see later, a simple SQL statement can easily take down a database.

SQL optimization, find a DBA on the line, I do not have to learn!

SQL optimization is the domain of dbAs, but developers are more responsible for their own code. If you focus on SQL quality during development, you can avoid many low-level problems.

I’m just an operation DBA, I can’t optimize SQL!

The development of DBA can be divided into “operation and maintenance DBA-> Development DBA-> Data Architect… . If only the operation and maintenance of the database can be completed, it is undoubtedly a lack of skills, but also to the future development of everyone. In addition, with the gradual promotion of Paas cloud, there are less and less requirements for database operation and maintenance, and more and more requirements for optimization, design and architecture. Therefore, SQL optimization is a skill that every DBA must master.

Now optimization tools, very simple!

It is true that there are tools that can reduce the optimization analysis for us by automatically making optimization suggestions. However, as a DBA, you need to know not only how, but also how. Moreover, the database optimizer itself is a very complex component, it is difficult to do a completely correct optimization, which requires manual intervention, analysis.

Optimization is not to add index, this what!

Yes, indexing is a very common optimization tool, but it is not the only one. And in many cases, adding indexes can lead to worse performance. A case study will follow.

SQL is still important!

Let’s use an example to show that it is still important to understand how SQL works.

This is a real life case THAT I encountered in a production environment. Oracle database environment, two tables do association. The execution plan is impressive, with the optimizer estimating the amount of data returned at 3505T records, the planned return amount at 127P bytes, the total cost at 9890G, and the return time at 999:59:59.

As can be seen from the execution plan, the cartesian product is used for the association between the two tables. We know that a Cartesian join is when two tables join without any join conditions. The Cartesian product should be avoided in general, except in some special cases. Otherwise again powerful database, also cannot handle. This is a typical case of multiple table associations lacking join conditions, resulting in cartesian products and causing performance problems.

In the case itself, there is nothing special about it, but a developer’s oversight resulted in a poor quality SQL. But on a deeper level, the case has lessons:

  • A developer’s oversight, resulting in serious consequences, the original database was so fragile. You need to be in awe of databases.

  • A computer is not a human brain, it does not know what your needs are, can only use written logic to process.

  • Don’t blame the developers, everyone makes mistakes, the key is how to ensure from the system that similar problems will not happen again.

Five, SQL optimization rules

Let’s look at some common optimization rules. By optimization rules, I mean the angles from which YOU can think about SQL optimization. There’s a couple of ways to look at it. Here are a few.

Here is a picture from ali – Ye Zhengsheng’s blog, I believe many people have seen. Here we propose the classic funnel optimization rule, where height refers to the resources we invest, and width refers to the possible benefits. As can be seen from the figure, “reduce data access” is the way to invest the least resources, but more benefits; “Increasing hardware resources” is a way to invest the most resources relative to the amount of revenue. I’m not going to expand this out for time.

This is an optimization rule I’ve come up with, known simply as the “DoDo” rule.

Number one, “Do Less or not Do!” This translates to making the database do as little or no work as possible.

How to understand less work? For example, creating an index can improve the access efficiency. The principle is to convert the original table scan into an index scan. In an ordered structure, data can be obtained with only a small amount of IO access. This boils down to doing less work.

How to understand not doing work? For example, in the system design of the common cache design, many are the original need to access the database, to access the cache. This not only improves the access efficiency, but also reduces the database pressure. From a database perspective, this is typically not doing the work.

Number two, “If must do,do it fast!” If the database must do this event, please do it as soon as possible.

How to interpret this sentence? For example, the common parallel operation in the database is the introduction of multiple processes to speed up the original execution process. Speed up the processing process, occupy less related resources, and improve the overall system throughput.

SQL execution process

The execution process of SQL is complicated and varies with different databases. The following introduction to two mainstream databases (Oracle, MySQL) introduced.

  • The user submitted an SQL statement

  • The database computes a HASH value based on the literal value of the SQL statement

  • Based on the HASH value, determine whether the execution plan for this SQL exists in the database buffer.

  • If not, you need to generate an execution plan (a hard parsing process) and then buffer the results.

  • If it exists, check whether it is the same SQL. (Statements with the same HASH value may have different characters; Even if they are identical, they may represent different statements. I won’t expand on this one.)

  • If it is the same SQL statement, the execution plan is pulled from the buffer.

  • The execution plan is handed over to the executor for execution.

  • The result is returned to the client.

  • The customer submits a statement

  • Now query the cache to see if there is the corresponding cache data, if there is a direct return (generally there is a very small possibility, so it is recommended to turn off the query cache).

  • Hand it over to the parser, which generates a parse tree from the submitted statement.

  • The preprocessor processes the parse tree to form a new parse tree. There is some SQL rewriting at this stage.

  • The rewritten parse tree is submitted to the query optimizer. The query optimizer generates the execution plan.

  • The execution plan is submitted to the execution engine to call the storage engine interface to complete the execution process. Note here that MySQL’s Server layer and Engine layer are separate.

  • The final result is returned to the client by the execution engine and is cached if query caching is enabled.

The SQL optimizer

In the execution process described above, the optimizer has been improved several times. It is also the core component of the database. Let’s introduce the optimizer.

Here are some things I know about the optimizer. The optimizer is the essence of a database, and a DBA should take it seriously. Unfortunately, the database is not open enough for this. (Oracle is doing relatively well.)

Here we see the working process of the MySQL optimizer, roughly experienced the following processing:

  • Lexical analysis, grammar analysis, semantic check
  • Preprocessing (query rewriting, etc.)
  • Query optimization phase (can be divided into logical optimization and physical optimization in detail)
  • The query optimizer optimizes the results from the cost estimator estimates (it calls statistics for the calculation)
  • Hand over to the actuator for execution

This graph is summarized by Li Haixiang, an original expert of MySQL in DBAplus community, comparing different database optimizer technologies. From here we can see:

  • Different database implementation level is different, some support, some do not support

  • Even if it does, the implementation principles vary widely

  • This is just a small list of optimization techniques

  • The above comparison can also explain how different databases behave differently for the same statement. An example is shown below

8. SQL Execution plan

Understanding the execution plan is one of the prerequisites for DBA optimization. It opens a window into the database. Unfortunately, there has never been a book called “How to Understand an Execution Plan,” and the situation here is very complex, many of which have been accumulated by the DBA over the years.

This is a simple example of an Oracle execution plan, illustrating the general content of an execution plan.

Ix. Case Sharing

A lot of theoretical content, the following through a few cases to illustrate. Just to help you understand what’s going on.

Case 1: Database comparison

The first example is a comparison of optimizer behavior. The example compares the behavior of three databases (four versions) for the same statement. From this example, you can see that the optimizer behaves differently from database to database (and from version to version). For database selection, database upgrade and other work, to do a full evaluation test, is also for this purpose.

Two test tables are simply constructed, the main note being that the former field contains null values.

The first case is the processing of IN subqueries. For Oracle, 10G and 11G behave the same, and one is listed here.

For an example like this, different databases have shown differences. Oracle and PG behave similarly, and MySQL does not support hash joins, so it uses other methods. The specific technical details will not be elaborated here.

The second case is the processing of the NOT IN subquery. In this case, different versions of Oracle, PG, and MySQL behave differently. As you can see from the above example, the 11G optimizer is more intelligent in handling such cases.

Case 2: Solve the problem of slow ERP order remittance

Here I’ve constructed a similar structure to simulate what happens on the line.

The example is an association subquery that, at its core, is converted to a table association and uses a variant of nested loops, a Filter, to implement the association. Obviously, if the outer table is too large or the inner probe is too inefficient, the performance is predictable. In general, when two tables are associated, nested loops are a last resort, and other methods (such as HASH JOIN, SORT MERGE) may be more effective.

Here the optimizer doesn’t choose a better plan, is that a Bug in the optimizer? Or is it function limited? We can intervene by artificial means to see if we can achieve unintended results.

A hint-unnest is introduced to actively implement subquery denesting. Bringing the subquery part forward gives the optimizer more options. In terms of execution plan, the optimizer generates an inline view and then implements a hash join with the external table, improving overall efficiency.

This example shows that the optimizer has limitations. In some cases, you can manually intervene in the execution of statements to improve the overall execution efficiency.

Case 3: Dealing with the “ERP Cleaning Data” problem

The following example is a problem caused by poor structural design.

In daily optimization, we tend to consider optimization strategies in the order of “statement level, object level, architecture level, business level”. But in the project requirements, design phase, is in reverse order. The latter is far more influential than the former. A bad object structure design can lead to a series of SQL problems. In the example, this is one such problem.

This is a backstage ERP system of a company, the system has been online for more than 10 years. Over time, the amount of accumulated data gets bigger and bigger. The company plans to clean up some large tables. A problem occurred during a DBA cleanup of a large table. The table itself has hundreds of gigabytes. According to the specified cleaning rules, you only need to select a certain percentage (no more than 10%) of the data according to the primary key field range (>=) for cleaning. However, in actual use, it is found that the SQL is a full table scan, and the execution time is much longer than expected. The DBA tried to clean up the data using a forcibly specified index, but it was still invalid.

This ERP system has a long history, and the relevant information has been lost. This can only be analyzed from a purely database perspective, where a normal table (non-partitioned table) queries a batch of records in the range of primary key fields for cleanup. It is understood that performing index range scans should be a more efficient way of doing things, but full table scans are true. Further analysis shows that the primary key of this table has no business meaning and is only self-growing data whose source is a sequence. Oddly, the type of this primary key field is variable-length text, not the usual numeric type. Now it is impossible to prove the basis of defining the type of this field, but the experiment shows that it is the type of this field “exception” that leads to the wrong execution path.

A test environment is constructed below.

Can reproduce the problem of the case very well. Select a full table scan in text mode and an index scan in digital mode. Efficiency is obvious.

You can imagine an index tree structure in your mind, and for strings, how do you store this ordered structure? Is it what you expected?

Now that you know what the problem is, what can you do about it? The modification of the structure is undoubtedly too costly and impractical. The strategy adopted here is “local order”. The range of conditions in the statement is changed from open interval to closed interval to affect the selection of cardinality. (Read more about this in the book, Cost-based Oracle Optimization.)

If it still doesn’t work, you can consider further refining the segmentation or simply using “strip by strip extraction + batch binding” to solve the problem.

A small data type set improperly, will be for our later work to bring much trouble.

Case 4: “Peel the thread” to find the problem

Here’s a look at a complete optimization process to see how dbAs can “peel away the threads” and discover the nature of the problem.

This case is not meant to illustrate a technique per se, but rather to show a DBA’s approach to analyzing and handling problems. The method used is to analyze and judge a possibility according to their own knowledge, and then verify whether it is the reason. In the continuous question, continuous verification and error correction, gradually close to the essence of the problem.

I also want to tell developers that it is not easy for DBAs to optimize statements through this example.

This is a data warehouse system with a job that had a big delay one day. In order not to affect the business system tomorrow, this problem must be solved today. After communication with the developers, the SQL statements of this business have not been modified, and the related data structure has not changed. Other similar businesses (similar SQL statements) are also running normally, and the database system itself is not abnormal.

After the modification, the execution plan is the same as other similar SQL. The whole plan can be summarized as “HASH JOIN” + “FULL TABLE SCAN”. In tests, the speed increased slightly, but the entire run time was still over 2 hours.

At the beginning of my first attempt, I came up with a simple method. Since the execution efficiency of similar SQL was fine, but this SQL had a large deviation due to other SQL execution plans, I could manually adopt the method of solidifying the execution plan. The method of extracting the OUTLINE is used here. After testing, the speed is not increased, I do not know the main cause of the problem.

The second attempt, from the point of view of the waiting event. The first consideration is related to caching.

Don’t blindly believe what others say. Check it out in accordance with the process before optimization. In this case, the solution could be greatly accelerated.

Q&A

Q1: Will there always be new versions of THE ANSI SQL standard? Will new syntax and features be added in future releases?

A1: This issue has not been carefully considered. The ANSI-SQL standard changes all the time, and different databases implement subsets of it according to their circumstances. From my personal perspective, in the future, ANSI-SQL may consider big data, data mining direction, and add some new syntax or features. After all, SQL interface, as the most familiar data access interface, has a promising future in big data and other directions.

Q2: Is the ultimate goal of SQL optimization to change the SQL execution plan?

A2: The first purpose is to understand the behavior of the existing optimizer selection and consider whether it is the best choice. The second purpose is to make the database execute SQL in a better way by means of human intervention in the context of the limitations of the optimizer. After all, people understand data better than computers.

Q3: Can you describe the impact of the data type selection on the database during development?

A3: Data types can be optimized from the following perspectives:

Select an appropriate type for storing data. Note that the word “appropriate” is used here to ensure accuracy, adequacy, and no waste.

Different data types have different efficiency in the storage and calculation of the database itself.

The types must be compatible to ensure the type consistency of the associated fields.

Q4: Can you describe the common methods of Oracle data migration and their advantages and disadvantages?

A4: There are many of these, depending on the migration requirements, such as the common ones:

1. Back up and restore data. 2. Logical import and export (including transfer tablespace). 3. DATAGUARD; 4.LOG SYNC (such as OGG); 5. Program synchronization… Pros and cons, mainly depends on the cost, cost, each scheme has its own applicable scenarios.

Q5: What are the optimization ideas for statements that require full table scans?

A5: In cases where full table scans must be used, this applies to “DoDo” rule # 2 in sharing. Try to make it as fast as possible. Strategies to consider include:

  • DIRECT PATH READ

  • Increase MULTI BLOCK READ COUNT

  • To enable the PARALLEL

  • Better IO

Q6: How to optimize group by statements?

A6: For groups, versions of Oracle since 11g provide an implementation of HASH GROUP BY. The HASH is a heavy memory consumption operation that can be optimized from a memory usage perspective.

Q7: The access path is cached. How to determine if the access path in the cache is not used?

A7: There is generally no need to worry about reclamation. If you must, you can find out from the memory information whether the execution plan has been recently used, using the DBMS package to clear it.

Q8: Oracle found that after the installation on the cloud, the concurrency was not good. Why?

A8: Different clouds have different implementation strategies. For concurrency, consider vCPU usage, IO, and so on. I haven’t had much experience with this, sorry!

Q9: Is it appropriate to change a full table scan to an index full table scan? How about using the WITH clause to optimize SQL?

A9: Change a full table scan to a full index scan. The basic principle is to reduce the number of visits, that is, to make the database do less work.

The WITH clause defines the query block. One of the purposes is to reduce the number of references, but there may also be a case where the query statement is not allowed to be deformed.

Author: Han Feng

DBAplus community sharing

Content source: CreditEase Technology Institute