With the popularization of big data technology, as well as the lower threshold of use. SQL has become increasingly important as a standard part of computing engines. Both hive, an older generation warehouse, and Spark or Flink, a popular computing engine, support SQL as a standard query language according to ANSI standards, respectively. As a qualified data test engineer, I face huge ETL SQL all day long. If you have good SQL writing skills, the actual testing work is obviously very helpful. As the first article to build SQL infrastructure, “SQL query order” is undoubtedly the preferred target.

A, SQL query sequence

It is no exaggeration to say that many people who have been writing programs for years, because they are used to ORM framework, have rarely written SQL, SQL execution order is mostly blurred. If you ask, I can only give you a general idea. However, there is no suitable ORM framework for data testing, especially spark and Flink, when the object is a Parquet file or stream log. This is why it is important to have a deep understanding of the order in which SQL is executed.

SQL differs from other programming languages, most notably in the order in which code is processed. In most programming languages, code is processed in coded order. But in SQL, the first clause to be processed is always the FROM clause

fromT1 # Prepare initial datajoin t2 Join# Additional needjoinThe data ofon t1.id =T2.user_id # Connection conditionwhere x>10# Filter conditiongroup byY # groupinghaving sum(x)>10# Filter after groupingselectCalculation expression (distinct) # query resultunion# league tableorder byFilter result limit (offset) # limit the number of outputsCopy the code

The FROM operation is performed first and the LIMIT operation is performed last. Each operation produces a virtual table that serves as input to a process. Spark and Flink’s processing of SQL becomes a physical execution plan, where each SQL step actually corresponds to a different operator, and the result of each step becomes the engine’s own data set (not extended here). These e virtual tables (datasets) are transparent to the user, and only the virtual tables generated in the last step are returned to the user. If a clause is not specified in the query, the corresponding step is skipped. Knowing the above order, you can quickly determine what results a given SQL query will return, and you can easily answer questions like:

  • Can beGROUP BYThe results of WHEREScreening?
  • Can I filter the results of window functions?
  • Can beGROUP BYIs executedORDER BYOperation?

And so on.

However, in practice the database/big data engines do not run queries strictly in this order, as they also perform a series of optimizations to speed up the query.

So:

  • When you want to understand the validity of a query or why a query is returned, try this diagram.
  • However, using this diagram cannot explain query performance or indexing-related issues, which involve more variables and are therefore more complex.

Alias calculation

Feel free to use a table of data to illustrate.

SQL syntax allows you to write:

SELECT CONCAT(username, ' hello') AS hello, count(*)
FROM `user`
GROUP BY hello
Copy the code

In MySQL actual test results:

Actual test results in Spark SQL:

spark-sql> SELECT CONCAT(username, ' hello') AS hello, count(*)> FROM ( > select "tailou" as username > UNION > select "diga" as username) as t > GROUP BY hello; Diga hello 1 tailou Hello 1 Time taken: 1.808 seconds, Touch2 row(s)Copy the code

Actual test results in Flink SQL:

Flink SQL> SELECT CONCAT(username, ' hello') AS hello, count(*) FROM (select 'tailou' as username UNION  select 'diga' as username) as t GROUP BY hello;
[ERROR] Could not execute SQL statement. Reason:
org.apache.calcite.sql.validate.SqlValidatorException: Column 'hello' not found in any table

Flink SQL> SELECT CONCAT(username, ' hello') AS hello, count(*) FROM (select 'tailou' as username UNION  select 'diga' as username) as t GROUP BY CONCAT(username, ' hello');
[INFO] Result retrieval cancelled.
Copy the code

In MySQL and Spark, it looks like GROUP BY is executed after SELECT, but GROUP BY is executed first because GROUP BY references the alias in SELECT.

The engine can rewrite the query as:

SELECT CONCAT(username, ' hello') AS hello, count(*) FROM (select 'tailou' as username UNION  select 'diga' as username) as t GROUP BY CONCAT(username, ' hello');
Copy the code

Next, execute the statement in GROUP BY before performing the SELECT operation, so it works. The engine will certainly perform a series of checks to ensure that the contents in SELECT and GROUP BY match before the query starts running, so it must examine the query statement as a whole before making an execution plan.

Unlike them, Flink can’t do this kind of optimization, but instead needs to change these into normal SQL order for them to work. There’s a caveat here.

Write in the last

By exploring the execution order of SQL query statements, I hope to help more people understand the execution order of SQL and how to write SQL query statements correctly. Also from the actual test, found that in fact when facing different engines, SQL writing has a certain “dialect” distinction. In the case of Spark, anSI-based and non-ANSI-based versions are also available. Therefore, in the actual work, we need more practice and accumulation. Can be handy in the follow-up work. So in this chapter, open the follow-up SQL basic ability learning. All subsequent articles will also include examples of MySQL, Spark, and Flink engines to distinguish between them.

References:

1, juejin. Cn/post / 686455… 2, juejin. Cn/post / 698495…