This is the second day of my participation in the August Text Challenge.More challenges in August

Today’s computer languages all have a perfect logical branch judgment structure. As a standard for structured queries, SQL syntax also has statements for logical processing that allow us to use IF… ELSE to execute the expression to get different results. This syntax is called CASE syntax.

Demo

The following is an example of SQL from an ETL

select
channel_id,ifnull(p.place_type,2) as place_type,

sum(case when (o.total_price/100< =5) and time_first_order_success between unix_timestamp('${day_start}'.'yyyy-MM-dd')*1000 and unix_timestamp('${day_end}'.'yyyy-MM-dd')*1000  then 1 else 0 end) as user_less_5_cnt,
sum(case when (o.total_price/100< =10) and time_first_order_success between unix_timestamp('${day_start}'.'yyyy-MM-dd')*1000 and unix_timestamp('${day_end}'.'yyyy-MM-dd')*1000  then 1 else 0 end) as user_less_10_cnt,
sum(case when (o.total_price/100< =15) and time_first_order_success between unix_timestamp('${day_start}'.'yyyy-MM-dd')*1000 and unix_timestamp('${day_end}'.'yyyy-MM-dd')*1000  then 1 else 0 end) as user_less_15_cnt,
...
from
table1 c join
table2 u on c.id=u.channel_id
left join  (select * from table3 where cur_date between ${start_ymd} and ${end_ymd}  and status=20 ) o  on u.id=o.buyer_id and u.time_first_order_success=o.time_create
left join (select * from table3 where cur_date between ${start_ymd} and ${end_ymd} and status = 20 ) o2 on u.id = o2.buyer_id and u.time_second_order_success = o2.time_create
left join (select *, (case when tag like '% home %' then 1 else 2 end) as place_type from 4) p on p.id=u.place_id
where  IFNULL(o.total_price,ifnull(o2.total_price,0)) > 0
group by channel_id,place_type
Copy the code

Because it is easy to use, there are a lot of CASE statements in ETL.

Basic grammar

CASE[expression]WHENexpression1 THEN result_1
WHENexpression2 THEN result_2 
WHENThe expression of nTHEN result_n
...
[ELSE] [result]
END case_name
Copy the code
  • [] indicates optional

Because of the optionality of the “expression” above, CASE statements can be divided into the following two cases.

A simple expression

When a CASE is followed by an expression, it is called a simple expression. At this point, the value of the expression will be evaluated against all the remaining values of “expression 1″,” expression 2″, and “expression n”, where expression 1.. N can be a constant or an expression, but it will be equal. If comparison judgments (<,>,<=, etc.) are involved, the following query expression needs to be used.

MySQL test results

mysql> SELECT
    -> CASE 2>1
    -> WHEN 3>2 THEN "true"
    -> WHEN 3<2 THEN "false"
    -> END
    -> FROM dual;
+---------------------------------------------------------+
| CASE 2>1
WHEN 3>2 THEN "true"
WHEN 3<2 THEN "false"
END |
+---------------------------------------------------------+
| true                                                    |
+---------------------------------------------------------+
1 row in set (0.00 sec)
Copy the code

Spark Actual Test

spark-sql> SELECT CASE 2>1 WHEN 3>2 THEN "true" WHEN 3<2 THEN "false" END ;
true
Time taken: 3.708 seconds, Fetched 1 row(s)
Copy the code

Flink actual test

According to the value comparison

SELECT CASE 1 WHEN 1 THEN "true" WHEN 3<2 THEN "false" END FROM dual
Copy the code

Because it is too simple, I will write a demo, not post engine test results. It is important to note that in this writing, SQL will automatically perform implicit type conversions under different engines when the expressed value or result is digitized. For example, in MySQL

SELECT CASE 1 WHEN "1" THEN "true" WHEN 3<2 THEN "false" END FROM dual
Copy the code

with

SELECT CASE '1' WHEN 1 THEN "true" WHEN '1' THEN "false" END FROM dual
Copy the code

The results are consistent.

But in Flink

Flink SQL> SELECT CASE 1 WHEN '1' THEN 'true' WHEN 3<2 THEN 'false' END;
[ERROR] Could not execute SQL statement. Reason:
org.apache.calcite.sql.validate.SqlValidatorException: Cannot apply '=' to arguments of type '<INTEGER> = <BOOLEAN>'. Supported form(s): '<COMPARABLE_TYPE> = <COMPARABLE_TYPE>'
Copy the code

If it fails, look at spark’s performance

spark-sql>  SELECT CASE 1 WHEN '1' THEN 'true' WHEN 3<2 THEN 'false' END;
true
Time taken: 0.283 seconds, Fetched 1 row(s)
Copy the code

The result is consistent with MySQL. So take a note here – implicit type conversions are not supported in Flink.

Both MySQL and Spark are supported. Obviously, there is an interesting question, which is what happens if the engine does an implicit conversion in expression 1 and fills in the same value in expression 2? If you take a close look at the SQL above, you’ll notice that after an implicit conversion, no matter how it’s executed, the result of the first matching expression is printed first. This brings up another magic problem – the execution order of CASE expressions. After many retry experiments, it is proved that the execution order of CASE is also based on the order of expression execution.

Query expression

The system evaluates expressions according to the order of expressions. – This has been proved above. The same order is followed in the query expression. When the result of the expression is true, the result after the corresponding expression then is returned. This is the most common use of CASE. Take MySQL for example

SELECT CASE WHEN 2>1 THEN 'true' WHEN 2<1 THEN 'false' END
Copy the code

spark

spark-sql> SELECT CASE WHEN 2>1 THEN 'true' WHEN 2<1 THEN 'false' END; True Time taken: 0.064 seconds, Touch1 row(s)Copy the code

flink

Well, the case is so boring. Well, let’s talk about something interesting. – What if none of them match?

ELSE&NULL

Normally, the ELSE statement in a CASE statement is used to solve a mismatch. This can be used directly in both simple and query expressions. Such as

mysql> SELECT CASE WHEN 2<1 THEN 'true' WHEN 2<1 THEN 'false' ELSE 'hello' END;
+------------------------------------------------------------------+
| CASE WHEN 2<1 THEN 'true' WHEN 2<1 THEN 'false' ELSE 'hello' END |
+------------------------------------------------------------------+
| hello                                                            |
+------------------------------------------------------------------+
1 row in set (0.00 sec)
Copy the code

spark

spark-sql> SELECT CASE WHEN 2<1 THEN 'true' WHEN 2<1 THEN 'false' ELSE 'hello' END;
hello
Time taken: 0.08 seconds, Fetched 1 row(s)
Copy the code

flink

It’s kind of interesting. This is similar to if.. else.. It’s a perfect fit. Remember when we first introduced syntax, we saw that ELSE and result could actually be omitted. Now that we know the ELSE effect, what does SQL return when there is no ELSE and there is no ELSE? Continue to test

mysql> SELECT CASE WHEN 2<1 THEN 'true' WHEN 2<1 THEN 'false'  END;
+------------------------------------------------------+
| CASE WHEN 2<1 THEN 'true' WHEN 2<1 THEN 'false'  END |
+------------------------------------------------------+
| NULL                                                 |
+------------------------------------------------------+
1 row in set (0.00 sec)
Copy the code

spark

spark-sql> SELECT CASE WHEN 2<1 THEN 'true' WHEN 2<1 THEN 'false'  END;
NULL
Time taken: 0.098 seconds, Fetched 1 row(s)
Copy the code

flink

As the saying goes, a null pointer is the mother of all evil. To develop good coding habits, you should ask your developers to avoid missing ELSE or ifNULL exception handling when writing null values. This is an essential capability in reviewing and testing ETL SQL to identify data quality issues at the coding stage.

Alternative query expression

The above query cases are all single logical operations. That’s something like A<B, and recently A colleague asked me A strange question. As such as

select case when 1<table.id<10 end from table
Copy the code

Is this form of SQL executable? My first reaction was, no. I have never encountered this kind of SQL since I have been around SQL. But without testing it, there’s no way to prove it’s true or false. So I also executed related statements on MySQL, Spark and Flink to verify. Try it first in MySQL, and the conclusion is, support!

In the spark again

A Boolean type cannot be compared to a value, which reminds me of MySQL’s implicit conversion, because MySQL implicitly converts false to 0 when flase is compared to a value. Such as

That’s why Boolean expressions written this way hold in MySQL. Then look at flink

Again, no, according to the error message, it has also been proved that it is the implicit transformation that makes this writing possible. Spark and Flink do not support implicit conversion.

Nested CASE

As in programming languages, nesting of CASE statements is supported in SQL. To solve more complex problems, nested forms are used in SQL

SELECT CASE 1 WHEN "1" THEN (CASE 2 WHEN 2 THEN 'true' END) WHEN 3<2 THEN "false" END FROM dual
Copy the code

In the spark

spark-sql> SELECT CASE 1 WHEN '1' THEN (CASE 2 WHEN 2 THEN 'true' END) WHEN 3<2 THEN 'false' END;
true
Time taken: 0.153 seconds, Fetched 1 row(s)
Copy the code

Flink, since implicit conversion is not supported, we change SQL to

 SELECT CASE 1 WHEN 1 THEN (CASE 2 WHEN 2 THEN 'true' END) WHEN 2 THEN 'false' END;
Copy the code

CASE in the WHERE condition

Finally, there is the use of case statements, which are used dynamically in WHERE conditions. Of course, this writing method is very rare in ETL, because the extraction conditions are generally fixed and will not change, so there is no need to dynamically adjust the filter conditions of the query according to the value of the expression. Such as

SELECT * FROM (select 1 as `name` UNION SELECT 2 as `name`) t WHERE (CASE 1 when 1 THEN `name` END) = 2
Copy the code

For demonstration purposes, the simplest notation is used, with a single WHEN conditional expression. In fact, the CASE statement is the same as in select.

flink

spark

spark-sql> SELECT * FROM (select 1 as `name` UNION SELECT 2 as `name`) t WHERE (CASE 1 when 1 THEN `name` END) = 2;
2
Time taken: 1.604 seconds, Fetched 1 row(s)
Copy the code

The final summary

  • By the end of this chapter, you should have a better understanding of the general usage and usage scenarios of CASE.
  • Flink does not support implicit conversion of CASE because it requires strong type consistency. However, MySQL and Spark support implicit conversion of CASE

Good;

  • Be aware of missing ELSE writes, which can have a serious impact on data quality.
  • Because of the implicit conversion capability of different engines, there may be some deviation in expression writing. Therefore, it is recommended to use logical operators such as AND or to connect multiple Boolean expressions in order to unify the coding style. You can’t go wrong this way.