This is the 7th day of my participation in the August Text Challenge.More challenges in August

After writing the WHERE statement in the last article, I felt an impulse to continue writing like. First of all, I’m going to continue with where, and also because like queries as fuzzy matching queries are also very common in SQL. However, it is really let people love and fear, a little attention may lead to database slow search. Here I will mainly talk about the specific use of like and the causes of slow search.

In SQL, a like statement is typically used in a WHERE condition. A few occur in case when statements, which is rare.

The wildcard

When it comes to like statements, wildcards are a must. Currently, there are two main types of wildcard characters in SQL

The wildcard meaning
% Represents 0 or more characters
_ The value contains 1 character

In the actual query process, by combining with the like statement, the wildcard will represent the corresponding number of characters for fuzzy matching. For example, if you want to find all data whose name begins with “CHEN”, you can use the where clause:

SELECT CASE WHEN y.name like 'CHEN%' THEN 1 else 0 END FROM (SELECT 'chenqi' as `name`)y
Copy the code

Use the LIKE keyword instead of the = symbol because we are using a partial match. In the string, we start with the character “CHEN”, followed by the percent sign “%”. This means that to return a record, the first name must be equal to “CHEN”, followed by any number of characters. It’s worth noting that MySQL is case insensitive. Spark and Flink are both case-sensitive. Spark Result

Flink Execution result

Use the ‘%’

‘%’ represents 0 or more characters and can be used before or after a search keyword, as in the following cases

  • % Chen % represents the query for records with ‘Chen’ in the middle
  • Chen % represents the query for records starting with ‘Chen’
  • % Chen queries records ending in ‘Chen’

There are three query results in MSYQL:

The three query results of Spark are as follows:

Three query results under Flink:

As you can see, the use of ‘%’ is somewhat generic, and the performance of all three engines is the same. Let’s look at ‘_’ again.

Use ‘_’

‘_’ reduces the number of characters that can be matched compared to ‘%’, so it is more precise in a sense. For example, if I know that a condition starts with ‘CH’ and is 4 bits long, I can complete with ‘_’. Such as

SELECT CASE WHEN y.name like 'ch__' THEN 1 else 0 END FROM (SELECT 'chen' as `name`)y
Copy the code

In this case, the following statement cannot be used to query the desired result

SELECT CASE WHEN y.name like 'ch_' THEN 1 else 0 END FROM (SELECT 'chen' as `name`)y
Copy the code

‘_’ is more optional than ‘%’.

Flink used in

The spark is used in

Use ‘_’ and ‘%’

In addition to being used individually, they can also be used in combination, and only selected cases will be demonstrated below

saprk flink

NOT LIKE

Just to mention it a little bit, it’s the opposite of the action of like. No in-depth case presentation.

Problems with using wildcards

While the use of wildcards improves the convenience of fuzzy queries, it also introduces the bad problem of using inappropriate full-table queries. Of course, this is mysql. Big data computing engines have no such concerns.

Cause of index failure caused by like: The LIKE filter can only use characters before the first wildcard in tree traversal. The remaining characters are only filter predicates and do not narrow the scan index. Thus, a single fuzzy match expression can contain two predicate types:

(1) The part before the first wildcard is the access predicate;

(2) Use other characters as filter predicates.

The more selective the prefix before the first wildcard, the smaller the index range scanned. This, in turn, makes index lookups faster. The following figure illustrates this relationship using three different LIKE expressions. All three select the same row, but scan the index range, so the performance is very different.

A variety ofLIKEsearch

The first expression has two characters before the wildcard. They limit the index range of the scan to 18 rows. Only one of these matches the entire LIKE expression — the other 17 are extracted but discarded. The second expression has a longer prefix that Narrows the scanned index to two lines. With this expression, the database reads only one additional row that is independent of the result. The last expression has no filtering predicate at all: the database only reads items that match the entire LIKE expression. LIKE an expression that begins with a wildcard character. Such a LIKE expression cannot be used as an access predicate. If no other condition provides an access predicate, the database must scan the entire table.

LIKE special syntax match

In mysql and Spark, there are more diversified like fuzzy matching methods. Patterns that support regular matching, for example, have a more flexible template configuration

Regular match

The key to spark’s LIKE regular matching is to use RLIKE or REGEXP. Or use the above case to demonstrate MSYQL

spark

It is not supported in Flink

Define escape characters

Define the ESCAPE character, using the keyword ESCAPE, which needs to be bound with like, as in

The problem

Since using like can cause performance problems, can using case WHEN cause the same problems? To answer this question, you can go to the first chapter “SQL execution order”, I believe that after reading your heart will have a bottom.

conclusion

  1. MySQL is case-insensitive to matching characters in like statements, whereas Spark and Flink are very sensitive.
  2. Do not use query templates starting with wildcards in the MySQL library. Otherwise, indexes will become invalid.
  3. The regular match syntax does not work in Flink.

References:

  • Use-the-index-luke.com/sql/where-c…