34 | how can use the join?

In actual production, questions about the use of join statements generally focus on the following two categories:

  1. What’s wrong with using join?
  2. If two tables of different sizes join, which table should be used as the driver table?

In this article, I will tell you how the JOIN statement is executed and then answer these two questions. To facilitate quantitative analysis, I will create two tables T1 and T2 to explain to you.

CREATE TABLE `t2` (
`id` int(11) NOT NULL,
`a` int(11) DEFAULT NULL,
`b` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `a` (`a`)
) ENGINE=InnoDB;

drop procedure idata;
delimiter ;;
create procedure idata()
begin
  declare i int;
  set i=1;
  while(i<=1000)do
    insert into t2 values(i, i, i);
    set i=i+1;
  end while;
end;;
delimiter ;
call idata();

create table t1 like t2;
insert into t1 (select * from t2 where id<=100)
Copy the code

As you can see, both tables have a primary key index ID and an index A, and no indexes on field B. Stored procedure idata() inserts 1000 rows into table T2 and 100 rows into table T1.

Index Nested-Loop Join

Let’s take a look at this statement:

select * from t1 straight_join t2 on (t1.a=t2.a);

If the JOIN statement is used directly, the MySQL optimizer may select table T1 or T2 as the driver table, which will affect the analysis of the execution of the SQL statement.

So, to analyze performance issues during execution, I switched to Straight_JOIN to let MySQL execute the query using a fixed join so that the optimizer would only join as we specified. In this statement, T1 is the driver table and T2 is the driven table.

Now, let’s look at the explain result of this statement.

Select * from table T2 where index a is used in join;

  1. Read a row R from t1;
  2. Select column A from row R and search in table T2.
  3. Take out the rows satisfying the conditions in table T2 and form a row with R as part of the result set;
  4. Repeat steps 1 through 3 until the end of table T1 completes the loop.

The procedure is to traverse table T1 and then search table T2 for records that meet the criteria based on the value A in each row retrieved from table T1. In form, this process is similar to the nested queries we write programs, and can use the indexes of the driven table, so we call it an indexnested-loop Join, or NLJ.

The corresponding flow chart is as follows:

In this process:

  1. A full table scan is performed on driver table T1, which requires 100 rows to be scanned.
  2. For each row R, table T2 is searched according to field A, following a tree search process. Since the data we construct is one-to-one, we scan only one row at a time, and a total of 100 rows.
  3. So, the total number of lines scanned throughout the execution process is 200.

Now that we know the process, try answering the first two questions. Let’s start with the first question: Can we use join? Assuming that no JOIN is used, we can only use a single table query. Let’s look at the requirements of the above statement, using a single table query how to implement.

  1. Select * from t1; select * from t1;
  2. Loop over the 100 rows:
  • $R.a; $R.a;
  • Select * from t2 where a=$R.a;
  • Form the returned result and R into a row in the result set.

As you can see, in this query, 200 rows were also scanned, but a total of 101 statements were executed, 100 more interactions than a direct JOIN. In addition, the client concatenates the SQL statements and results itself. Obviously, this is better than joining directly.

Let’s look at the second question: how to choose the driver table? During the execution of the join statement, the driven table is scanned through the entire table, while the driven table is searched through the tree.

Suppose the number of rows in the driven table is M. Each time a row is queried in a driven table, the primary key index is searched first. The approximate complexity of each tree search is logarithm base 2 of M, denoted as log M, so the time complexity of looking up a row on the driven table is 2 log2 M.

If the number of rows in the driver table is N, the execution will scan the driver table for N rows and then match the driven table once for each row.

So the approximate complexity of the whole process is N+N2log M. Obviously, N has a greater effect on the number of rows scanned, so small tables should be used to drive tables. To summarize here, we can draw two conclusions from the above analysis:

  1. Using a JOIN statement provides better performance than forcing multiple single-table SQL statements to execute.
  2. If you use the JOIN statement, you need to have a small table as the driver table.

However, you should note that this conclusion assumes that the index of the driven table can be used. Next, let’s look at the case where the index is not used by the driven table.

Simple Nested-Loop Join

Now let’s change our SQL statement to look like this:

select * from t1 straight_join t2 on (t1.a=t2.b);

Since there is no index on field B of table T2, a full table scan is performed each time t2 is matched using the execution flow shown in Figure 2.

You can imagine the problem for a moment and continue using the algorithm shown in Figure 2 to get the correct result. This algorithm is correct if you only look at the results, and it also has a name called “Simple nested-loop Join”. However, this SQL request scans table T2 up to 100 times, with a total of 100* 1000= 100,000 rows.

These are just two small tables, and if T1 and T2 were 100,000 rows each (which, of course, is still in the small table category), scanning 10 billion rows would seem too cumbersome. Of course, MySQL doesn’t use this Simple nested-loop Join algorithm either. Instead, it uses another algorithm called blocknested-loop Join, or BNL for short.

Block Nested-Loop Join

At this point, there is no index available on the driven table, and the flow of the algorithm looks like this:

  1. Insert table T1 into join_buffer (select * from join_buffer);
  2. Scan table T2, extract each row from table T2, compare it with join_buffer, and return the rows that meet join conditions as part of the result set.

The flow chart of this process is as follows:

Accordingly, the explain results of this SQL statement are as follows:

As you can see, a full table scan is performed on both tables T1 and T2, so the total number of rows scanned is 1100. Since join_buffer is organized as an unordered array, each row in table T2 needs to be judged 100 times. The total number of judgments to be made in memory is: 100 * 1000= 100,000 times.

As we said earlier, if the Simple nested-loop Join algorithm is used for the query, the number of rows scanned is also 100,000. Therefore, in terms of time complexity, the two algorithms are the same. However, the 100,000 judgments of the Block Nested-loop Join algorithm are memory operations, which are much faster and perform better.

Next, let’s see which table should be chosen as the driver table in this case. Assuming that the number of rows in the small table is N and the number of rows in the large table is M, then in this algorithm:

  1. Both tables perform a full table scan, so the total number of rows scanned is M+N.
  2. The number of decisions in memory is M times N.

As can be seen, there is no difference between M and N in the two formulas, so the execution time is the same when selecting a large table or a small table to drive the table.

Table T1 has only 100 rows. What if table T1 is a large table and join_buffer does not fit?

The size of join_buffer is set by the join_buffer_size parameter. The default value is 256K. If you cannot fit all the data in table T1, the strategy is simply to fragment it. I’ll change the join_buffer_size to 1200 and then run:

select * from t1 straight_join t2 on (t1.a=t2.b);

The execution process becomes:

  1. Scan table T1, read rows in sequence and add them to join_buffer. When line 88 is finished, join_buffer is full, continue step 2.
  2. Scan table T2 and extract each row from T2 and compare it with join_buffer. Those that meet join conditions will be returned as part of the result set.
  3. Empty join_buffer;
  4. Continue to scan t1, read the last 12 rows sequentially into join_buffer, and continue with step 2.

The execution flowchart looks like this:

Steps 4 and 5 in the figure indicate that the join_buffer reuse is cleared. This process shows the origin of “Block” in the name of this algorithm, which means “to divide into blocks to join”.

As you can see, table T1 is split into join_buffer twice, resulting in table T2 being scanned twice. Although it is divided into two times and put into join_buffer, The Times of judging equivalent conditions remain unchanged, which is still (88+12)* 1000= 100,000 times.

Let’s look at the choice of the driver table in this case. Assume that the number of data rows in the driven table is N, which needs to be divided into K segments to complete the algorithm process, and the number of data rows in the driven table is M. Note that K is not a constant here, and as N increases, K increases, so K is represented as λ* N, which obviously ranges from (0,1). Therefore, during the execution of this algorithm:

  1. The number of lines scanned is N+λ* N* M;
  2. Memory Check N x M times.

Obviously, the number of memory judgments is not affected by which table is selected as the driver table. Considering the number of lines scanned, if the size of M and N is fixed, the result of the whole equation will be smaller if N is smaller. So the conclusion is that you should let the smaller table be the driver table. Of course, you’ll notice that λ is the key factor in the N+λ* N * M formula, and the smaller the value, the better.

We just said that the larger N is, the larger the number of segments K is. Now, when N is fixed, what parameter affects the size of K? (the size of λ) the answer is join_buffer_size.

The larger join_BUFFer_SIZE is, the more rows that can be placed at one time, the fewer segments that can be divided into, and the fewer full table scans of the driven table.

That’s why you might see some advice telling you to increase join_buffer_size if your JOIN statement is slow.

Now that you understand the two algorithms used by MySQL to perform joins, let’s try to answer the two questions at the beginning of this article

First question: Can I use the join statement?

  1. If you can use the indexnested-loop Join algorithm, which means that you can use indexes on the driven table, this is fine;
  2. If the Block nested-loop Join algorithm is used, too many rows will be scanned. In particular, join operations on large tables may have to scan the driven table many times and consume a lot of system resources. Therefore, do not use this kind of join. When deciding whether to use a JOIN statement, look for “BlockNested Loop” in the “Extra” field in explain results.

The second question is: if you want to use a JOIN, should you choose a large table as the driver table or a small table as the driver table?

  1. If indexnested-loop Join algorithm is used, a small table should be selected as the driver table.
  2. If it is the Block nested-loop Join algorithm:

If join_BUFFer_size is large enough, this is the same; When join_BUFFer_SIZE is not large enough (which is more common), you should choose a small table as the driver table.

Therefore, the conclusion of this question is that you should always use small tables as driver tables.

When deciding which table should be the driver table, the two tables should be filtered according to their own conditions. After filtering, the total amount of data of each field participating in the join should be calculated. The table with the small amount of data is the “small table” and should be used as the driver table.