Hello everyone, I’m Li Xiabing, today we are going to learn and joke about MySQL Join function.
MySQL’s join function is too weak. MySQL’s join function is too weak. MySQL’s join function is too weak, and MySQL’s join function is too weak. These norms or remarks are both true and false, sometimes right and sometimes wrong. It is necessary for us to have a deep understanding of Join before we can clearly understand them.
Let’s take a look at MySQL’s join operation.
The body of the
In daily database query, we often need to join multiple tables to obtain the data after multiple tables merge at one time, which is to use the join syntax of database. Join is a common operation in the data world to join two data sets. MySQL, Oracle, PostgreSQL, and Spark all support this operation. This article focuses on MySQL, and uses MySQL’s join as the main language if not specified below. Oracle, PostgreSQL, and Spark are the big winners, with better join algorithm optimization and implementation than MySQL.
The MySQL join has many rules, so a bad join statement may not only lead to a full table query of a certain table, but also affect the cache of the database, causing most of the hot data to be replaced, which will drag down the whole database performance.
Therefore, the industry has summarized many specifications or principles for MySQL join, such as small tables driving large tables and prohibiting join operations of more than three tables. In the following sections, we will introduce the MySQL join algorithm, compare it with the Join implementation of Oracle and Spark, and explain why the above specification or principle is formed.
Nested Loop Join, Hash Join and Sort Merge join are three common algorithms, each of which has its own advantages and disadvantages and applicable conditions. And we’ll talk about it one by one.
Nested Loop Join implementation in MySQL
A Nested Loop Join is a scan driven table that queries data in the driven table based on the index on the associated field of a Join after each record is read. It is suitable for scenarios where the subset of data being joined is small, and it is the only algorithmic implementation of MySQL Join, which will be explained in more detail later.
MySQL has two variants of the nested-loop Join algorithm: Index nested-loop Join and Block nested-loop Join.
Index nested-loop Join algorithm
Let’s start by initializing the associated table structure and data
CREATE TABLE `t1` ( `id` int(11) NOT NULL, `a` int(11) DEFAULT NULL, `b` int(11) DEFAULT NULL, PRIMARY KEY (`id`), KEY `a` (`a`) ) ENGINE=InnoDB; delimiter ;; Create procedure init_data() begin Declare I int; set i=1; while(i<=10000)do insert into t1 values(i, i, i); set i=i+1; end while; end;; delimiter ; # call store to initialize t1 call init_data(); Create table t2 like t1; insert into t2 (select * from t1 where id<=500)Copy the code
Table B has a primary key index ID and a primary key index A, but no index on field B. The stored procedure init_data inserts 10000 rows into table T1 and 500 rows into table T2.
Straight_join To prevent the MySQL optimizer from selecting tables as driver tables and affecting the execution of analysis SQL statements, we directly used Straight_join to let MySQL run queries using a fixed order of join tables, where T1 was the driver table and T2 was the driven table.
select * from t2 straight_join t1 on (t2.a=t1.a);
Copy the code
Take a look at the execution plan for this statement using the explain command described in our previous article.
As can be seen from the above figure, the a field on T1 is indexed, which is used in the join process. Therefore, the execution flow of the SQL statement is as follows:
- Select * from t2;
- Use the A field of L1 to query the T1 table as a condition;
- Take out the rows that meet the conditions in T1 and form corresponding rows with L1 to become part of the result set.
- Repeat until the T2 table has been scanned.
This process is called Index nested-loop Join, or NLJ for short, and its corresponding flow chart is shown below.
Note that in step 2, when querying from table T1 against field A, indexes are used, so only one row is scanned per scan (from explain results, varies by case scenario).
Suppose the number of rows in the driver table is N and the number of rows in the driven table is M. In the process of executing this JOIN statement, the driven table scans the whole table, while the driven table uses indexes, and every row of data in the driven table needs to be indexed in the driven table, so the approximate complexity of the whole join process is N2log2M. Obviously, N has a greater effect on the number of rows scanned, so in this case you should let small tables do the driving.
Of course, all this assumes that the associated field of the JOIN is A, and that the A field of the T1 table has an index.
If there is no index, a full table scan is performed each time t1 is matched. This also leads to an unacceptable N * M programming time complexity for the entire process. So, when there is no index, MySQL uses the Block nested-loop Join algorithm.
Block Nested-Loop Join
Block nested-loop Join (BNL for short) is the Join algorithm used by MySQL when no index is available on the driven table. The flow is as follows:
- Table T2 is read into join_buffer of the current thread. In the example of this article, SQL does not do any conditional filtering on T2, so the entire table t2 is put into memory.
- Scan table T1 and compare each row with the data in JOIN_buffer. If the join conditions are met, the data will be put into the result set.
Take this SQL for example
select * from t2 straight_join t1 on (t2.b=t1.b);
Copy the code
The explain results for this statement are shown below. It can be seen that
As can be seen, the join process performed a full table scan for both T1 and T2, and put all 500 entries in table T2 into join_buffer, and traversed each row in table T1 in Join_buffer. Therefore, a total of 500 * 10,000 memory comparison operations are required, as shown in the following figure.
The main thing to notice in the first step is that instead of putting all the data in table T2 into join_buffer, we put the data in different rows and different fields according to the specific SQL statement. For example, the following join statement will only store the b field of t2 in join_buffer if b >= 100.
select t2.b,t1.b from t2 straight_join t1 on (t2.b=t1.b) where t2.b >= 100;
Copy the code
Join_buffer is not infinite and is controlled by join_buffer_size. The default value is 256K. When the amount of data to be stored is too large, it is only segmented, and the whole execution becomes:
- Scan table T2 and store the eligible rows in join_buffer. Because the size of the rows is limited, the second step is performed when the rows are full when 100 rows are saved.
- Table T1 is scanned, and each row of data is compared with the data in JOIN_buffer. If the join conditions are met, the data is put into the result set.
- Empty join_buffer;
- Perform the first step again until all the data has been scanned, repeating 5 times since there are 500 rows in the T2 table
This flow reflects the origin of Block in the name of the algorithm, which is divided into blocks to perform the join operation. Because the data in table T2 was stored in Join_buffer for 5 times, table T1 was scanned for 5 times.
All deposit | Deposit in 5 installments | |
---|---|---|
Memory operations | 10000 * 500 | 10000 * (100 + 100 + 100 + 100 + 100) |
Scan lines | 10000 + 500 | 10,000 times 5 plus 500 |
As shown above, compared with all table data stored in join_buffer, the number of memory judgments does not change, which is the product of the number of rows of two tables, i.e. 10000 * 500. However, the driven table will be scanned multiple times, and each additional storage will require a second scan, affecting the final execution efficiency.
Based on the above two algorithms, we can draw the following conclusions, which are also most of the specifications for MySQL join statements on the web.
-
If the driven table has an Index, that is, if the Index nested-loop Join algorithm can be used, the Join operation can be used.
-
Both the Index nested-loop Join algorithm and Block nested-loop Join use small tables as driver tables.
Because the time complexity of the above two join algorithms is at least a first-order relationship with the number of rows involved in the table, and requires a large amount of memory space, it is understandable that the join operation with more than three tables is strictly prohibited according to the specification of Ali developers.
However, these two algorithms are only one of the more efficient join algorithms, such as Hash Join and Sorted Merged Join. Oracle, PostgreSQL, and Spark all support these algorithms. This is why MySQL is so weak. (MySQL 8.0 supports Hash Join, But 8.0 is not a mainstream release yet).
In fact, ali developer’s specification forbids join operations of more than three tables due to the poor performance of MySQL’s join operation when migrating from Oracle to MySQL.
A Hash Join algorithm
Hash Join is a scan-driven table. It uses the associated fields of the Join to create a Hash table in memory, and then scans the driven table, reading each row of data and finding the corresponding data from the Hash table. It is a common way to join large data sets. It is suitable for scenarios where the data volume of the driving table is small and can be put into memory. It can provide the best performance for scenarios with large tables without indexes and parallel queries. Unfortunately, it only applies to equivalent join scenarios, such as on a.id = where b.a_id.
Join (); join ()
- Take out the qualified data from the driver table T2, hash the join field value of each row, and then store it in the hash table in memory.
- The driven table T1 is traversed, every row of data that meets the conditions is taken out, and the value of its JOIN field is also hash. The result is taken into the hash table of memory to search for a match. If it is found, it will become a part of the result set.
It can be seen that this algorithm is similar to Block nested-loop Join, except that the disordered Join Buffer is changed to hash table, so that data matching does not need to traverse all the data in the Join Buffer. Instead, the matching rows are obtained directly through the hash with a time complexity close to O(1), which greatly improves the join speed of the two tables.
However, due to the hash feature, this algorithm is only applicable to equivalent connection scenarios and cannot be used in other connection scenarios.
Sorted Merge Join algorithm
Sort Merge Join Sort two tables by the associated field of the Join (if the Join has already been sorted, for example, if the field has an index, it does not need to be sorted), and then Merge the two tables once. A Merge Join performs better than a Hash Join if the two tables are already sorted and do not need to be sorted when performing a sort Merge Join. Merge joins can be applied to non-equivalent joins (>, <, >=, <=, but do not include! =, that is <>).
It is important to note that if the joined fields are already indexed, that is, sorted, the merge operation can be performed directly, but if the joined fields are not indexed, it is performed as shown in the figure below.
- The table T2 is traversed, and the data that meets the conditions is read out and sorted according to the value of connection field A;
- The table T1 is traversed, and the data that meets the conditions is read out. It is also sorted according to the value of the connection field A.
- Merge the two sorted data to get the result set.
The Sorted Merge Join algorithm spends most of its time sorting two tables, so if two tables have already been Sorted by Join fields, the algorithm is even faster than Hash Join. In the side case, this algorithm is faster than the Nested Loop Join algorithm.
Below, we summarize the differences and advantages and disadvantages of the three algorithms.
Nested Loop Join | Hash Join | Sorted Merge Join | |
---|---|---|---|
Join condition | Apply to any condition | Only applicable to equivalent connections (=) | Equivalent or non-equivalent connections (>, <, =, >=, <=), except for ‘<>’ |
Major consumption of resources | CPU and disk I/O | Memory, temporary space | Memory, temporary space |
The characteristics of | When there is a highly selective index or restricted search efficiency is higher, can quickly return the first search results | Hash Join is more effective than Nested Loop when there is no index or the index conditions are ambiguous. It is usually faster than Merge Join. In the data warehouse environment, if the number of records in the table is large, the efficiency is high | Sort Merge Join is more effective than Nested Loop when indexes are lacking or the index conditions are ambiguous. When join fields are indexed or pre-sorted, they are faster than Hash Joins and support more join conditions |
disadvantages | If there is no index or the table records many times, the efficiency is low | Creating a hash table requires a lot of memory and the first result is slow to return | All tables need to be sorted. It is designed for optimal throughput and does not return data until all results are found |
Need to index | Yes (inefficient without indexing) | no | no |
Understanding the Join operation
After talking about the algorithms related to Join, we will also talk about the business understanding of Join operation.
In the absence of business complexity, most joins are not irreplaceable. For example, the order record generally only has the user_id of the order user, and the user name needs to be obtained when the information is returned. Possible implementation schemes are as follows:
- A database operation that uses the JOIN operation, order table and user table to join and return the user name;
- Two database operations, divided into two queries, the first time to obtain the order information and user_id, the second time according to the user_id name, the use of code procedures for information merge;
- Use redundant user names or read from non-relational databases such as ES.
All of the above solutions can solve the problem of data aggregation, and are handled based on program code, which is easier to debug and optimize than database join. For example, the user name is not fetched from the database, but first looked up from the cache.
Join operation, of course, is not without its merits, so the technology has its usage scenario, these solutions or above rules are Internet development team concluded, suitable for high concurrency, light writing rereading, distributed, simple business logic, these scenarios generally on the consistency of the data is not high, and even allow dirty reads.
But, in the financial Banks or financial companies such as application scenario, the join operation is indispensable, these applications are generally low frequent concurrency, complex data writing, CPU intensive rather than IO intensive, the main business logic through database processing even includes a large number of stored procedures, the high consistency and integrity of system.
Personal blog