When executing a Join statement, it is often said that a small table should drive a large table to query, while a small table can drive a large table to query faster.

Why Mysql recommends that small tables drive large tables to perform queries, which can effectively improve query efficiency

Simple Nested-Loop Join

Simple join query, Mysql does not use this query method, because this query method will produce Cartesian product, resulting in a large number of rows scanned in the query process of table association, resulting in very slow query efficiency.

  • SLJ execution process

    select * from t1 straight_jon t2 on t1.a = t2.a;
    Copy the code

    No index is created for column A of t1 and T2

    Execution process:

    1. From the tablet1To read a row of data, and get the a field
    2. Select a from t1, select a from T2, and select a from T1
    3. If a match is found, the matched data is added to the result set
    4. Repeat steps 1-3 until a full scan is performed for table T1
  • Time complexity

    During the execution, a full scan for table T1 and a full scan for table T2 is required. Therefore, the query efficiency is slow and the time complexity is N x M

    When the data volume of the table is large, the number of rows scanned is very large. For example, t1 and T2 are both 10W rows, so the total number of rows scanned is 10 billion

Index Nested-Loop Join

In this query process, it is similar to nested inserts, and the index of the driven table can be used in the query process, so it can be shortened to NLJ

  • NLJ execution process

    select * from t1 straight_jon t2 on t1.a = t2.a;
    Copy the code

    Where the a field of table T2 creates the index

    Straight_join was used to prevent the mysql optimizer from optimizing the way tables were driven

    In this SQL statement, T1 is the driver table and T2 is the driven table. The execution flow of this statement is as follows:

    1. Start by reading a row from table T1
    2. Retrieves a field from a row read from T1aAnd getaSelect * from t2
    3. First use the value of a field in T1 to search the index tree of A field in T2, and then get the primary key ID of the corresponding T2 table
    4. Use primary key ID to query the primary key index tree of t2 table and obtain the whole row data
    5. Add a row result from T1 and a row result from t2 to the result set
    6. Repeat Steps 1-5 until data scanning for table T1 is complete
  • Time complexity

    During the execution, the time complexity of NLJ is mainly composed of two table scans:

    • Full scan of N rows of t1 table
    • Query a row of the driven table. The time complexity is 2logM. When N rows are queried, the total time complexity is N*2*logM

    The total time complexity is N + N*2*logM

  • Small tables drive large tables

    Obviously, N has a greater impact on the overall time complexity, so it is necessary to keep the value of N as small as possible. Then when two tables join, it is more appropriate to use the small table as the driving table, and the time complexity is lower, and the query efficiency is higher

    Of course, the premise of the small table driving the large table is “can use the index of the driven table”, when the index of the driven table cannot be used, the time complexity is different

Block Nested-Loop Join

To solve the problem of too many rows scanned by SLJ algorithm, Mysql uses BLJ algorithm to solve the associated query operation between two tables when there is no index on the driven table

  • BLJ execution process

    select * from t1 straight_jon t2 on t1.a = t2.a;
    Copy the code

    No index is created for column A of t1 and T2

    1. Read data from table T1 into thread memoryjoin_bufferMiddle, because of thisselect *, so put all fields of t1 table into memory
    2. Scan table T2, extract each row from table T2, and then match the data in table T1 in Join_buffer. If join conditions are met, add this row of data to the result set
    3. Repeat step 2 until the T2 table scan is complete
  • Time complexity

    During BLJ execution, a full table scan is performed on both T1 and T2 tables, so the time complexity is N+M, and each row of T2 table is judged, so The Times of judgment is N*M. Although BLJ and SLJ are the same in terms of time complexity, the time complexity of full table scan of BLJ algorithm is almost constant, which can be ignored. The only time-consuming thing is the number of judgment times. However, these judgment operations are performed in memory, which is much faster and has good performance.

    Whether t1 or T2 is used as the driver table, the time complexity is almost the same. The only difference is that join_buffer loads different data

  • Segmented query

    The time complexity of a large table is the same as that of a small table. However, the size of a large table is different in join_buffer. Mysql’s policy is to perform segmented queries if it can’t fit

    The size of join_buffer is set using the join_buffer_size parameter. The default size is 256K

    • Segmented query procedure

      If the table size is larger than JOIN_buffer, a segmented query is performed

      1. Table T1 is scanned, and rows are read sequentially into join_buffer. When join_buffer is full, it stops and performs step 2
      2. Scan table T2, extract each row from T2, and match the data in join_buffer. The data that meets join conditions will be added to the result set
      3. Empty join_buffer
      4. Continue to scan table T1, read the remaining rows sequentially, add them to JOIN_buffer, and continue with Step 2

      As you can see, join_buffer is reused to match the result set

    • Time complexity of segmented query

      In the process of segmented query, although the memory query efficiency of Join_buffer is taken advantage of, table T2 is scanned twice

      Assume that the number of rows in table T1 is N, which needs to be divided into K segments to complete scanning, and the number of rows in the driven table is M

      Then the time complexity is:

      • The time complexity of scanning lines is N+K x M
      • The number of times for determining the memory is N x M

      Obviously, the number of rows scanned depends on the size of K, and K represents the number of segments. In the case of join_buffer, the larger the amount of data in the table is, the more segments and rows scanned will be

  • Small tables drive large tables

    Therefore, according to the above time complexity, when the number of rows of data driving the table is smaller, the total number of scanned rows will be less, and the total number of judgments will remain unchanged, so the efficiency of the query will be higher.

    Of course, the premise is that the amount of table data is too large to load all tables into join_buffer at one time. If join_buffer is large enough to load all tables into memory at one time, the selection of the driver table does not affect the query efficiency

    That’s why it’s recommended to make join_buffer_size bigger

Definition of small tables

  • What is a small watch

    • The amount of data filtered in the condition is small

      select * from t1 stright_join t2 on t1.a = t2.a where t2.id < 50;
      Copy the code

      The data volume of t1 table is 1W, while the data volume of T2 table is 10W

      This SQL query only queries the first 50 tables of table T2, so table T2 is a “small table” relative to table T1 in this SQL query.

    • Query fields are small

      select t1.b, t2.* from t1 stright_join t2 on t1.a = t2.a
      Copy the code

      Assume that only 100 rows of data are joined in T1 and T2

      In the SQL statement above, T1 should technically be “small table”, because T1 only needs to put field B into join_buffer, while T2 needs to put all fields into Join_buffer

    Therefore, in the process of defining a small table, you not only need to look at the size of the data volume of the table, but also need to combine the SQL conditions for filtering. After filtering, you can judge the table according to the query fields. The table with a small data volume is a “small table”.

Therefore, in the process of Mysql query, small tables should always be selected to drive large tables to query, reduce the time complexity of query, improve query efficiency, especially when the amount of data in the table is too large, it should be used in this way.