Reprint address

Small tables drive large tables, that is, small data sets drive large data sets. Before we know what a small table drives to a large table, let’s look at two query keywords, IN and EXISTS. Let’s start with two query statements to see what they do. I set up two tables, one employee table and one department table. The employee table has the department ID attribute, and I associate the two tables.

Let’s first use IN to query the data:

SELECT * 
FROM t_emp 
WHERE dept_id IN (SELECT dept_id FROM t_dept) 
LIMIT 5;
Copy the code

Next use EXISTS to query data:

SELECT * 
 FROM t_emp 
 WHERE EXISTS 
     (SELECT 1 
     FROM t_dept 
     WHERE t_dept.dept_id = t_emp.dept_id) 
 LIMIT 5;
Copy the code

The EXISTS syntax is: data in a primary query is conditionally validated in a sub-query, and data in a primary query is determined by the result TRUE and FALSE. The EXISTS subquery only returns TRUE or FALSE, so the SELECT * in the subquery can be SELECT 1 or otherwise. The EXISTS subquery is actually optimized by MySql when it is executed rather than comparing every data.

Since both IN and EXISTS can be used to query data, what is the difference between the two?

In practice, we want to index both tables’ dept_id. At the very beginning, we explained an optimization principle: small table drives large table. When we use IN for associated query, according to the execution sequence of IN operation above, we first query department table and then query employee information according to the ID information found IN department table. We all know that the employee table must have a lot of employee information, but the department table usually only has a little data information. We first query the employee information by querying the department table, and then use the query result of t_DEPT to drive the large table (T_EMP), which is very efficient and worth encouraging.

However, when an EXISTS query is used, the employee table is queried first, and then whether the information in the employee table needs to be retained according to the TRUE or FALSE returned by the department table query condition. T_emp is used to drive small tables t_DEPT. Although this way can also find the data we want, but this way of query is not worth advocating.

We use IN over EXISTS when there is more data IN the T_EMp table than IN the T_DEPT table. When there is more data IN the T_DEPT table than there is IN the T_EMP table (we are only assuming here), we use EXISTS over IN. Therefore, whether to use IN or EXISTS depends on our requirements. However, if the amount of data IN two tables is similar, there is little difference between using IN or EXISTS.