Structured Query Language (SQL) is an indispensable skill in the data science industry, and it is generally easy to learn. However, many people forget that writing queries is only the first step in SQL. We also need to make sure that the query performs well or matches the context in which we are working.

Because of this, this SQL tutorial will give you a look at some of the steps we can take to evaluate a query:

  • We’ll start with a brief introduction to the importance of learning SQL in data science work;
  • Next, we will first learn more about SQL query processing and execution so that we can properly understand the importance of writing high-quality queries: more specifically, we will see queries parsed, rewritten, optimized, and finally evaluated;
  • With this in mind, not only will we revisit some of the query antipatterns that beginners do when writing queries, but we’ll also learn more about alternatives and solutions to those possible errors; You’ll also learn more about collection-based and procedural query methods.
  • We’ll also see that these anti-patterns stem from performance concerns, and that in addition to improving SQL queries with a “manual” approach, we can analyze queries in a more structured and in-depth way by using other tools that help us see our query plans; And,
  • We will generally delve further into time complexity and big O notation to understand the time complexity of the execution plan before executing the query; In the end,
  • We will get some general instructions on how to further fine-tune the query.

Data Science: Why SQL?

SQL is far from dead: it’s one of the most sought-after skills we find in job descriptions in the data science industry, whether you’re applying for data analyst, data engineer, data scientist, or any other role. This was confirmed by 70% of respondents to the 2016 O’Reilly Data Science Salary Survey, who indicated that SQL was used in their professional backgrounds. In addition, SQL significantly outperformed R (57%) and Python (54%) in the survey.

So there you have it: SQL is a must-have skill if you want a job in the data science industry.

Not bad for a language that was developed in the early 1970s, right?

So why is SQL used so often? Why doesn’t it die even though it’s been around for so long?

There are several reasons for this: One of the first is that companies mostly store their data in relational database management systems (RDBMSS) or relational Data Flow Management systems (RDSMS), and we need SQL to access the data in those systems. SQL is the universal language of data: it lets you interact with almost any database, even build your own locally!

If this isn’t enough, keep in mind that SQL implementations from many vendors aren’t compatible and don’t necessarily follow standards. So understanding standard SQL is one of the things we need to make a living in the [data science] industry.

In addition, it’s safe to say that newer technologies have embraced SQL, such as Hive, an SQL-like query language interface for querying and managing large data sets, and Spark SQL, which can be used to execute SQL queries. Of course, the SQL that appears in these places will still be different from the standard we learned, but the learning curve will be much easier.

For comparison, think of SQL as linear algebra: put all your energy into one subject, linear algebra, and you know you can use it to master machine learning!

In short, here are the reasons why we should learn this query language:

  • It’s fairly easy to learn, even for complete beginners. The learning curve is easy and gradual so that we can write queries right away.

  • It follows the “learn once, use everywhere” principle, so it’s a good investment of time!

  • It is a great complement to programming languages; In some cases, writing queries is even better than writing code because it performs better!

  • .

What are you waiting for? 🙂

SQL processing and query execution

To improve the performance of SQL queries, we first have to know what happens internally when we run the query in a shortcut.

First, the query is parsed into a “parse tree”; The query is parsed to see if it meets syntactic and semantic requirements. The parser creates an internal representation for the input query and passes this internal representation as output to the rewrite engine.

The optimizer is then tasked with finding the optimal execution or query plan for a given query. The execution plan defines exactly what algorithm to use for each operation and how to coordinate the execution of the operation.

To find the optimal execution plan, the optimizer enumerates all possible execution plans, determines the quality or cost of each plan, obtains information about the current database state, and then selects the best one as the final execution plan. Because the query optimizer can be imperfect, database users and administrators sometimes need to manually review and adjust the plans generated by the optimizer for better performance.

Now you may be wondering what constitutes a “good query plan.”

As we have read and seen, the cost quality of the plan plays an important role. More specifically, it is critical to evaluate the amount of disk I/O required for a plan, the CPU cost of the plan, and the overall response time and total execution time that the database client can observe. And that’s where the concept of time complexity comes in. We’ll read more about that later.

Next, the selected query plan is executed, evaluated by the system’s execution engine, and the query result is returned.

Write an SQL query

At this point, something that may not have become clear since the last section, namely the garbage in, garbage out (GIGO) principle, comes to the fore naturally during query processing and execution: the person who wrote the query also holds the key to SQL query performance. If the optimizer gets a poorly formulated query, it can only be poorly optimized…

This means there are a few things we can do when writing queries. As you’ve seen in the introduction, the responsibility is twofold: it’s not just about writing queries that meet a certain standard, but it’s also about gathering ideas about where performance problems might be lurking in queries.

An ideal starting point is to consider the “points” in the query that the problem might dive into. And, in general, there are four clauses and keywords where a novice might have performance problems:

  • WHEREClause;
  • anyINNER JOINorLEFT JOINKey words; And,
  • HAVINGClause;

I admit, this is simple and crude, but as a beginner, these clauses and statements are good indicators. And it’s safe to say that when we first learn to write queries, these are the places where errors occur and, ironically, where they are hard to spot.

However, we also need to understand that performance requires a context to make sense: simply put, these clauses and keywords do not necessarily lead to poor performance when considering SQL performance. HAVING a WHERE or HAVING clause in a query doesn’t necessarily mean it’s a bad query…

Take a look at the section to learn more about anti-patterns for building queries and alternatives. These tips and tricks are for guidance only. How you actually rewrite the query depends on the amount of data, the database, the number of times it takes to execute the query, and so on. It all depends on the target of the query, and it is vital to have some prior knowledge of the database being queried!

1. Retrieve only the required data

There should be no “more data is better” mentality when writing SQL queries. With such a mindset, you run the risk of clouding your observation by getting more data than you really need, but also of impacting performance by extracting too much data from your queries.

That’s why, in general, it’s a good idea to look out for SELECT statements, DISTINCT clauses, and LIKE operators.

  • SELECTstatements

Once the query is written, the first thing you should check is that the SELECT statement is as compact as possible. The goal should be to remove unnecessary columns from the SELECT. This forces you to extract only the data used to query the target.

If you have a related subquery that contains EXISTS, you should try using constants in the SELECT statement of that subquery rather than selecting the value of an actual column. This is especially handy when you only check if the value exists.

Remember that correlation subqueries are subqueries that use values from external queries. Also note that even NULL can be used as a “constant” in this context, which is very confusing!

To understand what it means to use constants, consider the following example:

SELECT driverslicensenr, name 
FROM Drivers 
WHERE EXISTS (SELECT '1' FROM Fines 
              WHERE fines.driverslicensenr = drivers.driverslicensenr);
Copy the code

Tip: Be aware that having related subqueries is not always a good idea. You can always consider removing related subqueries, for example, by rewriting them with an INNER JOIN:

SELECT driverslicensenr, name 
FROM drivers 
INNER JOIN fines ON fines.driverslicensenr = drivers.driverslicensenr;
Copy the code
  • DISTINCTclause

The SELECT DISTINCT statement is used to return only DISTINCT (DISTINCT) values. Use of DISTINCT clauses should be avoided whenever possible; As you have read in other examples, if you add this clause to a query, the execution time will only increase. Therefore, it is always a good idea to consider whether you really need to perform a DISTINCT operation to get the result you want to accomplish.

  • LIKEThe operator

When using the LIKE operator in a query, the index is not used if the schema begins with % or _. It prevents the database from using indexes, if any. On the other hand, of course, you could argue that this type of query might result in getting too many records that don’t necessarily meet the query goal.

Again, knowledge of the data stored in the database helps us formulate a schema that properly filters all data so that only the rows that are critical to the query are found.

2. Limit the query results

When filtering some data in SELECT statements is unavoidable, you can consider limiting the results in other ways. Here’s where methods like the LIMIT clause and data type conversions come in:

  • TOP,LIMIT å’Œ ROWNUMclause

A LIMIT or TOP clause can be added to the query to set the maximum number of rows in the result set. Such as:

`SELECT TOP 3 * FROM Drivers; `Copy the code

Note that PERCENT can be further specified, for example, by changing the first line of the query by SELECT TOP 50 PERCENT *.

`SELECT driverslicensenr, name FROM Drivers LIMIT 2; `Copy the code

In addition, you can add a ROWNUM clause, which is equivalent to using LIMIT in a query:

SELECT * 
FROM Drivers 
WHERE driverslicensenr = 123456 AND ROWNUM <= 3;
Copy the code
  • Data type conversion

You should always use the most efficient (that is, the smallest) data type possible. It is always a risk to use a large data type when a smaller data type is sufficient.

However, when a cast is added to a query, it only increases execution time.

One alternative is to avoid casts whenever possible. Also note that it is not always possible to remove or omit data type conversions from queries, but you must be careful when including them, and when including them you should test the effect of adding them before running the query.

3. Don’t make the query more complex than it needs to be

Casting brings us to the next point: don’t over-design your queries. Keep it simple and efficient. This might seem too simple or silly as a trick, especially since queries can get complicated by nature.

However, as we’ll see in the examples in the next section, it’s easy to make simple queries more complex than they need to be from the start.

  • ORThe operator

When using the OR operator in a query, we probably don’t use an index.

Keep in mind that an index is a data structure that speeds up the retrieval of data from database tables, but comes at a cost: additional writes and additional storage are required to maintain the index data structure. Indexes are used to quickly locate or find data without having to search every row in the database every time a database table is accessed. Indexes can be created with one or more columns in a database table.

If you do not use indexes contained in the database, queries will inevitably take longer to run. This is why it is best to look for alternatives in queries that use the OR operator;

Consider the following queries:

SELECT driverslicensenr, name 
FROM Drivers 
WHERE driverslicensenr = 123456 OR driverslicensenr = 678910 OR driverslicensenr = 345678;
Copy the code

The OR operator can be replaced by:

  • withINThe conditions; or
SELECT driverslicensenr, name 
FROM Drivers 
WHERE driverslicensenr IN (123456, 678910, 345678);
Copy the code
  • withUNIONOf the twoSELECTStatements.

Tip: Be careful not to use the UNION operation unnecessarily here, because doing so will traverse the same table multiple times. At the same time, you must be aware that execution time increases when using unions in queries. Alternatives to the UNION operation are to reprogram the query to place all the conditions in a SELECT directive, or to use OUTER JOIN instead of UNION.

Tip: Keep in mind here too that index lookups are not always preferred, although OR and the other operators mentioned in the following sections may not use indexes!

  • NOTThe operator

When a query contains the NOT operator, it is likely that the index will NOT be used, as with the OR operator. This inevitably slows down queries. If you don’t know what this means, consider the following query:

`SELECT driverslicensenr, name FROM Drivers WHERE NOT (year > 1980); `Copy the code

This query is bound to be slower than you expect, mainly because it is planned to be much more complex than expected: in this case, it is best to find an alternative. Consider replacing NOT with a comparison operator, such as >, or! >; The above example might be rewritten to look like this:

`SELECT driverslicensenr, name FROM Drivers WHERE year <= 1980; `Copy the code

It looks cleaner, right?

  • ANDThe operator

The AND operator is another operator that does not use indexes AND can slow down queries if used in an overly complex AND inefficient way, as shown in the following example:

SELECT driverslicensenr, name 
FROM Drivers 
WHERE year >= 1960 AND year <= 1980;
Copy the code

Best overridden with the BETWEEN operator:

SELECT driverslicensenr, name 
FROM Drivers 
WHERE year BETWEEN 1960 AND 1980;
Copy the code
  • ANY å’Œ ALLThe operator

In addition, the ALL and ALL operators should be used with care, because indexes will not be used if they are included in the query. An alternative approach to use here is an aggregate function, such as MIN or MAX.

Tip: When using the alternatives recommended above, it is important to note that all aggregate functions (such as SUM, AVG, MIN, MAX) can result in long-running queries when applied to many rows. In this case, try either minimizing the number of rows to process or precalculating the values. As we can see again, when deciding which query to use, it is important to be aware of both the environment and the query target…

  • Isolate the columns in the condition

In addition, indexes are not used if columns are used in calculations or scalar functions. One possible solution is to isolate only specific columns so that they are no longer part of a calculation or function. Consider the following example:

SELECT driverslicensenr, name 
FROM Drivers 
WHERE year + 10 = 1980;
Copy the code

That looks gross, right? Try rethinking the calculation and rewriting the query as follows:

SELECT driverslicensenr, name 
FROM Drivers 
WHERE year = 1970;
Copy the code

4. Don’t use brute force

The final tip is that you should actually not try to limit queries too much, because it can affect query performance. This is especially true for join and HAVING clauses.

  • The connection

  • Table order When joining two tables, it may be important to consider the order of the tables in the join. If you notice that one table is much larger than another, you may need to rewrite the query to put the largest table at the end of the join.

  • Redundancy conditions in connections

When you add too many conditions to a connection, you essentially force SQL to select a path. However, this path does not always perform better.

  • HAVINGclause

The HAVING clause was originally added to SQL because the WHERE keyword could not be used with aggregate functions. HAVING is often used in conjunction with the GROUP BY clause to limit the GROUP of rows returned to only those that meet certain criteria. However, if you use this clause in a query, you don’t use the index, and we already know that this can lead to queries that don’t perform well.

If you’re looking for an alternative, consider using the WHERE clause. Consider the following query:

`SELECT state, COUNT(*) FROM Drivers WHERE state IN ('GA', 'TX') GROUP BY state ORDER BY state`

`SELECT state, COUNT(*) FROM Drivers GROUP BY state HAVING state IN ('GA', 'TX') ORDER BY state`
Copy the code

The first query uses the WHERE clause to limit the number of rows that need to be counted; The second query counts all rows in the table, and then uses HAVING to filter the calculated count. In these types of cases, the alternative of using the WHERE clause is clearly better because no resources are wasted.

As you can see, this is not limiting the result set, but the intermediate number of records in the query.

Note that the difference between these two clauses is that the WHERE clause introduces a condition on each row, whereas the HAVING clause introduces a condition on a selection aggregate or result (WHERE a single result, such as MIN, MAX, or SUM, has already been generated from multiple rows).

So evaluating quality, writing, and rewriting queries is not easy when it comes to considering performance as much as possible; When writing queries to run on a database in a professional environment, avoiding anti-patterns and considering alternatives can also be part of the responsibility.

This list is just a brief overview of some antipatterns and techniques that are expected to help beginners; Check out this discussion for more on what senior developers consider the most common anti-patterns.

Set – based query method and procedural query method

Implicit in the above anti-patterns is the fact that they really boil down to the difference between a collection-based approach to query creation and a procedural approach to query creation.

The procedural approach to creating queries is very similar to programming: we can tell the system what to do and how to do it.

One example is redundancy conditions in a join, or the overuse of the HAVING clause as in the above example, by executing one function and then calling another to query the database, or using logic that includes loops, conditions, user-defined functions (UDFs), cursors, and so on to get the final result. In this approach, you often find yourself requesting a subset of data, then requesting another subset from that data, and so on.

So, not surprisingly, this approach is often referred to as “step by step” or “row by row” queries.

The other approach is the collection-based approach, where we simply specify the operation to be performed. Our tasks include specifying conditions or requirements for the result set we want from the query. Leave how the data is retrieved to the internal mechanism for determining the query implementation: let the database engine determine the best algorithm or processing logic to execute the query.

Since SQL is based on collections, it is not surprising that this approach is more efficient than the procedural approach, which explains why SQL can work faster than code in some cases.

Tip: Collection-based query methods are also what top employers in the data science industry will demand of you! You’ll often need to switch between these two types of methods.

Note that if you find yourself with procedural queries, consider rewriting or refactoring them.

From query to execution plan

Knowing that anti-patterns are not immutable, will evolve as we evolve as SQL developers, and have a lot to think about when considering alternatives, means that avoiding querying anti-patterns and rewriting queries can be a fairly tricky task. Any assistance can come in handy, which is why tools to optimize queries in a more structured way are a good way to go.

Also note that some of the anti-patterns mentioned in the previous section stem from performance issues, such as the AND, OR, AND NOT operators AND their lack of index use. Thinking about performance requires not only a more structured approach, but also a more in-depth approach.

However, this structured and in-depth approach will be primarily based on the query plan, which is the query result that is first parsed into a “parse tree” and defines what algorithm is used for each operation and how the execution of the operation is coordinated.

Query optimization

As we saw in the introduction, we may need to manually check and adjust the plan generated by the optimizer. In this case, we will need to analyze the query again by looking at the query plan.

To control this plan, we use tools provided by the database management system. Some of the tools available are as follows:

  • Some software packages have tools to generate graphical representations of query plans. Consider this example:

  • Other tools can provide textual descriptions of query plans. One example is in OracleEXPLAIN PLANStatement, but the name of the instruction varies depending on the RDBMS being used. For example, MySQL, PostgreSQL, etcEXPLAIN, is in the SQLiteEXPLAIN QUERY PLAN.

Note that if you are using PostgreSQL, there is a difference between EXPLAIN and EXPLAIN ANALYZE. The former only gets a description of how the planner wants to execute the query, but does not execute the query; The latter actually executes the query and returns an analysis of the expected versus actual query plan. In general, the actual execution plan is the one used to actually execute the query, and the estimated execution plan is the one that works out what the execution plan would do without executing the query. Although logically similar, the actual execution plan is more useful because it contains additional details and statistics about what actually happens when the query is executed.

In the remainder of this section, we will learn more about EXPLAIN and ANALYZE and how to use these statements to learn more about the query plan and the possible performance of the query. To do this, we’ll start with a few examples. Two tables are used in the example: one_million and half_million.

We can use EXPLAIN to get the current information for table one_million; Make sure to put EXPLAIN at the top of the query, and when run, the query plan is returned:

EXPLAIN SELECT * FROM one_million; The QUERY PLAN ____________________________________________________ Seq Scan on one_million (cost = 0.00.. Rows = 1rows = 1rows) (1 row)Copy the code

In this case, we can see that the cost of the query is 0.00.. 18584.82, the number of rows is 1025082, the number of columns is 36.

We can then update the statistics with ANALYZE.

ANALYZE one_million;
EXPLAIN
SELECT *
FROM one_million;

QUERY PLAN
____________________________________________________
Seq Scan on one_million
(cost=0.00..18334.00 rows=1000000 width=37)
(1 row)
Copy the code

In addition to EXPLAIN and ANALYZE, we can also use EXPLAIN ANALYZE to obtain the actual execution time:

EXPLAIN ANALYZE SELECT * FROM one_million; The QUERY PLAN ___________________________________________________________ Seq Scan on one_million (cost = 0.00.. 18334.00 ROWS =1000000 width=37) (actual time=0.015.. Rows =100 loops=1) Total Runtime: 100 loops (2 rows)Copy the code

The obvious drawback to using EXPLAIN ANALYZE here is that the query is actually executed, so be careful!

The algorithms we’ve seen so far have been Seq scans or full table scans: a Scan on a database in which each row of the scanned table is read in (serial) order and the columns found are checked to see if they meet the criteria. In terms of performance, a sequential scan is clearly not the best execution plan because we are still doing a full table scan. However, when tables don’t fit into memory exactly, this isn’t too bad: sequential reads are fast, even with slow disks.

We’ll see more when we discuss index scanning.

However, there are other algorithms. For example, this query plan is used for the join:

EXPLAIN ANALYZE SELECT * FROM one_million JOIN half_million ON (one_million.counter=half_million.counter); The QUERY PLAN _________________________________________________________________ Hash Join (cost = 15417.00.. Rows =500000 width=42) (actual time=1241.471.. Rows =500 loops=1) Hash Cond: (one_million. Counter = half_million. Counter) -> Seq Scan on one_million (cost=0.00.. Rows =1000000 width= 100) (actual time=0.007.. 1254.027 rows=1000000 loops=1) -> Hash (cost=7213.00.. Rows =500000 width=5) (actual time=1241.251.. 1241.100 Rows =500000 Loops =1) Buckets: 4096 Fandom: 16 Memory Usage: 300KB -> Seq Scan on HALf_million (cost=0.00.. Rows =500000 width=5) (actual time=0.008.. 601.128 Rows =500000 loops=1) Total Runtime: 601.128 msCopy the code

You can see that the query optimizer has selected a Hash Join! Keep this in mind because we will need it to evaluate the time complexity of the query. Now, notice that there is no index on half_million. Counter, which can be added in the next example:

CREATE INDEX ON half_million(counter); EXPLAIN ANALYZE SELECT * FROM one_million JOIN half_million ON (one_million.counter=half_million.counter); The QUERY PLAN ________________________________________________________________ Merge Join (cost = 4.12.. Rows =500000 width=42) (actual time=0.033.. Loops =1) Merge Cond: (one_million_counter = half_million.counter) -> Index Scan using one_million_counter_idx on one_million (cost=0.00.. Rows =1000000 width=37) (actual time=0.011.. 694.466 Rows =500001 loops=1) -> Index Scan using half_million_counter_IDx on half_million (cost=0.00.. Rows =500000 width=5) (actual time=0.010.. Rows =500000 loops=1) Total Runtime: 500000 ms (5 rows)Copy the code

As you can see, by creating the index, the query optimizer now decides to Merge join with index scan.

Note the difference between an index scan, also known as a “table scan,” which scans data or index pages to find the appropriate record, and a “table scan,” which scans every row of a table.

You can see that the total runtime has been reduced and the performance should be better, but there are two index scans, which makes memory even more important here, especially if the table doesn’t fit into memory exactly. In this case, we first have to perform a full index scan, which is a quick sequential read that doesn’t cause any problems, but is followed by a lot of random reads to get rows by index values. These random reads are usually an order of magnitude slower than sequential reads. In this case, a full table scan is actually faster than a full index scan.

Hint: For more information about EXPLAIN or to look at examples in more detail, consider reading Understanding EXPLAIN by Guillaume Lelarge.

Time complexity and big O notation

Now that we’ve gone over our query plan in general, we can start digging deeper with the help of computational complexity theory and think about performance in a more formal way. This area of theoretical computer science focuses on classifying computational problems by difficulty; These computational problems can be algorithms or queries.

However, with queries, we don’t necessarily classify them by their difficulty, but by the time it takes to run them and return some results. Specifically, this is called time complexity, and to express or measure this type of complexity, we can use the big O notation.

In big O notation, we can express the running time in terms of how fast the running time grows as the input becomes infinite, relative to the input. The big-O notation excludes coefficients and low-order terms so that we can focus on the important part of the query’s runtime: its growth rate. When expressed in this way, the coefficients and lower order terms are discarded, and the time complexity is said to be described asymptotically. In other words, the input size reaches infinity.

In database languages, complexity is a measure of how long queries take to run as the size of a data table grows and the database grows with it.

Note that not only does the size of the database grow as more data is stored in the tables, but the indexes that exist in the database also contribute to the size growth.

Estimate the time complexity of the query plan

As mentioned earlier, the execution plan defines, among other things, what algorithm to use for each operation, so that each query execution time can be logically represented as a function of the table size in the query plan, known as the complexity function. In other words, you can use the big O notation and execution plan to estimate the complexity and performance of the query.

In the following sections, you’ll get a general idea of the four types of time complexity, and you’ll see examples of how the time complexity of a query can vary depending on the context in which you run it.

Hint: The index is part of the story!

Note, however, that the time complexity listed below is a very general generalization and may vary depending on your Settings, given that there are different types of indexes, different execution plans, and different implementations of different databases.

Constant time: O(1)

An algorithm is said to operate in constant time if it always takes the same amount of time no matter how large the input is. For a query, if the same amount of time is always required no matter how large the table is, the query will run in constant time.

This type of query is uncommon, but here’s an example:

SELECT TOP 1 t.* 
FROM t
Copy the code

Here the time complexity is constant because only one arbitrary row is selected from the table. Therefore, the length of time should be independent of the size of the table.

Linear time: O(n)

An algorithm is running in linear time if its time execution is proportional to the input size, that is, the time increases linearly as the input size increases. For a database, this means that the time execution is proportional to the size of the table: as the number of rows in the table increases, the time of the query grows.

One example is a query that uses the WHERE clause on unindexed columns: a full table Scan or Seq Scan would be required, resulting in O(n) time complexity. This means that each row needs to be read to find the data with the correct ID. You have no restrictions at all, so every row needs to be read, even if the condition is matched on the first row.

Consider the following example, where if there is no index on i_id, then the complexity of the query is O(n) :

SELECT i_id 
FROM item;
Copy the code
  • This also means other queries, for exampleCOUNT (*) FROM TABLE;Such a count query would have a time complexity of O(n) because a full table scan would be required, which would be more like O(1) unless the total number of rows of the table was stored.

Closely related to linear execution time is the execution time of the execution plan in which there are links. Here are some examples:

  • Hash join has the expected complexity O(M + N). The classical hash join algorithm for joining within two tables first prepares the hash table of the smaller table. Hash entries consist of join attributes and their rows. Access a hash table by applying a hash function to the join property. Once the hash table is built, the larger table is scanned and the related rows in the smaller table are found by looking at the hash table.

  • Merge join usually has complexity O(M + N), but this complexity depends heavily on the index on the join column and, in the absence of an index, on whether the rows are sorted by the key used in the join:

    • If both tables are sorted by the key used in the join, the complexity of the query is O(M+N).
    • If both tables have indexes on join columns, the indexes already maintain the columns in order, without sorting. It’s going to be order M plus N.
    • If neither table has an index on the join column, then both tables have to be sorted first, so the complexity is O(M log M + N log N).
    • If only one table has an index on the join column, then only the tables without indexes will need to be sorted before the merge step occurs, so the complexity will be O(M + N log N).
  • For nested connections, the complexity is usually O(MN). This join is efficient when one or two tables are very small (for example, less than 10 records), which is a very common situation when evaluating queries because some subqueries are written to return only one row.

Remember: a nested join is a join that compares every record in one table to every record in another table.

Log time: O(log (n))

An algorithm is said to run in logarithmic time if its time execution is proportional to the logarithm of the input size; For queries, this means that they will run in logarithmic time if the execution time is proportional to the logarithm of the database size.

This logarithmic time complexity is true for query plans that perform “index scans” or clustered index scans. A clustered index is an index that contains the actual data rows of the table at the leaf level of the index. A clustered index is very similar to other indexes: it is defined on one or more columns. These columns form index keys. The cluster key is the key column of the cluster index. A cluster index scan is basically an RDBMS operation that reads rows in a cluster index from top to bottom.

Consider the following query example, where there is an index on i_id that would normally yield complexity O(log(n)) :

SELECT i_stock 
FROM item 
WHERE i_id = N;
Copy the code

Note: If there is no index, the time complexity becomes O(n).

Square time: O(n^2)

An algorithm is said to run in square time if its time execution is proportional to the square of the input size. Again, for the database, this means that the execution time of the query is proportional to the square of the database size.

An example of a query with squared time complexity is as follows:

SELECT * 
FROM item, author 
WHERE item.i_a_id=author.a_id
Copy the code

The minimum complexity will be O(n log(n)), but the maximum complexity may be O(n^2), based on the index information of the connection attributes.

In summary, we can also look at the following quick check table to estimate the performance of a query based on its time complexity and execution:

SQL tuning

Once we understand the query plan and time complexity, we can consider further tuning our SQL queries. We can start by paying special attention to the following points:

  • Use index scan to replace unnecessary large table full table scan;

  • Ensure that the best table join order is being applied;

  • Ensure optimal use of indexes; As well as

  • Cache small table full table scan.

Dig deeper into SQL

A: congratulations! You’ve reached the end of this blog post, which gives you an overview of SQL query performance. Hopefully, you’ll gain more insight into anti-patterns, query optimizers, and tools that you can use to view, estimate, and explain the complexity of query plans. Still, there’s more to discover! For more information, consider reading Database Management Systems by R. Ramakrishnan and J.Gehrke.

Finally, I don’t want to hide this quote from StackOverflow users:

“My favorite anti-pattern is don’t test the query.

This applies to:

  • Your query involves multiple tables.
  • You think you have an optimized query design and don’t bother testing your hypothesis.
  • You accept that the first query is valid, with no clue as to whether it is close to optimization. “