Bool Query planner

This article was translated from Query Planning

directory

  1. The query

    1.1 Query without An Index

    1.2 Querying Information using ROWID

    1.3 Using Indexes to Query information

    1.4 Multi-line content lookup

    1.5 Use AND to link multiple WHERE condition queries

    1.6 Multi-column Query

    1.7 Overwrite index Query

    1.8 Use OR to link multiple WHERE condition queries

  2. The sorting

    2.1 Sort using ROWID

    2.2 Using Index Sort

    2.3 Overwrite index sort

  3. Query and sort

    3.1 Query and sort by multi-column index

    3.2 Query and sort by overwriting indexes

    3.3 Local Sorting by index

  4. Without the rowid table

An overview of the

The main feature of SQL (among all databases that use SQL statements, not just SQLite) is that it is an expressive programming language, not a procedural language. With SQL, you just tell the system what you want to calculate, you don’t describe how to do it. How the results are computed depends on the SQL database engine’s internal query planner.

There may be hundreds of execution algorithms for a single SQL statement. All algorithms can calculate the correct result, but some are fast and some are slow. The query planner acts as an AI, planning the fastest possible execution algorithm for each SQL statement.

In most cases, the query planner works very well in SQLite. But the query planner needs to use indexes to assist. These indexes need to be added by the developer when designing the database. Sometimes, the query planner will choose the suboptimal algorithm over the optimal one. In this case, some assistance from the developer is needed to help the query planner work better.

This article focuses on the principles behind the SQLite query planner and query engine. When necessary, developers can better create indexes based on these principles to help the query planner work efficiently.

See SQLite Query Planner and Next Generation Query Planner for more information.

1. Query

1.1 Query without An Index

In SQLite, most tables consist of one or more rows, each of which has a unique key (ROWId or INTEGER PRIMARY key)(the WITHOUT ROWId table is a special case). This data is usually arranged in ascending order. For example, the table used in this article is “FruitsForSale”, which mainly stores all kinds of fruits as well as their origin and price information. The table structure is as follows:

CREATE TABLE FruitsForSale(
  Fruit TEXT,
  State TEXT,
  Price REAL
);
Copy the code

After writing some data, the table is stored on disk in the form shown in Figure 1:

In this table, rowiDs are not continuous, but ordered. Typically, when SQLite creates a piece of data, the ROwiD of that piece of data is incremented by 1 from the previous ROWID. If a row is deleted, the ROWID will be incoherent. If necessary, you can specify the roWID sequence number when creating a piece of data, not just append data to the end. But no matter how added, each Rowid is unique and ordered.

When you want to query the price of peaches, the query might look something like this:

SELECT price FROM fruitsforsale WHERE fruit='Peach';
Copy the code

To satisfy this query, SQLite will read each row in the table, first retrieving the ‘fruit’ column to see if there is a value for ‘Peach’ and, if so, printing the value of ‘price’ for that row. The retrieval process is shown in figure2. This algorithm is called full table traversal — you need to read the entire table and retrieve it. The table has only 7 entries, which is fine to retrieve, but if you have 7 million entries, you need to read and traverse 1 meter of data in order to retrieve an 8-byte entry. For this reason, try to avoid full table traversal.

1.2 Querying Information using ROWID

Full table traversal can be avoided using ROWID queries (equivalent to an INTEGER PRIMARY KEY query). To query the price of peaches, directly retrieve the data whose ROWID is 4:

SELECT price FROM fruitsforsale WHERE rowid=4;
Copy the code

Because each piece of data is stored in the table in ROWID order, SQLite can perform binary lookups on the data. If the table contains N items of data, the time to query a item of data increases by a logN scale factor rather than by a N scale factor. If a table contains 10 million entries, this means traversing the entire table is N/logN times faster, or 1 million times faster.

1.3 Using Indexes to Query information

Queries with ROWId are quick, but what happens when you don’t know roWID? Rowid queries don’t work at this point.

To speed up the query, we can set the “fruit” column in the “FruitsForsalt” table as an index, like this:

CREATE INDEX Idx1 ON fruitsforsale(fruit);
Copy the code

The index table is another table associated with the original “FruitsForsale” table. The index table contains the contents of the index (fruit column) and the Rowid column. The contents of the index are first and all data is sorted by the contents of the index. Figure 4 shows the structure of the index table. The fruit” column is used as the primary key, and the “rowid” column is used as the secondary index. If multiple primary key fields have the same value, the secondary index is used to distinguish them. In the following example, the “Ornage” field values are the same, using the ROWID distinction. You may have noticed that each piece of data in the original table has a unique ROWID, so a compound key of “fruit” and “rowid” can determine a unique index for each piece of data.

Use index to query “peach price” faster:

SELECT price FROM fruitsforsale WHERE fruit='Peach';
Copy the code

SQLLite performs a binary lookup in the index table, finds fruit=’Peach’, and fetches the rowid for that row. Use Rowid to do a second binary lookup in the original table ‘FruitForSale’. After finding the corresponding row, extract the price field value. The retrieval process is shown in Figure 5.

To find the price of peaches, SQLite did two binary lookups. For tables with large amounts of data, this approach is still faster than full table traversal.

1.4 Multi-line lookup

In the previous query, a piece of data was retrieved using the fruit=’Peach’ constraint. But sometimes a constraint may correspond to more than one piece of data. For example, if we were to query the price of oranges, the following would appear:

SELECT price FROM fruitsforsale WHERE fruit='Orange';
Copy the code

Here, SQLite still does a binary lookup first to find the data indexed fruit=’Orange’. Then take out the ROWID and use the ROWID to go to the original table and do another binary lookup to find the corresponding price. SQLite then does not terminate the query, but continues to query the next eligible data. With binary lookup, the cost of querying the next piece of data is much less than the first, because the second piece of data is usually on the same page as the first, as shown in the figure above. This second lookup is so cheap that it is negligible. So the entire query takes about three binary lookups. If there are K eligible entries in the database and N entries in the entire table, the scaling factor for the time consumed by a query is approximately (K+1)*logN.

1.5 Use AND to link multiple WHERE condition queries

Next, you want to inquire about the price of oranges produced in California. The search conditions are as follows:

SELECT price FROM fruitsforsale WHERE fruit='Orange' AND state='CA';
Copy the code

One approach is to find all oranges using the fruit=’Orange’ condition, and then filter out all oranges that are not from California. The query process is shown in Figure 7. In most cases this is a reasonable approach. However, the database needed to do an additional binary lookup to filter out data from Florida, which was not as efficient as might be expected.

Now that you can index the “fruit” column, consider indexing the “state” column as well.

CREATE INDEX Idx2 ON fruitsforsale(state);
Copy the code

The “state” column in the index table is similar to the “fruit” column in Idx1, with the “state” column as the primary key and the “Rowid” column as the secondary index. In this model, the “state” column also has a lot of duplicates, so again, you need to use “rowid” to distinguish them.

Using the index table Idx2, SQLite has a new way of querying: use the index table to find the row that corresponds to California, and then filter out the rows that do not produce oranges.

The final result is the same as with idX1 queries (indexes are used to speed up SQLite queries and should not change the query results). The workload of the two indexing methods is the same, so in the case of query price, using Idx2 does not improve performance.

In this case, the last two queries take the same amount of time. Which one should we use? If the ANALYZE command is enabled, SQLite can collect statistics that use index tables. SQLite will then know to use Idx1 for index queries, which in most cases will only get one row of data (fruit=’Orange’ is a special case in this table); Using Idx2 for all queries, in many cases, you will find two rows of data. So if all other queries are the same, SQLite will select Idx1 for the index query to reduce the number of rows queried. This option is provided by the ANALYZE. If ANALYZE is not running on a database, SQLite selects each query with equal probability.

1.6 Multi-column index Query

To maximize the performance of “AND linking multiple WHERE condition queries “, you need to set up a multi-column index table based on the link items of AND. FruitsForSale create an index for the “fruit” and “state” columns of the FruitsForSale table:

CREATE INDEX Idx3 ON FruitsForSale(fruit, state);
Copy the code

The form of a multi-column index table is the same as that of a single-column index table, with the index column first and the ROWID column second. The leftmost column is used to determine the number of rows to query, and the second column is used to filter the number of rows that do not meet the requirements. If there are three columns, the third column is used to filter the results of the first two columns, and so on. This is not usually the case in our simple data model. For example, if the filter condition is fruit=’Orange’, there are two rows of data and the dirty data needs to be filtered by the second column in the index table. Because roWId is unique, each row in the index table is unique, even though both rows are the same.

Using the new index table Idx3, SQLite queries the price of oranges produced in California with only two binary lookups:

SELECT price FROM fruitsforsale WHERE fruit='Orange' AND state='CA';
Copy the code

Using the WHERE constraint in Idx3 for query, SQLite only needs to do a binary search to find the rowid corresponding to the row “Oranges produced in California”, and then perform a binary search from the original table to find the price of the corresponding orange. This is a very efficient way to query.

Since Idx3 already contains all the information in Idx1, we don’t need Idx1. If you want to query “peach price”, you can ignore the “state” field and directly use Idx3 to query:

SELECT price FROM fruitsforsale WHERE fruit='Peach';
Copy the code

Therefore, it is a good rule to follow when designing databases in the future: do not allow one index table to contain another. While SQLite can still do efficient lookups for long indexes, it is designed to minimize the number of columns in the indexed table.

1.7 Overwriting indexes

It is already very efficient to query “the price of oranges produced in California” using the index table Idx3. Add the “price” column to the index table and use an index table with three options:

CREATE INDEX Idx4 ON FruitsForSale(fruit, state, price);
Copy the code

This index table contains all the fields in the FruitesForSale table. We call this type of query “overlay query”. Because all the field information is set to index. SQLite can query the price of the corresponding fruit without having to query the original table.

SELECT price FROM fruitsforsale WHERE fruit='Orange' AND state='CA';
Copy the code

The column of results to be queried is also added to the index table, eliminating the need to associate with the original table and halving binary lookups. This query does provide a performance boost (the query is about twice as fast). But this is only a slight improvement. In terms of performance improvements, a doubling is often not as good as a multimillion fold. So for most queries, the difference between 1 microsecond and 2 microseconds is trivial.

1.8 Use OR to link multiple WHERE condition queries

Multi-column indexed tables are only suitable for queries with WHERE conditions joined by AND. Therefore, Idx3 and Idx4 are helpful only when the constraint is California production and orange; When the constraint is California production or oranges, these two indexes are no longer helpful.

SELECT price FROM FruitsForSale WHERE fruit='Orange' OR state='CA';
Copy the code

When faced with joining a WHERE condition using OR, SQLite first looks up the ROwiD of the row corresponding to each condition through the index table. Then make a union of these RowiDs and query them in the original table. Here is the query process:

As shown in the figure above, SQLite first queries the rowiDs that meet the criteria, then combines the two parts and uses those ROwiDs to query the original table. The permutations of these ROwiDs are very discrete, and SQLite will remember the indexes that were iterated after querying a ROWID once using an index, thus reducing the computation of the next query. Of course, this is just one implementation detail. The image above cannot represent the full details of the search, but it shows a general process.

The or-by-Union technique shown in the figure above is useful if there is data in the index table that meets the criteria. If there is no data in the index table that meets the constraints of an OR join, SQLite will perform a full table traversal in the original table. Rather than binary lookups through roWID collections, which can be costly in performance.

We can see that the or-by-Union technique performs multi-index query, in fact, it first queries the qualified ROwiDs through the index table, and then performs the UNION operation on these ROwiDs. Similarly, through AND join WHERE condition query, can also be through the index table to meet the conditions of the ROWID query, AND then take the intersection, many SQL database principle is such. However, if single-column index table AND OR-BY-INTERSECT are used for AND query, the performance will be poor. Therefore, multi-column index is generally used for AND query. As SQLite continues to be optimized, the post-order may support or-by-INTERSECT queries.

2. The sorting

SQLite (like many other SQL database engines) can use indexes for ORDER BY queries, not only making them faster. You can also speed up sorting.

An ORDERT BY query needs to be sorted first without the aid of an index. Consider the following statement:

SELECT * FROM fruitsforsale ORDER BY fruit;
Copy the code

SQLite retrieves all the results first, and then sorts the output by using a sorter.

If the number of rows to be output is K, then the scaling coefficient of the sorting time is KlogK. If K is small, the sorting time doesn’t matter. But as shown in the figure above, K==N, the sorting time is much longer than the time needed to traverse the full table. In addition, all retrieval results need to be placed in a temporary cache first (either run or disk cache, depending on the compile-time and run-time Settings), which means that a large cache needs to be occupied until the statement is finished executing.

2.1 Sort using ROWID

Sorting operations are expensive, and SQLite has a hard time converting ORDER BY into a non-time-consuming operation. If the data SQLite is going to output is already sorted, then you don’t need to sort it. For example, if you sort the output by rowid, you don’t need to sort:

SELECT * FROM fruitsforsale ORDER BY rowid;
Copy the code

You can also search in reverse order:

SELECT * FROM fruitsforsale ORDER BY rowid DESC;
Copy the code

So SQLite doesn’t do the sorting. But in order to output in reverse order, SQLite needs to start from the last entry of the table and traverse forward, not forward and backward. See Figure 17.

2.2 Using Index Sort

In practice, however, ordered output is rarely done directly through ROWId. In general, order retrieval is carried out by other conditions. If an index can be used for ORDER BY queries, it can also be used for sorting. For example, to output the “fruit” column sort:

SELECT * FROM fruitsforsale ORDER BY fruit;
Copy the code

The Idx1 index table is first iterated from top to bottom (or bottom to top if the query statement is “ORDER BY Fruit DESC”) to retrieve the corresponding ROwid for each fruit in ORDER. Then roWID performs binary lookup in the original table and outputs the corresponding data. Because the ROWIDS are already sorted when they are retrieved from the index, you can simply retrieve and output the data in the original table in the order of the ROWIDs, without having to reorder all the retrieved results.

But does it really save time? In the approach described at the beginning of this section, the time required to find and then sort the data is proportional to NlogN, because it requires sorting N pieces of data. For ordered search through Idx index table, we need to conduct binary search for N RowiDs, each search time is logN, and the proportion coefficient of the total time is also NlogN.

SQLite’s query planner follows the “low-cost principle.” When there are two or more query methods, SQLite will estimate the time of each query method and choose the one with the lowest cost. The cost is mostly determined by the estimated time, so the final choice depends on the size of the table to be queried and the complexity of the WHERE condition. In general, ordered lookups using indexes are preferred. The main reason is that using index lookups does not require additional temporary storage space to sort the data, reducing memory consumption.

2.3 Using overwrite index sort

If the overwrite index can be used for queries, the roWID step can be eliminated, which dramatically reduces the cost.

Using a covered index, SQLite can simply iterate over all the data and output the results in a time scale factor of N. And there is no need to create an additional temporary cache to sort the data.

3. Query and sort at the same time

The topics of query and sorting were explained separately. In practice, however, the developer needs to find and sort at the same time. Fortunately, you can do this with a single index.

3.1 Simultaneous search and sort operations through multi-column indexes

Let’s say we have a requirement: we want to query the prices of all oranges and sort the output by the place of origin of oranges. The query statement is as follows:

SELECT price FROM fruitforsale WHERE fruit='Orange' ORDER BY state;
Copy the code

This query statement contains both a query and a sort. Using two column indexes in index table Idx3, you can query the data that meets these two conditions.

SQLite performs a binary search to find the rowid corresponding to fruit=’Orange’. (Because Fruit is the leftmost column, the entire index is sorted by the spelling order of furit, so two identical fruit items are next to each other in the table.) Then rowid is used to do a binary lookup in the original table to find the price of the corresponding fruit.

You might notice that there’s no sort process here. There is no specific procedure to perform the ORDER BY operation. There is no sorting process because the data is sorted by state when it is retrieved from the index table. In an index table, if the values of the first column are the same (for example, ‘Orange’ in the figure above), the values of the corresponding second column will be arranged in the same order as the first column. So, if we iterate over two rows with the same fruit value in an index table, the state column of the two rows must be in order.

3.2 Use overlay indexes for lookups and sorts

Overwriting indexes can also be used for lookups and sorts, such as the following:

SELECT * FROM fruitforsale WHERE fruit='Orange' ORDER BY state;
Copy the code

As mentioned earlier, to satisfy the WHERE condition, SQLite performs a binary lookup, traversing the index table from top to bottom to find the data that matches the condition. If the value constrained by the WHERE condition has multiple entries in the index table, the entries must be contiguous. Traversal is done from top to bottom. Because the fruit column is state, so when fruit is equal, it’s going to be sorted by state, and so on. According to this principle, find out the data is directly in order, very efficient.

SQLite can also do descending queries:

SELECT * FROM fruitforsale WHERE fruit='Orange' ORDER BY state DESC;
Copy the code

The basic principle is similar, except this time it is traversed from the bottom up, so that the queried data is also sorted in descending order.

3.3 Using indexes for local sorting

In some cases, the index table can only satisfy the sorting of some attributes. For example, the following query:

SELECT * FROM fruitforsale ORDER BY fruit, price;
Copy the code

When traversing the covered index table, the fruit column is definitely in order, but if you have multiple data in the table with the same fruit field value, their price field values are not necessarily in order. When this happens, SQLite does a lot of local sorting, sorting on one fruit at a time rather than the entire table. Figure 22 shows this process:

In this example, instead of sorting all seven pieces of data, there are five single sorts (not really) and one double sort (fruit=’Orange’).

The advantages of doing multiple local sorts rather than the whole sort are:

  1. Multiple local sorts at the same time can reduce the CPU clock cycle as opposed to a single global sort.
  2. Each local sort can run quickly, which means you don’t have to temporarily store a lot of information in the memory cache, reducing memory footprint.
  3. Some sort keys are already sorted in the index table, and the ones that write SQL can be omitted to reduce memory usage and CPU execution time.
  4. Each time a local sort is completed, the data is returned to the application; An overall query will need to traverse the full table before returning the data. The former is better.
  5. If you use the LIMIT condition, you can also avoid traversing the entire table.

Because of these advantages, SQLite often uses indexes for local sorting rather than overall sorting.

4. Rowid-free tables

The basic principles described above apply to both roWID-containing and ROWID-free tables. The only difference is that for tables with ROWID, the ROWID column is typically used as a key for a table. After the index table is created, roWID is used to associate the index table with the original table at the right end of all the tables, and its place in the index table is replaced by the primary key.

reference

  • https://www.sqlite.org/queryplanner.html