How to do a good SQLite use quality testing, so that the accident eliminated in the cradle

This article is published by the Cloud + community

SQLite is widely used in mobile development, and its quality of use directly affects the product experience.

Common SQLite quality monitoring relies on post-launch feedback mechanisms, such as time consuming monitoring or user feedback. The problem with this approach is:

Hindsight revealed that the negative effects had already occurred.
It’s just not that bad. Eg. If the monitoring threshold is 500ms, an SQL that can be optimized to 20ms with an average time of 490ms is ignored.

Can quality SQLite usage be monitored prior to launch? So we tried a tool: SQLiteLint. Lint, despite the name “Lint”, is not a static check of code, but rather an analysis of SQL statements, execution sequences, table information, etc., while the APP is running. A bit like “Lint” : Step in at the development stage and apply some rules of best practice to detect potential and suspicious SQLite usage problems.

This article will introduce the idea of SQLiteLint, is also a share of SQLite experience, hope to help you.

The paper

SQLiteLint detects when APP is running, and most detection algorithms are independent of data volume, that is, they do not depend on data state on the line. Whenever you trigger the execution of an SQL statement, SQLiteLint will help you review whether the statement was written incorrectly. This can be done in the development, testing, or grayscale phases.

The testing process is simple:

\1. Collect THE SQL execution information of APP runtime, including execution statements, created table information, etc. Table information can be obtained by the pragma command. For execution statements, there are two cases: a) the DB framework provides a callback interface. WeChat use WCDB, for example, can easily by MMDataBase. SetSQLiteTrace register callback to get the information. B) Using Android’s default DB framework, SQLiteLint provides a non-intrusive way to get information about executed SQL statements and how long they took. By hooking up callbacks to the SQlite3_profile method of THE SQLite3 C layer API, you can also get the information needed for analysis without the need for additional statistical code by the developer.

\2. Preprocessing includes generating the corresponding SQL syntax tree, generating SQL without arguments, judging whether select* statements, etc., to prepare for the following analysis. Preprocessing and subsequent algorithm scheduling are in a separate processing thread.

\3. Scheduling specific detection algorithm execution Checker is a variety of detection algorithms, also support extension. And detection algorithms are implemented in C++, convenient to support multiple platforms. The timing of scheduling includes: recently unanalyzed SQL statement scheduling, sampling scheduling, initialization scheduling, and scheduling for each SQL statement.

\4. Issue a problem report a problem or pop-up message.

As you can see, the focus is on step 3. Let’s discuss in detail the quality problem detection that SQLiteLint is currently concerned with.

Introduction to Detection Problems

First, detect index use problems

The use of indexes is the most common database problem, and also the most direct impact on the performance of the problem. The analysis of SQLiteLint is mainly based on SQLite3’s “Explain Query plan”, which is the query plan of SQL. Here are some of the most common keywords for query plans:

SCAN TABLE: full TABLE scanning, traversing the TABLE to find the result set, complexity O(n) SEARCH TABLE: Binary Search is performed on the index tree to find the ROWID, and then the Binary Search is performed on the table to obtain the target result set. The complexity is O(logn) USE TEMP B-tree: creates a temporary TREE sort for the result set, requiring extra space and time. Such a query plan is possible with the Order By keyword, for example

By analyzing the query plan, SQLiteLint currently checks for the following index issues:

1. Full TABLE SCAN caused by no index (SCAN TABLE corresponding to the query plan…)

Although building an index is the most basic optimization technique, many students fail to build a proper index due to lack of awareness or urgent needs in practical development. SQLiteLint helps to remind this omission. The problem is small and simple to solve, but it is the most common. In passing, we also discuss the general conditions that are not suitable for indexing: write too much, read too little, and table rows too small. However, it should be unusual for clients to write more tables than they read. When the number of rows is small, it is possible to make the query slower (because the loading time of the index may be larger than that of a full table scan), but the difference is minimal. Therefore, it is considered that in general, client queries should try to use index optimization. If it is determined that the number of estimated tables is small or that there are too many writes and too few reads, the table can also be added to the undetected whitelist.

To solve this kind of problem, of course, is to establish the corresponding index.

2. Full TABLE SCAN caused by invalid indexes (SCAN TABLE corresponding to the query plan)

In some cases, even if an index is created, it may not take effect, and in some cases it can be used by optimizing SQL statements. Here’s an example:

As you saw above, even though the index is built, no index is actually used to query. In this case, like can be used as an inequality comparison:

Here you can see that the index is already used to SEARCH the TABLE, avoiding a full TABLE scan. But it’s worth noting that not all like cases can be optimized this way, like ‘%lo’ or like ‘%lo%’, and the inequality doesn’t work.

Let’s look at an example where a bit operation causes an index to fail:

Bitwise operations are one of the most common statements that cause indexes to fail. If the business value of flag in this case is only 0x1, 0x2, 0x4, 0x8, then this statement can be equivalent by enumerating values:

As we saw above, we can use the index by converting bitwise operations to in exhaustion.

To solve the problem of full table scan caused by the failure of such indexes, it is necessary to optimize SQL statements according to the actual business, and even use some trick skills. Also may not be able to optimize, then need to add to the whitelist.

3. Unnecessary temporary TREE sorting (USE TEMP B-tree corresponding to query plans) .

For example, SQL statements such as ORDER BY, DISTINCT, group BY, etc. may cause temporary additional tree sorting of result sets. Of course, many cases can be optimized by establishing appropriate indexes. Here’s an example:

As seen above, even if both id and mark are indexed separately, USE TEMP B-tree FOR ORDER BY even if only one row of results is required. Of course this case is very simple, but it can easily happen if you are unfamiliar with SQLite indexes or are lax in your development. This problem is also easy to optimize:

In this way, tree sorting is avoided, and the optimization effect is immediately better for large table queries.

To solve this kind of problem, it is generally to establish a suitable index.

4. Insufficient index combinations

This mainly refers to the fact that an index has been created, but the column of the index combination does not cover enough columns in the conditional of the WHERE clause. SQLiteLint detects this problem, and it is recommended to pay attention to whether the SQL statement has a performance problem before deciding whether to create a longer index. Here’s an example:

GenderIndex is used, but there is a mark=60 condition in the WHERE clause, so there is still one more iteration to get the desired result set. Especially for this case, gender is also gender, so there are at most three cases. At this time, the optimization effect of the single gender index is not obvious. Again, optimization is easy:

The solution to this problem is to build a larger composite index.

5. How to reduce false positives

Now SQLiteLint focuses on finding these problems based on certain keywords of the query plan, but the query syntax SQLite supports is very complex and the corresponding query plan is infinitely variable. So automatic and correct analysis of query plans is not an easy task. SQLiteLint has worked very hard on this

So automatic and correct analysis of query plans is not an easy task. SQLiteLint has worked very hard on this. SQLiteLint reconstructs an analysis tree for the output query plan, and combines the syntax tree of SQL statements to analyze and detect the query plan according to certain algorithms and rules. Build analysis tree each process will have access to a query plan such as “0 | 1 | 0” in front of the Numbers, there is no specific opened. For example: Are all query plans prefixed with “SCAN TABLE” considered to be optimized? Clearly not. Look at a specific case:

This is a join table query, which in SQLite implementations is generally a nested loop. In this statement, the t3.id column is indexed and used in the second loop, but the SCAN TABLE in the first loop cannot be optimized. For example, try to index the ID column of T4 as well:

As can be seen, SCAN TABLE still cannot be avoided. SQLiteLint should not be a false positive in cases where SCAN TABLE cannot be optimized. As mentioned earlier, query plans are organized into a tree structure. For example, in this case, the final query plan analysis tree is:

Analysis trees have a major feature: leaf nodes with siblings are syntable queries, whose cyclic order should be from left to right, while no siblings are single-table queries. And the final analysis will fall into the analysis of the leaf node. There is a rule (incomplete description) for traversing leaf nodes:

If the leaf node has sibling nodes and is the leftmost node, i.e. the first layer loop, and the WHERE clause does not contain relevant constant conditional expressions, SCAN TABLE is not considered as a quality problem.

There are two conditions that must be met in order for SCAN TABLE not to report a problem: level 1 loop & uncorrelated constant expression. The first layer of the loop has been described previously, but the next condition is explained here.

As can be seen from the above, when the constant condition expression “t4.id=666” appears in the SELECT clause, if t3.id and t4.id are indexed, it can be optimized to have no SCAN TABLE.

After the t4.id index is deleted, SCAN TABLE appears again. In the case of a SCAN TABLE that does not meet the second condition of the rule, SQLiteLint will report that index optimization is available.

This is an analysis of a query plan for a simpler statement, and of course there are more complex statements, as well as subqueries, combinations, and so on, which are not discussed here. The huge complexity undoubtedly poses a great challenge to the accuracy rate, which requires constant iteration and improvement of the analysis rules. The current analysis algorithm of SQLiteLint is still not rigorous enough, and there is still a lot of room for optimization. Here’s another way to address the accuracy issue: Prioritize all reported problems based on time, thread, problem level, and so on. This “curve-saving” strategy to reduce false positives also applies to all the detection problems described in this article.

2. Detect redundant index problems

SQLiteLint checks all tables once after the application is started for redundant indexes and recommends that the largest index combination be retained.

Define what a redundant index is: For example, for a table, if the index combination index1, index2 is the prefix of another index combination index3, then in general, index3 can replace the functions of index1 and index2, so index1 and index2 are redundant. While redundant indexes will have redundant insert costs and space costs, it is generally recommended to keep only index index3. Here’s an example:

As you can see above, if you already have an index of type and length, then you have already satisfied the single-length condition. There is no need to create an index for length.

Check select *

SQLiteLint select * from column (select * from column (s));

Select * is one of SQLite’s most commonly used statements, and it is also very convenient. Why is it considered problematic? It’s worth arguing here:

For SELECT *, there is still a step underneath SQLite to expand * to all columns of the table.
Select * also reduces the chance that an overridden index can be used. An overwrite index is an index that contains columns that already overwrite the columns required by the SELECT. Using an overwrite index can reduce a query to the table.
For Android, select * will project all columns, so each row will take up more memory, so the number of CursorWindow (buffer) will be smaller, so the number of sqlitequery.fillWindow may be more, This also has a performance impact.

For the above reasons, this is still a problem for SQLiteLint target best practice.

4, Detect Autoincrement problem

SQLiteLint will check all table creation statements after the application is started. If the AUTOINCREMENT keyword is found, a problem is reported. You are advised not to use AUTOINCREMENT.

To see why this problem is detected, here’s a reference to SQLite’s official documentation:

The AUTOINCREMENT keyword imposes extra CPU, memory, disk space, and disk I/O overhead and should be avoided if not strictly needed. It is usually not needed.

You can see that Auto Increment is really not a good thing. Ps. Here I would add what strictly needed means and why it is not necessary. Generally, AUTOINCREMENT is used to modify the INTEGER PRIMARY KEY column, which is referred to as the IPK column. The IPK column is the same as the ROWID alias, and it also has the increment property, but it will reuse the deleted ROWID number. For example, if there are four rows and the maximum ROWID is 4, delete the fourth row and insert another row. The roWID value of the newly inserted row is 1 more than the maximum ROwiD value, so 3+1=4, so the ROwiD number 4 is reused. If you add AUTOINCREMENT, it prevents reuse. In this case, the Rowid number is 5. That is, AUTOINCREMENT guarantees uniqueness for historical increments, but how many such scenarios are there for client applications?

5. Prepared Statement is recommended

If prepared is not used, the prepared statement will be used for the same number of consecutive executions. If prepared is not used, the prepared statement will be used. You are advised to use Prepared statement. If the threshold is 3, a problem will be reported if the following statements are executed consecutively:

There are two benefits to using Prepared Statement optimization:

Multiple execution of the same SQL statement (with different arguments) improves performance
It also prevents injection problems if the parameters are untrusted or uncontrolled inputs

Vi. It is recommended to use the without RoWID feature for testing

SQLiteLint will check all table creation statements after the application is started, and find that the table does not use the without ROwid technique and can be optimized with the without ROwid optimization based on the table information. This is another idea for SQLiteLint, which is to find out if some of SQLite’s advanced features can be applied.

Without Rowid can deliver up to half of the optimization in both space and time in some cases. A brief description of the principle, such as:

For the table with the ROWID (roWID is automatically generated), there are two queries involved, one to find the corresponding ROWID in the index tree of Name and one to use the ROWID to find the MARK column in the data tree. Create table without rowid;

Data tree construction is based on name as key, mark as data, and is stored as a normal B-tree. In this way, for the same query just now, only one query of the data tree is needed to obtain the MARK column, so the algorithm complexity has saved O(logn). In addition, there is less maintenance of a name index tree, and the insertion consumption and space are also saved.

Of course withou Rowid doesn’t work everywhere, or it would be the default. SQLiteLint determines that without rowid is recommended if both of the following conditions are met:

The table contains a non-INTEGER or composite (multi-column) PRIMARY KEY
Table rows are small in size, and a good standard is page size less than one-twentieth of the row size. Ps. The default page size is 4096 bytes after SQLite version 3.12.0 (corresponding to Android O or later), and 1024 bytes before SQLite. In order to reduce false positives, SQLiteLint uses stricter criteria: the table does not contain BLOB columns and does not contain non-primary KEY TEXT columns.

For 1, if there is no PRIMARY KEY, you cannot use the without Rowid feature. If you have an INTEGER PRIMARY KEY, as we said earlier, that’s equivalent to rowid. For 2, less than 1/20 pagesize is the official recommendation. Here’s why I understand. Page is the general read/write unit for SQLite (in fact, the read/write block of the disk is more critical, and the disk consumption is more in positioning, so more pages may require more positioning). The table without ROwid is stored in a normal B-tree, and the data is stored in all nodes of the Tree. If the data is large, the number of nodes stored in a page is smaller, and the search process needs to read more pages, so that the search consumption is higher. Of course, this is relative to the ROWId table B* -tree storage, because then all the data in the leaf node, the search path of the node only KEY, then a page can store a lot of nodes, search disk consumption is smaller. It is important not to think of this problem in terms of pure memory algorithmic complexity. The above is not necessarily correct inference, welcome advice.

By extension, this is why the SQLite index Tree is organized as a B-tree and the ROWId table Tree is organized as a B* -tree, because the storage of each node of the index Tree is mainly the index column and the ROWID, which is usually not that large. The advantage over B* -tree is that you don’t have to go all the way to the leaf node to end the search. With the same limitation as without rowid, it is not recommended to use large strings as index columns, which can of course be added to the detection of SQLiteLint.

summary

Here is a tool to check the quality of SQLite usage in the development, test, or grayscale phases. The benefits of this idea are:

Find problems before going online
Focus on best practices

The bulk of this article is really a discussion of SQLite best practices, because the idea behind SQLiteLint is to automate the detection of best practices. Of course screening can cover a wider area, accuracy is also a challenge, there is a lot of room.

This article has been published by Tencent Cloud + community authorized by the author