Let Python make full use of Sqlite3

I’ve been using SQlite3 a lot lately in projects that involve a lot of data processing. My initial attempt didn’t involve any database at all, and all the data would be kept in memory, including dictionary lookups, iterations, and conditional queries. That’s fine, but there’s only so much you can put into memory, and regenerating or loading data from disk into memory is a tedious and time-consuming process.

I decided to give SQlite3 a try. Because only the connection to the database is open, this increases the amount of data that can be processed and reduces the application load time to zero. In addition, I can replace many Python logic statements with SQL queries.

I’d like to share some thoughts and findings from this experience.

TL; DR

Use a large number of operations (aka executemany).
You don’t need to use the cursor (most of the time).
The cursor can be iterated.
Use the context manager.
Use compile instructions (when it makes sense).
Delay index creation.
Use placeholders to insert Python values.

1. Use a lot of operations

If you need to insert many rows into the database at once, you really shouldn’t use execute. The sqlite3 module provides a method for batch inserts: executemany.

Instead of doing something like this:

for row in iter_data(): connection.execute('INSERT INTO my_table VALUES (?) ', row)Copy the code

You can take advantage of the fact that executemany accepts a tuple’s generator as an argument:

connection.executemany( 'INSERT INTO my_table VALUE (?) ', iter_data() )Copy the code

It’s not only cleaner, it’s more efficient. Sqlite3 actually implements execute behind the scenes using Executemany, but the latter inserts one row instead of multiple rows.

I wrote a small benchmark to insert a million rows into an empty table (database in memory) :

Executemany: 1.6 seconds
The execute: 2.7 seconds

2. You don’t need a cursor

One thing I always got confused about at first was cursor management. Online examples and documentation are usually as follows:

connection = sqlite3.connect(':memory:')
cursor = connection.cursor()
# Do something with cursorCopy the code

But most of the time, you don’t need a cursor at all, and you can just use the connection object (mentioned at the end of this article). Operations like execute and executemany can be invoked directly on the connection. Here’s an example to prove it:

import sqlite3 connection = sqlite3(':memory:') # Create a table connection.execute('CREATE TABLE events(ts, msg)') # Insert values connection.executemany( 'INSERT INTO events VALUES (? ,?) ', [ (1, 'foo'), (2, 'bar'), (3, 'baz') ] ) # Print inserted rows for row in connnection.execute('SELECT * FROM events'): print(row)Copy the code

Other Translations (1)

3. Cursors can be used for iteration

You may often see examples of using fetchOne or fetchAll to process the results of a SELECT query. But I’ve found that the most natural way to process these results is to iterate directly over the cursor:

for row in connection.execute('SELECT * FROM events'):
    print(row)Copy the code

This way, as long as you get enough results, you can terminate the query without wasting resources. Of course, if you know in advance how many results you want, you can use LIMIT SQL instead, but Python generators are handy and allow you to separate data generation from data consumption.

Other Translations (1)

4. Use Context Managers

Even in the middle of processing SQL transactions, nasty things can happen. To avoid handling rollback or commit manually, you can simply use connection objects as context managers. In the following example, we create a table and insert duplicate values by mistake:

import sqlite3 connection = sqlite3.connect(':memory:') with connection: connection.execute( 'CREATE TABLE events(ts, msg, PRIMARY KEY(ts, msg))') try: with connection: connection.executemany('INSERT INTO events VALUES (? ,?) ', [ (1, 'foo'), (2, 'bar'), (3, 'baz'), (1, 'foo'), ]) except (sqlite3.OperationalError, sqlite3.IntegrityError) as e: print('Could not complete operation:', e) # No row was inserted because transaction failed for row in connection.execute('SELECT * FROM events'): print(row) connection.close()Copy the code

Other Translations (1)

5. Use Pragmas

… When it does work

There are several pragmas in your program that can be used to adjust the behavior of SQlite3. In particular, one that can improve performance is synchronous:

connection.execute('PRAGMA synchronous = OFF')Copy the code

You should know it could be dangerous. If an application crashes unexpectedly in the middle of a transaction, the database may be left in an inconsistent state. So please use with care! But if you want to insert a lot of rows faster, then this might be an option.

Other Translations (1)

6. Delay index creation

Suppose you need to create several indexes on a database, and you need to create indexes while inserting many rows. Delaying index creation until all rows have been inserted can lead to substantial performance improvements.

7. Insert Python values using placeholders

It is convenient to include values in queries using Python string operations. But doing so is very insecure, and SQlite3 gives you a better way to do it:

# Do not do this! my_timestamp = 1 c.execute("SELECT * FROM events WHERE ts = '%s'" % my_timestamp) # Do this instead my_timestamp = (1,) c.execute('SELECT * FROM events WHERE ts = ? ', my_timestamp)Copy the code

In addition, string interpolation using Python % s (or format or format string constants) is not always feasible for Executemany. So there’s no real point in trying! Keep in mind that these tips may (or may not) work to your advantage, depending on the specific use case. You should always try for yourself and decide if it’s worth it.

1. Use a lot of operations

2. You don’t need a cursor

3. Cursors can be used for iteration

4. Use Context Managers

5. Use Pragmas

6. Delay index creation

7. Insert Python values using placeholders

Related Posts

Docker + Gitlab + Jenkins built CI/CD system

Optimization practice of alibaba Cloud’s native ETCD service cluster management and control

With the database middleware Mycat+SpringBoot to complete the sub-database sub-table