Lightkv-high performance key-value storage component

LightKV is a lightweight, high-performance, and reliable Key-value storage component based on Java NIO.

A, origin

Common local storage methods of Android platform, SDK built-in SQLite, SharedPreference, etc., open source components such as ACache, DiskLruCahce, etc., have their own characteristics and applicability. SharedPreference is favored by developers with its natural key-value API, secondary storage (memory HashMap, disk XML file) and other characteristics. However, any tool can be useful. See the article don’t Abuse SharedPreference. Of course, some of the disadvantages are due to its positioning. For example, it is not suitable for storing large key-values. However, there are some areas that can be improved, such as the storage format: XML parsing is slow, takes up a lot of space, and special characters need to be escaped, which is not a good idea for storage with high frequency changes. Therefore, it is necessary to write a modified version of the key-value storage component.

Two, LightKV principle

2.1 Storage Format

We want the file to be streamed, and for simple key-value forms, it’s entirely possible to customize the format. For example, simply in order to save the key – it is good to value: the key value | | key value | | key | value…

value

For the value type, we need to support some common base types: Boolean, int, long, float, double, as well as strings and arrays (byte[]). In particular, more complex types (such as objects) can be converted with strings and arrays. As the underlying component, supporting the most basic types simplifies the complexity. For String and byte[], the length is stored before the content.

key

We observed that in practice keys are usually predefined; Therefore, we can discard some generality and use int as key instead of String. If you use an int as a key, you can use less space to carry more information.

public interface DataType {
    int OFFSET = 16;
    int MASK = 0xF0000;
    int ENCODE = 1 << 20;

    int BOOLEAN = 1 << OFFSET;
    int INT = 2 << OFFSET;
    int FLOAT = 3 << OFFSET;
    int LONG = 4 << OFFSET;
    int DOUBLE = 5 << OFFSET;
    int STRING = 6 << OFFSET;
    int ARRAY = 7 << OFFSET;
}
Copy the code

The lower 16 bits of an int are used to define the key, the 17-19 bits are used to define the type, the 20 bits are reserved, the 21 bits are encoded (more on this later), and the 32 bits (the highest bit) are valid: 1 is invalid and is skipped for reads.

Memory cache

SharePreference has one more layer of memory storage than ACache, DiskLruCache, etc., so their positioning is distinct: the latter are usually used to store large objects or files, etc., they are only responsible for providing disk storage, and it is not their responsibility to use and manage the memory after reading. Too large objects will take up too much memory, and SharePreference is a long-term holding reference, there is no space limit and elimination mechanism, so SharePreference is suitable for “lightweight storage”, which brings the benefit of fast read speed. The LightKV location is also “lightweight storage”, so the key-value is stored in memory as well, but SparseArray is used here.

2.2 Storage Operations

Mentioned above, the storage format is simply the key – value in that: the key value | | key value | | key | value… In this way, you can read it streaming, or even write it incrementally.

Solution one, incremental & asynchronous

The increment operation

Feature: the tail additional key | value;
Delete: To avoid byte movement, use the marking method — mark the top bit of the key to 1;
Modify: If the length of value remains unchanged, address to the corresponding position and write value. Otherwise, “Delete” first, then “add”;
GC: When the file content is parsed (parsed during data loading), the length of the deleted content is recorded. If the length exceeds the threshold, the file is cleared and a full write is performed. If a delete operation is performed during the operation, the total length of the deleted contents is added after the delete operation. If the length exceeds the specified threshold, the system GC.

mmap

To incrementally modify a file, you need the ability to write randomly: Java NIO would be a good choice, or even mMAP (memory-mapped files). Mmap also has several advantages: 1. Direct manipulation of kernel space: avoid copying data between kernel space and user space. 2. Automatic periodic refresh: avoid frequent disk operations. 3. Refresh when the process exits: call at the system level, without worrying about data loss caused by process exit.

If not, the mapping file phase costs more than the normal IO open file. So the API recommends mMAP for large files and regular IO for small files. The online introduction of MMAP is also more than a copy of large files. In fact, if small files are frequently written, it is also worth a try, such as Tencent’s log component Xlog and storage component MMKV, both use MMAP.

Mmap write is similar to asynchronous write, only do not need to own threads to flush data to disk, but by the operating system to schedule. This way has advantages and disadvantages: the advantage is to write fast, reduce disk wear; The disadvantage is that, like apply SharePreference, there is no atomicity, no atomicity, consistency is not guaranteed. For example, if a system-level error (such as system crash) or device exception (such as power failure or disk damage) occurs after data is written to the memory but before data is flushed to the disk, data will be lost. If, after writing to memory, some other code reads the memory before flushing to disk, the data may be inconsistent.

However, in general, the probability of system level errors and device exceptions is low, so it is relatively reliable.

Scheme two, full volume & synchronization

For some core data, we want to store it in a more reliable way. How do you define reliable? First of all, atomicity is required, so it can only be written synchronously; Then there is availability and integrity: program exceptions, system exceptions, or hardware failures can lead to data loss or errors; Mechanisms need to be added to ensure that data remains intact when exceptions and failures occur.

View the SharedPreference source code, its fault tolerance strategy is to rename the main file as the name of the backup file before writing, and delete the backup file when writing successfully. When opening the file, if the backup file is found, rename the backup file as the name of the main file. Thus, if a failure occurs while writing data, the data can be recovered from the backup file when the APP is restarted again. Such a fault-tolerant strategy is generally a good solution to ensure data availability in most data situations. We did not adopt this plan, mainly considering the operation of the scheme is relatively complex, and other concerns.

The strategy we adopt is: redundant backup + data verification.

Redundancy backup

The idea of redundancy to improve data availability is reflected in many places, such as RAID 1 disk arrays. Also, we can write two files from one memory, so that when one file fails, another file is available. For example, if the probability of one file failing is 1 in 100,000, the probability of both files failing is 1 in 10 billion. In short, redundant backup can greatly reduce the probability of data loss. You win some, you lose some, at the cost of double disk space and write time.

However, our positioning is “lightweight storage”. If only “core data” is stored, the amount of data will not be very large, so overall the benefits outweigh the costs. In terms of writing time, compared with SharedPreference, renaming and deleting a file is also a kind of IO, whose essence is to update the “metadata” of the file. Write disk in pages, a page is usually 4K.

Writing 1 byte to a file and 2497 bytes are equivalent during the disk write phase (both require 4K bytes). When the amount of data is small, write two files, compared with rename -> Write data -> Delete file operation, there is no difference.

Data validation

The method of data verification is usually to carry out some operation on the data and put the operation result after the data; When you read it, you do the same thing, and you compare it to what you did before. Common methods include parity check, CRC, MD5, SHA and so on. Parity check is mostly used in computer hardware error detection. At the software level, it usually computes hashes. Among the many Hash algorithms, we chose the 64-bit MurmurHash, which can be seen in my other article “Hash Functions”.

When considering block write or full write, block check or full check, grouping, more details, complex code, or choose the full way. That is, to collect all the key | value into the buffer, then calculate the hash in the data, along with all the written to disk.

Fish and bear’s paw

Different application scenarios have different requirements. LightKV provides both mMAP for fast write and synchronous write for more reliable write. They have the same API, but the storage mechanism is different.

public abstract class LightKV {
    final SparseArray<Object> mData = new SparseArray<>();
    / /...
}

public class AsyncKV extends LightKV {
    private FileChannel mChannel;
    private MappedByteBuffer mBuffer;
    / /...
}

public class SyncKV extends LightKV {
    private FileChannel mAChannel;
    private FileChannel mBChannel;
    private ByteBuffer mBuffer;
    / /...
}
Copy the code

AsyncKV does not have consistency, so there is no need to redundant backup, write one copy, for higher write efficiency and less disk write. SyncKV is a redundant backup, so you need to open two files, but you can use the same buffer file. The characteristics of the two are described in the previous “scheme 1” and “scheme 2”, and can be flexibly used according to specific needs.

2.3 Confused Operations

For SharePreferences stored in XML, open the file to see all key-values, even if the developer encodes the values, the keys are still visible. Isn’t the SharePreferences file in the App directory in the sandbox? Without root permission, the sandbox is indeed inaccessible to other applications (non-systems); But for the APP contrarians (the black industry?) For example, the SharePreferences file is just a piece of the puzzle, which may give you a glimpse of the key to your APP to help you crack it. Confusing content files, therefore, may add a little to the cost of cracking. For apps, there is no absolute security, just cracking the game between cost and revenue, here is not much to do.

LightKV because the use of streaming storage, and key is int type, so it is not easy to see its file content;

However, if value is a plaintext string, you can still see some of the content, as shown in the figure below:

LightKV provides an interface for confusing value(String and byte[] types) :

    public interface Encoder {
        byte[] encode(byte[] src);
        byte[] decode(byte[] des);
    }
Copy the code

Developers can implement encoding and decoding according to their own rules. There are many extensions you can make with this interface:

1, strict encryption;
2. Data compression;
3. Content confusion (in fact, both have the function of confusion)

After the confusion, open the file, are garbled.

It is worth mentioning that you can only confuse values of type String and byte[]. Since basic classes like long, double, etc., are written in binary form and opened in text form, they are already hard to read, so there is no need to confuse them.

Three, use method

As we saw earlier, SyncKV and AsyncKV both inherit from LightKV, and their in-memory storage format is the same. Both are SparseArray, so the get method is encapsulated in LightKV, and then each implements the PUT method. The method list is as follows:

Similar to SharePreferences, there are also contains, remove, clear, and commit methods, and even the usage is similar:

public class AppData {
    private static final SharedPreferences sp = 
        GlobalConfig.getAppContext().getSharedPreferences("app_data", Context.MODE_PRIVATE);
    
    private static final SharedPreferences.Editor editor = sp.edit();

    private static final String ACCOUNT = "account";
    private static final String TOKEN = "token";

    private static void putString(String key, String value) {
        editor.putString(key, value);
        editor.commit();
    }

    private static String getString(String key) {
        return sp.getString(key, ""); }}Copy the code

public class AppData {
    private static final SyncKV DATA =
            new LightKV.Builder(GlobalConfig.getAppContext(), "app_data")
                    .logger(AppLogger.getInstance())
                    .executor(AsyncTask.THREAD_POOL_EXECUTOR)
                    .encoder(new ConfuseEncoder())
                    .sync();

    public interface Keys {
        int SHOW_COUNT = 1 | DataType.INT;
        int ACCOUNT = 2 | DataType.STRING | DataType.ENCODE;
        int TOKEN = 3 | DataType.STRING | DataType.ENCODE;
    }

    public static SyncKV data(a) {
        return DATA;
    }

    public static String getString(int key) {
        return DATA.getString(key);
    }

    public static void putString(int key, String value) { DATA.putString(key, value); DATA.commit(); }}Copy the code

Of course, this is just one of many encapsulation methods, and different developers have different preferences for how to use it.

For LightKV, key is defined as follows: 1. It is best that a file corresponds to a unified key class, such as “Keys” above; 2. 2, the key assignment, can be defined according to their types from 1 to 65534, and then the corresponding DataType do “|” computing (need to infer types when parsing).

LightKV has more initialization options than SharePreferences, so the constructor pattern is used to build objects. The following describes each parameter and corresponding features one by one.

3.1 Content confusion

Confused if you need to value, just when constructing LightKV incoming Encoder, then statement key and DataType ENCODE do “|” operation. When saving and reading, LightKV will “&” key and datatype.encode, if not 0, then Encoder is called to ENCODE (save) or decode (read).

3.2 Asynchronous Loading

SharePreferences are loaded in the newly created thread, blocking reads and writes until the load is complete: LightKV also implements asynchronous loading and can specify executors, or choose not to load asynchronously. It should be noted that although asynchronous loading is provided, sometimes it does not have the effect of asynchronous loading. For example, calling get or PUT immediately while the object is being initialized blocks the current thread until the load is complete, which is no different from synchronous loading.

Call data() when the process is initialized to trigger data loading:

    fun inti(context: Context) {
        // Only initialize the object. Do not get or put
        AppData.data(a)// Other initialization work
    }
Copy the code

3.3 Error Logs

    public interface Logger {
        void e(String tag, Throwable e);
    }
Copy the code

Most components are not guaranteed to run without exceptions, and when exceptions occur, developers usually print the exception information to a log file (and sometimes upload it to the cloud). Therefore, LightKV provides an interface for printing logs, passing in the implementation class.

3.4 Selecting a Mode

At the end of the Builder, call sync() and async() to distinguish between AsyncKV and SyncKV. Also explained before respective characteristic, flexible selection can. If you don’t want to save some very important data (such as account information), use AsyncKV.

3.5 Accessing Data

After writing the initialization parameters, defining the key, and writing the get and set methods, you can access the data:

String account = AppData.getString(AppData.Keys.ACCOUNT)
if(TextUtils.isEmpty(account)){
      AppData.putString(AppData.Keys.ACCOUNT, "[email protected]")}Copy the code

3.6 Usage under Kotlin

With the help of Kotlin’s delegate property, the author extended the LightKV API to provide more convenient usage.

object AppData : KVData() {
    override val data: LightKV by lazy {
        LightKV.Builder(GlobalConfig.appContext, "app_data")
                .logger(AppLogger)
                .executor(AsyncTask.THREAD_POOL_EXECUTOR)
                .encoder(GzipEncoder)
                .async()
    }

    var showCount by int(1)
    var account by string(2)
    var token by string(3)
    var secret by array(4 or DataType.ENCODE)
}
Copy the code

val account = AppData.account
if (TextUtils.isEmpty(account)) {
   AppData.account = "[email protected]"
}
Copy the code

Compared to the Java version of the API, the key declaration is simpler, and the value corresponding to the key can be accessed like a variable.

Fourth, review

In a hurry, the test case may not be very scientific, for reference only – –

The test cases, 7 types of support each configuration of five key, a total of 35 key | value. Test machine: Mi Note 1, 16G storage

4.1 Storage Space

storage	File size (KB)
AsyncKV	4
SyncKV	1.7
SharePreferences	3.3

AsyncKV needs to map a disk space to memory because it uses mmap to open. In order to reduce fragmentation, it maps one page at a time (4K). SyncKV has a smaller file size than SharePreferences because of its compact storage format; However, since SyncKV uses double backup, the total size is about the same as SharePreferences.

When the amount of data is less than 4K, it’s almost the same; AsyncKV uses less memory when there is more content to store, because it stores the same format as SyncKV, but only one copy.

4.2 Write Performance

Ideal is written to each key | value all write memory, and then a unified call commit, so writing is the fastest. However in practical use, the key | value writing is usually random, so the following test results, are immediately after each time the put. AsyncKV is an exception because it is designed to reduce IO and let the kernel commit updates itself.

Test Machine 1: Mi Note 1 (2018 test)

storage	Write time (milliseconds)
AsyncKV	2.25
SyncKV	75.34
SharePreferences-apply	6.90
SharePreferences-commit	279.14

Test Machine 2: Huawei P30 Pro (2020 test)

storage	Write time (milliseconds)
AsyncKV	0.31
SyncKV	8.31
SharePreferences-apply	1.9
SharePreferences-commit	30.81

(The new machine writes much faster.)

AsyncKV and SharePreferences-apply are used to return the data to memory immediately, so it takes less time. SyncKV and SharePreferences- Commit are both committed to memory and disk in the current thread. LightKV is faster than SharePreferences for both synchronous and asynchronous writes: For synchronous writes, SharePreferences-commit takes 3 to 4 times longer than SyncKV; AsyncKV is also much faster than SharePreferences-apply for asynchronous writes.

As for the loading performance, I compared huawei P30pro and Mi Note machine, found that AsyncKV loading time is relatively slow in Mi Note, and in Huawei P30pro is relatively fast, so do not post the data. There is a Github link at the end of this article, and you can run benchmark yourself.

SharePreferences reads from HashMap and LightKV reads from SparseArray. The advantages and disadvantages of the two data structures have been much discussed on the Internet, so I won’t compare them this time.

Five, the summary

SharePreferences is a lightweight and convenient key-value storage component for the Android platform, but there is plenty of room for improvement. LightKV uses SharePreferences as a reference to provide better storage in terms of efficiency, security and ease of use.

Six, download

dependencies {
    implementation 'com. Horizon. Lightkv: lightkv: 1.0.7'
}
Copy the code

Project address: github.com/BillyWei001…

Refer to the article: www.cnblogs.com/mingfeng002… Cloud.tencent.com/developer/a… Segmentfault.com/r/125000000… www.jianshu.com/p/ad9756fe2… www.jianshu.com/p/07664dc4c…