Wechat full text search optimization road

This article first appeared in the September 2017 issue of Programmer magazine.

preface

Full-text Search (FTS) based on local data plays an important role in mobile applications. Different from the server-based search service, mobile terminal is limited by hardware conditions, especially in the case of relatively large data volume, search performance problems are very prominent. This paper takes SQLite FTS Extension, which is widely used on mobile platforms, as an example to introduce the basic principles of MOBILE platform FTS. Combined with the practice of wechat Android client, it focuses on some performance optimization experience of wechat on FTS.

SQLite FTS Extension

SQLite FTS Extension is a plug-in developed by SQLite for full-text search. It is embedded in the standard SQLite distribution version and has the following characteristics:

Fast search: Use inverted indexes to speed up the search process

Good stability: CURRENTLY SQLite has good stability in mobile terminals, and FTS Extension is built on SQLite

Simple access: Android and IOS platforms already support SQLite, and FTS Extension works just like SQLite tables.

Good compatibility: Benefiting from SQLite itself, SQLite FTS Extension also has good compatibility.

There are currently five versions of SQLiteFTSExtension released, but I’ll briefly mention the three major versions.

FTS3: Base version with full FTS features, support for custom word dividers, library functions including Offsets, Snippets.

FTS4: On the basis of FTS3, performance has been greatly optimized, adding correlation function calculation MatchInfo.

FTS5: It has a big change from FTS4, and has a big improvement in the storage format. The most obvious is the segmented storage of instance-list, which can support the storage of larger instance-list. And open ExtensionApi to support custom helper functions. FTS5 was released in mid-2015.

Storage architecture

Wechat full-text search was launched in late 2014 and initially focused on business searches for contacts and chat records. At the beginning of the scheme design, in order to make this function have a good experience, and considering the increasing number of access services in the future, our design objectives are as follows:

1. Fast search speed

Wechat full-text search uses SQLite FTS4 Extension to improve search speed through inverted index.

2. Business independence

Wechat’s core business is contact and message, while wechat full-text search needs to deal with a large amount of data when establishing index, updating index or deleting index. In order to ensure that full-text search does not affect wechat’s core business, the following storage architecture is adopted:

Independent DB and read-write separation: wechat full-text search is independent of the main business in the overall architecture, and the search DB is also independent of the main business DB; When the primary service data is updated, the primary service notifes and searches the corresponding service data processing module through EventBus. The service data processing module accesses the primary service database through an independent ReadOnly database connection, and does not share the database connection with the primary service storage layer.

Reduce database operations: In the search module, there are modules that deal with business data and do special work on complex data structures. For example, for a group chat of 500 members, if 500 group members are inserted into the search DB at different times, it will cause too many database operations. Therefore, wechat will concatenate all group members into a single string and insert it into the search DB.

Delayed update of hot data: A delayed update policy is adopted for hot data that is frequently updated. All index data is classified into normal data and dirty data. When data is updated, the corresponding data is marked as dirty data first, and then a timer is set to update the data to the index every 10 minutes.

3. High scalability

High scalability requires search table structure and business decoupling. The examples on the official website of SQLite FTS are all in the form of a single index table, with each column corresponding to a certain attribute of the business. When the corresponding business changes, the structure of the index table needs to be modified. In order to solve the table structure modification problem caused by business changes, wechat digitizes the business attributes and designs the following table structure:

IndexTable is responsible for the index establishment of full-text search. It is independent of logic. When searching keywords, only the corresponding DocId is needed. MetaTable filters the service logic by Type and SubType, and outputs the BusItemId.

Search optimization

Wechat full-text search was launched in version 5.4 on January 26, 2014. Until version 6.5.7 after the Spring Festival in 2017, the total number of users increased from 400 million to 900 million, and the number of heavy users also increased significantly. The data volume of wechat local search also increased significantly, resulting in a continuous decrease in search speed and an increase in user complaints. According to our statistics, from wechat version 5.4 to wechat version 6.5.7, the average search time of each task in wechat full-text search increased by more than 10 times, bringing great challenges to wechat full-text search.

To optimize the search duration, take a look at the flow chart of the search:

According to the time consuming of each stage, it can be found that in the data fetching stage, the time proportion reaches more than 80%, and the larger the amount of data in the search result set, the higher the time proportion, up to 95%. The data acquisition stage is a cyclic process, so the optimization of a cycle needs to be carried out from two aspects: reducing the single cycle time and reducing the total cycle times.

Reduce the execution time of a single loop

Deep into SQLite FTS4 Extension source code, it is found that the FTS4 library function Offsets takes up more than 70% of the execution time of a single loop, and the larger the amount of data, the longer the execution time.

FTS4 library function Offsets: used to convert word offset to byte offset, wechat uses bytes to do result sorting and result highlighting.

Function input:

Query: the keyword searched by the user
Hit Doc: The document hit by the keyword. A document is the basic unit of a full-text search. It can be a web page, an article, or a chat
Target term offset: In the search phase, the target term offset can be obtained by keyword search index

Function output:

Target byte offset: Represents the byte offset of the keyword in the hit Doc.

Such as:

Query = I hit Doc = I went shopping with my brother target word offset =0, 2

The following table can be obtained by dividing the hit Doc through the word splitter:

The final calculation can get the target byte offset =0, 6

Here is the relationship between the number of hit Doc bytes and the time taken by Offsets:

The Offsets function includes word segmentation, so the first step is to optimize the word segmentation.

To optimize word segmentation, word segmentation rules are the most important. The word segmentation rules of wechat are English and number segmentation, and non-English and number segmentation alone. For example, for the nickname “Hello520 China”, the participle results are “Hello”, “520”, “zhong”, “guo”. The reason for this word segmentation rule is that the sorting requirements for full-text search results in wechat are mainly based on other attributes rather than the relevance of documents. That is, the full-text search only needs to find documents with keywords, and does not care about the existence of several keywords in the document. Moreover, in most cases, the user’s input Query cannot form words, and there are dialects. Therefore, it is in line with the requirements to separate the whole word to build an index.

Wechat full-text search was first developed at the end of 2013, and FTS4 is the highest version of SQLite FTS Extension. However, FTS4’s own word divider cannot support Chinese well, so only ICU word divider can be used. At that time, ICU word divider is easy to access and supports Chinese well, so ICU word divider is used.

For the nickname “Hello520 China” output word segmentation, the first is UTF8 encoding, the word segmentation will do a conversion to Unicode encoding, then look up the dictionary, the final word segmentation results for post-processing. It can be found from the input and output that the two steps of transforming the code and searching the dictionary are actually redundant, so wechat abandoned the ICU word divider and customized the Simple word divider.

Simple word segmentation directly processes UTF8 encoded Doc content. Through a single char, determine the Unicode encoding range and Unicode encoding length of the current character, and make different processing according to different situations.

After the optimization of the word splitter, the time of Offsets function is reduced to 21ms after processing 100,000 bytes, but such optimization is not enough. When processing more than 10 10W result Doc, it will still exceed 200ms, so we have the next step of optimization.

Due to the limitation of the screen on the mobile terminal, only a small number of hit keywords will be highlighted when the search results are displayed at the end. However, all the target words in the Doc of the Offsets function in accounting fortune-telling are offset, so the Offsets function needs to be modified.

At the beginning, I tried to directly modify the source code of Offsets function. I found that FTS4’s API encapsulation was difficult to use, and Offsets function depended on a lot. The modified code was difficult to maintain and not readable, so I needed to find a new way to optimize it. After some research, I found that FTS5 supports custom auxiliary functions and has a good API encapsulation, so I finally used FTS5 custom auxiliary functions (MMHighLight) to re-implement the function of Offsets and add optimization logic.

Input: Query = I hit Doc = I went shopping with my brother Target word offset =0, 2 target return number =1

Step by step callback, when the word divider returns “I” for the first time, the first 0 of the offset of the target word has been met, and the target return number has reached 1, the function directly returns the offset of the target byte =0.

Reduce the total number of cycles

To reduce the total number of cycles in the fetch stage, it is easy to think of doing paging return of data at the SQL layer. Paging return means sorting at the DB layer, where sorting is determined by sorting factors. However, wechat full-text search is faced with many and complex business sorting factors, so it cannot directly use the ORDER BY in SQL, so it needs to transform through an intermediate function to reflect all sorting factors BY a comparable number, and finally use the ORDER BY sorting.

Here briefly, the more complex sorting factors are as follows:

Time segment sort: the time range is within half a year, the sorting factor depends on the next level of sorting factor, the time range is outside half a year, depending on the distance of time.

Function result sorting: The sorting factor is the result of a function calculation, not a direct database Column, and the result of a function calculation cannot be directly ordered BY, such as a string number.

Through the above analysis, the core point of reducing the total number of loops is to move the sorting from the Java layer to the SQL layer. The advantages are as follows:

Reduce I/O
Reduce data copying from the C layer to the Java layer

Therefore, the key implementation point here lies in the realization of the intermediate transformation function. The intermediate transformation function MMRank of wechat is realized by the auxiliary function of FTS5.

MMRank is implemented by converting all sorting factors into a 64-bit Long value, with the highest-priority sorting factors placed high and the lower-priority sorting factors placed low. The final SQL is as follows:

Special optimization – Chat search optimization

Wechat full text search has a more special search task, is chat records.

As shown in the figure:

The number in the red circle indicates the number of chats in this session that contain the keyword “I”, and the order of the session is the active time of the session.

The search of wechat chat records has the following two characteristics:

Have statistical properties
The number is very large (single keyword hit up to 200,000)

As can be seen from the search flow chart, wechat initially adopted the scheme of counting and sorting in the Java layer, which is not desirable in the case of big data. Given the previous analysis that the number of cycles reduced can be returned through paging, the core of which is to move sorting from the Java layer to the SQL layer, optimization 1 is presented.

Optimization plan 1: Group By

Implement SQL as follows:

In this scheme, the number of hit chat records is directly counted By Group By in the SQL layer and sorted By the latest time, but there are obvious defects:

Unable to use index acceleration: When GroupBy and Order Derby are used together, The order Derby field must contain GroupBy for the index to be hit, because using GroupBy generates intermediate subtables.
Full calculation: GroupBy calculates the number of matched chat records at the SQL layer. All sessions are counted. In the figure above, only three sessions need to be counted, which wastes a lot of resources.

Optimization scheme 2: step by step calculation

In view of the problem of full calculation in scheme 1, the method of step by step calculation is adopted.

Step 1: Find the last three active sessions

Conv1,conv2,conv3, then execute the following SQL to get the number of hits for each of the three sessions

However, there are problems with this approach, requiring multiple SQL executions.

Optimization plan 3: MessageCount

Since scheme 2 requires multiple SQL, you can customize the aggregation function to achieve one-time statistics. Perform the following steps:

Step 1: Find the last three active sessions

Get sessions conv1,conv2,conv3, and then execute the following SQL

The number of hits of three sessions can be obtained at one time.

The last

After optimization, the average time of each task for all users of wechat full-text search is less than 50ms, while the average time of each task for heavy users is less than 200ms, and the average time optimization range is more than 5 times.

There are a lot of further optimizations to be made. For example, when computing highlights, you can save some time by adding byte offset directly to DocList’s data structure.

Finally, I hope I can share some value to everyone, welcome to leave a message exchange.