preface

Full-text search based on local data plays an important role in the client requirements of IM scenarios. Full-text search is a technique for finding the location of a word in a large number of documents. In the past, the relational database can only be implemented by LIKE, which has several disadvantages:

  1. Unable to use the database index, need to traverse the full table, poor performance
  2. The search effect is poor, only the beginning and end of the fuzzy matching, unable to achieve complex search requirements
  3. Cannot get the document correlation to the search criteria

Netease Cloud Message IM has implemented the local data full-text retrieval function based on SQLite and other libraries in iOS, Android and desktop, but this part of function is missing in Web and Electron. On the Web side, due to the limitations of the browser environment, indexDB is the only local storage database that can be used, which is temporarily out of the scope of discussion. Although Electron is also built in with Chromium kernel, the ability to use Node.js gives us a wider range of choices.

Let’s take a look at how to implement full text retrieval.

Basic technical knowledge

In order to realize full-text retrieval, the following two knowledge points are indispensable:

  • Inverted index
  • participles

These two technologies are the implementation of full text retrieval technology and difficulties, its implementation process is relatively complex, and then talk about the implementation of the full text index, let’s talk about the implementation of these two technologies.

Inverted index

Inverted index is different from forward index. Inverted index is different from forward index.

  • Forward index: It is indexed with the unique ID of the document object and the document content as the structure of the record
  • Inverted index: Indexes the word in the document content, with the document ID containing the word as the structure of the record

Take the inverted index library search-index as a practical example. In the IM of netease Cloud Xin, each message object has idClient as its unique ID. Next, we input “it’s a good day today” and divide each Chinese word into a single word (we will share the concept of word participle in detail in the following paragraphs), so the input becomes “today”, “day”, “day”, “qi”, “true”, “good”, Write it to the library using the PUT method of search-index. Finally, look at the structure of the contents stored:

As shown in the figure, we can see the structure of the inverted index. Key is a single Chinese word after participle, and Value is an array of idclients containing the Chinese message object. Of course, there are other search-index contents, such as Weight, Count, and sorted data. These are used for sorting, paging, and searching by field, but I won’t go into that in this article.

participles

Word segmentation is to divide the original content of a message into multiple words or words according to their semantics. Considering the effect of Chinese word segmentation and the need to run on Node, we choose Nodejieba as the basic word segmentation library. The following is a flowchart of Jieba participles:

Taking “visiting Peking University” as an example, let’s analyze the most important modules:

Loading the dictionary

Jieba will first load the dictionary when initializing the word segmentation. The general content is as follows:

Building a Prefix Dictionary

The prefix dictionary is then built from this dictionary, with the following structure:

Among them, “Peking University” as the prefix of “Peking University”, its word frequency is 0, which is to facilitate the subsequent construction of the DAG chart.

Build a DAG figure

A DAG Graph is an abbreviation for Directed Acyclic Graph.

The input is sliced based on the prefix dictionary. Where, “to” has no prefix, so there is only one way to cut; For “North”, there are “North”, “Beijing”, “Peking University” three ways of segmentation; For “Jing”, there is only one way to slice; For “big”, there are two ways of dividing “big” and “university”. There is still only one way to divide “study” and “play”. In this way, we can get the segmentation mode of each word as the prefix word, and its DAG diagram is shown as follows:

Maximum probability path calculation

All paths of the above DAG diagram are as follows:

  1. Go to/Beijing/University/study/play
  2. Go to/Beijing/college/study/play
  3. Go to/Beijing/university/for fun
  4. Go to/Peking University/play

Because each node has a Weight, for a word in the prefix dictionary, its Weight is its word frequency. So our problem is we want to find a maximum path where the weight of the whole sentence is the highest.

This is a typical dynamic programming problem. First, let’s confirm the two conditions of dynamic programming:

  • Duplicate subproblem: for node I and its possible multiple successor nodes j and k:
The weight of any path through I to j is equal to the weight of the path through I plus the weight of j, which is R(I -> j) is equal to R(I) + W(j), R(I -> k) = R(I) + W(k).

That is, for j and k with common precursor node I, the weight of the path to I needs to be repeatedly calculated.

  • Optimal substructure: Let the optimal path of the whole sentence be Rmax, the end node be x, and the multiple possible precursor nodes be I, j, k. The formula can be obtained as follows:
Rmax = max(Rmaxi, Rmaxj, Rmaxk) + W(x) 

The problem then becomes to solve Rmaxi, Rmaxj and Rmaxk, and the optimal solution in the substructure is part of the global optimal solution.

As above, the optimal path is finally calculated as “go to/Peking University/play”.

HMM Implicit Markov Model

For unlogged words, JIEBA uses the HMM (the abbreviation of Hidden Markov Model) Model for word segmentation. It regards the word segmentation problem as a sequence labeling problem, in which the sentence is an observation sequence and the word segmentation result is a state sequence. The author of the Jieba issue said that the parameters of the HMM model are based on the available online segmenting corpus of the 1998 People’s Daily, an MSR corpus and his own collection of TXT novels, segmenting with ICTCLAS, and finally using Python scripts to count word frequency.

The model consists of a quintuple and has two basic assumptions.

Five yuan group:

  1. Set of state values
  2. Set of observations
  3. Initial state probability
  4. State transition probability
  5. State emission probability

Basic Assumptions:

  1. Homogeneity hypothesis: it is assumed that the state of the hidden Markov chain at any moment t only depends on the state of the previous moment t-1, and is independent of the state and observation at other moments, and independent of the time t.
  2. Observation independence hypothesis: it is assumed that the observation value at any moment is only related to the state of the Markov chain at that moment, and has nothing to do with other observations and states.

The set of state values is {B: begin, E: end, M: middle, S: single}, representing the position of each word in the sentence, B is the beginning position, E is the end position, M is the middle position, S is the word formation.

The set of observations is the set of each word in the sentence that we input.

The initial probability of the state indicates the probability that the first word in the sentence belongs to the four states of B, M, E and S, in which the probability of E and M is 0, because the first word can only be B or S, which is consistent with the reality.

The state transition probability indicates the probability of the transition from state 1 to state 2, satisfying the homogeneity hypothesis, and the structure can be represented by a nested object:

P = {B: {E: -0.510825623765990, M: -0.916290731874155}, E: {B: -0.5897149736854513, S: -0.8085250474669937}, M: {B: -0.8085250474669937}, -0.33344856811948514, M: -1.2603623820268226}, S: {B: -0.7211965654669841, S: -0.6658631448798212},}

P’B’ represents the probability of transition from state B to state E (the logarithm of probability in the structure is convenient for calculation), which is 0.6. Similarly, P’B’ represents the probability that the next state M is 0.4, indicating that when a word is at the beginning, the probability that the next word is at the end is higher than the probability that the next word is in the middle, which is intuitive. Because two-word words are more common than multi-word words.

The state emission probability indicates the current state, satisfying the observation independence assumption. The structure is the same as above, and can also be represented by a nested object:

P = {B: {' tu ': 2.70366861046,' the ' ': 10.2782270947,' optimal ': 5.57547658034}, M: {' to' : 4.26625051239, 'or' : -2.1517176509, 'Cheng ': -5.11354837278}, S: {...... }, E: {... }},

P’B’ means that the logarithm of the probability that the word observed is “sudden” in the state B is equal to -2.70366861046.

Finally, through Viterbi algorithm, input the set of observed values, take the state initial probability, state transition probability and state emission probability as parameters, and output the set of state values (i.e. the word segmentation result of the maximum probability). About the Viterbi algorithm, this article will not expand in detail, interested readers can look up.

These two technologies are the technical core of our architecture. Based on this, we improve the Electron terminal technical architecture of netease cloud message IM.

Netease cloud signal IM Electron terminal architecture

Detailed architecture diagram

Considering that full-text retrieval is only one function of IM, in order not to affect other IM functions and to meet the requirements of faster iteration, the following architectural scheme is adopted:

On the right is the previous technical architecture, the underlying repository uses IndexDB, and the upper layer has two modules for reading and writing:

  • When a user initiatively sends a message, initiatively synchronizes a message, initiatively deletes a message, and initiatively receives a message, the message object will be synchronized to IndexDB;
  • When the user needs to search for keywords, he/she will traverse all message objects in indexDB, and then use indexOf to determine whether each message object contains the searched keyword (similar to LIKE).

So, when there is a lot of data, the query speed is very slow.

On the left is the new schema scheme with the addition of participle and inverted index database. This scheme does not have any effect on the previous scheme, but just adds a layer before the previous scheme. Now:

  • When users actively send messages, synchronize messages, delete messages and receive messages, the messages in each message object will be synchronized to the inverted index database after word segmentation.
  • When the user needs to query the keyword, it will first find the idClient of the corresponding message in the inverted index database, and then find the corresponding message object in the indexDB according to the idClient and return it to the user.

Architecture advantages

The scheme has the following four advantages:

  • Fast: Inverted indexing by search-index improves search speed.
  • Cross-platform: Since both search-index and indexDB are based on LevelDB, search-index also supports a browser environment, making it possible to implement full-text search on the Web side.
  • Independence: Inverted index library is separated from IM main business library IndexDB. When IndexDB writes data, it will automatically notify the write module of the inverted index database. After word segmentation, the message content will be inserted into the storage queue, and finally inserted into the inverted index database in turn. When full-text retrieval is needed, the idClient of the message object corresponding to the keyword can be quickly found by inverting the read module of the index library. According to the idClient, the message object can be found in IndexDB and returned.
  • Flexibility: Full text search in the form of a plug-in access, it exposed a higher-order functions, package IM and returns the new after inherit extension of IM, because JS prototype oriented mechanism, in the new IM does not exist, the method of automatically to the prototype chain of the IM) (that is, the old search, therefore, makes the plug-in can focus on their own methods of implementation, Moreover, it does not need to care about the specific version of IM, and the plug-in supports custom word segmentation functions to meet different users’ different word segmentation requirements.

Use effect

After using the above architecture, our test shows that at the level of 20W data volume, the search time is reduced from more than ten seconds at the beginning to within one second, and the search speed is about 20 times faster.

conclusion

Above, we realized the full text retrieval of netease cloud message IM SDK chat messages on Electron based on Nodejieba and search-index, which accelerated the search speed of chat records. Of course, we will make more optimization for the following aspects in the future, such as:

  • Write performance improvement: In practice, we found that when the amount of data is large, the underlying database levelDB that search-index depends on will have a write performance bottleneck, and the consumption of CPU and memory is large. According to the survey, SQLite’s write performance is much better. From the observation, the write speed is only proportional to the amount of data, and the CPU and memory are relatively stable. Therefore, it may be considered to replace the search-index by compiling SQLite into a Node native module in the future.
  • Extensibility: The decoupling of business logic is not yet complete. Some business fields are stored in the inverted index library. Later, you can consider the inverted index library to find the idClient of the message object only by keyword, put the search with business attributes in the IndexDB, and completely decouple the inverted index library from the main business library.

The above is all of this article to share, welcome to pay attention to us, continue to share more technical dry goods. Finally, I hope my sharing can be helpful to you.

The authors introduce

Li Ning, senior front-end development engineer of netease Yunxin, responsible for application development, componencialization development and solution development of netease Yunxin audio and video IM SDK, has rich practical experience in componencialization design of React, PaaS and multi-platform development and compilation. Please feel free to leave a message if you have any questions.

More technical dry goods, welcome to pay attention to “netease intelligent enterprise technology +” WeChat public number