preface

Full-text search based on local data plays an important role in the client requirements of IM scenarios. Full-text search is the technique of finding the location of a word in a large number of documents. In the past, relational databases can only be implemented through LIKE, which has several disadvantages:

  1. Unable to use database index, need to traverse the full table, poor performance
  2. The search effect is poor, only the beginning and end of fuzzy matching, can not achieve complex search requirements
  3. Unable to get the document’s relevance to the search criteria

NetEase yunxinIM realizes the local data full-text retrieval function based on SQLite and other libraries in iOS, Android and desktop, but lacks this function in Web and Electron. On the Web side, due to browser constraints, the only local storage database available is IndexDB, which is beyond the scope of this discussion. On Electron, there’s also Chromium kernel built in, but the ability to use Node.js gives you a wider range of choices.

Let’s take a concrete look at how to achieve full text retrieval.

Basic technical knowledge

To achieve full text retrieval, can not do without the following two knowledge points:

  • Inverted index
  • participles

These two technologies are the realization of full text retrieval technology and difficulties, its implementation process is relatively complex, and then talk about the realization of full text index, we first talk about the realization of these two technologies.

Inverted index

The concept of an inverted index is different from that of an inverted index:

  • Forward index: a structure in which the unique ID of the document object is used as the index and the document content as the record
  • Inverted index: a structure in which the word in the document content is indexed and the document ID containing the word is used as the record

For a practical example, take the inverted search-index index library. In THE IM of NetEase Yunxin, each message object has idClient as its unique ID. Then, we input “it’s a beautiful day today” and divide each Chinese word into “today”, “tian”, “tian”, “qi”, “zhen” and “good”. PUT (search-index); PUT (search-index);

As shown in the figure, you can see the structure of the inverted index. Key is a single Chinese word after segmentation, and value is an array of IDClients containing the Chinese message object. Of course, in addition to the above contents, there are some other contents, such as Weight, Count and column data, etc., these are for sorting, paging, search by field and other functions, this article will not be detailed.

participles

Word segmentation is to divide the original content of a message into multiple words or phrases according to the semantics. Considering the effect of Chinese word segmentation and the need to run on Node, we choose Nodejieba as the basic word segmentation library. The following is the flow chart of Jieba segmentation:

Take “Go to Peking University to play” as an example, we select several of the most important modules for analysis:

Loading the dictionary

Jieba is initialized to load the dictionary first, with the following content:

Building a Prefix Dictionary

The prefix dictionary is then constructed from the dictionary, with the following structure:

Among them, “Peking University” as the prefix of “Peking University”, its word frequency is 0, which is convenient for the subsequent construction of DAG diagram.

Build a DAG figure

DAG is short for Directed Acyclic Graph.

Based on the prefix dictionary, the input content is segmented. Among them, “go” has no prefix, so there is only one way to slice; For “North”, there are three segmentation methods: “North”, “Beijing” and “Peking University”. For “Beijing”, there is only one way to slice; For “big”, there are “big” and “university” two segmentation methods; There is still only one syncopation of learning and playing. In this way, the segmentation method of each word as a prefix can be obtained, and its DAG is shown in the following figure:

Maximum probability path calculation

All paths of the above DAG diagram are as follows:

  1. Go to/Beijing/university/study/play
  2. Go to/Beijing/university/study/play
  3. Go to/Beijing/university/play
  4. Go to/Beijing University/play

Because each node has a Weight, for words in the prefix dictionary, its Weight is its word frequency. So our problem is to get the maximum path that gives the maximum weight to the whole sentence.

This is a typical dynamic programming problem. First, we confirm the two conditions of dynamic programming:

  • Repeat subproblem: for node I and its possible multiple successors J and K:
The weight of any path through I to j = the weight of the path through I + the weight of j, that is, R(I -> j) = R(I) + W(j) the weight of any path through I to k = the weight of the path through I + the weight of k, R(I -> k) = R(I) + W(k).Copy the code

That is, for j and K with common precursor node I, the weight of the path to I needs to be calculated repeatedly.

  • Optimal substructure: Suppose that the optimal path of the whole sentence is Rmax, the end node is X, and several possible precursor nodes are I, j and K. The formula is as follows:
Rmax = max(Rmaxi, Rmaxj, Rmaxk) + W(x) 
Copy the code

So the problem becomes solving Rmaxi, Rmaxj and Rmaxk, and the optimal solution in the substructure is part of the global optimal solution.

As above, the optimal path is finally calculated as “go/Peking University/play”.

HMM Implicit Markov model

For unlogged words, the HMM (abbreviation of Hidden Markov Model) Model is used in jieba word segmentation. The word segmentation problem is regarded as a sequence labeling problem, with sentence as observation sequence and word segmentation result as state sequence. Jieba writer mentioned in the issue that the parameters of HMM model are based on the synced corpus of 1998 People’s Daily which can be downloaded online, an MSR corpus and TXT novels collected by himself, synced by ICTCLAS, and finally counted the word frequency with Python script.

The model consists of a quintuple and has two basic assumptions.

Five yuan group:

  1. Set of state values
  2. Set of observed values
  3. Initial state probability
  4. State transition probability
  5. State emission probability

Basic assumptions:

  1. Homogeneity hypothesis: that is, it is assumed that the state of the hidden Markov chain at any time t only depends on the state of t-1 at the previous time, and has nothing to do with the state and observation at other time, and has nothing to do with time T.
  2. Observation value independence hypothesis: that is, it is assumed that the observed value at any moment is only related to the state of the Markov chain at that moment, and has nothing to do with other observations and states.

The set of state values is {B: begin, E: end, M: middle, S: single}, which represents the position of each word in the sentence. B is the beginning position, E is the end position, M is the middle position, and S is the single word.

The set of observations is the set of each word we type in the sentence.

The initial state probability indicates the probability that the first word in a sentence belongs to B, M, E and S, among which the probability of E and M is 0, because the first word can only be B or S, which is consistent with the reality.

State transition probability represents the probability of transition from state 1 to state 2, which satisfies the homogeneity hypothesis. The structure can be represented by a nested object:

P = {B: {E: -0.510825623765990, M: -0.916290731874155}, E: {B: -0.5897149736854513, S: -0.8085250474669937}, M: {E: -0.33344856811948514, M: -1.2603623820268226}, S: {B: -0.7211965654669841, S: -0.6658631448798212},}Copy the code

P[‘B’][‘E’] means that the probability of moving from state B to state E is 0.6. Similarly, P[‘B’][‘M’] means that the probability of the next state being M is 0.4, indicating that when a word is at the beginning, The probability of the next word being at the end is higher than the probability of the next word being in the middle, which is intuitive because two-word words are more common than multi-word words.

The state emission probability represents the current state, which satisfies the observation value independence hypothesis. The structure is the same as above, and can also be represented by a nested object:

P = {B: {' tu ': 2.70366861046,' the ' ': 10.2782270947,' optimal ': 5.57547658034}, M: {' to' : 4.26625051239, 'or' : -2.1517176509, 'cheng ': -5.11354837278}, S: {...... }, E: {... }},Copy the code

P[‘B’][‘ spike ‘] means that the log of the probability that the observed word is “spike” is equal to -2.70366861046.

Finally, Viterbi algorithm is adopted to input the set of observed values, and the initial state probability, state transition probability and state emission probability are taken as parameters to output the set of state values (i.e. the segmentation result of maximum probability). The Viterbi algorithm is not detailed in this article, but can be consulted by interested readers.

These two technologies are the technical core of our architecture. Based on this, we have improved the Electron terminal technology architecture of NetEase cloud IM.

NetEase cloud IM Electron terminal architecture

Detailed explanation of the architecture diagram

Considering that full-text retrieval is only one function in IM, in order not to affect other IM functions and to meet the requirements of faster iteration, the following architecture scheme is adopted:

On the right is the previous technical architecture, where the underlying repository uses indexDB and the upper layer has read and write modules:

  • When a user actively sends, synchronizes, deletes, or receives a message, the message object is synchronized to indexDB.
  • When the user needs to query the keyword, the user will go through all the message objects in indexDB and use indexOf to determine whether each message object contains the keyword (similar to LIKE).

So, when there is a large amount of data, the speed of the query is very slow.

On the left is the new schema with the addition of word segmentation and inverted index database. This schema does not change the previous schema at all, but just adds a layer before the previous schema. Now:

  • When the user sends the message, synchronizes the message, deletes the message and receives the message, the message in each message object will be synchronized to the inverted index database after word segmentation.
  • When a user needs to query a keyword, the user searches for the idClient of the corresponding message in the inverted index database. Then, the user searches for the corresponding message object in the indexDB based on the idClient and returns it to the user.

Architecture advantages

The scheme has the following four advantages:

  • ** Fast: ** inverts the index through search-index, which improves the search speed.
  • Cross-platform: Since both Search-Index and indexDB are based on levelDB, search-Index also supports the browser environment, making it possible to implement full-text search on the Web side.
  • Independence: The inverted index library is separated from the IM main service library indexDB. When data is written to the indexDB, it is automatically notified to the write module of the inverted index library, which splits the message contents into words, inserts them into the storage queue, and finally inserts them into the inverted index database. When full-text retrieval is required, the idClient of the message object corresponding to the keyword can be quickly found by inverting the reading module of the index library, and then the message object can be found in indexDB according to the idClient and returned.
  • ** Flexibility: * * full-text retrieval access in the form of a plug-in, it exposed a higher-order functions, package IM and returns the new after inherit extension of IM, because JS prototype oriented mechanism, in the new IM does not exist, the method of automatically to the prototype chain of the IM) (that is, the old search, therefore, makes the plug-in can focus on their own methods of implementation, It does not need to care about the specific VERSION of IM, and the plug-in supports self-defined word segmentation functions to meet different users’ different word segmentation requirements.

Use effect

After using the above architecture, our test shows that with 20W data, the search time decreases from ten seconds to less than one second, and the search speed is about 20 times faster.

conclusion

Above, we have realized the full-text retrieval of NetEase Yunxin IM SDK chat messages on Electron based on Nodejieba and search-index, speeding up the search speed of chat records. Of course, we will make more optimization for the following aspects in the future, such as:

  • Improved write performance: It is found that levelDB, the underlying database on which Search-index depends, has a write performance bottleneck and consumes a lot of CPU and memory when the amount of data is large. After investigation, SQLite’s write performance is much better. From observation, write speed is only proportional to the amount of data, and CPU and memory are relatively stable. Therefore, it may be considered to compile SQLite into Node native module to replace search-index.
  • Extensibility: The current decoupling of business logic is not complete enough. Some business fields are stored in the inverted index library. You can then consider the inverted index library to find the idClient of the message object only by keyword, and to put the search with business attributes into indexDB to completely decouple the inverted index library from the main business library.

Above, is the whole article to share, welcome to follow us, continue to share more technical dry goods. Finally, I hope my sharing can be helpful to you.

The authors introduce

Li Ning, senior front-end development engineer of NetEase Yunxin, is responsible for application development, component-based development and solution development of NetEase Yunxin audio and video IM SDK. He has rich practical experience in component-based design of React and PaaS as well as multi-platform development and compilation. If you have any questions, please leave a message.

More technical dry goods, welcome to pay attention to [NetEase Smart enterprise technology +] wechat public number