preface

For example, if you want to use an inverted index, you can use an inverted index to search for your data. For example, if you want to use an inverted index, you can use an inverted index to search your data. In this article, I will summarize my understanding of the two indexes.

The body of the

Is row index

For Mysql Innodb, a simple (no page attributes) B+ tree index structure looks like this: leaf nodes store the complete data, non-leaf nodes store the corresponding fields (primary keys) of the clustered index, and a Sql that can use the clustered index. The B+ tree is searched from top to bottom until the fields are consistent;

CREATE TABLE user_info (
	id int,
	name varchar(16),
	hobby varchar(256)
);
Copy the code

The corresponding non-clustered index is only the content of the leaf node and stores the primary key information of the table. The query sequence is to first find single or multiple primary key ids consistent in the leaf node through the field of the non-clustered index, and then use these primary key ids to return to the table, and finally obtain the corresponding complete entity data. Mysql > query user’s hobby list by keyword “basketball”; mysql > query user’s hobby list by keyword “basketball”

SELECT *
FROM user_info
WHERE hobby LIKE 'basketball % %';
Copy the code

Even if we create a normal index for the Hobby field, Innodb engine can only use the left-most prefixed index logic LIKE ‘basketball %’ to use string indexes in queries. Innodb supports full text index after version 5.6. After creating full text index, use MATCH and AGAINST to use full text index. It is much more efficient than B+ tree scan, but the corresponding full text index takes up a lot of disk space. Full-text index is the same as inverted index.

SELECT *
FROM user_info
WHERE MATCH (hobby) AGAINST ('basketball');
Copy the code

Inverted index

Compared to a straight index for a B+ tree, if we index the Hobby field, its inverted index would have the following minimal data format. To create the field of inverted index, the lexicon will divide the field into corresponding term index one by one according to the semantics, and constitute all term dictionary of this type of data. For example, “like basketball and singing” will be divided into “basketball” and “singing” two term index; The second column contains document ids (documentId) that correspond to these term indexes. This data will help us trace to the complete entity data. The third column is the position of the corresponding term index in the field of the document. 0 indicates the position at the beginning, which can help mark the highlighted information of retrieved data.

Inverted index and participle

How to input a document data to create the corresponding inverted index, such as {” ID “:1,”name”:” Zhang SAN “,”hobby”:” basketball, singing “}. In the case of ES, you can pre-set the field to String and the corresponding toggle, and the hobby field will be preprocessed. After the following three word segmentation steps, the whole sentence will be divided into multiple corresponding term indexes, and the corresponding position and document ID of each term index will also be generated. Add to the data structure above.

  1. Character Filters: Process raw text, such as removing HTML tags
  2. Tokenizer: Splits the original text into words according to certain rules
  3. Token Filters: Reworks words processed by Tokenizer, such as lowercase, delete, or add new ones

For different text contents, we can use different or even custom participles, such as ES: Standard Analyzer (Default word Analyzer, word segmentation, Handles punctuation), Simple Analyzer (characters that are not children and parents are ignored and cut to a shred point), Whitespace Analyzer (space-based word Analyzer), IKAnalyzer (the more popular Chinese word Analyzer); Mysql’s full-text index also has its Chinese word segmentation counterpart, Ngram.

Two index query order

From the above description, we can know a rough query order when using forward and inverted indexes

Application of inverted index idea

Have received a requirement description: before we are in different cities, grade, semester, different content show rules set by the user equipment, these properties can be set to null, finally, if a user attribute matching by multiple rules, you will need to score according to this a few weights of attributes, runs the highest rules, such as our configuration rules are as follows: Then A user of Shanghai primary school grade one in spring will match the two rules, and then score the weight value according to the attributes of the two rules, and finally choose to display A or B.

city grade term device rule
Shanghai In the spring According to A
Shanghai First grade According to B

Above the data query, SQL as follows, in this process of the or thoughts are similar to “participle in the inverted index + search for all documents containing the term index data” (just the word index is already make sure good), and then in the search to multiple records after weight value calculated retrieval relevance on a scale of (ES).

SELECT *
FROM tb_rules
WHERE (city = 'Shanghai' OR grade = 'First grade' OR term = 'spring')
Copy the code

conclusion

Through this article, we understand the forward index, inverted index ideas and a brief way to achieve, hope that through these different principles can bring different solutions to the problem in the work.