How to implement full text search in Mysql

One, foreword

Today, a colleague asked me how to use Mysql to implement full text search function like ElasticSearch, and score the search keywords. The question immediately went through my head, right? Why not just use ES? Easy to use and fast. However, he said that the amount of data is not large, and the customer gave very limited time, so there is no time to build ES, so let’s have a look at Mysql’s full-text search function. Ngram full text parser has been built into MySQL since version 5.7.6 to support Chinese, Japanese, and Korean word segmentation. Before MySQL 5.7.6, full-text index only supports English full-text index, but does not support Chinese full-text index. It is necessary to use word segmentation to preprocess Chinese paragraphs into words, and then store them in the database. This article is tested using Mysql 5.7.6, InnoDB database engine.

The original resolution

Second, full text parser ngram

A ngram is a sequence of n consecutive words in a text. The Ngram full-text parser is capable of segmentation of text, where each word is a consecutive sequence of n words. For example, use the Ngram full-text parser to participle “Hello world” :

n=1: 'you'.'good'.'the'.'world' 
n=2: 'hello'.'good,'.'the world' 
n=3: Hello world.Good world 
n=4: Hello world
Copy the code

MySQL uses the global variable ngram_token_size to configure the size of n in ngram. The value ranges from 1 to 10, and the default value is 2. Usually ngram_token_size is set to the minimum word count of the word to be queried. If you need to search for words, set ngram_token_size to 1. With a default value of 2, searching for a word yields no results. Since a Chinese word has at least two characters, the default value of 2 is recommended.

Mysql default ngram_token_size

show variables like 'ngram_token_size'
Copy the code

There are two ways to set the ngram_token_size variable:

1. Specify when starting the mysqld command

mysqld --ngram_token_size=2
Copy the code

2. Modify the mysql configuration file

[mysqld] 
ngram_token_size=2
Copy the code

Full text index

Take a certain document data as an example, create a new table T_WENSHU, create a full-text index for the document content field, and import 10W test data.

Create a full-text index when creating a table

CREATE TABLE `t_wenshu` (
  `province` varchar(255) DEFAULT NULL,
  `caseclass` varchar(255) DEFAULT NULL,
  `casenumber` varchar(255) DEFAULT NULL,
  `caseid` varchar(255) DEFAULT NULL,
  `types` varchar(255) DEFAULT NULL,
  `title` varchar(255) DEFAULT NULL,
  `content` longtext,
  `updatetime` varchar(255) DEFAULT NULL,
  FULLTEXT KEY `content` (`content`) WITH PARSER `ngram`
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Copy the code

2. Use ALTER TABLE

ALTER TABLE t_wenshu ADD FULLTEXT INDEX content_index (content) WITH PARSER ngram;
Copy the code

3. Select Create Index

CREATE FULLTEXT INDEX content_index ON t_wenshu (content) WITH PARSER ngram;
Copy the code

4. Search mode

Natural language retrieval

IN NATURAL LANGUAGE MODE is the default full-text search MODE of MySQL. The natural language pattern cannot use operators and cannot specify complex queries such as the keyword must or must not appear.

Boolean retrieval

(IN BOOLEAN MODE) select ‘this’,’ this’, ‘this’,’ this’, ‘this’,’ this’, ‘this’,’ this’, ‘this’,’ this’, ‘this’,’ this’, ‘this’,’ this’, ‘this’,’ this’; This is almost taken as a stopword; Boolean search mode can use operators to support complex queries that specify whether a keyword must or must not appear, or whether the keyword has a high or low weight.

· Do not discard more than 50% of rows that match. · Does not automatically reverse order by relevance. · You can search for fields that do not have FULLTEXT index, but it is very slow. · Restrict the longest and shortest strings. · Use Stopwords. ● Search syntax rules: + must have (data bars that do not contain this keyword are ignored). - Can not have (exclude the specified keyword, contain the keyword are ignored). > Increases the weight of the matching data. < Decreases the weight value of the matching data. Turning its relevance from positive to negative means that having the word reduces its relevance (unlike - excluding it), but only ranks lower than the weight value. * Multicharacter, unlike other syntax comes first, this should come after the string. "" Enclosing a sentence in double quotation marks means it matches exactly.Copy the code

Query extended search

Note :(WITH QUERY EXPANSION) use caution as QUERY extensions can cause many irrelevant queries!

5. Search and query

1) Query the record of “theft” contained in the content, and the query statement is as follows

select caseid,content, MATCH ( content) AGAINST ('Theft') as score from t_wenshu where MATCH ( content) AGAINST ('Theft' IN NATURAL LANGUAGE MODE)
Copy the code

2) The query content contains records of “picking quarrels and provoking troubles”, and the query statement is as follows

select caseid,content, MATCH ( content) AGAINST ('Pick a quarrel') as score from t_wenshu where MATCH ( content) AGAINST ('Pick a quarrel' IN NATURAL LANGUAGE MODE) ;
Copy the code

3) Single Chinese character, query the record containing “I” in the content, the query statement is as follows

select caseid,content, MATCH ( content) AGAINST ('我') as score from t_wenshu where MATCH ( content) AGAINST ('我' IN NATURAL LANGUAGE MODE) ;
Copy the code

Note: Because the value of global variable ngram_token_size is set to 2. If you want to query a single Chinese character, change ngram_token_size = 1 in the configuration file my.ini and restart the mysqld service.

4) Query field content contains “dangerous driving” and “picking quarrels” statement as follows:

select caseid,content, MATCH (content) AGAINST ('+ Dangerous driving + Provoking trouble ') as score from t_wenshu where MATCH (content) AGAINST ('+ Dangerous driving + Provoking trouble ' IN BOOLEAN MODE);
Copy the code

5) Query field content contains “dangerous driving”, but does not contain “provoking trouble” statement as follows:

select caseid,content, MATCH (content) AGAINST ('+ Dangerous driving - causing a disturbance ') as score from t_wenshu where MATCH (content) AGAINST ('+ Dangerous driving - causing a disturbance ' IN BOOLEAN MODE);
Copy the code

6) Conent contains “dangerous driving” or “picking quarrels” as follows:

select caseid,content, MATCH (content) AGAINST ('Dangerous Driving causing trouble') as score from t_wenshu where MATCH (content) AGAINST ('Dangerous Driving causing trouble' IN BOOLEAN MODE);
Copy the code

Six, summarized

1) Before using Mysql full-text index, find out what versions are supported;

2) Full-text index is N times faster than like + %, but there may be accuracy problems;

3) If a large amount of data is needed for full-text indexing, it is recommended to add data before creating indexes.

4) For Chinese, you can use MySQL after 5.7.6, or Sphinx, Lucene and other third-party plug-ins;

5) The MATCH() function must use the same field name as the field name specified when creating the full-text index, and only the same table fields cannot cross the table;

over