preface

We all know that InnoDB’s use of “%xx” in fuzzy queries of data will result in index failure, but sometimes this is required, and there are many such requirements. For example, search engines need full-text searches based on user data keywords, e-commerce sites need full-text searches based on user query conditions, You might need to look in the details of an item, which is not something that a B+ tree index does well.

With numerical comparisons, range filtering, etc., most of the queries we need are done. However, if you want to filter queries by matching keywords, you need similarity-based queries rather than the original exact numerical comparison, and full-text indexes are designed for this scenario.

Full-text Search is a technique for finding out any information in a whole book or article stored in a database. It can obtain the relevant chapter, section, paragraph, sentence, word and other information in the full text according to the need, and can also carry out various statistics and analysis.

In the early days of MySQL, InnoDB did not support full-text retrieval technology. Since MySQL 5.6, InnoDB has started to support full-text retrieval.

Inverted index

Full-text retrieval is usually implemented using inverted index, which is an index structure like B+Tree. It stores in auxiliary tables a mapping between a word and where the word itself is in one or more documents. This is usually implemented using associative arrays and has two representations:

  • Inverted file Index: {word, ID of document where word is located}
  • Full inverted index: {word, (id of the document in which the word resides, and location in the specific document)}

Inverted File Index associative array is inverted File Index array, and the word “code” exists in Documents 1 and 4. Therefore, it is easy to store and perform full-text query. The Documents containing the query keywords can be obtained directly from Documents. The full Inverted index stores pairs, i.e., DocumentId,Position. Therefore, its inverted index is stored as shown in the figure below, for example, the keyword “code” exists in the sixth word of document 1 and the eighth word of document 4. In contrast, full Inverted Index takes up more space, but provides better location data and extends some other search features.

The full text retrieval

Creating a full-text index

Create a full-text index when creating a table.

CREATE TABLE table_name ( id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, author VARCHAR(200), 
title VARCHAR(200), content TEXT(500), FULLTEXT full_index_name (col_name) ) ENGINE=InnoDB;
Copy the code

Enter the query statement:

SELECT table_id, name, space from INFORMATION_SCHEMA.INNODB_TABLES
WHERE name LIKE 'test/%';
Copy the code

The above six index tables constitute the inverted index, known as the auxiliary index table. When the incoming document is tokenized, the single word is fully sorted and partitioned in six index tables with positional information and associated DOC_ID, according to the character set sorting weight of the word’s first character.

Create a full-text index on an existing table.

CREATE FULLTEXT INDEX full_index_name ON table_name(col_name);
Copy the code

Using full-text indexes

MySQL database supports full-text retrieval queries. Full-text indexes can only be used on InnoDB or MyISAM tables, and can only be used to create columns of type CHAR, VARCHar, or TEXT.

The syntax is as follows:

MATCH(col1,col2,...) AGAINST(expr[search_modifier])
search_modifier:
{
    IN NATURAL LANGUAGE MODE
    | IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION
    | IN BOOLEAN MODE
    | WITH QUERY EXPANSION
}
Copy the code

Full-text searches are performed using the MATCH() AGAINST() syntax, where MATCH() is a comma-separated list named for the columns to be searched. AGAINST() takes a string to search for, along with an optional modifier for the type of search to perform. Full-text search is divided into three types: natural language search, Boolean search, and query extended search. The various query modes are described below.

Natural Language

Natural Language searches interpret search strings as phrases in Natural human languages, and MATCH() defaults to Natural Language mode, which means to query documents with specified keywords.

Next, combine with demo to better understand Natural Language

SELECT
    count(*) AS count 
FROM
    `fts_articles` 
WHERE
    MATCH ( title, body ) AGAINST ( 'MySQL' );
Copy the code

Query the number of rows containing the keyword ‘MySQL’ in the title and body columns. The above statement can also be written as follows:

SELECT
    count(IF(MATCH ( title, body ) 
    against ( 'MySQL' ), 1.NULL )) AS count 
FROM
    `fts_articles`;
Copy the code

The results of the two statements are the same, but internally, the second SQL statement is faster because the first SQL statement (based on the WHERE index query) also requires sorting statistics for correlation, which is not required in the second SQL statement.

Correlations can also be queried with SQL statements:

SELECT
    *.MATCH ( title, body ) against ( 'MySQL' ) AS Relevance 
FROM
    fts_articles;
Copy the code

The calculation of correlation is based on the following four conditions:

  • Whether Word appears in the document
  • The number of times word appears in a document
  • The number of word columns in the index
  • How many documents contain the Word

For InnoDB storage engine full-text retrieval, the following factors need to be considered:

  • The word of the query is in the stopWord column, and the query with this string is ignored
  • Whether the length of the word is within the range [innodb_ft_min_token_size,innodb_ft_max_token_size]

If the word is in stopWord, then the word is not queried, as in the case of ‘for’, and the result is as follows:

SELECT
    *.MATCH ( title, body ) against ( 'for' ) AS Relevance 
FROM
    fts_articles;
Copy the code

As you can see, ‘for’ appears in documents 2,4, but because it is stopword, its relevance is 0

The parameters innodb_ft_min_token_size and innodb_ft_max_token_size control the length of the InnoDB engine’s query character, If the length is smaller than innodb_ft_min_token_size or larger than innodb_ft_max_token_size, the search will be ignored. In the InnoDB engine, the default value of the parameter innodb_ft_min_token_size is 3, and the default value of innodb_ft_max_token_size is 84

Boolean

A Boolean search uses the rules of a special query language to interpret a search string that contains the word to be searched for, and it can also contain operators that specify requirements, such as that a word must be present or absent in the matched line, or that it should be weighted higher or lower than usual. For example, the following statement asks for documents that have the string “Pease” but no “hot”, where + and – indicate that the word must exist or must not exist, respectively.

select * from fts_test where MATCH(content) AGAINST('+Pease -hot' IN BOOLEAN MODE);
Copy the code

Boolean The types supported for full-text retrieval include:

  • + : Indicates that the Word must exist
  • – : Indicates that the Word must not exist
  • (no operator) indicates that the word is optional, but its relevance is higher if present
  • Distance indicates whether the distance between the multiple words in the query is within distance, which is in bytes. This full-text search query is also calledProximity Search, such asMATCH(context) AGAINST('"Pease hot"@30' IN BOOLEAN MODE)The statement indicates that the distance between the string Pease and hot must be within 30 bytes
  • > : increases the relevance of the word when it appears
  • < : reduces relevance when the word appears
  • ~ : indicates that the word is allowed to appear, but the correlation is negative
  • * : indicates a word that begins with the word, such aslik*Which means it could belik.like.likes
  • “–

Here are some demos to see how Boolean Mode works.

Not: + –

SELECT
    * 
FROM
    `fts_articles` 
WHERE
    MATCH ( title, body ) AGAINST ( '+MySQL -YourSQL' IN BOOLEAN MODE );
Copy the code

SQL > query for ‘MySQL’ but not ‘YourSQL

Demo2: no operator

SELECT
    * 
FROM
    `fts_articles` 
WHERE
    MATCH ( title, body ) AGAINST ( 'MySQL IBM' IN BOOLEAN MODE );
Copy the code

In the above statement, the query ‘MySQL IBM’ does not have a ‘+’, ‘-‘ identifier, which means word is optional. If it does, its relevance will be higher

Demo3: @

SELECT
    * 
FROM
    `fts_articles` 
WHERE
    MATCH ( title, body ) AGAINST ( '"DB2 IBM"@3' IN BOOLEAN MODE );
Copy the code

The above statement represents the words “DB2” and “IBM” within 3 bytes of each other

Demo4: > <

SELECT
    * 
FROM
    `fts_articles` 
WHERE
    MATCH ( title, body ) AGAINST ( '+MySQL +(>database <DBMS)' IN BOOLEAN MODE );
Copy the code

In the above statement, the query contains row information for ‘MySQL’, ‘database’, and ‘DBMS’, but the row information for ‘DBMS’ is more correlated than the row information for ‘DBMS’.

demo5: ~

SELECT
    * 
FROM
    `fts_articles` 
WHERE
    MATCH ( title, body ) AGAINST ( 'MySQL ~database' IN BOOLEAN MODE );
Copy the code

The above statement queries rows containing ‘MySQL’, but reduces correlation if the row also contains ‘database’.

Demo6: *

SELECT
    * 
FROM
    `fts_articles` 
WHERE
    MATCH ( title, body ) AGAINST ( 'My*' IN BOOLEAN MODE );
Copy the code

Query for rows whose keyword contains ‘My’.

Demo7:”

SELECT
    * 
FROM
    `fts_articles` 
WHERE
    MATCH ( title, body ) AGAINST ( '"MySQL Security"' IN BOOLEAN MODE );
Copy the code

Query for rows that contain the exact phrase ‘MySQL Security’.

Query Expansion

Query extended search is a modification of natural language search, which is usually done when the keyword of the query is too short and the user needs implied knowledge (implied knowledge). For example, for a query with the word database, the user may wish to query more than the document containing the database. It may also refer to words that include MySQL, Oracle, and RDBMS, where you can use the Query Expansion pattern to open the implied knowledge of full-text retrieval

Blind QUERY EXPANSION (also called automatic) can be enabled by adding WITH QUERY EXPANSION/IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION to the QUERY statement Relevance feedback), the query is divided into two phases.

  • Stage 1: Full-text index queries based on the searched words
  • The second stage: according to the word segmentation generated in the first stage to conduct a full text search query

Let’s take a look at an example of how Query Expansion is used.

Create index
create FULLTEXT INDEX title_body_index on fts_articles(title,body);
Copy the code
-- Use Natural Language mode to query
SELECT
    * 
FROM
    `fts_articles` 
WHERE
    MATCH(title,body) AGAINST('database');
Copy the code

The Query results before using Query Expansion are as follows:

-- When using Query Expansion mode
SELECT
    * 
FROM
    `fts_articles` 
WHERE
    MATCH(title,body) AGAINST('database' WITH QUERY expansion);
Copy the code

After using Query Expansion, the Query results are as follows:

Because full-text retrieval from Query Expansion can lead to many non-relevant queries, users may need to be very careful when using it.

Deleting a full-text index

Delete full-text index:

DROP INDEX full_idx_name ON db_name.table_name;
Copy the code

Alter table alter table delete full-text index

ALTER TABLE db_name.table_name DROP INDEX full_idx_name;
Copy the code

summary

This article introduces fulltext Index from the perspective of combining theory and practice. If you are interested in MySQL, you can follow the column of MySQL.