Abstract:This paper briefly introduces the principle and application of GausSDB (DWS) full-text retrieval.

This article was shared from the Huawei cloud community “GausSDB (DWS) SQL advanced full text search”, the original author: Zhang Jingyao.

Text search, as the name implies, is the process of finding the specified pattern in a given document. GausSDB (DWS) supports full-text retrieval of textual fields and combinations of fields in a table, finding the text that matches a given pattern, and presenting the matching results in the way the user expects.

Based on the author’s experience and thinking, this paper gives a brief introduction to the full-text retrieval function of GaussDB(DWS), hoping to be helpful to readers.

1. The preprocessing

There are several ways to find a pattern in a given document, such as searching for a regular expression with the grep command. Theoretically, text fields in the database could also be retrieved in a grep-like manner, and strings could be matched in GausSDB (DWS) with the keyword “LIKE” or the operator “~”. But there are a lot of problems with that. The first step is to scan each piece of text, which is inefficient and makes it difficult to measure “match” or “relevance.” Moreover, it can only mechanically match strings, and lacks the ability to analyze the grammatical semantics. For example, it is difficult to automatically recognize and match the plural nouns and the tense change of verbs in English, and it cannot obtain satisfactory retrieval results for the text composed of natural language.

GAUSSDB (DWS) uses a search engine – like approach for full-text retrieval. Firstly, the given text and pattern are preprocessed, including extracting words or phrases from a paragraph of text, removing useless stop words for retrieval, standardizing the deformed words and so on, so that they can be changed into a form suitable for retrieval and then matched.

In GausSDB (DWS), both the original document and the search criteria are represented as text, or in other words, as strings. The pre-processed document becomes a tsvector, which is converted by the function to_tsvector. For example,

postgres=# select to_tsvector('a fat cat ate fat rats'); To_tsvector -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 'ate: 4:' the cat '3' fat ': 2, 5' rat: 6 row (1)

Looking at the tsvector output above, we can see the effect of to_tsvector:

  • First each word is extracted and its position is indicated by an integer, such as “fat” at the position of the second and fifth word in the original sentence.
  • Moreover, the word “a” is so common, appearing in almost every document, that it does little to help retrieve useful information. To apply Shannon’s theory, the more likely a word is to appear, the less information it contains. Words like “a” and “the” carry so little information that they are dropped as stop words. Note that this does not affect the position numbering of the other words. The “fat” position is still 2 and 5, not 1 and 4.
  • In addition, the plural form of “rats” is replaced by the singular form of “rat”. This operation is called Normalize, and it is mainly an operation to remove the suffix and keep the root of words in Western Chinese, which will deform in different contexts. The idea is to make natural language retrieval easier. For example, when searching for “rat”, documents containing both “rat” and “rats” can be retrieved. The words that are standardized are called lexeme, such as “rat”. The original word is called a language token.

There are many benefits to converting a document to TSVector form. For example, you can easily create indexes to improve the speed and efficiency of retrieval, and when the number of documents is large, retrieving keywords through indexes is much faster than full-text scanning matching such as GREP. For another example, different keywords can be assigned different weights according to their importance, so as to facilitate the sorting of retrieval results and find out the most relevant documents.

The preprocessed retrieval condition is converted to the TSQUERY type, which can be implemented using the to_tsquery function. For example,

postgres=# select to_tsquery('a & cats & rat');
  to_tsquery  
---------------
 'cat' & 'rat'
(1 row)

As you can see from the above example:

  • Like to_tsvector, to_tsquery also removes stop words and normalizes input text, such as removing “a” and changing “cats” to “cat”.
  • Input of retrieval condition itself must be used and (&), or (|), not (!) Operator concatenation, such as the following statement, will report an error
postgres=# select to_tsquery('cats rat');
ERROR:  syntax error in tsquery: "cats rat"
CONTEXT:  referenced column: to_tsquery

But plainto_tsquery does not have this restriction. Plainto_tsquery will change the input word to an “and” condition:

postgres=# select plainto_tsquery('cats rat');
 plainto_tsquery
-----------------
 'cat' & 'rat'
(1 row)
postgres=# select plainto_tsquery('cats,rat');
 plainto_tsquery
-----------------
 'cat' & 'rat'
(1 row)

In addition to functions, you can cast a string to the tsvector or tsquery type using cast methods, for example

postgres=# select 'fat cats sat on a mat and ate a fat rat'::tsvector;
                      tsvector                      
-----------------------------------------------------
 'a' 'and' 'ate' 'cats' 'fat' 'mat' 'on' 'rat' 'sat'
(1 row)
postgres=# select 'a & fat & rats'::tsquery;
       tsquery       
----------------------
 'a' & 'fat' & 'rats'
(1 row)

The difference between a cast and a function is that it does not remove stop words, does not normalize them, and does not record word positions for TSVector types.

2. Pattern matching

Once the input document and retrieval criteria have been converted to TSVector and Tsquery, pattern matching is ready. GausSDB (DWS) uses the “@@” operator for pattern matching, which returns True on success and false on failure.

For example, create the following table,

postgres=# create table post(
postgres(# id bigint,
postgres(# author name,
postgres(# title text,
postgres(# body text);
CREATE TABLE
-- insert some tuples

If you want to retrieve the title of the body with “physics” or “math” in it, you can use the following statement:

postgres=# select title from post where to_tsvector(body) @@ to_tsquery('physics | math');
            title           
-----------------------------
 The most popular math books

You can also combine multiple fields to query:

postgres=# select title from post where to_tsvector(title || ' ' || body) @@ to_tsquery('physics | math');
            title           
-----------------------------
 The most popular math books
(1 row)

Note that different query methods may produce different results. For example, the following match failed because ::tsquery did not normalize the search criteria and the word “cats” was not found in the previous tsvector:

postgres=# select to_tsvector('a fat cat ate fat rats') @@ 'cats & rat'::tsquery; ? column? ---------- f (1 row)

With the same document and retrieval criteria, the following match works because to_tsquery changes “cats” to “cat” :

postgres=# select to_tsvector('a fat cat ate fat rats') @@ to_tsquery('cats & rat'); ? column? ---------- t (1 row)

Similarly, the following match fails because to_tsvector removes the stop word a:

postgres=# select to_tsvector('a fat cat ate fat rats') @@ 'cat & rat & a'::tsquery; ? column? ---------- f (1 row)

The following works because ::tsvector preserves all the words:

postgres=# select 'a fat cat ate fat rats'::tsvector @@ 'cat & rat & a'::tsquery; ? column? ---------- f (1 row)

Therefore, appropriate retrieval methods should be selected according to the needs.

In addition, the @@ operator can implicitly cast the input text, for example,

postgres=# select title from post where body @@ 'physics | math';
 title
-------
(0 rows)

To be precise, text@@text is equivalent to to_tsvector(text) @@plainto_tsquery (text), so the match above is unsuccessful, Because plainto_tsquery would make or conditions’ physics | math ‘into’ physic ‘&’ math ‘and conditions. Use with great care.

3. Create and use indexes

As mentioned earlier, scanning text fields in a table one by one is slow and inefficient, while index lookup can improve the speed and efficiency of retrieval. GausSDB (DWS) supports full-text retrieval using the Generalized Inverted Index (GIN). GIN is a type of index commonly used in search engines. The main principle of GIN is to find the document in reverse by keyword, thus increasing the query efficiency. GIN index can be created on a field of type TEXT with the following statement:

postgres=# create index post_body_idx_1 on post using gin(to_tsvector('english', body));
CREATE INDEX

Note that you must use the to_tsvector function to generate the tsvector. You cannot use cast or implicit type conversions. Also, the to_tsvector function used here takes an extra parameter ‘English’ than in the previous section, which is used to specify the Text search Configuration. The text search configuration is covered in the next section. Each TSVector is computed with a different index, so it must be specified explicitly. When queryingthe TSVector, the index must be searched only if the configuration and fields are consistent with the index definition. For example, in the following query, the former can be retrieved by post_body_idx_1, and the latter can only be retrieved by a full table scan without a corresponding index.

postgres=# explain select title from post where to_tsvector('english', body) @@ to_tsquery('physics | math'); QUERY PLAN ----------------------------------------------------------------------------------------------------- id | operation | E-rows | E-width | E-costs ----+---------------------------------+--------+---------+--------- 1 | -> Streaming (type: GATHER) | 1 | | 2 | 42.02 32 - > Bitmap Heap Scan on post | 1 | | 3 | - > 16.02 32 Bitmap Index Scan | 1 | | 0 to 12.00 postgres=# explain select title from post where to_tsvector('french', body) @@ to_tsquery('physics | math'); QUERY PLAN ---------------------------------------------------------------------------------------------- id | operation  | E-rows | E-width | E-costs ----+------------------------------+--------+---------+------------------ 1 | -> Streaming (type: GATHER) | 1 | 32 | 2 | - > 1000000002360.50 Seq Scan on post | 1 | | 1000000002334.50 32

4. Text Search Configuration

This section talks about how GausSDB (DWS) preprocesses documents, or how to_tsvector works.

Document preprocessing is generally divided into the following three steps:

  • The first step is to extract the words or phrases from the text one by one. This work is performed by a Parser or Segmentation device. When you’re done, the document becomes a series of tokens.
  • The second step standardizes the tokens obtained in the previous step, including removing suffixes and prefixes, converting synonyms, removing stop words, and so on, according to the specified rules, thereby getting the lexeme (exeme). This is done in terms of a Dictionary, that is, a Dictionary defines the rules for standardization.
  • Finally, the position (and weight) of each lexeme is recorded to obtain the tsvector.

As you can see from the above description, if you are given a parser and a dictionary, then the rules for document preprocessing are determined. In GAUSSDB (DWS), this set of rules for document preprocessing is called the Text Search Configuration. The full-text search configuration determines the results and quality of the match.

As shown in the figure below, a full-text retrieval configuration consists of a parser and a set of dictionaries. The input document is first broken down into tokens by the parser, then each token is looked up dictionary by dictionary, and if the token is found in a dictionary, it is normalized according to the rules of that dictionary. Some dictionaries mark the token as “processed” after Normalize, so that subsequent dictionaries will not process it again. Some dictionaries will output new tokens after Normalize and give them to subsequent dictionaries for processing. Such dictionaries are called “filtered” dictionaries.

Figure 1 Document preprocessing process

The parser used by the configuration is specified at configuration creation time and is not modifiable, for example,

postgres=# create text search configuration mytsconf (parser = default);
CREATE TEXT SEARCH CONFIGURATION

GausSDB (DWS) comes with four built-in parsers and currently does not support custom parsers.

postgres=# select prsname from pg_ts_parser;
 prsname 
----------
 default
 ngram
 pound
 zhparser
(4 rows)

The dictionary is specified using the ALTER TEXT SEARCH CONFIGURATION command, for example

postgres=# alter text search configuration mytsconf add mapping for asciiword with english_stem,simple; ALTER TEXT SEARCH CONFIGURATION

postgres=# alter text search configuration mytsconf add mapping for asciiword with english_stem,simple; ALTER TEXT SEARCH CONFIGURATION

Mytsconf is specified to use the English_stem and simple dictionaries to standardize the “asciiword” type token.

The “asciiword” in the above statement is a token type. Different parsers classify tokens in different ways, which can be viewed through the ts_token_type function. For example, the ‘default’ parser classifies tokens into the following 23 types:

postgres=# select * from ts_token_type('default');
 tokid |      alias      |               description                
-------+-----------------+------------------------------------------
     1 | asciiword       | Word, all ASCII
     2 | word            | Word, all letters
     3 | numword         | Word, letters and digits
     4 | email           | Email address
     5 | url             | URL
     6 | host            | Host
     7 | sfloat          | Scientific notation
     8 | version         | Version number
     9 | hword_numpart   | Hyphenated word part, letters and digits
    10 | hword_part      | Hyphenated word part, all letters
    11 | hword_asciipart | Hyphenated word part, all ASCII
    12 | blank           | Space symbols
    13 | tag             | XML tag
    14 | protocol        | Protocol head
    15 | numhword        | Hyphenated word, letters and digits
    16 | asciihword      | Hyphenated word, all ASCII
    17 | hword           | Hyphenated word, all letters
    18 | url_path        | URL path
    19 | file            | File or path name
    20 | float           | Decimal notation
    21 | int             | Signed integer
    22 | uint            | Unsigned integer
    23 | entity          | XML entity
(23 rows)

The existing dictionaries in the current database can be queriedusing the system table PG_TS_dict.

If a configuration is specified, the document is preprocessed according to the specified configuration, as shown in the command that created the GIN index in the previous section. If no configuration is specified, the TO_TSVector uses the default configuration specified by the variable DEFAULT_TEXT_SEARCH_CONFIG.

postgres=# show default_text_search_config; - view the current default configuration default_text_search_config -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- pg_catalog. English (row 1) postgres = # set default_text_search_config = mytsconf; SET postgres=# SHOW DEFAULT_TEXT_SEARCH_CONFIG; default_text_search_config ---------------------------- public.mytsconf (1 row) postgres=# reset default_text_search_config; # show default_text_search_config; # show default_text_search_config; default_text_search_config ---------------------------- pg_catalog.english (1 row)

Note that default_text_search_config is a session-level variable that is only valid in the current session. If you want the default configuration to persist, you can modify the variable with the same name in the PostgreSQL. Conf configuration file, as shown in the figure below.

You need to restart the process after the change.

conclusion

The full-text retrieval module of GausSDB (DWS) provides powerful document search function. Compared with pattern matching using “LIKE” keyword or “~” operator, full-text retrieval provides richer semantic syntax support and can process natural language text more intelligently. Cooperate with proper index, can realize the efficient retrieval to the document.

Click on the attention, the first time to understand Huawei cloud fresh technology ~