Understand the Elasticsearch analyzer

This article attempts to examine the parser part of ES. Analyzers are defined as Analyzers in ES. Learning ES analyzer can help us better understand the search engine index process of text data, the basic principle of full-text search. It also helps us define our own analyzers to meet specific business requirements and improve search accuracy.

This article mainly refers to the section about Analyzer of Elasticsearch Reference. Because the Reference of ES Analyzer is in English, this article provides a framework for reading the Reference of ES Analyzer.

In addition: This article is based on ElasticSearch 7.9 test verification, does not include ES installation and deployment process. All demo request actions are submitted on Kibana’s dev-tool.

1. Basic Introduction:

The first step is to distinguish between an analyzer and a tokenizer. In ES, an analyzer is not equal to a tokenizer; the tokenizer is only part of the analyzer. It can be distinguished by the following formula:

analyzer = [char_filter] + tokenizer + [token filter]

An ES parser consists of the following three parts:

  • Char filter: Performs the first processing of input text characters, such as removing HTML tags (HTML_strip) and converting emoticon characters to English words (mapping). For the ES built-in character processor, see Char Filter Reference

  • Tokenizer: performs word segmentation operations on text, such as whitespace, standard, etc., in ES (Lucene?). Is defined as a token. The ES built-in tokenizer reference is available

  • Filter (token filter) : Filters, converts (modifies), and deletes the elements of a token set. For example, the units cut by Whitespace Tokenizer can be converted to stem form (driven- >drive), uniformly converted to lowercase form (lowercase), filter out some stop words (stop) and so on. ES has abundant built-in token filters. For details, see the Token Filter Reference. The result of token filter processing is defined as term.

A field of type text that needs to be processed by ES’s parser before it can be written.

The analyzer is processed as follows:

2. ES built-in analyzer

For details, see the ES Build-in Analyzer Reference. Here are some examples of standard Analyzer, part of the built-in word analyzer. Each analyzer is described in great detail in the ES official documentation. Other built-in Analyzers can analyze and understand the contents of their official documents along the following lines.

standard analyzer

Habits are translated as standard analyzers. For details, see the ES official introduction to Standard Analyzer. Definition formula:

standard analyzer = [] + standard tokenizer + [“lower case token filter”, “stop token filter” ]

From the formula above, we know that Standard Analyzer has the following characteristics:

  1. Char_filter is not configured, so HTML tags are not removed. The Char Filter phase does nothing. See the example below.
  2. Use standard Tokenizer to slice the word, filter out punctuation, and slice the word by space (Chinese will slice by character), but he’s will treat it as a token
  3. In the Token filter phase, the token is converted to lowercase and the stop words are filtered out.

Test the results of standard Analyzer analysis below:

POST _analyze { "analyzer": "standard", "text": "<html> Hello world !!! He's </ HTML >"} // the HTML tag is not removed, only the symbol <> is removed, he's treated as a separate token, punctuation mark! {"tokens" : "HTML ", "start_offset" : 1, "end_offset" : 5, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "hello", "start_offset" : 7, "end_offset" : 12, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "world", "start_offset" : 13, "end_offset" : 18, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "he's", "start_offset" : 23, "end_offset" : 27, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "html", "start_offset" : 31, "end_offset" : 35, "type" : "<ALPHANUM>", "position" : 4 } ] }Copy the code

3. _analyze API

ES provides analyzer API interface for users to do relevant debugging:

POST _analyze
Copy the code

This API helps us build a good understanding of analyzer char Filter, Tokenizer, and Token Filter, and helps us debug our own analyzer

  1. Debug analyzer, specify analyzer
POST _analyze
{
  "text": "<html> hello world !!! He's  </html>",
  "analyzer": "standard"
}
Copy the code
  1. If you want to debug char_filter/tokenizer/token filter, you can specify char_filter/tokenizer/token filter at the same time or in part.
POST _analyze
{
  "text": "<html> hello world !!! \n He's a student </html>",
  "char_filter": ["html_strip"], 
  "tokenizer": "whitespace", 
  "filter": [{
   "type":"stop",
   "stopwords":["a","the","is"]
   },
   "lowercase"
 ]
}
Copy the code

After debugging through the API and meeting our own business requirements, we can save this composition as a custom Analyzer.

4. Customize the Analyzer

If the analyzer built into ES does not meet our business needs, we can customize the analyzer. When creating an index, we can define our own parser by selecting a particular combination of char_filter, Tokenizer, and Filter (token Filter). When configuring the index mapping, you can select a customized Analyzer for a specific field to meet your business requirements.

The following example defines an analyzer of its own that has the following characteristics:

    1. Char_filter, using a custom char_filter:
    • My_html_strip: Filters HTML tags
    • My_punctuation_mapping: Converts specific punctuation marks as follows: * => _, = => ~
    1. Tokenizer: Whitespace for word cutting
    1. Token filter (token filter) : user-defined stop word filter, stop word filter “is”, “a”, “the”
PUT analyzer_demo { "settings": { "analysis": { "analyzer": {"my_analyzer":{// Custom analyzer" Type ":"custom", "char_filter":[//chart_filter sequence, can be custom, * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ["my_stop_token_filter"] // Token filter sequence, can be custom or ES built-in}}, token filter "char_filter": { "my_punctuation_mapping":{ "type":"mapping", "mappings":["* => _","= => ~"] }, "my_html_strip":{ "type":"html_strip" } }, "tokenizer": { "my_tokenizer":{ "type":"whitespace" } }, "filter": { "my_stop_token_filter":{ "type":"stop", "ignore_case":true, "stopwords": ["is","a","the"] } } } }, "mappings": { "properties": { "name":{ "type": "text", "analyzer": "my_analyzer" } } } }Copy the code

!!!!!!!!!! It is important to note that chart filter and Token Filter are arrays in analyzer, but their keys are char_filters and filters, instead of char_filters and filters, they are char_filter and filter. The current version of ES will not report an error if it is miswritten as char_filter and filter is plural, but it will cause the test to be not what we expected, and I stepped on this pit.

Here are the results:

GET analyzer_demo/_analyze
{
  "text": "<html> Hello World!! 123-456 789*123 , he is a student  </html>",
  "analyzer": "my_analyzer"
}

{
  "tokens" : [
    {
      "token" : "Hello",
      "start_offset" : 7,
      "end_offset" : 12,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "World!!",
      "start_offset" : 13,
      "end_offset" : 20,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "123-456",
      "start_offset" : 21,
      "end_offset" : 28,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "789_123",
      "start_offset" : 29,
      "end_offset" : 36,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : ",",
      "start_offset" : 37,
      "end_offset" : 38,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "he",
      "start_offset" : 39,
      "end_offset" : 41,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "student",
      "start_offset" : 47,
      "end_offset" : 54,
      "type" : "word",
      "position" : 8
    }
  ]
}

Copy the code