This is my 27th day of The August Update Challenge

Elasticsearch already comes with a lot of nice profilers like Standard, Whitespace, and Lowercase. But these are friendly to the anglophone matrilineal countries. Elasticsearch provides support for custom profilers to meet the needs of other language speaking countries.

An analyzer consists of one or more character filters, a word splitter, and one or more word unit filters, and executes the above three parts in sequence: original string > character filter > word splitter > word filter, and finally obtains a set of tokens.

Character filter

A character filter is used to pre-clean or sort strings that have not yet been segmented. For example, if you’re dealing with DATA in HTML format, you can use the HTML_strip character filter provided by Elasticsearch that removes HTML tags. For example, remove tag pairs such as

. A parser can have one or more character filters

Word segmentation is

The word splitter is responsible for breaking the string processed by the character filter into terms or lexical units. Elasticsearch also provides many traversal participles. The Standard word splitter, for example, splits words into separate entries based on their boundaries and removes most punctuation; The Whitespace word splitter directly divides into separate terms based on space. A parser is allowed to have only one tokenizer

Word unit filter

The word unit filter is responsible for passing each individual term processed by the word splitter through one or more word unit filters in sequence. For example, take each individual entry and put it through the lowercase word unit filter to get a lowercase individual entry.

Custom analyzer

We want to remove HTML tags from the input, then split words by space, and finally lower case.

Implement a character filter to remove HTML tags:

"char_filter": {
        "html_strip_filter": {
          "type": "html_strip"
        }
      }
Copy the code

Implement a word segmentation based on space:

"tokenizer": "whitespace",
Copy the code

Implement a unified lowercase word unit filter:

"filter": {
                "lowercase_filter": {
                    "type": "lowercase"
            }}
Copy the code

So it all adds up to:

{
    "settings": {
        "analysis": {
            "char_filter": {
               "html_strip_filter": {
                  "type": "html_strip"
                    }},
           "filter": {
                "lowercase_filter": {
                    "type": "lowercase"
            }},
            "analyzer": {
                "my_analyzer": {
                    "type":         "custom",
                    "char_filter":  [ "html_strip_filter"],
                    "tokenizer":    "whitespace",
                    "filter":       [ "lowercase_filter" ]
            }}
}}}
Copy the code