elasticsear analyzer

What is the Analysis

As the name implies, text analysis is the process of converting a full text into a series of words (term/token), also known as word segmentation. In ES, Analysis is implemented through the Analyzer, which can be built into ES or customized on demand.

For example, thinking in Elasticsearch can be divided into three words: thinking in elasticsearch

The composition of a word participle

  • Character Filters: For raw text processing, such as removing HTML tags
  • Tokenizer: Splits words into rules, such as Spaces
  • Token Filter: Process the split words, such as uppercase to lowercase, delete stopWords and add synonyms

And the three parts of the Analyzer are in order, going from top to bottom through Character Filters, Tokenizer, and Token Filters. This is a little bit easier to understand. A text comes in, The text data needs to be processed through Character Filters first, then Tokenizer for word segmentation, and finally the result of word segmentation is filtered.

Elasticsearch has a built-in word splitter

Standard Analyzer

  • Standard tokenizer built into ElasticSearch. If not specified, this is the default tokenizer selected.

  • Distribution by word, support multiple languages

  • Lowercase processing, which removes most punctuation marks, lowercase terms, and supports specifying words to delete.

  • Standard Tokenizer

  • Lower Case Token Filter

  • Stop Token Filter (disabled by default)

    • The default words to delete are:

      a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with

Simple Analyzer

  • Lower case processing
  • If the text contains non-alphabetic characters (numbers, apostrophes, Spaces, hyphens, and so on), it is discarded.
  • Lower Case Token Filter

Whitespace Analyzer

  • Divide by space
  • Whitespace Filter

Stop Analyzer

  • Compared to Simple Analyzer, the Stop Filter will changea.an.and.are.as.at.be.but.by.for.if.in.into.is.it.no.not.of.on.or.such.that.the.their.then.there.these.they.this.to.was.will. withAnd other modifiers removed.

Keyword Analyzer

  • Regardless of the word, directly output the input as a term
  • Keyword Filter

Pattern Analyzer

  • Word segmentation through regular expression.
  • The default is\W+, non-character symbols are isolated
  • Pattern Tokenizer
  • Lower Case Token Filter
  • Stop Token Filter (disable by default)

Language Analyzer

A set of analyzers aimed at analyzing specific language text. The following types are supported: arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, estonian, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai.

Fingerprint Analyzer

  • Fingerprint word divider, lower case, delete extended characters, sort, delete repeated characters

Custom Analyzer

The difficulty of Chinese word segmentation

  • Chinese sentence, cut into words (not a paragraph)
  • In English, words are separated by Spaces, while in Chinese, there is nothing natural to separate words.
  • Some Chinese statements have different meanings in different contexts
    • This apple is not very good/this apple is not very good

Chinese word divider

ICU Analyzer
  • Need to install plugin
    • Elasticsearch-plugin install analysis-icu
  • Provides Unicode support, better support for Asian languages
  • Normalization Character Filter
  • ICU Tokenizer
  • Normalization Token Filter
  • Folding Token Filter
  • Collation Token Filter
  • Transform Token Filter
IK Analyzer
  • Java language based on the development of lightweight Chinese word segmentation tool kit
  • Support custom thesaurus, support hot update dictionary segmentation.
  • Supports fine granularity and intelligent word segmentation.
  • Github.com/medcl/elast…

Resources: Medcl Github, which provides many custom Analyzers

Participles API

Standard Analyzer

  • The default configuration

Statement:

post _analyze
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
Copy the code

Results:

{
  "tokens": [{"token" : "the"."start_offset" : 0."end_offset" : 3."type" : "<ALPHANUM>"."position" : 0
    },
    {
      "token" : "2"."start_offset" : 4."end_offset" : 5."type" : "<NUM>"."position" : 1
    },
    {
      "token" : "quick"."start_offset" : 6."end_offset" : 11."type" : "<ALPHANUM>"."position" : 2
    },
    {
      "token" : "brown"."start_offset" : 12."end_offset" : 17."type" : "<ALPHANUM>"."position" : 3
    },
    {
      "token" : "foxes"."start_offset" : 18."end_offset" : 23."type" : "<ALPHANUM>"."position" : 4
    },
    {
      "token" : "jumped"."start_offset" : 24."end_offset" : 30."type" : "<ALPHANUM>"."position" : 5
    },
    {
      "token" : "over"."start_offset" : 31."end_offset" : 35."type" : "<ALPHANUM>"."position" : 6
    },
    {
      "token" : "the"."start_offset" : 36."end_offset" : 39."type" : "<ALPHANUM>"."position" : 7
    },
    {
      "token" : "lazy"."start_offset" : 40."end_offset" : 44."type" : "<ALPHANUM>"."position" : 8
    },
    {
      "token" : "dog's"."start_offset" : 45."end_offset" : 50."type" : "<ALPHANUM>"."position" : 9
    },
    {
      "token" : "bone"."start_offset" : 51."end_offset" : 55."type" : "<ALPHANUM>"."position" : 10}}]Copy the code
  • Modify the configuration to enable the Stop Token filter

Statement:

PUT standard-demo
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english_analyzer": {
          "type": "standard",
          "max_token_length": 5,
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST standard-demo/_analyze
{
  "analyzer": "my_english_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
Copy the code

Results:

{
  "tokens": [{"token" : "2"."start_offset" : 4."end_offset" : 5."type" : "<NUM>"."position" : 1
    },
    {
      "token" : "quick"."start_offset" : 6."end_offset" : 11."type" : "<ALPHANUM>"."position" : 2
    },
    {
      "token" : "brown"."start_offset" : 12."end_offset" : 17."type" : "<ALPHANUM>"."position" : 3
    },
    {
      "token" : "foxes"."start_offset" : 18."end_offset" : 23."type" : "<ALPHANUM>"."position" : 4
    },
    {
      "token" : "jumpe"."start_offset" : 24."end_offset" : 29."type" : "<ALPHANUM>"."position" : 5
    },
    {
      "token" : "d"."start_offset" : 29."end_offset" : 30."type" : "<ALPHANUM>"."position" : 6
    },
    {
      "token" : "over"."start_offset" : 31."end_offset" : 35."type" : "<ALPHANUM>"."position" : 7
    },
    {
      "token" : "lazy"."start_offset" : 40."end_offset" : 44."type" : "<ALPHANUM>"."position" : 9
    },
    {
      "token" : "dog's"."start_offset" : 45."end_offset" : 50."type" : "<ALPHANUM>"."position" : 10
    },
    {
      "token" : "bone"."start_offset" : 51."end_offset" : 55."type" : "<ALPHANUM>"."position" : 11}}]Copy the code

Simple Analyzer

Statement:

POST _analyze
{
  "analyzer": "simple",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
Copy the code

Results:

{
  "tokens": [{"token" : "the"."start_offset" : 0."end_offset" : 3."type" : "word"."position" : 0
    },
    {
      "token" : "quick"."start_offset" : 6."end_offset" : 11."type" : "word"."position" : 1
    },
    {
      "token" : "brown"."start_offset" : 12."end_offset" : 17."type" : "word"."position" : 2
    },
    {
      "token" : "foxes"."start_offset" : 18."end_offset" : 23."type" : "word"."position" : 3
    },
    {
      "token" : "jumped"."start_offset" : 24."end_offset" : 30."type" : "word"."position" : 4
    },
    {
      "token" : "over"."start_offset" : 31."end_offset" : 35."type" : "word"."position" : 5
    },
    {
      "token" : "the"."start_offset" : 36."end_offset" : 39."type" : "word"."position" : 6
    },
    {
      "token" : "lazy"."start_offset" : 40."end_offset" : 44."type" : "word"."position" : 7
    },
    {
      "token" : "dog"."start_offset" : 45."end_offset" : 48."type" : "word"."position" : 8
    },
    {
      "token" : "s"."start_offset" : 49."end_offset" : 50."type" : "word"."position" : 9
    },
    {
      "token" : "bone"."start_offset" : 51."end_offset" : 55."type" : "word"."position" : 10}}]Copy the code

Stop Analyzer

Statement:

POST _analyze
{
  "analyzer": "stop",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
Copy the code

Results:

{
  "tokens": [{"token" : "quick"."start_offset" : 6."end_offset" : 11."type" : "word"."position" : 1
    },
    {
      "token" : "brown"."start_offset" : 12."end_offset" : 17."type" : "word"."position" : 2
    },
    {
      "token" : "foxes"."start_offset" : 18."end_offset" : 23."type" : "word"."position" : 3
    },
    {
      "token" : "jumped"."start_offset" : 24."end_offset" : 30."type" : "word"."position" : 4
    },
    {
      "token" : "over"."start_offset" : 31."end_offset" : 35."type" : "word"."position" : 5
    },
    {
      "token" : "lazy"."start_offset" : 40."end_offset" : 44."type" : "word"."position" : 7
    },
    {
      "token" : "dog"."start_offset" : 45."end_offset" : 48."type" : "word"."position" : 8
    },
    {
      "token" : "s"."start_offset" : 49."end_offset" : 50."type" : "word"."position" : 9
    },
    {
      "token" : "bone"."start_offset" : 51."end_offset" : 55."type" : "word"."position" : 10}}]Copy the code

User-defined filtering conditions:

PUT stop-demo
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_stop_analyzer": {
          "type": "stop",
          "stopwords": ["the", "over"]
        }
      }
    }
  }
}

POST stop-demo/_analyze
{
  "analyzer": "my_stop_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
Copy the code

Results:

{
  "tokens": [{"token" : "quick"."start_offset" : 6."end_offset" : 11."type" : "word"."position" : 1
    },
    {
      "token" : "brown"."start_offset" : 12."end_offset" : 17."type" : "word"."position" : 2
    },
    {
      "token" : "foxes"."start_offset" : 18."end_offset" : 23."type" : "word"."position" : 3
    },
    {
      "token" : "jumped"."start_offset" : 24."end_offset" : 30."type" : "word"."position" : 4
    },
    {
      "token" : "lazy"."start_offset" : 40."end_offset" : 44."type" : "word"."position" : 7
    },
    {
      "token" : "dog"."start_offset" : 45."end_offset" : 48."type" : "word"."position" : 8
    },
    {
      "token" : "s"."start_offset" : 49."end_offset" : 50."type" : "word"."position" : 9
    },
    {
      "token" : "bone"."start_offset" : 51."end_offset" : 55."type" : "word"."position" : 10}}]Copy the code

Keyword Analyzer

Statement:

POST _analyze
{
  "analyzer": "keyword",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
Copy the code

Results:

{
  "tokens": [{"token" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."."start_offset" : 0."end_offset" : 56."type" : "word"."position" : 0}}]Copy the code

Pattern Analyzer

Statement:

POST _analyze
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
Copy the code

Results:

{
  "tokens": [{"token" : "the"."start_offset" : 0."end_offset" : 3."type" : "word"."position" : 0
    },
    {
      "token" : "2"."start_offset" : 4."end_offset" : 5."type" : "word"."position" : 1
    },
    {
      "token" : "quick"."start_offset" : 6."end_offset" : 11."type" : "word"."position" : 2
    },
    {
      "token" : "brown"."start_offset" : 12."end_offset" : 17."type" : "word"."position" : 3
    },
    {
      "token" : "foxes"."start_offset" : 18."end_offset" : 23."type" : "word"."position" : 4
    },
    {
      "token" : "jumped"."start_offset" : 24."end_offset" : 30."type" : "word"."position" : 5
    },
    {
      "token" : "over"."start_offset" : 31."end_offset" : 35."type" : "word"."position" : 6
    },
    {
      "token" : "the"."start_offset" : 36."end_offset" : 39."type" : "word"."position" : 7
    },
    {
      "token" : "lazy"."start_offset" : 40."end_offset" : 44."type" : "word"."position" : 8
    },
    {
      "token" : "dog"."start_offset" : 45."end_offset" : 48."type" : "word"."position" : 9
    },
    {
      "token" : "s"."start_offset" : 49."end_offset" : 50."type" : "word"."position" : 10
    },
    {
      "token" : "bone"."start_offset" : 51."end_offset" : 55."type" : "word"."position" : 11}}]Copy the code

Custom regular expressions:

PUT pattern-demo
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_email_analyzer": {
          "type":      "pattern",
          "pattern":   "\\W|_", 
          "lowercase": true
        }
      }
    }
  }
}

POST pattern-demo/_analyze
{
  "analyzer": "my_email_analyzer",
  "text": "[email protected]"
}

Copy the code

Results:

{
  "tokens": [{"token" : "john"."start_offset" : 0."end_offset" : 4."type" : "word"."position" : 0
    },
    {
      "token" : "smith"."start_offset" : 5."end_offset" : 10."type" : "word"."position" : 1
    },
    {
      "token" : "foo"."start_offset" : 11."end_offset" : 14."type" : "word"."position" : 2
    },
    {
      "token" : "bar"."start_offset" : 15."end_offset" : 18."type" : "word"."position" : 3
    },
    {
      "token" : "com"."start_offset" : 19."end_offset" : 22."type" : "word"."position" : 4}}]Copy the code

Language Analyzer(English Analyzer)

Statement:

POST _analyze
{
  "analyzer": "english",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
Copy the code

Results:

{
  "tokens": [{"token" : "2"."start_offset" : 4."end_offset" : 5."type" : "<NUM>"."position" : 1
    },
    {
      "token" : "quick"."start_offset" : 6."end_offset" : 11."type" : "<ALPHANUM>"."position" : 2
    },
    {
      "token" : "brown"."start_offset" : 12."end_offset" : 17."type" : "<ALPHANUM>"."position" : 3
    },
    {
      "token" : "fox"."start_offset" : 18."end_offset" : 23."type" : "<ALPHANUM>"."position" : 4
    },
    {
      "token" : "jump"."start_offset" : 24."end_offset" : 30."type" : "<ALPHANUM>"."position" : 5
    },
    {
      "token" : "over"."start_offset" : 31."end_offset" : 35."type" : "<ALPHANUM>"."position" : 6
    },
    {
      "token" : "lazi"."start_offset" : 40."end_offset" : 44."type" : "<ALPHANUM>"."position" : 8
    },
    {
      "token" : "dog"."start_offset" : 45."end_offset" : 50."type" : "<ALPHANUM>"."position" : 9
    },
    {
      "token" : "bone"."start_offset" : 51."end_offset" : 55."type" : "<ALPHANUM>"."position" : 10}}]Copy the code

ICU Analyzer

Statement:

POST _analyze {"analyzer": "ICu_analyzer ", "text":" What he says is true "}Copy the code

Results:

{
  "tokens": [{"token" : "He"."start_offset" : 0."end_offset" : 1."type" : "<IDEOGRAPHIC>"."position" : 0
    },
    {
      "token" : "Said"."start_offset" : 1."end_offset" : 3."type" : "<IDEOGRAPHIC>"."position" : 1
    },
    {
      "token" : "Really"."start_offset" : 3."end_offset" : 5."type" : "<IDEOGRAPHIC>"."position" : 2
    },
    {
      "token" : "In"."start_offset" : 5."end_offset" : 6."type" : "<IDEOGRAPHIC>"."position" : 3
    },
    {
      "token" : "Richard"."start_offset" : 6."end_offset" : 7."type" : "<IDEOGRAPHIC>"."position" : 4}}]Copy the code

The resources

  • www.elastic.co/guide/en/el…