Elasticsearch word segmentation for elasticSearch

elasticsear analyzer

What is the Analysis

As the name implies, text analysis is the process of converting a full text into a series of words (term/token), also known as word segmentation. In ES, Analysis is implemented through the Analyzer, which can be built into ES or customized on demand.

For example, thinking in Elasticsearch can be divided into three words: thinking in elasticsearch

The composition of a word participle

Character Filters: For raw text processing, such as removing HTML tags
Tokenizer: Splits words into rules, such as Spaces
Token Filter: Process the split words, such as uppercase to lowercase, delete stopWords and add synonyms

And the three parts of the Analyzer are in order, going from top to bottom through Character Filters, Tokenizer, and Token Filters. This is a little bit easier to understand. A text comes in, The text data needs to be processed through Character Filters first, then Tokenizer for word segmentation, and finally the result of word segmentation is filtered.

Elasticsearch has a built-in word splitter

Standard Analyzer

Standard tokenizer built into ElasticSearch. If not specified, this is the default tokenizer selected.
Distribution by word, support multiple languages
Lowercase processing, which removes most punctuation marks, lowercase terms, and supports specifying words to delete.
Standard Tokenizer
Lower Case Token Filter
Stop Token Filter (disabled by default)
- The default words to delete are:
  
  a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with

Simple Analyzer

Lower case processing
If the text contains non-alphabetic characters (numbers, apostrophes, Spaces, hyphens, and so on), it is discarded.
Lower Case Token Filter

Whitespace Analyzer

Divide by space
Whitespace Filter

Stop Analyzer

Compared to Simple Analyzer, the Stop Filter will changea.an.and.are.as.at.be.but.by.for.if.in.into.is.it.no.not.of.on.or.such.that.the.their.then.there.these.they.this.to.was.will. withAnd other modifiers removed.

Keyword Analyzer

Regardless of the word, directly output the input as a term
Keyword Filter

Pattern Analyzer

Word segmentation through regular expression.
The default is\W+, non-character symbols are isolated
Pattern Tokenizer
Lower Case Token Filter
Stop Token Filter (disable by default)

Language Analyzer

A set of analyzers aimed at analyzing specific language text. The following types are supported: arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, estonian, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai.

Fingerprint Analyzer

Fingerprint word divider, lower case, delete extended characters, sort, delete repeated characters

Custom Analyzer

The difficulty of Chinese word segmentation

Chinese sentence, cut into words (not a paragraph)
In English, words are separated by Spaces, while in Chinese, there is nothing natural to separate words.
Some Chinese statements have different meanings in different contexts
- This apple is not very good/this apple is not very good

Chinese word divider

ICU Analyzer

Need to install plugin
- Elasticsearch-plugin install analysis-icu
Provides Unicode support, better support for Asian languages
Normalization Character Filter
ICU Tokenizer
Normalization Token Filter
Folding Token Filter
Collation Token Filter
Transform Token Filter

IK Analyzer

Java language based on the development of lightweight Chinese word segmentation tool kit
Support custom thesaurus, support hot update dictionary segmentation.
Supports fine granularity and intelligent word segmentation.
Github.com/medcl/elast…

Resources: Medcl Github, which provides many custom Analyzers

Participles API

Standard Analyzer

The default configuration

Statement:

post _analyze
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
Copy the code

Results:

{
  "tokens": [{"token" : "the"."start_offset" : 0."end_offset" : 3."type" : "<ALPHANUM>"."position" : 0
    },
    {
      "token" : "2"."start_offset" : 4."end_offset" : 5."type" : "<NUM>"."position" : 1
    },
    {
      "token" : "quick"."start_offset" : 6."end_offset" : 11."type" : "<ALPHANUM>"."position" : 2
    },
    {
      "token" : "brown"."start_offset" : 12."end_offset" : 17."type" : "<ALPHANUM>"."position" : 3
    },
    {
      "token" : "foxes"."start_offset" : 18."end_offset" : 23."type" : "<ALPHANUM>"."position" : 4
    },
    {
      "token" : "jumped"."start_offset" : 24."end_offset" : 30."type" : "<ALPHANUM>"."position" : 5
    },
    {
      "token" : "over"."start_offset" : 31."end_offset" : 35."type" : "<ALPHANUM>"."position" : 6
    },
    {
      "token" : "the"."start_offset" : 36."end_offset" : 39."type" : "<ALPHANUM>"."position" : 7
    },
    {
      "token" : "lazy"."start_offset" : 40."end_offset" : 44."type" : "<ALPHANUM>"."position" : 8
    },
    {
      "token" : "dog's"."start_offset" : 45."end_offset" : 50."type" : "<ALPHANUM>"."position" : 9
    },
    {
      "token" : "bone"."start_offset" : 51."end_offset" : 55."type" : "<ALPHANUM>"."position" : 10}}]Copy the code

Modify the configuration to enable the Stop Token filter

Statement:

PUT standard-demo
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english_analyzer": {
          "type": "standard",
          "max_token_length": 5,
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST standard-demo/_analyze
{
  "analyzer": "my_english_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
Copy the code

Results:

{
  "tokens": [{"token" : "2"."start_offset" : 4."end_offset" : 5."type" : "<NUM>"."position" : 1
    },
    {
      "token" : "quick"."start_offset" : 6."end_offset" : 11."type" : "<ALPHANUM>"."position" : 2
    },
    {
      "token" : "brown"."start_offset" : 12."end_offset" : 17."type" : "<ALPHANUM>"."position" : 3
    },
    {
      "token" : "foxes"."start_offset" : 18."end_offset" : 23."type" : "<ALPHANUM>"."position" : 4
    },
    {
      "token" : "jumpe"."start_offset" : 24."end_offset" : 29."type" : "<ALPHANUM>"."position" : 5
    },
    {
      "token" : "d"."start_offset" : 29."end_offset" : 30."type" : "<ALPHANUM>"."position" : 6
    },
    {
      "token" : "over"."start_offset" : 31."end_offset" : 35."type" : "<ALPHANUM>"."position" : 7
    },
    {
      "token" : "lazy"."start_offset" : 40."end_offset" : 44."type" : "<ALPHANUM>"."position" : 9
    },
    {
      "token" : "dog's"."start_offset" : 45."end_offset" : 50."type" : "<ALPHANUM>"."position" : 10
    },
    {
      "token" : "bone"."start_offset" : 51."end_offset" : 55."type" : "<ALPHANUM>"."position" : 11}}]Copy the code

Simple Analyzer

Statement:

POST _analyze
{
  "analyzer": "simple",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
Copy the code

Results:

{
  "tokens": [{"token" : "the"."start_offset" : 0."end_offset" : 3."type" : "word"."position" : 0
    },
    {
      "token" : "quick"."start_offset" : 6."end_offset" : 11."type" : "word"."position" : 1
    },
    {
      "token" : "brown"."start_offset" : 12."end_offset" : 17."type" : "word"."position" : 2
    },
    {
      "token" : "foxes"."start_offset" : 18."end_offset" : 23."type" : "word"."position" : 3
    },
    {
      "token" : "jumped"."start_offset" : 24."end_offset" : 30."type" : "word"."position" : 4
    },
    {
      "token" : "over"."start_offset" : 31."end_offset" : 35."type" : "word"."position" : 5
    },
    {
      "token" : "the"."start_offset" : 36."end_offset" : 39."type" : "word"."position" : 6
    },
    {
      "token" : "lazy"."start_offset" : 40."end_offset" : 44."type" : "word"."position" : 7
    },
    {
      "token" : "dog"."start_offset" : 45."end_offset" : 48."type" : "word"."position" : 8
    },
    {
      "token" : "s"."start_offset" : 49."end_offset" : 50."type" : "word"."position" : 9
    },
    {
      "token" : "bone"."start_offset" : 51."end_offset" : 55."type" : "word"."position" : 10}}]Copy the code

Stop Analyzer

Statement:

POST _analyze
{
  "analyzer": "stop",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
Copy the code

Results:

{
  "tokens": [{"token" : "quick"."start_offset" : 6."end_offset" : 11."type" : "word"."position" : 1
    },
    {
      "token" : "brown"."start_offset" : 12."end_offset" : 17."type" : "word"."position" : 2
    },
    {
      "token" : "foxes"."start_offset" : 18."end_offset" : 23."type" : "word"."position" : 3
    },
    {
      "token" : "jumped"."start_offset" : 24."end_offset" : 30."type" : "word"."position" : 4
    },
    {
      "token" : "over"."start_offset" : 31."end_offset" : 35."type" : "word"."position" : 5
    },
    {
      "token" : "lazy"."start_offset" : 40."end_offset" : 44."type" : "word"."position" : 7
    },
    {
      "token" : "dog"."start_offset" : 45."end_offset" : 48."type" : "word"."position" : 8
    },
    {
      "token" : "s"."start_offset" : 49."end_offset" : 50."type" : "word"."position" : 9
    },
    {
      "token" : "bone"."start_offset" : 51."end_offset" : 55."type" : "word"."position" : 10}}]Copy the code

User-defined filtering conditions:

PUT stop-demo
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_stop_analyzer": {
          "type": "stop",
          "stopwords": ["the", "over"]
        }
      }
    }
  }
}

POST stop-demo/_analyze
{
  "analyzer": "my_stop_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
Copy the code

Results:

{
  "tokens": [{"token" : "quick"."start_offset" : 6."end_offset" : 11."type" : "word"."position" : 1
    },
    {
      "token" : "brown"."start_offset" : 12."end_offset" : 17."type" : "word"."position" : 2
    },
    {
      "token" : "foxes"."start_offset" : 18."end_offset" : 23."type" : "word"."position" : 3
    },
    {
      "token" : "jumped"."start_offset" : 24."end_offset" : 30."type" : "word"."position" : 4
    },
    {
      "token" : "lazy"."start_offset" : 40."end_offset" : 44."type" : "word"."position" : 7
    },
    {
      "token" : "dog"."start_offset" : 45."end_offset" : 48."type" : "word"."position" : 8
    },
    {
      "token" : "s"."start_offset" : 49."end_offset" : 50."type" : "word"."position" : 9
    },
    {
      "token" : "bone"."start_offset" : 51."end_offset" : 55."type" : "word"."position" : 10}}]Copy the code

Keyword Analyzer

Statement:

POST _analyze
{
  "analyzer": "keyword",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
Copy the code

Results:

{
  "tokens": [{"token" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."."start_offset" : 0."end_offset" : 56."type" : "word"."position" : 0}}]Copy the code

Pattern Analyzer

Statement:

POST _analyze
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
Copy the code

Results:

{
  "tokens": [{"token" : "the"."start_offset" : 0."end_offset" : 3."type" : "word"."position" : 0
    },
    {
      "token" : "2"."start_offset" : 4."end_offset" : 5."type" : "word"."position" : 1
    },
    {
      "token" : "quick"."start_offset" : 6."end_offset" : 11."type" : "word"."position" : 2
    },
    {
      "token" : "brown"."start_offset" : 12."end_offset" : 17."type" : "word"."position" : 3
    },
    {
      "token" : "foxes"."start_offset" : 18."end_offset" : 23."type" : "word"."position" : 4
    },
    {
      "token" : "jumped"."start_offset" : 24."end_offset" : 30."type" : "word"."position" : 5
    },
    {
      "token" : "over"."start_offset" : 31."end_offset" : 35."type" : "word"."position" : 6
    },
    {
      "token" : "the"."start_offset" : 36."end_offset" : 39."type" : "word"."position" : 7
    },
    {
      "token" : "lazy"."start_offset" : 40."end_offset" : 44."type" : "word"."position" : 8
    },
    {
      "token" : "dog"."start_offset" : 45."end_offset" : 48."type" : "word"."position" : 9
    },
    {
      "token" : "s"."start_offset" : 49."end_offset" : 50."type" : "word"."position" : 10
    },
    {
      "token" : "bone"."start_offset" : 51."end_offset" : 55."type" : "word"."position" : 11}}]Copy the code

Custom regular expressions:

PUT pattern-demo
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_email_analyzer": {
          "type":      "pattern",
          "pattern":   "\\W|_", 
          "lowercase": true
        }
      }
    }
  }
}

POST pattern-demo/_analyze
{
  "analyzer": "my_email_analyzer",
  "text": "[email protected]"
}

Copy the code

Results:

{
  "tokens": [{"token" : "john"."start_offset" : 0."end_offset" : 4."type" : "word"."position" : 0
    },
    {
      "token" : "smith"."start_offset" : 5."end_offset" : 10."type" : "word"."position" : 1
    },
    {
      "token" : "foo"."start_offset" : 11."end_offset" : 14."type" : "word"."position" : 2
    },
    {
      "token" : "bar"."start_offset" : 15."end_offset" : 18."type" : "word"."position" : 3
    },
    {
      "token" : "com"."start_offset" : 19."end_offset" : 22."type" : "word"."position" : 4}}]Copy the code

Language Analyzer(English Analyzer)

Statement:

POST _analyze
{
  "analyzer": "english",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
Copy the code

Results:

{
  "tokens": [{"token" : "2"."start_offset" : 4."end_offset" : 5."type" : "<NUM>"."position" : 1
    },
    {
      "token" : "quick"."start_offset" : 6."end_offset" : 11."type" : "<ALPHANUM>"."position" : 2
    },
    {
      "token" : "brown"."start_offset" : 12."end_offset" : 17."type" : "<ALPHANUM>"."position" : 3
    },
    {
      "token" : "fox"."start_offset" : 18."end_offset" : 23."type" : "<ALPHANUM>"."position" : 4
    },
    {
      "token" : "jump"."start_offset" : 24."end_offset" : 30."type" : "<ALPHANUM>"."position" : 5
    },
    {
      "token" : "over"."start_offset" : 31."end_offset" : 35."type" : "<ALPHANUM>"."position" : 6
    },
    {
      "token" : "lazi"."start_offset" : 40."end_offset" : 44."type" : "<ALPHANUM>"."position" : 8
    },
    {
      "token" : "dog"."start_offset" : 45."end_offset" : 50."type" : "<ALPHANUM>"."position" : 9
    },
    {
      "token" : "bone"."start_offset" : 51."end_offset" : 55."type" : "<ALPHANUM>"."position" : 10}}]Copy the code

ICU Analyzer

Statement:

POST _analyze {"analyzer": "ICu_analyzer ", "text":" What he says is true "}Copy the code

Results:

{
  "tokens": [{"token" : "He"."start_offset" : 0."end_offset" : 1."type" : "<IDEOGRAPHIC>"."position" : 0
    },
    {
      "token" : "Said"."start_offset" : 1."end_offset" : 3."type" : "<IDEOGRAPHIC>"."position" : 1
    },
    {
      "token" : "Really"."start_offset" : 3."end_offset" : 5."type" : "<IDEOGRAPHIC>"."position" : 2
    },
    {
      "token" : "In"."start_offset" : 5."end_offset" : 6."type" : "<IDEOGRAPHIC>"."position" : 3
    },
    {
      "token" : "Richard"."start_offset" : 6."end_offset" : 7."type" : "<IDEOGRAPHIC>"."position" : 4}}]Copy the code

The resources

www.elastic.co/guide/en/el…

Elasticsearch word segmentation for elasticSearch

elasticsear analyzer

What is the Analysis

The composition of a word participle

Elasticsearch has a built-in word splitter

Standard Analyzer

Simple Analyzer

Whitespace Analyzer

Stop Analyzer

Keyword Analyzer

Pattern Analyzer

Language Analyzer

Fingerprint Analyzer

Custom Analyzer

The difficulty of Chinese word segmentation

Chinese word divider

ICU Analyzer

IK Analyzer

Participles API

Standard Analyzer

Simple Analyzer

Stop Analyzer

Keyword Analyzer

Pattern Analyzer

Language Analyzer(English Analyzer)

ICU Analyzer

The resources

Related Posts

How to design and implement a seckill system? (Complete code included)

Codeforces Beta Round #14 (Div. 2) E. Camels 的 翻 译

Unified exception handling

Codeforces Beta Round #14 (Div. 2) E. Camels 的翻译