preface

qbitThe version of ElasticSearch used is 7.x


Recommended to study Ruan Yiming’s
Elasticsearch core technology and actual combat”

  • normalizerIs to thekeyword, which is equivalent toanalyzerIs to thetext

Analyzer component

The Analyzer consists of three parts: a character filter (CharacterFilters), word splitter (Tokenizer) and lexical filters (TokenFilters).

analyzer / search_analyzer

  • By default, the same analyzer is used for both indexing and searching.
  • When SEARCH_ANALYZER, search with the SEARCH_ANALYZER

Analyzer installation method

  • See which plug-ins are installed
/bin/elasticsearch-plugin list # url http://10.10.10.10:9200/_cat/plugins # kibana GET /_cat/plugins
  • Install the built-in Analyzer (in the ICU example)
./bin/elasticsearch-plugin install analysis-icu
  • Install Analyzer on GitHub (take IK as an example)
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip
  • Install the local ZIP package
The. / bin/elasticsearch - plugin install file:///usr/share/es/download/plugin/elasticsearch-analysis-ik-7.4.2.zip
  • Without confirmation
-b
--batch

Analyzer Test Method

Built-in word splitter

POST _analyze
{
  "analyzer": "simple",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

Custom word splitter

GET /_analyze
{
  "char_filter": ["html_strip"], 
  "tokenizer": "hanlp",
  "filter": [ "word_delimiter_graph", "lowercase",  "stop", "stemmer"],
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

Built-in Analyzer with official plugins

  • Built-in analyzer
  • Official Analysis Plugin

English Analyzer

Standard Analyzer

  • Built-in Analyzer Standard Analyzer

Simple Analyzer

  • Simple Analyzer is built in

Chinese Analyzer

ICU

  • Official Word Segmentation Plug-in ICU

SmartCN

  • Official word segmentation plug-in SmartCN

HanLP

  • Online test: http://hanlp.com/
  • Making: https://github.com/hankcs/HanLP
  • HanLP Elasticsearch plug-in: https://github.com/KennFalcon/elasticsearch-analysis-hanlp

IK

  • IK Elasticsearch plug-in: https://github.com/medcl/elasticsearch-analysis-ik

Pinyin

  • Pinyin: https://github.com/medcl/elas…

ansj

  • Ansj Elasticsearch plug-in: https://github.com/NLPchina/e…

jieba

  • Jieba Elasticsearch plug-in: https://github.com/sing1ee/el…

jcseg

  • GitHub: https://github.com/lionsoul20…

reading

  • To understand the participles in NLP (differences between Chinese and English +3 major difficulties +3 typical methods)
  • We often say that “English and other pinyin characters are one-dimensional, Chinese is two-dimensional”, then is it possible to exist a higher dimensional characters?
  • Compare the world’s major languages to see the advantages and disadvantages of Chinese characters