concept

A tokenizer takes a string as input, splits the string into separate words or token units (characters such as punctuation may be discarded), and outputs a token stream.

normalization

There are many forms of normalization such as tense conversion, singular and plural conversion, synonym conversion, and case conversion. For example, the documentation includes His mom likes small dogs: During the process of index building, there will be some aspects of documentation such as tenses, singular and plural forms, and synonyms. ② Then the user can search related documents by using a similar mother liked little dog.

Analyzer composition

Character filter (mapping) => Elasticsearch => Elasticsearch There are: HTML strip(remove HTML tags), Mapping (string replace), pattern replace (regular match replace)
Tokenizer: Word segmentation
Token filter: stop word, tense conversion, case conversion, synonym conversion, modal word processing, etc. Lowercase,stop,synonym

POST _analyze {"tokenizer":"keyword", "char_filter":["html_strip"], POST _analyze {"tokenizer":"standard", "char_filter":[{"type":"mapping", "char_filter":[{"type":"mapping", "The mappings" : [" - = > _ "]}], "text", "123-456-789, I - love - u"} # replace emoticons POST _analyze {" tokenizer ":" standard ", "char_filter":[ { "type":"mapping", "mappings":[":) => happy"] } ], "Text ":" I am felling :),i-love-u"} # regular expression POST _analyze {"tokenizer": "char_filter":[ { "type":"pattern_replace", "pattern":"http://(.*)", "replacement":"$1_haha" } ], "text":"http://www.elastic.co" }Copy the code

ES has a built-in word divider

Standard Analyzer – Default word Analyzer, word segmentation, lower case processing Simple Analyzer – By non-letter segmentation (symbols are filtered), lower case processing Stop Analyzer – Lower case processing, Stop word filter (the, A,is) Whitespace Analyzer – split by space, no lower case Keyword Analyze – Treats input as output, regardless of words Default \W+(non-character delimited) Language – provides a word Analyzer for over 30 common languages Customer Analyzer – Custom word Analyzer

#HTML Strip Character Filter PUT my_index { "settings": { "analysis": { "char_filter": { "my_char_filter": { "type": "html_strip", "escaped_tags": ["a"] } }, "analyzer": { "my_analyzer": { "tokenizer": "keyword", "char_filter": ["my_char_filter"] } } } } } POST my_index/_analyze { "analyzer": "my_analyzer", "text": "<p>I&apos; m so <a>happy</a>! </p>" } #Mapping Character Filter PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "keyword", "char_filter": [ "my_char_filter" ] } }, "char_filter": { "my_char_filter": { "type": "mapping", "mappings": [" ٠ = > 0 "and" ١ = > 1, "" ٢ = > 2", "٣ = > 3", "٤ = > 4", "٥ = > 5", "٦ = > 6", "٧ = > 7", "٨ = > 8", "٩ = > 9" POST my_index / _analyze]}}}}} {" analyzer ":" my_analyzer ", "text" : "My license plate is ٢ discounted ١ "} #Pattern Replace Character Filter PUT my_index {" Settings ": {"analysis": {"analyzer": { "my_analyzer": { "tokenizer": "standard", "char_filter": ["my_char_filter"] } }, "char_filter": { "my_char_filter": { "type": "pattern_replace", "pattern": "(\\d+)-(? =\\d)", "replacement": "$1_" } } } } } POST my_index/_analyze { "analyzer": "my_analyzer", "text": "My credit card is 123-456-789" } #************************************************************************** #token Filter: tense conversion, case conversion, synonym conversion, modal word processing, etc. Has =>have him=>he apples=>apple the/ OH /a=> Kill # lowercase token filter GET _analyze {"tokenizer" : "standard", "filter" : ["lowercase"], "text" : "THE Quick FoX JUMPs" } GET /_analyze { "tokenizer": "standard", "filter": [ { "type": "condition", "filter": [ "lowercase" ], "script": { "source": "token.getTerm().length() < 5" } } ], "text": "THE QUICK BROWN FOX"} # stopWords Token filter PUT /my_index {" Settings ": {"analysis": {"analyzer": { "my_analyzer":{ "type":"standard", "stopwords":"_english_" } } } } } GET my_index/_analyze { "analyzer": "My_analyzer ", "text": "Teacher Ma is in the restroom"} # tokenizer standard GET /my_index/_analyze {"text": Set type to custom to tell Elasticsearch that we are defining a custom analyzer. Compare this to how you configure a built-in parser: Type will be set to the name of the built-in parser, such as Standard or simple PUT /test_analysis {" Settings ": {"analysis": {"char_filter": { "test_char_filter": { "type": "mapping", "mappings": [ "& => and", "| => or" ] } }, "filter": { "test_stopwords": { "type": "stop", "stopwords": ["is","in","at","the","a","for"] } }, "tokenizer": { "punctuation": { "type": "pattern", "pattern": "[ .,!?] " } }, "analyzer": { "my_analyzer": { "type": "custom", "char_filter": [ "html_strip", "test_char_filter" ], "tokenizer": "standard", "filter": ["lowercase","test_stopwords"] } } } } } GET /test_analysis/_analyze { "text": "Teacher ma & zhang also thinks [mother's friends] is good | nice!!!" PUT /test_analysis/_mapping/my_type {"properties": {"content": { "type": "text", "analyzer": "Test_analysis"}}} # * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * # in Chinese word segmentation PUT my_index { "settings": { "analysis": { "analyzer": { "default": { "type": "ik_max_word" } } } } } PUT /my_index { "mappings": { "properties": { "text": { "type": "text", "analyzer": "ik_max_word", "search_analyzer": "Ik_smart"}}}} / POST/my_index _bulk {" index ": {" _id" : "1"}} {" text ", "urban management call to stall the vendor"} {" index ": {" _id" : "2"}} {" text ", "laughed fruit culture response vendor farmer to stall"} {" index ": {" _id" : "3"}} {" text ", "the old farmer took 17 years out of the chair of trees"} {" index ": {" _id" : "4"}} {" text ":" married for over 30 years of husband and wife AA system, grasp by chengguan "} {" index ": {" _id" : "5"}} {" text ": } GET /my_index/_analyze {"text": "National anthem of the People's Republic of China "," Analyzer ": "Ik_max_word"} GET /my_index/_analyze {"text": "National anthem of the People's Republic of China "," Analyzer ": "ik_smart"} GET /my_index/_search {"query": {" match ": {" text", "cage osprey"}}} GET/my_index / _analyze {" text ", "super messiah who", "analyzer" : "Ik_max_word"} GET /my_index/_analyze {"text": "Porcelain is a form of blackmail and should be sentenced "," Analyzer ": "ik_max_word"}Copy the code

Chinese word divider

Github.com/medcl/elast…

Download. Create ik under plugins in the es root directory and unzip it. Restart the

What is the difference between IK_max_word and IK_smart? Ik_max_word: the text will be split into the most fine-grained, for example, “The national anthem of the People’s Republic of China” will be split into “The People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the people, the people, the people, the Republic, the republic, and the guo guo, the national anthem”, which will exhaust all possible combinations, suitable for Term Query; Ik_smart: Splits “National anthem of the People’s Republic of China” into “National anthem of the People’s Republic of China”, which is suitable for the Phrase query.

IK file description: 1) ikAnalyzer.cfg. XML: IK segmentation configuration file 2) Main thesaurus: main.dic 3) English stop words: stopword.dic, will not be built in the inverted index 4) special thesaurus: A.kuantivier.dic: special thesaurus: Dic: special lexic: suffix name c.surname.dic: special lexic: hundred family names D. preposition: modal word

Hot update: a. Modify the source code of ik word segmentation b. Based on the hot update scheme of IK word segmentation native support, deploy a Web server, provide an HTTP interface, through modified and tag two HTTP response headers, to provide hot update words

ElasticSearch complete directory

Elasticsearch is the basic application of Elasticsearch.Elasticsearch Mapping is the basic application of Elasticsearch.Elasticsearch is the basic application of Elasticsearch Elasticsearch tF-IDF algorithm and advanced search 8.Elasticsearch ELK

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

6. Elasticsearch word segmentation

concept

normalization

Analyzer composition

ES has a built-in word divider

Chinese word divider

ElasticSearch complete directory

6. Elasticsearch word segmentation

concept

normalization

Analyzer composition

ES has a built-in word divider

Chinese word divider

ElasticSearch complete directory

Related Posts

C++11 features summary (2) – how to use lambda expressions?

Introduction to Functional programming (2) Object-oriented and functional programming

🐻 Practice the 6-day learning route of Unity3D