1. Default Elasticsearch word splitter

You can use the _analyze function of ES to analyze the word segmentation result. You can use the _analyze function of ES to analyze the word segmentation result.

The default participle of ES is the English participle, which can do a good word segmentation for English sentences. Let’s look at an example. When typing the following request, participle the sentence “What’s your name”, you can see that several words are broken out.

POST _analyze
{
  "tokenizer": "standard",
  "text": "What's your name"
}        

{
  "tokens" : [
    {
      "token" : "What's",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "your",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "name",
      "start_offset" : 12,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}
Copy the code

When typing “what is your name” in Chinese, we can see that the standard participle divides the sentence into characters one by one, which is obviously unacceptable in actual use.

POST _analyze {"tokenizer": "standard", "text": "present"} {"tokens" : [{"token" : "you ", "start_offset" : 0, "end_offset" : 1, "type" : "<IDEOGRAPHIC>", "position" : 0}, {"token" : "call ", "start_offset" : 1, "end_offset" : 1 2, "type" : "IDEOGRAPHIC" > ", "position" : 1}, {" token ":" what "and" start_offset ": 2," end_offset ": 3," type ": "< IDEOGRAPHIC >", "position" : 2}, {" token ":" yao ", "start_offset" : 3, "end_offset" : 4, "type" : "< IDEOGRAPHIC >", "position" : 3}, {" token ":" name ", "start_offset" : 4, "end_offset" : 5, "type" : "< IDEOGRAPHIC >", "position" : 4}, {" token ":" words ", "start_offset" : 5, "end_offset" : 6, "type" : "<IDEOGRAPHIC>", "position" : 5 } ] }Copy the code

2. IK participle

Because the English sentences are separated with Spaces, so the word segmentation is more clear, but because of the language features in Chinese, word segmentation is difficult points, also easy to generate word segmentation ambiguity, if oneself development points, the cost will be big, so generally used in the process of using some points, more famous include Jieba participles, hanlp etc., Here we introduce an ES plug-in word splitter, IK word splitter. You can download the word segmentation zip from github at github.com/medcl/elast… Create an IK directory in the es plugins directory, place the decompressed files in the IK directory, and restart Elasticsearch.

At this point, let’s replace the previous participle with ik_smart and see what happens. You can see that ik_smart has been able to split Chinese words.

POST _analyze {"tokenizer": "ik_smart", "text": "ik_smart"} {"tokens" : [{"token" : "you ", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0}, {" token ":" name ", "start_offset" : 1, "end_offset" : 4, "type" : "CN_WORD", "position" : 1}, {" token ":" name ", "start_offset" : 4, "end_offset" : 6, "type" : "CN_WORD", "position" : 2 } ] }Copy the code

In addition to ik_smart, there is an IK_max_word divider.

  • Ik_smart shards the text in a coarse-grained way. For example, the People’s Republic of China is regarded as a word, and the result is the People’s Republic of China.

  • Ik_max_word, on the other hand, does fine-grained segmentation of text, producing words of various lengths. If the People’s Republic of China is also segmented, many words will be separated.

    {“tokens” : “start_offset” : 0, “end_offset” : 7, “CN_WORD” : “position” : 0}, {” token “:” the people “, “start_offset” : 0, “end_offset” : 4, “type” : “CN_WORD”, “position” : 1}, {” token “: “Chinese”, “start_offset” : 0, “end_offset” : 2, “type” : “CN_WORD”, “position” : 2}, {” token “:” Chinese “, “start_offset” : 1, “end_offset” : 3, “type” : “CN_WORD”, “position” : 3}, {“token” : “People’s Republic “,” starT_offset “: 2, “end_offset” : 7, “type” : “CN_WORD”, “position” : 4}, {” token “:” the people “, “start_offset” : 2, “end_offset” : 4, “type” : “CN_WORD”, “position” : 5}, {” token “, “republic”, “start_offset” : 4, “end_offset” : 7, “type” : “CN_WORD”, “position” : 6}, {” token “, “republic”, “start_offset” : 4, “end_offset” : 6, “type” : “CN_WORD”, “position” : 7}, {” token “: “Kingdom”, “start_offset” : 6, “end_offset” : 7, “type” : “CN_CHAR”, “position” : 8}}]

When dealing with the specific scene, it is necessary to choose the appropriate word participle to use.

3. Ik_smart and IK_MAX_word are used together

In general, in order to improve the search effect, it is necessary to use these two word segmentation devices together. Ik_max_word is used for indexing as many segmentation words as possible, while IK_smart is used for searching to improve the matching accuracy as much as possible, so that users can search as accurately as possible. For example, a common scenario is to search for “imported wine” and try not to show lipstick related products or make lipstick not appear in the front of the list.

We’ll start by creating an index called goods within Elasticsearch, where the name’s participle is ik_max_word.

PUT /goods
{
  "mappings":{
	"goods": {
		"properties": {
			"id": {
				"type": "keyword"
			},
			"name": {
				"analyzer": "ik_max_word",
				"type": "text"
			}
		}
	  }
  },
  "settings":{
            "index": {
                "refresh_interval": "1s",
                "number_of_shards": 5,
                "max_result_window": "10000000",
                "mapper": {
                    "dynamic": "false"
                },
                "number_of_replicas": 0
            }
  }
}
Copy the code

And then we add some data to it with a POST request.

POST/goods/goods {" id ":" 1 ", "name" : "beautiful pink lipstick star"} working POST/goods/goods {" id ":" 2 ", "Name" : "good drink imported red wine"} working POST/goods/goods {" id ":" 3 ", "name" : "imported red wine is delicious"}Copy the code

Finally, at query time, we specify the query descriptor as IK_smart.

GET/goods/goods / _search {" query ": {" match" : {" name ": {" query" : "imported red wine", "analyzer" : "ik_smart"}}}}Copy the code

You can see that two records were searched for imported wine, but not lipstick

{ "took" : 28, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : , "hits" : {0} "total" : 2, "max_score" : 0.36464313, "hits" : [{" _index ":" goods ", "_type" : "goods", "_id" : "CdLk1WoBvRMfJWIKVfOP", "_score" : 0.36464313, "_source" : {" id ":" 3 ", "name" : "imported red wine is delicious"}}, {" _index ": "Goods", "_type" : "goods", "_id" : "ctLk1WoBvRMfJWIKX_O6", "_score" : 0.36464313, "_source" : {" id ":" 2 ", "name" : "Good imported red wine"}}]}}Copy the code

4. To summarize

Tokenizers are an important part of Elasticsearch, and there are many open source tokenizers available on the web. They may be good enough for general applications, but in certain scenarios you may need to optimize the tokenizers, or even develop your own tokenizers.

For more exciting content, please pay attention to the public number