What is a participle

We talked about the inverted index in our last article. For example, if we search for the moon, we may find the ancient poem “Silent night thinking” according to the inverted index, right? The moon here lowers its head and moonlight is the word divided by the participle. We tag an article with a word segmentation tag through the word segmentation, and then we can find the corresponding title according to the tag, such as New mobile Phone Huawei. If there is no word segmentation, we may not find the new mobile phone Huawei because of the order problem.

The composition of a word participle

  • Character filter

Before segmentation a paragraph of text, you should first clean the raw data, such as HTML tags, such as black, bold, long, underline, underline, etc., in fact, I want a word in the text, such as “I am a handsome man” when I just need to get “I am handsome” to the segmentation, ok

  • Participators are tokenizers

The string filter feeds the word “I am a handsome man” to the classifier, which then splits words into “I am a handsome man”, for example

But there are some words are not keywords, you take the word search also has no meaning, you want to search I caught a cold how to do, for example, in order to make the search more efficient, allow the user to look more professional, we get a keyword Should catch a cold how to do Rather than how to do Or, ah, so you need to the results of segmentation after filtered again, This is where word segmentation filters come in

  • Token filers are word segmentation filters

Elastic supports nine different word segmentation modes by default

  • Stardard (Filter punctuation)

  • Simple (Filters punctuation and numbers, leaving only letters)

  • Whitespace (space word segmentation, unfiltered content)

  • Stop (filter punctuation, numbers, mood words, stop words)

  • Keyword (Takes the content as a whole and does nothing, so that’s why keyword is used for precise matching)

  • Patter (Regular matching, refer to Java’s regular matching rules)

In the chestnut, only numbers are filtered

  • Fingerprint (Lowercase to refilter stop words, sort)

  • Word segmentation for over 30 common languages (no Chinese)
  • Supports custom word segmentation

IK participle

  • Here we explain again why we need to use participle, if not participle, we can actually search, such as the following chestnut, when I search Huawei, I love China’s document is also searched out, I am obviously not what I want, but why does this take the situation? What is the participle returned in Chinese

We found that by default, THE Chinese characters of ES are split separately. When you search for Huawei, the original search is for items related to Huawei, but when you split them separately, you will find all the items containing Hua he Wei, which is obviously not what we want, so we need word segmentation to achieve more accurate matching

  • ElasticSearch already has English participles because foreigners are native speakers of foreign languages, what about Chinese? The IK Word Divider was released in version 1.0 in 2006.

IK Analyzer is an open source, lightweight Chinese word segmentation toolkit developed based on the Java language. Elasticsearch was integrated into ElasticSearch by medcl, a developer and evangelist who joined ElasticSearch in 2015. IK word segmentation installation github search above github.com/medcl/elast… Or download the package from the official website, and the blogger was cheated by the package from the third party for two or three hours

  • We’ll check again after we’ve installed the IK participle
  • However, at this time you search Huawei is not able to search, because when creating the document did not set up huawei I love Huawei participle index, we delete and create again

Searches for huawei I love the Chinese entry does not appear, only to find the one I love huawei, remember, document generated when creating the inverted index structure, so we are creating a document when it is best to first sketch out what you need data model, choose the type of data field according to the model, the right word, with the right structure will have appropriate indexes, The right index will provide the right service. Before, I saw a friend insert data without mapping. In fact, the system also dynamically generated mapping at this time, but the maping may not meet your requirements. For example, when you create the index, you create the index according to a single word. Then when you search Huawei as a word to match, you will not be matched.

Configuration of IK word divider

  • Let’s take a look at the catalogue
[elastic@localhost config]$ ls
extra_main.dic  extra_single_word.dic  extra_single_word_full.dic  extra_single_word_low_freq.dic  extra_stopword.dic  IKAnalyzer.cfg.xml  main.dic  preposition.dic  quantifier.dic  stopword.dic  suffix.dic  surname.dic
[elastic@localhost config]$ pwd
/home/elastic/elasticsearch2/plugins/ik/config
Copy the code

Ikanalyzer.cfg.xml: Used to configure custom thesaurus

<? xml version="1.0" encoding="UTF-8"? > <! DOCTYPE properties SYSTEM"http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer extension configuration </comment> <! -- Users can configure their own extension dictionary relative path absolute path can be here, these words may not exist in the native dictionary, such as mushroom blue thin ghost animal, etc. --> <entry key="ext_dict">dic/hehe.dic; dic/haha.dic</entry> <! -- Users can configure their own extension stop word dictionary here such as ah oh invalid word --> <entry key="ext_stopwords">dic/stop.dic</entry> <! -- Users can configure the remote extension dictionary here to configure their own remote extension words --> <entry key="remote_ext_dict">http://m.dic.cn/ext.txt</entry> <! -- Users can configure the remote extension stop word dictionary here --> <! -- <entry key="remote_ext_stopwords">http://m.dic.cn/stop.txt</entry> -->
</properties>
Copy the code
  • Here you can simply take a look at chestnuts, such as mushroom blue thin, let’s look at it first

  • I will now configure the file to use the remote extended thesaurus (note that if the address of the remote thesaurus is changed or added, it will need to be restarted. If the address is not changed, it will need to be restarted.)
  • Take a look at my remote address (note that in order to avoid browser access to TXT garbled, add nginx server insidecharset 'utf-8';)

My configuration file is as follows, configured with a remote extension dictionary

<? xml version="1.0" encoding="UTF-8"? > <! DOCTYPE properties SYSTEM"http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer extension configuration </comment> <! -- Users can configure their own extended dictionary here --> <entry key="ext_dict"></entry> <! -- Users can configure their own extension stop word dictionary --> <entry key="ext_stopwords"></entry> <! -- Users can configure the remote extension dictionary --> <entry key="remote_ext_dict">https://0e2d-222-129-5-131.ngrok.io/ext.txt</entry> <! -- Users can configure the remote extension stop word dictionary here --> <! -- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>
Copy the code

  • I added a word tech bug to the remote dictionary

main.dic

There are more than 270,000 words in ik’s built-in Chinese vocabulary, which are grouped together

quantifier.dic

I put in some words related to units

suffix.dic

Put in some suffixes

surname.dic

Chinese Surnames

stopword.dic

Stop words in English