Standard word participle

The Standard word segmentation adopted by ES by default is word segmentation for English and single word segmentation for Chinese, so Standard word segmentation is not suitable for Chinese search. Therefore, the word segmentation is modified to Chinese friendly word segmentation to achieve better search results.

IK participle installation

  1. Download at github.com/medcl/elast…

    Note: the IK participle version must be exactly the same as the ES version!

  2. Delete the Data folder

    Because the installation of IK participle needs to ensure that there is no data in the original ES, it is generally required to install IK participle before the formal use of ES.

  3. Create an analyser-ik directory in the plugins directory

  4. Unzip the zip package to the analyser-ik directory

  5. Restart the ES

The two installation modes are different

This article USES the installation way is the local installation, IK participles can also install online, local and online installation configuration directory is not the same, if the two configuration directory exists at the same time, ES will default installation using online configuration files in the configuration path, because online installation configuration file path is the default configuration file path of the ES.

  • The local installation configuration file path: ES installation directory/plugins/analyser – ik/config/IKAnalyzer. CFG. XML (as above)
  • The directory of the configuration file for online installation is ES installation directory /config/analysis-ik/ ikAnalyzer.cfg.xml

All of the following configurations for the IK participle are configured in the locally installed configuration file.

Two participles

IK word segmentation provides two word segmentation patterns

  • Ik_smart: will do the coarsest fragmentation, such as “the National anthem of the People’s Republic of China” to “the National Anthem of the People’s Republic of China”.

    Testing:

    GET /_analyze
    {
      "text": National Anthem of the People's Republic of China."analyzer": "ik_smart"
    }
    Copy the code
  • Ik_max_word: Will split the text into the most granular possible pieces, such as the “National anthem of the People’s Republic of China” into “National anthem of the People’s Republic of China, the People’s Republic of China, The Chinese, the Chinese, the People’s Republic, the People’s Republic, the People’s Republic, the People’s Republic, the People’s Republic, the republic, the republic, and the guoguo, the national anthem”, and will exhaust all possible combinations.

    Testing:

    GET /_analyze
    {
      "text": National Anthem of the People's Republic of China."analyzer": "ik_max_word"
    }
    Copy the code

Test the result of word segmentation

View the result of ik_max_word participle for the text “national anthem of the People’s Republic of China”.

GET /_analyze
{
  "text": National Anthem of the People's Republic of China."analyzer": "ik_max_word"
}
Copy the code

IK participle used

1. Specify the participle pattern

When you create Type, you need to set the word segmentation mode for the field of Type text. If you do not set the word segmentation mode, the default is Standard.

PUT /postilhub
{
  "mappings": {
    "user": {
      "properties": {
        "id": {
          "type": "keyword"  
        },
        "username": {
          "type": "keyword" 
        },
        "age": {
          "type": "integer" 
        },
        "content": {
          "type": "text"."analyzer": "ik_max_word"
        }
      }
    }
  }
}
Copy the code

ES relies on word segmentation when storing and indexing data. The word segmentation set during storage must be consistent with the word segmentation pattern used during retrieval. If no word segmentation pattern is set during retrieval, ES defaults to the word segmentation set during storage.

PUT /postilhub
{
  "mappings": {
    "user": {
      "properties": {
        "id": {
          "type": "keyword"  
        },
        "username": {
          "type": "keyword" 
        },
        "age": {
          "type": "integer" 
        },
        "content": {
          "type": "text"."analyzer": "ik_max_word"."search_analyzer": "ik_max_word"
        }
      }
    }
  }
}
Copy the code

2. Add extension words

The ability to accurately identify all the words in a sentence is very important, since the IK can achieve fine-grained segmentation. However, with the development of the Internet, new Internet words are born every day, such as “Blue Lean mushroom”, “Ori Ge”, “emo”…… The IK word participle is very limited in dealing with these new Internet words.

So we need to define new extension words on the basis of IK participle.

It is recommended to pre-set most extension dictionaries and extension stop dictionaries at the beginning of the IK participler configuration, because new extension words will only apply to subsequent documents, not previous documents.

  1. Go to the plugins folder of ES and find the root directory of the IK participle.

  2. Go to the config folder in the IK analyzer directory and find the ikAnalyzer.cfg. XML file.

    The. Dic file is the official IK dictionary.

  3. Enter the file:

    Extension dictionaries and extension stop dictionaries are available for configuration.

    • Extension dictionary: include keywords in the IK participle dictionary.

    • Extension stop dictionary: in the IK participle dictionary, but let the IK participle intentionally ignore the keyword.

  4. Create the extension dictionary file

    The file name is user-defined and the suffix must be DIC.

    In general, the extension dictionary file is named extension-dic and the extension stop dictionary file is named stop.dic.

  5. Adds an extension word to the extension dictionary file

    Note:

    • The text format of the extended dictionary file must be UTF-8.
    • Extension dictionaries and extension stop dictionaries can contain only one word per line.
  6. Extend the dictionary file configuration to ikAnalyzer.cfg.xml

    The key of the entry label is unchanged, and the extension dictionary file name is filled in the middle of the label. If multiple extension dictionary files are configured, use “;” between the file names. separated

  7. Restart ES and load the extension dictionary.

3. Add extension words remotely

If you need to manually add extension words or ignore extension words, then undoubtedly greatly increase the labor costs.

At present, an excellent scheme is that Redis performs real-time statistics on search terms, retrives the statistical data at intervals, and analyzes the popularity of each word through the program. If a word is popular but not in the extension dictionary, the file IO stream is used to write these keywords into the extension dictionary. ES listens to the extension dictionary in real time. If the contents of the extension dictionary change, it reloads the dictionary and enters ES.

Epidemiological indicators:

  • The top keyword in all search terms.
  • Keywords that reach a predetermined number of searches.

To add extension words remotely, create a TXT file for storing keywords, and then configure the access address of this file to ikAnalyzer.cfg.xml. If the system finds that a word is popular, it writes the keyword to the text file through the file stream, and automatically adds the extension word by listening to the extension dictionary file in real time through ES.

The following section only shows the process of configuring the remote extension dictionary. It does not show the entire solution of adding extension words remotely. For details, see the following article.

  1. Create the remote extension dictionary TXT file

    The TXT file stipulates that each keyword stored in the file must be carriage return and line feed, and the format is consistent with the DIC file.

    The TXT file is named extension.txt, and the extension stop dictionary file is stop.txt.

    This TXT is generally stored in the project, for example, in the Spring Boot project recommended to put the webApp directory.

  2. Set the SpringBoot project context path

  3. Get the remote to expand the dictionary file access address: http://localhost:8080/es/extension.txt

  4. Configure the remote dictionary address in ikAnalyzer.cfg.xml

  5. Restart ES and load the remote extension dictionary.