HanLP Chinese word splitter is an open-source word splitter designed for Elasticsearch. It is based on HanLP and provides most of the word segmentation in HanLP. Its source code is located at:

Github.com/KennFalcon/…

Elasticsearch has been updated with different releases of Elasticsearch since 5.2.2.

 

The installation

1) Method 1:

A. corresponding release installation package, download the latest release packages can be downloaded from baidu plate (link: pan.baidu.com/s/1mFPNJXgi… Password: i0o7)

B. Run the following command to install the plug-in: PATH Indicates the absolute PATH of the plug-in package.

./bin/elasticsearch-plugin install file://${PATH}
Copy the code

2) Method 2:

A. Use the elasticSearch script to install command.

./bin/elasticsearch-plugin install https://github.com/KennFalcon/elasticsearch-analysis-hanlp/releases/download/v7.4.2/elasticsearch-analysis-hanlp-7.4.2.z ipCopy the code

After the installation, we can verify the success of our installation by using the following methods:

$ ./bin/elasticsearch-plugin list
analysis-hanlp
Copy the code

If our installation is successful, we can see the output above.

Installation package

The release package stores the default word segmentation data in HanLP source code. To download the complete package, please check HanLP Release.

Package directory: ES_HOME/plugins/analysis-hanlp

Note: The hanlp.properties file has been changed to English because some files in the original data packet user-defined dictionary are named in Chinese. Please modify the file name accordingly

Restart the Elasticsearch

Note: ES_HOME in the above description is your ES installation path, which requires an absolute path.

This step is very important. If we don’t reboot, the newly installed segmenter won’t work.

4. Hot update

In this version, the dictionary hot update is added, and the modification steps are as follows:

A. in ES_HOME/plugins/analysis – hanlp/data/dictionary/custom directory, add a custom dictionary

B. Modify hanlp. Properties, modify CustomDictionaryPath, and add user-defined dictionary configurations

C. Wait 1 minute. The dictionary automatically loads

Note: Each node needs to make the above changes

A description of the participle provided

  • Hanlp: default participle of hanLP
  • Hanlp_standard: standard participle
  • Hanlp_index: index participle
  • Hanlp_nlp: NLP participle
  • Hanlp_n_short: N- shortest participle
  • Hanlp_dijkstra: the shortest participle
  • Hanlp_crf: CRF participle (latest form available)
  • Hanlp_speed: dictionary participle of speed

Let’s do a simple example:

GET _analyze {"text": "tokenizer": "hanlp"}Copy the code

Then the result displayed is:

{" tokens ": [{" token" : "the United States", "start_offset" : 0, "end_offset" : 2, "type" : "NSF", "position" : 0}, {" token ": "Alaska", "start_offset" : 2, "end_offset" : 7, "type" : "the NSF", "position" : 1}, {" token ":" ", "start_offset" : 9, 7, "end_offset" : "type" : "v", "position" : 2}, {" token ":" 8.0 ", "start_offset" : 9, "end_offset" : 12, "type" : "M", "position" : 3}, {" token ":" grade ", "start_offset" : 12, "end_offset" : 13, "type" : "q", "position" : 4}, {" token ":" earthquake ", "start_offset" : 13, "end_offset" : 15, "type" : "n", "position" : 5}]}Copy the code

For more detailed reading, please see the link github.com/KennFalcon/…