Elasticsearch + IK Chinese word segmentation plugin for document center project use

preface

Since it is the document center, for the front user in addition to the basic document reading function, the most important function is to search documents according to keywords. So this point, whether for English or Chinese, its essence is actually full text search, but for Chinese need to do some extra processing.

Elasticsearch profile

Full-text search is one of the most common needs, and the open source ElasticSearch is currently the preferred full-text search engine. It can store, search and analyze huge amounts of data quickly. Wikipedia, StackOverflow, GitHub all use it. The underlying Elasticsearch is based on the open source library Lucene, but you can’t use Lucene directly. It’s just a full-text search engine framework, not a full-text search engine. You have to write your own code to call its interface. Elasticsearch is a wrapper around Lucene and provides an interface to the REST API right out of the box.

Elasticsearch installation

Use Docker to install:

docker pull elasticsearch

Error response from daemon: manifest for elasticsearch:latest not found Latest Tag is not supported. If you have to specify a version number for Elasticsearch, you must specify a version number for Elasticsearch. As of this writing, the latest version of Elasticsearch is 7.6.0. If you manually specify a tag, you will successfully pull the image:

Docker pull elasticsearch: 7.6.0

Running Elasticsearch.

Docker run - d - name elasticsearch -p 9200:9200 - p, 9300:9300 - e discovery. Type = "single - node" elasticsearch: 7.6.0

Basic ElasticSearch concepts

The Node and Cluster

ElasticSearch is essentially a distributed database that allows multiple servers to work together, and each server can run multiple instances of ElasticSearch. A single instance of Elasticsearch is called a node, and a group of nodes forms a cluster.

Index

The top level unit of Elasticsearch data management is called Index, which is a synonym for a single database. The name of each Index (or database) must be lowercase.

Unlike older Sphinx or Coreseek, Elasticsearch stores all fields indexed in real time, while Elasticsearch stores all fields indexed in real time After processing, an Inverted Index is written, and the Index will be directly searched when searching data, so that the search effect is almost real-time.

To view all indexes of the current node, the following command is used:

curl -X GET 'http://localhost:9200/_cat/indices? v'

Document

The individual entries in an Index are called documents. In Elasticsearch, the term document has a specific meaning. It refers to the top level or root object that is serialized to JSON and stored in Elasticsearch with a unique ID specified. Many documents make up an Index.

Document is represented in JSON format. Here is an example:

{"title": "UOS account registration ", "category":" New Guide ", "content": "UOS account registration "}

Documents in the same Index do not require the same structure (schema), but it is better to keep the same, which is conducive to improve the search efficiency.

Type

Elasticsearch exposes a feature called Type that allows logical grouping of data within an index. Documents of different types may have different fields, but it is better to be very similar. In the latest version of Elasticsearch, how the concept of type has evolved is as follows:

in
5.XIn this version, multiple types can be created under an index;

in
6.XIn this version, only one type can be created under an index;

in
7.XIn the version, type has been completely removed, but there is actually a name called
_docThe default unique type.

IK Chinese word segmentation plug-in

Search requirements

At the beginning of this paper, I mentioned the document search function, which actually involves two requirements, namely Chinese word segmentation (including keyword segmentation and document title and content segmentation) and keyword highlighting in search results. Why put forward Chinese word segmentation function separately? In short, the default word splitter is based on the natural Spaces between English words, which is clearly not the case in the Chinese world.

Introduction to IK Plug-in

In order to meet these two requirements, it is necessary to install additional Chinese word segmentation plug-in. We chose IK for the document center project. The IK plugin integrates the Lucene IK parser into Elasticsearch and supports custom dictionaries. IK_SMART and IK_MAX_WORD can be specified in both analyzers and tokenizers.

Plug-in installation

Go to the Elasticsearch root directory and install it:

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.6.0/elasticsearch-analysis-ik-7.6.0.zip

Restart ElasticSearch after installation.

The basic use

First create the index to store the document, here we use doc as the index name:

curl -X PUT http://localhost:9200/doc

Specify a word splitter for a field that requires a word splitter:

curl -X POST http://localhost:9200/doc/_mapping -H 'Content-Type:application/json' -d '
{
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "ik_max_word",
                "search_analyzer": "ik_smart"
            },
            "content": {
                "type": "text",
                "analyzer": "ik_max_word",
                "search_analyzer": "ik_smart"
            }
        }

}'

For the document center, we need to specify a word splitter for the title and content fields. Analyzer is a word splitter for field text, and search_analyzer is a word splitter for search words. Ik_max_word and ik_smart differ as follows: ik_max_word: For example, “National Anthem of the People’s Republic of China” will be divided into “People’s Republic of China, People’s China, China, Chinese, People’s Republic, People’s Republic, Republic, Country, National Anthem”, which will exhaust all possible combinations and is suitable for Term query.

IK_ smart: Will split the text in the coarse-grained way. For example, “National Anthem of the People’s Republic of China” will be split into “National Anthem of the People’s Republic of China”, suitable for Phrase query.

We want the field content to be split as fine-grained as possible to maximize the chance of being hit by the search, so we specify the ik_max_word splitter. For search terms, the text should be split as coarse-grained as possible to maximize the semantic match of the search results, so specify the IK_SMART word splitter.

Stores a document, for example:

Curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl "UOS official website account and UOS developer background account are common, if you have registered an account on the developer background, you can use the registered developer account to log into the system." } '

If you don’t have a natural ID in your business, you can go to ElasticSearch instead of specifying the ID manually. The automatically generated ID is a URL-safe, base64-encoded, 20-character GUID string. These GUID strings are generated by the modifiable FlakeID schema, which allows multiple nodes to generate unique IDs in parallel with virtually zero probability of collision between each other. You can see that _id is returned as a result of the command above, with a value like h4dib3abkqhg1-lvunaq.

Keyword search:

curl -X POST http://localhost:9200/doc/_doc/_search -H 'Content-Type:application/json' -d ' { "query" : { "match" : {" content ":" accounts "}}, "highlight" : {" pre_tags ": [" < tag1 >"], "post_tags" : [" < / tag1 > "], "fields" : {" content ": {}}}} '

For example, we search for documents using the keyword “register an account” and specify the

tag as the inclusion tag of the highlighted keyword. The results are as follows:

{ "took":56, "timed_out":false, "_shards":{ "total":1, "successful":1, "skipped":0, "failed":0 }, "Hits" : {" total ": {" value" : 1, the "base" : "eq"}, "max_score" : 0.4868466, "hits" : [{" _index ":" doc," "_type" : "_doc", "_id" : "IIAfcHABkQHg1 - LvWtCp", "_score" : 0.4868466, "_source" : {" title ":" UOS account registration ", "content" : "UOS official website account and UOS developer background account are common, if you have registered an account in the developer background, "}, "highlight":{"content":["UOS official website <tag1> account </tag1> account </tag1> account </tag1> generic, If you have <tag1> account </tag1> at Developer Background <tag1> registered </tag1>, you can log in using <tag1> account </tag1> who has <tag1> registered </tag1> "]}}]}}

In the hits array, the highlight key corresponds to the

tag contained in the returned content, which is the matching search keywords (commonly known as keyword highlighting). It can be seen that the account registration has been divided into account and registration by word segmentation plug-in. If the Chinese word segmentation plug-in is not installed, then only the documents containing the account registration and linked together can be searched.

Configure the word dictionary

Without a single word dictionary by default, the word splitter is segmented according to the Chinese word phrases in the default master configuration. But in some business scenarios, this is not sufficient. It’s the same keyword search example above, but we want the search number to return the document containing the search number:

curl -X POST http://localhost:9200/doc/_doc/_search -H 'Content-Type:application/json' -d ' { "query" : { "match" : {" content ":" "}}, "highlight" : {" pre_tags ": [" < tag1 >"], "post_tags" : [" < / tag1 > "], "fields" : {" content ": {}}}} '

The results are as follows:

{
    "took":1,
    "timed_out":false,
    "_shards":{"total":1,"successful":1,"skipped":0,"failed":0},
    "hits":{
        "total":{"value":0,"relation":"eq"},
        "max_score":null,
        "hits":[]
    }
}

You can see that there is no document match. It indicates that the word account number was used as a phrase when the document was stored to establish an inverted index and the Chinese word segmentation was performed, instead of being divided into two characters: account number and account number. For this, we can verify the word segmentation test using the Analyze interface:

Curl -x GET "http://localhost:9200/doc/_analyze" - H 'the content-type: application/json - d' {" text ", "accounts", "tokenizer" : "ik_max_word" }'

The results are as follows:

{" tokens ": [{" token" : "accounts", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0}]}

It can be seen that the results confirm our conclusion that the word account is used as a phrase and is not divided into two words: account and number.

Now let’s configure word dictionary, open {conf} / analysis – ik/config/IKAnalyzer. CFG. XML, find < entry key = “ext_dict” > < / entry > this line, here you can configure your own extension dictionaries, Here we add the dictionary of words we need:

extra_single_word. DIC
. The extra_single_word.dic file is in the plug-in’s configuration directory, but it is not enabled by default. Restart Elasticsearch () to restart Elasticsearch () to restart Elasticsearch ()

{" tokens ": [{" token" : "accounts", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0}, {" token ":" zhang ", "Start_offset" : 0, "end_offset" : 1, "type" : "CN_WORD", "position" : 1}, {" token ":" no. ", "start_offset" : 1, "end_offset" : 2, "type":"CN_WORD", "position":2 } ] }

It can be seen that the word account number has been split into two words in the result this time, indicating that the configured word dictionary has taken effect.

Now if I repeat the previous search, will there be a match? We did it again, but the result was still the same. Why? In fact, since we stored the index of that document before configuring the word dictionary, and the inverted index didn’t divide the account number into account and number in the document content, so the search number naturally still didn’t match. We can add a new document as follows:

Curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl curl "UOS2 official website account and UOS2 developer background account are common, if you have registered an account on the developer background, you can use the registered developer account to log in." } '

Execute the search number again, and the results are as follows:

"Highlight ":{"content":["UOS2 official account <tag1> </tag1> and UOS developer backlog <tag1> </tag1> are common. You can log into the system using the registered developer account <tag1> </tag1>. "] }

You can see that this time there is a match, and the number is wrapped as a keyword in the specified highlighted tag

.