The profile

This article mainly explains the basic principle of inverted index and several commonly used ES segmentation.

The process of building an inverted index

Inverted indexing is a common indexing method used in search engines to store a map of where a word is stored in a document under a full-text search. By inverting the index, we enter a keyword and get a list of documents containing that keyword very quickly.

Let’s look at English first. Suppose we have two documents:

  1. I have a friend who loves smile
  2. love me, I love you

To build an inverted index, the simplest way to separate each word with a space is to get the following result: * indicates that the item is in the column, and empty indicates that the item is not

Term

doc1 doc2
I *

*
have *


a *


friend *


who *


loves *


smile *


love
*
me
*
you
*

If we want to search for “I Love You”, we just need to find the document containing each term:

Term

doc1 doc2
I *

*
love
*
you
*

Both documents match, and doc2 is a better match than DOC1 in terms of the number of hits.

This is the simplest way to express the inverted index. In the inverted index of ES, the position of each term in the document is recorded.

Expected word segmentation

I have an index that loves and loves. No, they both mean love, one third person singular, one primitive. Would the search results be more realistic if some of the syntax differences were removed? Such as:

  • I love you so much
  • “A”, “have” and other nonsense words, just block them out
  • , etc.

The index now looks like this:

Term

doc1 doc2
friend *


love *

*
smile *


me
*
you
*

Isn’t that a lot of streamlining? This process is called normalization. During the process of normalization, there is a series of operations called Normalization that process inverted indexes to improve the chances of searching related documents, such as tense conversion, singular and plural conversion, synonym conversion, and case conversion.

Enter the participle

The function of word segmentation is to divide the whole document into terms one by one according to certain semantics. The goal is to improve the recall rate of the document and reduce the noise of invalid data.

Recall refers to an increase in the number of results that can be found during a search. Noise reduction: Reduce the interference of some low-relevance terms in the document to the overall search ranking results.

The word segmentation process of a document consists of the following steps:

  • Character filter

Preprocessing strings, such as HTML tag cleaning Love –> Love, I & you –> I and you, etc.

  • Word segmentation is

To divide a string into a single entry, such as English by space and punctuation, Chinese by word. For different languages, there are different word dividers, including relatively simple standard word dividers, and especially complex Chinese word dividers, which contain very complex segmentation logic such as:

I Love you –> I/Love/you

Me and my motherland –> me/and/my/motherland

  • The Token filter performs further processing of the word segmentation entries, such as changing entries (I love –> love), deleting meaningless entries (A, and, this, Chinese “, “, “, “, “), adding entries (adding synonyms)

Word segmentation is very important, good word segmentation can significantly improve the recall rate, improper word segmentation results may generate ambiguity to the search, and finally the processed results are taken to establish inverted index.

Introduction to common word segmentation

Elasticsearch comes with its own built-in word splitter and also allows third-party word splitters.

Built-in word divider

  • Standard word analyzer

The ES default word splitter divides text according to word boundaries defined by the Unicode Consortium, removes most punctuation, and finally lowercase entries.

  • Simple word analyzer

Separate the text anywhere that is not a letter and lower case the entry

  • Whitespace analyzer, a space analyzer

Divide the text between Spaces

  • Language analyzer

Specific language segmentation, such as English, English word segmentation, maintain a set of English words and the, used to delete entries, for English grammar rules, have the ability to extract word stems.

The built-in word splitter supports English well, while Chinese needs to use an external word splitter.

Outer part lexicon

  • IK Chinese word divider IKmaxword

    Will do the most fine-grained text split; Break out as many words as you can.

    Such as Nanjing Yangtze River Bridge –> Nanjing city/Nanjing/mayor/Yangtze River bridge/Yangtze River/bridge
  • IK Chinese word segmentation ik_smart will do the coarsest granularity resolution; Words that have been separated will no longer be occupied by other words such as Nanjing Yangtze River Bridge –> Nanjing Yangtze River Bridge
  • CJK supports Chinese, Japanese, And Korean languages such as Nanjing Yangtze River Bridge –> Nanjing city/Beijing/mayor/Yangtze River/Jiangda/Bridge
  • Aliws has developed Chinese word segmentation such as Nanjing Yangtze River Bridge –> Nanjing/city/Yangtze River/Bridge

There are many external parts of the word, there are many open source, for different languages, different fields, you can combine the characteristics of their own business, choose their own word segmentation, here is not an introduction, you can go to understand.

Set component word machine

For Elasticsearch 6.3.1, the IK toener is integrated. For other toeners, the plugin installation process is similar. ./elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.1/elasticsearch-analysis-ik-6.3.1.zip

The github release of ElasticSearch-Analysis-IK is the ES release of elasticSearch-analysis-IK.

[2019-11-27t12:17:15.255][INFO][o.e.p.luginsService][node-1] loaded plugin [analysis-ik]

Test the segmentation effect

The analyze API allows you to view the analyze text as the analyze word. The analyze API allows you to analyze text as the analyze word.

GET /_analyze {"analyzer": "ik_max_word", "text": "Nanjing Yangtze River Bridge"}Copy the code

Response results:

{" tokens ": [{" token" : "nanjing", "start_offset" : 0, "end_offset" : 3, "type" : "CN_WORD", "position" : 0}, {" token ": "Nanjing", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 1}, {" token ":" mayor ", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 2}, {" token ":" Yangtze river bridge ", "start_offset" : 3, "end_offset" : 7, "type" : "CN_WORD", "position" : 3}, {" token ":" the Yangtze river ", "start_offset" : 3, "end_offset" : 5, "type" : "CN_WORD", "position" : 4}, {" token ":" the bridge ", "start_offset" : 5, "end_offset" : 7, "type" : "CN_WORD", "position" : 5}]}Copy the code

summary

This paper mainly introduces the basic idea of inverted index, shows the simplified structure, and expounds the basic steps of word segmentation. At present, there are many popular segmentation components on the market, and the open source community is also very active. You can choose the appropriate one for integration according to the actual project requirements and background, and pay attention to the compatibility of version numbers.

Focus on Java high concurrency, distributed architecture, more technical dry goods to share and experience, please pay attention to the public account: Java architecture community