Hello, everyone. I’m Li Xiaobing. In the article “Why ElasticSearch is Better than MySQL for Complex Conditions Search”, we explained how ElasticSearch supports full-text search and complex condition queries in data storage. This article will focus on the analysis of ElasticSearch before full text search how to use IK word segmentation, so that we have a comprehensive and in-depth understanding of ElasticSearch full text search and IK Chinese word segmentation principle.

Full-text search and exact matching

Elasticsearch supports full text search and exact search on text type data, but you must set the corresponding type in advance:

  • Keyword type, storage will not do word segmentation processing, support precise query and word segmentation matching query;
  • Text type, the storage will be word segmentation processing, also support precise query and word segmentation matching query.

For example, create an Index named article, configure mappings for its two fields, set the content of the article as text, and set the title of the article as keyword.

curl -XPUT http://localhost:9200/article
curl -XPOST http://localhost:9200/article/_mapping -H 'Content-Type:application/json' -d'
{
        "properties": {
            "content": {
                "type": "text"
            },
            "title": {
                "type": "keyword"
            }
        }

}'

ElasticSearch will save and select tokens for ElasticSearch and select tokens for ElasticSearch. The title of the article is not processed by word segmentation, and the original value is saved directly.

The right half of the figure shows the different storage processes for the keyword and text types. ElasticSearch = ElasticSearch = ElasticSearch = ElasticSearch;

  • Term query, that is, accurate query, does not divide words, but directly according to the input word query;
  • Match query, also known as word segmentation matching query, first divides the input words, and then queries the words after word segmentation.

For example, if there are two articles, one with the title and content of “programmer” and the other with the title and content of “program”, then the inverted index of both will be stored as follows in Elasticsearch (assuming you use a special word splider).

At this point, query the two fields using the TERM and MATCH queries, respectively, to produce the results shown on the right.

Analyzer processing process

As can be seen, the difference between keyword and text type, term and match query mode is whether the word segmentation is carried out. In ElasticSearch, this segmentation process is collectively referred to as Text analysis, which is the process of converting fields from an unstructured string (Text) to a structured string (keyword).

Text Analysis not only performs word segmentation, but also includes the following process:

  • Use Character filters to do some processing on the original text, such as removing whitespace characters.
  • Use a Tokenizer to segment the original text and get some tokens.
  • Use Token filters to continue processing the words obtained in the previous step, such as changing words (lowercase), removing words (removing quantifiers) or adding words (adding synonyms), merging synonyms, etc.

The component in ElasticSearch that handles Text analysis is called Analyzer. In turn, Analyzer is made up of three components, character filters, Token filters and token filters.

Elasticsearch comes with 3 character filters, 10 word splitters and 31 word filters. In addition, the corresponding components implemented by third parties can be obtained through the plug-in mechanism. Developers can customize the components of Analyzer to their own needs.

"analyzer": {
    "my_analyzer": {
        "type":           "custom",
        "char_filter":  [ "html_strip"],
        "tokenizer":      "standard",
        "filter":       [ "lowercase",]
    }
}

With the above configuration, the function of the my_analyzer is as follows:

  • The character filter ishtml_strip, will remove the characters associated with HTML tags;
  • The word splitter is the default standard word splitter for Elasticsearchstandard;
  • The word filter is lowercaselowercaseProcessor to lowercase English words.

Generally speaking, the most important part of the Analyzer is the word segmentation, and the quality of the segmentation results will directly affect the accuracy and satisfaction of the search. The default word segmentation of Elasticsearch is not the best choice for Chinese word segmentation. At present, the industry mainly uses IK for Chinese word segmentation.

IK participle principle

IK is currently more mainstream Elasticsearch open source Chinese phrase parts, it has built in the basic Chinese word library and word segmentation algorithm to help developers quickly build Chinese word segmentation and search functions, it also provides extended dictionary dictionary and remote dictionary and other functions, facilitate developers to expand network new words or buzzwords.

IK provides three built-in dictionaries, which are:

  • Main. dic: The main dictionary, which includes everyday common words such as programmer and programming;
  • Quantifier. DIC: Dictionary of quantifiers, including everyday quantifiers such as meter, hectare, hour, etc.
  • Stopword. DIC: It mainly refers to the English words that stop, such as A, such, that and so on.

In addition, developers can extend the above dictionary by configuring the Extended Thesaurus Dictionary and the Remote Dictionary.

<? The XML version = "1.0" encoding = "utf-8"? > <! DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd" > < properties > < comment > IK Analyzer extension configuration < / comment > <! > <entry key="ext_dict">custom/ mydic.dic </entry> <! > <entry key="ext_stopwords">custom/ext_stopword.dic</entry> <! --> <entry key="remote_ext_dict">location</entry> <! -- Here you can configure the remote extension stopword dictionary --> <entry key="remote_ext_stopwords">http://xxx.com/xxx.dic</entry> </properties>

When IK starts following Elasticsearch, the default dictionary and extended dictionary will be read and loaded into memory, and the dictionary Tire Tree (also known as prefix tree) data structure will be used for storage, which is convenient for subsequent word segmentation.

The typical structure of the dictionary tree is shown in the figure above. Each node is a word. From the root node to the leaf node, the characters passing through the path are connected to the corresponding word of the node. So the words in the picture above include: Programmer, Programmer, Knit, Coding and Work.

First, load the dictionary

When IK’s Dictionary singleton object is initialized, it calls the corresponding load function to read the Dictionary file and construct three Dictionary trees made up of dictsegments, MainDict, QuantifierDict, and StopWords. Let’s take a look at the loading and construction of its main dictionary. The loadMainDict function is simple. It first creates a DictSegment as the root node of the dictionary tree, and then populates the tree by loading the default primary dictionary, extending the primary dictionary, and the remote primary dictionary, respectively.

Private void loadMainDict() {// Create a DictorialDictation_DirectSegment = new dict (char) 0); Path file = PathUtils.get(getDictTroot (), dictionary.path_dic_main); loadDictFile(_MainDict, file, false, "Main Dict"); // Load the extension dictionary this.loadExtDict(); // Load the remote custom word library this.loadRemoteExDict (); }

During the execution of the LoadDictSegment function, words are read line by line from the dictionary file and handed to the FillSegment function of DictSegment.

FillSegment is the core function of dictionary tree construction. The specific implementation is as follows. The processing logic generally has the following steps:

  • 1. According to the index, get a word in the word;
  • Check whether the word exists in the children of the current node. If not, add it to the nodecharMap;
  • Third, calllookforSegmentThe function looks for a node in the dictionary tree that represents the word and inserts a new one if there is none.
  • Four, recursive callfillSegmentThe function handles the next word.
private synchronized void fillSegment(char[] charArray , int begin , int length , Int enabled){// Get Character beginChar = character.valueOf (CharArray [begin]); Character keyChar = charMap.get(beginChar); If (keyChar == null){charmap. put(beginChar, beginChar); keyChar = beginChar; } // Search for the storage of the current node, query for the keyChar corresponding to the keyChar, if not create dictSegment ds = lookForSegment (keyChar, enabled); if(ds ! = null) {/ / processing keyChar corresponding segment of the if (length > 1) {/ / word yuan is still not fully join tree dictionary ds. FillSegment (charArray, begin + 1, length - 1. enabled); }else if (length == 1){// Is already the last char of the word, sets the current node state to enabled, //enabled=1 indicates a full word, Enabled =0 to block the current word from the dictionary ds.nodeState = enabled; }}}

IK initialization process is roughly like this, and then further detailed logic you can go directly to see the source code, in the middle are Chinese notes, relatively easy to read.

Second, participle logic

IK implements Elasticsearch-related abstract classes to provide its own implementation of word segmentation logic:

  • IKAnalyzerinheritedAnalyzer, an analyzer used to provide Chinese word segmentation;
  • IKTokenizerinheritedTokenizer, a word segmentation device used to provide Chinese word segmentation, of whichincrementTokenElasticsearch is the entry function for Elasticsearch to call IK for word segmentation.

The incrementToken function calls the next method of IKSegmenter to get the result of the word segmentation, which is the core method of IKSegmenter.

As shown in the figure above, IKSegmenter has three word segmenters. During word segmentation, it will traverse all the words in the word and then process the words in order by the three word segmenters:

  • LetterSegmenter, English word segmentation is relatively simple, is the continuous English characters for word segmentation;
  • CN_QuantifierSegmenter, Chinese quantifier word splitter, which determines whether the current character is a numeral and quantifier, will connect the numeral and quantifier into one word;
  • CJKSegmenter, the core word segmentation device, based on the dictionary tree for word segmentation.

We will only look at the implementation of the CJKSegMenter. The analyze function is divided into two types of logic:

  • According to the word to the dictionary tree query, if the word is a word, lexicon generation; If it is a word prefix, it is placed in the temporary hit list.
  • The dictionary tree is then searched based on the word and the temporary hit list data saved during the previous processing. If it hits, the lexicon is generated.
Public void analyze(AnalyzeContext context) {// select hit from tmpHits, select if(! this.tmpHits.isEmpty()){ .... for(Hit hit : tmpArray){ hit = Dictionary.getSingleton().matchWithHit(context.getSegmentBuff(), context.getCursor() , hit); If (hit.isMatch()){// Output the current word Lexeme and newLexeme = newLexeme (Context.getBufferOffset (), hit.getBegin(), context.getCursor() - hit.getBegin() + 1 , Lexeme.TYPE_CNWORD); context.addLexeme(newLexeme); . }else if(hit.isunmatch ()){//hit is not a word, remove this.tmphits.remove (hit); }}} / / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / / to the current cursor position character word matching Hit singleCharHit = Dictionary.getSingleton().matchInMainDict(context.getSegmentBuff(), context.getCursor(), 1); If (singleCharHit. IsMatch ()){// Output the current word Lexeme and newLexeme = newLexeme (Context.getBufferOffset (), context.getCursor() , 1 , Lexeme.TYPE_CNWORD); context.addLexeme(newLexeme); If (singleCharHit. IsPrefix ()){this.tmphits. add(singleCharHit); }}else if(singleCharHit. IsPrefix ()){this.tmphits. add(singleCharHit); }... // Determine whether to finish the cleanup work}

The specific code logic is shown above. To make it easier for you to understand, let’s say the input word is a code job:

  • First of all, to deal withEdword
  • Because the currenttmpHitsIs empty, direct word judgment;
  • Directly takeEdThe word goes to the dictionary tree query in the schematic diagram above (see details)matchInMainDictFunction), found to be able to hit, and the word is not the end of a word, so willEdAnd its position in the input wordHitObject, stored totmpHitsIn the.
  • Then deal withcode
  • becausetmpHitsIt’s not empty, so take itEdThe correspondingHitThe objects andcodeLook up words in the dictionary tree (see details)matchWithHitFunction) and found that it was hitcodingTherefore, the word is stored as one of the output wordsAnalyzeContext; But becausecodeIs already a leaf node and has no child nodes, meaning it is not a prefix of another word, so it will correspond toHitObject is deleted;
  • Go on with the wordscodeLook in the dictionary tree to see if a word is a word, or a prefix for a word.
  • And so on, all the words are processed.

Disambiguation and result output

Through the above steps, many result sets are sometimes generated. For example, Programmer Loves Programming is divided into Programmer, Programmer, Programmer, Loves, and Programmer. This is also the output of the ik_max_word mode for IK. However, there are situations where developers want only programmer, love, and programming, and IK’s IK_SMART mode is used for disambiguation.

IK uses IKarbitrator for disambiguation processing, mainly in the way of combinatorial traversal. The disjoint word segmentation set is extracted from the word segmentation results of the previous stage. The so-called intersection is whether the position of its occurrence in the text is cooccurrence. For example, the three participles of programmer, program and worker turn out to be intersecting, but love and programming are not. Therefore, programmers, programmers and staff will be treated as a set, love as a set, and code as a set, respectively, and the word segmentation result set with the highest rule priority will be selected from the set. The specific rules are as follows:

  • Length of valid text is preferred;
  • Less word number is preferred;
  • Large path span is preferred;
  • The lower the position is preferred, because according to the statistical conclusion, the probability of backward segmentation is higher than forward segmentation;
  • The longer the word is, the more average it is.
  • The word position weight is important priority.

According to the above rules, in the first set, it is clear that programmers conform to the rules more than programmers and staff, so disambiguation results in the output of programmers, not programmers and staff.

Finally, for the input word, some of the positions may not be in the output result, so they will be output directly as a single word as a lexicon (see the AnalyzContext outputToResult function for details). For example, programmers are professional, is the word will not be partitioned out, but in the final output results, it should be output as a single word.

Afterword.

The combination of ElasticSearch and IK is the mainstream Chinese search technology at present. Understanding the basic process and principle of its search and word segmentation will help developers to build Chinese search functions faster, or customize search word segmentation strategies based on their own needs.