Overall structure of ElasticSearch

Through the above, after understanding the overall principle of ES through diagrams, we comb out the overall structure of ES

  • In cluster mode, an ES Index consists of multiple nodes. Each node is an Instance of ES.
  • Each node has multiple shards. P1 and P2 are primary shards, and R1 and R2 are duplicate shards
  • Each fragment corresponds to a Lucene Index.
  • A Lucene Index is a general term that consists of multiple segments. Each segment file stores a Doc document. Commit Point Records information about all segments

Added :Lucene index structure

What files are in Lucene’s index structure in the figure above?

More file types are available

The relationships of files are as follows:

Supplement :Lucene processing process

As illustrated above, you also need to understand the Lucene process, which will help you better index and search documents.

To create an index:

  • Prepare the original document to be indexed. The data source may be a file, database, or network
  • The content of the document is divided into phrase parts to form a series of terms
  • The index component processes documents and terms to form dictionaries and inversion lists

Search index process:

  • The query statement is segmented to form a series of terms
  • According to the inverted index table, the documents containing the Term are found and merged to form a set of documents that match the result
  • Compare the correlation score between the query statement and each document and return the score accordingly

Added :ElasticSearch profiler

One of the most important items in the figure above is parsing/language processing, so we need to add the ElasticSearch parser knowledge.

The analysis involves the following procedures:

  • First, a block of text is broken up into individual entries suitable for an inverted index,
  • These terms are then unified into a standard format to improve their “searchability,” or recall

The profiler does the above. Profilers actually encapsulate three functions into one package:

  • Character filters First, the string passes through each character filter in order. Their task is to organize the string before the participle. A character filter can be used to remove HTML, or to convert & to and.
  • Second, the string is divided into individual entries by the word segmentation. A simple participle might break the text into entries when it encounters Spaces and punctuation.
  • Token Filters Finally, entries pass through each Token filter in order. This process may change entries (for example, lowercase Quick), remove entries (for example, no words such as a, and, the), or add entries (for example, synonyms such as jump and leap).

Elasticsearch provides character filters, word segmentation and token filters out of the box. These can be combined to form custom parsers for different purposes.

Built-in analyzer

Elasticsearch also comes with pre-wrapped profilers that you can use directly. The most important analyzers are listed next. To prove the difference, let’s look at what terms each parser gets from the following string:

"Set the shape to semi-transparent by calling set_trans(5)"
Copy the code
  • Standard analyzer

The standard parser is the default parser used by Elasticsearch. It is the most common choice for analyzing text in a variety of languages. It divides text according to word boundaries defined by the Unicode Consortium. Remove most of the punctuation. Finally, lowercase the entry. It will produce

set, the, shape, to, semi, transparent, by, calling, set_trans, 5
Copy the code
  • Simple analyzer

The simple parser separates text anywhere that is not a letter and puts entries in lower case. It will produce

set, the, shape, to, semi, transparent, by, calling, set, trans
Copy the code
  • Space analyzer

The space analyzer divides the text in the space. It will produce

Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
Copy the code
  • Language analyzer

Language-specific parsers are available for many languages. They can take into account the characteristics of the specified language. For example, if the English parser comes with a set of English no-words (common words, such as and or the, which have little effect on relevance), they are removed. By understanding the rules of English grammar, this participle can extract the stems of English words.

The English word participle produces the following entries:

set, shape, semi, transpar, call, set_tran, 5
Copy the code

Notice that transparent, calling, and set_trans have been changed to the root format.

When to use profilers

When we index a document, its full-text fields are analyzed into terms to create an inverted index. However, when we search in the full-text field, we need to put the query string through the same analysis process to ensure that the terms we search are in the same format as the terms in the index.

Full-text queries to understand how each domain is defined so they can do the right thing:

  • When you query a full text, the same parser is applied to the query string to produce the correct list of search terms.
  • When you query for an exact value field, instead of parsing the query string, you search for the exact value you specify.

For example

One piece of data per day in ES is queried as follows:

GET /_search? q=2014 # 12 results GET /_search? q=2014-09-15 # 12 results ! GET /_search? q=date:2014-09-15 # 1 result GET /_search? q=date:2014 # 0 results !Copy the code

Why return that result?

  • The date field contains an exact value: the single entry 2014-09-15.
  • The _all field is a full-text field, so the word segmentation process converts the date into three entries: 2014, 09, and 15.

When we look for 2014 in the _all field, it matches all 12 tweets because they all contain 2014:

GET /_search? q=2014 # 12 resultsCopy the code

When we query for 2014-09-15 in the _all field, it first analyzes the query string to produce a query that matches any of the entries in 2014, 09, or 15. This will also match all 12 tweets, as they all contain 2014:

GET /_search? q=2014-09-15 # 12 results !Copy the code

When we query 2014-09-15 in the Date field, it looks for the exact date and only finds one tweet:

GET /_search? q=date:2014-09-15 # 1 resultCopy the code

When we query 2014 in the Date field, it cannot find any document because no document contains this exact log:

GET /_search? q=date:2014 # 0 results !Copy the code