Distributed document storage

1.1 Routing a document to a shard

When a document is indexed, it is stored in a master shard. How does Elasticsearch know which shard to store a document in? When we create a document, how does it decide whether the document should be stored in shard 1 or shard 2?

First of all, it certainly can’t be random, otherwise we won’t know where to look when we need to retrieve documents in the future. In fact, the process is determined by the following formula:

shard = hash(routing) % number_of_primary_shards
Copy the code

Routing is a variable value, which defaults to the _id of the document and can be set to a custom value. Routing uses the hash function to generate a number, which is then divided by number_of_primary_shards (number of primary shards) to get the remainder. The remainder, between 0 and number_of_primary_shreds -1, is where the document shard we are looking for is located.

This explains why the number of master shards is determined at index creation time and never changes: if the number changes, then all the values of previous routes are invalid and the document is never found again.

All document apis (GET, index, DELETE, BULK, Update, and MGET) accept a routing parameter called Routing, which allows you to customize the mapping of documents to shards. A custom routing parameter can be used to ensure that all related documents — for example, all documents belonging to the same user — are stored in the same shard. We will also discuss in detail why this requirement exists in the chapter on capacity expansion Design.

1.2 How do master shards and replica Shards interact

For illustrative purposes, let’s assume a cluster consisting of three nodes. It contains a name calledblogsThere are two master shards, and each master shard has two replica shards. Copies of the same shard are not placed on the same node, so our cluster looks like this:We can send requests to any node in the cluster. Each node is capable of handling arbitrary requests. Each node knows any document location in the cluster, so requests can be forwarded directly to the desired node. In the following example, all requests are sent toNode 1Let’s call thatCoordinating Node  。

1.2.1 Creating, Indexing, and Deleting documents

New, index, and delete requests arewriteThe operation must be completed on the master shard before it can be copied to the related replica shard, as shown in the figure below

The following is the sequence of steps required to successfully create, index, and delete documents on primary and secondary shards and any replica shards:

  1. The client toNode 1Send a new, index, or delete request.
  2. The node uses documentation_idDetermine that the document belongs to Shard 0. The request is forwarded toNode 3, because the main shard of shard 0 is currently allocated inNode 3On.
  3. Node 3Perform the request on the main shard. If successful, it forwards the request in parallel toNode 1 和 Node 2On a copy fragment. Once all replica sharding has been reported successful,Node 3Success is reported to the coordinating node, and the coordinating node reports success to the client.

By the time the client receives a successful response, the document changes have been executed in the master shard and all replica shards, and the changes are safe.

There are optional request parameters that allow you to influence this process, possibly improving performance at the expense of data security. These options are rarely used because Elasticsearch is already fast, but for completeness they are described here:

  • Consistency. Under the default Settings, even if it’s just trying to perform a write _ _ before operation, the primary shard will require must have specified number (quorum) (or in other words, also must want to have most) copy of the shard in the active state are available, and to perform write _ _ operations (including copy of fragmentation can be primary shard or copy fragmentation). This is to avoid data inconsistency caused by _ write operations when a Network partition fails. _ Specified quantity _ that is:

    int( (primary + number_of_replicas) / 2 ) + 1
    Copy the code

    The consistency parameter can be set to one (_ write _ is allowed as long as the master shard state is OK),all (_ write _ is allowed only if the master and all replica shards are in good condition), or quorum. The default value is quorum, which means that most shard copies will be allowed to operate _ write _ if the status is fine.

    Note that number_of_replicas in the specified number calculation formula refers to the number of replicas specified in the index setting, not the number of replicas currently in the active processing state. If your index setting specifies that the current index has three replica shards, then the specified number of shards is calculated:

    int( (primary + 3 replicas) / 2 ) + 1 = 3
    Copy the code

    If you start only two nodes at this point, you will not have the required number of active shard copies, and therefore you will not be able to index or delete any documents.

  • Timeout What happens if there are not enough copies sharded? Elasticsearch will wait, hoping for more shards to appear. By default, it waits up to 1 minute. If you need to, you can use the timeout argument to terminate it earlier: 100 100 milliseconds, and 30s is 30 seconds.

1.2.2 Querying Documents

Documents can be retrieved from the master shard or from any other replica shard, as shown in the figure below

The following is the sequence of steps to retrieve a document from a master or replica shard:

1. The client sends a request to Node 1. 2. The node uses the document id to determine that the document belongs to Shard 0. A replica shard of shard 0 exists on all three nodes. In this case, it forwards the request to Node 2. 3. Node 2 returns the document to Node 1, which then returns the document to the client.

When processing read requests, the coordination node achieves load balancing by polling all replica shards on each request.

Documents that have been indexed may already exist on the master shard but have not been copied to the replica shard when the document is retrieved. In this case, the replica shard may report that the document does not exist, but the master shard may successfully return the document. Once the index request is successfully returned to the user, the document is available in both the master and replica shards.

1.2.3 Updating documents

updateThe API combines the read and write modes described earlier

Here are some steps to update a document:

  1. The client toNode 1Send an update request.
  2. It forwards the request to the main shardNode 3 。
  3. Node 3Retrieve the document from the master shard and modify it_sourceJSON in the field, and try to re-index the primary shard document. If the document has been modified by another process, it retries Step 3, exceedingretry_on_conflictAfter giving up.
  4. ifNode 3Successfully updates the document, which forwards the new version of the document in parallel toNode 1 和 Node 2The index is re-indexed. Once all replica shards return success,Node 3Success is returned to the coordination node, and the coordination node returns success to the client

Document-based replication When the master shard forwards changes to the replica shard, it does not forward update requests. Instead, it forwards a new version of the complete document. Remember that these changes will be forwarded to the replica shard asynchronously, and there is no guarantee that they will arrive in the same order in which they were sent. If Elasticsearch only forwards change requests, you might apply the changes in the wrong order, resulting in a corrupted document.

2. Mapping and Analyze

When we add data to ES, we can insert data directly without even creating an index first. In fact, ES has already created a new index for us, and by default assigns a type to each field. In ES, different field types are searched differently. Default: Exact equivalence search is used for dates and values, and full text search is used for strings.

2.1 Exact value VS full text

Elasticsearch data can be broadly divided into two categories: exact value and full text.

Exact values are as exact as they sound. Such as a date or a user ID, but strings can also represent exact values, such as a user name or email address. Foo is different from Foo for exact values, and 2014 is different from 2014-09-15.

Full-text, on the other hand, is textual data (usually written in a language that is easily recognizable to humans), such as the content of a tweet or an email.

Full text usually refers to unstructured data, but there is a misconception that natural language is highly structured. The problem is that the rules of natural language are complex, making it difficult for computers to interpret them correctly. For example, consider the statement: May is fun but June Bores me. Does it refer to a month or a person?

Precise values are easy to query. The results are binary: either the query matches or the query does not. This type of query is easily expressed in SQL:

WHERE name    = "John Smith"
  AND user_id = 2
  AND date    > "2014-09-15"
Copy the code

Querying full-text data is much more subtle. We’re asking not just “does this document match the query?” but “How well does this document match the query?” In other words, how relevant is this document to a given query?

We rarely make exact matches for fields of full-text type. Instead, we want to search in the field of text type. Not only that, we want search to understand our intent:

  • searchUK, returns containUnited KindomThe document.
  • searchjumpThat will matchjumped , jumps , jumpingAnd evenleap 。
  • searchjohnny walkerWill matchJohnnie Walker , johnnie deppShould matchJohnny Depp 。
  • fox news huntingI should go back to the Fox News stories about hunting, and in the meantime,fox hunting newsWe should go back to the fox hunting story.

To facilitate such queries in the full-text domain, Elasticsearch firstAnalysis of theDocument, which is then created based on the resultsInverted index. In the next two sections, we will discuss inverted indexing and analysis processes.

2.2 Inverted index

Elasticsearch uses a structure called an inverted index, which is suitable for fast full-text searches. An inverted index consists of a list of all non-repeating words in a document, and for each word there is a list of documents containing it.

For example, suppose we have two documents, each containing the following content field:

  1. The quick brown fox jumped over the lazy dog
  2. Quick brown foxes leap over lazy dogs in summer

To create an inverted index, we first split the content field of each document into individual words (we call them terms or tokens), create a sorted list of all non-repeating terms, and then list which document each term appears in. The result is as follows:

Term      Doc_1  Doc_2
-------------------------
Quick   |       |  X
The     |   X   |
brown   |   X   |  X
dog     |   X   |
dogs    |       |  X
fox     |   X   |
foxes   |       |  X
in      |       |  X
jumped  |   X   |
lazy    |   X   |  X
leap    |       |  X
over    |   X   |  X
quick   |   X   |
summer  |       |  X
the     |   X   |
------------------------
Copy the code

Now, if we want to search quick Brown, we just need to look for the document that contains each term:

Term      Doc_1  Doc_2
-------------------------
brown   |   X   |  X
quick   |   X   |
------------------------
Total   |   2   |  1
Copy the code

Both documents match, but the first document is a better match than the second. If we use a simple similarity algorithm that only counts the number of matching terms, then we can say that the first document is better for the relevance of our query than the second.

However, there are some problems with our current inverted index:

  • Quick 和 quickAppear as separate terms, whereas users may think they are the same word.
  • fox 和 foxesIt’s very similar, likedog 和 dogs; They have the same root.
  • jumped 和 leapAlthough they don’t have the same root their meanings are very similar. They are synonyms

Using the previous index search +Quick +fox will not yield any matching documents. (Remember, the + prefix indicates that the word must exist.) The first document contains Quick Fox, and the second document contains Quick foxes.

Our users can reasonably expect both documents to match the query. We can do better.

If we standardize terms into a standard pattern, we can find documents that are not exactly the same as the terms the user is searching for, but are sufficiently relevant. Such as:

  • QuickI can lower it to lowercasequick 。
  • foxescanstemmingChange the root form tofox. Similarly,dogsCan be extracted asdog 。
  • jumped 和 leapIs a synonym that can be indexed to the same wordjump 。

The index now looks like this:

Term Doc_1 Doc_2 ------------------------- brown | X | X dog | X | X fox | X | X in | | X jump | X | X lazy | X | X over  | X | X quick | X | X summer | | X the | X | X ------------------------Copy the code

It’s not nearly enough. Our search for +Quick +fox will still fail because Quick is no longer in our index. However, if we use the same standardized rules for the searched string as for the content field, it becomes query +quick +fox, so both documents match! The process of word segmentation and standardization is called analysis

2.3 Analysis and analyzer

Analysis consists of the following processes:

  • First, divide a block of text into separate pieces suitable for inverted indexingentry ,
  • Later, unify the terms into a standard format to improve their “searchability,” orrecall

The parser does the above work. The parser encapsulates three functions in a single package:

  • Character filter

    First, the string passes through each in orderCharacter filter. Their job is to unscramble strings before segmentation. A character filter can be used to remove HTML, or to remove&Converted intoand.
  • Word segmentation is

    Second, the string isWord segmentation isDivided into individual terms. A simple word splitter might break the text into terms when it encounters Spaces and punctuation.
  • Token filter

    Finally, the entries go through each in orderToken filter. This process may change the entry (for example, lowercase)Quick), delete terms (for example, likea.and.the“), or add entries (e.g., likejump 和 leapThis synonym).

Elasticsearch provides character filters, word splitters, and token filters out of the box. These can be combined to form custom profilers for different purposes. We discuss this in more detail in the custom profiler section.

2.3.1 Built-in analyzer

However, Elasticsearch also comes with a pre-wrapped profiler that you can use directly. Next we list the most important profilers. To demonstrate the difference, let’s see what terms each parser gets from the following string:

"Set the shape to semi-transparent by calling set_trans(5)"
Copy the code
  • Standard profiler Standard profiler is the default profiler for Elasticsearch. It is the most common choice for analyzing text in various languages. It divides text according to word boundaries defined by the Unicode Consortium. Remove most punctuation. Finally, lower case the entry. It will produce

    set, the, shape, to, semi, transparent, by, calling, set_trans, 5
    Copy the code
  • Simple parsers Simple parsers separate text anywhere that is not a letter, lowercase entries. It will produce

    set, the, shape, to, semi, transparent, by, calling, set, trans
    Copy the code
  • Whitespace parsers divide text at Spaces. It will produce

    Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
    Copy the code
  • Language analyzer

    Language-specific parsers are available for many languages. They can take into account the characteristics of the specified language. For example, the English analyzer comes with a set of English non-words (common words, such as and or the, which have little effect on correlation) that are removed. Because of the understanding of the rules of English grammar, this participle can extract the stems of English words.

    The 'English' splitter produces the following entries: Set, shape, semi, transpar, call, set_tran, 5 Notice that 'transparent', 'calling' and 'set_trans' have become root formats.Copy the code

2.3.2 When to use profilers

When we index a document, its full-text field is parsed into terms to create an inverted index. However, when we search in the full-text field, we need to pass the query string through the same parsing process to ensure that the format of the term we search for matches the format of the term in the index.

Full-text queries that understand how each field is defined so they can do the right thing:

  • When you query aThe full textThe same parser is applied to the query string to produce the correct list of search terms.
  • When you query aAccurate valueInstead of parsing the query string, search for the exact value you specify.

Now you can understand why the query in the opening section returned that result:

  • dateThe field contains a single exact value: a single entry2014-09-15.
  • _allThe field is a full-text field, so the word segmentation process converts the date into three terms:2014.09, and15.

When we query for 2014 in the _all field, it matches all 12 tweets because they all contain 2014:

GET /_search? q=2014 # 12 resultsCopy the code

When we query 2014-09-15 in the _all field, it first parses the query string to produce a query that matches any of the terms in 2014, 09, or 15. This will also match all 12 tweets, since they all contain 2014:

GET /_search? q=2014-09-15 # 12 results !Copy the code

When we query 2014-09-15 in the date field, it looks for the exact date and finds only one tweet:

GET /_search? q=date:2014-09-15 # 1 resultCopy the code

When we query 2014 in the date field, it cannot find any documents because no documents contain this exact log:

GET /_search? q=date:2014 # 0 results !Copy the code

2.3.3 Test the analyzer

Sometimes it’s hard to understand the segmentation process and the terms actually stored in the index, especially if you’re new to Elasticsearch. To understand what is happening, you can use the Analyze API to see how the text is analyzed. In the body of the message, specify the parser and the text to parse:

GET /_analyze
{
  "analyzer": "standard",
  "text": "Text to analyze"
}
Copy the code

Each element in the result represents a single entry:

{
   "tokens": [
      {
         "token":        "text",
         "start_offset": 0,
         "end_offset":   4,
         "type":         "<ALPHANUM>",
         "position":     1
      },
      {
         "token":        "to",
         "start_offset": 5,
         "end_offset":   7,
         "type":         "<ALPHANUM>",
         "position":     2
      },
      {
         "token":        "analyze",
         "start_offset": 8,
         "end_offset":   15,
         "type":         "<ALPHANUM>",
         "position":     3
      }
   ]
}
Copy the code

Tokens are entries that are actually stored in the index. Position indicates the position of the entry in the original text. Start_offset and end_offset specify the position of the character in the original string.

2.3.4 Specifying a parser

When Elasticsearch detects a new string field in your document, it automatically sets it to a full-text string field and parses it using a standard parser.

You don’t want to be like this all the time. Perhaps you want to use a different parser, suitable for the language in which your data is used. Sometimes you want a string field to be just a string field — without parsing, index the exact value you pass in, such as a user ID or an internal status field or tag.

To do this, we must manually specify the mapping of these fields.