First, write it in the front

ES (Elasticsearch) More and more enterprises use ES to store their own unstructured data in business scenarios, such as e-commerce business to realize commodity site search, data index analysis, log analysis, etc. As a supplement to traditional relational database, ES provides some capabilities that relational database does not have.

ES first came to the public eye for its ability to implement full-text search, also due to the Lucene-based implementation, which has an inverted index data structure inside.

In this paper, the author will introduce the distributed architecture of ES and the storage index mechanism of ES. In this paper, the API of ES will not be introduced in detail, but will be analyzed from the overall architecture level. Later, the author will introduce the use of ES in other articles.

What is an inverted index

To understand what is inverted index, what we first comb the first index, such as a book, catalog page of the book, a chapter, section name, we would like to see which section, page through the catalogue, we look up to the corresponding section and the page number, can locate the specific chapters, by the name of the catalog pages of chapter to chapter page, and then see chapters content, This process is an indexing process, so what is an inverted index?

Such as query “Java programming thought” the articles of this book, opened the book to see catalog pages, record this chapter names and chapter addresses page, by querying the section name “inheritance” section can be positioned to “inherit” this chapter of the specific address, see to the content of the article, we can see the article content contains a lot of the word “object”.

So what if we were to query all the articles in this book that contain the word “object”?

According to the current indexing way undoubtedly looking for a needle in a haystack, if we have an “object” — the article mapping relationship, not on it? Such a mapping in reverse is called an inverted index.

As shown in figure 1, will get the key words, the article after word segmentation in the inverted index based on keyword established, keywords to build into a dictionary, for each word in the dictionary (keywords), each keyword has a corresponding list, this list is the inversion lists, to store the information such as the document number and word frequency is chapter, Each element in the inverted list is an inverted item. As you can finally see, the entire inverted index is like a Xinhua Dictionary. The inverted list of all the words is usually stored in a file on disk, which is called an inverted file.

(figure 1)

Dictionaries and inverted files are the two basic data structures in Lucene, but they are stored differently: dictionaries are stored in memory and inverted files are stored on disk. This article will not introduce the techniques used in constructing inversion index and querying inversion index, such as word segmentation, TF-IDF, BM25, vector space similarity, etc. Readers only need to have a basic understanding of inversion index.

III. Cluster architecture of ES

1. Cluster nodes

An ES cluster can consist of multiple nodes, and each node is an instance of an ES service, which is joined by configuring the cluster name cluster.name. So how does a node join a cluster by configuring the same cluster name? To understand this, we must first understand the role of the nodes in the ES cluster.

Configure file conf/elasticsearch.yml to set the following configuration for the role.

node.master: true/false
node.data: true/false

A single node in the cluster can be either a candidate master node or a data node, and four categories can be formed by pair combination through the above configuration:

(1) is only candidate master node; (2) is both candidate master node and data node; (3) is only data node; (4) is neither candidate master node nor data node

Candidate Primary Node: Only the Candidate Primary Node can participate in the election voting, and only the Candidate Primary Node can be elected as a Primary Node.

Master node: it is responsible for adding and deleting indexes, tracking which nodes are part of the cluster, allocating shard, collecting the status of each node in the cluster, etc. A stable master node is very important for the health of the cluster.

Data node: it is responsible for the operation of adding, deleting, modifying, checking and aggregation of data. The query and storage of data are all responsible by the data node, which has high requirements on CPU, IO and memory of the machine. Generally, the machine with high configuration is selected as the data node.

In addition, there is another node role called coordination node, which is not assigned by setting itself. The user’s request can be randomly sent to any node, and the node is responsible for the distribution of requests, collection of results and other operations, without the need for forwarding by the master node. This type of node can be called a coordinating node, and any node in the cluster can act as a coordinating node. Each node keeps in touch with each other.

(figure 2)

2. Discover the mechanics

As mentioned earlier, nodes can join a cluster by setting a cluster name. How does ES do this?

Here we will talk about the special discovery mechanism of ES, ZenDiscovery.

ZenDiscovery is the built-in discovery mechanism of ES. It provides both unicast and multicast discovery methods. The main responsibility of ZenDiscovery is to discover nodes in the cluster and elect Master nodes.

Multicast, also known as multicast, means that a node can send requests to more than one machine. ES does not recommend this approach in production environments, as multicast can generate a lot of unnecessary traffic for a large cluster.

Unicast, when a node joins an existing cluster or forms a new cluster, requests are sent to a machine. When a node contacts a member in the unicast list, it gets the status of all the nodes in the cluster, and then it contacts the Master node and joins the cluster.

Only nodes running on the same machine automatically cluster together. ES is configured to use unicast discovery by default. The unicast list does not need to contain all the nodes in the cluster, it just needs enough nodes when a new node contacts one of them and communicates. If you use Master candidate nodes as a unicast list, you only need to list three.

This configuration is in the elasticsearch.yml file:

discovery.zen.ping.unicast.hosts: ["host1", "host2:port"]

The clusterinformation collection phase uses the Gossip protocol, which is configured as a single seed node. I won’t go into the Gossip protocol here.

ES official recommends that unicast.hosts be configured as all candidate master nodes, ZenDiscovery will ping every ping\_interval (config), and each timeout is Discovery. Zen. ping\_timeout (config), If the ping_retries configuration fails three times, the node is considered to be down. If the node fails, a failover will be triggered and shard redistribution and replication will be performed.

If the down node is not Master, then Master will update the cluster meta information. Master node will publish the latest cluster meta information to other nodes, and other nodes will reply ACK. The Master node receives a reply from the value of discovery.zen.minimum\_master\_nodes -1 candidate Master nodes. Then it sends Apply message to other nodes and the cluster status is updated. If the down node is Master, the other candidate Master nodes start the Master node election process.

2.1 choose the main

There is only one Master to be selected. ES guarantees that the selected Master is recognized by at least one candidate Master by a majority threshold of the quorum parameter.

If the current candidate master node finds that it is not a master node, and finds that it cannot contact the master node by ping other nodes, and there are more than minimum\_master\_nodes including itself that cannot contact the master node, then the candidate master node is initiated.

Select the main flow diagram

(figure 3)

Sort by cluster node parameter <stateVersion, id>. StateVersion is sorted from large to small in order to select the node with relatively new cluster meta information as Master, and ID is sorted from small to large in order to avoid the failure to select Master by split voting when StateVersion is at the same time.

The first node after sorting is the Master node. When a candidate Master node initiates an election, it selects what it thinks is the Master based on the sorting strategy described above.

2.2 split brain

When it comes to distributed system selection, it is inevitable to mention the phenomenon of split brain. What is split brain? If multiple Master nodes are elected in the cluster, resulting in inconsistent data updates, this phenomenon is called brain-splitting.

In short, different nodes in the cluster have different Master choices, and there are multiple Master contenders.

Generally speaking, split brain problems may be caused by the following reasons:

  • Network problem: Network delay between clusters causes some nodes to lose access to the Master and assume that the Master is down. In fact, the Master is not down. Instead, a new Master is elected, and the shards and replicas on the Master are marked red to assign a new Master shard.
  • Node load: the role of the Master node is both Master and Data. When the traffic volume is large, ES may stop responding (suspended death) and cause a large area delay. At this time, other nodes can’t get the response from the Master node and think the Master node is down, so the Master node will be selected again.
  • Memory collection: The role of the Master node is both Master and Data. When the ES process on the Data node occupies a large amount of memory, large-scale memory collection will be caused by the JVM, causing the ES process to lose its response.

How to avoid split brain: We can make optimization measures based on the above reasons:

  • Increase the response timeout appropriately to reduce misjudgment. The ping timeout is set with the parameter Discovery. Zen. ping_timeout. The default value is 3s, which can be increased appropriately.
  • As an election trigger, we need to set the value of the parameter Discovery. Zen. munimum\_master\_nodes in the candidate node’s configuration file. This parameter represents the number of candidate primary nodes that need to participate in the election of primary nodes. The default value is 1, and the official recommended value (master\_eligibel\_nodes/2)+1, where master\_eligibel\_nodes is the number of candidate primary nodes. As long as no less than the candidate nodes of discovery. Zen. Munimum \_master\_nodes survive, the election can proceed normally. A value less than this will not trigger an election, and the cluster will not be used and will not cause sharding chaos.
  • Role separation refers to the role separation between the candidate master node and the data node mentioned above. In this way, the burden of the master node can be reduced, the suspended animation of the master node can be prevented and the misjudgment of the master node’s downtime can be reduced.

How to write the index

1. The principle of indexing

1.1 shard

ES supports PB full-text search. Usually, when we have a large amount of data, the query performance will become slower and slower. One way we can think of is to distribute the data to different places for storage, so does ES. The split database block is called a Shard, much like MySQL’s library and table.

Different primary shards are distributed on different nodes, so where should the data be written in a multi-shard index? It must not be written randomly, otherwise it will not be able to retrieve the corresponding data quickly when querying. This requires a routing strategy to determine which shard to write to, and how to route will be described below. The number of shards is specified when the index is created, and once the number of shards is determined, it cannot be changed.

1.2 a copy of the

Replicas are copies of shards. Each primary shard has one or more duplicate shards. When the primary shard is abnormal, the replica can provide operations such as data query. The primary shard and the corresponding replica shard will not be on the same node to avoid data loss. When a node is down, the data can also be queried through the replica. The maximum number of replica shards is n-1 (where N is the number of nodes).

New, indexed, and deleted doc requests are writes that must be done on the primary shard before they can be copied to the corresponding copy. In order to improve the writing ability of ES, the process is written concurrently. At the same time, in order to solve the problem of data conflict in the process of concurrent writing, ES is controlled by optimistic locking. Each document has a _version number, and the version number increases when the document is modified.

Success is reported to the coordinator node once all replica shards have been reported successfully, and the coordinator node reports success to the client.

(figure 4)

1.3 ElasticSearch write index flow

As mentioned above, index writing can only be done on the primary shard, and then synchronized to the duplicate shard. As shown in Figure 4, there are four primary shards: S0, S1, S2, and S3. According to what strategy is a piece of data written to the specified shard? Why is this index written to S0 but not to S1 or S2? The process is determined by the following formula.

shard = hash(routing) % number_of_primary_shards

The value of the above formula is the remainder between 0 and number\_of\_primary\_shards-1, which is where the data file is shards. Routing uses the Hash function to generate a number, which is then divided by number\_of\_primary\_shards (number of primary shards) to get the remainder. Routing is a mutable value, which defaults to the _id of the document or can be set to a custom value.

In a written after the request is sent to a node, the node according to the mentioned above, will act as a coordinator node, according to the routing formula to calculate the writing which shard, the current node has shard of all other node information, if you find the corresponding fragmentation is on other nodes, and then forwards the request to the fragmentation of the primary shard node.

In the ES cluster, each node knows the location of data in the cluster through the above formula, so each node has the ability to receive read and write requests.

So why is the number of primary shards determined when the index is created and not modifiable? If the number changes, then all the values previously calculated by the route will be invalid and the data will never be found again.

(figure 5)

As shown in Fig. 5, the value of the current data obtained through the routing calculation formula is shard=hash(routing)%4=0, then the specific process is as follows:

(1) The data write request is sent to Node1 node, and the value obtained through routing calculation is 1, then the corresponding data should be on the main shard S1. (2) Node1 node forwards the request to Node2 node where S1 main shard is located. Node2 accepts the request and writes it to disk. (3) Concurrently copy data to three replicas shard R1, in which data conflicts are controlled through optimistic concurrency. Once all the replica shards have been reported successful, the node Node2 reports success to the Node1 node, which then reports success to the client.

This mode, as long as there is in copy, write minimum delay is the sum of two single shard write time-consuming, efficiency will be lower, but the benefits are obviously, avoid writing after a single machine hardware failures lead to loss of data, the data integrity and performance, are generally preferred data, unless some special scenario allows lost data.

In order to reduce disk IO and ensure read and write performance in ES, data will be written to disk every once in a while (for example, 30 minutes). For data written to memory, but not flushed to disk, if the machine is down or power down, then the data in memory will be lost. How to guarantee this?

To address this problem, ES takes a page from the database and adds a commitLog module, called TransLog in ES, which is described in the ES Storage Principles section below.

2. Principle of storage

The above describes the process of index writing inside ES. After data is written to shards and replicas, the data is now in memory. To ensure that the data will not be lost after power outage, it also needs to be persisted to disk.

As we know, ES is realized based on Lucene, and the internal index creation, writing and search query are completed through Lucene. The working principle of Lucene is shown in the following figure. When a new document is added, Lucene performs pre-processing such as word segmentation, and then writes the document index into memory. Translog is similar to MySQL’s binlog, which is used to restore the memory data after the outage, and to store the operation log of the unpersisted data.

By default, Lucene flushes the in-memory data to the file system cache every 1S (refresh_interval configuration), called a segment. Segments cannot be retrieved until they are brushed into the file system cache.

Therefore, REFRESH_INTERVAL determines the real-time performance of ES data, so ES is a quasi-real-time system. Segments are immutable on disk, thus avoiding random writes to disk. All random writes are done in memory. By default, Lucene will persist the segment every 30 minutes or more than 512M. This is called a commit point. At this point, the corresponding transLog is deleted.

When we are testing write operations, we can use manual refresh to ensure that the data is retrieved in a timely manner, but do not manually refresh every time a document is indexed in production, as the refresh operation has a certain performance overhead. It is not always necessary to refresh every second in a normal business scenario.

You can reduce the refresh rate of each index by increasing the REFRESH \ _INTERVAL = “30s” value in SETTINGS. Note that this value is followed by a time unit, otherwise the default is milliseconds. When REFRESH \ _INTERVAL =-1, the automatic refresh of the index is turned off.

(figure 6)

Index files are stored in segments and cannot be modified, so what about new, updated, and deleted files?

  • New, new is easy to handle, because the data is new, so you only need to add a paragraph to the current document.
  • Removed, as immutable, so for the delete operation, not the document from the old section of the removed. But with the addition of a del file, the file will be delete documents list these piece of information, the marked documents can still be matching to delete, but it will be before the end result is returned from the result set.
  • Update, can not modify the old segment to carry on the document update, in fact, update is equivalent to delete and add these two actions. The old document is marked down in the.del file, and the new version of the document is indexed to a new segment. It is possible that both versions of the document will be matched by a query, but the deleted version will be removed before the result set is returned.

Segment is set to be unmodifiable with certain advantages and disadvantages.

Advantages:

  • No locks required. If you never update the index, you don’t have to worry about multiple processes changing the data at the same time.
  • Once the index is read into the kernel’s file system cache, it stays there because of its immutability. As long as there is enough space in the file system cache, most read requests go directly to memory and do not hit the disk. This provides a significant performance boost.
  • Other caches, like the Filter cache, are valid for the lifetime of the index. They do not need to be reconstructed every time the data changes, because the data does not change.
  • Writing a single large inverted index allows data to be compressed, reducing disk I/O and the use of indexes that need to be cached into memory.

Disadvantages:

  • When you delete old data, the old data is not deleted immediately, but is marked as deleted in the.del file. Old data can only be removed when the segment is updated, resulting in a lot of wasted space.
  • If there is a piece of data updated frequently, each update is new, mark the old, there will be a lot of space waste.
  • Each time you add data, you need to add a new segment to store the data. When the number of segments is too large, the consumption of server resources such as file handles can be very large.
  • The inclusion of all the result sets in the result of the query increases the burden of the query by excluding old data that has been marked down.

2.1 period of consolidation

Since a new segment is created every time a refresh occurs, the number of segments in a short period of time will increase dramatically, and too many segments will cause major problems. A large number of segments can affect the read performance of the data. Each segment consumes file handles, memory, and CPU cycles.

More importantly, each search request must take turns checking each segment and merging the query results, so the more segments, the slower the search will be.

Lucene therefore merges segments according to a strategy that removes the old deleted files from the file system. Deleted documents will not be copied to the new segment.

The process of merging does not break indexing and searching, and the inverted indexed data structure makes merging files relatively easy.

Segment merging is done automatically during indexing and searching, and the merging process selects a small number of similarly sized segments and merges them behind the scenes into larger segments, which may be either uncommitted or committed.

At the end of the merge, the old segment is deleted, the new segment is flushed to disk, a new commit point is written that contains the new segment and excludes the old and smaller segments, and the new segment is opened for search. Segment merging is computationally heavy and eats a lot of disk I/O, and segment merging slows down write rates and, if allowed to evolve, affects search performance.

ES imposes resource constraints on the merge process by default, so search performance can be guaranteed.

(figure 7)

Five, write at the end

The author introduces the architecture principle and index storage and writing mechanism of ES. The overall architecture system of ES is relatively ingenious, and we can learn from its design ideas when designing the system. This paper only introduces the overall architecture of ES, and more content will be shared by the author in other articles in the future.

Authors: Vivo website mall development team