Esaticseach architecture

Where does es come from?

Question: What problems will we face when data volume reaches trillions, megabytes?

How to store it? What database? Relational or non-relational?
How to ensure data security? How to ensure the availability of services?
How to search?
How to do statistical analysis?

Es comes with the solution:

Build clusters to share data storage pressure, cope with high concurrency, and improve system availability.
Multi-point storage improves data security.
Efficient index, improve query speed and statistical analysis ability.
Compress data and reduce storage costs.

A brief introduction to ES ANSWER:

Es is an open source full text search engine based on Lucence, highly extensible, distributed. Its data is read and write data is fast, especially read, search speed is near real-time. Scalability is also very powerful, can scale to hundreds of servers cluster. Es also uses RESTful apis to hide implementation complexity, making the two core functions of retrieval and statistics very simple to use in any language

Es Node Type

Es In order to realize the various capabilities of the cluster, each node in the cluster may take on different roles. Master node: A node can be set as a candidate master node by setting node.master=true. Candidate primary nodes are elected by the cluster and, if successful, become primary nodes. The master node is responsible for creating and deleting indexes, allocating shards, and tracking the status of nodes in the cluster. Data node: A node can be set as a data node by setting node.data=true. Data nodes are mainly responsible for data read-write operations, such as CRUD, search, aggregation, etc. Client node: If a node’s node.master and node.data are both false, the node will neither vote for the primary node nor read or write data. So, is it still useful in clusters? There is. It can also handle customer requests, but it does this by distributing the received requests to the relevant nodes and returning the results as a summary. Such nodes are called client nodes. Coordination node: The coordination node is not a specific node type, but more like the role that a node plays. For example, the client node mentioned above is the role of a coordination node, and the work content of the client node is also the work content of the coordination node role. Of course, the primary node, candidate primary node, or data node can take on this role.

Why have a primary node?

The problem is simple. Every organization needs a core leadership team, just like a Boss. As a cluster, there also needs to be a core leadership. This core leadership may come from some third-party middleware, or in the case of ES, a master may come from within. The master node, mainly for distributed systems under the CAP, provides solutions, NAMELY consistency, availability, partition tolerance. We’ll understand this more later when we talk about how the master node works.

Primary node election

The election of the primary node of ES can be simply summarized as: someone runs for election, someone votes, and the one with more votes wins. The goal of the election is to select a single master node.

Who can run?

So any node whose node.master=true can run for the primary node.

Who can vote?

Each node in the ES cluster can vote

How do I vote?

Each node sorted the nodes they knew, and the first one was the one they wanted to choose. The source code is as follows:

public DiscoveryNode electMaster(Iterable nodes){ List sortedNodes = sortedMasterNodes(nodes); if (sortedNodes == null || sortedNodes.isEmpty()) {returnnull; }return sortedNodes.get(0); }

Sort by node ID:

private static classNodeComparatorimplementsComparator{@Overridepublicintcompare(DiscoveryNode o1, DiscoveryNode o2){if (o1.masterNode() && ! o2.masterNode()) {return -1; }if (! o1.masterNode() && o2.masterNode()) {return1; }return o1.id().compareTo(o2.id()); }}

How to be elected?

If a candidate primary node gets more than half the votes of all candidate primary nodes and elects itself first, the node becomes the primary node. Suppose A cluster has 9 nodes, among which there are 3 candidate primary nodes, one of which is A. Then, the condition that A is elected is that at least two nodes including A vote for A as the primary node.

The election fissure

When observing the source code of node sorting, we can see that ES is simply sorted by node ID. The id of the node, once added to the cluster, is fixed, so generally, there is a consensus on the election. This consensus can ensure the success of the selection. However, in the case of partitioning, things get complicated.

Figure 1

As shown in the figure above, in the case of partitioning, since each node elects only (and only in) the nodes it knows, we assume that all five nodes are candidate primary nodes. So, in the case above, left partition node 1 will be elected and right partition node 2 will be elected. In this way, when the two partitions are connected again, there will be two primary nodes, which we call split brain. So how can this be prevented? See the following source code:

publicbooleanhasEnoughMasterNodes(Iterable nodes){if (minimumMasterNodes < 1) {returntrue; }int count = 0; for (DiscoveryNode node : nodes) {if (node.masterNode()) { count++; } }return count >= minimumMasterNodes; }

The above method is executed before the election. If true is returned, the election will proceed normally. If false is returned, the election will not proceed. The method is very simple, by comparing the number of candidate primary nodes and minimumMasterNodes in the current cluster, if the number of candidate primary nodes is greater than minimumMasterNodes, the election can be carried out, otherwise, no election. MinimumMasterNodes comes from our configuration. As shown in Figure 1, we can prevent split-brain by setting minimumMasterNodes to 3. This is because when minimumMasterNodes is 3, there are less than 3 candidate primary nodes in the left partition, so the election cannot be carried out and the connection of the partition can only be waited for. The right can vote normally. Therefore, when the partition is restored, node 3 is elected as the only primary node.

Minimum_master_nodes =3 In general, for a cluster with n candidate primary nodes, the minimumMasterNodes can be set to N /2 + 1. This can avoid brain splitting. Note that if you have a cluster with two candidate primary nodes, minimumMasterNodes is set to 2 to avoid splitting. However, this will cause the cluster to become unavailable if the cluster is partitioned so that the two candidate primary nodes cannot communicate. MinimumMasterNodes is set to 1, which ensures availability but does not prevent brain breakage. Therefore, ES recommends that the number of candidate primary nodes in the cluster be at least 3.

Es The process of writing data

The client selects a node to send the request to. This node is a coordinating node.
Coordinating Node routes the document and forwards the request to the corresponding node (with primary shard).
The primary shard on the actual node processes the request and synchronizes the data to the Replica node.
Coordinating Node If it is found that the primary node and all replica nodes are completed, the coordinating node returns the response to the client.

Es Reads data

The doc ID is hashed to determine which shard the DOC ID is assigned to, and which shard the doc ID is assigned to.

The client sends a request to any node, known as a coordinate Node.
Coordinate Node carries out hash routing for doc ID and forwards the request to the corresponding node. In this case, the round-robin random polling algorithm is used to randomly select one from the Primary shard and all replicas to balance the read request load.
The node that receives the request returns the document to the coordinate Node.
Coordinate Node returns the document to the client.

Es Searches the data process

The most powerful thing about ES is to do full text search, which is like if you have three data:

Java is really fun ah Java is very difficult to learn ah J2EE especially cattle

You search by Java keywords, and you search for documents that contain Java. Es will tell you: Java is fun, Java is hard to learn.

The client sends the request to a Coordinate node. Coordinate nodes forward search requests to the primary shard or Replica shard corresponding to all shards, either of which is ok. Query Phase: Each shard returns its search results (in fact, some DOC ids) to the coordinating node, which merges, sorts, paginates the data and produces the final results. Fetch Phase: Then the coordinating node pulls the actual Document data from each node according to the DOC ID, and finally returns it to the client. Write requests are written to the primary shard and then synchronized to all replica shards. Read requests can be read from the Primary shard or replica Shard using the random polling algorithm.

Underlying principles of write data

First write to the memory buffer, the data is not searched in the buffer; Data is also written to a Translog log file.

If the segment file is nearly full or a certain amount of time has elapsed, the segment file is refreshed to a new segment file, but not directly to the SEGMENT file. This process is refresh.

Every 1 second, es writes the data in the buffer to a new segment file. Every second, es generates a new segment file. This segment file stores the data written to the buffer in the last 1 second.

If there is no data in the segment file, the segment file will not refresh. By default, if there is data in the segment file, the segment file will refresh once every second.

In operating systems, disk files actually have something called OS cache, which stands for operating system cache. This means that before data is written to disk files, it goes to the OS cache, a memory cache at the operating system level. As soon as the data in the buffer is flushed into the OS cache by the refresh operation, the data can be searched.

Why is ES quasi-real time? NRT stands for near real-time. The default is refresh every second, so ES is quasi-real-time because written data is not seen until one second later. You can use the ES restful API or Java API to manually perform a refresh operation, that is, manually flush the data in the buffer to the OS cache, so that the data can be immediately searched. As soon as the data is entered into the OS cache, the buffer is cleared, because the data is persisted to disk in the Translog because there is no need to keep the buffer.

Repeat the above steps. New data is continuously entered into buffer and translog, and the buffer data is continuously written to new segment file after new segment file. After each refresh, the buffer is cleared and translog is retained. As this process progresses, translog becomes larger and larger. When the translog reaches a certain length, the COMMIT operation is triggered.

Commit the first step of the operation is to refresh the existing data in the buffer to the OS cache to clear the buffer. Then, write a commit point to a disk file that identifies all segment files corresponding to the commit point, and force all current data from the OS cache to fsync to the disk file. Finally, the existing Translog log files are emptied, a Translog is restarted, and the COMMIT operation is complete.

This commit operation is called flush. Flush is automatically executed every 30 minutes by default, but if translog is too large, flush is triggered. Flush corresponds to the commit process. You can manually flush fsync data from the OS cache to disk through the ES API.

What is a Translog log file for? Before you commit, data is either stored in buffer or OS cache. Both buffers and OS cache are memory, and once the machine dies, all data in memory is lost. Therefore, the corresponding operations of data need to be written into a special log file translog. Once the machine is down, es will automatically read the data in the Translog file and restore the data to the memory buffer and OS cache when the machine is restarted.

Translog is written to the OS cache first, and is flushed to disk every 5 seconds by default. So by default, 5 seconds of data may be stored in the OS cache of the buffer or translog file. If the machine hangs at this point, Five seconds of data will be lost. But this way the performance is better, the maximum loss of 5 seconds of data. You can also set translog so that every write must be fsync directly to disk, but performance will be much worse.

If the interviewer does not ask you about the data loss of ES, you can show the interviewer here by saying that the first es is quasi-real-time, and the data can be searched 1 second after writing. You might lose your data. 5 seconds of data is stored in the buffer, Translog OS cache, and segment file OS cache, but not in the disk. If the system goes down, 5 seconds of data will be lost.

To summarize, the data is written to the buffer, and then refreshed to the OS cache every 1s. When the OS cache is refreshed, the data can be searched. Write data to a Translog file every 5 seconds (so that if the machine is down and there is no data in memory, at most 5 seconds of data will be lost). If the translog is large enough, or every 30 minutes by default, the commit operation will be triggered. Flush all buffer data into segment file.

After the data is written to the segment file, an inverted index is created.

Delete/update data underlying principles

If the doc is deleted, a.del file is generated at commit time, which identifies the doc as deleted. Therefore, the.del file is used to determine whether the DOC is deleted. If the update operation is performed, the original doc is marked as deleted and a new data entry is written.

A segment file is generated every time the buffer refresh, so by default, the segment file is generated every 1 second. As a result, more and more segment files are generated. When merging multiple segment files into one, the doc identified as deleted is physically deleted, and the new segment file is written to the disk. A commit point is written. Identify all new segment files, then open the segment file for search, and delete the old segment file.

In simple terms, Lucene is a JAR package containing a variety of encapsulated algorithms to build inverted index code. We use Java development, the introduction of Lucene JAR, and then based on Lucene API development can be.

With Lucene, we can index existing data, and Lucene will organize the index data structure for us on the local disk.

The next inverted index