ElasticSearch: NRT, index, sharding, copy, etc

The core concept of ElasticSearch

Near RealTime(NRT)

Near real time (nREALtime) has a small delay from the time data is written to the time when it can be searched. Nrealtime (nREALTime) has a small delay from the time data is written to the time when it can be searched.

First, we will use Java to save data to ElasticSearch at ** 2:16:20 **
If you are using a Java application to fetch data from ElasticSearch, the time error should be in the order of seconds, regardless of the size of the cluster. In this example, the time delay is set to 1 second.
It needs to be mentioned here: real-time is divided into quasi-real-time and near-real-time, quasi-real-time is millisecond level, near-real-time is second level

Cluster Cluster

A cluster contains multiple nodes, each of which cluster belongs to is determined by a configuration (cluster name, default is elasticSearch), which is normal for small and medium enterprises to start with one node per cluster.

The Node Node

Each node in the cluster is assigned a random name by default. The node name is very important (for o&M operations). Each node is added to a cluster named ElasticSearch by default. Elasticsearch can be a cluster of elasticSearch nodes, but yellow is the normal state, green is the normal state. This will be explained in more detail later on why there are yellow and green.

The Document Document

The smallest data unit in ES, it can be a simple customer data, a commodity classification data, an order data, usually expressed in JSON structure, each index under the type, can store multiple documents. Here is an example of a simple commodity document.

product document
{
	"product_name":"Pastoral fabric sofa."."product_id":"1"."category_name": "Sofa"."category_id": "2"
}
Copy the code

Type Type

Elasticsearch version 6.0 can have multiple types in an index, but after version 6.0 it can only have one type. – Type will be completely discarded after 7.0, why is type slowly removed from ElasticSearch? In ElasticSerarch, if multiple types have the same field, each type will use the same field. In ElasticSerarch, each type has the same field. That is to say, they do not distinguish from each other, so the type is gradually imperceptible in the later stage. The following example will show how document is stored in type (the relationship between type and document is not a representation of the data structure in ES).

{
"type":[
    {
        "product_name":"Pastoral fabric sofa."."product_id":"1"."category_name":"Cloth sofa"."category_id":"2"
    },
    {
        "product_name":"Rural solid wood sofa."."product_id":"2"."category_name":"Solid wood sofa"."category_id":"3"}}]Copy the code

The index index

Before version 6.0, you could store multiple types in index, but after 6.0, you can only store one type, and this type has many documents in it. This is just an example of how index and type and document relate to each other. This is not a representation of the es data structure. Understand) {” index “: {” type” : [{” product_name “, “rural” cloth art sofa, “product_id” : “1”, “category_name” : “cloth art sofa,” “category_id” : “2”}, {” product_name “, “rural” solid wood sofa, “product_id” : “2”, “category_name” : “solid wood sofa,” “category_id” : “3”}]}}

Shards shard

Shards is a shard for ElasticSearch. Shards is a shard for ElasticSearch. Shards is a shard for ElasticSearch.

Now we have a requirement to store 3T data in THE ES cluster, but the maximum capacity of each node is only 1T, so a single server cannot hold our data.
At this time, we need to split the 3T data into 3 parts and store it on 3 nodes (don’t say there are 6 servers here, why not save 500G for each server, I don’t accept the argument 😄, just kidding, we will explain the use of the remaining 3 servers later), so we can save the data this time. That’s when it comes outshardsIndex is split into multiple shards. Each shard holds part of the index’s data, and these shards are scattered across multiple servers.
First, horizontal expansion. For example, when the amount of data increases to 4T, we only need to add a node and create an index with 4 shards, and then reverse the data of the previous index with 3 shards (reciprocal data is very simple, and will be introduced later).
Shard the second advantage is that, the data distribution on multiple shards, multiple servers above all the operation will be distributed on multiple servers parallel distributed execution, refreshing throughput and performance, for example if there is only one node, then after all requests coming, all the pressure is only one server. For example, if the pressure on a single server is now only 2000, then four servers can handle 8000 visits.

Replica Fragments

For example, if node 7 breaks down, the data on the node is lost and the data on the node cannot be obtained. Therefore, it is the backup of the shard. If a main node breaks down, Then the data is searched on the replica of the node’s backup shard (the role of replica is not only backup, but also to improve throughput, etc.). The data stored by one replica is the same as that stored on the shard.