The basic concept

ES has some basic concepts that will greatly simplify your learning process if you understand them at the beginning

Near real time search

ES is a near-real-time search engine, which means that there is a slight delay (usually a second or so) between the time a document index is indexed and the time it can be retrieved

Cluster (Cluster)

A cluster is a collection of one or more nodes (services) that hold all the data and provide indexing and searching across all nodes. A cluster has a unique name, the default is ElasticSearch, which is important because nodes need this name to join a cluster. Make sure your cluster name is not reused in different environments, or nodes may join the wrong cluster. For example, you can use logging-dev,logging-stage, logging-prod to represent development clusters, pre-release clusters, and production clusters. Note that it is good to keep a single-node cluster. In addition, there can be multiple independent clusters, each with its own unique cluster name.

Node (the Node)

A node is a separate server in a cluster that stores data and participates in the indexing and search functions of the cluster. Like clusters, nodes have a unique named representation, which defaults to a random UUID that is assigned when the node is started. If you do not want to use the default name, you can customize the name used to manage which server in the cluster corresponds to which node. The node can be added to a specific cluster. By default, the node is added to the elasticSearch cluster. This means that if you start several nodes in your network (assuming they can find each other), they will automatically organize and join elasticSearch cluster. You can configure as many nodes in a cluster as you want. If you have no other ES nodes running on your network, you can create a default ElasticSearch cluster by starting one node.

The Index (Index)

An index is a collection of documents with similar characteristics. For example, you might have one index that holds customer data, one that holds catalog information, and another that holds order information. Indexes also have a name (all lowercase) that is used to add, delete, change, or query documents in the index. The number of indexes in a cluster is also arbitrary. The number of indexes in a cluster is also arbitrary. The number of indexes in a cluster is arbitrary. Distinguish index from database, different concept)

Type (Type)

An index can define multiple types, which are logical partitions of the index, depending on your Settings. Generally, a T type defines a series of documents with identical fields. Suppose you have a blogging platform where all data is stored in the same index, including user types, blog types, comment types, and so on. So far you can think of indexes as databases in a relational database and types as tables. Type is essentially a metadata field in the index and is not physically partitioned. In version 6.0, this feature is limited to a fixed value and no longer supports multiple types. It is expected that this feature will be removed in 7.0. Therefore, it is recommended that one index correspond to one type.

The document

A document is the basic unit of indexed data. For example, you can have a document for a single user’s data, a document for a single product, and a document for a single order. The document is in JSON format. Several documents can be stored in an index/type. Note that although documents are physically stored on the index, they must also be typed. (Translator’s note: For the same reason, more documents is not always better. Too many documents lead to too many indexes, affecting read and write efficiency.)

Shard and replica

Indexes can store large amounts of data that can exceed a single machine hardware bottleneck. For example, an index of 1 billion documents occupying 1TB of space would not fit on a single machine’s disk, or the search efficiency of a single node would be too low. To solve this problem, ES can split an index into chunks, also known as sharding. When you create an index, you can specify the number of shards. Each shard is a fully functional, independent “index” that can exist on any node in the cluster. Sharding has two main functions:

  • Horizontal capacity expansion
  • The ability to operate between shards in parallel and distributed (on multiple nodes) to improve system performance and throughput How shards are distributed and how search request document results are aggregated is managed internally in ES and is transparent to users. In a network or cloud environment, failure/error is inevitable. To prevent shards or nodes from being taken offline or lost for unknown reasons, it is recommended to have a failover mechanism. For this purpose, ElasticSearch allows you to create copies of multiple shardsA copy of the shard, hereinafter referred to asA copy of theReplicas have two main functions:
  • Replicas can provide a high availability mechanism to prevent node or shard failure. Because of this, a copy and its original shard cannot exist on the same node.
  • Because searches can be performed on all shard replicas, replicas can increase search throughput and search concurrency. To summarize, each index can be split into multiple shards, and an index can have zero (none) or more copies. Once replicated, each index has a master shard (the source of the replication) and a copy (a copy of the master shard). Shards and replicas can be specified during index creation. Once the index is created, you can dynamically adjust the number of copies you want, but the master shard cannot be modified. The default of 5 shards and 1 replica per index means that assuming your cluster has at least two nodes, your index will have 5 master shards and 5 other replicas of 10 slices per index.

Note: Each shard is actually a Lucene index, and a single Lucene index has a maximum limit on the number of documents, for example (lucene-5843), with a threshold of 2,147,483,519 (= integer.max_value-128) documents. You can monitor the size of shards via API _cat/shards.

Proper control of the number and size of shards can greatly improve performance. The fragment size should not be too large. Otherwise, the file opening and reading will be slow, affecting the efficiency of query and writing. In this case, the number of documents and the rational utilization of routes should be controlled through services. Don’t have too many shards, otherwise you’ll have too many documents during query aggregation, too much memory, and too much shard translation time when the cluster recovers.) With that in mind, let’s look at some fun things…

Elasticsearch (Elasticsearch

Install Elasticsearch