One, the introduction

As Tencent Elasticsearch cloud product has more and more functions, ES users are increasing, and the cluster scale on the cloud is also increasing. In our daily operation and maintenance work, we often encounter various cluster availability and stability problems caused by the late business growth and large cluster size due to inadequate cluster planning in the early stage.

Here are some typical cluster planning problems:

  • Node specification planning problems: The number of clusters is large, but the configuration per node is low;

  • Index sharding planning problem: index is small, but set dozens of shards, or index is large, set only two or three shards;

  • Shard number planning problem: the cluster contains more than 100,000 shards.

Is the so-called knife does not mistakenly cut wood workers, only in the early stage to do a good job of cluster evaluation planning work, in the late stage to save a lot of operation and maintenance work. In addition, the high availability and stability of the cluster can be ensured for a long time.

This paper introduces how to plan the cluster capacity and index configuration, as well as some principles and experience followed, based on the various cluster problems we encountered in the daily operation and maintenance of Tencent cloud ES cluster and some operation and maintenance experience summarized. Article author: Wu Rong, Tencent Cloud Elasticsearch r&d engineer.

Planning cluster scale and index

1. Cluster scale evaluation

(1) What to evaluate?

The evaluation of cluster scale mainly evaluates the following three aspects:

The first is computing resource evaluation, which mainly evaluates the CPU and memory of a single node.

The computing resources of ES are generally consumed in the process of writing and querying. After summarizing the operation and maintenance experience of a large number of ES clusters, the CONFIGURATION of 2C8G can support 5K doc/s writing, and the configuration of 32C64G can support 5W doc/s writing.

Second, evaluate storage resources. The storage resource evaluation mainly evaluates the disk type and capacity.

For example, what type of disks is used in the ES cluster? SSDS or high-performance cloud disks. And the capacity of each disk, whether to choose a single disk with multiple capacity or multiple disks with less capacity. By default, SSDS are used as hot nodes and high-performance cloud disks are used as warm nodes in a cluster separated from hot and cold nodes.

In addition, Tencent Cloud ES supports mounting multiple cloud disks to a single node, and the performance test shows that the throughput of three disks is about 2.8 times higher than that of one disk. Therefore, if you have high requirements on write speed and I/O performance, you can mount multiple SSDS.

Schematic diagram of ES hot and cold separation multi-disk cluster

Third, the number of nodes evaluation, node number evaluation is mainly to evaluate the number of cluster data nodes.

With the same cluster performance, you are advised to select a cluster with high configuration and few nodes. For example, compared with an 8C16G12 cluster, a 32C64G3 cluster has advantages in terms of stability and capacity expansion.

If a highly configured cluster encounters performance bottlenecks and needs to be expanded, add more nodes with the same configuration to the cluster. In a low-configuration cluster, you need to add nodes in the vertical capacity.

There are two vertical cloud capacity expansion modes:

The first method is rolling restart, which affects the stability of the cluster.

The second method is data migration. The capacity expansion principle is to add the same number of high-configuration nodes to a cluster, migrate the data on low-configuration nodes to new nodes, and remove low-configuration nodes from the cluster. Therefore, the capacity expansion process takes a long time and costs a lot.

Data migration Schematic diagram of vertical capacity expansion

(2) On what basis?

Cluster scale evaluation is mainly based on the following three points:

  • Specific service scenarios, such as log analysis, indicator monitoring, and search services;

  • Business forecast query and write QPS;

  • The total amount of data indexed.

(3) Cluster size evaluation criteria

Based on our operation and maintenance experience, several suggestions for cluster scale evaluation are given here:

  • The 32C64G single-node configuration can normally carry 5W writes /s.

  • If the write and data volume is large, the 32C64 GB node is preferred.

  • 1T of data is expected to consume 2-4GB of memory space;

  • Large memory nodes are preferred in search scenarios.

  • Storage capacity = Source data x (1 + Number of copies) x 1.45 x (1 + Reserved space) ≈ Source data x (1 + Number of copies) x 2.2.

2. Index configuration evaluation

(1) What to evaluate?

The evaluation of index configuration mainly evaluates two points:

First, how to partition the index?

When using index, it is recommended to plan to switch indexes periodically. For log scenarios, you are advised to create indexes on a daily basis for small writes, and on an hourly basis for large writes.

The benefits of periodic index scrolling include: Controlling the size of a single index improves read and write performance. At the same time, a single index can be too large, affecting the time of fault recovery; In addition, the hot index is too large, which may affect the snapshot backup recovery time.

Second, how to set the index master shard number?

By default, the number of index master fragments on the cloud is five. The specific number of index master fragments depends on the scenario and data volume. Some specific guidelines and lessons are given below.

(2) On what basis?

Index configuration should also be evaluated based on specific service scenarios and index data volume, especially the amount of data added on a single day.

(3) Index configuration evaluation criteria

The evaluation of index configuration can be based on the following criteria:

  • The size of a single fragment is controlled within 30-50GB;

  • The total number of clusters shall be controlled within 3W;

  • 1GB memory space to support 20-30 fragments is preferred;

  • A node is recommended to have no more than 1000 fragments.

  • It is recommended that the number of index fragments be the same as the number of nodes.

  • If the cluster is large, you are advised to set a dedicated primary node.

  • It is recommended that the dedicated primary node be larger than 8C16 GB.

  • For sequential data, it is recommended to combine hot and cold separation with ILM index lifecycle management.

In particular, we need to explain the size control of the total number of cluster fragments. After some performance tests, we find that when the total number of cluster fragments exceeds 10W, the index creation time will increase to the level of minutes.

Especially for the cluster with more than one million QPS writes, if the total number of slices is 10W+ and the index is automatically created, the write will drop precipitatively and the cluster will become unavailable every time the new index is switched.

The figure below is a cloud cluster with 100 nodes and a total number of slices in 11W+. When the new index is switched at 8 o ‘clock every day, 0 is written directly, and the cluster is unavailable for several hours.

The write performance of the cluster is affected at 8:00 every day

For this kind of problem, our Tencent Cloud ES team also has some very mature optimization schemes.

You can create indexes in advance to solve the problem of steep drop when new indexes are changed at 8:00 every day. You are advised to use fixed index mapping to avoid massive put-mapping metadata updates. Because updating metadata is a very performance consuming operation for such a large cluster with a large number of nodes and total slices.

If the total number of slices exceeds 10W, this is common in log analysis scenarios. If historical data is not important, you can delete historical indexes periodically.

For the case where historical data is important and no data can be deleted, the hot-hot architecture + index lifecycle management function can be used to store the data generated 7 days ago to the warm node, and Shrink the number of master fragments to a smaller value when the index data is migrated from the hot node to the warm node. In addition, the data of the temperature node can be backed up to Tencent Cloud COS by snapshot, and then the copy of the index on the temperature node can be set to 0, so as to further reduce the total number of slices in the cluster.

Hot and cold separation +ILM+COS backup cluster architecture

Second, ES write performance optimization

The write performance of an ES cluster is affected by many factors. Here are some suggestions for optimizing the write performance:

1. Write data without specifying doc_id. Let ES generate data automatically

Each doc in the index has a globally unique doc_ID, which can be customized or automatically generated by ES.

If it is customized, ES will have one more step to determine whether the doc_id already exists in the writing process. Update if it exists, create a new doc if it does not.

Therefore, if we do not have special requirements for index doc_id, we recommend that ES automatically generate doc_id, which can improve some write performance.

2. For a large cluster, you are advised to create indexes in advance and use fixed Index mapping

This optimization recommendation was also mentioned above, because creating indexes and adding new fields is an update metadata operation that requires the master node to synchronize the new version of metadata to all nodes.

Therefore, in the scenario of large cluster size and high write QPS, the master update metadata timeout problem is particularly likely to occur. This can cause a large number of Pending_tasks tasks to pile up on the master node, making the cluster unavailable or even ownless.

Updating the cluster metadata timed out

A large number of Pending_tasks tasks are accumulated in the cluster

3. In scenarios where the requirement for real-time data is not high, increase refresh_interval appropriately

ES default refresh_interval is 1s, that is, doc can be searched for 1s after writing.

If services do not have high requirements on real-time data, such as log scenarios, the refresh_interval of the index template can be set to 30s to avoid excessive generation of small segment files and merging of segments.

4. For efficient writing scenarios, you can set the index being written to a single copy and open the copy after the write is complete

More and more external customers are choosing to migrate their self-built ES clusters to Tencent cloud. Customers usually use Logstash to migrate data. As data is completely retained in self-built clusters, the index copy being written on the cloud can be set to 0 at this time, so that the cluster migration can be completed quickly. Open the copy only after data migration is complete.

5. Use Bulk interfaces to write Bulk data in batches, and ensure that the volume of Bulk data is about 10 MB at a time

To improve the write performance, ES provides THE API for Bulk writing. Generally, the client prepares a batch of data to write to ES. After receiving the Bulk request, ES distributes the batch data according to routing values, assembles the batch data into several molecular sets, and asynchronously sends the data to the node where each fragment resides.

This greatly reduces network interaction and latency on write requests. It is recommended that the volume of Bulk data at a time be less than 10 MB and the number of DOC at a time be around 10,000.

ES Bulk request diagram

6. Use the custom routing function to forward requests to as few fragments as possible

As mentioned above, ES provides the Bulk interface to support Bulk writing of data to the index. Although the coordination node asynchronously sends data to all shards, it waits for all shards to respond before returning the data to the client. Therefore, the Bulk delay depends on the node with the slowest response. This is the long tail of distributed systems.

Therefore, we can customize routing values to forward Bulk to as few shards as possible at a time.

POST _bulk? routing=user_idCopy the code

Custom routing

7. Try to select an SSD disk type and mount multiple cloud disks

The cloud provides multiple types of disks. The throughput of a 1 TB SSD cloud disk is 260M/s, and that of a high-performance cloud disk is 150M/s. Therefore, using SSDS improves write performance and I/O performance.

In addition, Tencent cloud now also provides multi-disk capacity, compared to single-disk nodes, the throughput of three disks is about 2.8 times higher.

8. Freeze historical indexes to release more memory space

We know that the index of ES has three states, namely, Open state, Frozen state and Close state. As shown below:

Three states of an ES index

An index in the Open state can be searched quickly and is the fastest because the inverted index is loaded into memory as an FST data structure.

However, it consumes a large amount of memory, which is resident memory and will not be GC. A 1T index is expected to consume 2-4GB of JVM heap memory space.

Indexes in the Frozen state can be searched. However, Frozen indexes occupy no memory and are only stored on disks. Therefore, the search speed of Frozen indexes is relatively slow. If there is a large amount of data in our cluster and historical data cannot be deleted, consider using the following API to freeze historical indexes to free up more memory.

POST /index_name/_freeze
Copy the code

For frozen index searches, you can specify the ignore_throttled=false parameter in the API:

GET /index_name/_search? ignore_throttled=false { "query": { "match": { "name": "wurong" } } }Copy the code

Some of the more common write performance tuning recommendations and experiences are described above, but more efficient tuning needs to be tailored to specific business scenarios and cluster sizes.

3. Summary of ES cluster routine operation and maintenance experience

1. Check the cluster health status

The health status of an ES cluster is Green, Yellow, and Red.

  • Green: All master and copy fragments are successfully allocated.

  • Yellow(Yellow) : At least one copy is not allocated successfully.

  • Red(Red) : At least one primary shard failed to be allocated.

We can use the following API to query the health status of the cluster and the number of unallocated fragments:

GET _cluster/health { "cluster_name": "es-xxxxxxx", "status": "yellow", "timed_out": false, "number_of_nodes": 103, "number_of_data_nodes": 100, "active_primary_shards": 4610, "active_shards": 9212, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 8, "delayed_unassigned_shards": 0, "number_of_pending_tasks": 0, "number_of_in_flight_fetch": 0, "task_max_waiting_in_queue_millis": 0, "active_shards_percent_as_number": 99.91323210412148}Copy the code

The key fields are status, number_OF_nodes, unassigneD_SHards, and number_OF_pending_tasks.

Number_of_pending_tasks The number_of_pending_tasks field is high, usually due to a large number of tasks that have timed out due to the metadata update triggered by the master node.

We can use the following API to see which tasks need to be executed:

GET /_cat/pending_tasks
insertOrder timeInQueue priority source
       1685       855ms HIGH     update-mapping [foo][t]
       1686       843ms HIGH     update-mapping [foo][t]
       1693       753ms HIGH     refresh-mapping [foo][[t]]
       1688       816ms HIGH     update-mapping [foo][t]
Copy the code

The priority field indicates the priority of the task. If you look at the source code of ES, you can see that there are six priorities:

IMMEDIATE((byte) 0),
URGENT((byte) 1),
HIGH((byte) 2),
NORMAL((byte) 3),
LOW((byte) 4),
LANGUID((byte) 5);
Copy the code

2. Check the reason why fragments are not allocated

When clustering Red, we can use the following API to check the cause of unallocated fragments:

GET _cluster/allocation/explain
Copy the code

Check the reason why fragments are not allocated

The index and shard list which fragments of the index failed to be allocated. The Reason field lists the reasons for the unallocated fragments. Here are all the possible reasons:

INDEX_CREATED: Unassigned due to the API that created the index. CLUSTER_RECOVERED: The recovered cluster is not allocated. Index_too: because an index has been opened or closed too. DANGLING_INDEX_IMPORTED: Unallocated because of the import of the index. NEW_INDEX_RESTORED: Not allocated due to a restore to a new index. EXISTING_INDEX_RESTORED: Unallocated due to restoring to a closed index. REPLICA_ADDED: Unallocated because replicas are explicitly added. ALLOCATION_FAILED: Unallocated due to fragmentation allocation failure. NODE_LEFT: unallocated because the node hosting the shard leaves the cluster. REINITIALIZED: Unallocated due to shard moving from start to initialization (for example, using shadow shadow copy shard). REROUTE_CANCELLED: Cancel assignment as a result of the explicit cancel rerouting command. REALLOCATED_REPLICA: a better replica location is identified and used, causing the existing replica allocation to be cancelled or unallocated.Copy the code

The detail field lists the more detailed reasons for the unallocation. I will summarize several common reasons in daily operation and maintenance work.

If there are a lot of unallocated shards, we can also use the following API to list all unallocated indexes and master shards:

GET /_cat/indices? v&health=redCopy the code

3. Summary of common reasons for unallocated fragments

(1) The disk is full

the node is above the high watermark cluster setting [cluster.routing.allocation.disk.watermark.high=95%], using more disk space than the maximum allowed [95.0%], actual free: [4.055101177689788%]
Copy the code

When we implement _cluster/allocation/explain command to see the line statements above, is to judge the index’s main divided node disk is full.

** Solution: ** Expand the disk capacity or delete historical data to release disk space.

Usually if the disk is full, ES sets all indexes on the node to read-only to ensure the stability of the cluster. After ES 7.x, the disk space can be automatically removed, but before ES 7.x, you need to manually execute the following API to remove the read-only mode:

PUT index_name/_settings
{
 "index": {
   "blocks": {
     "read_only_allow_delete": null
    }
  }
}
Copy the code

(2) The number of sharded documents exceeds the limit of 2.1 billion

failure IllegalArgumentException[number of documents in the index cannot exceed 2147483519
Copy the code

This restriction is for the sharding dimension, not the index dimension. So this exception is usually due to our index sharding not being set up properly.

** Solution: ** Switch write to the new index, and modify the index template, set a reasonable number of main shards.

(3) The node where the main fragment is located is disconnected

cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster
Copy the code

This is usually due to a node failure or a drop due to high load.

** Solution: ** Find the cause of node disconnection, restart the node to join the cluster, and wait for fragment recovery.

(4) The attributes required by the index do not match the node attributes

node does not match index setting [index.routing.allocation.require] filters [temperature:\"warm\",_id:\"comdNq4ZSd2Y6ycB9Oubsg\"]
Copy the code

** Resets the attributes required by the index to be the same as the node. If the node properties are reset, the node needs to be restarted, which is costly.

For example, the following API can be used to change the temperature attribute of the node assigned by the index:

PUT /index_name/_settings
{
  "index": {
    "routing": {
      "allocation": {
        "require": {
          "temperature": "warm"
        }
      }
    }
  }
}
Copy the code

(5) The node rejoins the cluster after being offline for a long time and introduces dirty data

cannot allocate because all found copies of the shard are either stale or corrupt
Copy the code

** Reassign a master shard via reroute API:

POST _cluster/reroute? pretty" -d '{ "commands" : [ { "allocate_stale_primary" : { "index" : "article", "shard" : 1, "node" : "98365000222032", "accept_data_loss": true } } ] }Copy the code

(6) Too many unallocated fragments reach the fragment recovery threshold, and other fragments queue up

reached the limit of incoming shard recoveries [2], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])
Copy the code

This usually happens after the cluster is restarted or a node is restarted. The value of fragment concurrent recovery is low. To restore the cluster health status as soon as possible.

** Speed and concurrency can be increased by calling the following API:

PUT /_cluster/settings
{
    "transient" : {
        "cluster.routing.allocation.node_concurrent_recoveries": "20",
        "indices.recovery.max_bytes_per_sec": "100mb"
    }
}
Copy the code

conclusion

This article describes evaluation criteria for cluster size and index configuration planning. Planning clusters in advance based on these criteria can ensure cluster stability and availability and simplify complex O&M tasks.

In addition, some common write performance optimization suggestions and methods are introduced. It can further improve the write performance and stability of the cluster. Finally, the methods and ideas of troubleshooting cluster problems in daily operation and maintenance work are introduced. I hope this article can help every ES customer of Tencent Cloud.