Didi Elasticsearch multi-cluster architecture practice

Elasticsearch is a distributed search engine based on Lucene, providing real-time search and analysis capabilities for massive data. Elastic Stack is an open-source Elastic Stack that provides simple, easy-to-use solutions for logging, search engines, and system monitoring.

Didi Elasticsearch

Didi started building Elasticsearch in early 2016, and now has over 3500 instances of Elasticsearch, over 5PB of data storage, and over 2000W /s of peak write TPS.

Elasticsearch has a variety of application scenarios in Didi, such as online core taxi map search, multi-dimensional query of customer service and operation, and nearly a thousand platform users such as Didi log service.

The sheer scale and richness of the scene created a huge challenge for Elasticsearch, and we gained a lot of experience and some results. This article will share the practice of Elasticsearch multi-cluster architecture in Didi.

Single-cluster architecture bottlenecks

Before introducing the single cluster architecture bottleneck, let’s take a look at the single cluster architecture of Didi Elasticsearch.

Didi Elasticsearch single cluster architecture

In the single cluster architecture of Didi, write and query have been controlled by Sink service and Gateway service.

| Sink service

Almost all data written to Elasticsearch by Didi is consumed into Elasticsearch via Kafka. Kafka data includes service log data, mysql binlog data, and data independently reported by services. Sink service consumes these data into Elasticsearch in real time.

The Sink service was originally designed to control and protect the Elasticsearch cluster from massive data writes. The Sink service has been used since then. In addition, the service is separated from Elasticsearch platform to establish didi Sink data delivery platform, which can synchronize data from Kafka or MQ to Elasticsearch, HDFS, Ceph and other storage services in real time.

With the multi-cluster architecture, the Elasticsearch platform can consume a single copy of MQ data to write to multiple Elasticsearch clusters for cluster-level disaster recovery and failover through MQ backtracking.

| Gateway service

All services are queried through the Gateway service, which implements the HTTP restful and TCP protocols of Elasticsearch. Customers can access the Gateway service through the SDK of Elasticsearch languages. The Gateway service also implements an SQL interface that allows businesses to access Elasticsearch platform directly.

The Gateway service initially provides basic capabilities such as application permission control, access record, traffic limiting, and degrade. As the platform evolves, the Gateway service also provides index storage separation, DSL-level traffic limiting, and multi-cluster DISASTER recovery.

| Admin service

The entire Elasticsearch platform is managed by the Admin service. The Admin service provides rich platform capabilities such as index life cycle management, automatic index capacity planning, index health score, cluster monitoring, and metadata information such as index and permission for Sink and Gateway services.

Elasticsearch single cluster bottleneck

Elasticsearch cluster is getting bigger and bigger. At its peak, the Elasticsearch cluster consisted of hundreds of physical units with 3000+ indexes and over 50000 shards. The total capacity of the cluster reached PB level. Large Elasticsearch clusters face significant stability risks from the following three aspects:

Elasticsearch architecture bottleneck
Index resource sharing risks
Service scenarios vary greatly

Elasticsearch architecture bottleneck

The Elasticsearch architecture has a bottleneck when the cluster grows to a certain size. The bottleneck is mainly related to the Elasticsearch task processing model.

Elasticsearch looks like a P2P architecture, but it’s still a centralized distributed architecture. There is only one active Master in the cluster. Master is responsible for metadata management of the entire cluster. All metadata of a cluster is stored in the ClusterState object, including global configuration information, index information and node information. Any changes to metadata must be made by the master.

The task processing of Elasticsearchmaster is completed by a single thread. Each processing task involving ClusterState changes will publish the latest ClusterState object to all nodes in the cluster and block waiting for all nodes to receive the change message. After processing the change task, To complete this mission.

Such an architectural model leads to serious stability risks as clusters grow in size.

If a node is dead, for example, the JVM is full and the process is still alive, the response time of the master task will be very long, affecting the completion time of the individual task.
When there are a large number of restore tasks, because the master is single-threaded, all tasks need to be queued, resulting in a large number of Pending_tasks. Recovery times become very long.
For example, the priority of the Put-mapping task is lower than that of the index creation and recovery. If some indexes with a lower priority are being restored, new fields are being written to the normal index, and new fields are blocked.
Master task processing model, after task execution is complete, a large number of listeners are called back to process metadata changes. Some of the callback logic will be slow to process after the index and shard swell. When the shard swell to 5-6W, some tasks will take 8-9s to process, which seriously affects the recovery ability of the cluster.

Elasticsearch has also been optimized for the same type of tasks, such as the put-mapping task, so that the master can process all the same tasks stacked in the queue at once. The ClusterState object only passes the diff content, optimizes the processing time of the listener module, and so on.

But because the entire cluster is being processed in a single thread of a master, metadata changes need to be synchronized to each node in the cluster and blocked until all nodes are synchronized. The stability of this model will decrease with the expansion of cluster size.

| index resources sharing risks

The Elasticsearch index consists of multiple shards. The master dynamically allocates node resources to these shards. Different indexes may have mixed resources.

Elasticsearch uses Shard Allocation Awareness to divide nodes in a cluster into different rack sets. When assigning an index, you can specify the rack list so that the index is assigned only to the node list corresponding to the specified rack, thereby isolating physical resources.

However, in practice, many indexes with small capacity will be mixed among some nodes due to their limited resource occupancy. In this case, it will affect the stability of other indexes because of the soaring query and write volume of individual indexes. If a node failure occurs, the stability of the entire cluster will be affected.

Master and clientnode resources are shared in the entire cluster. Master risks have been mentioned separately. Gc, jitter, and exceptions caused by clientnode sharing affect all indexes in the cluster.

| large difference of the business scenario

The business scenarios for Elasticsearch vary considerably.

For the online core entrance search, the index capacity is generally small after it is divided by city, and the data is not written in real time or the TPS is very small. For example, the map POI data is updated offline, and the amount of writing of takeout merchants and dishes is also very small. However, the QPS of the query is very high, and the query has high requirements on the average time and jitter of RT.
For the log retrieval scenario, the real-time write volume is particularly large, and some indexes even exceed the TPS of 100W /s. This scenario has high requirements on throughput, but not on query QPS and QUERY RT.
For binlog data retrieval, the write volume is much smaller than the log, but the query complexity, QPS and RT are required.
In monitoring and analysis scenarios, aggregated query requirements are high, causing high memory pressure on Elasticsearch, node jitter, and GC.

These scenarios have different stability and performance requirements. An Elasticsearch cluster cannot meet all the requirements even if various optimization methods are used. The best method is to divide Elasticsearch clusters based on service scenarios.

Multi-cluster Challenge

It was the single cluster that was at great risk to its stability that we began to plan for a multi-cluster architecture. When we design multi-cluster solutions, we expect zero awareness of the business side.

The Sink service can load data from different topics into different Elasticsearch clusters. Queries continue through the Gateway service, and the business side still passes index names as before without being aware of index distribution within the platform. The distribution details of all indexes in different clusters are masked by the Gateway service.

The biggest challenge of the whole transformation lies in the compatibility of query methods. Elasticsearch has a very flexible way of querying indexes and supports * as wildcard matching. Such an index query may query multiple indexes, such as the following three indexes:

index_a
index_b
index_c

When index* is used to query indexes INDEX_A, index_B, and index_C at the same time. The Elasticsearch implementation is very simple. A query queries the data of multiple shards. Therefore, for specific indexes or fuzzy indexes, the query results of multiple Shards are merged to return the shard list.

This approach is problematic for multi-cluster scenarios, such as index_A in cluster A, index_B in cluster B, and index_C in cluster C. Query for index* cannot be performed on one cluster.

Tribenode introduction

After investigation, we found that Elasticsearchtribenode can meet the multi-cluster query feature. The tribenode implementation is very clever. The org.elasticSearch. tribe package contains only three files, and the core class is TribeService. The core principle of Tribenode is to merge each cluster’s ClusterState object into a common ClusterState object, which contains the index, shard and node data distribution table. Elasticsearch’s working logic is ClusterState metadata driven, so it looks like a clientNode with all indexes.

Tribenode sets multiple Elasticsearch cluster addresses and then connects to each cluster as a ClientNode. Each cluster appears to have an extra ClientNode. Tribenode obtains ClusterState information for the cluster using the ClientNode role and binds a Listener to listen for ClusterState changes. Tribenode merges ClusterState information obtained from all clusters to form a ClusterState object for external access to provide services externally. Tribenode uses ClientNode code for all logic except register listener and Merge ClusterState.

You can see the benefits of tribenode:

It can meet the requirements of multiple cluster access and is transparent for external use.
The implementation is simple, elegant and reliable.

Tribenode also has some drawbacks:

The Tribenode must be added to each Elasticsearch cluster as a clientNode. The master’s change task must wait for the Tribenode to respond before proceeding, which may affect the stability of the original cluster.
Tribenode does not persist ClusterState objects and requires metadata from each Elasticsearch cluster upon restart. During metadata retrieval, tribenode will already provide access, which will cause the query to fail to access the cluster index while it is still being initialized. If there are too many clusters connected to the Tribenode, initialization will be slow. To address this defect, when our platform restarts a tribenode cluster, all traffic from the Gateway to the cluster is cut to the backup Tribenode cluster.
If multiple clusters have the same index name, Tribenode can only set one Perfer rule: random, discard, prefer to specify the cluster. This may lead to the detection of unexpected exceptions. Didi Elasticsearch avoids the occurrence of the same index name in multiple clusters connected to Tribenode through unified index management.

Because of these flaws in Tribenode, Elasticsearch introduced a Cross ClusterSearch design in higher versions, where a Cross Cluster does not connect to other clusters as a node, but simply requests agents. We are still evaluating the cross-cluster solution and will not go into details here.

Multi-cluster architecture topology

After the final transformation, our cluster architecture topology is as follows:

Elasticsearch clusters are divided into four types according to different application scenarios: Log cluster, Binlog cluster, document data cluster, and independent cluster. A common cluster consists of a maximum of 100 Datanodes. We use the drip cloud to realize the automatic deployment and elastic expansion of the cluster, which can be very convenient to expand the cluster horizontally.

The Elasticsearch cluster is a multi-group tribenode cluster, designed to solve the stability problem of the Tribenode.

The Gateway connects to both the Tribenode cluster and Elasticsearch cluster, and sets the name of the cluster to be accessed by the application based on the index list. The Gateway proxies the request to the specified cluster to access the application. The application can access indexes of multiple clusters.

The Admin service manages all Elasticsearch clusters and the mapping between indexes and clusters. A range of features have been adapted for multiple clusters.

Sink service has been separated from Elasticsearch platform and DSink data delivery platform has been established. DSink Manager is responsible for managing DSink node. DSink Manager obtains metadata information of index from Elasticsearch Admin service. Send to the corresponding DSink node.

Summary of multi-cluster architecture practice

| cluster architecture more benefits

Elasticsearch has the following benefits:

Elasticsearch platform isolation can be increased from the physical node level to the Elasticsearch cluster level. Standalone Elasticsearch cluster support is available for core online applications.
Data of different types is divided into clusters to avoid mutual impact and reduce the impact of faults, greatly improving platform stability.
Elasticsearch has improved its scalability by adding clusters to the platform.
The multi-cluster architecture is ultimately unresponsive to the business side, and the business looks like an infinite Elasticsearch cluster without being aware of the actual cluster distribution of the index.

| cluster architecture more practical experience

The multi-cluster architecture of Didi Elasticsearch has been evolving for a year and a half, and there have been some challenges brought about by the multi-cluster architecture.

Tribenode Stability Challenges:

As the number of clusters increases, the shortcomings of the tribenode mentioned earlier become more apparent, such as longer initialization times, and so on. Our response strategy is to deploy multiple clusters of Tribenodes, some of which are fully connected to each other for Dr, and some of which are only connected to core clusters for more important cross-cluster access scenarios.
Tribenode’s ClusterState metadata contains too many indexes and shards. Elasticsearch’s search logic takes too long to process some cases. Elasticsearch is not limited to the number of shards that can be queried at a time when the client receives a search request from the NETty THREAD. In some complex logic of fuzzy index matching shard, and sending query requests to each shard, there will be a high time, maybe more than 1-2s case, which will affect other requests on the Netty worker, causing part of the response to spike. We solved this problem by optimizing some indexing and shard bloating logic in the Tribenode Search process.

Challenges in unifying the configurations and versions of multiple clusters:

When there is only one cluster, the platform maintains only one configuration and version of the cluster. When the number of clusters increases, _cluster Settings of different clusters may differ. These differences may cause uneven load among clusters and slow or fast recovery. Each cluster also has a basic index template configuration, which may also differ. We are still working on this issue. We are planning to split the Admin service into index management service and cluster management service. The cluster management service will focus on cluster version, configuration, deployment, capacity expansion, monitoring, etc.
Some of the source optimizations we made for Elasticsearch went live in clusters one after the other, which caused the issue of version confusion between clusters. Our solution is to add internal version numbers to Elasticsearch and Lucene, release updates to Elasticsearch through the company’s internal distribution system, and then cluster management services will manage the cluster version.

Challenges to capacity balancing across multiple clusters:

In order to rebalance data between Elasticsearch clusters, you can use this tool to rebalance data across Elasticsearch. After the multi-cluster architecture is used, resources in the Elasticsearch cluster may be unevenly allocated. For example, the capacity of some indexes increases rapidly, causing resource shortage in the cluster, and the number of indexes decreases, causing idle cluster resources. The need to migrate indexes across clusters arises. To address this requirement, we solved the index migration problem by adding a version number to the index. Later, we will introduce the scheme in detail.
The didi Elasticsearch platform implements automatic index capacity planning to balance capacity between clusters. The Elasticsearch platform can dynamically plan index capacity. When a cluster capacity plan is insufficient, the platform can dynamically migrate some indexes to idle clusters. New index access requirements are preferentially accessed from idle cluster resources. How to automatically plan index capacity for Elasticsearch?

conclusion

Didi’s multi-cluster architecture was originally designed to solve the bottleneck of Elasticsearch’s single-cluster architecture. In order to support a multi-cluster architecture, many of the following components need to consider multiple cluster scenarios, which introduces some complexity to the platform architecture. However, the stability and isolation benefits of multi-ElasticSearch clustering far outweigh the architectural complexity. With the move to multi-cluster architecture, we were able to withstand the Elasticsearch platform’s explosive growth, increasing its size by more than 5 times, and the multi-cluster architecture supported the rapid growth of the business.

Author: Wei Zijun

Didi Technology (ID: DIDI_tech)

— More recommendations —

Didi data hierarchical guarantee practice

How did Didi build a centralized real-time computing platform from scratch?

Didi’s infrastructure under the Internet of Things

— the END —