Thinking and practice of operation and maintenance of large-scale ES cluster

Elasticsearch is a distributed, RESTful search and data analysis engine. It is far ahead of other products in open source search. With the rapid development of ES in recent years, ES has gradually evolved from a single search engine into a comprehensive data product. It is widely used in log monitoring, full-text retrieval, database acceleration, big data analysis and many other fields.

▲ Database engine ranking, data source: db-engines.com/en/ranking▲

Jd Zhilian CLOUD ES supports public cloud, private cloud and a large number of ES clusters within JD Group. Jingdong Mall, Jingdong Logistics, Jingdong Finance and other business areas have a large demand for ES services. Hundreds of thousands of cores, tens of thousands of nodes, and trillions of documents have been used.

How to use the advantages of cloud vendors to operate and maintain such a large ES cluster efficiently, reliably and stably is a problem we need to think about and solve. Next, I will introduce our thinking and practice from several dimensions.

I. Infrastructure and service orchestration

Cloud vendors’ biggest advantage over user-built systems is flexibility. Flexibility gives users more than ease of use, it also reduces costs. Relying on the service orchestration capability of Yunjian, JINGdong Smart Cloud ES can achieve rapid and flexible cluster deployment, supporting elastic capabilities such as horizontal expansion, vertical allocation and storage expansion of clusters. In addition, JD Zhilian cloud ES also provides a variety of storage methods such as cloud hard disk, local site and object storage to meet the needs of users in different scenarios.

In addition, service choreography provides fault healing capabilities. If the physical server is faulty, the ES node on the faulty server is automatically failover to another node. If the ES node is faulty, the SYSTEM automatically restarts the ES node. If the fault persists, the SYSTEM migrates the ES node to another physical machine node.

Second, the operations

Jingdong Zhilian cloud ES operation and maintenance ability ▲

1. Monitoring warning

In such a large-scale cluster, an o&M monitoring system with abundant global and indicators is required to ensure the visualization of system o&M. Through our accumulation of operation and maintenance experience, the operation and maintenance monitoring system of JD Zhilian Cloud has been able to find abnormal clusters in real time and analyze the causes of problems through various monitoring indicators.

2. Multi-version support

There are a large number of ES versions. Due to historical reasons, many old systems are difficult to upgrade, and different users have different cloud version requirements, so many different versions need to be supported. X, 5.x, 6.x, 7.x. Multiple versions of management reuse the same set of orchestration management system, can quickly support the new ES version online.

3. Performance optimization

ES is out of the box, but there are a lot of CONFIGURABLE items in ES. For different business scenarios and requirements, clusters need different tuning configurations, which makes it difficult for non-professional users to use it reasonably. Here are some common questions:

4. Data migration

5. Index lifecycle management

Index lifecycle management is a management feature that users often use. Indexes are created on a daily or monthly basis. Indexes that expire are deleted after a certain period of time. Indexes that expire for a specified period of time (for example, indexes during the rush period) are retained permanently. ES supported index lifecycle management in test versions starting with version 6.6 in x-Pack, but not in earlier versions. Jd Zhilian CLOUD ES extends this function to all ES versions and provides UI Settings, which are simpler and more practical than native Kibana configuration or API configuration.

6. Explore intelligent operation and maintenance

We explore intelligent operation and maintenance. Our operation and maintenance knowledge and experience are productized to users. Based on users’ business scenarios and various monitoring indicators, we provide cluster health status and solution suggestions. For example, the node load is uneven, the fragment setting is unreasonable, the heap memory is occupied too high, the GC time is too long, the fileData proportion is too high, the cluster load is too high, the fragment quantity is too large, the queue of writing or querying thread pool is stacked or reject, and the cluster read and write traffic fluctuates abnormally.

7. Monitor indicator data autonomous system

Automatic operation and maintenance or autonomous operation and maintenance is the ultimate goal. With the improvement of intelligent operation and maintenance capability, the autonomous system can make decisions and execute them independently through monitoring index data without manual intervention.

Optimization of application scenarios

Application scenarios of ES mainly involve log retrieval, database acceleration, monitoring indicators, data analysis and other fields. Different service scenarios have different features and performance requirements. Therefore, different optimization schemes are required for different scenarios.

1. Log retrieval scenario, large amount of concurrent write, low real-time requirements, large storage, hot and cold properties of data. In this scenario, the index write cache size can be increased to improve the write performance. Increase the refresh interval to reduce the number of segments; Make Translog. portal use HDD to reduce storage cost.

2. Database acceleration scenarios. ES is a good choice to replace relational databases for structured query scenarios with no transactional requirements and the need to retrieve massive data. Jingdong’s main applications include commodities, coupons, orders, account checking, logistics and so on. This scenario is characterized by delay sensitivity and requires high performance and high availability.

3. Monitoring indicators, large amount of concurrent write, timing characteristics, no need for high availability, data has hot and cold properties.

4. In data analysis scenarios, multiple dimensions of data analysis can be used for aggregated query. Large write volume, small query volume, but need to aggregate query. Jingdong’s main applications include order transaction analysis, user portrait and so on.

4. Servitization

Professional people do professional thing, hosting ES products the first step to solve the user’s own structures, cluster, the cluster management efficiency and cost issues, but users still need to understand the principle of ES knowledge, knowledge of tuning, indexes, configuration, the cluster configuration, shard Settings, and so on do not have direct relation and business knowledge, can make good use of ES still needs a high threshold. Therefore, the second step of hosting ES products is servitization. Users only need to put forward business requirements and no longer care about the parameters of ES cluster behind the service. For example, the user provides the expectation of writing and querying performance indicators based on the service scenario. In addition, the user only needs to define the index mapping and does not need to care about the index Settings and number of fragments. From the specification configuration of the cluster to the reasonable setting of indexes are automatically set and optimized by the background.

Thinking about the future

Providing users with simple and reliable products is our ultimate goal. Therefore, in the future, we will improve and optimize our products from two aspects. First, we will present the product form from the perspective of users. Through product servitization, we will get closer to users’ usage habits, lower the threshold of users, and provide simpler and more practical ways of use. Second, from the perspective of operation and maintenance, the autonomous operation and maintenance ability of the background is enhanced, including intelligent detection and repair ability, fault self-healing ability, automatic flexibility ability, data configuration hosting ability, etc., so as to transform managed products into managed services.