The author | xing-yu Chen (yu mu) source | alibaba cloud native public number

background

Kubernetes uses ETCD to store its internal core metadata information. After years of development, especially with the rapid development of cloud native in the past two years, Kubernetes has been widely recognized and widely used. With ali internal container platform ASI and public cloud ACK cluster number rapid growth, the underlying storage ETCD cluster for blowout growth, etCD cluster number from the original several dozen developed to the current several thousand, they are distributed around the world, for the upper Kubernetes cluster and other products services, More than 10,000 service users.

In recent years, earth-shaking changes have taken place in the native ETCD service of Ali Cloud. This article mainly shares the problems encountered by ETCD service in the face of large-scale growth of business volume and how we solve them, hoping to provide experience sharing for readers to understand the use, control and operation of ETCD.

The specific introduction will be divided into three parts:

  • Etcd cluster cost optimization, improved utilization rate
  • Etcd control operation and maintenance efficiency is improved
  • Etcd kernel architecture upgrade

Etcd cluster operation cost optimization, utilization rate improvement

In recent years, the number of ETCD clusters has exploded. Its operating form has undergone changes from 1.0 to 2.0 to 3.0, as shown below:

1.0 Physical Machine era

At the beginning, the number of ETCD clusters under our control was relatively small. We used Docker to run the ETCD container directly on the host. The 1.0 pattern in the figure.

2.0 Cloud age

It is very simple to run ETCD in 1.0 mode, but there are also common problems such as low efficiency of running software using physical machines. With the step of Alibaba to go cloud, ETCD has also completely switched the running environment to cloud ECS, and storage has been changed to cloud SSD or ESSD.

Overall, the cloud has obvious advantages. Using the ECS elasticity of Iaas at the bottom of Ali Cloud and the storage cloud disk, ETCD cluster can quickly complete vertical and horizontal scaling, and fault migration is much easier than 1.0. Taking cluster upgrade operation as an example, the whole upgrade time has been reduced from half an hour at the beginning to 10 minutes now, which can take into account the business peak and daily ordinary pressure, and stably carry the internal Double 11 business peak of Ali and several external public cloud customers’ Spring Festival promotion activities.

3.0 Large-scale cloud era

With the increasing number of ETCD clusters, the ecS and cloud disk costs required to run these clusters are getting higher and higher. Etcd has become one of the most expensive parts of container services, and the cost of ETCD operation has become a problem we must face and solve.

Using exclusive ECS and cloud disk in 2.0 mode, we found that ETCD resource utilization was relatively low and there was a lot of resource waste. We further upgraded the 2.0 mode by mixing clusters and reducing operating costs. However, after the mixing, we encountered risks such as computing resource competition and cluster stability, and the etCD cluster switching frequently occurred, leading to the abnormal functions of upper-layer etCD software, such as Kubernetes Controller switching affecting user services.

To solve these problems, we first divided different ETCD clusters into different types according to different service quality and SLO, similar to Kubernetes’ Best effort, Burstable and Guaranteed. Manage different types of clusters in different resource pools. Due to the highly scattered and random mode adopted in the mixed deployment, as well as the large number of cluster users, the hot requests for ETCD were fragmented and did not gather, and the stability problems such as the number of cluster master cutting were greatly reduced. In the premise of ensuring stability to improve the purpose of resource utilization, cost decreased significantly.

Improved operation and maintenance efficiency

In the early stage, our way of operation and maintenance management of ETCD cluster is relatively simple. Shell script can basically cover the whole process of etCD cluster life cycle, such as cluster creation, deletion and migration are completed by script. As the number of clusters exploded, the previous small workshop model became more and more unsuitable. We encountered problems such as slow production speed of ETCD cluster, great difficulty in adapting to underlying IaaS, and low efficiency of cluster management at runtime.

To solve these problems of low efficiency of operation and maintenance control, we embrace the cloud native ecology, use Kubernetes as the base for running ETCD, and adapt the underlying IaaS of Ali Cloud based on the open source ETCD-operator after several years of research and development. Modified many open source bugs, standardized ETCD control operation and maintenance actions, and covered the whole life cycle of ETCD control. A new ETCD operation and maintenance management background alpha was launched. We used alpha to unify the ETCD cluster management and control within Alibaba and the ETCD cluster management and control on the public cloud ACK, which greatly improved the efficiency of our management and control and operation of ETCD cluster. At present, we can manage nearly ten thousand clusters with 0.5 manpower input, and the man-efficiency ratio is significant. The following image shows his control interface.

To take a closer look at alpha’s capabilities, let’s start by looking at the diagram below, which depicts the typical life cycle of an ETCD cluster from create – run – Fail – run – stop – destroy.

What Alpha does is cover all aspects of the graph in two parts:

1. Etcd Cluster life cycle management

  • Etcd cluster creation, destruction, stop, upgrade, failover, etc.
  • Etcd Monitors cluster status, including cluster health status, member health status, access volume, and storage data volume.
  • Etcd anomaly diagnosis, preplan, black box detection, configuration inspection, etc.

2. Etcd data management

Etcd data management includes data migration, backup management and recovery, dirty data cleaning, hot data identification, etc. This is a feature of Alpha, and we’ve seen very little work done on this with open source or other products. The functions we do are as follows.

1) ETCD data backup and recovery

The two methods are as follows:

  • Cold backup in traditional mode: Snapshot data can be backed up from etCDServer to Ali Cloud OSS or local server. If a fault occurs, data can be restored based on the snapshot backup file.

  • Raft Learner hot Standby: For a new version of the ETCD cluster that uses raft Learner, we can use Learner as a hot standby node. When a failure occurs, we force Learner to switch to a normal node and cut client access to the new node, which can recover faster than traditional methods. And Learner can deploy in different areas to achieve the ability to live in different places.

2) Clean dirty data

We can delete garbage KV according to the etCD key prefix, reduce etCD server storage pressure.

3) Hotspot data identification

We developed the ability to aggregate hot keys by ETCD key prefixes, as well as db storage usage for different key prefixes. With this capability, we have helped our customers to analyze etCD hot keys and solve ETCD abuse problems on many occasions, which is a necessary capability in large etCD clusters.

4) Data migration capability, two ways

  • Snapshot mode: Use etcdSnapshot to back up data and then restore data to migrate data.
  • Raft Learner mode: We use Raft Learner to quickly split the original cluster and derive a new cluster to achieve cluster migration.

5) Horizontal split of data

When the cluster data store is very large, we support the use of horizontal splitting to split and store different customer data into different ETCD clusters. We used this feature in the ASI cluster within Ali to support over 10,000 nodes.

To sum up, we adopt Kubernetes as the operating base of ETCD cluster, and develop a new ETCD management and control software Alpha based on open source operator modification, covering the whole life cycle management and control of ETCD. One set of software manages all ETCD clusters. Etcd control efficiency is significantly improved.

Etcd kernel architecture update updated

Etcd is a very important piece of software in the cloud native community. Over the years, it has solved many bugs and improved the performance and storage capacity of the kernel. However, open source software is like a rough building, and there are still some problems in the production environment. Alibaba has requirements for larger data storage scale and performance. In addition, ETCD has weak QoS control ability for multi-tenant sharing, which is not suitable for our use scenario.

We used the open source version of ETCD 3.2/3.3 in the early stage, and later we added some stability and security enhancements to meet the requirements of some of our usage scenarios. Now we use ali internal version, which shows some important differences as follows:

1. Adaptive historical data cleansing Compact technology

Etcd stores historical values of user data, but it cannot store all historical values for a long time, otherwise the storage space will be insufficient. Therefore, the Compact mechanism is used within ETCD to periodically clean up the historical value data. When our cluster is large and the amount of data is large, each cleanup can have a significant impact on runtime performance, akin to a Full GC. This technology can adjust the timing of Compact according to the volume of business requests, avoiding peak service usage and reducing interference.

2. Horizontal scalability of read-only nodes based on Raft Learner

Raft Learner is a special role in RAFT protocol. He does not participate in the leader election, but obtains the latest data of the cluster from the leader. Therefore, he can be used as the read-only node of the cluster to expand horizontally and improve the ability of the cluster to process read requests.

3. Hot spare node based on Raft Learner

Raft Learner can be used as a read-only node. We can also enable Raft Learner to be used as a hot spare node in the cluster. Currently, we use hot spare nodes for remote Active-active to ensure high availability of the cluster.

4. Etcd Cluster QoS capability

On the public cloud, a large number of users use ETCD by sharing ETCD cluster. In such multi-tenant usage scenario, we need to ensure both fair use of ETCD storage resources by tenants and stability, that is, the misuse of a tenant will not disrupt the whole cluster and affect the use of other tenants. Therefore, we have developed the corresponding QoS traffic limiting function, which can realize the limitation of read and write data traffic and static storage data space in different tenants’ running time.

conclusion

Ali Cloud has adopted ETCD service to become the core data storage system of container service for nearly 4 years. We have accumulated a lot of experience in operation and maintenance management of ETCD and the best practices of using ETCD. This article shares some of our practices in reducing cost and improving efficiency and kernel optimization.

In recent years, with the tide of cloud native, ETCD has also achieved unprecedented rapid development, ETCD has officially graduated from CNCF last year. Ali Cloud has contributed important features and bug fixes to the ETCD community. It actively participates in the community, absorbs nutrition from the community, and feeds back to the contributing community. It can be predicted that etCD cluster will continue to maintain rapid growth in the future, and we will continue to invest in reducing cost and increasing efficiency, ensuring stability and reliability. If you are interested in joining us, please contact [email protected].