Background and challenges

With the rapid growth of Tencent’s own cloud and public cloud users, on the one hand, On the other hand, the types of container services we provide (TKE hosted and independent cluster, EKS elastic cluster, Edge edge computing cluster, Mesh service grid, Serverless Knative) are also getting richer and richer. The core behind all kinds of container service types is K8S, and the storage ETCD of K8S core is uniformly managed by our ETCD platform built based on K8S. Based on it, we currently manage a thousand-level ETCD cluster and support a ten-thousand-level K8S cluster.

How can we effectively guarantee the stability of ETCD cluster under the scale of ten-thousand level K8S cluster?

Where does the stability risk of an ETCD cluster come from?

We conducted stability risk model analysis based on business scenarios, historical legacy problems, current network operation experience, etc. The risks mainly come from the unreasonable design of the old TKE ETCD architecture, the stability of ETCD, the failure of some ETCD performance scenarios to meet the requirements of the business, the insufficient coverage of test cases, the lax change management, the comprehensive coverage of monitoring, the automatic inspection and discovery of hidden dangers, and the security of extreme disaster failure data.

The ETCD platform described above has solved the ETCD scalability, operationability, observability and data security of all kinds of container services we manage from the aspects of architecture design, change management, monitoring and inspection, data migration and backup to some extent. Therefore, this paper will focus on describing the ETCD kernel stability and performance challenges we face in the 10,000-level K8S scenario, such as:

• Data is inconsistent

• Memory leaks

• a deadlock

• process Crash

• Large packet request causes ETCD OOM and packet loss

• Slow startup in scenarios with large data volumes

• The interface performance of authenticating and querying the number of Key and the specified number of records is poor

This article briefly describes how we identified, analyzed, reproduced, and addressed the above issues and challenges, and what lessons we learned and applied to the storage stability of our various container services.

At the same time, we will contribute all our solutions back to the ETCD open source community. Up to now, all our 30+ PR contributions have been incorporated into the community. Tencent Cloud TKE ETCD team is one of the most active teams in the ETCD community in the first half of 2020. We will contribute our strength to the development of ETCD. During this process, we would like to thank the community AWS, Google, Ali and other maintainer for their support and help.

2. Analysis of stability optimization cases

From GitLab accidentally deleting part of the main database and losing data to GitHub data inconsistency resulting in 24 hours of outage, to AWS S3 which is called “unsinkable aircraft carrier” outage for several hours, all of them are storage services without exception. Stability is crucial to a storage service and even a company’s reputation, making the difference between life and death. In the case of stability optimization, we will elaborate on the severity of data inconsistency, two ETCD data inconsistency bugs, LEASE memory leak, MVCC deadlock and Wal crash, how we found, analyzed, replicated and solved the above cases, and share our gain and reflection from each case. Learn from them and nip them in the bud.

Data Inconsistency

When it comes to data inconsistence glitches, it’s worth mentioning in detail that GitHub disconnected its East Coast network center from its major East Coast data center once in 18 years due to routine maintenance work on network equipment. While network connectivity was restored within 43 seconds, the brief outage set off a chain of events that resulted in GitHub’s 24 hour and 11 minute service being downgraded and some features unavailable.

GitHub uses a large number of MySQL clusters to store Meta data of GitHub, such as Issue, PR, Page, etc., and at the same time to do cross-city disaster recovery on the east and west coastways. The core reason of the failure was that GitHub’s MySQL arbitration service Orchestrator failed over the network anomaly, directing the write data to the MySQL cluster on the West Coast of the United States (the primary was on the East Coast before the failure). However, the MySQL on the East Coast contained a small piece of write, which had not been copied to the West Coast cluster yet. At the same time, after failover, the primary database cannot be safely failover back to the East Coast of the United States because the clusters in both data centers now contain writes that do not exist in the other data center.

In the end, GitHub had to fix the data consistency at the cost of a 24-hour service downgrade in order to ensure that user data was not lost.

The fault severity of data inconsistency is self-evident. However, ETCD is a distributed and highly reliable storage system based on RAFT protocol, and we did not do cross-city disaster recovery. It is very difficult to encounter such seemingly high bug as data inconsistency. However, the dream is beautiful, the reality is cruel, we not only encountered incredible data inconsistency bug, but also a step is two, one is to restart ETCD has a low probability of triggering, the other is to upgrade ETCD version if the authentication is turned on, in the K8S scene with a high probability of triggering. Before discussing these two bugs in detail, let’s take a look at the problems caused by inconsistent ETCD data in the K8S scenario.

• The scareest thing about data inconsistencies is that if a client writes successfully, it may read null or old data on some nodes. The client cannot realize that a write failed on some nodes and may read old data

• Null reading may lead to the disappearance of business Node, POD and Service routing rules on Node. In general, it will only affect the Service changed by users

• Reading old data will cause business changes not to take effect, such as Service scaling, Service RS replacement, change mirror exception waiting, in general, only affect the user to change the Service

• Under the migration scenario of ETCD platform, the client cannot perceive the write failure. If the verification data is consistent (the normal Node is connected during the verification), the whole cluster will fail (the Apiserver is connected to the abnormal Node) after the migration, and the user’s nodes, deployed services and LB will all be deleted. Serious impact on user business

First of all, the first inconsistency bug was encountered in the process of restarting ETCD. The manual attempts to reproduce it failed for many times. The analysis, positioning, reproduction and solving of the bug went through many twists and turns, which was very interesting and challenging. Replenishing success. The real culprit turned out to be a 3 year old bug that affected all V3 versions of the database because an authorization interface replayed after a restart caused the authentication version number to be inconsistent, and then zoomed in and caused multiple versions of the database to be inconsistent, and some nodes could not write new data.

Then we submitted several related PR to the community and merged them all, the latest ETCD v3.4.9[1], This problem has been fixed in V3.3.22 [2], and Jingyih of Google has also raised K8s issue and PR [3] to upgrade the ETCD client and server version of K8s 1.19 to the latest V3.4.9.



The second inconsistency bug was encountered in the process of upgrading ETCD. Due to the lack of key error log in ETCD, there is not much effective information on the fault site, and it is difficult to locate, which can only be solved by analyzing the code and reproducing. However, manual attempts to reproduce failed for many times, so we simulated the client behavior scenario through CHAOS MONKEY, and scheduled the ETCD allocation requests of all K8S clusters in the test environment to our duplicate cluster. Besides, by comparing the differences between 3.2 and 3.3 versions, we added a large number of key logs to the doubtable points such as Lease and TXN modules. An error log is printed for ETCD Apply Request failure scenarios.

Revoke Lease is not allowed in version 3.2, while write permission is required in version 3.3. Revoke Lease is not allowed in version 3.2, and write permission is required in version 3.3. When the lease expires, if the leader is 3.2, the request will fail at the 3.3 node due to lack of authority, resulting in inconsistent key numbers, inconsistent MVCC version numbers, and TXN transaction scenario execution failure, etc. The latest 3.2 branch has also incorporated the fixes we submitted. We have also added error logs for core process failures in ETCD to improve the efficiency of data inconsistencies. We have also improved the upgrade documentation to specify that LEASE will cause data inconsistencies in this scenario to prevent people from having to drill again.

Here’s what we learned and best practices from these two data inconsistency bugs:

Data consistency algorithm theory, does not represent the overall service implementation to guarantee the data consistency, the industry for the distributed storage system based on log replicated state machine implementation, not one of the core mechanism can ensure the raft, wal, MVCC, snapshot module collaboration is not a problem, raft can only guarantee the consistency to log the state machine, There is no guarantee that the command will succeed when the application layer executes these logs

• There are certain risks in the ETCD version upgrade, and the code needs to be carefully reviewed to evaluate whether there are incompatible features. If there are any that affect the authenticated version number and the MVCC version number, if there are any, the data may be inconsistent during the upgrade process, and the gray scale must be changed to the current network cluster

• Added conformance patrol alerts to all ETCD clusters, such as revision difference monitoring, key number difference monitoring, etc

• Backup ETCD cluster data regularly. According to Murphy’s law, even the smallest probability of failure may occur. Even though ETCD itself has complete automated tests (unit tests, integration tests, E2E tests, fault injection tests, etc.), there are still many scenarios that cannot be covered by test cases. We need to prepare for the worst scenario (for example, WAL, SNAP, DB files of three nodes are damaged at the same time), reduce the loss in extreme cases, and achieve quick recovery of available backup data

• The cluster gray level after ETCD v3.4.4 turns on the data corruption detection function. When the cluster is inconsistent, it will refuse to write or read the cluster, stop the loss in time, and control the inconsistent data range

• Continue to improve our Chaos Monkey and use ETCD’s own fault injection testing framework Functional to help us verify and test the stability of new releases (long run), rediscover deep hidden bugs, and reduce the probability of line pit

Memory leak (OOM)

ETCD is known to be written by Golang, but Golang has a garbage collection mechanism will also leak memory? First of all, we need to understand how Golang Garbage Collection works. It works by running a daemon thread in the background that monitors the state of various objects, identifies and discards objects that are no longer being used to free and reuse resources. If you haven’t released objects in a while, Golang Garbage Collection is not a magic solution. For example, the following scenarios can lead to memory leaks:

• goroutine leaked

• Deferring function calls (for loop) • Deferring function calls (for loop) • Deferring function calls (for loop)

• The long string/slice is not freed because a string/slice is obtained (it will share the same underlying memory block)

• Memory leaks due to poor management of application memory data structures (e.g. to clean up expired and invalid data in time)

Now what is the case of the ETCD memory leak we encountered? The incident originated from a weekend at the end of March, when I got up and received an alarm that a large amount of memory in Xianyang 3.4 cluster exceeded the safety threshold, I immediately checked and found the following phenomena:

• Both QPS and traffic monitoring displays are low, so high load and slow queries are excluded

• In a cluster of 3 nodes, only two follower nodes are abnormal, with the leader 4G and the follower node reaching 23G

• Goroutine, FD and other resources are not leaking

• The GO Runtime MEMSTATS indicator shows that the memory applied by each node is consistent, but the follower node GO_MEMSTATS_HEAP_RELEASE_BYtes is much lower than that of the leader node, indicating that a data structure may not be released for a long time

• The production cluster turned off Pprof by default, turned on Pprof, and waited for the recurrence. We searched the community to release similar cases, and found that many users had reported in January, but the community did not pay attention to them. The usage scenes and phenomena were the same as those of us

• Fast positioning through the community heap stack is due to the ETCD managing the state of the LEASE through a heap. When the LEASE expires, it needs to be deleted from the heap, but the follower node does not do this. As a result, the follower memory leak affects all 3.4 versions.

• After the problem has been analyzed, the solution I submitted is that the follower node does not need to maintain the lease heap. When an election occurs in the leader, it is ensured that the new follower node can rebuild the lease heap and the old leader node can empty the lease heap.

The memory leak bug was caused by poor management of memory data structures. After the problem was fixed, the ETCD community immediately released a new version (v3.4.6+) and K8S immediately updated the ETCD version.



Here are the lessons and best practices from this memory leak bug:

• Keep an eye on community issues and PR, others’ problems today are more likely to be our problems tomorrow

• ETCD tests by themselves cannot cover such resource leak bugs that need to be run for a certain period of time before triggering. We need to strengthen the testing and pressure testing of such scenarios internally

• Continuously improve and enrich all kinds of monitoring and warning of ETCD platform, and keep enough memory buffer to withstand various unexpected factors.

Storage layer Deadlock (MVCC DEADLOCK)

Deadlock is the execution of two or more goroutines, due to competing resources waiting for each other (usually lock) or due to communication (Chan caused) caused by a program stuck phenomenon, can not provide external services. Because deadlock is often caused by resource competition in a concurrent state, it is generally difficult to locate and reproduce. The nature of deadlock dictates that we must keep the analysis site well, or else the analysis and reproduction will be extremely difficult.

So how do we discover and solve the deadlock bug? The problem originated from the fact that the internal team found that a node suddenly failed during the test of the ETCD cluster, and it could not be recovered all the time, and it could not get the key number and other information normally. After receiving this feedback, I analyzed the stuck ETCD process and looked at the monitoring and came to the following conclusions:

• RPC requests that do not pass through the RAFT and MVCC modules, such as the Member List, can return results normally, while all RPC requests that pass through the Context Timeout

• ETCD health monitor returns 503,503 error report logic is also through RAFT and MVCC

• Eliminate RAFT network module exceptions with tcpdump and netstat and narrow the suspect target to MVCC

• When analyzing the log, I found that it was stuck because the data lagged far behind the leader, so I received a snapshot of the data, and then it was stuck when executing the update snapshot, and I did not output the log with the snapshot loaded, and I also confirmed that the log was not lost

• Examine snapshot-loaded code, lock several suspicious locks and related goroutines, and prepare for a stuck goroutine stack

• Get the goroutine stack by kill or pprof, successfully find two competing goroutines based on the time the goroutine is stuck and the related suspect code logic, one of which is the main goroutine that performed the snapshot loading to rebuild the DB, It acquires an MVCC lock and waits for all asynchronous tasks to finish, while the other goroutine executes the historical key compression task. When it receives the STOP signal, it immediately quits, calling a CompactBarrier logic which in turn needs to acquire the MVCC lock, thus causing a deadlock. The stack is as follows.



This bug has been hidden for a long time. It affects all ETCD3 versions. If there is a large number of writes in the cluster, a lagging node is performing a snapshot rebuild, and at the same time it is doing historical version compression, it will be triggered. The fix PR I submitted has now been incorporated into the 3.3 and 3.4 branches and the new version has been released (v3.3.21+/v3.4.8+).

Here’s what we learned and best practices from this deadlock bug:

• ETCD automation test cases with multiple concurrent scenarios are not covered, difficult to build and therefore prone to bugs. Are there any other similar scenarios that have the same problem? To avoid this problem, you need to engage the community and continue to improve ETCD test coverage (more than half of the code described on the official blog before ETCD is already test code).

• Monitoring can detect abnormal node downtime in time, but we will not automatically restart ETCD before the deadlock scenario, so we need to improve our health detection mechanism (such as curl /health to determine whether the service is normal), in case of a deadlock, we can preserve the stack and restart the service automatically.

• For scenarios with high read requests, it is necessary to evaluate whether the QPS capacity provided by the remaining two nodes of the 3-node cluster can support the business after one node goes down. If not, the 5-node cluster should be considered.

Wal Crash (Panic)

Panic is a serious runtime and business logic error that causes the entire process to exit. Panic is not new to us. We have encountered Panic several times in Xianxian. The earliest instability we encountered was during cluster operation.

Although our 3-node ETCD cluster can tolerate the failure of one node, Crash still affects users instantly, and even the cluster dialing and testing connection fails.

The first crash bug we encountered, It is found that when the number of cluster links is large, there is a certain probability of crash, and then according to the stack check, the community has reported GRPC crash(Issue)[4], because the component GRPC-Go which is dependent on ETCD appears GRPC crash(PR)[5]. Recently, the Crash Bug [6] we encountered was caused by the release of the new version of V3.4.8 / V3.3.21, which has a lot to do with us. We contributed 3 PR to this version, accounting for more than half of it. So how did this Crash Bug arise and reappear? Could it be our own pot?

• The first crash report was WALPB: CRC mismatch, and we failed to commit code to change the logic of the WAL.

• Secondly, through the review of the new version PR, the target was targeted to a Google executive when he fixed the crash bug caused by WAL’s successful write but Snapshot’s failed write.

• But how exactly was it introduced? Multiple test cases are included in the PR to verify the newly added logic, and the local creation of empty clusters and the use of stock clusters (with relatively small data) cannot be replicated.

• The error log information is too small to determine which function reported the error, so the first thing to do is to add the log. After adding the error log for each suspect, we randomly found an old node replacement version in our test cluster, and then it was easy to duplicate it, and determined that it was the newly added pot to verify the validity of the snapshot file. So why is it having a CRC mismatch? First let’s take a brief look at the wal file.

• Any ETCD request that passes through RAFT modules will be persistent through the WAL file before being written to the ETCD MVCC DB. If the process is killed during the Apply Command process or other exceptions, the data can be replayed through the WAL file to make up the data to avoid data loss. Wal files contain various request commands, such as member change information, operations involving key, etc. In order to ensure data integrity and uncorrupted, each Wal record will calculate its CRC32 and write it to the Wal file. After a restart, the record is checked for integrity during the parse of the WAL file. If the data is corrupted or the CRC32 algorithm changes, a CRC32 mismatch occurs.

, hard disk and file system did not appear abnormal, ruled out data corruption, after a thorough screening crc32 algorithm calculation, found that is not new logic processing crc32 types of data record, it will affect the value of crc32 algorithm, lead to the differences, and only when the etcd cluster create generated after the first recycled would trigger a wal file, So for a cluster that runs for some time on the stock, 100% is duplicated.

• The solution is to add processing logic to the CRC32 algorithm and add unit tests to cover the scenario where WAL files are recycled. The community has merged and released new versions 3.4 and 3.3 (V3.4.9 / V3.3.22).

Although the bug was reported by the community, the following lessons and best practices were learned from the crash bug:

• Unit test cases are valuable, but writing a complete unit test case is not easy and requires consideration of various scenarios.

• The ETCD community upgrades the stock cluster, and the compatibility test cases among different versions are almost 0, so we need to work together to add lick blocks to make the test cases cover more scenes.

• Standardization and automation of the internal processes of launching the new version, such as test environment pressure test, chaos test, performance comparison of different versions, priority use in non-core scenes (such as Event), gray scale on-line and other processes are essential.

Quotas and speed limits (Quota&QoS)

ETCD, for some expensive read and write operations, Such as full keyspace fetch, a large number of event queries, list all Pod, configmap write and so on will consume a lot of CPU, memory, bandwidth resources, extremely easy to lead to overload, and even avalanche.

However, ETCD currently only has a very simple speed limit protection. When the commited index of ETCD is greater than the applied index threshold of 5000, it will reject all requests and return Too Many requests, with obvious defects. – Unable to accurately limit expensive read/write speeds to prevent cluster overload and unavailability.

In order to solve the above challenges and avoid cluster overload, we have adopted the following solutions to ensure cluster stability:

• Based on K8S apiserver upper speed limit ability, such as apiserver default write 100/s, read 200/s

• Unreasonable POD/ConfigMap/CRD number based on K8S resource quota

• Control the number of invalid PODs based on the -terminated -POD-GC-THRESHold parameter of K8s controller-manager (default value up to 12500, plenty of scope for optimization)

• The APIServer based on K8S features that all kinds of resources can be stored independently, and different ETCD clusters are used for Event/ConfigMap and other core data, so as to improve the storage performance and reduce the failure factors of core main ETCD

• Rate control for APIServer requests to read and write Events based on the Event Admission Webhook

• Flexible adjustment of Event-TTL time based on different business situations to minimize the number of events

• Developed QoS features based on ETCD, and has submitted a preliminary design scheme to the community, which supports setting QoS rules based on multiple object types (such as GRPCMethod, GRPCMethod + Request Key Prefix path, Traffic, CPU-intensive, Latency)

• Prevent and avoid possible cluster stability problems in advance by multi-dimensional cluster alerts (ETCD cluster LB and nodes themselves inflow and outflow traffic alerts, memory alerts, resource capacity abnormal growth alerts refined to each K8s cluster, and cluster resource read and write QPS abnormal growth alerts)

Multidimensional cluster alerts play an important role in our ETCD stability assurance, helping us to spot problems with both our users and our own cluster components several times. User problems such as internal a K8S platform before the bug, writing a large number of cluster CRD resources and client read and write CRD QPS is obviously high. The problems of our own components, such as an old log component, when the size of the cluster increased, the log component made unreasonable frequent calls to List POD, resulting in the ETCD cluster traffic up to 3Gbps, and 5XX errors also occurred in Apiserver itself.

Through the above measures, we can greatly reduce the stability problems caused by expensive read. However, from the effect of online practice, at present, we still rely on cluster alarm to help us locate some abnormal client calls, and cannot automatically limit the speed of abnormal clients accurately and intelligentedly. Because the ETCD layer cannot distinguish which client is calling, if the speed limit on the ETCD side will kill the request of the normal client by mistake, so it relies on the refined speed limit function of Apiserver. The community has introduced an API Priority and Fairness[7] in 1.18, currently in alpha, and expects this feature to stabilize soon.

3 Performance optimization case analysis

The read-write performance of ETCD determines how large a cluster we can support and how many concurrent calls of clients we can make. The startup time determines how long it will take for us to provide services again when we restart a node or receive snapshot reconstruction from the leader because we are too behind the leader. We will reduce the startup time by half, password authentication performance by 12 times, query key number performance by 3 times and so on to briefly introduce how to optimize ETCD performance.

Performance optimization of startup time, query key number and query specified record number

When the DB size reached 4G and the number of keys was at the level of millions, it was found that it took up to 5 minutes to restart a cluster, and the key number query also timed out. After adjusting the timeout time, it was found that it was up to 21 seconds and the memory exploded by 6G. In the scenario where the query only returns a limited number of records (for example, the business uses ETCD GRPC-proxy to reduce the number of watches, and ETCD GRPC proxy will initiate a limit read query of the watch path when it creates a watch by default), it is still time-consuming and has huge memory overhead. So in my spare time on the weekend, I conducted an in-depth investigation and analysis on these questions. Where did I spend the start-up time? Is there room for improvement? Why does the query for the number of keys take so long and cost so much memory?

With these problems of the source code for in-depth analysis and positioning, first look at the query key number and query only to return the number of specified records of the time and memory overhead of the problem, the analysis of the conclusions are as follows:

• When query the number of keys, the implementation before ETCD is to traverse the entire memory btree and store the corresponding revision of keys in the slice array

• The problem is that Slice scaling involves copying data with a large number of keys, and Slice also requires a large amount of memory overhead

• So the optimized solution is to add a Countrevision to count the keys instead of using Slice, which reduces the performance from 21s to 7s without any memory overhead

• As for the problem that the specified record data of the query is time-consuming and the memory cost is very large, through analysis, it is found that the limit record number is not pushed down to the index layer. By pushing the limit parameter of the query down to the index layer, the performance of the limit query in the big data scenario can be improved one hundred times without extra memory cost.



Looking at the problem that the startup time is too high, the following conclusions can be drawn by adding logs to each stage of startup time:

• The ETCD process on the machine was not fully utilized on startup

•9% of the time spent on opening the backend DB, such as the entire DB file mmap into memory

•91% of the time spent rebuilding the in-memory index btree When ETCD receives a request for a GET KEY, the request is passed to the MVCC layer. First, it needs to find the corresponding version number of the Key from the memory index Btree, and then find the corresponding value from the boltdb according to the version number, and then return it to the client. When rebuilding the btree number of the memory index, it is exactly the opposite process: traverse the boltdb, iterate continuously from the version number 0 to the maximum version number, parse the corresponding key, revision and other information from value, and rebuild the btree. Because this is a serial operation, the operation is extremely time-consuming

Serial build btree, try to optimize into high concurrency constructs, try to calculate all the nuclear force is used, the compilation of the new version after tests found little effect, then compile the new version to print reconstruction in different stages of the memory index of time-consuming analysis in detail, found the bottleneck on the insert memory btree, the insert has a global lock, so almost no optimization space

• Continue to analyze 91% of the time and find that the reconstructed memory index has been called twice. The first is to obtain a key consistent index variable of MVCC, which is a key data structure to ensure that ETCD command will not be executed repeatedly. The data inconsistent bug we mentioned earlier is also closely related to consistent index.

• Consistent Index is not reasonable and encapsulated in MVCC layer. Therefore, I mentioned a PR before to reconstruct this feature into an independent package, providing various methods for ETCDServer, MVCC, Auth, Lease and other modules to call.

• Consistent index after feature reconstruction no longer needs to be acquired through reconstruction of memory index number and other logic at startup. It is optimized to get consistent index quickly by calling cindex package, which will shorten the whole time from 5min to about 2:30 seconds. Therefore, the consistent index feature, which is also dependent on the optimization, is reconstructed. Major changes have not yet been made to the 3.4/3.3 branch of backport. In the future 3.5 version, when the amount of data is large, the startup time can be significantly increased.



Password authentication performance increased by 12 times

An internal business service has been running well. One day, after the number of clients increased slightly, a large number of timeouts occurred in the live network ETCD cluster. Switches to cloud disk type, deployment environment, and parameter adjustment did not work.

• The phenomenon is really weird. There is no abnormality in the related index of DB delay and no valid information in the log

• Business feedback on a large number of read request timeout, even through the ETCDCTL client tool can be simply repeated, but the Metric corresponding to the number of read request related indicators are 0

• Guide users to open trace logs and metrics to open the extensive mode, and found no trace logs after opening it. However, after opening extensive, I found that all the time was spent on the Authenticate interface, and the business feedback was through password authentication rather than certificate-based authentication

• I tried to ask the business student to shut down the authentication for a short time to test whether the business was restored. After the business student found a node to close the authentication, the node immediately returned to normal, so he chose to temporarily shut down the authentication to restore the current network business

• Then why does the certification take so long? We added a log to the suspicious parts and printed the time of each step of authentication. It was found that there was a timeout while waiting for the lock. Why did the lock take so long? It was found that during the locking process, the BCRPT encryption function was called to calculate the hash value of the password, which took about 60ms each time, and the maximum time to wait for the lock in hundreds of concurrent sessions was up to 5S +.

• So we wrote a new version to reduce the scope of the lock and reduce the lock holding blocking time. After the user uses the new version and turns on the authentication, the business no longer timeouts and returns to normal.

• We then submitted the fix to the community and wrote a test tool to test the improved performance up to nearly 12 times (from 18/s to 202/s on an 8-core 32G machine), but it was still slow. The main problem is that password verification calculation is involved in the authentication process, and there are users’ feedback on the community that password authentication is slow. At present, the latest version of V3.4.9 has included this optimization, and the performance can be further improved by adjusting the BCRPT-COST parameter.

4 summarizes

This article briefly describes the ETCD stability and performance challenges we have encountered in managing 10,000-level K8S clusters and other business processes, and how we have located, analyzed, reproduced, resolved these challenges, and contributed the solution to the community.

It also details what valuable lessons we have learned from these challenges and applied them to subsequent ETCD stability assurance to support larger single clusters and total clusters.

Finally, we are faced with the number of ten-thousand level K8S clusters, the number of thousand level ETCD clusters, and the distribution of more than 10 versions. Many of the lower versions contain important and potentially serious bugs that may be triggered. We still need to put in a lot of work to continuously optimize our ETCD platform. To make it more intelligent, change is more efficient, safe, controllable, such as support automation, controllable cluster upgrade, etc.), data security is also important at the same time, tencent TKE cloud hosting cluster we have already full backup, independent of the cluster user etcd follow-up will guide by applying market open timing backup to tencent cloud backup plugin object storage COS.

In the future, we will continue to be closely integrated into the ETCD community, contribute our strength to the development of the ETCD community, and work with the community to improve the various functions of ETCD.