3 years old ETCD3 data inconsistency bug analysis

The problem background

Weird K8S rolling update exception

The author received feedback from colleagues one day that the rolling update of K8S cluster in the test environment did not take effect. Kube-apiserver shows that the Deployment version is already the latest version, but this latest version of Pod has not been created.

In view of this phenomenon, we initially suspected that it might be caused by a bug in Kube-Controller-Manager, but no obvious abnormality was found by observing the controller-Manager logs. After raising the log level of Controller-Manager for the first time and restarting the operation, it seems that because controller-Manager did not watch the update event, we still did not find the problem. At this point, the kube-Apiserver log is observed, and there is also no obvious anomaly.

So, increase the log level again and restart Kube-Apiserver, and something strange happens. The previous Deployment rolls up normally!

Inconsistent ETCD data?

Since kube-Apiserver’s logs were also unable to extract useful information that could help resolve the problem, we could only guess at first that kube-Apiserver’s cache update exception might be the cause. Just as we were trying to solve the problem from this point of view, the colleague reported an even weirder problem — his newly created Pod disappeared by querying the Pod list through Kubectl! What? What is this operation? Kubectl = kubectl = kubectl = kubectl = kubectl = kubectl = kubectl = kubectl Kube-apiserver pulls data directly from ETCD and returns it to the client. Is there a problem with etCD itself?

As we all know, ETCD itself is a strong consistent KV storage, in the case of a successful write operation, two read requests should not read different data. With an attitude of unbelief, we directly queried etCD cluster status and cluster data through etCDctl. The returned result showed that the status of three nodes was normal and RaftIndex was consistent. No error message was found by observing etCD logs. The only suspicious thing is that the dbsizes of the three nodes are quite different. Then, we specify the endpoint of client access to different node addresses to query the number of keys on each node. The result shows that the number of keys returned by three nodes is inconsistent, and the difference between the number of keys on two different nodes can reach thousands at most! You can directly query the Pod created by using etcdctl. The Pod can be queried when you access some endpoints, but cannot be queried when you access other endpoints. At this point, it is almost certain that data inconsistencies do exist between nodes in the ETCD cluster.

Problem analysis and troubleshooting process

Never ask Google

A data inconsistency occurs on a highly consistent storage device. Such a serious problem must be reflected in the logs. However, perhaps because the etCD developers are worried about the impact of too many logs on performance, etCD logs are not printed enough, so we checked the etCD node logs, but found no useful error logs. Even after we raised the log level, no exception messages were found.

As a 21st century programmer, the first thing to do when faced with this weird and temporarily confused problem is to Google it. After all, a programmer who can’t handle StackOverFlow is not lucky! Google entered the word “ETCD Data Inconsistent” and found that we are not alone in this issue. Others have reported similar issues to the ETCD community in the past, but failed to provide stable reoccurrences. Such as the issue:

Github.com/etcd-io/etc…

Since this problem is serious and can affect data consistency, and we currently use hundreds of ETCD clusters in our production environment, we decided to take a closer look to avoid similar problems.

An introduction to etCD working principles and terminology

Before we begin, for the convenience of readers, here is a brief introduction to the common terms and basic reading and writing principles of ETCD.

Glossary:

Etcd is a distributed KV storage with strong consistency. The so-called strong consistency simply means that after a write operation is successful, the data read from any node is the latest value, and there will be no failure to read or old data read after the write operation is successful. Etcd uses raft protocol to implement leader election, configuration changes, and data read and write consistency. The following is a brief introduction to the etCD read/write process:

Write data flow (taking the leader node as an example, see figure above) :

Etcd The ETCD Server module of any node receives a write request from the Client (in the case of the Follower node, the etCD server module forwards the request to the leader node through Raft first).
Etcd Server encapsulates the request as a Raft request and submits it to the Raft module for processing.
The leader interacts with the followers nodes in the cluster using Raft protocol to copy messages to the followers nodes and persist logs to WAL in parallel.
The follower node responds to the request and replies whether it agrees to the request.
When more than half of the cluster nodes ((n/2)+1 members) agree to receive this log data, the request can be committed and Raft module notifies etCD Server that the log data has been committed and can be applied.
The applierV3 module of etCD Server of each node asynchronously performs Apply operation and writes BoltDB to the back-end storage through MVCC module.
After the node data connected to the client is applied successfully, the application result is returned to the client.

Data reading process:

Etcd The ETCD server module of any node receives a client read request (Range request)
Determine the read request type, and if it is serializable, enter the Apply process directly
Linearizable reads enter the Raft module
The Raft module sends a ReadIndex request to the leader to get the latest data Index that has been submitted by the current cluster
Enter the Apply process while waiting for the local AppliedIndex to be greater than or equal to the CommittedIndex retrieved by ReadIndex
Apply process: Obtain the latest Revision of Key from KV Index module through Key name, and obtain corresponding Key and Value from BoltDB through Revision.

Preliminary validation

Normally, if there are no external changes in the normal operation of the cluster, such a serious problem would not occur. When we checked the release records of the etCD cluster, we found that when we published the etCD cluster one day before the failure, the DATABASE was written to the full due to improper DBsize configuration, and the cluster could not write new data. Therefore, we updated the dbsize configuration and compaction. And restart the ETCD. After the restart, the o&M student manually executed compact and defrag operations on the ETCD to compress the DB space.

Through the above scenario, we can preliminarily judge the following suspicious trigger conditions:

Dbsize full
Dbsize and compaction configuration updates
Operations compaction and defrag
Restart the etcd

Problems must be able to reproduce more conducive to solving the problem, the so-called can reproduce the bug is not called a bug. Before reoccurrence, we analyzed previous issues in the ETCD community and found that common conditions for triggering the bug include executing a compaction or defrag operation and restarting the ETCD node. Therefore, we plan to first try to simulate these operations simultaneously to see if they can be replicated in the new environment. To do this, we create a new cluster, write scripts to write and delete data to the cluster until dbsize reaches a certain level, configure updates, restart nodes, and trigger operations such as compaction and defrag. However, after several attempts, we did not reproduce the scenario similar to the above data inconsistency.

There are some clues

Then, in the subsequent test, I accidentally found that the client specified different endpoint to write data, and the nodes that could check the data were also different. For example, if the endpoint is set to node1 for data writing, data can be queried on all three nodes. The endpoint is specified to write data to node2. Node2 and Node3 can be queried. The endpoint is specified to write data to Node3, which only node3 can find. The details are as follows:

So we made a preliminary guess, there are the following possibilities:

The cluster may split and the leader does not send messages to the follower node.
The leader sends a message to the follower node, but the log is abnormal and there is no corresponding command.
The leader sends a message to the follower node, which has the corresponding command, but the Apply fails. The operation fails before KV Index and BoltDB are reached.
The leader sends a message to the follower node with the corresponding command, but the Apply is abnormal and the KV Index is faulty.
The leader sends a message to the follower node with the corresponding command, but the Apply fails and BoltDB fails.

To test our guess, we ran a series of tests to narrow down the question:

First, we check the cluster information through endpoint Status, and find that the clusterId, Leader, raftTerm and raftIndex information of the three nodes are consistent, but the dbSize and revision information are inconsistent. The clusterId is consistent with the leader, which basically eliminates the speculation of cluster splitting. The raftTerm is consistent with the raftIndex, which indicates that the leader synchronizes messages with the followers and further eliminates the first guess. However, it is uncertain whether WAL is abnormal. The inconsistency between dbSize and revision further indicates that the data of the three nodes have been inconsistent.

Second, etcd itself provides some dump tools, such as etcd-dump-log and etcd-dump-db. You can run etcd-dump-log to dump the contents of a specified WAL file and etcd-dump-db to dump the data of a DB file, as shown in the following figure.

So, we wrote a data to node3, and then analyzed the WAL of the three nodes through etcd-dump-log. According to the test, the data that was written to Node3 should not be detected through the other two nodes. However, we found that all three nodes received WAL logs, which means that WAL was not lost, thus eliminating the second guess.

Then we dumped the DB data and found that the data written to node3 with endpoint specified could not be found in the DB files of the other two nodes. That is to say, the data did not fall into the DB, rather than being written into the DB.

Since WAL has it and DB does not, it is most likely that the Apply process is abnormal and the data may be discarded during Apply.

Since the existing logs don’t provide any more useful information, we are going to add some new logs to etCD to help us better locate problems. When ETCD performs the Apply operation, the trace log will print more than each request that exceeds 100ms. We first lower the threshold of 100ms to 1ns, so that each apply request can be recorded, which can better help us locate problems.

After compiling the new version, we replaced one of the ETCD nodes and then tested write requests to different nodes. Sure enough, we found an unusual error log :” error”:” Auth: Revision in header is old”, so we concluded that the problem was most likely due to a key missing from the node that sent the error log.

After searching the code, we found that etCD will judge AuthRevision in the RAFT request header if authentication is enabled when applying. If the AuthRevision in the request is less than the AuthRevision of the current node, the AuthRevision is considered too old and will cause Apply to fail.

func (as *authStore) isOpPermitted(userName string, revision uint64, key, rangeEnd []byte, permTyp authpb.Permission_Type) error { // ... if revision < as.Revision() { return ErrAuthOldRevision } // ... }Copy the code

AuthRevision was read directly from DB when etCD was started, and every change was written to DB in time. Therefore, we simply changed the etcd-dump-DB tool. AuthRevision stored in db of each node was decoded and compared, and AuthRevision of the three nodes was indeed inconsistent. AuthRevision of Node1 was the highest, and AuthRevision of Node3 was the lowest. Data written to node1 can be found on all three nodes. Data written to node3 can only be found on Node3 because the node with the lower AuthRevision initiated Raft requests. Nodes with high AuthRevision are discarded during Apply (see table below).

Source code before, no secrets?

So far we have been able to determine that the reason why newly written data cannot be detected by accessing certain endpoints is because of inconsistent AuthRevision. However, whether the inconsistent data is caused by AuthRevision at the beginning cannot be determined for the time being. Why do you say that? Because AuthRevision is likely to be a victim, such as the inconsistency of AuthRevision and data are caused by the same reason, but the inconsistency of AuthRevision magnifies the problem of inconsistent data. However, in order to get closer to the truth, let’s first assume that AuthRevision is the culprit leading to inconsistent data, and then find out the real reason leading to inconsistent AuthRevision.

How do you find the cause? As it is said, there are no secrets before source code, we first thought of analyzing the code. AuthRevision only increases when AuthRevision is performed on write operations related to permissions (such as adding or deleting users/roles, authorizing roles, etc.). When AuthRevision is increased, it is written to the backend cache together with write permission operations. When write operations exceed a certain threshold (the default value is 10000 records) or every 100ms (the default value is 10000 records), the backend flusher data to the db. Because the persistence of AuthRevision and the persistence of actions such as creating a user are in a single transaction, it is almost impossible to create a user without AuthRevision increasing normally.

func (as *authStore) UserAdd(r *pb.AuthUserAddRequest) (*pb.AuthUserAddResponse, error) { // ... Tx := as.be.batchtx () tx.lock () defer tx.unlock () putUser(tx, newUser) as.commitRevision(tx) return &pb.AuthUserAddResponse{}, nil}func (t *batchTxBuffered) Unlock() { if t.pending ! = 0 { t.backend.readTx.Lock() // blocks txReadBuffer for writing. t.buf.writeback(&t.backend.readTx.buf) t.backend.readTx.Unlock() if t.pending >= t.backend.batchLimit { t.commit(false) } } t.batchTx.Unlock()}Copy the code

So, since AuthRevision of the three nodes is inconsistent, is it because some operations related to write permission of nodes are lost, so db is not written? If this is true, the bucket contents of authUser and authRole should be different for each node. For further verification, we modified the etcd-dump-db tool and added a function to compare the contents of buckets in different DB files. Unfortunately, there is no difference in the contents of the authUser and authRole buckets between the three nodes by comparison.

Since the write permission of the node is not lost, could the command be repeated? When viewing the logs in the abnormal period, it is found that there are many AUth operations. By comparing the auth operation logs of the three nodes, it is found that some nodes have more logs, while some nodes have less logs, which seems to show the phenomenon of repeated command execution. Due to log compression, although it is not clear whether the operation is repeated or lost, the information can provide a lot of inspiration for our follow-up investigation.

We continue to observe that although AuthRevison of different nodes is different, the difference is small, and the difference value has not changed during our pressure test. Since AuthRevision differences between different nodes were not amplified further, the new logs were basically undetectable because the inconsistencies were most likely instantaneous at some point in the past. This means that we need to be able to reproduce AuthRevison inconsistencies or data inconsistencies in order to find the root cause of the problem, and we need to be able to catch the scene of the reoccurrence moment.

The problem seems to have come full circle, but fortunately we have eliminated a lot of distracting information so far to focus on the Auth operation.

Chaos engineering, successful repeat

Since manual simulation of various scenarios failed to reproduce successfully for many times before, we plan to develop a set of automatic pressure measurement scheme to reproduce this problem. The following points are mainly considered in the formulation of the scheme:

How to increase the probability of recurrence?

According to the previous investigation results, it is likely that the data inconsistency is caused by auth operation, so we implemented a monkey script, which will write random users and roles to the cluster and grant authorization to the roles every once in a while, and at the same time perform data writing operations, and randomly restart nodes in the cluster. Record the detailed time and execution logs of each operation.
How to ensure that the root cause of the problem can be located as far as possible in the case of recurrence?

According to the previous analysis, the root cause of the problem is probably the problem in the apply process, so we added detailed logs in the Apply process. In addition, information such as committedIndex, appliedIndex, and consistentIndex is printed for each apply operation.
If the recurrence is successful, how can it be detected in the first place?

Because the amount of logs is too large, only when problems are discovered at the first time, can the scope of problems be narrowed more accurately and problems can be located more conveniently. So we implemented a simple metry-server that pulled the number of keys from each node every minute and compared them, exposed the difference as metric, pulled by Prometheus, and displayed in Grafana. Once the difference exceeds a certain threshold (in the case of a large amount of written data, even if the number of keys on each node is counted concurrently, there may be a small difference, so there is a tolerance error), the unified alarm platform will immediately push alarms to us for timely discovery.

After the solution was well developed, we built a set of ETCD cluster, deployed our pressure measurement plan, and planned to carry out long-term observation. At noon the next day, we received a wechat alarm — there was inconsistent data in the cluster.

Therefore, we immediately logged in the pressure test machine for analysis, first stopped the pressure test script, and then looked at the AuthRevision of each node in the cluster, and found that AuthRevision of the three nodes was indeed different! According to the monitoring data on the Grafana monitor panel, we narrowed the time range of inconsistent data to within 10 minutes. Then we analyzed the logs of these 10 minutes and found that the value of consistentIndex was lower after a node was restarted than before. However, all apply operations of ETCD depend on consistentIndex to ensure the idempotency. When applying, it will judge whether the Index of the Entry to be applied is greater than consistentIndex. If Index is greater than consistentIndex, consistentIndex is set to Index and the record is allowed to be applied. Otherwise, the request is considered repeated and the actual apply operation will not take place.

// applyEntryNormal apples an EntryNormal type raftpb request to the EtcdServerfunc (s *EtcdServer) applyEntryNormal(e *raftpb.Entry) { shouldApplyV3 := false if e.Index > s.consistIndex.ConsistentIndex() { // set the consistent index of current executing entry s.consistIndex.setConsistentIndex(e.Index) shouldApplyV3 = true } // ... // do not re-apply applied entries. if ! shouldApplyV3 { return } // ... }Copy the code

In other words, since consistentIndex is reduced, the idempotent operations that etCD relies on will become idempotent. As a result, permissions related operations will be applied twice after etCD restarts.

Problem principle analysis

Why does consistentIndex decrease? After reading the consistentIndex code, we finally found the root cause of the problem: consistentIndex itself is persistent and depends on the data write operation of MVCC; When data is written through MVCC, saveIndex is called to persist consistentIndex to Backend. However, operations related to Auth are directly read and written to Backend and do not go through MVCC. As a result, consistentIndex will not be persisted if permission related writes are made and data is not subsequently written via MVCC. If etCD is restarted at this time, the permission related writes will be applied twice. As a side effect, authRevisions may increase repeatedly, which directly leads to inconsistent AuthRevisions of different nodes, which in turn leads to inconsistent data.

func putUser(lg *zap.Logger, tx backend.BatchTx, user *authpb.User) { b, Err := user.marshal () tx.unsafePut (authUsersBucketName, user.name, b) // Write to Backend without MVCC. ConsistentIndex}func (TW *storeTxnWrite) End() {// Only update index if the TXN modifies the MVCC state.if len(tw.changes) ! = 0 {tw.s.saeindex (tw.tx) // When writing data through MVCC, Set consistentIndex to persistent TW.s.evmu.lock () tw.s.currentrev ++} tw.tx.unlock () if len(tw.changes)! = 0 { tw.s.revMu.Unlock() } tw.s.mu.RUnlock()}Copy the code

Now, why is it possible to read inconsistent data, and possibly read different data? Doesn’t ETCD use the RAFT algorithm to ensure strong consistency? This has to do with etCD’s own read implementation.

CommittedIndex on the leader node and AppliedIndex on the current node. Etcd updates AppliedIndex during apply regardless of whether apply is successful. As a result, although the current node apply fails, the read operation will not be aware of the failure when judging, resulting in some nodes may not be able to read data. In addition, ETCD supports multi-version concurrency control, and multiple versions of data can exist in the same key. If apply fails, the data of a certain version may fail to be updated, which results in the inconsistency of the latest data version between different nodes, resulting in different data read.

scope

This problem was first introduced in 2016. All ETCD3 clusters with authentication enabled are affected. In certain scenarios, data inconsistency between multiple nodes in an ETCD cluster occurs.

The trigger condition

The ETCD3 cluster is used and authentication is enabled.
A node in the ETCD cluster restarts. Procedure
Before a node is restarted, a grant-permission operation (adding or deleting the same permission for multiple times in a short period of time) is performed, and no data is written before the node is restarted.
Sends data write requests to the cluster through non-restarted nodes.

Repair plan

After knowing the root cause of the problem, the fix is clear. We only need to trigger the persistence operation of consistentIndex after auth calls commitRevision. This ensures that etCD consistentIndex is correct on restart, thus ensuring idempotency of auth operation. We have submitted PR #11652 to the ETCD community for the specific fixes. Currently, this feature has been backport to version 3.4, 3.3, etc., and will be updated in a recent release.

So what if the data is already inconsistent? Is there a way to recover it? In the case that the ETcd process does not restart frequently, we can first find the node with the smallest authRevision and its data should be the most complete. Then, we can use the move-leader command of ETCD to transfer the leader to this node, and then move other nodes out of the cluster successively. Back up and delete the data directory, and then add the node back in. At this point, it synchronizes the latest data from the Leader, so that the data of the other nodes in the cluster is consistent with that of the Leader, that is, data loss is minimized.

Upgrade Suggestions

It is important to note that the upgrade is risky. Although this issue is resolved in the new version, the restart of etCD during the upgrade process may still trigger this bug. Therefore, you are advised to stop write permission operations before upgrading the restored version, manually trigger data write operations, and then restart the node to avoid problems caused by the upgrade.

In addition, it is not recommended to upgrade directly from etCD3.2 to ETCD3.3. Large version upgrade has certain risks, so it needs to be carefully tested and evaluated. We have found another inconsistent problem caused by lease and Auth before, specifically visible issue #11689 and related PR #11691.

The problem summary

The direct cause of this problem is that consistentIndex is not persisted during permission related operations, which results in non-idempotent auth related operations and data inconsistency.

The reason why consistentIndex is not persisted in auth module is that consistentIndex is currently implemented in MVCC module without exposed persistence interface, so it can only be called in an indirect way and is easy to miss. To optimize this problem, we refactor consistentIndex into a separate module, which can be directly called by other dependent modules to avoid similar problems in the future. For details, see PR #11699.

On the other hand, the apply operation of ETCD itself is an asynchronous process, and no error log is printed after the failure, which greatly increases the difficulty of troubleshooting. Therefore, we later submitted PR #11670 to the community to optimize the log printing when Apply fails, so as to locate the problem.

Another reason why it is difficult to locate problems is that auth Revision currently does not expose any metric or API. Each query can only be obtained directly from DB through [etcd-dump- DB tool]. To facilitate debugging, we also added related information

Metric] (github.com/etcd-io/etc…). And the Status API to enhance the observability and testability of Auth Revision.

The resources

Etcd source: github.com/etcd-io/etc…
Storage: www.codedump.info/post/201811…
The implementation principle of ETCD is as follows: DraP. Me/ETCd-Introd…
Raft design and implementation: zhuanlan.zhihu.com/p/51063866,…

Thank you

Thanks to etCD community Jingyih and Mitake for their help and advice during the PR submission process!

[Tencent cloud native] cloud said new, cloud research new technology, cloud travel new live, cloud appreciation information, scan code to pay attention to the public account of the same name, timely access to more dry goods!!