Etcd architecture and implementation analysis

Etcd was used in the project some time ago, so I studied its source code and implementation. There are a lot of articles on the Internet about Etcd usage, but not many articles analyzing the implementation of the specific architecture, and Etcd V3 documentation is very scarce. By analyzing the architecture and implementation of Etcd, this paper understands its advantages and disadvantages as well as bottlenecks. On the one hand, it can learn the architecture of distributed system, and on the other hand, it can ensure the correct use of Etcd in business, knowing what it is and knowing why, and avoiding misuse of Etcd. Finally, it introduces the tools around Etcd and some precautions for use.

Intended audience: Distributed system enthusiasts, developers who are using Etcd or plan to use it in their projects.

Etcd according to the official introduction

Etcd is a distributed, consistent key-value store for shared configuration and service discovery

Is a distributed, consistent key-value store for shared configuration and service discovery. Etcd has been widely used in many distributed systems. The architecture and implementation part of this paper mainly answers the following questions:

How does Etcd achieve consistency?
How is Etcd storage implemented?
How is Etcd’s Watch mechanism implemented?
How is Etcd’s key expiration mechanism implemented?

Why do YOU need Etcd?

All distributed systems face the problem of data sharing between multiple nodes, which is the same as team collaboration. Members can work separately, but they always need to share some necessary information, such as who is the leader, which members are there, and depends on the sequence coordination between tasks. So distributed systems either implement a reliable shared storage themselves to synchronize information (such as Elasticsearch) or rely on a reliable shared storage service, and Etcd is such a service.

What capabilities does Etcd offer?

Etcd provides the following capabilities. Those already familiar with Etcd may skip this paragraph.

Provides an interface for storing and obtaining data. It ensures data consistency among multiple nodes in an Etcd cluster through protocols. Used to store meta information and share configurations.
Provides a listening mechanism that allows a client to listen for a key or for key changes (v2 and V3 have a different mechanism, see later). Used to listen for and push changes.
The key expiration and renewal mechanisms are provided. The client refreshes the key periodically to renew the key. (The implementation mechanisms of V2 and V3 are different.) Used for cluster monitoring and service registration discovery.
Provides atomic CAS (compare-and-swap) and CAD (compare-and-delete) support (V2 through interface parameters, v3 through batch transactions). Used for distributed locks and leader elections.

More detailed usage scenarios are not described here, but those interested can refer to the infoQ article at the end of this article.

How does Etcd achieve consistency?

This brings us to the RAFT protocol. This article is not specifically about RAFT, there is not enough space to do a detailed analysis. If you are interested, please see the original paper address and an animation of raft protocol at the end of this article. For the convenience of reading the following article, I make a simple summary here:

Raft reduces generality (relative to PAXOS) by designing different mechanisms for different scenarios (master selection, log replication), but also reduces complexity and makes it easier to understand and implement.
The master selection protocol built into raft is for your own use, and the key to understanding raft’s master selection mechanism is to understand the clock cycles and timeouts.
The key to understanding Etcd’s data synchronization is to understand raft’s log synchronization mechanism.

When implementing RAFT, Etcd takes full advantage of the GO CSP concurrency model and the magic of Chan.

Wal logs are binary logs that are parsed into the data structure LogEntry. The first field type has only two types: 0 indicates Normal and 1 indicates ConfChange (ConfChange indicates the synchronization of configuration changes of Etcd, such as the addition of new nodes). The second field is term, and each term represents the term of a master node, which changes every time the master node changes. The third field is index, which is strictly ordered and represents the change number. The fourth field is binary data, which stores the entire PB structure of the RAFT Request object. Etcd has tools/etcd-dump-logs that dump wal logs into text to help analyze raft protocols.

Raft protocol itself does not care about the application data, that is, the part of the data. Consistency is achieved by synchronizing WAL logs. Each node will apply the data received from the master node to the local store. For example, if data is not properly applied locally, data inconsistency may also occur.

Etcd v2 and v3

Etcd V2 and V3 are essentially two independent applications that share the same raft protocol code, with different interfaces, different storage, and data isolation. In other words, if you upgrade from Etcd V2 to Etcd V3, the original v2 data can only be accessed through the V2 interface, and the data created by the V3 interface can only be accessed through the V3 interface. So let’s analyze it in terms of V2 and V3.

Etcd V2 storage, Watch and expiration mechanism

Etcd V2 is a pure memory implementation and does not write data to disk in real time. The persistence mechanism is simple, serializing the store integration into JSON files. The data is a simple tree structure in memory. For example, the following structure of data stored in Etcd is shown in the figure below.

/ nodes / 1 / name node1 / nodes / 1 / IP 192.168.1.1Copy the code

We have a global currentIndex in store, and every time we change, index is going to increment by 1. And then each event is associated with the currentIndex.

Select * from EventHistroy where waitIndex is less than or equal to waitIndex and waitIndex is less than or equal to currentIndex. And the event that matches the Watch key will be returned directly if it has data. If there is no waitIndex in the history table or the request does not have waitIndex, it is placed in the WatchHub and each key is associated with a Watcher list. When a change is performed, the event generated by the change is placed in the EventHistroy table and the watcher associated with the key is notified.

There are a few details that affect usage:

EventHistroy is limited in length, up to 1000. That is, if your client stops for a long time and then watches again, the event associated with the waitIndex may have been eliminated, in which case the change will be lost.
If there is a block (each watch channel has 100 buffer space), Etcd will delete watcher directly. In other words, the connection of wait request will be interrupted and the client needs to reconnect.
The expiration time is stored in each node of the Etcd store and is cleaned by a timing mechanism.

Thus, some limitations of Etcd V2 can be seen:

The expiration time can only be set for each key. It is difficult to ensure the same life cycle for multiple keys.
Watch can only watch a key and its children (recursive), not multiple watches.
It is difficult to achieve complete data synchronization through the Watch mechanism (there is a risk of losing changes), so most of the current use is to learn about changes through Watch, and then get data again, not completely dependent on the change events of Watch.

Etcd V3 storage, Watch and expiration mechanism

Etcd V3 implements watch and Store separately. Let’s first analyze the implementation of Store.

Etcd V3 Store is divided into two parts, one is an in-memory index, kvIndex, which is based on Google’s open source Btree implementation of Golang, and the other is back-end storage. Backend is designed to interconnect with multiple types of storage, and bolTDB is currently used. Boltdb is a stand-alone kv storage that supports transactions. Etcd transactions are implemented based on BoltDB transactions. The key stored by Etcd in BoltDB is reversion, and the value is the key-value combination of Etcd itself. That is to say, Etcd will save each version in BoltDB, thus realizing the multi-version mechanism.

For example, using etcdctl to write two records via the batch interface:

etcdctl txn <<<' 
put key1 "v1" 
put key2 "v2" 

' 
Copy the code

Then update these two records through the batch interface:

etcdctl txn <<<' 
put key1 "v12" 
put key2 "v22" 

' 
Copy the code

Boltdb actually has 4 pieces of data:

rev={3 0}, key=key1, value="v1" 
rev={3 1}, key=key2, value="v2" 
rev={4 0}, key=key1, value="v12" 
rev={4 1}, key=key2, value="v22" 
Copy the code

Reversion consists of two parts: the first part, main Rev, increments by one for each transaction, and the second part, sub Rev, increments by one for each operation in the same transaction. As shown above, main rev is 3 for the first operation and 4 for the second. Of course, the first thing that comes to mind with this mechanism is space, so Etcd provides commands and Settings to control compact, and supports put parameters to control the exact number of historical versions of a key.

If you want to query data from BOLtDB, you must use reversion, but clients always use key to query value, so Etcd memory kvIndex stores the mapping relationship between key and reversion to speed up the query.

Then we will analyze the implementation of watch mechanism. The watch mechanism of Etcd V3 supports a fixed key of watch and a range of Watch (a watch that can be used to simulate the structure of a directory). Therefore, watchGroup contains two types of Watcher, one is Key Watchers, The data structure is a set of Watcher for each key, and the other is Range Watchers. The data structure is an IntervalTree (see the link at the end of the article if you are not familiar with it), which makes it easy to find the corresponding Watcher through the interval.

At the same time, each WatchableStore contains two watchergroups, one is synced and the other is unsynced. The former means that the Watcher data of this group has been synchronized and is waiting for new changes. The latter indicates that the group’s Watcher data synchronization is behind the latest changes and still catching up.

When Etcd receives the watch request from the client, if the request carries revision parameters, it will compare the revision of the request with the current revision of store. If the request is larger than the current revision, it will be put into the Synced group; otherwise, it will be put into the Unsynced group. At the same time Etcd starts a background Goroutine that continuously synchronizes unsynced watcher and migrates it to the Synced group. Under this mechanism, Etcd V3 supports starting watch from any version, without the limitation of V2’s 1000 historical events table (in the absence of compact, of course).

In addition, as we mentioned earlier, when Etcd V2 informs the client, if the network is not good or the client reads slowly and blocks, the current connection will be directly closed, and the client needs to initiate the request again. Etcd V3 addresses this issue by maintaining a watcher queue that blocks push and retry in another Goroutine.

Etcd V3 also improves the expiration mechanism by setting the expiration time on the lease and associating the key with the lease. In this way, multiple keys can be associated with the same lease ID, which facilitates setting a uniform expiration time and implementing batch renewal.

Some major changes of Etcd V3 compared to Etcd V2:

Interface The GRPC provides an RPC interface instead of the HTTP interface of V2. The advantage is that the efficiency of the long connection is significantly improved, but the disadvantage is that it is not as convenient as before, especially in the scenario that it is not convenient to maintain the long connection.
The original directory structure has been scrapped to become pure KV, and users can simulate the directory through prefix matching mode.
Instead of holding values in memory, more keys can be stored in the same memory.
The watch mechanism is more stable. Basically, data can be fully synchronized through the Watch mechanism.
Etcd V2 provides batch operation and transaction mechanism. Users can implement THE CAS mechanism of Etcd V2 through batch transaction request (batch transaction supports if condition judgment).

Etcd, Zookeeper, Consul comparison

These three products are often taken to do selection comparison. Etcd and Zookeeper provide very similar capabilities. They are both universal consistency meta-information storage, provide watch mechanism for change notification and distribution, and are also used as shared information storage by distributed systems. Their positions in the software ecosystem are almost the same and can be replaced by each other. In addition to the differences in implementation details, language and consistency protocols, the biggest difference between them lies in the surrounding ecosystem. Zookeeper is written in Java under Apache and provides RPC interface. It was first hatched from hadoop project and is widely used in distributed systems (Hadoop, Solr, Kafka, Mesos, etc.). Etcd, an open source product from Coreos, is relatively new, gaining a following with its simple rest interface and active community, and is being used in new clusters (such as Kubernetes). Although V3 has been converted to a binary RPC interface for performance, it is still easier to use than Zookeeper. Consul aims to be more specific. Etcd and Zookeeper provide distributed consistent storage capabilities that users need to implement themselves for specific business scenarios, such as service discovery and configuration changes. Consul focuses on service discovery and configuration change and comes with KV storage. In the software ecosystem, the more abstract the component is, the wider the scope of application, but at the same time, there must be some shortcomings in meeting the needs of specific business scenarios.

Etcd peripheral tools

Confd

In a distributed system, it is ideal for applications to interact directly with a service discovery/configuration center such as Etcd, listening on Etcd for service discovery and configuration changes. But we have a lot of legacy applications, service discovery, and configuration that are mostly done by changing configuration files. Etcd itself is positioned as a universal KV storage, so there is no mechanism and tools like Consul to implement configuration changes, and Confd is the tool to do this. Confd listens for Etcd changes through the Watch mechanism and then synchronizes the data to one of its local stores. Users can define which key changes they care about through configuration and provide a configuration file template. Confd generates a configuration file using the latest data rendering template as soon as it detects a data change. If there is a change in the old and new configuration files, it replaces them and triggers a user-supplied Reload script for the application to reload the configuration. Confd has implemented some consul-agent and Consul-Template functions. The author is Kelsey Hightower of Kubernetes. However, It seems that The great God is very busy and does not have much time to pay attention to this project. So forked out a maintenance update of its own, mainly adding new template functions and support for the Metad back end.

Metad

The implementation mode of service registration is generally divided into two kinds, one is the scheduling system for registration, the other is the application itself registration. When the scheduling system registers on behalf of the application, there needs to be a mechanism for the application to know “who I am” after the application is started, and then discover its cluster and its configuration. Metad simplifies the client’s service discovery and configuration change logic by providing a mechanism for clients to request a fixed interface /self of Metad and for Metad to inform the application of the meta-information to which it belongs. Metad does this by keeping an IP to meta information path mapping, and currently supports Etcd V3 with a simple and easy-to-use HTTP REST interface. It synchronizes Etcd data to local memory via the Watch mechanism, acting as a proxy for Etcd. Therefore, it can also be used as a proxy for Etcd, suitable for scenarios where it is not convenient to use Etcd V3’s RPC interface or where you want to reduce Etcd stress.

Etcd cluster setup script

There is a bug in the official one-click building script of Etcd. I have prepared a script by myself. Through the network function of Docker, one-click building of a local Etcd cluster is convenient for testing and testing. One-click setup script

Precautions for using Etcd

Etcd cluster initialization problem

If a node is not started during the initial startup of the cluster, Error: Etcdserver: Not capable is reported when the v3 interface is used to access the node. For compatibility reasons, the default API version is 2.3 when the cluster is started. Upgrade the cluster to V3 only when all nodes in the cluster are added and all nodes support V3 interfaces. This only happens when the cluster is initialized for the first time, and it does not matter if the cluster has already been initialized and then hangs the node, or if the cluster is shut down and restarted (during which the cluster API version is loaded from persistent data).

Etcd read request mechanism

V2 quorum=true when reading is done via raft, via CLI request, this parameter defaults to true. V3 – Consistency = “L” (default) read through raft, otherwise read local data. The SDK code controls whether to turn on the WithSerializable option. In the case of consistent read, raft protocol is also required for each read. This ensures consistency but degrades performance. If network partitions occur, only a few nodes in the cluster cannot provide consistent read. If you do not set this parameter, you will read directly from the local store, thus losing consistency. When using this parameter, you need to set it based on application scenarios and choose between consistency and availability.

Compact mechanism of Etcd

Etcd does not compact automatically by default. You need to set startup parameters or run commands to compact. You are advised to set this parameter if you change it frequently; otherwise, space and memory waste and errors will occur. Error: ETCDServer: MVCC: EtCDServer: MVCC: EtCDServer: MVCC: Database Space Exceeded “, causing data cannot be written.

— From Midnight Coffee jolestar.com/etcd-archit…

Raft’s website has paper addresses and related information.
See this animation for raft.
Interval_tree
Etcd: Comprehensive interpretation from application scenarios to implementation principles This article describes the usage scenarios in a comprehensive way.
Confd Our modified version of confD repository address.
Metad warehouse address.
Etcd cluster setup script
Thread, Goroutine, Actor: The Pain of Concurrency, Thread, Goroutine, Actor: The pain of concurrency