In 2013, a startup called CoreOS argued that all security problems boil down to the ability to update software, and that if you could build a server that updates itself (automatically), security problems would be solved. To achieve this, the application must be packaged into a single unit that isolates the base operating system from the application. CoreOS uses container technology to achieve this goal. At the same time, also shared the cluster management scheme, so that users can manage the cluster as easy as managing a single machine.

The first question the CoreOS team faced in managing the cluster was, how do you make the cluster highly available? In other words, the CoreOS team wanted to restart a node in any cluster without bringing down other applications. Therefore, multiple copies need to be run. With multiple copies, a coordination service is required. CoreOS started by looking for good, relevant open source coordination services in the open source community. They wanted the open source project to meet five goals:

  1. High availability

    The coordination service acts as a control panel for the cluster, with the deployment information of each service stored behind it. If a fault occurs, the cluster cannot be changed and the number of service replicas cannot be coordinated. As a result, user experience may be affected. For example, data loading is slow (with few nodes and heavy traffic).

  2. Low volume

    Only key metadata, cluster, and node configurations are saved.

  3. maintainability

    Will be the new bugs in the distributed system, and replace the node scenario, these all need to change in the nodes, through the coordination service if artificial to operate these nodes, it may appear that the operating error caused by the downtime, whether through the API to the operation of the convention requires coordinated service, a way that would be more smooth.

  4. Data consistency

    Under the premise of high availability, a single point of failure can not occur, so that multiple nodes (multiple copies) will have data consistency problems.

  5. Can monitor nodes and can add, delete, change and check the basic operation

    When a node is abnormal or changed, a “pull” operation is changed into a “push” operation, which can avoid unnecessary performance overhead in coordination services, compared with the control end’s regular polling check.

With that goal in mind, CoreOS started looking for such a coordination service, but unfortunately it didn’t happen in 2013, so CoreOS decided to build their own wheels, and then it was down to technology selection to achieve that goal.

To solve the problem of data consistency, CoreOS chose to introduce a consensus algorithm Raft. The algorithm is easy to understand. It divides the consistency problem into three relatively independent sub-problems: Leader election, log synchronization and security.

CoreOS uses a directory-based hierarchical model based on ZooKeeper’s data model and API. The API uses a simpler REST API compared to ZooKeeper to add, delete, change, check and listen on nodes.

For nodes on the storage engine, ETCD uses a simple memory tree to simplify the data structure of nodes, including paths, values, and child nodes. This is a typical low-volume design.

Finally, maintainability. Raft algorithm provides a member change algorithm based on which to monitor the online and changing status of members.

Based on the above technology selection, the CoreOS team released beta V0.1, the first version of ETCD written based on Go, in 2013. Why is it called ETCD? The CoreOS team was inspired by the Unix /etc folder and the DISTRIBUTED System D. Together, etCD is an information storage service for storing distributed configurations.

In June 2014, Google’s Kubernetes project came out of the sky. It chose ETCD as the cornerstone of service coordination, because the goals of ETCD coincide with the features needed by Kubernetes, such as high availability and Watch mechanism. The Kubernetes project uses ETCD. In addition to technical factors, CoreOS is also a core member of the Kubernetes container ecosystem. Under the leadership of Kubernetes, ETCD has also entered an era of rapid development. Therefore, in January 2015, V2.0 version of ETCD was released. At this time, etCD is widely used in configuration storage, service discovery, and primary/secondary election scenarios. However, there are also a number of performance and functional problems, as follows:

  1. Functional limitations

    First, ETCD V2.0 does not support range queries and paging. With the increase of Kubernetes cluster size, thousands of resources such as Pod and Event may appear, and there will be expensive Request without pagination, causing performance problems. Second, multi-key transactions are not supported. In real scenarios, we often need to update multiple keys at the same time.

  2. Performance bottlenecks

    The bottleneck of HTTP/1.x protocol causes the performance bottleneck of ETCD. Because HTTP/1.x protocol has no compression mechanism, it is easy to cause packet loss, high CPU load, and memory overflow problems of Server and ETCD when a large number of PODS are pulled in batches.

  3. Memory overhead problem

    As we know, ETCD uses a memory tree to store nodes. When the data scenario is large and there are many configuration items, a large amount of data will be stored, resulting in a large amount of memory, which affects the stability of the system.

  4. The reliability of the Watch mechanism is faulty

    Kubernetes is very dependent on etCD Watch mechanism, however, ETCD V2.0 is memory type, does not support to save the historical version of the node, only use the sliding window to save the latest change time in memory, when etCD server write request is more, network fluctuation, it is easy to appear event loss. Therefore, full pull of client data is triggered, resulting in a large number of Expensive Requests.

In the face of these problems, the CoreOS team actively listened to the voice of the open source community and actively improved it. Faced with the problem of high memory overhead, it realizes an MVCC database by introducing B-Tree and BOLTDB, changes the data model from hierarchical to flat, implements persistent storage based on BoltDB, and reduces etCD memory consumption; For performance bottlenecks, it introduces gRPC, which defines messages with Protobuf, and improves decoding performance significantly, and reduces the number of connections in scenarios such as Watcher via HTTP/2.0 multiplexing. Based on these solutions, IN June 2016, ETCD V3.0 was officially released.

Today, etCD has 38K stars on Github! It has become the preferred metadata storage product in the cloud native era.