It is very helpful to explain 5 common registries and compare their processes and principles, whether it is interviewing or technology selection.

Previous selections:

  • How to treat the programmer 35 career crisis?
  • 2 years experience summary, tell you how to do well project management
  • Java full set of learning materials (14W words), took half a year to sort out
  • I spent three months writing the GO core manual for you
  • Message queuing: From model selection to principle, this article takes you through everything
  • RPC framework from principle to selection, article with you to understand RPC
  • Micro service gateway selection, please take my knee!
  • More…

Hello everyone, I am Lou Zai! As for registries, BEFORE writing this article, I actually only had a deep understanding of ETCD, but little knowledge of Zookeeper and other registries, and did not even consider whether ETCD and Zookeeper were suitable as registries.

After nearly 2 weeks of study, in addition to ETCD and Zookeeper, there are also Eureka, Nacos and Consul commonly used registries. Below, we will explore the similarities and differences of these commonly used registries for the convenience of subsequent technology selection.

The full text is close to 8000 words, a little long, I suggest to collect first, and then slowly read, the following is the article catalogue:

Basic concept of registry

What is a registry?

Registries have three main roles:

  • Service provider (RPC Server) : At startup, registers its own services with Registry and sends heartbeat periodically to Registry to report the survival status.
  • Service consumer (RPC Client) : At startup, subscribe to Registry, cache the list of service nodes returned by Registry in local memory, and connect with RPC server.
  • Service Registry: It stores the registration information of the RPC Server. When the RPC Server node changes, The Registry synchronizes the changes, and the RPC Client refreshes the list of service nodes cached in the local memory.

Finally, the RPC Client selects an RPC Sever from the locally cached service node list based on the load balancing algorithm to initiate the call.

The registry needs to implement functionality

According to the description of the principle of the registry, the registry must implement the following functions, take the slack, directly paste a picture:

Registration center basic literacy

If you already know this, you can skip it, mainly for the sake of literacy.

Theory of CAP

CAP theory is an important theory in distributed architecture:

  • Consistency: All nodes have the same data at the same time.
  • Availability: Ensuring that each request is responded regardless of success or failure;
  • Partition tolerance: The loss or failure of any information in the system does not affect the continued operation of the system.

As for the understanding of P, I think it means that the failure or breakdown of a certain part in the whole system does not affect the operation or use of the whole system, while the availability means that the failure of a certain node in a system does not affect the acceptance or sending of requests by the system.

It is impossible to take both caps, but only two of them can be taken for the following reasons:

  • If C is the first requirement, then the performance of A will be affected because the data needs to be synchronized, otherwise the results of the requests will be different, but the data synchronization will take time and the availability will be reduced.
  • If A is the first requirement, requests can be accepted as long as one service is present, but returns are not guaranteed because, in distributed deployment, the data consistency process cannot cut lines as fast.
  • And if, at the same time, consistency and availability are satisfied, then partition fault tolerance is difficult to guarantee, that is, the single point, which is the basic core of distribution.

Distributed System Protocol

The main consistency protocol algorithms include Paxos, Raft and ZAB.

Paxos algorithm is a consistency algorithm based on message passing proposed by Leslie Lamport in 1990. It is very difficult to understand. The biggest difference between paxOS-based data synchronization and traditional master/slave mode is as follows: Paxos only needs more than half of its replicas to be online and communicating with each other properly to ensure that the service is continuously available and data is not lost.

Raft is a consistent algorithm designed by Diego Ongaro and John Ousterhout at Stanford University for easy comprehension. There are dozens of frameworks for implementing Raft algorithms in dozens of languages, including etCD. Google’s Kubernetes also uses ETCD as its service discovery framework.

Raft is a simplified version of Paxos. It’s easy to understand, easy to implement, and just like Paxos, Raft will be able to provide services as long as more than half of the nodes are healthy. This article “ETCD Tutorial -2.Raft Protocol” explains the Raft principle in detail and is interesting to read if you are interested.

ZooKeeper Atomic Broadcast (ZAB) is the core algorithm for ZooKeeper to implement distributed data consistency. ZAB uses Paxos algorithm for reference, but unlike Paxos algorithm, ZAB is a universal distributed consistency algorithm. It is an atomic broadcast protocol designed specifically for ZooKeeper to support crash recovery.

Common registry

This section mainly introduces five common registries, namely Zookeeper, Eureka, Nacos, Consul and ETCD.

Zookeeper

What is interesting about this is that it is not officially described as a registry, but many Dubbo scenarios in China use Zookeeper to complete the functions of the registry.

Of course, there are many historical reasons for this, which we won’t go back to here. ZooKeeper is a very classic service registry middleware. Under the domestic environment, under the influence of Dubbo framework, ZooKeeper is considered to be the best choice for the registry under RPC service framework in most cases. With the continuous development and optimization of Dubbo framework and the birth of various registry components, Even with RPC frameworks, registries are now moving away from ZooKeeper. In common development cluster environment, ZooKeeper still plays a very important role. In Java system, most cluster environment is dependent on Each node of ZooKeeper management service.

How does Zookeeper implement a registry

See the article “How Zookeeper works as a registry” for more details.

Zookeeper can act as a Service Registry to form a cluster of multiple Service providers. Service consumers can access specific Service providers by obtaining specific Service access addresses (Ip+ port) from the Service Registry. As shown below:

Every time a service provider is deployed, it registers its service in a path of ZooKeeper: /{service}/{version}/{IP :port}.

For example, if our HelloWorldService is deployed on two machines, two directories will be created on Zookeeper:

  • / HelloWorldService / 1.0.0/100.19.20.01:16888
  • / HelloWorldService / 1.0.0/100.19.20.02:16888

This description is a little hard to understand, but the following is more intuitive:

In ZooKeeper, service registration is actually the creation of a ZNode node in ZooKeeper, which stores the IP address, port, invocation mode (protocol, serialization mode) of the service. This node has the most important responsibility and is created by the service provider (when the service is published) for the service consumer to obtain information in the node to locate the real network topology of the service provider and how to invoke it.

The RPC service registration/discovery process is summarized as follows:

  1. When the service provider starts, it registers its service name and IP address with the configuration center.
  2. When a service consumer invokes a service for the first time, it will find the IP address list of the corresponding service through the registry and cache it locally for subsequent use. When a consumer invokes a service, it no longer invokes the registry, but invokes the service directly from the server of one of the service providers in the IP list through a load-balancing algorithm.
  3. When a server of the service provider goes down or goes offline, the corresponding IP address is removed from the service provider IP address list. At the same time, the registry sends a list of new service IP addresses to the service consumer machine, which is cached on the consumer machine.
  4. When all servers for a service go offline, the service goes offline.
  5. Similarly, when a server of a service provider goes online, the registry sends a list of new service IP addresses to the service consumer machine, which is cached on the consumer machine.
  6. The service provider can take the number of service consumers as the basis for service offline.

Zookeeper provides the “heartbeat detection” function: It periodically sends a request to each service provider (actually establishing a socket connection). If there is no response for a long time, the service center considers the service provider as “suspended” and excludes it.

Such as 100.100.0.237 if the machine goes down, then the path on the zookeeper will only/HelloWorldService / 1.0.0/100.100.0.238:16888.

Zookeeper’s Watch mechanism is actually a combination of push and pull:

  • The service consumer will listen to the corresponding path (/HelloWorldService/1.0.0). Once the data on the path changes (increase or decrease), Zookeeper will only send an event type and node information to the concerned client, but will not include the specific change content, so the event itself is lightweight. That’s the push part.
  • The client receiving the change notification needs to pull the changed data itself, which is the pull part.

Zookeeper is not suitable as a registry

ZooKeeper is great as a distributed collaborative Service, but not for a Service discovery Service, where even a return containing false information is better than nothing. Therefore, when we query the registry for the list of services, we can tolerate the registry to return the registration information of a few minutes ago, but cannot accept the service directly down and unavailable.

However, in ZK, when the master node loses contact with other nodes due to network failure, the remaining nodes will re-elect the leader. The problem is that it takes too long to elect a leader, 30 to 120s, and the entire ZK cluster is unavailable during the election, which leads to the breakdown of the registration service during the election. In a cloud deployment environment, it is highly likely that the ZK cluster will lose its master node due to network problems. Although the service can be restored eventually, it is not acceptable to have long registrations unavailable due to long election time.

So, as a registry, availability is more important than consistency!

In the CAP model, Zookeeper generally follows the consistency (CP) principle, that is, the access request to Zookeeper can get consistent data results at any time, but when the machine goes offline or down, the service availability cannot be guaranteed.

So why doesn’t Zookeeper use the Final Consistency (AP) model? Because the core zooKeeper-dependent algorithm is ZAB, everything is designed for consistency. This is fine for distributed coordination systems, but if you take the consistency assurance Zookeeper provides for distributed coordination services and apply it to registries, or service discovery scenarios, it’s not appropriate.

Eureka

Eureka architecture diagram

What, the picture above looks complicated? Let me give you a simplified version:

Eureka characteristics

  • Availability (AP principles) : Eureka is designed to follow AP principles. As long as there is one Eureka in the Eureka cluster, the registration service can be available, but the information may not be up to date (strong consistency is not guaranteed).
  • Decentralized architecture: Eureka Server can run multiple instances to build a cluster. Different from ZooKeeper’s leader election process, Eureka Server uses peer-to-peer communication. This is a decentralized architecture. There is no master/slave, and each Peer is Peer. Nodes improve availability by registering with each other, and each node needs to add one or more valid ServiceurLs to point to other nodes. Each node can be treated as a copy of another node.
  • Automatic request switchover: In a cluster environment, if a Eureka Server is down, the Eureka Client automatically switches requests to the new Eureka Server node. After the Server is recovered, Eureka will manage the Server cluster again.
  • Inter-node operation replication: When a node starts receiving client requests, all operations are replicated between nodes, replicating the requests to all other nodes that the Eureka Server is currently aware of.
  • Automatic registration & Heartbeat: When a new Eureka Server node is started, it first tries to get all registration list information from neighboring nodes and completes initialization. Eureka Server obtains all nodes through the getEurekaServiceUrls() method and updates them periodically through heartbeat contracts.
  • Automatic offline: By default, if the Eureka Server does not receive a heartbeat from a service instance within a certain period of time (default interval is 30 seconds), the Eureka Server will log out of the instance (default interval is 90 seconds, Eureka. Instance. lease-expiration-duration-in-seconds User-defined configuration).
  • Protected mode: When the Eureka Server node loses too many heartbeats in a short period of time, the node goes into self-protected mode.

In addition to the above features, Eureka also has a self-protection mechanism. If more than 85% of the nodes have no normal heartbeat within 15 minutes, Eureka considers that there is a network failure between the client and the registry, and the following situations may occur:

  • Eureka no longer removes services from the registry that have expired because they have not received heartbeats for a long time;
  • Eureka can still accept new service registration and query requests, but will not be synchronized to other nodes (i.e. the current node is still available).
  • When the network is stable, the newly registered information of the current instance is synchronized to other nodes.

Eureka workflow

After understanding Eureka’s core concept, self-protection mechanism and working principle in the cluster, we will review Eureka’s working process as a whole:

  1. The Eureka Server is successfully started and waiting for the Server to register. When a cluster is configured during startup, it periodically synchronizes the Replicate registry between clusters. Each Eureka Server has a separate and complete service registry information.
  2. The Eureka Client registers the service with the registry based on the Eureka Server address.
  3. The Eureka Client sends a heartbeat request to the Eureka Server every 30 seconds to verify that the Client service is normal.
  4. If the Eureka Server 90s does not receive the heartbeat of the Eureka Client, the registry considers that the node is invalid and deregister the instance.
  5. When the Eureka Server detects that a large number of Eureka clients do not send heartbeat messages within a unit of time, the Eureka Server considers that the network may be abnormal and enters the self-protection mechanism to exclude the clients that do not send heartbeat messages.
  6. When the heartbeat request of the Eureka Client recovers, the Eureka Server automatically exits the self-protection mode.
  7. Eureka Client periodically obtains the service registry from the registry in full or increments and caches the obtained information locally.
  8. When a service is invoked, the Eureka Client first looks for the invoked service from the local cache. If not, the registry is refreshed from the registry and then synchronized to the local cache.
  9. The Eureka Client obtains the target server information and invokes the service.
  10. When the Eureka Client shuts down, it sends a cancellation request to the Eureka Server, and the Eureka Server removes the instance from the registry.

By analyzing the working principle of Eureka, I can clearly feel the clever design of Eureka, which perfectly solves the stability and high availability of the registry.

In order to ensure the high availability of the registry, Eureka tolerates non-strong data consistency. Data may be inconsistent between service nodes and between Client and Server. It is suitable for scenarios that span multiple computer rooms and have high requirements on the availability of registry services.

Nacos

The following is excerpted from Nacos official website: nacos. IO/zh-CN /docs/…

Nacos is dedicated to helping you discover, configure, and manage microservices. Nacos provides an easy-to-use feature set that helps you quickly implement dynamic service discovery, service configuration, service metadata, and traffic management.

Nacos helps you build, deliver, and manage microservices platforms more agile and easily. Nacos is the service infrastructure for building modern “service” centric application architectures (e.g., microservices paradigm, cloud native paradigm).

Main features of Nacos

Service discovery and service health Monitoring:

  • Nacos supports DNs-based and RPC-based service discovery. After a Service provider registers a Service using a native SDK, OpenAPI, or a separate Agent TODO, Service consumers can use DNS TODO or HTTP&API to find and discover services.
  • Nacos provides real-time health checks on services to prevent requests from being sent to unhealthy hosts or service instances. Nacos supports health checks at the transport layer (PING or TCP) and the application layer (e.g. HTTP, MySQL, user-defined). Nacos provides two health check modes: Agent report mode and server active check mode for complex cloud environments and network topologies, such as VPCS and edge networks. Nacos also provides a unified health check dashboard to help you manage service availability and traffic based on health status.

Dynamically configuring services:

  • Dynamically configured services allow you to manage application configuration and service configuration for all environments in a centralized, external, and dynamic manner.
  • Dynamic configuration eliminates the need to redeploy applications and services when configuration changes, making configuration management more efficient and agile.
  • Centralized configuration management makes it easier to implement stateless services and make it easier for services to scale flexibly on demand.
  • Nacos provides an easy-to-use UI (sample console Demo) to help you manage the configuration of all your services and applications. Nacos also provides a number of out-of-the-box configuration management features including configuration version tracking, Canary publishing, one-click rollback configuration, and client configuration update status tracking to help you more securely manage configuration changes and reduce the risks associated with configuration changes in a production environment.

Dynamic DNS service:

  • The dynamic DNS service supports weighted routing, enabling you to implement load balancing at the middle layer, flexible routing policies, traffic control, and simple DNS resolution services on the data center Intranet. Dynamic DNS services also make it easier for you to implement DNS protocol-based service discovery to help eliminate the risk of coupling to vendor-proprietary service discovery apis.
  • Nacos provides some simple DNS APIs TODO to help you manage your service’s associated domain name and available IP:PORT list.

Here’s a section:

  • Nacos is ali open source and supports DNS and RPC-based service discovery.
  • Nacos registry support CP also support AP, for it is just a command switch, play with it, also support various registries to migrate to Nacos, anyway, in a word, you want it it has.
  • In addition to service registration discovery, Nacos also supports dynamic configuration of services. In a word, Nacos = Spring Cloud registry + Spring Cloud configuration Center.

Consul

Consul is an open source tool from HashiCorp for service discovery and configuration in distributed systems. Consul’s solution is more “one-stop” than other distributed service registration and discovery solutions, with built-in service registration and discovery framework, distributed consistency protocol implementation, health check, Key/Value storage, and multi-data center solution, eliminating the need to rely on other tools (such as ZooKeeper, etc.).

Consul is also relatively simple to use and written in Go, making it naturally portable (Linux, Windows, and Mac OS X support); The installation package contains only one executable file for easy deployment and works seamlessly with lightweight containers such as Docker.

Consul call procedure

  1. When the Producer starts, it sends a POST request to Consul telling Consul its IP address and Port.
  2. Consul sends a health check request to the Producer every 10 seconds (by default) after receiving the Producer’s registration to check whether the Producer is healthy.
  3. When a Consumer sends a GET request/API /address to the Producer, he gets a temporary table of storage service IP addresses and ports from Consul. GET the IP and Port of Producer from the table and send the GET request/API /address.
  4. The temporary table, which is updated every 10 seconds, only includes producers who have passed health checks.

Consul main features

  • CP model, Raft algorithm is used to ensure strong consistency, not guarantee availability;
  • Support service registration and discovery, health check, KV Store functions.
  • Multiple DATA centers (DCS) are supported to avoid single point of failure in a single DC. However, network delay and fragmentation must be considered when deploying the DCS.
  • Supporting secure service communication, Consul can generate and distribute TLS certificates for services to establish mutual TLS connections.
  • Supports HTTP and DNS interfaces.
  • The official web management interface is provided.

Multi-data center

Here’s the basics, learn how Consul’s multi-data center is implemented.

Consul supports multiple data centers out of the box, which means users don’t need to worry about building an additional layer of abstraction to scale out to multiple regions.

In the figure above, there are two datacenters that are connected over the Internet, and note that only the Server nodes participate in cross-datacenter communication for efficiency.

In a data center, Consul is divided into Client and Server nodes (all nodes are also called Agents). The Server node stores data, while the Client performs health checks and forwards data requests to the Server. A Server node has one Leader and multiple followers. The Leader node synchronizes data to followers. The recommended number of servers is 3 or 5.

Consul nodes in a cluster use the Gossip protocol to maintain membership, meaning that a node knows which nodes are in the cluster and whether they are clients or servers. The myth protocols in a single data center communicate using both TCP and UDP, and both use port 8301. The myth protocol across the data center also uses both TCP and UDP communication, using port 8302.

The read/write request of the data in the cluster can either be directly sent to the Server or forwarded to the Server by the Client using RPC. The request will finally reach the Leader node. Under the condition that data delay is allowed, the read request can also be completed on the common Server node. Data in the cluster is read, written, and replicated through TCP port 8300.

ETCD

Etcd is a distributed and highly available consistent key and value storage system written by Go. It provides reliable distributed key and value storage, configuration sharing, and service discovery.

ETCD characteristics

  • Curl curl curl curl curl curl curl curl curl curl
  • Easy deployment: written in Go language, cross-platform, simple deployment and maintenance;
  • Strong consistency: Raft algorithm is used to fully ensure the strong consistency of distributed system data;
  • High availability: Provides fault tolerance. If a cluster has N nodes, services are still provided when (N-1)/2 nodes fail to send data.
  • Persistence: After data is updated, it is persisted to disks in WAL format and Snapshot is supported.
  • Fast: each instance supports 1000 write operations per second, and the ultimate write performance can reach 10K QPS;
  • Security: Optional SSL customer authentication mechanism;
  • ETCD 3.0: In addition to the above functions, it also supports gRPC communication and Watch mechanism.

ETCD framework

Etcd is divided into four parts:

  • HTTP Server: Used to process API requests sent by users and synchronization and heartbeat information requests from other ETCD nodes.
  • Store: Transactions that handle the various functions supported by ETCD, including data indexing, node state changes, monitoring and feedback, event processing and execution, and so on. It is the concrete implementation of most of the API functions provided by ETCD to users.
  • Raft: The concrete implementation of Raft strong consistency algorithm is the core of ETCD.
  • WAL: Write Ahead Log, which is the data storage mode of etCD. In addition to holding the state of all the data and the index of the node in memory, ETCD is persisted through WAL. In WAL, all data is logged before submission. Snapshot is a status Snapshot to prevent too much data. Entry Indicates the specific log content.

Typically, a request from a user will be forwarded to the Store via HTTP Server for specific transaction processing. If node changes are involved, it will be handed to Raft module for state changes, logging, synchronization to other ETCD nodes for confirmation of data submission, and finally data submission and synchronization again.

For more information on ETCD, check out the article “One month of Living ETCD, From Raft Principles to Practice”.

Registry comparison & selection

Registry comparison

  • Service health check: Euraka requires explicit configuration of health check support; On the other hand, Zookeeper and Etcd are not healthy if they lose Consul connections to server processes. Consul is more detailed, such as whether memory has been used up 90% and whether the file system is running out of space.
  • Multi-data center: Consul and Nacos support, other products require additional development work to implement.
  • KV storage service: Except Eureka, other products can support K-V storage service externally, so we will talk about the important reason why these products pursue high consistency later. And provide storage services, can also be better transformed into dynamic configuration services.
  • Trade-offs in CAP theory:
    • Eureka is a typical AP, and Nacos can be configured as an AP, which is more suitable as a product of service discovery in distributed scenarios, where the availability priority is high and consistency is not particularly fatal.
    • However, Zookeeper, Etcd, and Consul sacrifice availability and have little advantage in service discovery scenarios.
  • Support of Watch: Zookeeper supports server-side push of changes, and other changes are sensed through long polling.
  • Self-cluster monitoring: Zookeeper and Nacos all support metrics by default, allowing operators to collect and report metrics for monitoring purposes.
  • Spring Cloud integration: Currently, there are corresponding Boot starter, which provides integration capabilities.

Registry selection

About the comparison and selection of the registry, in fact, the above has been very clear, I give some personal understanding:

  • CP or AP: Choose AP, Eureka and Nacos are preferred because of availability over consistency; In terms of how Eureka and Nacos choose, I choose which one gives me less to do, and obviously Nacos helps us do more.
  • Technical system: Etcd and Consul are all developed by Go, while Eureka, Nacos, Zookeeper and Zookeeper are all developed by Java. It is possible that the project belongs to different technology stacks and will prefer to choose the corresponding technical system.
  • High availability: All of these open source products have considered how to build high availability clusters, but there are some differences;
  • Product activity: These open source products are generally very active.

It is better to have no books than to have no books. Because of my limited ability, it is hard to avoid omissions and mistakes. If you find bugs or have better suggestions, you are welcome to criticize and correct.