It’s often fun to stand at the crossroads of the future and look back at the mistakes of history, because we can casually think crazy things like what if something had happened earlier and something else hadn’t. Just as what would have happened if Archduke Ferdinand and his wife, heir to the Austro-Hungarian throne, had not been shot by princip, a young Serb, and what would have happened if The road had not passed through Niujia village?

At the end of 2008, Taobao started an internal reconstruction project called “Multicolor Stone”, which later became taobao’s service-oriented, distributed path of self-development and out of the beginning of the Internet middleware system, and Taobao service registry ConfigServer was born in the same year.

Around 2008, the former Internet giant began publicizing its big data distributed coordination product ZooKeeper, a reference to Google’s published papers on Chubby and Paxos.

In November 2010, ZooKeeper developed from a subproject of Apache Hadoop to a top-level project of Apache, officially declaring ZooKeeper to be an industrial-level mature and stable product.

In 2011, Alibaba open source Dubbo. In order to open source better, the relationship between Dubbo and Alibaba internal system needs to be separated. Dubbo supports the open source ZooKeeper as its registration center. The typical servitization of Dubbo + ZooKeeper has given ZooKeeper its reputation as a registry.

On November 11, 2015, nearly 8 years passed inside ConfigServer service. Alibaba’s internal “service scale” exceeded millions, and promoted IDC disaster recovery technology strategy “thousands of miles away”, etc. Together, Alibaba started the architecture upgrade from ConfigServer 2.0 to ConfigServer 3.0.

Time is approaching 2018. Standing at the crossroads of 10 years, how many people are willing to slow down a little while and take a closer look at the field of service discovery while chasing the ever-changing new technology concepts? How many people have thought or thought about the following question:

Service discovery, is ZooKeeper really the best choice?

Looking back at the history, we occasionally have a myth. In the scenario of service discovery, what if ZooKeeper had been born earlier than our HSF registry ConfigServer?

Will we use ZooKeeper first and then frantically transform and repair ZooKeeper to adapt to Alibaba’s servitization scene and needs?

However, standing on the shoulders of today and our predecessors, we have never been more firmly aware that ZooKeeper is not the best choice in the field of service discovery, just like Eureka who has been with us for years and this article Eureka! Why You Shouldn’t Use ZooKeeper for Service Discovery explains Why You Shouldn’t Use ZooKeeper for Service Discovery!

My path is not lonely.

Registry requirements analysis and key design considerations

Next, let’s go back to the demand analysis of service discovery and discuss why ZooKeeper is not the most suitable registry solution, combining alibaba’s practice in key scenarios.

Is the registry a CP or AP system?

CAP and BASE theories are familiar to readers and have become one of the key principles guiding the construction of distributed systems and Internet applications. We will not repeat their theories here, but directly enter into the analysis of data consistency and availability requirements of the registry:

Data consistency requirement analysis

The essential function of the registry can be regarded as a Query function Si = F(service-name). Service-name is the Query parameter and the return value is the list of available endpoints (IP :port) of the corresponding service.

Note: Service is abbreviated to SVC.

First, let’s take a look at the impact of inconsistency of key data endpoints (IP :port), that is, the consequence of C not meeting the requirement in CAP:



As shown in the figure above, if 10 nodes (replicas/replicas) are deployed in an svcB, if two searches of two nodes of the calling svcA return inconsistent data for the same service name, for example, S1 = {ip1,ip2,ip3… ,ip9 }, S2 = { ip2,ip3,…. Ip10}, and what is the effect of this inconsistency? As you must have noticed, the traffic of svcB nodes is a little uneven.

Ip1 and IP10 are relative to the other 8 nodes {ip2… Ip9}, request flow is a bit small, but it is clear that in a distributed system, even the peer-based deployment services, because the time of arrival of the request, the state of the hardware, operating system scheduling, the virtual machine such as GC, any one point in time, these peer deployment node status also can’t completely consistent, inconsistent and traffic situation, As long as the registry converges the data to a consistent state (that is, meets the final consensus) within the time promised by the SLA (say, 1s), the traffic will quickly become statistically consistent, so the registry design with the final consensus model is perfectly acceptable in production practice.

  • Partition tolerance and availability requirements analysis

Next, let’s look at the impact of registry unavailability on service invocation in the case of Network Partitions, that is, when the A in CAP is not satisfied.

Consider a typical ZooKeeper three-room DISASTER recovery (Dr) five-node deployment structure (2-2-1 structure), as shown in the following figure:



When Network Partitioned appears in room 3, that is, room 3 has become an island on the Network. We know that although the overall ZooKeeper service is available, node ZK5 cannot be written because the Leader cannot be contacted.

In other words, the application service svcB of Machine Room 3 cannot be deployed, restarted, expanded or reduced. However, from the perspective of network and service invocation, Although the svcA of Machine Room 3 cannot call the svcB of machine Room 1 and machine Room 2, the network between the svcB of machine room 3 and machine Room 3 is OK. Why not let me call the service of this machine room?

Now because the registry itself has given up availability in order to preserve data consistency (C) under the split brain (P), there is no call between the same machine room services, which is absolutely not allowed! It can be said that in practice, registries should not break their own connectivity between services for any reason. This is an iron law that should be followed in the design of registries! We will continue this discussion later on registry client disaster recovery.

At the same time, let’s consider the data inconsistency in this case. If rooms 1 and 2 are isolated from each other, what will happen if the svcA in each room only gets the IP list of the svcB in its own room? That is, the IP list of svcB in each room is completely inconsistent?

In fact, there is no big impact, but in this case, all become the same room call, when we design the registry, sometimes even take the initiative to use the registry data can be inconsistent, to help the application take the initiative to do the same room call, so as to optimize the effect of service call link RT!

It can be seen from the above explanation that in the tradeoff of CAP, the availability of registry is more valuable than strong consistency of data, so the overall design should be more inclined to AP rather than CP, and data inconsistency is within the acceptable range. Abandoning A under P completely violates the principle that the registry should not break the connectivity of the service itself for any reason.

Service size, capacity, and service connectivity

How big is microservices at your company? Hundreds of microservices? Hundreds of nodes deployed? What about three years from now? The Internet is a place that produces miracles. Maybe your “service” becomes a household name overnight, with traffic doubling and scale doubling.

When the data center service size exceeds a certain number (service size =F{pub number of services, sub number of services}), the ZooKeeper registry can quickly become overwhelmed like the donkey shown below



In fact, when ZooKeeper is used in the right place, namely in coarse-grained distributed lock and distributed coordination scenarios, the TPS supported by ZooKeeper and the number of connections supported are sufficient, because these scenarios do not have strong demands for ZooKeeper’s scalability and capacity.

But in the service discovery and health monitoring scenarios, with the increasing scale of service, whether it is used frequently launch services to register write requests, or brush millisecond service health status of write requests, or hate can’t the whole data center machine or container is connected to the registry has long pressure connection, ZooKeeper will soon run out of steam, and ZooKeeper is not written to be extensible, so horizontal scalability cannot be solved by adding nodes.

In order to solve the problem of service scale growth on the basis of ZooKeeper, a practical method to be considered is to sort out the business, divide the business domain vertically, and divide it into multiple ZooKeeper registries. However, as a platform organization group providing universal services, Because the service ability that oneself provides is inadequacy wants business to cooperate according to the baton of technology divide management business, feasible really?

And it violates the service connectivity of the registry itself. For example, a search service, a map service, a large entertainment service, and a game service should be disconnected from each other forever. Maybe today, but what about tomorrow, a year from now, 10 years from now? Who knows how many business domains will be opened in the future to do some weird business innovation? As an underlying service, the uncertainty of the future should certainly not interfere with the need for future inherent connectivity of business services.

Does the registry need persistent storage and transaction logging?

Yes, and no.

We know that the ZAB protocol of ZooKeeper keeps a transaction log on each ZooKeeper node for each write request. At the same time, it also periodically mirrors the memory data (Snapshot) to disk to ensure data consistency and persistence, and the data can be recovered after downtime. This is a nice feature, but we have to ask, in a service discovery scenario, does the core data – a real-time, healthy service address list – really need data persistence?

With this data, the answer is no.



As shown in the preceding figure, three write operations are performed on ZooKeeper during the process of svcB registration (IP1), expansion to two nodes (IP1 and IP2), and capacity reduction due to ip1 breakdown.

But careful analysis, through the transaction log, persistent continuous recording the change process actually makes little sense, because in the service discovery, service calls sponsors pay more attention to its service to invoke the address list of real time and real time state of health, every time a call, don’t care about the history of the service to invoke the service address list, the health status of the past.

But why say again need, because the production of a full available registries, in addition to the service of real-time address list as well as the real-time state of health, can also store some service metadata information, such as service version, grouping, the data center, where the weight, the authentication policy information, service meta information such as label, This data needs to be persisted, and the registry should provide the ability to retrieve this meta information.

Service Health Check

When ZooKeeper is used as the service registry, the health monitoring of the service often utilizes the ZooKeeper Session Active Track mechanism combined with the Ephemeral ZNode mechanism. The service health monitoring is tied to ZooKeeper’s Session health monitoring, or TCP long link activity detection.

This can also cause fatal problems in many cases. Is the service healthy when the ACTIVITY detection of the TCP long link between ZK and the service provider machine is normal? Of course the answer is no! Registries should provide a richer health monitoring scheme, and the logic of whether the service is healthy or not should be open to service providers to define themselves, rather than a blanket approach to TCP activity detection!

One of the basic design principles of health detection is to provide as realistic a feedback as possible on the true health status of the service itself, otherwise a health status determination result that cannot be trusted by the service caller is worse than no health detection.

Disaster recovery considerations for the registry

As mentioned earlier, in practice, registries cannot break the connectivity between services for any reason of their own, so an essential question in terms of availability is, if the Registry itself is completely down, should svcA call svcB links be affected?



Yes, it should not be affected.

Service invocation (request response flow) links should be weakly dependent registries that must be relied on only when necessary for service publication, machine offline, service scaling, and so on.

This requires a careful design of the client provided by the registry. There should be a means in the client for disaster recovery when the registry service is completely unavailable. For example, the design of the client cache data mechanism (called client snapshot) is an effective means. In addition, the health check mechanism of the registry should be carefully designed so that there is no such thing as a push off in this case.

ZooKeeper’s native client does not have this capability, so when implementing a registry using ZooKeeper, we must ask ourselves: if all ZooKeeper nodes are killed, will all service invocation links on your production be unaffected? And you should do regular failure drills for this.

Do you have an expert on ZooKeeper to rely on?

ZooKeeper seems to be a simple product, but it is not taken for granted that it can be used on a large scale and well in production. If you decide to introduce ZooKeeper into production, you should expect to seek help from ZooKeeper technical experts at any time, typically in two ways:

The Client/Session state machine is difficult to master

ZooKeeper’s native client is definitely not good, and Curator is better, but it is also limited. It is not easy to fully understand the interaction protocol between ZooKeeper client and Server. Understanding the ZooKeeper Client/Session state machine is not straightforward:



However, the ZooKeeper-based service discovery solution relies on the long connection /Session management provided by ZooKeeper. Ephemeral ZNode, Event&Notification, ping mechanism, So to make good use of ZooKeeper to do service discovery, it is precisely to understand the core mechanism of ZooKeeper, which sometimes makes you angry, I just want a service discovery, how do you know so much? If you understand all of this and stay on top of it, congratulations, you’ve become a ZooKeeper technical expert.

Unbearable exception handling

When alibaba internal applications access ZooKeeper, there is a WIKI named “ZooKeeper Application Access Must Know and Must Know”, which has the following discussion on exception handling:

What is the most important thing for app developers to know when using ZooKeeper? So based on our previous support experience, it must be exception handling.

When everything (host, disk, network, etc.) is lucky enough to work, apps and ZooKeeper will probably work just fine, but unfortunately, we face surprises all day long, and murphy’s Law is that unexpected bad things happen when you’re most worried.

Therefore, it is important to learn about ZooKeeper exceptions and errors in some scenarios to ensure that you understand these exceptions and errors correctly and know how your application properly handles them.

ConnectionLossException and Disconnected events

Simply put, this is an exception (Recoverable) that can be recovered in the same ZooKeeper Session, but it is the responsibility of the application developer to restore the application to the correct state.

For example, the network between the application machine and the ZooKeeper node breaks down, the ZooKeeper node breaks down, the server takes a long time to fully GC, or even your application process hangs and recovers after a long time to fully GC.

To understand this exception, you need to understand a typical problem in distributed applications, as shown below:



In a typical client request, the Server response, when the long connection failure between them, the client to perceive the failure event, will be in an embarrassing situation, it is unable to determine the incident occurred near in what state of the request, the Server end is received this request? Has it been dealt with? Since this cannot be determined, there is also a question mark as to whether the request should be Retry after the client reconnects to the Server.

So in the event of an handle disconnects, application developers must be clear in a flash what is broken near the request (which is often difficult to determine), if the request is idempotent, for business requests on the Server side services processing for “handle only once” “up to deal with a” “to deal with at least one” semantic selection and expectations.

For example, if an application receives a ConnectionLossException and the previous request was a Create operation, and the application catches the exception, one possible recovery logic to apply is to determine whether the node created by the previous request already exists, and if so, stop creating it. Otherwise, create.

For example, if an application uses exists Watch to listen for the creation of a node that does not exist, it is possible that during a ConnectionLossException, another client process may have created a node and then deleted the node. So for the current application, what is the impact of this miss on the application if it misses the creation event of the node it cares about? Is it tolerable or unacceptable? It is up to the application developers themselves to evaluate and process the business semantics.

SessionExpiredException and SessionExpired events

Session timeout is an unrecoverable exception, which means that when a Catch is applied, the application cannot recover the application status in the same Session, a new Session must be established, and the temporary node associated with the old Session may also be invalid. Owning locks may be invalid. .

In the process of trying to use ZooKeeper to do service discovery by ourselves, our partner of Alibaba once summarized and shared his experience of stepping on the pit on our Intranet technology forum



In the article to the point:

. In the process of coding found a lot of possible traps, MAO estimates, the first time to use ZK to achieve cluster management should have more than 80% of the pit, some pit more hidden, in the network problems or abnormal scenarios will appear, may be a long period of time will be exposed…

This article has been shared with the cloud community and you can read it in detail here.

Go left, go right

Is Alibaba not using ZooKeeper at all? Is not!

Those familiar with Alibaba’s technical system know that Alibaba maintains the largest ZooKeeper cluster in China and even in the world at present, with nearly 1,000 ZooKeeper service nodes in the overall scale.

At the same time, Alibaba middleware also maintains a code branch TaoKeeper for large-scale production, high availability, easy monitoring, operation and maintenance of ZooKeeper. Based on our practice of using ZooKeeper in various business lines and production in the past 10 years, If we evaluate ZooKeeper with a phrase, we think ZooKeeper should be “The King Of Coordination for Big Data”!



In coarse-grained distributed lock, and distributed to choose the main, the main switch for the high availability and does not require high TPS support scenario has an irreplaceable role, and these requirements are often more concentrated in the big data, offline tasks, such as the relevant business area, because of big data field, pay attention to integral data sets, and most of the time the task process/thread parallel processing these data sets, But there are always some points where these tasks and processes need to be coordinated, and this is where ZooKeeper comes into play.

However, in the transaction link of the transaction scenario, there are natural shortcomings in the main business data access, large-scale service discovery, large-scale health monitoring and other aspects, so the introduction of ZooKeeper should be avoided in these scenarios. In the production practice of Alibaba, Applications must strictly evaluate scenarios, capacity, and SLA requirements when applying for ZooKeeper.

So you can use ZooKeeper, but big data goes to the left, trading goes to the right, distributed coordination goes to the left, service discovery goes to the right.

conclusion

Thank you for your patience. So far, I believe you have understood that this article is not a complete negation of ZooKeeper, but a summary of our experience and lessons in service discovery and registry design and use based on our production practice of large-scale sertization in alibaba in the past 10 years. Hope to inspire and help the industry on how to better use ZooKeeper, how to better design their own service registry.

In the end, all roads lead to Rome, and may your registry be born directly in Rome.


Originally published at: June 6, 2018

Author: Kun Yu

This article is from chat Architecture, a partner of the cloud community. For more information, please follow chat Architecture.