The article was transferred from the public account: Luohan eating grass

Welcome to follow our wechat official account: Shishan100

My new course ** “C2C e-commerce System Micro-service Architecture 120-day Practical Training Camp” is online in the public account ruxihu Technology Nest **, interested students, you can click the link below for details:

120-Day Training Camp of C2C E-commerce System Micro-Service Architecture

Since writing the public account for two years, whenever there is an opportunity to write the topic of failure, I will quietly look at the monitor for a long time before the beginning, after many times of suffering and struggle, dare to mention the pen

Why is that? That’s because it’s easy to make fun of things like, “After all this talk, isn’t the configuration bad?” “Or” Did a pig write this code? Is there anyone on your team who understands performance testing?”

The comments were provocative and full of disdain.

But I think most of the time, in the world of technology, the objective situation determines the subjective outcome, and the subjective outcome reflects the objective situation

It is not a bad thing to be able to string together the scenes and results, write them down in your own way and spread them out, and chat with other students who have the same experience.

Last month, our system had an accident caused by the collapse of the registry. It was a common event, but we guessed the cause, and the cause was a distributed caching system that had been in production for years.

What’s going on here?

Let’s review the breakdown process

It was mid-morning on a trading day in November.

When the middleware monitoring system did not trigger any alarms, the application team leader suddenly ran over and said, “How slow is the cache response? Are you doing something?”

As this was happening in the middle of the trading session, the middleware operations team took an emergency look at a series of monitoring data

Zabbix first checked basic alerts such as CPU, memory, network and disk, and everything was fine. Then it checked the health of the service. After a lot of twists and turns, nothing was suspicious.

I don’t know. It doesn’t make sense.

At 10:30, an alarm message was received, which said, “A node in the ZK cluster is faulty, the port is unavailable, node information cannot be obtained, please handle it quickly!”

This is easy, ZK service port is unavailable, restart, immediate recovery.

At 10:40, all ZK clusters were down and Node data could not be obtained. Since the Dubbo service of the application system and the distributed cache use the same ZK cluster, and the application has not been restarted during this period, the application service itself was not affected temporarily.

It doesn’t make sense. Neither the application side nor the cache side has released a version for nearly a month, and the distributed cache basically has no dependence on ZK except storing some node related information in ZK.

At 10:50 a.m., the ZK cluster was all restarted and crashed again 10 minutes later.

Amazing. What went wrong?

At 10:55, the ZK cluster all restarted. One minute later, it was found that the Node Count reached nearly 22W+ and crashed again.

At 10:58, the monitoring script is added to identify the source of Node from the local cache service of the distributed cache system

At 11:00am, after shutting down the local cache service via the console, the ZK cluster was restarted for the third time to remove the large amount of node information generated by the localized cache through the script.

At 11:05 PM, all ZK clusters on the production line were recovered without exception.

A storm although the past, but everyone’s face revealed a vacant expression, evil door, why can this local cache registry collapse? It’s been online for over a year. Why wasn’t there a problem before? Why did it have to happen today?

A bunch of hellos, filling everyone’s head.

How our local cache works

Last year, I gave a relatively detailed description of our distributed cache in # Tradeable Distributed Cache Middleware, so here, I will briefly explain some of the core working mechanisms of our local cache system through a system flow diagram.

| the working mechanisms of the local cache

| the working mechanisms of local cache – KEY preload/update

| the working mechanisms of local cache – Set/Delete operations

| the working mechanisms of local cache – Get operation

By the way, ** due to historical and resource shortage reasons, part of our cache system and application system’s ZK cluster are mixed, ** just because of this, this accident has hidden trouble.

How did the ZK cluster fail?

At this point, I believe that middleware has a certain understanding of the basic people can guess the full picture of this event.

To put it simply, in the early stage of the launch, ZK was used to realize the message notification of our local cache, and broadcast was also used due to the small traffic and the small amount of application system access.

However, with the increase of traffic and application system access, the number of messages sent increases exponentially, and finally reaches the upper limit of the carrying capacity, and the ZK cluster crashes.

Sure, the reason is pretty much right, but why does the number of messages increase exponentially?

Based on how local caching works, what do we typically store in it?

  1. Update frequency is low, but access is frequent, such as system parameters or business parameters.

  2. A large Key/Value consumes a lot of network resources, resulting in significant performance degradation.

  3. The server has insufficient or unstable resources (such as I/O), but requires high stability.

Ignorant circle, put some parameter class information, and update frequency is very low, so the five nodes of the ZK cluster exploded?

To get to the bottom of it, we immediately went through the code and discovered something.

According to the design, when a Key completes the server cache operation, if it is not added to the list of local cache rules, message notification cannot be triggered in the “local cache mechanism – Set/Delete operation”

But there is obviously a BUG that causes all keys to be sent to ZK.

It makes sense that even though the application has not released a recent version, it has quietly added distributed locks to the cache shards through the cache console

So once the trade opens, it only takes a few minutes, and it explodes.

In addition, in addition to finding bugs, we also reached the following conclusions through post-test verification:

  1. ZK is used for message synchronization. ZK itself has a low load capacity. Would you like to switch to MQ?

  2. Single monitoring means, weak monitoring;

  3. The system deployment structure is not reasonable. The ZK of the infrastructure should not be mixed with the ZK of the application.

And that’s the end of the story.

Speaking at the end

After reading this story, some people may be tempted to ask. The architecture you design, the code you write, don’t you know the logic? How dare you bring up such a stupid mistake?

Not necessarily. For each technical team, the dimission of core members and the change of business form will more or less cause the technical team to form the situation of “knowing what it is but not knowing why” to the existing system. Although every team is trying to avoid it, it is not easy to completely eliminate it.

As a technical manager, have a good mentality, regard every failure as a cicada change process, get a summary and experience from it, and inherit it, and do not commit any more in the future, that is a good one.

But what if one day you miss and bring the system to a complete standstill?

Wish you all the best.

END

“21 days Internet Java Advanced Interview Training Camp (distributed)” details, scan the two-dimensional code at the end of the picture, try the course