How does the microservice architecture guarantee 99.99% high availability in the double 11 carnival

Welcome to follow our wechat official account: Shishan100

My new course ** “C2C e-commerce System Micro-service Architecture 120-day Practical Training Camp” is online in the public account ruxihu Technology Nest **, interested students, you can click the link below for details:

120-Day Training Camp of C2C E-commerce System Micro-Service Architecture

directory

An overview,

2. Business scenarios

Online experience — How do I set the Hystrix thread pool size

Online experience – How to set the request timeout

V. Problem solving

Six, summarized

An overview,

In my last article, I talked about a real life case where a friend’s company had problems using the Spring Cloud architecture. It wasn’t a big technical problem, but if you didn’t understand something well enough, you could make some mistakes.

If you have not read this article, I suggest you have a look: [behind the Double 11 carnival] how does the micro-service registry bear the access of large systems of ten million level? Because the case background of this article will be based on the previous article.

In this article, we will discuss how to ensure the high availability of the whole system in the microservices architecture.

– Fixed some infrastructure issues like Redis cluster down, Elasticsearch cluster down, MySQL down.

The core of the microservices architecture itself to ensure high availability are two:

One is to do resource isolation and fusing based on Hystrix;
The other is to do backup downgrades.

If resource isolation and degradation are well done, in high concurrency scenarios such as Double 11, individual service failures may occur, but they will not spread to the entire system.

For those of you who have forgotten how to do resource isolation, circuit breakers, and downgrades in Hystrix, please don’t Ask me about the Underlying principles of Spring Cloud in an interview.

2. Business scenarios

Ok, so let’s take a closer look at how hystrix thread pool and timeout should be set to the optimal state based on system load in a real production system.

Let’s start by reviewing the following diagram of a company’s system mentioned in the previous article.

Core service A invokes core services B and C. If core service B responds too slowly, A thread pool of core service A will freeze.

However, because you use Hystrix for resource isolation, core service A can call service C normally, so you can ensure that users can use at least part of the APP functions, but the page associated with service B can not be displayed, so the function can not be used.

Of course, this situation in the production system, is absolutely not allowed, so you must let the above situation happen.

In the last article, we finally optimized the system to make sure that a Hystrix thread pool could easily handle requests per second, and that there was a reasonable timeout to prevent requests from slowing down the thread.

Online experience — How do I set the Hystrix thread pool size

Ok, now the question is, in a production environment, how exactly do we set the size of each Hystrix thread pool in the service and how do we set the timeout?

The following is a summary of our production experience after optimization of a large number of online systems:

Suppose your service A receives 30 requests per second and sends 30 requests to service B at the same time, and the response time for each request is about 200ms. How many threads does your Hystrix thread pool need?

The formula is: 30 (number of requests per second) * 0.2 (number of processing seconds per request) + 4 (number of buffers given) = 10 (number of threads).

If you have any doubts about the above formula, why can 10 threads easily handle 30 requests per second?

If a thread can execute a request in 200 milliseconds, then a thread can execute 5 requests in a second. Theoretically, with only 6 threads, you can execute 30 requests per second.

In other words, 6 out of 10 threads in a thread are enough to handle 30 requests per second. The remaining four threads are playing and idle.

So why do we need four more threads? It’s easy, because you want to leave a little buffer.

In the event that system performance drops slightly during peak hours, when many requests take more than 300 milliseconds to complete, a thread can handle only 3 requests per second, and 10 threads can barely handle 30 requests per second. So you have to think about leaving more threads.

As always, I’ll give you a picture, just to give you an intuition.

Online experience – How to set the request timeout

Next, what is the timeout set for the request? The answer is 300 milliseconds.

Why? It’s easy, man! If you set your timeout to 500 milliseconds, what could happen?

In the extreme case, if service B is slow and takes 500 milliseconds to respond, you can only handle a maximum of 2 requests per second per thread, and 20 requests per 10 threads.

30 requests per second. What happens? If we look back at the first picture, a large number of threads will be completely jammed. There will not be enough time to process so many requests that the user will not be able to render the page.

Still a little confused? Here’s another picture to give you a feel for the problem caused by this unreasonable timeout!

If your thread pool size and timeout are not properly set, it is likely to cause A temporary performance bump in service B, causing service A’s thread pool to freeze up for A while before it can proceed to the next request.

Even if service B’s interface performance recovers to less than 200 milliseconds, it will take A while for service A’s thread pool to get stuck. The longer your timeout is set incorrectly, such as 1 second or 2 seconds, the longer it will take to recover.

Therefore, you need to set the timeout time to 300 milliseconds, and ensure that a request does not finish in 300 milliseconds, and immediately return the timeout.

The threads in the thread pool do not get stuck for long and can methodically process the extra requests, or even return immediately if they do not finish processing them in 300 milliseconds, but still remain runnable.

So when service B’s interface performance recovers to less than 200 milliseconds, threads in service A’s thread pool can recover quickly.

This is the hystrix parameter setting optimization experience on a production system, where you have to consider how various parameters should be set.

Otherwise, you’re likely to end up with a situation like the one above, where you use the fancy Spring Cloud and end up like a black box with inexplicablesystem glitches, all sorts of freezes, outages, etc.

V. Problem solving

All right, let’s move on. If the system now has 6,000 requests per second, and core service A has 60 machines deployed, each machine receives 100 requests per second, how many threads does your thread pool need?

Pretty simple, 10 threads against 30 requests, 30 threads against 100 requests, pretty much.

At this point, do you know how many threads should be allocated to your thread pool of service A, the thread pool that calls service B? How to set the timeout time should also know!

This thing is not fixed, but you should know how it calculates, based on service response time, system peak QPS, how many machines, thread pool size and timeout!

After setting this up, it’s time to consider service degradation.

If one of your services goes down and your Hystrix goes down with a fuse, you need to take into account the downgrading logic of each service.

Some common examples:

If the service that queries the data fails, you can consult the local cache
If the writing service fails, you can log the write operation to, say, mysql or MQ, and then recover
If Redis is down, you can check mysql
If mysql is down, you can log the operation to es and recover the data later.

The specific downgrade strategy depends on the business and is not fixed.

Six, summarized

In summary, to troubleshoot those infrastructure failures, you need to make two things clear if you want to play microservices architecture:

First of all, you must set proper parameters for Hystrix resource isolation and timeout to avoid frequent hystrix thread deadlocks at peak times
Secondly, for individual service failures, to set up a reasonable downgrade strategy, to ensure that each service is suspended, can be reasonably degraded, the system as a whole available!

If there is any harvest, please help to forward, your encouragement is the biggest power of the author, thank you!

A large wave of micro-services, distributed, high-concurrency, high-availability original series of articles is on the way:

**** Hadoop architecture in plain English **, ** stay tuned

**** How can Hadoop NameNode Support High Concurrent Access in A Large Cluster

Please scan the qr code belowContinue to pay attention to:

! [](https://p1-jj.byteimg.com/tos-cn-i-t2oaga2asx/gold-user-assets/2018/11/12/167088310d1d57b1~tplv-t2oaga2asx-image.imag e)

Architecture Notes for Hugesia (ID: Shishan100)

More than ten years of EXPERIENCE in BAT architecture

How does the microservice architecture guarantee 99.99% high availability in the double 11 carnival

An overview,

2. Business scenarios

Online experience — How do I set the Hystrix thread pool size

Online experience – How to set the request timeout

V. Problem solving

Six, summarized

Related Posts

MySQL 30 thousand words summary + interview 100 questions with the interviewer

One’s deceased father grind computer composition principle of computer development, 1.1 1.2 computer system structure (top) | carding mind maps

Research on Python distributed dynamic page crawlers