Original: Little Sister Taste (wechat official ID: XjjDog), welcome to share, non-official account reprint this statement reserved.

Time is like wind ethereal, blink of an eye has been in the industry infiltration for many years, have seen countless fierce characters, have also seen more and more setbacks B.

A few days ago, I just went to work and received the task of interviewing a graduate, which made me sigh at the gap between people.

His level is absolutely beyond that of an architect who has worked for many years. I admire ~

Our topic is about how to build a scalable, highly available, highly reliable large website. Well, let’s get started.

1. Ask questions

As you all know, with the explosion of data and service requests on the Internet today, even very small companies can generate tens of times more traffic for a product than they used to. Of course, this is sometimes just a dream.

Increased traffic means increased capacity for back-end services, making it a challenge to build large systems that process gigabytes of data per second and have hundreds of thousands of QPS. Especially some genius of the second kill creative, so that the flow of the more abnormal unpredictable.

With effective resources, how can the system maintain good feedback? To support the bosses’ dreams? How do you handle it? Or do you have any systematic insights to share with me?

The graduate smiled and said, “I have a little bit of a summary on this. I could spend more time talking about this.”

All right, I’m all ears.

2. Important indicators of service construction

“First of all, I want to make clear that the construction of services should pay attention to several indicators. Having a goal means having a direction. In general, I sum up four points.

  1. Availability. We need to ensure the availability of the service, which is the SLA metric. Only when the service can respond properly and the error rate is kept at a low level can our service be considered normal.

  2. Service performance. Availability and performance go hand in hand. With higher service performance, limited resources can support more requests and availability increases. Therefore, service performance optimization is a continuous work, and every 1ms average performance improvement under huge traffic is worth pursuing.

  3. Reliability. There are many components of a distributed service, and each component can cause problems with different impacts. How to ensure the runtime reliability of each component and how to ensure data consistency is a challenge.

  4. Observability. In order to obtain the index data of service optimization, it is required that our service can guarantee the observability of service at the beginning of design. It can identify component faults at macro level and provide basis for performance optimization at micro level. In automated scaling scenarios such as HPA, telemetry data is even the only basis for automated decision making.

For a service, there are two main means of expansion. Scale-up: vertical scaling, scale-out: horizontal scaling.

Vertical scaling increases the processing power of a single node by increasing the configuration of a single machine. This is necessary in some business scenarios. But our service is more the pursuit of horizontal expansion, with more machines to support the development of business.

As long as the service is scalable and stateless, all that’s left is heap hardware. That sounds great, but in reality, the architectural challenges are enormous.

The graduate’s analysis was a lot like that of a pheasant CTO. I nodded to myself, encouraging him to go deeper and more specific and come up with something different.

3. Idempotence

What if the interface call fails? In the early days of the Internet, it was probably even worse because of the network. HTTP status code 504 is a typical gateway timeout status. The first request may time out and fail, and the second request may succeed. In reality, there are many interfaces that require strict retry, especially with the addition of asynchrony, which makes retry more important.

But we also need to consider retry storms due to fast retries, because timeouts alone can mean that the server is overwhelmed and there is no reason to add fuel to the fire. Therefore, there is an exponential backoff algorithm for retries until the request is actually terminated and the exception is handled.

As you can see, service idempotence is particularly important due to the introduction of timeout and retry mechanisms. Not only does it support repeated calls on a single machine, it also guarantees multiple reentrants across the entire distributed cluster environment.

In mathematics, it even has a beautiful function formula.

f(f(f(x))) = f(f(x)) = f(x)

Copy the code

Once an interface is idempotent, it is capable of tolerating failures. When a small number of service calls fail due to occasional network failures or machine failures, the final call can be easily completed by retry and idempotence.

For a query operation, it is inherently idempotent as long as the data set is unchanged and no additional processing is required. More challenging are the add and update operations.

There are a number of techniques to ensure idempotency, such as using a unique index to a database, using pre-generated transaction ids, or using the token mechanism to ensure unique calls. Among them, the token mechanism is increasingly used, which is to request a unique tokenId before the request, and then the call idempotent is programmed around the tokenId.

4. Health checkup

Since K8S standardized health checks, they have become a mandatory option for the service. In K8S, there are liVENESS probes and Readiness probes.

Active probes are mainly used to find out if an application is active. It only shows the state of the application itself, and should not depend on the health of other external systems; Ready probes indicate whether an application is ready to receive traffic, and if the ready state of an application instance is not ready, traffic will not be routed to that instance.

If you use SpringBoot’s Actuator component, you can easily access these functions through the Health interface. When the container or registry determines that a service has a problem through the Health interface, it automatically removes the problem node from the node list and then, through a series of probes, hangs it again when the service returns to normal.

The health check mechanism prevents traffic from being scheduled to the wrong machine.

5. Automatic service discovery

Early software developers were clueless about the mechanics of bringing services online, not because they wanted to, but because they had to.

For example, if I want to expand a machine, I need to first test the machine’s survivability, then deploy the service, and finally configure the machine in load balancing software such as Nginx. Usually, you also need to look at the log to see if there is any traffic coming to the machine.

With the help of microservices and continuous integration, we no longer need such a complicated process to go live, just click on the page to build, publish, services can automatically go live and be discovered by other services.

Registries play a very important role in service discovery. It acts as a centralized information center, where all services are started and shut down. Similarly, if I want to invoke certain services, I need to go to the same registry.

A registry, which acts as an intermediary to manage all of these frequent online and query requirements, has become a necessary facility for microservices.

These query requirements can be very frequent, so a copy is also stored locally on the caller, so that in the event of a registry problem, there is no massive failure due to lack of oxygen to the brain. With duplicates, there are consistency issues. There are registries that update information through Pull, and there is the effectiveness of data consistency. The components that are better handled for effectiveness are those that have Push mechanisms that can sense service changes at a faster time.

Many components can act as service registries, as long as it has the ability to store data distributed and data consistency. Eureka, Nacos, Zookeeper, Consul, Redis, and even databases can fill this role.

6. Current limit

In Web development, Tomcat defaults to a pool of 200 threads, and as more requests come in and no new threads can handle the request, the request will wait on the browser side. This takes the form that the browser keeps going in circles (not exceeding the acceptCount), even though you’re requesting a simple Hello World.

We can also view this process as flow limiting. In essence, it sets a limit on the number of resources beyond which requests will be buffered or simply failed.

It has a special meaning for limiting traffic in high-concurrency scenarios: it is primarily used to protect underlying resources. If you want to call a service, you need to get permission to call it first. Flow limiting is typically provided by the service provider and limits what the caller can do.

For example, A certain service provides services for A, B, and C. However, according to the traffic estimation obtained in advance, the requests of A, B, and C are limited to 1000/ s, 2000/ s, and 1W/s. At the same time, some clients may have rejected requests, and some clients can run properly. Traffic limiting is regarded as the self-protection capability of the server.

Common traffic limiting algorithms include: counters, leaky buckets, token buckets, etc. However, the counter algorithm can not achieve smooth current limiting and is seldom used in practical applications.

7. Fusing

Since Schneider invented the circuit breaker, the concept of a circuit breaker has taken the world by storm. From A – share circuit breaker, to the service circuit breaker, have A wonderful.

Fusing means: when the circuit is closed, the current can pass through, when the circuit breaker is open, the current stops.

Typically, a request from a user requires the cooperation of multiple back-end services to complete the work. Not every one of these back-end services is necessary, and it would be unreasonable to reject a user’s entire request because of a problem with one of these services.

The circuit breaker expects certain services to return some default values in the event of a problem. The entire request continues normally.

Risk control. If the risk control service becomes unavailable at some point, users should be able to trade normally. At this point, we should acquiesce that the risk control is passed, and then dump these abnormal transactions to another place, and deal with them as soon as possible before the shipment after the risk control is restored.

As can be seen from the above description, some services simply return some default data after fusing, such as recommendation services; However, some services need to have corresponding exception flow support, which is an if else; What’s more, some businesses do not support circuit breakers, so they can only Fail Fast.

All at once processing is a thoughtless technique and is not what we recommend.

Hystrix, Resilience4J, Sentinel and other components are widely used tools in Java. With SpringBoot integration, these frameworks are generally easy to use and configurable programming.

8. The drop

Demotion is a vague term. Current limit, fuse, to a certain extent, can also be regarded as a kind of degradation. But the downgrade, as it’s often called, cuts at a higher level.

Downgrading generally considers the integrity of the distributed system, cutting off the source of traffic at its source. For example, on Double 11, in order to ensure the transaction system, some non-important services will be suspended to avoid resource contention. Service degradation involves human intervention. When some services are unavailable, it is usually a service degradation mode.

Where is the best place to downgrade? It’s the entrance. Like Nginx, like DNS, etc.

In some Internet applications, there is a concept called Minimum Viable Product (MVP), which means minimizing Viable products and has a very high SLA requirement. There will be a series of service unbundling operations around the minimum viable product, although in some cases it will need to be rewritten.

For example, an e-commerce system, in extreme cases, just needs to display the goods and sell them. Other supporting systems, such as reviews and recommendations, can be turned off temporarily. Consider these scenarios for physical deployment and invocation relationships.

9. Preheat

Take a look at the following situation.

In a high-concurrency environment, the DB process dies and restarts. The upstream load balancing policy was reallocated during the peak service period. The DB that has just been started accepts 1/3 of the traffic instantly, and then the load increases wildly until there is no response at all.

The cause is that the DB is newly started and various caches are not ready, and the system status is completely different from normal operation. Maybe a tenth of the normal amount would bring it to death.

Similarly, a JVM process that has just started, because the bytecode has not been optimized by the JIT compiler, has a slow response time for all interfaces when it starts. If the load balancing component that calls it does not take this start-up situation into account and 1/n traffic is normally routed to this node, problems can easily occur.

Therefore, we expect the load balancing component to dynamically ramp up and warm up the service until it reaches normal traffic levels, depending on the JVM process startup time.

The back pressure

Consider the following two scenarios:

  1. There is no finite flow. When the number of requests is too high, the number of requests is too high, which can easily cause backend service crash or memory overflow

  2. Conventional current limiting. You impose a maximum capacity on an interface beyond outright rejection, but the back-end service is capable of handling these requests

How to dynamically change the value of current limit? This requires a mechanism. The caller needs to know the processing power of the called, that is, the called needs to have the ability to feedback. Back Pressure, in English, is actually a kind of intelligent current limiting, which refers to a strategy.

The idea of back pressure is that the requested side will not directly throw away the traffic of the request side, but continuously feedback its own processing ability. Based on this feedback, the requester adjusts its sending frequency in real time. A typical scenario is the use of sliding Windows for flow control in TCP/IP.

Reactive programming is the embodiment of the observer model. Most of them use event-driven, non-blocking elastic applications that deliver elastic data based on flow. In this scenario, the back pressure implementation is much simpler.

Back pressure makes the system more stable, more efficient, and it has more flexibility and intelligence. For example, we often use HTTP 429 status dock, which means that there is too much request, asking the client to slow down, so it is an intelligent notification.

11. The isolation

Even within the same instance, resources of the same type sometimes need to be isolated. An obvious analogy is the Titanic, which had multiple cabins. Each cabin is isolated from each other, so that the entire ship does not sink because of water in a single cabin.

Of course, the Titanic sank with sulky Jack, because the cabins were too broken.

In some companies’ software, report query services, scheduled tasks, and general services are all in the same Tomcat. They use the same database connection pool, and other normal services are unavailable as requests for certain reporting interfaces rise. This is what happens when you mix resources.

In addition to following CQRS to split services, a quick mechanism is to isolate the resources used by certain types of services. For example, assigning a separate database connection pool to a report and assigning a separate stream limiter will not affect other services.

Coupling occurs on storage nodes as well as stateless service nodes. Rather than having the reporting service storage and the normal business storage in one database, separate them and provide services separately.

One boy is a boy, so can two boys. The reason is that they are in two temples.

12. The asynchronous

If you compare the difference between BIO and NIO, you can see that our service actually spends most of its time waiting for the return and the CPU is never full. NIO, of course, is the underlying mechanism that avoids thread bloat and frequent context switches.

Service asynchrony is similar to NIO in that it eliminates unnecessary waiting. Especially if the call path is long, asynchrony does not block and the response is fast.

On a single machine, we will use NIO; In a distributed environment, MQ is used. They are different technologies, but the principles are the same.

Asynchrony usually involves a change in the programming model. In synchronous mode, the request is blocked until a success or failure result is returned. Although its programming model is simple, it is particularly problematic to deal with sudden, time-skew traffic, and requests are prone to failure. Asynchronous operations can smoothly expand horizontally or move the instantaneous pressure time up and back. Synchronous requests, like fists against steel plates; Asynchronous requests are like punching a sponge. If you can imagine the process, the latter is definitely more flexible and the experience is much friendlier.

13. The cache

Caching is probably the most used optimization technique in software. For example, in the most core cpus, there is multi-level caching; To bridge the gap between memory and storage, various cache frameworks like Redis are popping up.

Cache optimization effect is very good, can let the original load very slow page, instant second open; It can also make the database, which is already a lot of pressure, instantly down.

Caching, in essence, is about reconciling two components with very different speeds by adding an intermediate layer to store frequently used data on relatively fast devices.

In application development, cache is divided into local cache and distributed cache.

So what is distributed caching? It’s really an idea of centralized management. If our service has multiple nodes, there will be a copy of the in-heap cache on each node. Distributed cache, in which all nodes share a cache, saves space and reduces management costs.

In the distributed cache space, Redis is the most used. Redis supports very rich data types, including string, list, set, zset, hash and other commonly used data structures. Of course, it also supports other data structures such as bitmaps.

So the added problems must focus on cache penetration, breakdown and avalanche, as well as consistency, which I won’t talk about.

14.Plan-B

A mature system has Plan B, in addition to remote live and disaster recovery and other solutions, plan-B also thinks that we need to provide abnormal channels for normal services.

For example, specialize in running a minimum viability system that runs the core business of the company. In the event of a massive failure, requests are switched to the minimum system.

Plan-b is usually global, guaranteeing the company’s most basic service capabilities, and we expect it never to be used.

15. Monitor and alarm

A problem is a problem because it leaves evidence. No evidence of the problem, you see the effects, but you can’t find the culprit.

And the problem is usually human, and when it finds it impossible to find it, it always comes up again. It’s like a criminal finds a loophole and tries to exploit it again.

So, to deal with an online problem, you need to leave evidence that it happened, and that’s the most important thing. If you don’t have these things, your company is going to be in an endless battle.

Logging is the most common practice. By putting dots in the program logic and working with logging frameworks such as Logback, you can quickly locate the line of code where the problem occurred. We need to take a look at the detailed bug occurrence process, log the logic that may cause problems in detail, and output more detailed logs. When problems occur, we can switch to debug for debugging.

If it is a widespread bug, it is highly recommended to debug directly online. Dynamically modifying bytecodes using tools such as Arthas for testing is not recommended, and remote debugging of IDEA is certainly not recommended. Instead, it is recommended to use a similar approach to canary publishing, exporting a very small amount of traffic and building a new version to test. If you don’t have a Canary publishing platform, load balancing tools like Nginx can do similar things with weights.

Log system and monitoring system, the hardware requirements are relatively large, especially your request body and return body is relatively large, the amount of storage and computing resources requirements are higher. Its hardware cost, in the whole infrastructure, accounted for a relatively high proportion. But these evidence information, to analyze the problem, is very necessary. So even if it is relatively expensive, many companies will still have a lot of investment in this, including hardware investment and human investment.

MTTD and MTTR are two very important indicators, and we must pay more attention to them.

16. The ending

I looked at my watch, and this fellow was very eloquent. The allotted time was quickly exhausted. I waved and stopped. “What else do you know? Just to be brief! “

“Not many. Things like how to build a DevOps team to support our development, testing, and online environments, how to do deeper performance optimizations, how to do actual troubleshooting. And details like how to optimize the operating system, network programming and multithreading, which I haven’t talked about yet. “

I said, “That’s enough. You’re already very good.”

“You call yourself a graduate, and you already crush most people. Where the hell did you graduate from?! “

“I am B station, just graduated yesterday ~”, he smiled shyly.

I stared into his eyes and smiled, too. Still hair, people can twice again young! Wonderful!

Xjjdog is a public account that doesn’t allow programmers to get sidetracked. Focus on infrastructure and Linux. Ten years architecture, ten billion daily flow, and you discuss the world of high concurrency, give you a different taste. My personal wechat xjjdog0, welcome to add friends, further communication.

Recommended reading:

2. What does the album taste like

3. Bluetooth is like a dream 4. 5. Lost architect, leaving only a script 6. Bugs written by architect, unusual 7. Some programmers, nature is a flock of sheep!

Taste of little sister

Not envy mandarin duck not envy fairy, a line of code half a day

332 original content