Fault isolation and fault tolerance for microservices architecture

This paper first introduces the risks of microservice architecture, and then proposes a variety of effective methods and techniques in microservice architecture to avoid microservice architecture failures, such as service degradation, change management, health check and repair, circuit breakers, current limiter, etc.

directory

1. Risks of microservices architecture

2. Elegant service downgrades

3. Change management

4. Health check and load balancing

5. Repair yourself

6. Failover Caching

Retry Logic

Rate Limiters and Load Shedders

9, Fail Fast and Independently

10. Bulkheads

11. Circuit Breakers

12. Testing for Failures

13, summary

14 points,

Microservice architectures can effectively isolate failures by defining well-defined service boundaries. Like other distributed systems, microservices have more problems at the network, hardware, and application layers. Because services are interdependent, any component can go wrong and users cannot access it. To minimize the impact of some outages, we need to build fault-tolerant services that can handle some of the outages that occur.

This article introduces the most commonly used techniques and architectural patterns for building and operating highly available microservices architecture systems. If the reader is unfamiliar with the above pattern, that’s fine. Building reliable systems is not a one-size-fits all.

1. Risks of microservices architecture

Microservices architecture divides application logic into services that interact with each other over a network. Because it is invoked over a network, rather than in a process, this creates potential problems and complexity for systems that need to collaborate across multiple physical and logical components. The increasing complexity of distributed systems also increases the possibility of network-specific failures.

One of the biggest advantages of the microservices architecture is that teams can independently design, develop, and deploy their own services, compared to the larger structures of traditional applications. Teams can control the entire lifecycle of their services. This also means that the team has no control over the dependencies of services that may be managed by other teams. In a microservices architecture, it is important to keep in mind that the services provided are controlled by others and may be temporarily unavailable due to release, configuration, and other changes, and components are independent of each other.

2. Elegant service downgrades

One of the greatest benefits of the microservices architecture is that when components fail, they can be isolated and gracefully degraded. In photo-sharing apps, for example, users may not be able to upload images when something goes wrong, but they can still view, edit and share uploaded images.

Microservice failure independence (in theory)

In most cases, elegant service degradation like the one shown above is difficult to achieve because in a distributed environment, applications are interdependent and developers need to implement some error-handling logic (discussed later in this article) to deal with transient failures and outages.

Services depend on each other and fail simultaneously without failover logic

3. Change management

Google’s website reliability team found that about 70 percent of failures were due to changes. When making changes to the service… For example, releasing a new version of code or changing some configuration can always cause a failure or introduce new errors.

In a microservice architecture, services are interdependent. That’s why you need to minimize failures and minimize their negative impact. To deal with problems caused by changes, you can implement change policy management and implement automatic rollback.

For example, when new code is deployed or configuration changes are made, the changes should be progressively deployed to a subset of instances within the service instance cluster, monitored and automatically rolled back if critical metrics are found to be problematic.

Change management – Roll back deployment

Another solution is to run two production environments. During deployment, only the changed application is deployed to one of the environments, and only after the newly released version is verified as expected, will the responsible traffic be directed to the new application. This method is called “blue-green release” or “red-black release”.

Rolling back code is not a bad thing. You should not deploy problematic code in production, and you should be wondering what went wrong. The sooner the better, the sooner the code should be rolled back when necessary.

4. Health check and load balancing

Service instances are constantly started, restarted, and stopped due to failures, deployment, or automatic scaling. This causes the service to be temporarily or permanently disabled. To avoid these problems, you should set these instances to be ignored in the route in load balancing because they do not serve subsystems or users.

We can look outside to see if an application instance is healthy. You can call the endpoint of Get/ Health multiple times or Get this information from reports on your own service. Current service discovery solutions continually collect health information from instances and set up load-balanced routing to point only to healthy instance components.

5. Repair yourself

Self-healing can help restore the application. Let’s talk about the steps you can take to fix your app when it crashes. In most cases, the status of the instance is monitored through an external system and the service is restarted when it has been down for a period of time. In most cases, the self-healing feature is quite useful, however, in some cases there are problems associated with constantly restarting the service. For example, if the service is overloaded or the database connection times out, the application cannot report the correct service health status.

For some scenarios, such as database link loss, implementing advanced self-healing capabilities at this point can be tricky. In this case, additional logic needs to be added to the application to handle these special cases and to let the external system know that instances of the service do not need to be restarted immediately.

6. Failover Caching

Services often fail because of network problems and changes in the system. However, most of these outages are temporary, thanks to self-healing and advanced load balancing capabilities, and we should find a solution that allows the service to work even when it fails. This is called Failover Caching, and it helps provide the necessary data for our applications.

Failover caches typically use two different expiration dates: the shorter one indicates how long the cache can be used under normal conditions, and the longer one indicates how long the data in the cache can be used in the event of a failure.

Failover cache

In particular, it is important to note that failover caching should only be used when providing outdated data is better than no data at all.

To set up caching and failover caching, use standard response headers in HTTP.

For example, you can use the max-age header to specify the maximum time for a resource to be new. You can use the stale-if-error header to determine how long it takes to retrieve a resource from the cache in the event of a failure.

CDN and load balancers today offer a variety of caching and failover solutions, but you can also build a shared library in your company that includes these standard reliability solutions.

Retry Logic

In some cases, we may not be able to cache the data, or we may want to make changes to the data, but the operation fails. In this case, we can choose to retry the operation because we can expect the resource to recover after some time or the load balancer will send the request to the healthy instance.

You should be careful about adding retry logic to your application and client, because a larger number of retry operations can make things worse and even prevent your application from recovering.

In distributed systems, microservice system retries can trigger multiple other request or retry operations, leading to cascading effects. To minimize the impact of retries, you should reduce the number of retries and use an exponential backoff algorithm to continuously increase the latency between retries until the maximum limit is reached.

Since the retry is initiated by the client (browser, other microservices, etc.) and the client is unaware of the failure before and after processing the request, you should provide idempotent processing capabilities for your application. For example, you should not charge the customer twice when you retry the purchase. Using a unique idempotency-key for each transaction is the solution to the retry problem.

Rate Limiters and Load Shedders

Flow limiting is the technique of defining how many requests a customer or application can receive or process over a period of time. For example, by limiting traffic, you can filter out customers and microservices that are generating peak traffic, or you can ensure that your application does not become overloaded until Auto Scaling fails.

You can also block lower-priority traffic to provide enough resources for critical transactions.

Current limiter can prevent peak flow

Another type of limiter is called concurrent Request limiter. This is useful when you have some expensive and important endpoints that you want not to be called more than a specified number of times, but you still want to serve traffic.

Using load switches ensures that adequate resource assurance is always available for critical transactions. It reserves some resources for high-priority requests and does not allow low-priority transactions to occupy these resources. The load switch is determined based on the overall state of the system rather than the size of a single user’s Request bucket. Loading devices can help your system recover because they keep core functions functioning in the face of ongoing failure events.

9, Fail Fast and Independently

In microservices architecture, we want services to fail quickly and individually. To isolate faults at the service level, we can use the Bulkhead pattern. You can see this later in this article.

We also want our components to fail fast, because we don’t want to wait for disconnected instances to time out. There’s nothing more frustrating than pending requests and an unresponsive interface. Not only is this a waste of resources, but it also makes the user experience worse. Our services call each other, so special care should be taken to prevent timeout operations before these delays stack up.

The first thing you might think of is to define a timeout level for each invocation of the service. The problem with this approach is that you can’t really know what an appropriate timeout value is, because when network failures and other problems occur, some cases only affect one or two operations. In this case, you may not want to reject all of these requests if only some of them have timed out.

We can say that using timeouts to achieve fast failures in microservices is an anti-pattern that should be avoided. Instead of timeout, you can use a circuit breaker based on the number of success/failure counts for an operation.

10. Bulkheads

In industry, bulkheads are often divided into sections so that if one part of the hull ruptures, others remain sealed and intact.

The concept of bulkheads can also be used in software development to isolate resources.

By using the bulkhead model, we can protect our limited resources from being exhausted. For example, if we have two types of operations that communicate with the same database instance, and the database limits the number of connections, we can use two connection pools instead of one shared connection pool. Because of this client-resource separation, an operation that times out or overuses the pool does not invalidate all other operations.

One of the main causes of the Sinking of the Titanic was the failure of the bulkhead design, which allowed water to pour on top of the bulkhead through the upper deck and eventually flooded the ship.

The malfunctioning bulkhead of the Titanic

11. Circuit Breakers

To limit the duration of an operation, we can use a timeout. Timeouts prevent pending operations and ensure that the system can respond. However, using static, fine-tuned timeouts in microservice architecture communication is an anti-pattern, because we are in a highly dynamic environment and it is almost impossible to determine the exact time limit that will work in every case.

Instead of using a small and specific transaction-specific static timeout mechanism, we can use circuit breakers to handle errors. Circuit breakers are named after real-world electronics because they all behave the same. You can protect resources and help them recover by using circuit breakers. Circuit breakers are useful in distributed systems because repeated failures can cause a snowball effect and bring the entire system down.

When a specified type of error occurs many times within a short period of time, the circuit breaker opens. An open circuit breaker can reject subsequent requests — just as it prevents the flow of real electrons. Circuit breakers are usually closed after a certain amount of time to provide enough space for the underlying services to recover.

Keep in mind that not all errors should trigger a circuit breaker. For example, you might want to ignore client-side issues, such as requests for 4XX response code, but include 5XX server-side failures. Some circuit breakers can also have a half-off state. In this state, the service sends the first request to check the availability of the system while letting other requests fail. If this first request is successful, the breaker is restored to the closed state and continues receiving traffic. Otherwise, keep it on.

The circuit breaker

12. Testing for Failures

You should continuously test your system for common problems to ensure that your services can operate under various failure environments. You should test for breakdowns frequently to prepare your team for possible accidents.

For testing, you can use an external service to identify a group of service instances and randomly terminate an instance within the group. Using this approach, you can test for single instance failures, and you can even shut down entire service groups to simulate outages at the cloud provider level.

13, summary

Implementing and operating reliable services is not easy. It takes a lot of effort on your part and costs the company more.

By the way, I recommend a Learning community on Java architecture: 650385180, which can not only exchange discussion, but also interview experience sharing and free download of information, including Spring, MyBatis, Netty source analysis, high concurrency, high performance, distributed, microservice architecture principle, JVM performance optimization these become architects necessary knowledge system. I believe that for those who have already worked and met technical bottlenecks, there will be content you need in this group.

Reliability has many layers and aspects, so it’s important to find the right solution for your team. You should make reliability a factor in your business decision making process and allocate adequate budget and time for it.

14 points,

Dynamic environments and distributed systems – such as microservices – lead to higher failure opportunities.
The service should fail individually to achieve elegant service degradation to improve the user experience.
70% of problems are caused by changes, and restoring usable code is not always a bad thing.
Fail quickly and alone. The team has no control over its service dependencies.
Architectural patterns and technologies such as caches, isolation techniques, circuit breakers, and current limiters help build reliable microservices

Fault isolation and fault tolerance for microservices architecture

1. Risks of microservices architecture

2. Elegant service downgrades

3. Change management

4. Health check and load balancing

5. Repair yourself

6. Failover Caching

Retry Logic

Rate Limiters and Load Shedders

9, Fail Fast and Independently

10. Bulkheads

11. Circuit Breakers

12. Testing for Failures

13, summary

14 points,

Related Posts

Analyze the implementation principle of HashMap in depth

Leetcode 1869. Longer Contiguous Segments of Ones than Zeros (Python)

Arrays. AsList has holes, don’t stomp on them!