This is the 13th day of my participation in Gwen Challenge

Problems faced by distributed systems

Applications in complex distributed architectures have dozens of dependencies, each of which will inevitably lose at some point (abnormal failure).

Let me elaborate

Service avalanche

In a distributed system environment, there are usually many layers of service invocation. Due to network reasons or its own reasons, the service is not guaranteed to be 100% available. If there is a problem with a service, a thread blocking occurs when calling the service. If a large number of requests come in, multiple threads will block and wait, causing the service to crash.

When multiple microservices are invoked, it is assumed that microservice A calls microservice B and microservice C, and microservice B and microservice C call other microservices, which is called “fan-out”. If the invocation response time of A microservice on the fan-out link is too long or unavailable, the invocation of microservice A will occupy more and more system resources, leading to system crash, which is the “avalanche effect” of service faults.

** For high-traffic applications, a single back-end dependency can cause all resources on all servers to saturate in seconds. ** Worse than failure, these applications can also cause increased latency between services, strain backup queues, threads, and other system resources, leading to more cascade failures throughout the system. These represent the need to isolate and manage failures and delays so that the failure of a single dependency does not cancel the entire application or system.

So, usually when you find that an instance of a module fails, that module still receives traffic, and then that module in question calls other modules, there is a cascading failure, or avalanche.

To prevent the avalanche from spreading, we need to be fault-tolerant of service: measures to protect ourselves from being dragged down by our pig-mates.

Common fault tolerance schemes: isolation, timeout, current limiting, fusing, degrade

Hystrix

Hystrix is an open source library for handling latency and fault tolerance in distributed systems, where many dependencies inevitably fail to be called, such as timeouts, exceptions, etc.

Hystrix is able to improve the resiliency of distributed systems by ensuring that the entire service does not fail in the event of a dependency failure, avoiding cascading failures.

“Circuit breaker” itself is a switching device. When a service unit fails, through the fault monitoring of the circuit breaker (similar to blowing a fuse), an expected and manageable FallBack response is returned to the caller, instead of waiting for a long time or throwing exceptions that the caller cannot handle. This ensures that the threads of service callers are not tied up unnecessarily for long periods of time, preventing failures from spreading and even avalanches in a distributed system.

For now: Hystrix has been discontinued and will use Ali’s Sentinel, but Hystrix still has ideas and designs worth learning.

The preparatory work

To learn about Hystrix, we need to do the following 8001 service provider preparation.

Create the module cloud-provider-Hystrix-Payment8001

Rely on:

<dependencies>
    <! --hystrix-->
    <dependency>
        <groupId>org.springframework.cloud</groupId>
        <artifactId>spring-cloud-starter-netflix-hystrix</artifactId>
    </dependency>
    <! --eureka client-->
    <dependency>
        <groupId>org.springframework.cloud</groupId>
        <artifactId>spring-cloud-starter-netflix-eureka-client</artifactId>
    </dependency>
    <! --web-->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-actuator</artifactId>
    </dependency>
    <! -- Introducing a custom API generic package that allows you to pay Entity using Payment -->
    <dependency>
        <groupId>com.xn2001.springcloud</groupId>
        <artifactId>cloud-api-commons</artifactId>
        <version>1.0 the SNAPSHOT</version>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-devtools</artifactId>
        <scope>runtime</scope>
        <optional>true</optional>
    </dependency>
    <dependency>
        <groupId>org.projectlombok</groupId>
        <artifactId>lombok</artifactId>
        <optional>true</optional>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-test</artifactId>
        <scope>test</scope>
    </dependency>
</dependencies>
Copy the code

application.yml

In order to be fast and efficient, we directly use the stand-alone Eureka

server:
  port: 8001

spring:
  application:
    name: cloud-provider-hystrix-payment

eureka:
  client:
    # indicates whether to register yourself in EurekaServer. The default is true
    register-with-eureka: true
    Whether to fetch existing registration information from EurekaServer. Default is true. The cluster must be set to True to use load balancing with the ribbon
    fetch-registry: true
    service-url:
      defaultZone: http://eureka7001.com:7001/eureka
# defaultZone: http://eureka7001.com:7001/eureka/,http://eureka7002.com:7002/eureka/
Copy the code

The main start class

@SpringBootApplication
@EnableDiscoveryClient
public class PaymentHystrixMain8001 {
    public static void main(String[] args) { SpringApplication.run(PaymentHystrixMain8001.class,args); }}Copy the code

The business layer:

Instead of writing interfaces, let’s write implementation classes.

@Service
public class PaymentService {

    public String paymentInfoOK(Integer id){
        return "Current thread:"+Thread.currentThread().getName()+"PaymentInfo_OK, id:"+id+"\t"+"O (studying studying) O ha ha ~";
    }

    public String paymentInfoTimeOut(Integer id){
        int timeout=3;
        try {
            TimeUnit.SECONDS.sleep(timeout);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        return "Thread pool:"+Thread.currentThread().getName()+"PaymentInfo_Timeout, id:"+id+"\t"+"┭┮﹏┭┮ purr ~"+"Time (seconds) :"+timeout; }}Copy the code

Control layer:

package com.xn2001.springcloud.controller;
import com.xn2001.springcloud.service.PaymentService;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RestController;
import javax.annotation.Resource;

/** * Created by createcount on 2020/5/21 0:03 */
@RestController
@Slf4j
public class PaymentHystrixController {

    @Resource
    private PaymentService paymentService;
    @Value("${server.port}")
    private String serverPort;

    @GetMapping(value = "/payment/hystrix/ok/{id}")
    public String paymentInfoOK(@PathVariable("id") Integer id){
        String result = paymentService.paymentInfoOK(id);
        log.info("* * * * * result."+result);
        return result;
    }

    @GetMapping(value = "/payment/hystrix/timeout/{id}")
    public String paymentInfoTimeOut(@PathVariable("id") Integer id){
        String result = paymentService.paymentInfoTimeOut(id);
        log.info("* * * * * result."+result);
        returnresult; }}Copy the code

Direct run startup

Visit: http://localhost:8001/payment/hystrix/ok/1

http://localhost:8001/payment/hystrix/timeout/2

This will take 3 seconds

Pressure test

Need to use JMeter tool, official download address

After downloading and decompressing, go to bin directory and double-click jmeter.bat to start.

Note: If you are not comfortable with the English version, you can modify the jmeter.properties file in the bin directory

Add language = zh_CN. I just add it to line 38

Let’s start the manometry

View the effect:

You’ll find that your 20000 threads access is http://localhost:8001/payment/hystrix/timeout/2

But you still affected by a certain access http://localhost:8001/payment/hystrix/ok/2 at this time, not seconds to load. (If there is no effect, you can dry the thread to 200000)

Therefore, everyone is in the same microservice. At this time, timeout pressure is high, and the server is concentrating on processing 20,000 threads. As a result, the path on OK side will be dragged down a little.

It is worth noting that Hystrix is available on both the server and the consumer, but is generally used on the 80 consumer.

We join the service consumer 80 port module

cloud-consumer-feign-hystrix-order80

Rely on:

<dependencies>

    <! -- openfeign -->
    <dependency>
        <groupId>org.springframework.cloud</groupId>
        <artifactId>spring-cloud-starter-openfeign</artifactId>
    </dependency>
    <! -- hystrix -->
    <dependency>
        <groupId>org.springframework.cloud</groupId>
        <artifactId>spring-cloud-starter-netflix-hystrix</artifactId>
    </dependency>
    <! -- eureka client -->
    <dependency>
        <groupId>org.springframework.cloud</groupId>
        <artifactId>spring-cloud-starter-netflix-eureka-client</artifactId>
    </dependency>
    <! -- Custom API generic package -->
    <dependency>
        <groupId>com.xn2001.springcloud</groupId>
        <artifactId>cloud-api-commons</artifactId>
        <version>1.0 the SNAPSHOT</version>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-actuator</artifactId>
    </dependency>
    <dependency>
        <groupId>org.projectlombok</groupId>
        <artifactId>lombok</artifactId>
        <optional>true</optional>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-devtools</artifactId>
        <scope>runtime</scope>
        <optional>true</optional>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-test</artifactId>
        <scope>test</scope>
    </dependency>

</dependencies>
Copy the code

Start the class

package com.xn2001.springcloud;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.cloud.openfeign.EnableFeignClients;

/** * Created by createcount on 2020/5/21 11:32 */
@SpringBootApplication
@EnableFeignClients
public class OrderHystrixMain80 {
    public static void main(String[] args) { SpringApplication.run(OrderHystrixMain80.class,args); }}Copy the code

Business invocation interface layer

This method uses Feign to call the 8001 interface, which gets the data directly from the URI path.

I wrote about this in the Feign service Invocation article.

package com.xn2001.springcloud.service;

import org.springframework.cloud.openfeign.FeignClient;
import org.springframework.stereotype.Component;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PathVariable;

/** * Created by createcountry on 2020/5/21 11:34 */

@FeignClient(value = "CLOUD-PROVIDER-HYSTRIX-PAYMENT")
public interface PaymentHystrixService {

    @GetMapping("/payment/hystrix/ok/{id}")
    String paymentInfoOK(@PathVariable("id") Integer id);

    @GetMapping("/payment/hystrix/timeout/{id}")
    String paymentInfoTimeOut(@PathVariable("id") Integer id);

}
Copy the code

Control layer

package com.xn2001.springcloud.controller;

import com.xn2001.springcloud.service.PaymentHystrixService;
import lombok.extern.slf4j.Slf4j;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RestController;

import javax.annotation.Resource;

/** * Created by createcountry on 2020/5/21 11:35 */
@RestController
@Slf4j
public class OrderHystrixController {

    @Resource
    private PaymentHystrixService paymentHystrixService;

    @GetMapping(value ="/consumer/payment/hystrix/ok/{id}")
    public String paymentInfoOK(@PathVariable("id") Integer id){
        String result = paymentHystrixService.paymentInfoOK(id);
        return  result;
    }

    @GetMapping(value ="/consumer/payment/hystrix/timeout/{id}")
    public String paymentInfoTimeOut(@PathVariable("id") Integer id){
        String result = paymentHystrixService.paymentInfoTimeOut(id);
        returnresult; }}Copy the code

Service degradation

The reason for service degradation is to call other methods to protect the microservice when we have a special case like an exception, timeout.

Mistakes have a bottom, the whole situation.

Such as:

The caller (80) cannot wait indefinitely because the service (8001) has timed out

If the service (8001) is down, the caller (80) cannot wait for the service to degrade

The other party’s service (8001) is OK, and the caller (80) is faulty or has a self-requirement (his waiting time is smaller than that of the service provider). In this case, 80 must also degrade the service.


I’ll cover the use of global service degradation

The case service price reduction process will be designed in client 80, you can also use 8001, it is up to you, according to the specific business needs can choose.

Now we are using Feign’s PaymentHystrixService (PaymentHystrixService) for two methods. We will degrade the two methods and call the degraded methods when they fail.

@FeignClient(value = "CLOUD-PROVIDER-HYSTRIX-PAYMENT")
public interface PaymentHystrixService {

    @GetMapping("/payment/hystrix/ok/{id}")
    String paymentInfoOK(@PathVariable("id") Integer id);

    @GetMapping("/payment/hystrix/timeout/{id}")
    String paymentInfoTimeOut(@PathVariable("id") Integer id);

}
Copy the code

We create a new class, PaymentFallbackService, to implement this interface and handle exceptions uniformly.

package com.xn2001.springcloud.service;

import org.springframework.stereotype.Component;

/** * Created by createcount on 2020/5/21 18:47 */
@Component
public class PaymentFallbackService implements PaymentHystrixService{
    @Override
    public String paymentInfoOK(Integer id) {
        return "--------paymentFallbackService fall back paymentInfoOK ┭┮﹏┭┮";
    }

    @Override
    public String paymentInfoTimeOut(Integer id) {
        return "--------paymentFallbackService fall back paymentInfoTimeOut ┭┮﹏┭┮"; }}Copy the code

And then how do we make them relate? It’s not enough just to implement interfaces.

We add another attribute to the @FeignClient annotation in the PaymentHystrixService interface:

Fallback = PaymentFallbackService. Class (here is you deal with the demotion of class)

The results are as follows.

package com.xn2001.springcloud.service;

import org.springframework.cloud.openfeign.FeignClient;
import org.springframework.stereotype.Component;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PathVariable;

/** * Created by createcountry on 2020/5/21 11:34 */

@FeignClient(value = "CLOUD-PROVIDER-HYSTRIX-PAYMENT",fallback = PaymentFallbackService.class)
public interface PaymentHystrixService {

    @GetMapping("/payment/hystrix/ok/{id}")
    String paymentInfoOK(@PathVariable("id") Integer id);

    @GetMapping("/payment/hystrix/timeout/{id}")
    String paymentInfoTimeOut(@PathVariable("id") Integer id);

}
Copy the code

Finally, enable the configuration in application.yml

feign:
  hystrix:
    enabled: true
Copy the code

Add an annotation @enablehystrix to the startup class

Access at this time:

http://localhost/consumer/payment/hystrix/ok/5 and http://localhost/consumer/payment/hystrix/timeout/6

If you’re smart, you’ll see everything. The former accesses normally, so the output is

Current thread: http-niO-8001-exec-1paymentInfo_OK, ID: 5 O(∩_∩)O haha ~Copy the code

The latter is because we made the thread wait for 3 seconds, but Feign will report a timeout exception if it exceeds 1 second by default (if we didn’t configure it). If you don’t understand this, you can check my blog about Feign service invocation. So this service degradation will call the degradation method we just wrote, and output

--------paymentFallbackService fall back paymentInfoTimeOut ┭┮﹏┭┮
Copy the code

This allows the client to get an alert when the server is unavailable rather than hang up and consume the server

In addition to global service degradation, there are individual methods.

Configuring a service degradation method for each method is technically possible, but actually stupid.

In addition to some important core businesses that are exclusive, other common businesses can be handled globally.

additional

There is also a global service degradation that specifies a unified approach. The priority is higher than that of the implementation class above

Let’s go straight to the Controller layer and test it

Add a method:

public String paymentGlobalFallbackMethod(a){
    return "Global exception handling message, please try again later: orz~";
}
Copy the code

Add a note: The defaultFallback property is your method name

@DefaultProperties(defaultFallback = "paymentGlobalFallbackMethod")
Copy the code

We put the fallback interface (PaymentHystrixService) = PaymentFallbackService. The class is removed.

Add the @hystrixCommand annotation to the method that waits 3 seconds

The final class looks like this:

package com.xn2001.springcloud.controller;

import com.netflix.hystrix.contrib.javanica.annotation.DefaultProperties;
import com.netflix.hystrix.contrib.javanica.annotation.HystrixCommand;
import com.xn2001.springcloud.service.PaymentHystrixService;
import lombok.extern.slf4j.Slf4j;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RestController;
import javax.annotation.Resource;

/** * Created by createcountry on 2020/5/21 11:35 */
@RestController
@Slf4j
@DefaultProperties(defaultFallback = "paymentGlobalFallbackMethod")
public class OrderHystrixController {

    @Resource
    private PaymentHystrixService paymentHystrixService;

    @GetMapping(value ="/consumer/payment/hystrix/ok/{id}")
    public String paymentInfoOK(@PathVariable("id") Integer id){
        String result = paymentHystrixService.paymentInfoOK(id);
        return  result;
    }

    @GetMapping(value ="/consumer/payment/hystrix/timeout/{id}")
    @HystrixCommand
    public String paymentInfoTimeOut(@PathVariable("id") Integer id){
        String result = paymentHystrixService.paymentInfoTimeOut(id);
        return result;
    }

    public String paymentGlobalFallbackMethod(a){
        return "Global exception handling message, please try again later: orz~"; }}Copy the code

To start, go to http://localhost/consumer/payment/hystrix/timeout/6

Then you can test for yourself, for example, turn 8001 off, the default service is dead, and then you can access to see if it affects.

Service fusing

What is it? No, it’s something like service degradation, but it’s global. Let’s try it out. (Make sure you know about service downgrades before you read on.)

To reduce microservice startup, let’s go simple and switch the service to 8001 provider.

PaymentService adds a new method :(I’ll explain how it works later)

When id<0, an exception will be thrown and an error will occur, but at this point we have a degraded service and the degraded method will be invoked.

 @HystrixCommand(fallbackMethod="paymentCircuitBreakerFallback", CommandProperties ={@hystrixproperty (name = "circuitBreaker. Enabled ",value = "true"),// whether to enable the circuitBreaker @hystrixproperty (name = "Value = circuitBreaker. RequestVolumeThreshold", "10"), / / the total number of requests threshold @ HystrixProperty (name = "CircuitBreaker. SleepWindowInMilliseconds," value = "10000"), / / @ HystrixProperty sleep time window (name = "CircuitBreaker. ErrorThresholdPercentage," value = "60") / / percentage error threshold})
    public String paymentCircuitBreaker(@PathVariable("id") Integer id){
        if (id<0) {throw new RuntimeException();
        }
        String serialNumber = IdUtil.simpleUUID();
        return Thread.currentThread().getName()+"\t "+"Call successful, serial number:"+serialNumber;
    }

    // The degraded method
    public String paymentCircuitBreakerFallback(@PathVariable("id") Integer id){
        return "Id cannot be negative, please try again later ~ id:"+ id;
    }
Copy the code

Control layer interface:

@GetMapping("/payment/circuit/{id}")
public String paymentCircuitBreaker(@PathVariable("id") Integer id){
    String circuitBreaker = paymentService.paymentCircuitBreaker(id);
    log.info("******result: "+circuitBreaker);
    return circuitBreaker;
}
Copy the code

Note: Check for the @enablehystrix annotation on the startup class

There are three notes:

@SpringBootApplication
@EnableDiscoveryClient
@EnableHystrix
Copy the code

Start our Eureka registry and 8001 service provider.

Visit http://localhost:8001/payment/circuit/3 and http://localhost:8001/payment/circuit/-3 to check whether the normal display.

We tested it with the stress test tool described above.

Let’s test what happens when we open 11 threads on the wrong interface.

At this time you visit: http://localhost:8001/payment/circuit/3 you will find that the result is id cannot be negative, please try again later ~ id: 3, this is called a fuse, similar to a fuse, which you can understand as: when you have more than a rated number of errors in a route (our test used 11 threads, and all 11 threads accessed the wrong interface, percentage error rate). So the fuse was pulled. Even better, after 10 seconds (the sleep window for our code), it automatically recovers (because we are no longer accessing the wrong path interface at this point, the workflow will be described below).

The working process

The GitHub documentation reads as follows:

Circuit Breaker

The following diagram shows how a or interacts with a HystrixCircuitBreaker and its flow of logic and decision-making, including how the counters behave in the circuit breaker.HystrixCommand``HystrixObservableCommand

The precise way that the circuit opening and closing occurs is as follows:

  1. Assuming the volume across a circuit meets a certain threshold ()…HystrixCommandProperties.circuitBreakerRequestVolumeThreshold()
  2. And assuming that the error percentage exceeds the threshold error percentage ()…HystrixCommandProperties.circuitBreakerErrorThresholdPercentage()
  3. Then the circuit-breaker transitions from to .CLOSED``OPEN
  4. While it is open, it short-circuits all requests made against that circuit-breaker.
  5. After some amount of time (), the next single request is let through (this is the state). If the request fails, the circuit-breaker returns to the state for the duration of the sleep window. If the request succeeds, the circuit-breaker transitions to and the logic in 1. takes over again.HystrixCommandProperties.circuitBreakerSleepWindowInMilliseconds()``HALF-OPEN``OPEN``CLOSED

I understand it

This is my own fuses work flow thinking based on the official introduction and video teaching.

  1. Snapshot time window: The circuit breaker needs to collect request and error data to determine whether to enable the circuit breaker. The snapshot time window is the latest 10 seconds by default.

  2. Total number of requests threshold: During the snapshot time window, the total number of requests threshold must be met to have a chance to fuse. The default value is 20, which means that within 10 seconds (snapshot time window), if the hystrix command is invoked less than 20 times, even if all requests are timed out or fail for other reasons, the circuit breaker will not open.

  3. Error percentage threshold: when the total number of requests in the snapshot time exceeds the threshold, such as happened 30 times call (more than 20) by default, if in the 30 times in the call, there are 15 timeout exception happens, which is more than 50% of the error percentage, 50% in the default setting threshold, this time will open the circuit breaker.

  4. When the circuit breaker to open, to fuse the main logic, hystrix will start a sleep time window (the default is 50 seconds) within the time window, relegation logic is temporary into logic, when sleep time window when due, the breaker will enter a state of half open, when you release a request to the original primary logically, if the request to return to normal, The circuit breaker will close again and the master logic will resume. If there is still a problem with this request, the breaker remains open and restarts when the sleep window expires.

It is worth noting that these default configurations and parameters can be seen by pressing shift twice in IDEA and entering HystrixCommandProperties.

There is also a graphical interface for Hystrix, check out this blog post.