The full text overview

[TOC]

Why do we need Hystrix

Hystrix is on Github

  • Hystrix is also Netfix’s contribution to distributed systems. The same goes for the non-maintenance phase. Failure to maintain does not mean obsolescence. It just means that the technology keeps iterating. Once this brilliant once design is still worth us to learn.

  • In a distributed environment, service scheduling is both a feature and a headache. In the service governance section we introduce the functions of service governance. In the previous lesson, we also introduced the ribbon and Feign to make service calls. Service monitoring management is now a natural step. Hystrix is about isolating services. To ensure that the service does not have a joint failure. The entire system becomes unavailable

  • As shown in the figure above, when multiple clients invoke Aservice services, there are three Aservice services in a distributed system, among which some Aservice logic needs Bservice processing. Bservice deploys two services in a distributed system. At this time, the communication between Aservice and Bservice is abnormal due to network problems. If Bservice is for logging. From the point of view of the system as a whole, the loss of logs probably doesn’t matter as much as a system outage. However, at this time, Aservice was unavailable due to network communication problems. Kind of not trying.

  • Look at the figure above. A — — — — > C > B > D. The D service is down. C encounters a processing exception because D is down. But C’s thread is still responding to B. Thus, as concurrent requests come in, the C service thread pool becomes full and the CPU increases. At this time, other services of C service will also be affected by CPU increase, resulting in slow response.

features

Hystrix is a low latency and fault tolerant third-party component library. Access points designed to isolate remote systems, services, and third-party libraries. The official website has stopped maintenance and recommended using Resilience4J. But at home we have SpringCloud Alibaba.

Hystrix addresses service avalanche scenarios by isolating access between services to implement delay and fault tolerance mechanisms in distributed systems and provides fallbacks based on Hystrix.

  • Fault tolerance for network delays and faults
  • Break the distributed system avalanche
  • Fail quickly and recover gently
  • Service degradation
  • Real-time monitoring, alarm

99.9 9 30 = 99.7 % u p t i m e 0.3 % o f 1 b i l l i o n r e q u e s t s = 3 . 000 . 000 f a i l u r e s 2 + h o u r s d o w n t i m e / m o n t h e v e n i f a l l d e p e n d e n c i e s h a v e e x c e l l e n t u p t i m e . 99.99^{30} = 99.7\% \quad uptime \\ 0.3\% \quad of \quad 1 \quad billion \quad requests \quad = \quad 3,000,000 \quad failures \\ 2+ \quad hours \quad downtime/month \quad even \quad if \quad all \quad dependencies \quad have \quad excellent \quad uptime.
  • On the interview website to give a statistic. The probability of an exception in each of the 30 services is 0.01%. For every 100 million requests, 300,000 fail. That translates to at least two hours of downtime per month. This is fatal to Internet systems.

  • Here are the two scenarios on the website. Similar to what we did in the last chapter. Both are scenarios that introduce service avalanches.

Project preparation

  • In the OpenFeign feature we talked about feIGN based service fuses and said internally hystrix based. At that time, we also looked at the internal structure of POM. Eureka has the ribbon built in, but also the Hystrix module built in.

  • Although hystrix is included in the package. Let’s introduce the corresponding start to enable the configuration. This is actually an example in the OpenFeign project. In the project we provided PaymentServiceFallbackImpl, PaymentServiceFallbackFactoryImpl two classes as an alternative. At the time, however, we need only point out that OpenFeign supports alternatives for setting up both approaches. Today we

    <! --hystrix--> <dependency> <groupId>org.springframework.cloud</groupId> <artifactId>spring-cloud-starter-netflix-hystrix</artifactId> </dependency>Copy the code

    Demonstrate what can happen when a traditional enterprise has no backup plan.

The interface test

  • First we test the Payment# createByOrder interface. Check the response

  • Testing the Payment# getTimeout/ ID method.

    • Now we use the Jemeter to pressure the Payment# getTimeOut/ ID interface. One person needs to wait for 4S will be a resource depletion problem. At this point our payment#createByOrder will also block.

    • The default maximum number of Tomcat threads in Spring is 200. To protect our hard working laptop. Here we set the number of threads to be a little bit smaller. This makes it easier to reproduce the full thread situation. A full thread will affect the Payment# createByOrder interface.

  • What we tested above is the native interface of Payment. If the order module is pressed. If fallback is not configured in OpenFeign. The order service is slow to respond because the Payment# getTimeOut/ ID interface is concurrent and the thread is full. This is the avalanche effect. Let’s deal with avalanches from two aspects.

Business isolation

  • The above scenario occurs because Payment# createByOrder and Payment# getTimeOut/ ID are both part of the Payment service. A Payment service is really just a Tomcat service. The same Tomcat service has a thread pool. Each time a request falls into the Tomcat service, a thread is requested from the thread pool. Only when the thread is acquired can the thread process the requested business. The thread pool is shared within Tomcat. So when payment#getTimeOut/ ID is sent, it emptens the thread pool. As a result, other excuses and even unrelated interfaces have no resources to apply for. Can only wait for the release of resources.

  • It’s like taking an elevator during rush hour because all the elevators are used at one time because of the concentration of work by one company. National leaders can’t get on the elevator at this time.

  • We also know that this situation is easy to solve. Each park will have a dedicated elevator for special use.

  • The same is true of how we solve these problems. Quarantine. Different interfaces have different thread pools. So it doesn’t cause an avalanche.

Thread isolation

  • Remember that we set the maximum number of threads for the ORDER module to 10 above to demonstrate concurrency. Here we use the test tool to call the order/ getPayment /1 interface to see how the log prints

  • The current thread is printed out where our interface is called. We can see the same 10 threads going back and forth. That’s why avalanches happen up there.

	@HystrixCommand( groupKey = "order-service-getPaymentInfo", commandKey = "getPaymentInfo", threadPoolKey = "orderServicePaymentInfo", commandProperties = { @HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds",value = "1000") }, threadPoolProperties = { @HystrixProperty(name = "coreSize" ,value = "6"), @HystrixProperty(name = "maxQueueSize",value = "100"), @HystrixProperty(name = "keepAliveTimeMinutes",value = "2"), @HystrixProperty(name = "queueSizeRejectionThreshold",value = "100") }, fallbackMethod = "getPaymentInfoFallback" )
    @RequestMapping(value = "/getpayment/{id}",method = RequestMethod.GET)
    public ResultInfo getPaymentInfo(@PathVariable("id") Long id) {
        log.info(Thread.currentThread().getName());
        return restTemplate.getForObject(PAYMENT_URL+"/payment/get/"+id, ResultInfo.class);
    }
    public ResultInfo getPaymentInfoFallback(@PathVariable("id") Long id) {
        log.info("Now that we're in the alternative, let's go to the free thread."+Thread.currentThread().getName());
        return new ResultInfo();
    }
  @HystrixCommand( groupKey = "order-service-getpaymentTimeout", commandKey = "getpaymentTimeout", threadPoolKey = "orderServicegetpaymentTimeout", commandProperties = { @HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds",value = "10000") }, threadPoolProperties = { @HystrixProperty(name = "coreSize" ,value = "3"), @HystrixProperty(name = "maxQueueSize",value = "100"), @HystrixProperty(name = "keepAliveTimeMinutes",value = "2"), @HystrixProperty(name = "queueSizeRejectionThreshold",value = "100") } )
    @RequestMapping(value = "/getpaymentTimeout/{id}",method = RequestMethod.GET)
    public ResultInfo getpaymentTimeout(@PathVariable("id") Long id) {
        log.info(Thread.currentThread().getName());
        return orderPaymentService.getTimeOut(id);
    }
Copy the code
  • It’s not going to work very well here, so I’ll just show you the data.
Concurrency amount in getpaymentTimeout getpaymentTimeout/{id} /getpayment/{id}
20 An error occurs after three threads fill up Can respond normally; It is also slow and CPU thread switching takes time
30 Same as above Same as above
50 Same as above It also times out because the pressure of order calling the Payment service is affected
  • If we load Hystrix into the Payment native service, the third case above will not occur. The reason WHY I put it on order is to show you the avalanche scene. At concurrency 50, the maximum thread set for payment is also 10, which itself has throughput. In order# getPyament/ID interface, although the order module has its own thread running because of hystrix thread isolation, but because the native service is not strong, it causes its call timeout and affects the effect of running. This demonstration is also to lead to a scenario simulation of fallback to solve the avalanche.
  • We can set the fallback in the Payment service via Hystrix. Ensure that the payment service is low latency so that the Order module does not cause normal interface exceptions such as Order # getPayment due to slow payment itself.
  • The other thing is although thread isolation is done through Hystrix. But the response time is also a little bit longer when we run other interfaces. This is because the CPU has overhead when it comes to thread switching. This is also a pain point. We can’t do thread isolation at will. This leads us to semaphore isolation.

Semaphore isolation

  • Semaphore isolation is not demonstrated here. The point of the demonstration is not very great
   @HystrixCommand( commandProperties = { @HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds",value = "1000"), @HystrixProperty(name = HystrixPropertiesManager.EXECUTION_ISOLATION_STRATEGY,value = "SEMAPHORE"), @HystrixProperty(name = HystrixPropertiesManager.EXECUTION_ISOLATION_SEMAPHORE_MAX_CONCURRENT_REQUESTS,value = "6") }, fallbackMethod = "getPaymentInfoFallback" )
Copy the code
  • Our configuration above indicates that the maximum semaphore size is 6. Indicates that the wait will take place after concurrency 6. The wait timeout period is less than 1s.
measures advantages disadvantages timeout fusing asynchronous
Thread isolation One call per thread pool; Do not interfere with each other; Ensure high availability CPU thread switching cost Square root Square root Square root
Semaphore isolation Avoid CPU switchover. efficient In high concurrency scenarios, the amount of stored signals becomes larger x Square root x
  • In addition to isolation methods such as thread isolation and semaphore isolation, we can enhance stability by means of request merging and interface data caching.

Service degradation

The trigger condition

  • In addition to the abnormal HystrixBadRequestException program.
  • Service invocation timeout
  • Service fusing
  • Insufficient thread pool, semaphore

  • Above our timeout interface. Either thread isolation or semaphore isolation will simply reject subsequent requests when conditions are met. It’s too rough. We also mentioned fallback above.

  • It is also remembered that the getPayment interface was abnormal when we order50 concurrent timeout, which was found to be caused by the pressure of the original payment service. If we add fallback to payment, we can ensure fast response even when resources are scarce. This at least ensures the availability of the order#getpayment method.

    • But this configuration is experimental. It is not possible to configure fallback on every method in real production. It’s stupid.

    • Hystrix also has a global fallback in addition to a specially customized fallback in methods. Just implement the global alternative on the class via @defaultProperties (defaultFallback = “globalFallback”). A method that meets the criteria for triggering degradation uses the global fallback of its class if there is no fallback configured in the HystrixCommand annotation corresponding to the request. If not globally, an exception is thrown.

      insufficient

      • althoughDefaultPropertiesYou can avoid having fallbacks configured for each interface. But this global does not seem to be a global fallback. We still need to configure fallback on each class. The author looked up the data seems to have no
      • But in our openFeign feature, we talked about service degradation implemented by OpenFeign in conjunction with Hystrix. One of the things I remember was mentionedFallbackFactoryThis class. This class can be understood as springBeanFactory. This class is used to produce what we needFallBack. We can generate a proxy object of generic type fallback in this factory. Proxy objects can take in and out arguments based on the method signature of the proxy method.
      • This way we can configure the factory class in all OpenFeign places. This avoids generating many fallbacks. The fly in the ointment is that each place needs to be specified. aboutFallBackFactoryFor those interested, download the source code and check it out, or go to the OpenFeign homepage.

Service fusing

  @HystrixCommand( commandProperties = { @HystrixProperty(name = "circuitBreaker.enabled",value = "true"), / / whether open circuit breaker @ HystrixProperty (name = "circuitBreaker. RequestVolumeThreshold", value = "10"), / / request times @ HystrixProperty (name = "circuitBreaker. SleepWindowInMilliseconds", value = "10000"), / / time range @ HystrixProperty (name = "circuitBreaker. ErrorThresholdPercentage", value = "60"), / / what is the failure rate after tripping}, fallbackMethod = "getInfoFallback" )
    @RequestMapping(value = "/get", method = RequestMethod.GET)
    public ResultInfo get(@RequestParam Long id) {
        if (id < 0) {
            int i = 1 / 0;
        }
        log.info(Thread.currentThread().getName());
        return orderPaymentService.get(id);
    }
    public ResultInfo getInfoFallback(@RequestParam Long id) {

        return new ResultInfo();
    }
Copy the code
  • First we turn on the fuse with circuitBreaker. Enabled =true
  • circuitBreaker.requestVolumeThresholdExample Set the number of request statistics
  • circuitBreaker.sleepWindowInMillisecondsSet the time slide unit, how long after the fuse is triggered to try to open, and commonly known as the half-open state
  • circuitBreaker.errorThresholdPercentageSet the critical condition for triggering the fuse switch
  • In the preceding configuration, if the error rate of the last 10 requests reaches 60%, the fusing downgrade is triggered. The service will be fusing down for 10 seconds. 10 Seconds later, try to obtain the latest service status
  • Next we interface through JMeterhttp://localhost/order/get?id=-1Run 20 tests. Although none of the 20 additional errors will be reported. But we will find that the error was reported in the first place because of a bug in our code. The latter error is the Hystrix fuse error. Error: Short-circuited and fallback failed

  • We normally configure fallbacks in Hystrix, and we have implemented two fallbacks in the downgrading section above. I’m going to leave it out just so you can see the difference.

  • The parameters configured in HystrixCommand are basically in the HystrixPropertiesManager object. We can see that there are six parameters for the fuse configuration. These are basically the four configurations we have above

Service current limiting

  • Service degradation The two types of isolation we mentioned above are policies to implement traffic limiting.

Request to merge

  • In addition to fuses, downgrades, and current limiting accidents, Hystrix also provides us with request merges. As the name implies, combining multiple requests into one request has achieved the problem of reducing concurrency.
  • For example, our order is to query the order information one after anotherorder/getId? id=1All of a sudden, 10,000 requests came in. To ease the pressure, let’s focus on one call per 100 requestsorder/getIds? ids=xxxxx. So we end up with 10,000/100 =100 requests in the payment module. Let’s use code configuration to implement the following request merge.

HystrixCollapser

@Target({ElementType.METHOD})
@Retention(RetentionPolicy.RUNTIME)
@Documented
public @interface HystrixCollapser {
    String collapserKey(a) default "";

    String batchMethod(a);

    Scope scope(a) default Scope.REQUEST;

    HystrixProperty[] collapserProperties() default {};
}
Copy the code
attribute meaning
collapserKey A unique identifier
batchMethod Request merge processing methods. This is the method that needs to be called after the merge
scope Scope; Two ways [REQUEST, GLOBAL];

REQUEST: Conditions met in the same user REQUEST will be merged

GLOBAL: Requests from any thread are added to the GLOBAL count
HystrixProperty[] Configuring Related Parameters

  • All the properties configuration in Hystrix HystrixPropertiesManager. In Java. Collapser has only two related configurations. Represents the maximum number of requests and statistical time unit respectively.
	@HystrixCollapser( scope = com.netflix.hystrix.HystrixCollapser.Scope.GLOBAL, batchMethod = "getIds", collapserProperties = { @HystrixProperty(name = HystrixPropertiesManager.MAX_REQUESTS_IN_BATCH , value = "3"), @HystrixProperty(name = HystrixPropertiesManager.TIMER_DELAY_IN_MILLISECONDS, value = "10") } )
    @RequestMapping(value = "/getId", method = RequestMethod.GET)
    public ResultInfo getId(@RequestParam Long id) {
        if (id < 0) {
            int i = 1 / 0;
        }
        log.info(Thread.currentThread().getName());
        return null;
    }
    @HystrixCommand
    public List<ResultInfo> getIds(List<Long> ids) {
        System.out.println(ids.size()+"@ @ @ @ @ @ @ @ @");
        return orderPaymentService.getIds(ids);
    }
Copy the code
  • Above we have configured getId to receive getIds requests, and at most 10S, the three requests will be merged. And then getIds has the Payment service that queries separately and returns multiple ResultInfo.

  • The getId interface pressure test is performed on the Jemeter. The maximum length of IDS in logs is 3. This validates the configuration of our getId interface above. This ensures that interface merges are performed to reduce TPS when high concurrency occurs.

  • Above we did interface merging by requesting method annotations. Actually internal Hystrix is through HystrixCommand

The working process

  • The official website provides the process diagram and process description, a total of 9. Let’s translate it.

  • Create HystrixCommand or HystrixObservableCommand object

    • HystrixCommand: Used to rely on a single service
    • HystrixObservableCommand: Used to rely on multiple services
  • HystrrixCommand executes execute and queue. HystrixObservableCommand Executes observe and toObservable

methods role
execute Synchronous execution; Return result object or exception thrown
queue Asynchronous execution; Return the Future object
observe Return Observable
toObservable Return Observable
  • ③ Check whether the cache is enabled and whether the cache is hit. If the cache is hit, the cache response is returned
  • (4) Whether to fuse. If it has, fallback will be degraded; Release if the fuse is off
  • ⑤ whether the thread pool and semaphore are available for use. Fallback if there are not enough resources. With the release
  • ⑥ Execute the Run or construct method. These two methods are native to hystrix, and the Java implementation of hystrix will implement the logic of both methods, which springcloud has encapsulated for us. I’m not going to look at these two methods here. Fallback if execution error or timeout occurs. During this period, logs are collected to the monitoring center.
  • Calculate fuse data to determine whether it is necessary to try release; The data collected here will be viewed in the dashboard of Hystrix.stream. This helps you locate the interface health status
  • In the flow chart, we can also see that ④, ⑤ and ⑥ all point to fallback. It’s also what we call service degradation. So downgrades are a hot business for Hystrix.
  • ⑨. Return a response

HystrixDashboard

  • In addition to service fuses, downgrades, and limiting traffic, another important feature of Hystrix is real-time monitoring. And form report statistics interface request information.

  • Hystrix installation is also simple, requiring only the Actutor and Hystrix-Dashboard modules to be configured in the project

		<dependency>
            <groupId>org.springframework.cloud</groupId>
            <artifactId>spring-cloud-starter-netflix-hystrix-dashboard</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-actuator</artifactId>
        </dependency>
Copy the code
  • Start class to addEnableHystrixDashboardDashboard was introduced. We don’t need to do any development. This, like Eureka, requires simple priming.

  • The Dashboard setup is complete. The dashboard is primarily used to monitor hystrix’s request processing. So we also need to expose the endpoints in the Hystrix request.

  • Just add the following configuration in the module using hystrix command, which I did in the Order module

@Component
public class HystrixConfig {
    @Bean
    public ServletRegistrationBean getServlet(a){
        HystrixMetricsStreamServlet streamServlet = new HystrixMetricsStreamServlet();
        ServletRegistrationBean registrationBean = new ServletRegistrationBean(streamServlet);
        registrationBean.setLoadOnStartup(1);
        / / note that this configuration/hystrix. Stream net access address is localhost: port/hystrix stream; If the configuration in the configuration file is required in the new version
        // Add the actuator: localhost:port/actuator
        registrationBean.addUrlMappings("/hystrix.stream");
        registrationBean.setName("HystrixMetricsStreamServlet");
        returnregistrationBean; }}Copy the code
  • Then we access the Order modulelocalhost/hystrix.streamThe ping screen will appear. It indicates that the monitoring of order module is successfully installed. Of course order also requires the ACTUATOR module
  • Let’s use JMeter to pressure test our fuses, downgrades and current limiting interfaces and dashboard to see the status of each interface.

  • The above animation looks like our service is still busy. Think about e-commerce when you look at the broken line image of each interface and it doesn’t look like your heart is beating. If it’s too high, you worry. Too low and you will not achieve high. Let’s look at dashboard’s metrics in detail

  • Let’s look at the current state of each interface during the run of our service.

The aggregation monitoring

  • Above we pass the new modulehystrix-dashboardTo monitor our order module. But in practice it is not possible to configure Hystrix only in order.
  • We are only above the order configuration for demonstration purposes. Now we configure hystrix in Payment as well. Then we need to switch back and forth between order and Payment monitoring data in the dashboard.
  • That’s where our aggregation monitoring comes in. Let’s introduce Payment to Hystrix before we do aggregate monitoring. Note that we injected hystrix.stream through a bean. Access prefixes do not need to be used

New hystrix – turbine

pom

<! -- Added Hystrix Dashboard -->
        <dependency>
            <groupId>org.springframework.cloud</groupId>
            <artifactId>spring-cloud-starter-netflix-turbine</artifactId>
        </dependency>
Copy the code
  • It is mainly to add turbine coordinates, and the other modules are Hystrix, Dashboard and other modules. Please check the source code at the end for details

yml

spring: application: name: cloud-hystrix-turbine eureka: client: register-with-eureka: true fetch-registry: True service - url: defaultZone: http://localhost:7001/eureka instance: -- - IP - address: true # turbine aggregation monitoring: App-config: cloud-order-service,cloud-payment-service cluster-name-expression: "'default'" # app-config: cloud-order-service,cloud-payment-service cluster-name-expression: "'default'" # app-config: cloud-order-service,cloud-payment-service cluster-name-expression: "'default'" If/exoskeletons /hystrix.stream is configured, the exoskeletons instanceUrlSuffix: hystrix.stream must be configuredCopy the code

Start the class

Add the EnableTurbine annotation to the start class




The source code

The source code