1 introduction

High availability, and then we’re going to spend a couple of lectures talking about how to use Hystrix to build an architecture for highly available services

We will use a real project background as a business scenario to bring out some of the various usability issues that may arise in this particular business scenario

What are our solutions and principles to these problems with Hystrix

With everyone, pure manual will all services of the high availability architecture of the code, all pure manual out

A hands-on tutorial in forming a highly available service architecture

2 What is Hystrix

In distributed systems, each service may invoke many other services, and those invoked are dependent services, and sometimes it is common for some dependent services to fail.

Hystrix allows us to control calls between services in a distributed system, with some fault tolerance for call delays or dependency failures.

Hystrix isolates the resources of dependent services to prevent the failure of a dependent service from spreading across all dependent service calls throughout the system, and also provides fallback degradation in the event of a failure

All in all, Hystrix helps us improve the availability and stability of distributed systems in these ways

What are distributed systems and their failures and Hystrix

  


3 History of Hystrix

Hystrix, a framework for high availability assurance similar to Spring (IOC, MVC), Mybatis, Activiti, Lucene, a framework, is a set of pre-packaged code libraries designed to solve a specific problem in a specific domain

Frameworks, using frameworks to solve specific problems in this area, can greatly reduce our workload, improve the quality and efficiency of our work. Frameworks, Hystrix, is a framework for high availability assurance

Netflix(think of it as youku or IQiyi), the API team started working on improving the usability and stability of the system in 2011, and Hystrix grew out of that.

By 2012, Hystrix had become more mature and stable, and many other teams at Netflix were using Hystrix besides the API team.

Today, Billions of service to service calls are made daily in Netflix through the Hystrix framework, and Hystrix helps improve the overall availability and stability of Netflix

In November 2018, Hystrix announced on its Github page that it would no longer open new features, recommending that developers use other open source projects that are still active. This shift in maintenance mode by no means means that Hystrix is no longer valuable. On the contrary, Hystrix has inspired many great ideas and projects, and its ideas are still worth studying!

4 Hystrix design principles

Control and fault-tolerant protection for invocation delays and invocation failures when invoking dependent services

In A complex distributed system, prevent the spread of A service-dependent fault in the entire system, service A- service B- service C, service C fails, service B also fails, service A fails, the whole distributed system fails, and the whole system breaks down

Provides fail-fast and fast recovery support

Support for graceful fallback downgrades

Support near real-time monitoring, alarm and ops 5 Hystrix operation to solve the problem in the complicated distributed system architecture, each service has a lot of depend on service, and each dependent services are likely to failure If the service without dependence and their service in isolation, so one may depend on service failure will bring down the current service

For instance,

A service has 30 dependent services, each of which has a high availability of 99.99%

Then the availability of the service is 99.99% to the power of 30, which is 99.7% availability

99.7% availability means that 3% of requests may fail because 3% of the time the system may fail

For 100 million visits, 3% of requests fail, which means 3 million requests fail, which means the system is unavailable for 2 hours per month

In a real production environment, it could be even worse

That said, even if you have 99.99% high availability for each dependent service, having dozens of dependent services will result in you being unavailable for several hours each month

When a dependent service is called late or fails, why does it drag down the current service? And how can failures spread quickly in distributed systems?

Service-dependent failures leading to service drag and the spread of the failure diagram

  


More detailed design principles for Hystrix

Prevent any dependent service from exhausting all resources, such as all thread resources in Tomcat

Traffic limiting and Fail Fast are used to control faults

Provides fallback degradation in response to failures

Use resource isolation technologies such as Bulkhead, Swimlane, and circuit Breaker to limit the impact of any service-dependent failure

The near real-time statistics/monitoring/alarm function improves the speed of fault discovery

You can configure the properties and hot change function in near real time to improve the speed of troubleshooting and recovery

Protect all failure cases of dependent service invocations, not just network failure cases

When calling the dependent service, the client call package has bugs, blocks, etc., and various failures of the dependent service call can be handled

7 How does Hystrix achieve its goals

HystrixCommand or HystrixObservableCommand encapsulates access requests for external dependencies. This access request is usually run in a separate thread, and resources are isolated

For service calls that exceed the threshold we set, we simply timeout and do not allow them to take too long to block. The default timeout is 99.5% of the access time, but generally we can set it ourselves

Maintain a separate thread pool for each dependent service, or semaphore, and reject calls to the service when the thread pool is full

Count the successful times, failed times, rejected times, and timeout times of calls to dependent services

If the number of invocation failures for a dependent service exceeds a certain threshold, the system automatically interrupts the invocation of the service within a certain period of time and tries to recover the service after a period of time

Fallback degradation mechanism is automatically invoked when a service invocation fails, is rejected, times out, or is short-circuited

Near-real time support for property and configuration changes

After resource isolation of dependencies, how to avoid the failure of the current service caused by the delay or failure of the dependent service invocation

How does resource isolation protect service-dependent failures from bringing down the entire system