Author | Shao Yuliang, head of system Governance in infrastructure team

This article mainly introduces bytedance’s thinking and experience in the construction of high availability. Let me give you a brief introduction to what the system governance team does. System governance team in infrastructure within the team, is mainly responsible for byte beating closed-loop ecological research and development: from the service development, to the mass under the micro service architecture of alignment, development, and the corresponding release, go up to the micro service governance, the corresponding flow scheduling, capacity analysis, and in the end through the construction of the chaos to help business improve ability of high availability.

So let’s get down to business. First, introduce the background of bytedance chaos project construction. As you know, ByteDance has a lot of apps, and we have a lot of services, which can be roughly divided into three types:

  • Online service: we can understand it as the back-end service supporting Douyin, watermelon video, etc. The feature of these services is that they run on a cluster of PaaS on our home-built large-scale K8s, a very large set of microservices architecture.
  • Offline service: including some training of recommendation model, report calculation of big data, etc., all belong to offline service. They rely on massive storage and computing power.
  • Infrastructure: It supports all lines of business in Byte China and provides a set of PaaS capabilities, including computing and storage, to support different usage scenarios of various services.

Different service systems have different concerns about high availability. Let’s do a quick analysis:

  • Online service: stateless service itself, running on THE K8s container, its storage in the external MySQL, Redis. These stateless services are easy to scale up and are as fault-tolerant as possible in the event of a failure, with the possibility of some degradation.
  • Offline service: stateful service, which is concerned with the state of computing. The computing service of big data is characterized by long running time, especially long Training time and model time. It tolerates some errors (it can be retry if a job fails) and relies on the underlying storage system for state consistency and data integrity. So our high availability build in offline services relies heavily on the high availability provided by the entire infrastructure.
  • Infrastructure: The infrastructure itself is stateful. It is a platform for large-scale storage and computing, which may encounter some grey Swan events such as network failures and disk failures. Among them, we pay more attention to data consistency.

Students in the system governance team responsible for high availability proposed different solutions for different service types. Here is an introduction to the evolution of chaos engineering when we deal with online services (stateless services).

Chaotic engineering evolution of online services

Chaos Engineering Platform 1.0 architecture

We consider our Chaos engineering platform version 1.0 to be less of a chaos engineering system and more of a fault injection system.

The diagram above shows the architecture of version 1.0 of our platform. This platform provides users with a visual interface for fault injection and simple configuration. We installed Agent on the underlying physics machine. Agent runs on the host and can implement network fault injection between containers.

For service steady-state, when we do chaos drills, we can inject some metrics on the platform. Users can write a BOsun statement to query metrics. We provide a threshold value, and the system will poll the metrics to determine whether the service has reached a stable state. If it goes out of bounds, we fail back up. If not, continue to drill and see if you can meet your expectations. Why can’t this system be called a chaotic engineering system? The Principle of Chaos the Principle of Chaos engineering is defined by five principles:

  • Build a hypothesis around steady-state behavior
  • Diversify real world events
  • Run experiments in a production environment
  • Continuously run experiments automatically
  • Minimize blast radius control

Comparing these five principles, let’s see why the platform is a fail-injection system.

  • First of all, the overall steady state is relatively simple.
  • There are various faults in the actual microservice architecture. In this platform, only relatively simple fault injection is implemented, such as fault delay and network outage.
  • Drilling in a production environment is something that can be done at the time.
  • Because steady state is relatively simple, it is difficult to really evaluate whether the system is stable or not, and the system cannot run experiments automatically.
  • The whole system declares that Scale’s scope is not particularly good. In addition, at that time, the technical structure was to do fault injection on the host of the physical machine, which itself had certain hidden dangers, and the explosion radius control was not particularly good.

Chaos Engineering Platform 2.0 architecture

In 2019, we started thinking about moving from Chaos Engineering Platform version 1.0 to the next generation, and we wanted to make a system that truly met the standards of chaos engineering, so with The platform version 2.0, we think of it as byteDance’s first truly chaotic engineering system.

Some updates to chaos Engineering Platform version 2.0:

  • Architecture upgrade: Introduced a fault central layer that decouples business logic from underlying fault injection.
  • Fault injection: With the application of Service Mesh on a larger scale, more faults related to network invocation are implemented based on sidecar.
  • Stability model: At this stage, we also built a steady-state system, and realized steady-state calculation based on key indicators of service and machine learning algorithms. We are very focused on steady-state systems and believe that true automated drills require no human intervention, so a system is needed to identify whether the system being drilled is stable or not. If the system only sees a bunch of metrics, it’s hard to directly perceive the stability of the system. We want to aggregate these metrics into a percentile through some specific algorithm, and assume that it’s stable at 90 points. And we’ll talk a little bit more about how we do algorithmic input in this steady state system.

Fault center architecture

Our fault center borrows the architecture of K8s.

The Chaos Engineering Platform 1.0 system has a problem: Suppose a delay fault was successfully injected into K8s through the Agent. However, K8s has flexible scheduling capabilities, and if the service crashes during a walkthrough, K8s will automatically start the Pod on another machine. In this case, you think the walk-through was successful, but it didn’t, and a new service was restarted. The fault center can continue to inject faults as the container drifts.

Therefore, we are A set of declarative apis, which do not declare A fault to inject, but describe A state of the server. For example, the network between A and B is disconnected, so the fault center should ensure that A and B are disconnected in any state.

Secondly, the whole system uses the architecture of K8s for reference and has rich controller to support different fault injection capabilities at the bottom. In support of the rapid needs of the business, we can quickly access open source projects such as Chaos Mesh and Chaos Blade in Controller. We also made some native controllers, such as Service Mesh Controller, Agent Controller, service discovery controller, etc.

Explosion radius control

As the fault center is injected through a declarative API, we need to define a fault injection Model.

As shown above:

  • Target: indicates the Target service to inject the fault.
  • Scope Filter: For explosion radius control, it is very important that we want the business to help declare the Scope we want to drill, which we call Scope Filter. Scope Filter can define the target of fault injection, which can be a machine room, a cluster, an available area, or even accurate to the instance level or even the traffic level.
  • Dependency: It is the source of all exceptions that may affect the service itself, including middleware, a downstream service, the CPU, disk, network on which it depends, etc.
  • Action: indicates a fault event, that is, a fault occurs, such as downstream service rejection or packet loss. For example, disk write exceptions or CPU preemption occur.

Therefore, when declaring a fault in the fault center, you need to describe the preceding information to indicate the fault status that the service expects in the system.

The steady state system

Steady state system will involve some algorithm work, here mainly introduces three algorithm scenarios:

  • Dynamic analysis of time series: we call steady-state algorithm, which can try to analyze whether the service is stable or not. Threshold detection, 3 Sigma principle, sparse rule and other algorithms are used.
  • AB comparative steady-state analysis: referring to the Mann-Whitney U test used by Netflix, you can read some related papers and article introduction.
  • Detection mechanism: using index fluctuation consistency detection algorithm, used to analyze strong and weak dependence.

By using these algorithms (and others), a steady-state system can be a good description of system stability.

Automated drill

We define an automated walkthrough as failure injection by the system without any human intervention, analyzing the stability of the service during injection and evolution, and stopping losses or getting results at any time. Here are a few preconditions for our automated walkthroughs:

  • Be able to define the objectives of the actual scenario of the drill;
  • Through steady-state system, it has automatic judgment ability for steady-state hypothesis.
  • The Scope of influence of chaos drill can be controlled by declarative API and Scope Filter, and the production loss in the experiment process is minimal.

At present, the main application scenarios of automated drills are weak and weak dependence analysis, including:

  • Whether the status quo of strong and weak dependence is consistent with business labeling;
  • Whether a weak dependency timeout will drag down the entire link.

conclusion

Now let’s review why we consider Chaos Engineering Platform version 2.0 to be a chaos engineering system. Again, compare the five principles mentioned above:

  • Establish a hypothesis around steady states: the evolution of the steady-state hypothesis has been initiated by steady-state systems.
  • Diversified real world events: Failure stratification is now more reasonable, complementing a large number of middleware failures and underlying failures.
  • Run tests in production: This was implemented in 1.0 and extended in 2.0 to support various failure drills in production, pre-release, and local test environments.
  • Continuous automated operation trial: provide CSV, SDK, API and other capabilities to enable the line of business to continuously integrate with functions in the desired service release process. We also provide API capabilities to help the line of business do fault injection when needed.
  • Blast radius minimization: One of the reasons for the ability to provide declarative apis is to control blast radius.

Infrastructure chaos platform that supports the underlying system walkthrough

As mentioned earlier, offline services rely heavily on the consistency of the underlying state, so if you can get the storage and computing right in the infrastructure, you can support the business well. We’re doing some internal experimentation with a new infrastructure chaos platform. For infrastructure chaos engineering, we break some of the standard principles of chaos engineering.

  • Firstly, chaos engineering for infrastructure is not suitable for practice in production environment, because it depends on fault injection at the bottom level, the impact area is very large, and the explosion radius is difficult to control.
  • In terms of automated walkthrough, the business side needs to be more flexible and able to further integrate their CI/CD, as well as more complex orchestration requirements.
  • For steady-state models, in addition to stability, we pay more attention to consistency.

To support chaos engineering in offline environments, the infrastructure Chaos platform gives us a secure environment where we can do more fault injection, such as CPU, Memory, File system and other system resource failures. Network faults such as rejection and packet loss; And other failures including clock hops, process kills, code-level exceptions, and file system-level method error hooks. For automated choreography of automated drills, we hope to give users more flexible choreography capabilities through this platform, for example:

  • String parallel task execution
  • Pause at any time & breakpoint recovery
  • Infrastructure primary and secondary nodes are identified

We also provide plug-in capabilities that give component teams more flexibility to inject failures. Some business teams may have some hooks embedded in their systems that they want to help inject failures more directly, but also want to reuse our choreography and platform architecture. With the hook approach, the business team can simply implement the corresponding hook, inject a specific fault, and then continue to use our entire orchestration architecture and platform.

Infrastructure chaos platform architecture diagram

From chaos engineering to system high availability construction

When we started Chaos Engineering, the mission of our team was to implement Chaos Engineering in Bytedance. But when we build some capabilities for the line of business to use, we find that the line of business has no need for them. Then we thought hard and adjusted the mission of the team: to help the business drive high availability through chaos engineering or some other means. After the adjustment, we went from studying Chaos Engineering to understanding the high availability of the business. How can we help businesses build high availability?

What is high availability

We use the following formula to understand high availability.

  • MTTR (Mean Time To Repair) : indicates the average Repair Time
  • MTBF (Mean Time Between Failure) : Mean Time Between failures
  • N: Number of accidents
  • S: Impact range

The value of this formula is obviously less than one, so it should be three nines, five nines. To make the value of A large enough, we need:

  • MTTRNS is small enough. Therefore, MTTR needs to be reduced to reduce the number of accidents and narrow the scope of failure.
  • MTBF gets bigger. That is, as much space as possible between two failures.

How to reduce MTTR, N, S?

Reducing the Scope of Failure (S)

When a production-oriented architecture failure occurs, to reduce the impact of the failure, there are several design approaches from the architecture side:

  • Unitary design: User request isolation
  • Multi-room deployment: System resources are isolated
  • Independent deployment of core services: Service functions are isolated
  • Asynchronous processing

One thing chaos engineering can do here is help the SRE team verify that these architectural designs live up to expectations.

Reduce the number of failures (N)

I want to redefine the fault here. Failures are inevitable, and we should try to avoid failures turning into errors at the architectural level of the software system. How do I reduce the conversion rate from Failure to Error? The most important is to enhance the fault tolerance of the system, including:

  • Deployment: remote live, flexible traffic scheduling, service at the bottom, plan management;
  • Service governance: Timeout configuration, fuse fail fast.

The role of chaos engineering helps verify the fault-tolerant ability of the system.

Reduced mean repair time (MTTR)

The figure above shows some of the factors involved in MTTR: time required for Fail Notification, diagnosis, repair, testing, and finally going live. To reduce MTTR, some design measures can be added for each factor involved:

  • Adequate monitoring of alarm coverage. Promote services to handle alarms.
  • Ensure that alarms are fully covered and accurate.
  • Efficient positioning, enhance the ability to remove obstacles. Currently, we are working with our internal AIOps team to further analyze the intelligence barriers and reduce the diagnosis time.
  • Quick stop loss plan. From repair to test and finally online, it is necessary to have a plan system. According to the diagnosis of fault characteristics, plan library is prepared. Click the button to select the accurate plan for recovery.

Among them, chaos engineering can do emergency response drills. In fact, the drill is not only for the system, but also for the emergency ability of everyone in the organization. When faced with an incident, the team can have a standard workflow to identify, locate, and resolve the problem, which is exactly what chaos engineering systems want to drill for.

Subsequent planning

Finally, I would like to introduce our follow-up planning in the aspect of high availability and chaos engineering, mainly in three aspects:

  • Fault refinement capacity building
    • Fault stratified construction for different systems
    • Rich fault capacity of each layer
  • Enrich the application scenarios of chaos engineering
    • Continue exploring the automation scenario
    • Reduce user access and usage costs and build a lightweight platform.
  • Extending the connotation of chaos engineering
    • The relationship between chaos engineering and high availability is explored continuously from the perspective of usability

    • The failure budget mechanism is established to predict and analyze the failure loss by quantification, so as to assist in the decision of input in chaos engineering.