In the world of programmers, there are some old classics:

First rule of programming: ** If the code somehow works, don’t move

First rule of architecture: Never touch an old system that has been running steadily for years

Image from Internet

These principles are ingrained in the minds of novice programmers who learn from their predecessors and apply them throughout their careers.

For the on-line production environment in running, not only dare not touch disorderly, anxious to burn incense and consecrate.

However, the vulnerability of technical systems comes from many sources, including not only hardware failures, code bugs, architecture and logic loopholes, unexpected and uncertain online traffic and so on. As technology systems migrate to cloud-native architectures, more and more dependencies and uncertainties are introduced into the architecture.

How to realize the anti-fragility of the system under the increasingly complex technical architecture has become a required course for a programmer to be promoted to an architect.

In 2010, the engineers of NetFlix proposed the concept of chaos engineering to verify the response behavior of the system under various failure scenarios by actively injecting faults into the online production system, and identify and repair system hidden dangers in advance.

It acts like a vaccine, deliberately injecting harmful substances into the body to prevent future diseases. For complex technical systems, engineers can also inject a limited number of controllable failures to identify weaknesses in advance and fix them, thereby avoiding large failures with potentially serious consequences.

The introduction of chaos engineering is not only a technical innovation, but also a breakthrough in concept. It abandons the old stability maintenance concept of “online system should not be touched” that the industry predecessors used to adhere to, and puts forward a new idea of stability construction of “attack instead of defense”, opening a new era of offensive and defensive drills in the production environment.

The development of Chaos engineering in IQiyi

Iqiyi’s offensive and defensive drills were initially organized by each business. For example, the financial payment team started chaos engineering work very early due to its high requirement on stability. See the previous article “Behind the Attack and Defense Battle”.

At this stage, the methods and tools used in the attack and defense drills of each team are scattered and diverse, without forming a unified platform and tool standard.

During the epidemic traffic peak in early 2020, iQiyi suffered a broadcast failure.

After a review, it was found that the large failure was actually caused by a small network jitter and code bug caused by an avalanche. Such a small problem is usually hidden in the complex technical architecture, and it is difficult to be found by ordinary tests. However, because of this small exception, a huge system can trigger a huge chain reaction.

Since then, iQIYI technical team started on a large scale in the production of key service environment, promote the ground normalization, standardization of offensive and defensive drills, at the same time the construction of the corporate level of offensive and defensive drill platform – the fawn disorderly bump, defensive drills used to support the business related requirements, improve the security of the offensive and defensive drills and execution efficiency.

By Q2 2021, iQiyi has carried out attack and defense drills for more than 20 key businesses through the Xiaolu Lunclang platform, and the online environment of each key service has experienced 3 to 4 rounds of real failure attack drills on average.

Introduction to the use of the platform

The deer has two roles in IQiyi:

** (1) Business self-test: ** The Owner of each business system can conduct fault self-test on the production or test environment of his own system through the collision platform to check whether all kinds of high-availability safeguard measures preset in his service (alarm/downgrade/fusing/disaster recovery cut-off, etc.) are effective.

** (2) Red-Blue attack and Defense: ** An independent architecture evaluation team set up by the company verifies the high availability level of key business systems with random attack experiments from the perspective of a third party, and carries out third-party inspection and guarantee for key services.

Deer ramming platform module diagram

The platform refers to the design of excellent public cloud attack and defense drill products. Users can carry out attack and defense configuration through the following simple steps:

1. Select an attack target

2. Configure the attack mode

3. Create a drill plan

According to the drill plan arranged above, it can be executed after submission for approval.

4. Drill and observe

In addition to the configuration and arrangement of the drill program, the little Deer Random Collision platform also connected with the company’s internal resource management and monitoring and observation systems, solved the permissions, processes, observability and other problems in the attack and defense drill, and output concise and effective failure drill reports for users.

Observation during attack and defense drill: monitoring and alarm

03 Introduction of key business attack and defense Drill cases

1. Case 1: Couchbase cache failure test

(1) Background of the drill

One of the reasons for the playback failures mentioned above was that the SDK connection for the playback service to access Couchbase was not working properly due to network jitter, which caused a chain reaction.

After the failure, the students of the broadcasting service conducted a series of fuse reinforcement for the dependencies in the architecture. When the service visited Couchbase dependency timeout, the automatic fuse was triggered. The service request was changed to access the standby KV database, and the effect of fuse reinforcement was verified through attack and defense drills to replay network faults.

(2) Exercise process

In the attack and defense drill, we choose the attack mode to inject point-to-point network failure, and add 1000ms delay between the server where the video playback service is located and the Couchbase cluster, and simulate the timeout caused by network failure for business access to Couchbase.

(3) Drill effect

When the Couchbase access times out, the service request is immediately switched to HiKV.

2. Case 2: Member service Redis distributed lock attack and defense drill

(1) Background of the drill

Iqiyi member service team uniformly switched to the new Redis distributed lock, and tested the reliability of the new distributed lock in various extreme failure scenarios through attack and defense drills.

(2) Exercise process

Three different fault attacks are carried out on Redis distributed lock to test the effect of distributed lock under these three attacks respectively:

  • ** Scenario 1: The network between the service service and Redis is disconnected. The network is restored 5 minutes later

  • ** Scenario 2: ** The primary Redis library fails, triggering a primary/secondary switchover to the secondary Redis library

  • ** Scenario 3: ** The primary Redis library fails and the primary/secondary switchover is not performed. After 5 minutes, restart the Redis service

(3) Drill effect

The business impact of Redis distributed lock in these three extreme scenarios is verified, and the wait/retry mechanism of the business side during distributed lock failure is determined. According to the distributed lock request response status during Redis failure, service callers are instructed to set reasonable wait interval and retry configuration according to their own business scenarios.

04 Common pitfalls and experiences in drills

Looking back over the past year, in the production environment of more than 20 lines of business to do the experience of attack and defense drills, ** we summarized some of the common or great harm problems, ** here is a simple arrangement as follows:

05 summary

In IQiyi, after more than one year of chaos engineering culture preaching and systematic online offensive and defensive practice, the front-line technical leaders of core businesses have acquired considerable offensive and defensive awareness. Here the authors summarize two common characteristics of good architects across the business:

1. Zero trust

Any service is not isolated, including DNS, load balance, gateway, virtual machine/container, database, middleware, storage, network based service dependencies, such as large Cache, and a large number of external interfaces rely on, even if each individual depend on have 99.99% availability, dozens of fold together by the availability of also is not high.

A good architect never approaches technical services with the idea that “this dependency is a basic service, my architecture assumes it is 100% available, and if it fails, I will blame it.” Instead, he tries to design highly available solutions for his system by assuming that any part of it can fail.

2. The exploratory

A good technical leader knows the reliability of his or her service architecture and can clearly answer at any point in the architecture diagram that an exception here can be blocked and an exception there can cause the service to fail. If there are uncertain anomalies, we will actively explore them through offensive and defensive drills. All kinds of DISASTER recovery/cut-off/fusing/degradation mechanisms verified in testing/grayscale environment also dare to attack verification in online environment.

It is a new requirement for architects in the future cloud native era to dare to carry out real attacks on the online services they are in charge of, which reflects the confidence of technical leaders in architecture and technology.

Confidence comes from a deep grasp of code logic, rational thinking of architecture principles, and brave commitment to business responsibilities.

In the complex technical architecture of the future cloud era, adhering to such technical confidence and actively participating in the offensive and defensive drills will help deepen the understanding of the architecture, find and repair potential service problems, open up new prospects for the ability growth of technical students and escort the stable development of the company’s business.

Some pictures come from the Internet, if there is any copyright problem, please contact us.