Scene mainly introduces the idea and principle of Chaos engineering, experience failure drill (AHAS Chaos), Ali Cloud products in the field of Chaos engineering. From November 9 to 23, you can get “A TOMY Domeika alloy car model” after completing the experience.

Address: developer.aliyun.com/adc/series/…

This scenario involves the following technologies or products:

Container Service ACK: Kubernetes (ACK for short) provides high-performance and scalable container application management capabilities, supporting the full life cycle management of enterprise containerized applications. It is the only product selected in the 2020 Gartner Public Cloud Container Report in China, and ranked first in China in 2019 Forrester Container Report. It integrates Alibaba cloud virtualization, storage, network and security capabilities to help enterprises efficiently run cloud-based Kubernetes containerized applications.

Chaos: Chaos is a cloud native Chaos engineering platform, which provides large-scale, low-cost, controllable and diversified fault rehearsal services. Chaos provides one-stop architecture analysis, fault inspection, fault injection, system steady-state measurement and other functions to help users enhance fault tolerance and recovery of distributed systems, and help the system stabilize the cloud.

I believe you must have seen in the news broadcast PLA so-and-so corps in a place for military combat drill news, for the army, the best way to train is real exercises. Even though the usual training has been very systematic and perfect, but in the real combat, there may still be a variety of unexpected problems in the usual training. Therefore, only real combat exercises can discover the problems, better plan the next stage of training, and improve the combat effectiveness of the army.

Isn’t it the same with designing our software systems for failure? “Everything fails, all the time.” In the normal development process, even though we have anticipated various scenarios and fixed all the bugs, there will always be various situations once we go online. Our software systems need the same kind of practice. You need to consider failure scenarios from the beginning of the system design phase, make failure orientation a part of the system design, and prepare a strategy to recover from failure, which can help improve the usability of the overall system. Only by recognizing that things will fail over time and incorporating that into your architecture can you be completely immune or minimize the cost of failure when it does occur.

The idea of failure – oriented design is exactly the origin of chaos engineering. Failure design requires us to prepare for failure in advance, but will the measures we have in place actually work when failures do occur? Does the fault recovery tool implement DISASTER recovery? Are troubleshooting personnel skilled? These problems, which are difficult to verify, often show up in real failures. This is the significance of chaos engineering. Chaos engineering, like a drill, aims to create faults and find out the possible weaknesses of the system, so as to verify whether the ability of the system and personnel to deal with various unexpected problems meets expectations in a real complex environment and improve the immune capacity of the system. Failure drills (Chaos) provide just this capability.

Create lab resources

Ali cloud provides an ACK + Chaos cloud products 4 hours resources links: developer.aliyun.com/adc/scenari…

Install the probe

Procedure 1 On the Container Services Console page, click < in the upper navigation bar.

2 In the navigation tree on the left, click the application directory.

3 On the application directory page, click Ack-ahas-pilot.

4 On the ACK-ahas-Pilot details page, click Create.

If the following page is displayed, the probe has been deployed.

View the overall system architecture through architecture awareness

Procedure 1 Copy the address of the APPLICATION HA console, open a new page in Firefox, paste it, and access the application HA console.

https://chaos.console.aliyun.com/
Copy the code

2 At the top of the Summary page, select the region where the resource is located. For example, in the figure below, the region is switched to East China 1 (Hangzhou).

3 In the navigation tree on the left, choose Fault Drill > Architecture Awareness.

4 On the Architecture Map page, click View in the Kubernetes Monitoring View card.

5 On the architecture map page, open the Kubernetes monitoring view drop-down list, set the command space to Default, and click OK to view the Kubernetes monitoring view of experimental resources.

Automatic recovery scenario drill

In distributed system design, a fault-tolerant strategy is failback. Through health check and other mechanisms, machines or applications can be automatically redeployed when problems occur. Chaos was used as a failure drill to test whether our system had this capability

1. Assume steady-state. Define a steady-state indicator that evaluates the health of the system and monitors and handles chaos during implementation. We defined steady state as access to our Frontend interface and normal use of various shopping carts, ordering, and other functions.

2. Simulate real events. 2.1 Switching back to the APPLICATION HA Console. In the left navigation bar, click MySpace.

2.2 On the Myspace page, click Create a Blank drill in the New Drill drop – down list.

2.3 On the Drill configuration page, complete the following operations:

(1) Set the drill name.

(2) In the exercise object configuration wizard, select Frontend for the exercise application, frontend-group for the application group, select any machine in the machine list, and click Add Exercise Content.

(3) In the select Drill fault dialog box, choose JAVA Applications > Delay > JAVA Delay in Container, and click OK.

(4) On the Walkthrough configuration page, click Java delay in the container.

(5) In the Java delay panel inside the container, enter the class’s fully qualified name, method name, process keyword, and target container name, and click Close.

Fully qualified name of a class: input com. Alibabacloud. Hipstershop. Web. HealthController. Method name: Enter health. Process keyword: Enter Java. Target container name: Select Frontend.

(6) In the drill content area, click Save.

(7) Click Next.

(8) In the globally configured monitoring policy area, click Add policy.

(9) In the new policy dialog box, choose Service Monitoring > Service Status Observation (Http) and click OK.

(10) In the SERVICE Status observation (Http) panel, select get for the request type and enter http://<frontend external endpoint >/ for the URL.

Description:

The external endpoints of frontend are obtained on the Frontend Service access TAB of the Container Service ACK console.

(11) In the Global Configuration wizard, click Next.

(12) In the success dialog box, click Drill Details.

2.4 On the Drill details page, click Drill.

2.5 In the dialog box that is displayed, click Confirm.

3. Test the impact of the experiment. 3.1 View the service Status Observation (Http) time sequence diagram on the Drill Record details page. You can see that the call to the Health interface drops after a failure and then automatically returns to normal immediately, indicating that our design works.

3.2 Switch to the CONTAINER Service ACK console and click the Events TAB on the Frontend Service page.

You can see that frontend is automatically expanded.

4. Terminate the experiment. 4.1 Switching back to the APPLICATION HA Console. On the walkthrough Record details page, click Terminate.

4.2 In the Stop Drill dialog box, click OK.

4.3 Click OK in the result feedback dialog box after the scenario is finished.

Strong and weak dependence scenario drill

In a microservice architecture, there are many dependencies between services. But when an unimportant weak dependency goes down, a robust system should still be able to function. Chaos was used for failure drills to test how well our system handled strong and weak dependencies.

1. Assume steady-state. 1.1 Switch back to the Container Services ACK console and click the external endpoint of frontend.

1.2 On the Hipster Shop page, refresh the page several times. You can see that the order of items on the page is different each time. You can understand that the product recommendation service will make recommendations based on individuation, so that the product has priority. So we define steady state as, each time you refresh the page, the items are in a different order.

2. Simulate real events. 2.1 Switching back to the APPLICATION HA Console. In the left navigation bar, click MySpace.

2.2 On the Myspace page, click Create a Blank drill in the New Drill drop – down list.

2.3 On the Drill configuration page, complete the following operations:

(1) Set the drill name.

(2) In the Test object configuration wizard, select ShameshameshameService for the test application and shameshameservice-Group for the test group. For the machine list, select the machine and click Add Test content.

(3) In the select Drill fault dialog box, choose JAVA Applications > Delay > JAVA Delay in Container, and click OK.

(4) In the Walkthrough content area, click Java Delay in the container.

(5) In the Java delay panel inside the container, enter the class’s fully qualified name, method name, process keyword, and target container name, and click Close.

Fully qualified name of a class: input com. Alibabacloud. Hipstershop. Recomendationservice. Service. RecommendationServiceImpl. Method name: Enter sortProduct. Process keyword: Enter Java. Target container name: select RecommendationService.

(6) In the drill object, click Save.

(7) Click Next.

(8) In global configuration, click Next.

(9) In the success dialog box, click Drill Details.

2.4 On the Drill details page, click Drill.

2.5 In the dialog box that is displayed, click Confirm.

Test the impact of the experiment. 3.1 Switch back to the Container Services ACK console. On the stateless page, click Frontend.

3.2 On the Frontend page, click the Access mode TAB and click the external endpoint of the frontend.

3.3 Refresh the Hipster Shop page several times. You can see that the product order does not change with each refresh. Note The recommended service is down, but other services are not affected.

Terminate the experiment. 4.1 Switch to the APPLICATION HA Console and click Stop on the drill record details page.

4.2 In the Stop Drill dialog box, click OK.

4.3 In the dialog box that is displayed, click OK.

Retry failure scenarios

In the microservice architecture, a large system is divided into multiple small services, and a large number of RPC calls exist between the small services. RPC calls may fail due to network jitter and other reasons. In this case, the retry mechanism can improve the final success rate of requests, reduce the impact of faults, and make the system run more stably. We use Chaos to inject failures into the system to see how well the system retries after failures.

Let’s do the steady-state hypothesis. 1.1 Switch back to the Container Services ACK console and click CartService on the stateless page.

1.2 On the CartService page, click Scale.

1.3 In the scaling dialog box, change the required number of container groups to 2 and click OK.

If the status changes to Running, the container group is successfully expanded.

1.4 Switch to the Hispter Shop page and click the shopping cart.

If the following page is displayed, the shopping cart service is normal. So we define steady state as being able to use the shopping cart function of frontend properly.

Simulate real events. 2.1 Switch to the APPLICATION HA Console and click MySpace on the left navigation bar.

2.2 On the Myspace page, click Create a Blank drill in the New Drill drop – down list.

2.3 On the Drill configuration page, complete the following operations:

(1) Set the drill name.

(2) In the drill object, select CARTService for the drill and select CARTService-group for the drill. Select any machine from the machine list and click To add the drill content.

(3) In the dialog box that is displayed, choose JAVA Application > Throw Exception > JAVA Delay throwing custom exceptions in Containers, and click OK.

(4) In the walkthrough content area, click Java delay throwing custom exceptions in the container.

(5) Enter the method name, fully qualified name of the class, exception, process keyword, and target container name in the Custom exception panel of Java delay throwing in the container, and click Close.

Method name: Enter viewCart. Fully qualified name of a class: input com. Alibabacloud. Hipstershop. Cartserviceprovider. Service. CartServiceImpl. Exception: Enter java.lang.exception. Process keyword: Enter Java. Target container name: Select CartService.

(6) In the drill object, click Save.

(7) Click Next.

(8) In global configuration, click Next.

(9) In the success dialog box, click Drill Details.

2.4 On the Drill details page, click Drill.

2.5 In the dialog box that is displayed, click Confirm.

Test the impact of the experiment. 3.1 Switch to the Hispter Shop page and click the shopping cart.

You return to the following page and find that you cannot access the shopping cart. This is because traffic did not switch to the machine that did not go down, and it means that our system did not have retry failure capability, or was not designed in the first place, or did not work. Through this fault injection, we found the flaw in the system.

3.2 Switching to the HA Application Console, click Stop on the drill record details page.

3.3 In the stop Drill dialog box, click OK.

3.4 In the dialog box that is displayed, click OK.

If the following page is displayed, the drill is complete.

Microservice walkthrough

After experiencing the above three scenarios, we have a preliminary understanding of chaos engineering and have mastered the basic functions of applying high availability services. However, the process of manually deploying the parameters is tedious. Next, let’s experience a more convenient and fast weak and weak dependence governance.

Procedure 1 Switch to the APPLICATION HA console. On the left navigation bar, click Microservice Walkthrough.

And select strong and weak dependence governance page.

2 On the strong/weak dependency governance page, click create a governance scheme.

3 On the configuration wizard page for creating a governance scheme, perform the following operations: 1. 3.1 In application access, define a solution name, select Frontend for governance applications, and click Next.

3.2 In the 30-day dependency management dialog box, click Confirm.

3.3 In dependency analysis, wait until the analysis is complete and click Next.

3.4 In dependency prediction, select the strength and weakness of the dependent object. For example, in nacos-standalone and CheckoutService, select the strength and weakness of the dependent object, and click Next.

3.5 Select any use case in dependency verification. For example, select Frontend and Nacos-standalone, and click to verify.

3.6 In the dialog box that is displayed, click Confirm.

Note:

If the window does not jump, check whether the jump is blocked and manually clear the window

4 On the drill details page, click Drill.

5 In the dialog box that is displayed, click Confirm.

6 Switch to the Hipster Shop page and click any function on the page. The Hipster Shop web page and related functions can be accessed normally, indicating that the Frontend service is weakly dependent on the Nacos-standalone service. 7 Switch to the HA application console, and click Stop on the drill record details page.

8 In the stop drill dialog box, click OK.

9 In the result feedback dialog box, select weak dependency as the verification result, and click OK to return to weak dependency governance.

10 In dependency verification, you can verify other use cases. After verification, click Schema Archive.

11 In the dialog box are you sure you want to archive this solution, click Confirm archive.

If the following page is displayed, the archive is complete.

Related scenario

Offline data analysis based on EMR

Container service ACK+ container network file system CNFS quickly set up NGINX website