High Availability DevHa practices to show you how production environment 0 performance failures are done!

Recently, Lu Xuehui, CTO of Series Technology, attended the ArchSummit Global Architect Summit and delivered a keynote speech entitled “How To Do 0 Performance failure: DevHA Practice in the field of High Availability performance”. She introduced the practical experience and corresponding solutions of 0 performance failure in detail.

Before the official start, I would like to share a short story: summer is coming. I found some mosquitoes in Shenzhen some time ago. When the sleeping light was turned off at night, I heard a buzzing sound around me. This kind of experience, everyone should have!

It’s very similar to the performance bottleneck that I talked about today, and we all know that there must be performance bottlenecks in systems. But where is it, if you don’t draw a range, it’s very hard to find this performance bottleneck. After we find it, we will optimize it. I believe that many architects have no problem, but the difficulty is that we cannot find it.

From 7 failures to 0 failures for two consecutive years

Let’s start with a set of performance practices!

This is from 2018 to 2020 for one of our clients. In 2018, it suffered seven production performance failures lasting more than 500 minutes. Start in 2018Full-link pressure test in the production environmentBy 2020, all applications have been connected, and this full-link test has been continuously performed to ensure no performance failures in the production environment. Here’s how we do it.

Performance failures occur frequently. What is the core problem?

In 2016, this customer launched a new business with rapid growth. At that time, he faced several problems: first, the system was old; second, its demand iteration was very fast; third, there were many new people.

So this is a situation that they were facing. There are so many performance failures that if you always respond passively to solve them, they may never be solved. Therefore, we need to grasp some core contradictions from these complex phenomena and continuously solve them, so as to control the whole situation.

When we analyzed the whole situation, we found that the most common reason for their failure is the database, which accounted for more than half of the performance failures. It also found a very interesting data, which may not be paid much attention to, is that the hardware cost of the database will account for more than 50% of the total of other hardware except big data. But the number of machines is not that large, so the computing resources of the database are very expensive and a frequent problem spot.

The second major contradiction is that the performance test cycle is extremely long and the cost is very high.

At that time, the company had a performance team of eight people, and the cost of the whole machine was about 4.5 million, which was roughly 1.5 million per year spread over three years. In the morning, LIU Tanren, the director of SF Technology architecture Committee, shared that the hardware cost was about 20 million when they did the pressure test, so the cost of traditional performance test is very high.

In addition, for the boss, CIO, CTO, the most concerned and most headache is the long cycle, put a performance demand schedule to more than two weeks, which means that only the big project planning in advance can do the performance pressure test. As for the daily iteration, I only had one week to develop, and I was not able to do performance testing, which led to frequent failures in the production environment.

The third major contradiction is that the big push to ensure this performance is at the head of the architect team. He lacks the means to objectively measure the capacity of the production environment, and can only rely on experience to find the performance bottleneck and optimization direction.

How to solve the three core problems?

As with most corporate technical teams, the internal staff worked hard. The architects upgraded the distributed database to make the system more computatively powerful. On the business side, architecture optimization and system reconfiguration were done to improve performance. Providing independent support for core link resources…… Although they did a good job, there were performance problems.

One solution we came up with at that time was to take three steps:

The first step for a problem of this database, we didn’t say at that time the introduction of distributed database as a core principle, but by optimizing its computing architectures to database and burden, actually like to primary school students, less point assignment to let him have more time to handle should do one thing.

The second step is to do the full link pressure test of the production environment.

The third step, which you may find even more terrifying, is brute force high frequency pressure testing. The main reason is that in order to continue to ensure that there are no performance issues on the line, we must ensure a high frequency pressure situation. Just like we said the problem of mosquitoes, the doors and Windows are closed, I can wipe out the mosquitoes in the house, that one-time no problem, but, we often open the window and open the door in our daily life, you found a few days mosquitoes came again. It’s impossible to be mosquito-free in the house, and in the system, without a consistent mosquito-control program.

So let’s take a look at these three steps. What did we do?

1. Optimize the computing architecture to reduce the database load

In fact, the core of the two steps, the first step is to TP type query calculation and AP type do resource isolation.

In a variety of ways, do not use the database as much as possible, because database resources are very valuable. After doing this, the load on the database was reduced and the performance problems were reduced.

The second step is to create a simple and landable SQL specification. The SQL specification is single SQL, single table, and it’s easy to do, but starting with a schema that doesn’t follow a single SQL, single table is a very difficult step to take.

I remember this client came to us because they had a system that had been up for a week and was failing because there was so much competition for database resources. At that time Oracl experts suggested that the last XData cost 20 million to solve the problem.

This customer found us through a friend. After we learned the situation, we proposed that we have a way to help you continuously solve this problem, which does not cost 20 million. The solution we gave at that time was to tear down all of their hundreds of lines of such SQL. After more than a month, the system was successfully launched and there were still many resources left in the database. Through this action, we successfully helped the customer to save about 20 million.

By the time we had done these two steps, many of the performance problems had been curbed.

2. Pressure test the entire link in the production environment

At that time, the customer company CTO raised a question: will this year’s Double 11 system still fail? It’s hard to answer this question if you just do some database level architecture optimization. So we give the second step of the whole link pressure measurement in the production environment. Many people are very upset when they first hear this concept — what if it dies in production? So today I’m going to focus on security.

In fact, the core logic of producing a full-link pressure test is very simple.

First of all, the flow of pressure measurement should be identified at any node. In any processing logic, I should be able to know whether what I am dealing with is a pressure measurement flow or a production flow.

The second point is that the label of the pressure measurement, it has to be passed on and on in the microservice architecture, can not say that the transmission is broken, this will also cause problems.

The third important point is that the pressure data can be isolated. Do not mix any data generated by the pressure test with data generated by the business.

We want to do pressure measurement traffic identification is actually relatively simple, take HTTP traffic as an example, we only need to increase the key value in the HTTP head can be effectively identified. But it’s hard to pass it down across the entire microservice architecture, and there’s some technical difficulty in doing tag passing. Technicians need to have a good understanding of all the middleware used in the company, and there is no one-size-fits-all approach to all the middleware, and the delivery solution needs to be customized for each middleware.

Finally, I want to talk about how to do the isolation of this kind of data. I’ve listed some of them here, but not all of them.

For example, messaging systems can be isolated by Topic, searches can be isolated by indexes, and databases can be isolated by libraries/tables. The principle is relatively simple, complex and difficult mainly reflected in some technical details.

We also had problems with message isolation, which is the data isolation process of a messaging system, Rocket MQ, divided into message producers, consumers, and servers. The message data of formal and pressure test are isolated through Topic, and the pressure test data is put into shadow Topic. The idea is very simple, and it is very simple to clean and maintain the pressure test data in the later period. However, during the test run, we found that a pressure test data went to the line, which was very strange to us. According to the plan, this would not happen.

Later, it was found that there was a problem with the data building that caused the subscriber to fail to consume, and three failed purchases in RocketMQ will be put into the retry queue. Before this, we did not consider this part and only made the shadow Topic of business messages, so this data was considered as normal business messages after receiving back and ran to the line. For this reason, we added the shadow Topic of the retry queue, and the problem was solved smoothly.

The point of this detail is to show that there are many technical details in data isolation and data label transfer, which requires technicians to understand the details of all the middleware in the company, otherwise problems may occur.

In order to ensure the safety and stability of the system, in addition to the security guarantee in the technical design just mentioned, we have done a lot of security verification before, during and after the whole link pressure test for different points.

Let me just pick a few examples to share with you. For example, there is a feature called whitelist. What does it do? Suppose that when we do the full link pressure test and go online, one of my links is a and B and C and D, but because of the resource coordination problem, ABCD cannot all go online at the same time, and only A and B can go online first. At this time, a and B can identify the pressure test flow, but C and D cannot. That part of the traffic will become real traffic to the line up, the white list is used to deal with this situation. We will collect the list of all the services that support pressure testing, and then use an aggregated service to turn it into some whitelisted list configurations, and then distribute them to prevent the pressure testing traffic of B from entering C.

In addition, we can also provide monitoring services. The E2E inspection platform can set RT value and error rate value for different scenarios. Once the limit value is reached, the pressure test will be automatically stopped.

Then through these means, we can safely do this kind of full-link pressure measurement in the production environment, and can bravely answer the previous CTO’s question — this year’s Double 11 will not fail!

3. Brute force high frequency pressure test

In addition to the big events like Double 11 and 618, there may be performance problems on a daily basis, and we need to find them and optimize them, which is when we use what we call brute force high frequency pressure testing. In fact, there is a very important change in the whole link pressure measurement from the support side to the operator side. The difference between the two is that the support side is that you give me requirements and I do it, while the operation side is that I set standards for you and you do it, and then I check it.

The first thing I need to do is to get some support from the senior CTO or the leadership of the architecture group, to get people willing to do this, to set up a performance operations team. The second point is that in the initial promotion stage, we need to change from a supporter to an advocate. In the promotion of such new technology in the company, the architect must help students in the line of business to solve problems and build trust.

In the high-frequency pressure measurement, we must think of various ways to reduce the use cost of the r&d students, for example, through the probe, the development students can do the pressure measurement without changing any code.

We will encounter many problems in the process of use, and we will precipitate all these problems into the product to develop a series of tools to help the development quickly get started and complete the work. Analysis modules, such as pressure reports, are built into the report to tell the developer if performance is falling short of the target.

This is the data I showed you at the beginning, with two additional columns — number of production pressure test accesses and production pressure test times. It’s not hard to see a very large number of improvements in the last two years compared to 2018, so it can achieve continuous production environment performance with zero failures.

Outlook for DevHa high availability

In the future, technologies and products related to high availability will definitely flourish, and the whole ecology will be built with RESEARCH and development as the center. In the future, there will certainly be great improvement in production simulation technology, and the way to deal with performance problems will be changed from finding faults after the event to actively finding faults before the event and making responses. I will do things like brute force high frequency pressure measurement in daily life to ensure the continuous robustness of the system and to ensure an accurate high frequency feedback, so that the students in research and development can constantly optimize.

Founded in 2016, Series Technology is a leading system high availability expert in China, initiated by a number of alibaba senior experts. It aims to solve the governance and performance problems of micro-service architecture as the core, and provide a comprehensive guarantee for the performance and stability of enterprise systems. It has built a complete product matrix covering the full link pressure test, E2E inspection, fault drill and other modules, and is committed to helping enterprises improve the system availability to 99.99%.