Thinking and practice of the construction scheme of full link pressure measurement system

Brief introduction: In the process of double 11 ali taobao, has long been in production for all link pressure measurement, through practice we find that in a production environment to do pressure test, actually, maturity and an IT organization structure, process and other closely related, so we put the whole link from simple production within the scope of pressure test, becomes the entire business continuity plan.

In the process of double 11 ali taobao, has long been in production for all link pressure measurement, through practice we find that in a production environment to do pressure test, actually, maturity and an IT organization structure, process and other closely related, so we put the whole link from simple production within the scope of pressure test, becomes the entire business continuity plan. This paper is divided into four aspects for you to explain: first, the significance of the whole link pressure test, why to do full link pressure test in the production link; Second, technical points and solutions on landing; Third, make suggestions on the full link compression test process in the production process. Considering the different tolerance of each organization, give some suggestions to everyone. Fourth, how to achieve business continuity in a third party throughout the production environment includes the results of pressure testing.

The significance of full link compression test

The diagram above shows three issues that are actually typical of different IT organizations when communicating with testing.

Many of the testing industry said that they have done offline performance testing, but there are still a lot of problems when they go online, because it is not possible to simulate a 1:1 environment offline. With a lot of third party interfaces, people rarely simulate the entire online scene. So after we did a lot of offline testing, we concluded why many companies that derived from offline capacity to online capacity did not end up with very good results. That’s the reason.

Now that all IT organizations are doing DevOps, our functionality has gone from being iterated once a month to now being iterated once a week, leaving less and less time for testing. The functional test time has been shortened from one or two weeks before to three, four, two or three days now, then the performance test can not be launched on time, it is very likely to have a variety of performance problems, which will directly affect the brand influence of the enterprise.

Usually, the water level on the line is relatively low and rarely reaches the peak, but there will be some emergencies. Last year’s epidemic, for example, has led many companies to move their business online. For example, in the education industry, face-to-face education used to be conducted by teachers in class, but now it is carried out on online platforms. This kind of unexpected situation will make test engineers, including development, operation and maintenance teams, greatly troubled. Before I do that, I want to introduce a concept that was developed by Nassim Nicholas Taleb, the author of The Black Swan. The concept centers on vulnerability and anti-vulnerability.

What is vulnerability? Fragile is like glass, you know glass is very fragile. What’s the opposite of vulnerability? It’s not tough, it’s not tough, it’s probably anti-fragile. What is anti-vulnerability? For example, table tennis, we know that table tennis on the ground without a lot of force can be destroyed, step on a foot to destroy, but in the case of high speed movement, table tennis we exert greater force, the greater the strength of its rebound, indicating that table tennis in the process of movement has anti-fragile characteristics.

The same is true of our IT systems. No amount of code is guaranteed to be flawless, and our infrastructure may be fragile, with limitations on servers, databases, etc. Our framework is always vulnerable, and by integrating these issues together, we hope that through some means, such as planning, identification of risks, or some means of circuit breakers, we can eventually put these things together to make the whole IT system anti-vulnerable. In short, we want to make the IT system have enough redundancy through some means, and have enough contingency plans to deal with unexpected and uncertain risks.

How to build IT systems against vulnerability? We hope that by some means, such as line pressure measuring ability, provide the uncertain factors, and then through the real-time monitoring in the process, including the ability to plan, finally put these uncertainty factors are identified, and in the process of producing pressure measuring online for it to do some processing, more may be through after checking, the recognition of uncertainty factors. Then we may be in a production environment by means of before, in a production environment on the stability of a normalized pressure measurement, achieve long-term stability of the scene, in the end we could reach the fragile ability needed for the overall monitoring ability, operating protection ability, and controls routing capabilities, which makes the whole IT system with the characteristics of the vulnerable.

Full link pressure test solution

How to do full link load test in production environment? What technology does it require?

Pressure test process evolution

How does testing evolve from offline to online in general? I divide it into four stages:

At present, the vast majority of IT can do offline single system pressure test, that is, for a single interface or a single scene to do pressure test, and also do system analysis and performance analysis. However, in complex business scenarios, we may not be able to fully discover problems, many of which are spontaneous activities of development or testing classmates.

We set up a similar to the test lab or testing organization institution, such a big department may construct a group of similar to the production environment performance test environment, in this we can do more things, for example to do a full offline environment link pressure measurement, and we can do it according to the accumulation of experience in the above before the return of some offline, Including the diagnosis of performance. In fact, this step is equivalent to the whole test step forward, do some analysis of the link in the test environment, on top of the evolution of some capabilities, such as risk control, etc.

At present, the vast majority of IT enterprises and Internet enterprises are willing to try the business pressure test of online production environment. In fact, this part is similar to the previous second stage, but IT is artificially divided into two layers in this process. The first layer is simply to do full-link compression test. Many IT companies have done read-only compression test in non-production links, because IT will not cause data pollution. At the next level, some organizations may do further full-link pressure testing during normal production hours, in which case we will require the organization to have higher capabilities. , for example, we need to do some dyeing for the pressure test flow, can distinguish normal business data, the normal flow and abnormal pressure test flow, some may do some environment in isolation, but in the business we do during the production of pressure measurement, need to consider the whole flow of migration, current limiting, including circuit breakers, etc. No matter how you do business, there may be some impact on the final production business, when the real problem may need to have a quick circuit breaker mechanism.

With this capability, the final stage is the full link test of the entire production link, including read and write, which is the basic capability. The ways we actually more is through the import library table, plus technical means, all links on the production of pressure measurement, including read, write, business, etc., at the same time we have a system fault drills and production change exercise ability, in this case, we may finally have isolated ability, ability to monitor the isolation and log data.

Key technology of full link compression test

For the whole link compression test, we need several key technologies:

Full link traffic staining

This may be done through some identification on the compressor, such as adding a suffix, or through some identification means to read the flow, distributed to the relevant table. At the same time, in the full link traffic display process, we also need to identify the traffic. For every middleware and every service that the pressure test traffic passes through, we hope to be able to accurately identify whether the traffic comes from the pressure test machine or the normal traffic. This is the first step.

Full link data isolation

What means do we need to use, for example, through the shadow library, through the operation and maintenance students to make a shadow library that is the same as the one above the production, and then cut to the shadow library, or make a shadow table that is the same on the production database, to do data isolation? The first method is more secure, but the drawback is that the entire production environment is not available when we use the shadow library. The production shadow library cannot completely simulate the situation of the whole line, because the shadow table requires us to have a higher technical level, which can guarantee the traceability of the whole link, including the recovery ability of the whole data if something goes wrong.

Full link risk control mechanism

Is the risk of circuit breakers, once found really production line pressure measurement has an impact to our business, we need by some rules or other indicators to fuse automatic trigger risks, including the control measures and so on, whether to provide traffic pressure machine, or damage to the production system isolation part of doing business, Such means are necessary for us to do full link pressure test in the production process.

Full link logging log isolation

In fact, the log itself will not have much impact on the whole link. However, due to the improvement of digitization level, the log is basically the most important data source for BI students, including those in operation, to analyze the whole business. If we do not do log isolation, it is likely to have a certain impact on BI decision-making. Such as pressure measurement is used in great quantities in the process we will make access to production environment, a regional flow BI classmates may through the log analysis found a certain area to do big, led him to the wrong operation decision, so for all links in the process of producing pressure measurement, in the whole process we need to do some logs, Distinguish the storage between the normal production flow and the pressure flow.

Full link compression and business continuity platform core functions

This part is the functionality needed to truly be a full-link ballast and business continuity platform.

First of all, there is the pressure measuring flow tool from the whole region. The function of this flow tool includes the functions related to the whole region flow mining and flow transformation.
The whole pressure test recognition, including shadow storage part of the function. The yellow part is the normal flow, the blue part is the pressure of traffic, we may through transformation of pressure machine the blue parts add some logo, through the use of Agent technology, it can identify with the flow, through the bottom of the Agent technology to these to the corresponding shadows library or list, or is the shadow of the cache area.
Do fuse rule management, so need to have a reasonable console, here may do some installation probe management, including the entire architecture management, library table maintenance, rule maintenance, fuse mechanism maintenance, etc.
And finally, the real pressure part. There may be some probes or agents installed here, whose function is to let the traffic fall into the corresponding shadow table, and through the corresponding monitoring indicators, for example, when our error reaches 1%, or when the inspection time exceeds a certain threshold, the Agent will report in time. Flow limiting through rule configuration. Through this architecture, we can now achieve a cost savings of about 40% compared to the overall environment, basically without any cut into the entire production business.

Full link test risk prevention and control capability

Let’s talk about how to do a shadow database, including the entire traffic identification. The orange part is the real pressure flow rate, which will be marked on the presser, now it will be added with a suffix. In addition, we will also do filter in the server, which is actually an interceptor. We will intercept the relevant identification in the traffic, and then distinguish, dye and track it. Every request can be truly transparent and visible in any middleware and project heap basically. In the process of compression test, it can be directly rewritten through the end of Agent bytecode, and the conditions of bytes can be replaced with compressed conditions. To shadow library built first, of course, through the bottom of the track we can make the corresponding flow, if the database will go more clear, then we will make the flow testing, see if it is more clear, and we can do all the test data with a logo, once really has not gone into diagnosis, we also can do it in the normal list to delete, And every area we pass through is visible to us. In this way, most IT organizations are divided into three phases, while some very mature ones are divided into two phases:

Before going online, problems are mostly found during offline development or testing and debugging, and then the entire interface is optimized to ensure that there are no code problems, including DNS problems. This kind of problem is basically solved in the offline environment, the development environment.
In the deployment process, we will do third-party plug-ins such as security and other issues, but with the development of the container, the development of the deployment environment will be gradually diluted.
Do the actual downtest of the production environment online, this part may do capacity planning or downtest, other things like the overall environment, such as CDN or DNS issues, or the entire online system capacity assessment and so on.

These are the goals we currently hope to achieve at various stages throughout the testing life cycle.

Suggestions for the pressure test process

Given the different maturity levels of each organization, the recommendations may not be applicable to all IT organizations, but you can use them for your own purposes.

Generally, we carry out full-link compression test for the third party and on-line production compression test, which will go through five stages:

The first is the stage of combing the business with a third party. We do the following things:

1. Assess the performance index and capacity index of the business system according to the previous system usage;

2. Organize the system architecture of the existing information system, and determine the path and path of the entire dyed flow;

3. Communicate with the pressure test duration, including interval, etc., and confirm the relevant pressure test scene design;

4. Desensitization of production data. If some of them involve production data, desensitization of production data and other related work may be done.

This part is done to do the second part, some of the application modification. For example, to do traffic marking work, to determine the business system by monitoring the traffic, may do relevant monitoring access in the business system, relevant third party components will Mock, the entire test scenario will be established with the third party communication. Including flow meter construction and plan access and so on.

The third is the whole process of pressure test. The whole link pressure test in the whole production state will optimize the performance and evaluate the capacity of the whole system.

The fourth is to normalize the online full-link pressure test, which will involve some things, such as current limiting, degradation, chaotic engineering acceptance, including production and release.

Fifth, review the whole activity to see whether the emergency plan is effective and where there is still need to be optimized. This is the life cycle of the full link pressure test in the production link.

We now do something deeper, the entire development process, now everyone use the conversation, may be a single interface performance test has been used to in the process, we are currently building contains the interface to businesses the single machine performance test of the level, using a single test tools, began receiving report interface in the release process of performance problems, Ensuring that this interface is delivered online without code level errors will eventually eliminate the need for an integrated test, including the test environment, to go straight to the online test process. In the stage of single interface, we will support the corresponding mainstream frame pressure test. At present, we are also doing the pressure test support of test environment cluster continuously. We still hope that direct users can skip this step and start to directly do the pressure test of traffic isolation online.

The above diagram shows the capabilities we think a complete business continuity platform needs. 1. The console of pressure flow initiation, the flow initiation end, actually manages the whole pressure flow and scene design; 2. Traffic isolation console, this part of the hope to achieve uniform tangent flow, when there is a problem can be cut off the pressure flow, unified routing; 3. In the process of pressure measurement, there is the whole flow monitoring, including system monitoring; The performance monitoring platform for the whole application during the test, including link monitoring, JVM monitoring, component monitoring and so on; 4. Real chaos engineering, including flow control rules, isolation rules, degradation rules and other platforms, where the corresponding rules will be maintained. In the end, we hope that this platform can achieve the following goals: anytime and anywhere at low cost to achieve full link compression test; Periodic fault drills can be carried out for the Ops platform and this capability can be given to the Ops team to initiate changes anytime and anywhere. For the entire online activities including the big promotion to do some undercover, can avoid sudden activity breakdown. Because the long-term solidified production pressure test will bring us the limit of capacity and water level, the implementation of the plan in the drill process will have a better means to avoid and protect the emergency process. Take Ali as an example, now basically can be done by the month, because we know that every month Taobao has activities, there are three big activities every year: 6.18, double 11, double 12. At present, we can carry out small exercises, such as Double 11, Double 12 or June 18 promotion with Zhou as the implementation unit. Moreover, we can clearly organize the pressure testing activities within or across BU, and can clearly define the capacity expansion plan.

Customer case

The following is an implementation case we gave to a third party.

Case a

“Cross” case access, we have applied their system decomposition, the first confirmed compression scene about four, then through flow rendering, flow dyeing, flow track about 23, found the whole dyeing shadow table was established through online way, after building the shadow table with small flow dyeing, We connected the whole shadow library and shadow table to the production environment, and it did not cause any impact in the process of production and pressure test. Besides, through the 23 scenarios we tested, there was no problem in last year’s November 11, including the phenomenon of warehouse bursting or overorder.

Them the year before doing this, there are about more than 50 people spent four months, they maintain a separate environment, the environment or there is a certain difference, logging or appear the phenomenon of the backlog of orders and by link pressure test, after all we do is nearly a month time to do the whole link with the five core backbone pressure measurement, Basically, it has been equipped with on-line application at any time, self-replication, traffic application and traffic dyeing. The test cycle is also in units of days. A relatively small iteration on-line can basically complete the entire online performance regression in one to two days. For large traffic, double 11, double 12 promotion activities can basically complete the performance regression of the whole main link in one week, and can fully evaluate the capacity of the current production environment, including capacity expansion, production environment change and other functions. Case 2

For a customer in the beauty industry, all the systems were basically developed by a third party without performance evaluation, so he basically knew nothing about it. The most critical problem was that the whole application was complicated because the third party had been replaced several times. The problem was that a function went offline, causing the whole system to collapse. After our evaluation, the hardware cost of each transaction is about 0.18 yuan. As I did a pressure test on Taobao in 2012, their index is about 9-10 times that of Taobao in 2014. The key problem is that they still have many unknown risks, for example, they launched a new application and want to promote it. As a result, something went wrong, causing the seckill system to crash, and basically the campaign didn’t work.

We spent about a month to help them build the online environment, combing 22 core links, 22 systems and about 600 servers for them. It took us a long time to build the first production link, which took about half a month. The follow-up work was implemented by them themselves. It took 55 days for a total of 22 links to fully clarify the online capacity of the entire operating system. During the whole process, we did not pollute the data of the production link, including the isolation of the entire log. In the whole process, we are in line with the attitude of co-construction, to help customers establish the daily online pressure test regression mechanism.

From the perspective of short-term benefit, maybe we adjusted the number of servers in the application, adjusted some servers from the links with low revenue to the links with high revenue, and finally reduced the consumption rate of their entire resources to about 20%. After we did the full link compression test, we made a baseline for them. They do each performance iteration based on this baseline.

At present, they have fully mastered the process of the entire production environment pressure test, and they can basically follow their own planning every time they go online. Their goal this year is to reduce overall server resources by at least 50 percent, and they’re working on it.

This article is the original content of Aliyun, shall not be reproduced without permission.