Business background

As the mobile development industry enters the stock era, the load capacity of the overall App architecture and the optimization of each link have gradually become the focus of attention of all developers.

Stress test is the main scheme to achieve the above functions. Generally can be based on the stress test:

  • Test the load bottlenecks of back-end services;

  • Assess overall architecture performance;

  • Business stability peak;

  • Check out the weak relationship of each node;

  • Optimize system resources;

  • Avoid short board effect;

Provide the operation with accurate user carrying capacity as evidence to avoid bad user experience caused by sudden traffic caused by the launch of events/new applications.

Today, we will introduce the full link pressure measurement scheme is the implementation principle and implementation path.

Full link pressure measurement and principle

Usually, we can simply put the load performance = single performance * machine this formula applied to estimate the total scheme, but in the actual scenarios, nodes usually involves a lot of business, such as DNS, gateway, a database, are likely to cause the overall business performance bottlenecks, and the actual service capacity may be there is biggish error and expectations.

General users will use loadRunner and other schemes to achieve server performance pressure test in production environment, but in mPaaS applications, complex deployment can not pass THE MGS gateway, high cost and other difficulties emerge to solve these pain points.

According to the requirements of several customers, the mPaaS team provided the MGS full-link pressure measurement scheme.

Different from the previous test plan, the full link pressure measuring scheme is the biggest difference is in the different Angle of view, standing on the Angle of the client as a starting point, the entire server link as a black box, with real request and the response as a basis for the evaluation, simulate real business request, real data flow, the real user habit, To achieve as realistic an assessment as possible.

Link to comb

In a standard data link, the following model is generally used

In the full-link pressure test, we regard the whole server implementation as a black box, so we need to focus on the first half, which can be summarized as follows:

1. Client requests construction;

2. The client request is sent and passes through the MGS gateway;

3. The client parses the response returned by the MGS gateway and processes it correctly;

4. Achieve a high-concurrency client request cluster.

Once again, the following difficulties can be summarized

Difficulty 1 The client requests to build

MPaaS mobile gateway RPC communication is a standardized interface method implemented on the basis of HTTP protocol. On the premise of reusing HTTP request standards, a set of data exchange format is defined, with Header and Body as the actual distinction, which can be approximately understood as: The operation-type in the Header is used as the real API point, and the body part is encapsulated according to rules and forwarded.

In this step, we take JMeter as the implementation scheme, JMeter flexible script features can be a good implementation of the client’s real request simulation.

Difficulty 2 data encryption and decryption

MPaaS mobile gateway RPC request specific data encryption to construct the more complex part of the request. The existing test solution on the client cannot cover this capability. Therefore, the client often disables the signature check and encryption functions on the gateway server to perform pressure test.

The trouble with this approach is that it is impossible to estimate the computational stress caused by encryption and decryption on the gateway server.

According to experience, different encryption and decryption algorithm configurations affect the throughput of the gateway by 20% to 40%. At this stage, the JMeter plug-in MGSJMeterExt, customized by the Financial Line SRE team based on the user production environment, reverse-realized the encryption and decryption process of the request body, so that the orchestration of the pressure test script can include the encryption part.

Difficulty 3 request signature construction

MPaaS mobile gateway has a special signature verification mechanism for RPC requests. The same as data encryption and decryption, currently there is no solution on the client side that can cover this part of the capability, so the client often chooses to disable the interface check for testing. With the help of MGSJMeterExt, the message can be signed correctly in JMeter and verified by the server.

Difficulty 4 Commissioning a cluster environment

For pressure testing, the emphasis needs to be on real, real traffic entry, real number of concurrent, to achieve real results, and self-implementation of pressure testing environment, high cluster deployment costs, also become unnecessary expenses.

Therefore, we recommend users to use ALI Cloud PTS as the pressure test platform. Based on other solutions, it has the advantages of easy deployment, support for Jmeter script, real traffic, and can also provide users with more detailed pressure test reports.

An overview of

The above model can be simply summarized as the following structure

Full link scheme and implementation

Part1 Preliminary preparation and research

In the early stage, the objective is to provide relevant preparation and data support for the actual pressure survey, and to establish the pressure survey objectives and overall direction.

1.1 Objectives and Data preparation

1. Customers need to define their own pressure test objectives and objectives. Based on the pressure test objectives and with reference to previous operational data, specific business categories and possible user behavior habits involved are given, and the proportion of each habit in the overall business is related.

1.2 Preparing clients

1. The client side needs to sort out the interface and data flow that may be involved in the implementation of the client side according to the corresponding business objectives, such as whether there are pre-steps such as login, whether there are mandatory steps such as the refresh of the home page, and collect the real request and response in this step through packet capture. And determine the value conditions that meet expectations.

2. This step involves different service structures and can also be performed by the server interface.

1.3 Preparing for the Server

1. On the server side, make relevant data baffles according to the relevant interfaces in 1.2 statistics to avoid the contamination of test data to the real database.

2. In the mPaaS full-link pressure test, the server is regarded as a black box. Therefore, the performance indicators of each service on the server need to be monitored to serve as a basis for later server tuning.

1.4 MGSJMeterExt plug-in is ready

Since MGSJMeterExt needs to be customized based on the actual gateway environment, users are required to provide the following data:

1. Workspace related environment data

2. Encryption algorithm and public key

Q&A answering questions

Q: How to implement the pressure test script?

A: will be done by our team of experts and the classmates to simple scenario training under the pressure test scripts, in the actual scenario, might involve multiple segment of the business, such as land access token, some clear front steps, such as involves complex business scenarios, need to customer with the help of ali expert team itself.

Q: Why full-link?

A: Although our pressure test script is based on the client logic, we actually simulate the real data request and confirm whether the return from the server reaches the expectation, involving the entire data link and node.

Q: How do link indicators achieve buried points?

A: Pressure measurement scheme of object is based on the black box, through the system of index of PTS, request parameters and return the rate of return, the success rate of the check in line with the expected results to confirm based on user under the perspective of the whole architecture can load performance, for some backend index, due to the different customer the server architecture, there are many differences for the backend such indicators, Generally, corresponding service providers can provide relevant monitoring solutions, and mPaaS do not need to process them.

Q: Why use PTS?

A: The mPaaS team actually provides the COMMUNICATION solution of MGS, assisting customers to complete the writing of PTS scripts. It is not mandatory to use PTS, but only to provide relevant Jmeter cluster deployment environment, and PTS related resources need to be purchased by users themselves. However, the current mPaaS team is based on multi-case evaluation. Comparatively speaking, using PTS has a higher cost performance ratio, and can provide a more consistent pressure testing environment and complete pressure testing report. Therefore, users are recommended to use PTS for pressure testing.

Q: Are there any specific standards, such as 2c4G, or 4C8G, what performance indicators should be achieved?

A: stress tests is to clear under the related system resources, can achieve the performance index, due to the different server architecture, the actual business involved in the process node, there are huge differences between the performance of different environment, which is the purpose of the stress tests, through the pressure test is needed to clear the real index and evaluation of various nodes real resources time-consuming.

Part2 Jmeter development and script modification

We identified the specific focus of the MGS communication solution, which we needed to accomplish in Jmeter

2.1 the Header modification

In the Header, we need to note the following:

1. The MGS gateway protocol relies on Header fields. Therefore, ensure that the gateway parameters are complete.

2. Some parameters have fixed values and can be written to death. For details, see the configuration file downloaded on the console.

3. If services depend on other Header information, such as Cookide, you can add it directly. MGS gateway does not filter Header information.

2.2 the Url modification

  

In the URL, we need to note the following:

1. The ACTUAL URL should point to the MGS gateway rather than the actual service server. You can refer to the configuration file downloaded on the console for relevant configuration.

2. At present, all requests to the MGS gateway are POST. If there is a GET request, it is also changed to GET when MGS forwards the request, and it is also POST in communication with MGS.

3. If there is no special requirement for the Body part, it is suggested to show it as shown in the figure.

2.3 the Request modification

In Request, we need to note the following:

1. The encryption/verification here depends on the MGSJMeterExt file, which needs to be referenced.

2. Generally, you only need to modify the //config part.

3. The following part is generally a unified solution, mainly for encryption and verification, and no modification is required.

2.4 the Response modification

In Response, we should pay attention to the following points:

1. The performance of the press is taken into account here, which will not affect the evaluation ability of the server. Therefore, if there is no demand for secondary use of data or demand for result judgment, it is not written here

2. If you have relevant requirements, you can complete the secondary processing of the Response parameter here

Part3 Actual pressure measurement

The general steps can be summarized as follows:

3.1 PTS and script performance tuning

Ali Cloud performance Test Service (PTS) provides a convenient and fast cloud pressure test capability. In this pressure test service, the Internet pressure flow is input with the help of PTS.

The interesting point is that the encryption and decryption calculation not only brings calculation pressure to the gateway, but also brings certain calculation pressure to the pressure machine. Therefore, before implementing the first version of the plug-in and pressure test script, we first conducted a “stress test” on the press.

First round of basic testing

PTS Test press configuration:

1.PTS Single IP unit configuration

2. Concurrency 500 (maximum concurrency in a single machine)

3. Fixed pressure flow model

4. The two-minute pressure test is frequent

The TPS result is not high, but the returned RT value is not high from the recovered pressure test report:

Then observe the performance of the press, it can be seen that the CPU utilization level of the press has been relatively high, so there is reason to suspect that the pressure of the encryption calculation has a relatively large impact on the pressure release of the press.

By caching the results of repeated content encryption, the computational pressure is greatly reduced. At the same time, in order to avoid memory problems caused by cache design, the cache upper limit is limited.

Second test

The configuration is exactly the same as in the first round of testing, with only the optimized encryption plug-in replaced. From the recovered test report, scenario TPS improved by 75% :

There is an obvious optimization in terms of the CPU performance of the press.

Third test

With the first round of bottom finding and the second round of optimization, the third round of test uses two pressure machines in the configuration of full load pressure test, observe the pressure test results:

As a result, the pressure script and choreography process met expectations and allowed for a formal PTS cloud pressure test in the customer’s production environment.

3.2 Pressure measurement in production environment

At the beginning of the formal stress test, several rounds of small-scale stress tests were conducted to observe whether the working status of the back-end system met expectations. The following problems were found during the survey:

Fault 1: Nginx traffic is not properly forwarded

According to the log performance of MGS containers, some containers cannot get any request all the time. After investigation, this problem is found to be caused by three reasons:

1) AN MGS container IP is missing in the DMZ Nginx forwarding configuration;

2) Access permission is required for network policies from DMZ to each MGS container IP;

3) The Nginx forwarding rule is set to iphash. In the case of single-IP source tests, traffic can only be forwarded to one container.

After the correct IP address list is configured, the network permission is enabled, and forwarding rules are modified, the problem is resolved.

Problem two: The CPU base load of certain MGS containers is too high

Preliminary testing found that one MGS container (MPAASGW-7) had a CPU load of 25% in silent state, which was not as expected.

Logging into the container found that there was a JPS process that was consuming a large amount of CPU. It is suspected that it is not released normally after being called in the early commissioning stage. After killing the JPS process, the problem was resolved and the container was restarted to avoid other problems

Note: the JSP, Java Virtual Machine Process Status Tool), is to provide a display of all current Java Java Process pid command, see: docs.oracle.com/javase/7/do…). .

Problem three: CoreWatch monitoring platform cannot be accessed

CoreWatch console inaccessible, error 502 reported in browser. When you restart the CoreWatch container, the page can load, but always in the loading state.

http://corewatch.***.com/xflush/env.js has been in the pending state. The ALB instance listening configuration is incorrect, and the problem is rectified.

3.3 Production Environment Pressure Test & Summary

After all the problems in 3.2 are solved, the system is ready for the pressure test. The formal pressure test will be conducted for the “encrypted scenario” and the “non-encrypted” scenario respectively.

As production data is not leaked, the following are only examples of the problems encountered.

Test in “encrypted” case

1. TPS does not increase when the number of concurrency is around 500, indicating that the bottleneck may be reached.

2. Observe the load of the MGS gateway container. The overall CPU load reaches its limit.

3. The CPU load of the MCUBE container in the same period is healthy, and other performance indicators (such as I/O and network) are also healthy.

4. From the above situation, the main performance bottleneck in the encryption scenario is on the MGS gateway. According to experience and process analysis, the main performance pressure is caused by intensive calculation in the packet encryption and decryption process. To solve this bottleneck, the MGS container needs to be expanded.

Tests in the “no encryption” case

1. The growth of TPS stops when concurrency reaches around 1000. In general, this situation indicates that a system capacity bottleneck has been reached.

2. Observe the load of the MGS gateway container. The overall CPU load is not high, which is different from that in the encryption case.

3. According to the feedback from the network group, the number of TCP sessions from the Internet to the DMZ is three to four times that from the DMZ to the Intranet, and the CPU pressure of the firewall on the Intranet is high.

4. Combined with the above three manifestations, it is suspected that network bottlenecks have been reached. According to the onsite situation, the Nginx in the DMZ does not adopt the keep-alive policy when forwarding the packets to the Intranet. Modify the Nginx configuration, add the Keepalive 1000 configuration, and retry the second test.

Keepalive: By default, Nginx uses a Keepalive connection to connect to the back end (HTTP1.0). For each new request, Nginx opens a new port to connect to the back end, and the back end closes the connection after execution. The Keepalive parameter tells the number of long connections cached between Nginx and the back-end server. When new requests come in, TCP connections can be reused directly, reducing the performance impact of establishing TCP connections. See also: nginx.org/en/docs/htt…

conclusion

After the optimization, the performance can be improved by at least 70% in the non-encryption scenario and 10% in the encryption scenario, and the performance can be greatly improved after the expansion of MGS. The tuning result is far better than expected.

Author: Ali Cloud mPaaS TAM team (Wang Zekang, Bei Mo, Dong Lei, Rong Yang)

END