Original Hou Long

A good software, function and performance are very important, which of course can not leave the product students bright brains, research and development students’ dexterous hands, but also can not leave the test students’ golden eye. When it comes to testing, we may think of page dot dot, interface verification, business integration and so on. In fact, there is a very important link, that is performance testing.

So, what is performance testing? How to measure system performance? How is the system response time calculated? How do YOU tune performance? With that in mind, let’s talk briefly about performance tuning today.

1. What is performance testing

Performance test is in a particular way, the system to be measured according to certain pressure on the test strategy, access to the system response time, throughput, resource utilization, such as performance indicators, to test the system can meet the demand of the user process after online, including the test/tools/purpose, test environment, test plan, test implementation, test results and analysis.

2. Measure the four indicators of the system

The performance of a system is measured by the following four indicators:

The response time

The time it takes for an application to perform an operation, including the time from the time a request is made to the time a response is received. Response time is the most important performance index of the system, which directly reflects the speed of the system.

throughput

Indicates the number of requests processed by the system per unit of time, reflecting the overall processing capability of the system. TPS(Transaction per second) is a common quantitative indicator of throughput, followed by Hits per second (HPS) and Query per second (QPS).

Resource utilization

It refers to the usage of system resources such as CPU, memory, disk, and network of the application server, database server, and middleware server of the system under test.

concurrency

Refers to the number of users submitting requests simultaneously. The relationship between these four indicators is shown in Figure 1.

Figure 1

Throughput = Concurrency/Average response time Throughput = Concurrency/Average response time.

From Figure 1, we can see:

  • When the system pressure is low, the response time is almost unchanged, and the throughput and system resources increase linearly with the increase of the number of concurrent requests.

  • When the system pressure is high, the response time increases gradually with the increase of the number of concurrent requests, system resources reach the limit, and the throughput no longer increases.

  • Concurrency continues to increase, response time increases rapidly, system resources remain at their limit, and throughput declines rapidly.

In general, we want systems that can support greater concurrency and throughput. However, as we can see from the above analysis, the increase in concurrency does not always increase throughput because response time will be a bigger factor in determining throughput once system resource utilization reaches its limit. So where does the time go?

3. Where did the time go

Figure 2

A request goes from sending to receiving the response, as shown in Figure 2. The general process is as follows:

  1. The client sent a request packet. Procedure The client sends a request packet, which is transmitted through the network and then reaches the server.
  2. Server processing. After receiving the request packet, the server performs service logic processing and necessary data read and write operations.
  3. The server returns a response packet. After processing, the server sends the response packet to the client.

We usually refer to response time as the total time consumed by steps 1, 2, and 3. Step 1 is mainly about the client request time and network time. The second step is mainly business logic, data read and write, and network time. Step 3 is mainly client rendering and network time.

Steps 1, 2, and 3 each have potential performance issues that can lead to long response times. In step 1, the client host configuration is low and the response is slow. In step 2, the service thread is blocked and the database query is slow. Step 3: Network transmission delay. Depending on the type of problem, we can classify the problem as hardware problem, network problem, code problem, middleware problem, etc. There are different tuning methods for different problems, so let’s briefly talk about performance tuning.

4. Thief of Time – Performance tuning

Common tuning methods are:

  • Space for time. For example, the data cache reads data from the disk to the memory in advance. The CPU obtains the requested data directly from the memory, thus achieving higher efficiency.

  • Time for space, such as uploading large attachments, processing the data in batches, with less space to complete the task processing;

  • Divide and conquer, the task is divided, separate execution, also convenient parallel execution to improve efficiency;

  • Asynchronous processing, such as THE MOST common MQ message queue in Internet applications, separates time-consuming services on service links to reduce congestion through asynchronous processing.

  • Parallel: Multiple processes or threads process services at the same time, shortening service processing time.

  • Closer to the user, such as CDN technology, put the static resources requested by the user closer to the user;

  • Everything is extensible, business modular, service-oriented (stateless at the same time), good horizontal expansion ability.

Let’s take a few cases to illustrate.

Case 1

Fault Description: When an interface is pressed, the response time becomes longer and longer.

Problem analysis:

  1. The number of FailoverEvent threads in the thread stack increases, and the memory eventually runs out.

  2. The program fails to determine whether the FailoverEvent thread queue already exists. As a result, the FailoverEvent thread queue is created repeatedly.

Solution: Check whether the FailoverEvent thread queue exists before creating it. If it does not, create it. If it does, use the existing object.

Optimization result: The memory overflow problem is resolved, and the response time is normal.

Tuning suggestions:

  1. Early release of references to useless objects;
  2. Instead of using strings, use StringBuffer.
  3. Use less static variables;
  4. Avoid creating objects centrally, especially large objects;
  5. Try to use object pool technology to improve system performance;
  6. Do not create objects in frequently called methods, especially in loops.

Case 2

Problem description: In the case of a batch processing interface without backlog, the processing time of 10000 orders and 4500SKU types takes 433 seconds.

Fault analysis: the interface invokes downstream services in single-thread mode. Query times = number of sku types / 11,4500 sku types are about 410 times, and each invocation takes about 519ms.

Solution: Call downstream services in a multithreaded manner instead.

Optimization results: TP99 decreased from 212 seconds to 33 seconds, TPS increased from 87 strokes/second to 127 strokes/second.

Tuning suggestion: This example uses multi-threading to reduce response time, but it does not mean that multi-threading is necessarily faster than single-threading, because it is the CPU that does the work, not the thread. We can select by confirming whether the system has disk/network IO, there is, multithreading; None. Single thread. And when multithreading, be sure to use thread pools.

Case 3

Problem description: Data query interface, TP99=727ms, increased concurrency, throughput cannot be improved, application server CPU usage is always less than 40%.

Problem analysis: Through call chain analysis, we found that the selectList method was called 11 times for one request, resulting in a surge in the total interface time.

Solution: Remove redundant calls and call the selectList method once per request.

Optimization results: TP99 decreased from 727ms to 19ms, a 38-fold improvement; TPS increased from 17.5 pens/second to 163.4 pens/second, a 9-fold improvement.

Tuning suggestions:

  1. Design before code;
  2. Basic rule: Keep database operations out of loops;
  3. If it is a query, replace the for loop (space for time) with an IN query;
  4. If you want to add new data, use batch insert.

Case 4

Symptom: A deadlock occurs when an interface submits database operations to update data.

Table 1 shows the transactions that generate deadlocks:

Solution: Split transaction 1, query first, and then delete in batches based on the results of the query.

Optimization result: Deadlock problem solved.

Tuning suggestions:

  1. Avoid big things;
  2. Access data pairs in the same order;
  3. Avoid writing transactions that involve user interactions;
  4. Use low isolation levels as appropriate, such as RC;
  5. Add a reasonable index to the table. If you do not move the index, each row of the table will be locked, and the probability of deadlock will greatly increase.
  6. Avoid running multiple scripts that read and write to the same table at the same time. Pay special attention to statements that lock and operate with a large amount of data.
  7. Set lock wait timeout parameter, innodb_lock_WAIT_timeout.

5. To summarize

The response time is usually just a symptom of the problem, and the root cause is the proper utilization of various resources. Resources here are generalized resources, including hardware/software resources, system/thread/data resources, and other different levels of resources. Tuning itself is a more rational allocation of resources. The purpose of tuning is often to meet business needs, so we don’t have to pursue premature and excessive optimization, and we should recognize that performance tuning is not a one-shot solution, and as the business iterates, there will always be new problems, so we should have the understanding and ability to play the long game.

Recommended reading

  • How did we optimize the delivery of ToB services by 75%?

  • Dry goods | how to TensorFlow jingdong cloud GPU cloud host Benchmark test

  • Thinking and practice of operation and maintenance of large-scale ES cluster

Welcome to [JINGdong Technology] to learn about the developer community

More wonderful technical practice and exclusive dry goods analysis

Welcome to “JINGdong Technology Developer” public account