What is JVM tuning all about?

1. Understanding of the pressure test, is it meaningful to you that the performance of XXX is 10W /s?

No melon vendor will tell you that they are not sweet. Likewise, no open source project will tell you how stressful it is at its most demanding, and most of the performance metrics that the website claims to show you are in the best of circumstances are meaningless.

For example, the pressure test on the official website of Redis, the read speed of 256 bytes is 11W /s, and the write speed is 8.1W /s. We all know that Redis has the advantage of variable data structures, such as String, List, hash, set and sortset. The actual work in a slightly complicated environment is often mixed with various structures. String lengths vary, what you need is real redis x response performance under your working environment, that is: your mix of data structures you use, your access pressure, and your string length. This value is often quite different from the official one. In my experience, we are in the video industry, and the String length is relatively long. After compression serialization, we mainly use String and List. The performance is about 200-400 when accessing client, and QPS can reach 5W accordingly. This 5W is the performance bottleneck of Redis based on our business.

My service pressure is 2000tps. You think it’s awesome?

It seems that the higher the TPS value, the more powerful the program is, it is ridiculous. If the performance is better than this, then writing a 1+1=2 program is probably invincible. Specific problems to specific analysis, you are awesome you have to explain your awesome, under what conditions, what kind of background, to say how much your TPS is, so that it is meaningful.

We’ve had no problems online. We can hold up. Are you sure?

It does not mean that the traffic will not increase. It does not mean that the service will not be complicated. It is very likely that the performance of complex services will deteriorate.

All the stress tests are done by QA. If there is any problem, you will give me feedback. What is your service limit?

Take a look at how QA does stress testing, both from a service perspective and from a personal development perspective. Qa can only give you test results, not where the performance bottlenecks are.

2. What should stress tests focus on

The pressure test is not a joke. You have four nines

Good service will have an index, called four nines, namely 99.99% service reliability. This is the universal standard to measure whether a service is excellent or not, and the quality of the stress test has the most intuitive impact on the guarantee of 4 9. Stress test is used to ensure the stability of service and give the limit conditions of service stability. Stable means that service load, CPU utilization, interface response time, network latency, result accuracy and so on are all within your criteria. And these indicators influence each other.

Conditions, using as much as you know to guess what you don’t know, simulations, what you’re going to do to predict the future

Hardware conditions: the server number of CPU cores, memory size, and other service under the network condition, with the host machine, software requirements: the stability of virtual users, gc, the response time requirement, the third party relies on available, such as a different version of the JDK would seriously affect the results of the stress tests, to cause you pressure measuring results online after short of exceed the time limit. If you do not pay attention to these conditions, it is very likely that your service will crash when the online TPS is only 500. When pressure measuring, it is best to ensure that the pressure measuring environment is consistent with the line. Your results are more meaningful.

What on earth do you want

TPS is just a result, virtual users, number of threads, average response time, number of errors, 90% average response time, server load, accuracy of results as many points as possible to evaluate your service.

TPS: the throughput of service processing and the number of response requests per second. TPS is the most intuitive result of stress test and a result indicator to measure service performance. Generally speaking, the performance value of stress test is TPS value. At its most straightforward, the total number of access logs per second is the TPS value for that second.

Number of errors: Generally, errors are not allowed during service stress testing, that is, the number of errors is 0%. However, under complex conditions, the error rate of some services is not more than X %, depending on the service.

The average response time: the mobile Internet era, you allow a user to open the app to wait for you 1 s, the user would have run away, open the app immediately see content is at the very least, response time is generally good app interface millisecond level, but the different demands of different scenarios, I a company on the requirements of the corresponding average length within 20 ms. However, on the algorithm interface of the previous company, the average corresponding time is 100ms, different services have different requirements, you need to find your own requirements.

90% average response time: Jmeter-based for 90% of transactions, the server’s response is maintained near a certain value. For example, there are three times: 1 second, 5 seconds, 12 seconds, the average time is 6 seconds, and the other situation: 5 seconds, 6 seconds, 7 seconds, the average time is also 6 seconds, obviously the second is much more stable than the first. Therefore, when viewing the Average transaction response Time, we first look at the overall curve trend. If the overall trend is smooth without ups and downs, “Average Time” and “90 Percent Time” can be used. If the overall trend is irregular and fluctuates greatly, Instead of “Average Time” it would be more realistic to use “90 Percent Time”

Result accuracy: multi-threaded access to the correctness of the results. This is often overlooked during manometry. This need you at the time of pressure test random access interface, view the results is correct, once encountered by qa testing interface, online after return the result is not correct, Intranet right back to the test, when the internal pressure test is successful retrieval, is caused by the use of unsafe multithreaded code, pressure measuring glance selectiving examination, actually it is easy to see.

Thread count: This metric is not monitored by the major stress testing tools, but is a very important service metric for stress testing. According to experience, a lot of time service crash, the number of threads is full, a lot of time of the application server threads within a certain range, service is the most healthy state, more than a certain range, the service is in unstable state, in a little the critical point of the network delay jitter are easy to collapse, so familiar with this value, you know what your service in current condition.

Virtual Users (number of concurrent threads) : Many people tend to confuse TPS with virtual users, which is the number of users currently accessing your service, i.e. how many threads the stress test tool starts to continuously access your interface. But in fact, which user will be like a mad dog has been crazy uninterrupted access to your interface, for users, access to your interface, click in to see the content, no longer access is normal (professional term is called thinking time), but the size of this value, seriously affect the VALUE of TPS. So what is the right value? There are two ways I know. One is to get the maximum value of TPS. During the stress test, the number of virtual users cannot be measured, so I just ignore it and manage the TPS throughput of my service. Approach is to look for a TPS highest when the number of virtual users, remember to do pressure test index of virtual users, of course, this is a serious problem, stress testing is to know, some service virtual users are very low TPS is very high, but the virtual users but slightly ascending, performance degradation is very fast, how to remedy this situation, strategy is to repair interval, After obtaining the virtual users with the best TPS, add or subtract some value (for example, 30) from this virtual user to obtain three stress test results, and the trend of the three stress test results serves as performance indicators for the service. The second way is to estimate what the TPS of the current service is. You can use a pressure gauge to get the number of concurrent threads that have reached the current TPS, but this value is definitely just an estimate.

One more thing, why this value can not be ignored: in fact, when the real online traffic is high, there is often a phenomenon: Nginx has a service that can’t support the current traffic by restarting the service directly, so you can adjust the front-end traffic allocation (some of the service is not restarted by restarting the server), and then open all the traffic, and find that the server can support the current traffic again. But slowly start online online traffic flow and a dozen come over, in fact, at the same time the virtual users is about the same, just starts the flow, thought that this kind of phenomenon is because the service call all need to create a reusable object reuse and thread when many, easy to cause the service is not stable, so in fact stress test virtual users, it is important? Is it really irreplaceable? The number of virtual users is just a summary of the performance of the service. The real health of the service is actually reflected in threads, response times, server load, performance bottlenecks (such as transactions or locks), and so on. The number of virtual users is just a summary of these health limits. That is to say, when the TPS of the virtual user is measured in a certain range by the stress test, what is really described is the TPS of the virtual user due to the number of threads, the load, the average corresponding time, the number of threads, and the gc stability.

4. What is JVM tuning for?

Note that JVM tuning, tuned for stability, does not give you a big performance boost. The importance of service stability goes without saying, gc will always be one of the unstable factors that Java programmers need to consider to ensure service stability. For complex and high-concurrency services, it is important to ensure that each GC does not degrade, that performance metrics do not fluctuate, that GC collections are regular and clean, and that appropriate JVM Settings are found. For more information about the JVM, see the Book “Understanding the Java Virtual Machine in Depth.” As an aside, it turns out that many people have no experience with JVM tuning, some even doubt whether it really works, and some companies have uniform JVM Settings across all services. In fact, just did not encounter complex production conditions, take a simple example: I have company, met service running more than 14 h crash problems directly and pressure test in the afternoon, before the second day morning service automatically restart, according to the habit, new services need to pressure test with 12 h, in principle I service through test, due to the complexity of the test environment, all can log on and the script a lot of development, qa think may be have a script to manslaughter, However, the deadline for the launch was still early, so I decided to put pressure again and successfully reappeared. Finally, I checked the JVM and found that the O area was always a little more after each fullGC. Jmap printing memory stack found that the use of CHAR objects was gradually increasing, and finally the memory was full. This is just a simple one, in that company, due to the service bias algorithm and high traffic, encountered many such problems. Another time, the loadrunner image of the pressure test shows that the response time immediately drops at the point of every certain period of time, and then returns to normal after 2s, with strong regularity. Through jstat, it is found that the frequent generation of large objects directly enters the old age, and the old age quickly extends to trigger full GC recycling, and the recycling time is too long, resulting in obvious service suspension. Immediate response to the manometry. The solution is to adjust the young generation, so that the large object can trigger yong GC in the young generation, adjust the collection frequency of the large object in the young generation, ensure the collection of the large object in the young generation as much as possible, reduce the collection time of the old generation, and the service is indeed stable. There was a slight performance penalty, almost negligible, but it improved service stability, which was the most important aspect of this JVM tuning.

5. Common pressure test tools and commands

Loadrunner, JMeter, self-written JAR packages, tcpCopy, etc.

Tcpcopy is a copy of online traffic. It is a great tool for stress testing existing interfaces and services. Jmeter and LoadRunner are stress testing tools. Jmeter is a relatively civilian version of LoadRunner, winning for free. In the past, because the data needs to be queried from the database in real time, I wrote the HTTP client, so I need to assist some shell and AWK commands to count the corresponding indicators.

Jstat displays the status of memory reclamation. Jmap displays the size of objects in memory. Jstack displays the thread stack and deadlocks. Performance bottlenecks, such as a thread using too much CPU resulting in overall service slowdowns, can be seen in these commands auxiliary Linux commands.

Top, vmstat, SAR,dstat, traceroute, ping, NC, netstat, tcpdump, ss, etc.

Your service runs on Linux. Is there a problem with relying on third-party services? Is it influenced by other services or excessive use of resources? Is the network jitter? Is the card full? Is it CPU performance? Is there a disk write bottleneck? Frequent kernel data exchanges? Is the load higher? All require Linux commands to see.

6. Where is the difficulty in performance diagnosis?

I got a service call. What do I do? Manual client connection to memcache is not a problem, is it a network problem, test network latency is very low, so what is the problem?

In fact, the service is similar to the human body, some people have a cold when the nose ventilation throat pain, some people have a headache, some people clear nose, some people yellow nose, symptomatic medicine to treat the root cause. There is no one answer, which shows the importance of experience. For example: one night I suddenly received an alarm, VPN login found that the service returned normally, no error was reported temporarily, but the load increased obviously, the number of resin threads soared to more than 1000 (normally, the number of threads in the peak period of service is 500-700), CPU utilization rate is high, check :1. Surge in traffic? Not statistically. 2. Is the network abnormal? Access to the server also found none, slightly slower due to higher load. 3. Application bottlenecks? Print memory stack thread stack is not found 4. Affected by other same host services? I checked the surveillance and found none. Careful observation shows that the flow rate fluctuates slightly but not significantly. Why is the load high? Finally, it was found that the front-end nginx bandwidth was full, and the bandwidth congestion caused the back-end service of the proxy could not return data in time. The number of handle congestion of the back-end service caused the server load to increase, and the server load increased the number of threads and CPU utilization, resulting in the service’s individual access response time was too long, triggering the alarm. In more serious cases, the connection to memcache may time out. Log sends a log indicating the connection error. The butterfly effect. To get to the point quickly, like a doctor, you need to know enough about the service, pay more attention to the service status and experience is really important.

7. Should we add machines or optimize services?

Cost, add machine is a kind of cost, optimize service is also a kind of cost, is likely to be your service many rely on a third party, push them accord with the requirement of you is a kind of communication cost, most of the time, thinking always is the cost of the boss, if he thinks that the service has a lot of optimization space, increases the machine that you find most he won’t agree, so this kind of thing to resources, Please consider the cost, there is also a problem, big companies tend to cry to milk, this is also very realistic, anyway, in the end, for those of us who engage in service, the sense of achievement should be to squeeze the hardware server resources to the end.

Please long press the QR code to pay attention to programmer knowledge dock

Related Posts

⚡ Daily algorithm & interview questions ⚡ learn fat together

Linux learning Notes (7) Query system

The 2021-07-10 algorithm problem