Author: Su Mu
0. The opening
Performance optimization is a complex and uncertain task. Unlike Java business code, which can be written once and run anywhere, changes that we may not be aware of can surprise/scare us. Being able to fully understand and evaluate the performance of the applications we are responsible for is, IN my opinion, a very effective way to improve technological certainty and awareness. This article is as brief as possible to summarize my own experience and experience in performance optimization. From the perspective of practice, I try to avoid being too wordy and stiff, but I have too much related knowledge. Limited by personal experience and technical depth, I would like to ask you to supplement it.
Part 1 is the introduction of partial background knowledge, students who have this knowledge can skip directly.
1. Understand the operating environment
Most programming languages (Java in particular) do a lot to make it easy to write code that works correctly without knowing much about hardware, but you need a lot of knowledge from the hardware, operating system, and software levels to fully understand performance.
At present, we use a large number of Xeon processors with Intel 64-bit architecture. Besides, there will also be AMD X64 processors, ARM server processors (such as Huawei Kunpeng and Ali Yitian), RISC-V processors and some special FPGA chips in the future. Here we mainly talk about the current ali cloud ECS we use a large number of Intel 8269CY processor.
Intel Xeon Platinum 8269CY Intel Xeon Platinum 8269CY Intel Xeon Platinum 8269CY Intel Xeon Platinum 8269CY Intel Xeon Platinum 8269CY Intel Xeon Platinum 8269CY Intel Xeon Platinum 8269CY Main frequency: 2.5ghz (3.2ghz, 3.8ghz Max), 26 cores, 52 threads (with hyper-threading technology), 6-channel DDR4-2933 memory, 1T maximum configured memory, Cascade Lake microarchitecture, 48-channel PCI-E 3.0, 14nm photoligraphy, 205W TDP. It is the most powerful xeon processor of this generation and supports up to eight channels of deployment.
Having the ability to automatically overclock dynamically would provide a short-term performance boost, while keeping a few cores busy for long periods of time, which would have a significant impact on our evaluation of application performance (very good performance in a small number of tests, very bad performance in a large number of tests). Largest configuration memory 1 TB, on behalf of the processor has 48 bit of VA (virtual address), which is usually need to level 4 page table (next generation have 57 bit VA processor has been in the design, usually need to 5 page table), deep page table is clearly a great impact on the performance of memory access and memory (page table is stored in memory). So Intel designed the large page (2MB, 1GB large page) mechanism to reduce the impact of too deep page tables. The 6-channel 2933MHz memory bus means it has a total of about 137GB/s of broadband (the memory bus is 64bit bit wide), but it is important to remember that they are highly parallel in design.
The CPU core architecture of this microprocessing is shown in the figure below, using an 8-emission out-of-order architecture, 32KB instruction +32KB data L1 Cache, and 1MB L2 Cache. The emission unit is the real computing unit within the CPU, and its size is a key factor in CPU performance. Among the 8 transmitting units, 4 units can carry out basic integer operations (ALU units), and only 2 units can carry out integer multiplication and division and floating point operations, so the parallel efficiency will be low for scenarios with a large number of floating point operations. One CPU check should be two HT (here refers to the virtual one hyper-threading technology hardware threads) is a Shared 8 launch unit, so that there will be a very big between the two HT influence each other (this also leads to the operating system within the CPU usage is no longer a linear value, specific please refer to the related information), L1 and L2 Cache is also Shared, So they also interact.
Intel Xeon processors put a lot of effort into branch prediction, so they tend to perform better with more branch code (usually if else code) than most ARM architectures do. Pointer compression techniques commonly used in Java also benefit from the flexible addressing capabilities of the x86 architecture (e.g. Mov EAX, ECX * 8 + 8), can be completed in one Instruction without compromising performance, but this is not applicable to Reduced Instruction Set Computing (RISC) processors such as ARM and RISC-V.
It is understood from reliable channels that the next generation architecture (Sunny Cove) will greatly optimize the architecture, upgrading to 10 emission, and L1 Cache will increase the data part to 48KB, which means that the following processors will focus more on improving SIMD (single instruction multiple operand, Signle Instruction Multiple Data, etc.
There are 26 CPU cores inside an 8269CY, which are connected by the following topology. This generation of processors has a maximum of 28 CPU cores, and 8269CY blocks two cores to reduce product yield requirements (cost savings). You can see that the L3 Cache is 35.75MB in total. The Last Level Cache is divided into 13 blocks, each of which is 2.75MB, not a single area. The 6-channel memory controller is also distributed on the left and right sides, and corresponds to the location of the memory slots on the actual mainboard. All this information tells us that this is a multi-core processor with very strong parallelism.
Ali Cloud is usually 2U model with dual Intel processors (based on heat dissipation, density, cost performance and so on), basically 2 NUMA (Non-uniform Memory Access) nodes. Specifically, Intel 8269CY means that a server has 52 physical cores and 104 hardware threads. Ali Cloud is usually called 104 cores. The advent of NUMA technology was a compromise of hardware engineers (who simply did not have the ability to access any address consistently with multiple cpus), so doing it poorly could seriously degrade performance. Most of the time the virtual machine/container scheduling team had to turn NUMA on, Deploy a VM or container on the same NUMA node.
The development of AMD is very good in recent years, its multi-core architecture is very different from Intel, ali Cloud will deploy many models using AMD processors in the near future. AMD processors will have more NUMA nodes and more complex topologies, as will Alibaba's own e-Ten (based on ARM architecture). This means that the VIRTUAL machine/container scheduling team is busy enough.
In most cases, the server uses CPU: memory 1:2 or 1:4 configuration, that is, a physical machine with dual Intel 8269CY, usually equipped with 192GB or 384GB of memory. If the memory required by the VM/container is too large, it is difficult to allocate the memory to the nearest NUMA node that corresponds to the CPU. In other words, performance cannot be guaranteed.
Since the 2U model has twice the physical height of the 1U model, there is more room for more SSDS, high-performance PCI-E devices, and more. However, cloud vendors are definitely not willing to directly sell physical machines to users (after all, they are no longer the former hosting physical machine company), no matter how they have to build a layer on top, that is, make ECS and sell to customers, so that functions such as hot migration, high availability and so on can be realized. The aforementioned "rack one layer" is achieved through virtualization technology.
1.1.3 Virtualization Technology
A physical machine performance is strong, and usually we only need a small piece of inside, but we hope not to perceive others in a Shared this physical machine, so that led to the virtualization technology (is simply to let can let a CPU work like multiple CPU run in parallel, so that within a single server can run multiple operating systems at the same time). Early virtual machine technology is realized by software, established manufacturers such as VMWare, but the performance sacrifice is a bit much, hardware manufacturers are also optimistic about the prospect of virtual machine technology, so there is hardware virtualization technology. Each vendor's implementation is not the same, but the difference is not very big, fortunately, there is a dedicated virtualization module to compatible with it.
Intel's Virtualization Technology, called Intel VT Technology, includes VT-X (processor Virtualization support), VT-D (Virtualization of direct I/O access), VT-C (Virtualization of network connections), And SR-IOV (Single Root I/O Virtualization) for network performance. One of the important things is that the original one level conversion (linear address - physical address) to access memory will be changed to two level conversion (linear address within VM -Host linear address - physical address), which will introduce more memory overhead and page table conversion work. So most cloud vendors turn on large pages on Host operating systems (Linux operating systems typically use transparent large pages) to reduce memory-related virtualization overhead.
The server has high requirements on network performance, and the network hardware supports the multi-queue technology of network card. Generally, the network interruption in VM needs to be distributed to different CPU cores to deal with, so as to avoid the performance bottleneck caused by single-core forwarding.
The Host operating system needs to manage one or more VMS (Virtual machines) on top of it, as well as handle the aforementioned nic interrupts, which incur CPU consumption. Ali cloud generation models, the server is a total of 96 nuclear (namely 96 hardware threads HT, actual it is 48 nuclear physics), but most can only allocate 88 nuclear, need to keep eight nuclear physical machine (the equivalent of a 8.3% reduction in CPU) to the Host operating system is used, at the same time because of I/O related virtualization overhead, the machine performance will decline more than 10%.
In order to minimize the cost of virtualization, Ali Cloud has developed an excellent "elastic bare metal server - Shenlong", which claims that it will not reduce performance due to virtualization, but improve some performance (mainly network forwarding).
1.1.4 DpCA Server
To avoid the performance impact of virtualization, Alicloud (and similar solutions from cloud vendors like Amazon) developed the DpCA server. To put it simply, the shenlong MOC card is designed to handle most of the virtual machine management and network interruption processing from CPU offload to THE MOC card. Dpca MOC card is a PCI-E 3.0 device, which has a specially designed FPGA chip for network processing and two low-power x86 processors (rumored to be Intel Atom), to take over the virtualization management of Host operating system to the maximum extent. Through such a design, the network forwarding performance can even achieve 10 times that of bare physical machine, which is worthy of bare metal. A 104-core physical machine can create a 104-core ECS without having to reserve several cores for the Host operating system.
Virtual Private Cloud (VPC), most Cloud users want their networks to be isolated from other customers, just like self-built computer rooms. The most important technology is network virtualization. At present, Ali Cloud uses VxLAN protocol, which uses UDP protocol for data transmission. The overall packet structure is shown in the following figure. VxLAN In the VxLAN frame header, a Network identifier similar to a VLAN ID is introduced. The VxLAN Network IDENTIFIER (VNI) consists of 24 bits and can support up to 16 MB of VxLAN segments, thus meeting the identification and isolation requirements of large-scale networks. The introduction of this layer will add 50 Bytes of fixed-length headers to the original network packet. Of course, you also need a matching switch, router, gateway, and so on.
1.3 Container Technology
Though the perfection of virtualization technology optimization has been greatly solve the problem of the VM layer of virtualization overhead, but the cost of VM operating system is unavoidable, and now most of the Java application can do single process deployment, this layer of VM operating system overhead is some waste (current, it for strong isolation, and security). Container technology is built on the support of operating system. Currently, Linux operating system is mainly used. The most famous container is Docker, which is built based on Cgroup technology of Linux. The VM experience is so good that the ultimate goal of the container is to have the VM experience without the overhead of the VM operating system layer. In the container, when executing commands such as top and free, we only want to see the container view, and the network is also the container view. Therefore, we have to troubleshoot the problem and only capture packets of the container network.
At present, the model of Ali Cloud ECS with 16GB memory actually has only 15GB available memory in the operating system, and the model of 32GB only has 30.75GB. Containers don't have this problem because the tasks running on containers are really just one or more processes on the operating system. Because of the logical isolation nature of containers, it is almost impossible for applications from different enterprises to be deployed on the same operating system (that is, the same ECS).
The points that most affect the performance of a container are whether the container is oversold, whether the core is bound, the strategy of core allocation, etc., and many of the aforementioned knowledge points will have a great impact on performance.
Typically, enterprise core applications require core binding (that is, the location of multiple Vcpus in a container is determined and dedicated, and Intel HT, AMD CCD/CCX, NUMA, etc.) so that the certainty of performance can be guaranteed.
Between Docker and VM, in fact, there are other more balance container technology, most companies call safe container, it USES the hardware virtualization to implement strong isolation, but it doesn't need a heavy VM operating system (such as Linux), instead, a very light micro kernel (it only supports what container must be part of the kernel function, Meanwhile, most of the work is forwarded to the Host operating system. This technology is very much wanted by cloud vendors and is the basis for them to sell reliable FaaS (Function as a Service).
2. Obtain performance data
Prior to performance optimization, we need to collect enough, accurate, and representative performance data to analyze performance bottlenecks before we can effectively optimize.
An application performance evaluation is a very complex thing, in most cases an application there will be a lot of interfaces, and the same will pick up goods because of the participation of different internal business logic or bring the change of the very large execution logic, so we have to first want to know, what we have is to optimize the performance of the business scenario (for orders, Maybe just order).
After the performance test cases run, how to get what we want real performance data is very critical, because of the existence of the observer effect (referred to as the "observation" this behavior the effect of a certain effects on the observed object, it is very common in life), obtaining the performance data to be measured at the same time, it will more or less impact on application, So we need to have a deeper understanding of how the performance data acquisition tools we use work. When it comes to Java (and almost any other language), there are two main ways to find out what an application is doing:
- Instrumentation refers to the possibility of using agents independent of the application to monitor applications running on the JVM, including but not limited to obtaining JVM runtime state, replacing and modifying class definitions, etc. Commonly understood is the function before and after the execution of the code, statistics function execution time. Once we understand the basics, we can probably see that this approach has a significant impact on performance. The shorter the function, the more times it is executed, the greater the impact. The benefits of this approach are obvious: you can count the number of times the function is executed and not miss a single detail. This approach is typically used to apply early optimization analysis.
- Sampling: Interrupt program execution with a fixed frequency, and then pull the execution stack of each thread for statistical analysis. Sampling frequency determines the minimum granularity and error of observation results. Some small functions with more times of execution may be counted more, while some small functions with less times of execution may not be counted. Major operating systems support this from the kernel layer, so the performance impact of this approach is relatively small (the exact amount depends strongly on the sampling frequency).
Time is also a very important indicator in performance data. There are two main categories:
- CPU Time: indicates the total number of OCCUPIED CPU Time slices. This time is mainly used to analyze high CPU consumption.
- Wall Time: the actual elapsed Time. In addition to CPU consumption, there is also the waiting time for resources, etc., which is mainly used to analyze RT (Response time).
Performance data acquisition mode + time indicator there are four combination modes, each of which has its most applicable scenario. Keep in mind, however, that Java applications typically require at least 5 minutes of high-volume continuous testing (a good rule of thumb, since Server mode JVMS typically need 5,000 to 10,000 method executions before JIT compilation) to get their application performance to a steady state, so unless you're looking at application performance while it's warming up, Otherwise, you need to wait more than five minutes before collecting performance data.
Linux also has some very useful performance monitoring tools, as shown below:
2.1 Constructing Performance Test Cases
Typically we analyze the performance of one or more typical business scenarios, not just one or more API interfaces. For example, what we need to analyze is the performance of shopping guide, transaction, coupon distribution and other business scenarios.
The need for good performance test cases reflects typical user and system behavior (not all of them). We can't get to 100% of the real user scenario, we can only get closer to it), such as the average number of items purchased in a single order (the actual use case would be broken down into: Buy a commodity accounted for how much, two commodities accounted for how much, etc.), the number of hot sales to buy goods and clinch a deal accounted for the number of users, number of goods, hot inventory distribution, buyers per capita how many coupons and so on. Pressure test cases like Taobao's Double 11 will have more than 200 key parameters like this.
In practice, we expect the test case to run stably and continuously (such as not running for 30 minutes to find out the inventory is gone, the coupon is gone), and the external dependencies of cache hit ratio, DB flow, and so on can also reach a stable state. The performance of applications is also stable at the granularity of second time (generally no finer) (that is, the computational complexity of requests is evenly distributed over time granularity, without wobbling between high and low).
In order to cooperate with the execution of performance test cases, sometimes the application system needs to do some corresponding modifications. For example, for scenarios that use caching, the hit ratio will be low at first, but after running for a while it will gradually become 100%. This is not usually the case, so some logic may be written to keep the cache hit ratio at a certain value.
In summary, a good set of performance test cases is essential for the follow-up work and is worth the time we spend on them.
2.2 Real test environment
Maintaining consistency between the test environment and the actual environment is extremely important, but it is often difficult to achieve, so most Internet companies' full-link pressure test solutions are directly using the online environment to do performance testing. If we can't use the online environment for performance testing, it takes a lot of effort to carefully compare our environment with the online environment to make sure we know which performance data can be trusted.
It is not easy to directly use the online environment to do performance testing, which requires us to have an overall solution to distinguish the pressure measured flow from the real flow. Generally, a pressure gauge is added to the flow for full-link transparent transmission. At the same time, basically all the basic components need to be modified to support pressure measurement, mainly including:
- DB: Creates a pressure table for service tables to store pressure data. The reason for not using additional fields for logical isolation is that they can easily be confused with the official data, and it is not convenient to clean up the pressure data separately. The reason for not using a new pressure library is that, on the one hand, it violates our basic purpose of using an online environment for performance pressure, and on the other hand, it doubles the number of database connections on the application side.
- Cache: Add a special prefix to the cache Key, such as __yt_. Caching mostly doesn't have the concept of a table and looks like a huge Map store, so there's no good way to do it except with a fixed prefix. However, in order to reduce the storage cost of pressure data, it is usually necessary to: 1) do some processing in the cache client package to reduce the cache expiration time of pressure data; 2) The cache console provides the function of cleaning pressure test data.
- Message: pass through the pressure gauge when sending and consuming. As far as possible, it does not need to be sensed by the development students of the business team, the mark of whether the pressure data is added in the internal structure of the message, and the business team does not need to apply for a new Topic dedicated to the pressure test.
- RPC: indicates the transparent transmission pressure gauge. Of course, HTTP, DUBBO and other specific RPC interface transparent transmission schemes will be different.
- Cache and database client packets: routes requests based on pressure gauges. This needs to be combined with the specific cache and DB implementation mentioned earlier.
- Asynchronous thread pool: transparent pressure gauge. To reduce the cost of scaling, ThreadLocal is often used to store scaling, so remember to bring it with you when using asynchronous thread pools.
- In-application cache: Isolate pressure data from official data. If the primary key or other unique identifier of the compressed data allows us to distinguish it significantly from the official data, we may not need to do much, otherwise we may need to consider either a new set of caches or a special prefix for the compressed data.
- Self-built task: transparent transmission pressure gauge. We need to do something similar to the message component mentioned above, since tasks and messages are technically very similar.
- Two-party and three-party interface: specific analysis and solution. We need to see if the two - and three-party interfaces support the test. If they support the test, then we can transfer the parameters in the way the other side expects. If not, we need to think about other ways to help us (such as setting up a store or account dedicated to the test).
In order to reduce the impact of performance tests on users, they are usually conducted at low traffic peaks, usually in the middle of the night. Of course, if we can have a relatively independent compromise scheme, such as using a small object environment, using part of the unit in a system that supports the unit, etc., we can do performance testing at any time and achieve the normalization of performance testing.
2.3 Use of JProfiler
JProfiler is a very mature product, it's expensive, it's easy to use, it's designed for performance analysis of Java applications, it's cross-platform, it's a tool I use a lot. Its general architecture is shown in the figure below. Linux Agent and Windows UI are the most recommended usage modes. It not only supports Instrumentation Sampling, CPU Time Wall Time, but also has an easy-to-use graphical interface.
To analyze, we simply upload the agent package to a directory in the application (e.g. /opt/jprofiler11.1.2) and add startup options for the JVM to load it. I usually configure this as follows:
-agentpath:/opt/jprofiler111.2./bin/linux-x64/libjprofilerti.so=port=8849 Copy the code
Then we restart the application and the changes will take effect. With this configuration, the Java process starts by waiting for the JProfiler UI to connect to it before continuing, so that we can analyze the performance of the application at startup.
JProfiler has a lot of features that I won't cover in detail, but you can read its official documentation. The collected performance data can also be saved as a. JPS file, facilitating subsequent analysis and communication. Its typical analysis interface is shown in the figure below:
Some disadvantages of JProfiler:
1) Agent needs to be loaded after Java application startup (of course, it also has the mode of attach after startup, but there are many limitations), which is not convenient to analyze some urgent performance problems in a short time; 2) The performance of Java applications is greatly affected. Using sampling to collect performance data is certainly much cheaper, but it's still not as good as perF, which I'll cover next.
2.4 Use of PERF
Perf is a worthy performance analysis tool for Linux, and this needs to be emphasized. It can be used not only to analyze the performance of Linux user-mode applications, but also to analyze the performance of the kernel. Its module structure is shown in the figure below:
Imagine a scenario where we switch to another cloud vendor, or a cloud vendor updates its server hardware (mainly CPUS), and we want to know the specific reason for the performance change. What can we do? Perf can do this very well.
In order to help us analyze the Performance of application execution, CPU designers specially design related hardware circuits, PMU (Performance Monitor Unit) is the most important part of this. In short, it contains a number of Performance Counters (PMCs, Performance Monitor Counters) that perF can read. Not only that, but there are a number of software-level counters available at the kernel level that perF can also read. Some of the key metrics related to CPU architecture are as follows:
- IPC (Instruction per cycle) : Considering power consumption/performance, the frequency of most server processors is in the range of 2.5-2.8 GHz, which means that the number of cycles in a chip at the same time is not very different. Therefore, the more instructions can be executed in a single cycle, the better our application is optimized. Too many jump instructions (i.e., if else code), floating-point calculations, random access to memory, etc., are obviously very disruptive to IPC. Some people like to refer to CPI (Cycle per Instruction) as the inverse of IPC.
- LLC Cache Miss: Memory-biased applications need to be aware of this indicator. Too much indicates that we are not taking advantage of processor or operating system Cache preloading mechanisms.
- Branch Misses: This number is very high and indicates that our Branch class code is not well-designed, and needs to be tweaked to meet the expectations of the processor's Branch prediction algorithm. If our branching logic depends on data, we can also improve performance by tweaking the data (such as the classic case where the data has 1 million elements between 0 and 255 and counts the number of elements less than 128). Sorting the array ahead of time and doing the for loop will run faster).
Because perf is intended for native applications on Linux, using it to analyze Java applications directly treats the Java program as a normal C++ program and does not show Java call stack and symbol information. The good news is that the perF-map-Agent plug-in project solves this problem by exporting Java symbolic information and helping PERF trace the stack of Java threads so that we can use PerF to analyze the performance of Java applications. After executing perf top -p Java process Id, you can see the real-time performance statistics displayed by perf as shown in the following figure:
Perf only supports sampling + CPU Time mode, but its performance is so good that normal Java performance analysis tasks typically introduce less than 5% of the overhead. Use the environment variable PERF_RECORD_FREQ to set the sampling frequency. The recommended value is 999. But as you can see, it's standard Linux command-line interaction, which isn't that convenient. And while it is possible to record performance data as a file for later analysis, remember to also save the Java process symbol file, otherwise you will not be able to see the Java call stack information. Despite its limitations, PERF is the most suitable tool for analyzing online performance problems in real time. It requires no preparation, is readily available, and has little impact on online performance, allowing you to quickly identify performance bottlenecks. After installing perf (which requires sudo permission) and perf-map-agent, you usually use the following command to open it:
export PERF_RECORD_SECONDS=5 export PERF_RECORD_FREQ=999/perf-java-report-stack Java process pidCopy the code
This is all the important information to introduce. If you want to use PERF well in practice, you need to refer to the relevant documentation.
2.5 Kernel mode and User mode
Those of you who are familiar with operating systems will hear these two terms a lot and will know that frequent interaction between kernel and user modes is very performance critical. At the execution level, it is the basis on which processor designers have designed and built today's stable operating systems. With it, user-mode (ring3 on x86) processes cannot execute privileged instructions and access kernel memory. Most of the time, for security reasons, the kernel cannot map some kernel memory directly to the user state. Therefore, when making a kernel call, the kernel needs to write that part of the parameter to a specific parameter location, and then the kernel copies the desired content from there, so there is an extra memory copy overhead. As you can see, the kernel takes care of every request for security.
On Linux, TCP protocol support is implemented in kernel mode, there were many good reasons for this, but the kernel update iteration speed is certainly slower than the requirements of the Internet industry today, so QUIC (Quick UDP Internet Connection, Google's udP-based Internet Transport Layer protocol, a low-latency protocol, was born. Now the mainstream development idea is to do without the kernel, as far as possible in the user mode to achieve everything.
One exception is that preemptive thread scheduling cannot be done in user mode because the timed clock interrupts required to implement it can only be set and handled in kernel mode. Coroutine technology has been an excellent technique for reducing kernel scheduling overhead in heavy IO Java applications, but unfortunately it requires the execution thread to voluntarily yield the remaining time slice, otherwise the multiple user-mode threads associated with the kernel thread could starve. Alibaba's Dragonwell JVM also tries to adjust dynamically (i.e., user threads are not associated with fixed kernel threads and can be switched when needed), but it doesn't work very well due to the aforementioned clock interrupt limitations.
Including today's virtualization technology, especially SR-IOV technology, only requires the kernel to participate in the interface allocation/reclamation work, the intermediate communication part is completely completed in the user mode, does not require the kernel to participate. So, if you find that your applications are spending too much time in kernel mode, consider whether you can have them done in user mode.
2.6 Key METRICS of the JVM
There are many JVM metrics, but there are a few key ones that you need to keep an eye on.
- Number and duration of GC: This includes Young GC, Full GC, Concurrent GC, etc. High frequency of Young GC often represents too many temporary objects.
- Java heap size: Includes the size of the entire Java heap (controlled by Xmx and Xms), the size of the young generation and the size of the old generation respectively. Not specifying both Xms and Xmx is likely to keep your Java processes using very small heap space, and large old age space also means a waste of memory most of the time (allocating more to the Young generation will significantly reduce the Young GC frequency).
- Thread count: Usually we use a 4C8G (4Core vCPU, 8GB of memory), 8C16G model, allocating thousands of threads is most of the time wrong.
- MetaspaceSize and Usage: Do not let the JVM dynamically expand the size of the metasspace, try to fix it by setting MetaspaceSize and MaxMetaspaceSize. We need to know how much meta space our application needs. Excessive meta space usage and rapid growth can mean that we are using the wrong dynamic proxy or scripting language.
- CodeCache size and usage: also, do not let the JVM dynamically expand the size of the CodeCache. Try to keep it fixed by setting InitialCodeCacheSize and ReservedCodeCacheSize. We can see if a new class library has been introduced recently.
- Out-of-heap size: Limit the maximum out-of-heap size, calculate the size of each CHUNK of JVM memory, and do not give the OS a chance to trigger OOM Killer.
2.7 to understand the JIT
Bytecode interpretation execution is certainly slow, and Java's popularity has a lot to do with its high-performance JIT (Just in time) compilers. However, the compilation process itself is also quite performance consuming, and due to the dynamic nature of Java, it is difficult to compile C/C++ programming languages into native code in advance and then execute it, which results in the slow startup of Java applications (most of them take more than 3 minutes. Windows operating systems don't even take that long to start), and JIT experience is not shared across multiple machines in the same application (which is obviously a huge waste).
The JVMS we used used a tiered compilation strategy, C1, C2, C3, and C4 from lowest to highest, depending on the degree of optimization, with C4 being the fastest. The JIT compiler collects quite a bit of runtime data to guide its compilation strategy, with the central assumption being that information can be progressively collected and that only hot methods and paths can be compiled. However, this assumption is not always true. For example, for the double 11 rush scenario, our traffic suddenly increases vertically at the point of arrival, and some branches of code do not run before a certain point in time (for example, a discount will be available after midnight).
For extremely hot functions, the JIT compiler usually inlining the function, which is the equivalent of copying code directly to where it was called to reduce the overhead of a function call, but functions that are too big (usually hundreds of bytes, depending on the JVM implementation) can't be lining. This is why programming specifications often say not to write a function too big.
The JVM does not JIT optimise all executed methods, usually after 5000-10,000 executions, and it compiles only those branches that have been executed (to reduce compilation time and Code Cache usage and optimize CPU execution). Therefore, when writing code, it is best to make sure that if code is followed by a block of code with a high probability of execution and that the code execution flow is as fixed as possible.
Alibaba's Dragonwell JVM adds features that allow the JVM to record which methods are compiled at run time and write them to a file (which can also be distributed to other machines in the application cluster). This information can be used by the next JVM startup to trigger JIT compilation the first time the methods are run. Instead of executing it thousands of times, this dramatically increases the speed and CPU consumption of application startup. Dynamic AOP (Aspect Oriented Programming) code and lambda code will not benefit from this bonus, however, because the actual function names they generate at runtime are numeric flabby names like MethodAccessor$1586, in this case 1586, Next time you won't know anything.
2.7.2 annealing optimization
The JIT compiler's aggressive optimization is not always right, and it will de-optimize, resubmit the compilation request for that method, if it finds that the current flow of execution it needs was omitted from a previous compilation. This is likely to be interpreted execution (executing C1 if there is low-order compiled code that has not been discarded, such as C1) until a new compile request is completed, which, combined with the overhead of the compile thread, can result in a short-term application performance degradation. In the double 11 rush scenario, that is, the peak hour at zero, the performance of the application is significantly lower than that of the pressure test due to the occurrence of backoptimization.
Alibaba's Dragonwell JVM also provides options for removing aggressive optimizations during JIT compilation to prevent de-optimizations from occurring. Of course, this leads to a slight degradation in application performance.
2.8 Real Cases
In the practice of performance optimization, there is a phrase that needs to be understood over and over again: reflect everything through data, not hearsay or experience.
Here is a performance comparison of an application that was optimized in a refactoring project to show how to leverage the knowledge we discussed earlier. This application is a partial end application, and the downstream is basically no longer dependent on other applications. In particular, [this case does not have to be a case], nor does it correspond to any real case.
Application container: 8C32G, Intel 8269CY processor, 8 processors bound to 8 HT with 4 physical cores.
Old applications: 22.214.171.124, 126.96.36.199
New apps: 188.8.131.52, 184.108.40.206
Test interfaces and traffic
|Confirm order discount render single @1012QPS|
|Order confirmation discount single @330QPS|
|Discount single @416QPS|
Time: 2021-05-03 Basic data
|Item \ data||Old app (Collection time 21:00)||New Application (Collection time 21:15)|
|CPU||43.44% (user: 36.78%)||45.04% (user: 40.52%)|
|RT: Confirm order discount rendering||5.2 ms||6.1 ms|
|RT: Order and confirm discount||7.5 ms||4.7 ms|
|RT: Discount for order verification and cancellation||1.0 ms||1.3 ms|
|The GC frequency||18.5||25|
|Net||In: 11.4 M Out: 13.7 M||In: 7.3 M Out: 12.4 M|
|Thread||669(Daemon:554)||Daemon: 789 (664)|
|Java Exception per minute||5132||7324|
|Operating \ QPS||The old application||The new application||use|
|GET:860||4848||-||Coupon rule cache|
|GET:758||1165||1365||The seller cache|
|GET:8||1158||1280||Seller global rule cache|
|PREFIX_GETS:688||1425||-||Active index (new app deprecated)|
|GET:688||2282||-||Active index (new app deprecated)|
|GET:100||21||21||Store interest-free discount cache|
|PREFIX_GETS:100||-||73||Store interest-free discount cache|
|GET:4||-||1008||Store some kind of discount cache|
|GET:10086||2394||2580||Card coupon relationship cache|
|GET:770||100||-||SKU discount cache|
|PREFIX_GETS:770||92||-||SKU discount cache|
|GET:88||21||21||Item limit cache|
The DB access
|Library table \ QPS||The old application||The new application||use|
|QUERY promotion_detail||865||-||Favourable activity|
|QUERY promotion_detail_sku||21||SKU discount activities|
2.8.1 Preliminary findings
- As you can see from the differences in external dependencies, the code logic of old and new applications will differ, and you need to further evaluate what those differences are. Usually, when collecting performance data, we need to have a simple analysis and judgment. First, we need to ensure the correctness of the business logic. Otherwise, the performance data is meaningless.
- Java Exceptions are performance-intensive, mainly in collecting Exception stack information, and large differences in the number of exceptions between old and new applications need to be found and resolved.
- The interface of the new application, "Order confirmation discount" RT, has been significantly reduced, while other interfaces have been improved, indicating that there may be a big difference in the execution path, which also needs in-depth analysis.
3. Start performance tuning
Unlike obtaining performance data, you need to learn about upper-layer services from the bottom layer, but perform performance optimization from the upper layer to the bottom layer. The higher the level of optimization is often less difficult and more profitable, but it needs to be deeply combined with the business. Low-level optimization is often more difficult and difficult to obtain large benefits (after all, a group of technical elites have been doing it), but it has good versatility and can often be applied to multiple business scenarios. Next, let's talk about some directions and practical examples for each layer.
3.1 Purpose and principle of optimization
Before we talk about specific optimizations, let's talk about why we do performance optimizations. In most cases, performance optimization is about cost, efficiency, and stability. Achieving the same business results, using fewer resources, or delivering a better user experience (usually meaning faster page responsiveness). Technical solutions that do not consider cost often do not have too many challenges. For e-commerce platforms, we often use the cost of a single order to measure the machine cost. For example, the value of Taobao may be about 0.17 yuan. In the early stage of business development, it often does not care so much about cost, but pays more attention to efficiency. When it gradually matures, it will gradually begin to pay attention to cost. Colloquially speaking, it begins to compare whether there is cost and then whether it is good or not. Therefore, at different times, we will focus on the purpose and direction of performance optimization.
The Internet industry is a rapidly developing industry, and RD efficiency is crucial to the healthy development of business. In the process of optimization, we should take into account the improvement of RD efficiency (at least not too much damage) in the selection of technical solutions, giving people a feeling of "it should be like this". Rather than doing something that obviously isn't sustainable in the long run and has high maintenance costs. A good optimization is like a work of art that impresses everyone who sees it.
Why do people buy things on major e-commerce websites at midnight on Double 11? Why is everyone grabbing train tickets at 10 am before Spring Festival? Wait, it's all business design. Preparing a large wave of machine resources to use for two months just for the few minutes of double 11 peak is actually a huge cost waste. Therefore, in order to avoid such waste, Taobao's Double 11 pre-sale payment time is usually at 1 am. 12306 early a few years is every into the Spring Festival will hang, because want to go home is too much, so behind slowly in accordance with the train will sell time to break up, refer to the announcement:
Since January 8 this year, in order to avoid a large number of passengers in the Internet queue to buy tickets, the original 8 o 'clock, 10 o 'clock, 12 o 'clock, 15 o 'clock four time nodes put tickets instead of 15 nodes put tickets, namely :8 o 'clock -18 o 'clock, during every hour or every half an hour are part of the new tickets sold.Copy the code
These strategies can greatly reduce the peak traffic of the system while being basically insensitive to the user experience, and many of these optimizations are what we need to think about from the very beginning (but remember to always put business first, not technology).
3.3 System Architecture
Such as product details page dynamic and static separation (static page and dynamic page separately accessed by different systems), user interface layer (namely HTTP/S layer) and back-end (Java layer) combined deployment, etc., are successful examples of architecture optimization. Business architects tend to design systems with many layers, but at runtime they can often be deployed together to reduce cross-process, cross-machine, and cross-geographical communication.
Taobao's unitary architecture is also a good design in terms of performance. Almost all the processing of a transaction request can be completed within a closed unit, which reduces the demand for cross-regional network long transmission bandwidth.
The rich client solution is also a good solution for basic data such as commodity information and user information. After all, in most cases, they are accessing the cache such as Redis. One more RPC request to the server always seems to be too much, and of course, more work is needed to upgrade the data structure later.
Discussion about architecture is an eternal topic, and different companies have different backgrounds, so optimization needs to be made according to the actual situation.
3.4 Invoking a Link
In distributed system, it is inevitable to rely on many downstream services to complete business actions. How to rely on, what interface to rely on, and how many times to rely on are issues that need to be considered deeply. With the call chain view tool (dependency, in this case), we can take a close look at each business request and decide if it is the best approach. Here's an example I've heard (specifically, this case has to be a case) :
Background: The marketing team received a demand to pull a new, it will form a burst point at 10:00 of the company's anniversary, it is expected to generate a maximum of 300,000 UV, and then just click the participation button of the activity page (the estimated conversion rate is 75%), a group page will pop up, asking the user to invite his friends to participate in the group, each more friends invited, Users in the group can enjoy more discounts.
The marketing team provides a new backend interface for group pages that initially needs to do these things:
In the step of "create a new group for users", the information of the new group will be persisted to the database at the same time. According to the estimated conversion rate of the business, there will be a peak traffic of 30W*75%=22.5W QPS. Basically, we need about 10 database instances to support such a high concurrency write. But is it necessary? Obviously not, as we all know how low the conversion rate for these types of events is (most are less than 8%), and it doesn't make sense to keep information in a database for unattended groups. The final optimization scheme is: 1) when "creating a new group for the user", only the group information is written into the Redis cache; 2) When the friends invited by the user agree to join the group, the group information will be persisted in the database. The new design, which requires only 1.8W QPS of concurrent database writes, can be supported using the original single database instance, and there is no difference in business effect.
In addition to the above examples, there are many methods for link optimization, such as merging multiple calls, only calling when necessary (similar to COW [Copy on write] idea) and so on, which need to be combined with specific scenarios to analyze and design.
3.5 Application Code
Application code optimization is often what we are passionate about and good at, after all, business and system architecture optimization often requires architects. JProfiler or PerF Profiling data is a very useful reference, and any guesses that are not based on actual running data can often lead us astray, and most of what we need to do next is "find hot - optimize", then "find hot - optimize", and so on. Finding hot spots is not that difficult, but analyzing the logic of the code to determine exactly how much resources it should consume (usually CPU), and then formulating an optimization plan to achieve that goal requires considerable optimization experience. From the performance optimizations I've done, the main problems are probably in these areas:
- Messing with strings: A lot of code likes to concatenate multiple Java variables using StringBuilder (guys who don't even use StringBuilder, but use + is even more of a headache), then spilt them into multiple strings and convert them to other types (longs, etc.). Well, the next time you use StringBuilder, specify the initial capacity.
- Log flying everywhere: tube it useful not, anyway hit is not wrong, after all, who have experienced no log when troubleshooting problems of pain. Well, printing useful and efficient logs is a required course for programmers.
- Love Exception: I wonder if some Java fanatics are overstating that Java exceptions are just as efficient as C/C++ error codes. This is not the case, however, and Exception's call stack backtracking can be costly, especially if it also needs to be printed in a log.
- Deep copies of containers: Lists, hashmaps, etc., are popular Java containers, but the Java language does not have a good mechanism to prevent others from modifying them, so people often make deep copies of a new one, which is just a line of code anyway.
- Love JSON: serialize objects to JSON Strings, and deserialize JSON strings to objects. The former is mainly used for logging, while the latter is mainly used for reading configurations. JSON is fine, just don't use it everywhere (and rename every property too long).
- Repeat repeat repeat: a request to query the same goods 3 times, query the same user 2 times, query the same cache 5 times, are common things, but also many queries several times better consistency :(). And then there's some configuration that doesn't change much, that goes into the cache, and every time it's used it's read out of the cache, deserialized, and then used again, um, pretty heavy.
- The fun of multithreading: can't write multithreaded program development is not good development, so everyone likes new thread pool, then asynchronous asynchronous. When traffic is low, it seems that multithreading does solve the problem (RT does get smaller, for example), but when traffic comes up, the problem worsens (after all, most of our machines are 8-core).
Again, in this step, finding the problem is not too difficult, but finding good optimizations is difficult and testing.
3.6 the cache
Most caches are Key and Value structures, and it is important to choose compact keys and values and efficient serialization and deserialization algorithms (binary serialization protocols are much faster than text serialization protocols). Other caches are structured as Prefix, Key, and Value, with the main difference being who decides to route the actual data to which server for processing. A single cache should not be too large, and generally larger than 64KB is something to be careful about, as it is always handled by a real server and can easily be maxed out for broadband or computing performance.
A database can handle much less traffic than a cache, by an order of magnitude. SQL communication protocol is very easy to use, but it is actually a very inefficient communication protocol. The optimization of DB usually starts from reducing the amount of writing, reading, interaction, batch processing and so on. DB optimization is a complex science, which is difficult to explain in an article. Here are only some representative examples I think:
- Reduce network interactions with MultiQuery: Databases such as MySQL support writing multiple SQL languages together and sending them to the DB server at the same time, which reduces multiple network interactions to one.
- Use BatchInsert instead of multiple inserts: This is common.
- Using KV protocol to replace SQL: Ali Cloud database team attached a KV engine on the database server, which can directly read the data in InnoDB engine, bypass the data layer of the database, making the query based on the unique key 10 times faster than using SQL.
- Connect with the business: Taobao can use up to 10 red packets at the same time when placing an order, which means sending up to 10 UPDATE SQL for a single order. Assuming N red packets are used in one order, based on the analysis of business behavior, it will be found that the first n-1 packets are used in full, and the last one may be used in part. For used red packets, we can use a SINGLE SQL to complete the update.
update red_envelop set balance = 0 where id in (...) ;Copy the code
- Hot spot optimization: The hot spot of inventory is a problem that every e-commerce platform is faced with. Using database to reduce inventory is certainly the most reliable solution, but basically it is difficult to break through the bottleneck of 500Tps. Ali Cloud database team has designed a new SQL Hint that, in conjunction with the MultiQuery technology mentioned in the first article, can complete inventory reduction with a single interaction with the database. At the same time, with the targeted optimization of the database kernel, the hotspot deduction capability of 8W TPS can be achieved. Commit_on_success in the following table is used to indicate that the update is committed immediately if it succeeds, which minimizes lock occupancy for the inventory hot rows. Target_affect_row (1) and rollback_on_FAIL are used to restrict the update execution from failing and rolling back the entire transaction when the stock is sold out (i.e. inv_count - 1 = 0 is not valid).
Insert inventory reduction flow; update /* commit_on_success rollback_on_fail target_affect_row(1) */ inventory set inv_count = inv_count - 1 where inv_id = 11222 and inv_count - 1 = 0;Copy the code
3.8 Operating Environment
Our code is running in an environment that has a lot of knowledge we need to know, and if all the optimizations above are not enough, then we have to go further. This can be a difficult process, but it can also be a fun process, because you finally have the opportunity to talk to the bigwigs in various fields.
At present, most of the middleware code runs together with our business code, such as monitoring collection, message Client, RPC Client, configuration push Client, DB connection component, etc. If you find performance issues with these components, don't be afraid to mention them without hurting anyone :).
I've come across some scenarios like this:
- Ygc occasionally occurs in a large number of applications: The reason for detection is that when the address list of the service we depend on changes (such as restart, offline, capacity expansion, etc.), RPC client will receive a large number of push messages, and then parse these push messages, and then update a large number of memory structures. The optimization suggestions are as follows: 1) Address push should be changed from full push to incremental push; 2) Change the address list from the link service interface dimension to the link application dimension.
- DB connection component too much string concatenation: DB connection component needs to parse SQL to calculate sub-database sub-table information, but the implementation above is not elegant, concatenation of strings is too much, resulting in the execution of SQL memory consumption too much.
Most of the time what we can't do about container technology itself is try to use the latest technology (such as Using Aliyun's Dragon server etc.), but there's a lot we can do about bang core (container scheduling). We need to pay attention to who our applications are running with, whether they are competing for resources with each other, whether they are scheduled across NUMA, whether they are oversold, and so on (this, of course, requires the container team to provide the appropriate viewing tools). There are two main points to consider here:
- Whether to support off-line mixing: online tasks require real-time response, and the running of offline tasks requires a lot of machines. In a scenario like singles Day, borrowing offline machines for a few hours can save a lot of money by reducing the corresponding online machine purchases.
- Business-based scheduling: It would be nice to have high-consumption and low-consumption applications deployed together, but not at exactly the same peak time.
A coroutine is developed to solve the problem of too many threads in heavy IO applications.
Value containers were developed to solve the problem of too many small objects in Java containers (e.g., K and V of HashMap can only be wrapper types).
In order to solve the problem that the GC time is too long when the Java Heap is too large (and of course the reason that the Java memory management is not flexible enough), GCIH (GC Invisible Heap, some hot promotional activities during Taobao Double 11 are stored in THE GCIH) was developed.
The Launch Hint feature was developed to address the performance issue of Java startup (that is, code that runs thousands of times to JIT and repeats the process with each startup).
JIT compilation aggressive optimization removal options were developed to address the problem of JIT de-optimization at peak times of business (i.e., code execution paths that are not normally used are required at peak times of business, such as offers that take effect at 0).
Just because the CURRENT JVM implementation is as you know it doesn't mean it's always reasonable.
3.8.4 Operating System
Most of us use Linux, and new versions of the kernel usually come with new features and performance improvements, as well as a lot of work that the operating system needs to do for supporting containers (i.e. Docker, etc.). For the Host operating system, it is necessary to enable the transparent large page, configure the NIC interrupt CPU disaggregation, configure the NUMA, configure the Network Time Protocol (NTP) service, and configure the clock source (otherwise, the CLOCK getTime may be slow). There's also the need to isolate resources, CPU isolation (high-priority task priority scheduling, LLC, isolated, hyper-threading technology, etc.), memory, isolation broadband (memory, memory, recovery of isolation to avoid global memory recovery), network isolation (broadband network, packet tou hierarchies), file IO isolation (file IO upper limit and lower limit of broadband, specific operation limits) and so on.
Most kernel-level optimizations are out of our reach, but we need to know some of the key kernel parameters that affect performance and be able to understand how most kernel mechanisms work.
We usually use Intel x86 CPU architecture, such as the Intel 8269CY we are using, but it costs more than 40,000 RMB per CPU, but it is only 26C52T (26 cores and 52 threads). By comparison, AMD's EPYC 7763 has more impressive specs (64C128T, 256MB three-level cache, 8-channel DDR4 3200MHz ram, and 204GB/s ultra-high memory broadband), but it costs just over 30,000. Of course, it's not fair to compare AMD's 2021 products with Intel's 2019 products. It's just 38C76T (though that's nearly 50% better than its predecessor).
In addition to x86 processors, ARM 64-bit processors are also making efforts to server products, and this industry chain can also achieve full domestic production. The Kunpeng 920-6426 processor, released by Huawei in early 2019, adopts the 7nm process, has 64 CPU cores and a main frequency of 2.6GHz. Although its single-core performance is only about 2/3 of that of Intel 8269CY, the number of CPU cores is more than double. With its affordable price, the cost of CPU will be reduced by nearly half with the same computing power (of course, the cost of computing the entire physical machine is actually limited).
Since November 11, 2020, Taobao has deployed a home-made computer room supporting 10,000 transactions per second in Nantong, Jiangsu province, using the processor Kunpeng 920-6426. Meanwhile, in November 11, 2021, It will also use the e-Ten 710 processor independently developed by Ali Cloud (also using ARM 64-bit architecture). In the future, it is possible to design their own processors based on RISC-V architecture. All of these facts suggest that we still have a lot of room in our processor choices.
In addition to using general-purpose processors, in some special computing, we also can use a dedicated chip, such as: using GPU acceleration calculation, deep learning in AI reasoning NPU using neural network to accelerate the chip - including light, high-performance network data processing and using FPGA chip (ali cloud dragon server using the dragon MOC) and so on.
There were also attempts to design processors that could run Java bytecode directly, though they were ultimately rejected because of the complexity.
All this shows that hardware is constantly evolving according to usage scenarios, always full of imagination.