Introduction: With the rise of microservice architecture, the call dependence on the server side becomes more complex. In order to quickly locate abnormal components and performance bottlenecks, access to distributed link Trace has become a consensus in the FIELD of IT operation and maintenance. However, what are the differences between open source, open source hosting or commercial Trace products, and how should I choose? This is the question that many users will encounter when investigating the Trace scheme, and it is also the misunderstanding that is most easily confused.

The author source | | ya sea ali technology to the public

With the rise of microservice architecture, the call dependence on the server side becomes more complex. In order to quickly locate abnormal components and performance bottlenecks, access to distributed link Trace has become a consensus in the FIELD of IT operation and maintenance. However, what are the differences between open source, open source hosting or commercial Trace products, and how should I choose? This is the question that many users will encounter when investigating the Trace scheme, and it is also the misunderstanding that is most easily confused.

To figure this out, we need to start from two aspects. One is to sort out the core risks and typical scenarios of online applications. The second is to compare the capability differences of the three Trace schemes: self-built open source, hosted and commercialized self-developed. As the saying goes, “know yourself and know your enemy, you can win a hundred battles with no danger of defeat”. Only by combining your actual situation can you choose the most suitable plan.

I. “Two Types of Risks” and “Ten Typical Problems”

Online application risks are mainly divided into “wrong” and “slow” categories. The “error” is usually caused by the program not performing as expected, such as the JVM loading the wrong version of a class instance, code going into an abnormal branch, environment configuration errors, etc. The “slow” causes are usually insufficient resources, such as a burst of traffic resulting in a full CPU, depletion of microservices or database thread pools, memory leaks resulting in persistent FGC, and so on.

Whether it’s a “wrong” problem or a “slow” problem. From the user’s point of view, it is hoped to be able to quickly locate the root cause, timely stop loss, and eliminate hidden dangers. However, according to the author’s five years’ experience in Trace development, operation and maintenance, as well as the preparation for the Double Eleven, most of the online problems cannot be effectively located and solved only with the basic ability of link tracking. The complexity of online systems determines that a good Trace product must provide more comprehensive and effective data diagnosis capabilities, such as code-level diagnosis, memory analysis, thread pool analysis, etc. Meanwhile, in order to improve the ease of use and stability of Trace component, it is necessary to provide dynamic sampling, lossless statistics, automatic convergence of interface name and other capabilities. This is why the mainstream Trace products in the industry are gradually upgrading to APM and application observables. In order to facilitate understanding, this paper still uses Trace to express the observability of the application layer uniformly.

To sum up, in order to ensure the ultimate service stability of online applications, in addition to Trace general basic capabilities (such as call chain, service monitoring, link topology), we can also refer to the following “ten typical problems” (taking Java applications as an example) when selecting link tracing solutions. A comprehensive comparison was made between the differentiated performance of self-built open source, hosted open source and commercially developed Trace products.

1. [code-level automatic diagnosis] If the interface times out occasionally, the calling chain can only see the name of the timeout interface, but not the internal method, unable to locate the root cause, and difficult to reproduce, how to do?

Students in charge of stability should be familiar with this scenario: the system will have occasional interface timeout at night or on the hour, and when the problem is found and checked again, the abnormal scene has been lost, and it is difficult to reproduce, and cannot be diagnosed by manual JSTACK. However, the current open source link tracing implementation can only see the timeout interface through the call chain, the specific reason, which section of code caused by the exception can not be located, and finally can not be solved. This scenario repeats itself until it goes wrong, resulting in significant business losses.

In order to solve the above problems, an accurate and lightweight automatic monitoring function of slow call is needed, which can truly restore the first scene of code execution and automatically record the complete method stack of slow call without burying points in advance. As shown in the figure below, when the interface call exceeds a certain threshold (for example, 2 seconds), the monitoring of the thread of the slow request will be started until the request stops immediately after the end of the 15th second. The snapshot set of the thread of the request life cycle will be kept precisely, and the complete method stack and time will be restored.

2. [Pooled Monitoring] Micro-service/database thread pool is often full, resulting in service timeout, which is very difficult to check, how to solve?

Business requests time out due to a full microservice/database thread pool is a daily occurrence. Students with rich diagnostic experience will consciously check the corresponding component logs. For example, Dubbo will output related exception records when the thread pool is full. However, if the component does not output thread pool information, or the operations and maintenance students are not experienced in troubleshooting, this kind of problem can become very difficult. Currently, the open source version of Trace generally provides only an overview of JVM monitoring, with no way to see the state of individual thread pools specifically, let alone determine whether they are exhausted.

The pooled monitoring provided by commercially developed Trace can directly see the maximum number of threads, current number of threads, number of active threads in a specified thread pool, etc., showing the risk of thread pool exhaustion or high water level. In addition, you can set the alarm for the thread pool usage percentage. For example, if the current number of Threads in the Tomcat thread pool exceeds 80% of the maximum number of threads, SMS notification will be sent in advance, and phone alarm will be set when the number reaches 100%.

3. [Thread analysis] CPU water level is very high after pressure test/change release. How to analyze application performance bottlenecks and optimize accordingly?

We are doing great for pressure measurement, or major version change (logical) contains a lot of code, will encounter CPU water suddenly become very high, but can’t clear positioning which is caused by a piece of code, can only do jstack, than the thread state changes to the naked eye, and then according to the experience constantly optimize test, finally consumes a lot of energy, The effect was mediocre.

Is there a quick way to analyze application performance bottlenecks? The answer must be yes, and more than one. The most common is to manually trigger a ThreadDump that lasts for a period of time (say 5 minutes) and then analyze thread overhead and method stack snapshots for that time. The disadvantage of manually triggering ThreadDump is that it has high performance overhead and cannot be run normally. For example, if the CPU is high during the pressure test, by the time the pressure test is finished and the disk is restarted, it is too late for manual ThreadDump.

The second is to provide a normalized thread analysis capability that automatically records the state, number, CPU time, and internal method stack of each thread pool type. In any time period, click sort by CPU time consumption to locate the thread category with the largest CPU cost. Then click method stack to see the specific code card points. As shown in the figure below, a large number of method cards in BLOCKED state are obtained from database connection, which can be optimized by increasing the database connection pool.

4. [Exception diagnosis] After the release or configuration change, a large number of interface errors are reported. However, the cause cannot be located immediately.

The biggest culprit affecting online stability is change, whether it is application release change, or dynamic configuration change, may cause abnormal program operation. So, how to quickly judge the risk of change, the first time to find the problem, timely stop loss?

Here, we share a practice of Exception release interception of Ali internal release system. One of the most important monitoring indicators is the comparison of the number of Exceptions in Java Exception/Error. Whether NullPointException (NPE) or OutOfMemoryError (OOM), based on the number of all or specific exceptions, you can quickly detect online exceptions, especially before and after changing the time line.

On the independent exception analysis and diagnosis page, you can view the changing trend and stack details of each type of exception, as well as the associated interface distribution, as shown in the following figure.

5. [Memory diagnosis] If FGC is frequently applied and memory leaks are suspected, but abnormal objects cannot be located, what should I do?

FullGC is definitely one of the most common problems in Java applications. It can be caused by anything from too fast object creation to memory leaks. The most effective way to check FGC is to perform HeapDump. The memory footprint of various objects is clear and visible.

The blank screen memory snapshot function enables a machine to perform one-click HeapDump and analysis, greatly improving the troubleshooting efficiency of memory problems. It also supports automatic Dump to save abnormal snapshots in memory leak scenarios, as shown in the following figure:

6. [Online debugging] If the online running state of the same code is inconsistent with the local debugging behavior, how to troubleshoot it?

Local debugging passed the code, as soon as sent to the production environment all kinds of errors, exactly what is wrong? I believe development students have experienced such a nightmare. This problem can be caused by multiple versions of Maven dependencies, inconsistent dynamic configuration parameters in different environments, and different dependency components in different environments.

In order to solve the problem of online running code does not conform to the expectations, we need an online debugging diagnostic tool, can view the current real-time operation state of the source code, access and execution method stack and time-consuming, the value of the static or dynamic object instance, etc., make convenient online debugging like local debugging, as shown in the figure below:

7. [Full-link tracking] Users give feedback that the website is very slow to open. How to realize the full-link call tracking from the Web end to the server end?

The key to establishing full links between the front and back ends is to follow the same set of transparent transmission protocol standards. Currently, open source only supports access to back-end applications and lacks front-end burial points (such as Web/H5 and small programs). The full-link tracking scheme of the front and back ends is shown as follows:

  • Header transparent transmission format: Adopt the Jaeger format. The Key is uber-trace-id, and the Value is {trace-id}:{span-id}:{parent-span-id}:{flags}.
  • Front-end access: Use Script injection (CDN) or NPM for low-code access, and support Web/H5, Weex, and various small program scenarios.
  • Back-end access: Java applications are recommended to use ARMS Agent preferentially. Non-invasive buried points do not require code modification, and high-level functions such as edge diagnosis, non-destructive statistics and precise sampling are supported. User – defined methods can be actively buried through the OpenTelemetry SDK. Non-java applications are recommended to access through Jaeger and report data to the Endpoint of ARMS. ARMS is perfectly compatible with transparent link transmission and display between multi-language applications.

The current full-link tracking scheme of Ali Cloud ARMS is based on Jaeger protocol, and SkyWalking protocol is being developed to support lossless migration of Self-built SkyWalking users. The call chain effect of full-link tracing of front-end, Java application and non-Java application is shown in the figure below:

8. [Nondestructive Statistics] The cost of invoking chain log is too high. After client sampling is enabled, the monitoring chart is not accurate.

The call chain log is positively correlated with traffic. The traffic of class C services is very large, and the cost of full report and storage of the call chain is very high. However, if client sampling is enabled, inaccurate statistical indicators will be encountered. The statistics aggregated from these 100 logs cause serious sample skew and cannot accurately reflect actual service traffic or time consumption.

To solve the above problem, we need to support non-destructive statistics on the client Agent, where the same metric is reported no matter how many times it is requested within a certain period of time (usually 15 seconds). In this way, the results of statistical indicators are always accurate and will not be affected by the call chain sampling rate. Users can safely adjust the sampling rate, call chain cost can be reduced by more than 90%. The larger the traffic and cluster scale, the more significant the cost optimization effect.

9. [Automatic interface name convergence] a RESTFul interface diverges URL names due to parameters such as timestamp and UID. Monitoring charts are meaningless breakpoints.

If an interface name contains variable parameters, such as timestamp and UID, the interface name of the same class may be different and may occur only a few times. As a result, the interface name does not provide monitoring value and may cause storage/computing hotspots, affecting cluster stability. At this point, we need to classify and aggregate divergent interfaces to improve data analysis value and cluster stability.

At this point, we need to provide an automatic convergence algorithm for interface names, which can actively identify variable parameters, aggregate interfaces of the same class, observe the trend of category change, and better meet users’ monitoring demands. In addition, data hotspots caused by interface divergence are avoided, improving overall stability and performance. /safe/getXXXInfo/ XXXX would all be grouped together, otherwise each request would be a single data point chart and the user’s readability would be poor.

10. [Dynamic Configuration Delivery] Non-core functions need to be degraded immediately due to insufficient resources due to sudden online traffic. How can I achieve dynamic degradation or tuning without restarting the application?

Unexpected always sudden traffic emergency, external attack, computer fault may lead to insufficient system resources, in order to retain the most important core business is not affected, we don’t often need to restart the application scenarios, dynamic drop some non-core functions to release resources, such as lower client invocation chain sampling rate, close some performance overhead diagnosis module, etc. In contrast, there are times when we need to dynamically turn on expensive deep diagnostics to analyze the current exception scene, such as memory Dump.

Either dynamic degrade or dynamic enable requires dynamic configuration push-down without restarting the application. However, open source Trace usually does not have such ability, so it needs to build its own metadata configuration center and carry out corresponding code modification. Commercial Trace not only supports dynamic configuration push-down, but also can be refined to the independent configuration of each application. For example, if application A has occasional slow call, the automatic slow call diagnosis switch can be turned on for monitoring. However, application B is sensitive to CPU overhead. You can disable this switch. The two applications take what they want from each other and do not affect each other.

Two open source built vs open source hosting vs commercial research

The “top 10 typical problems” in production environments listed above are not yet solved by current open source self-built or hosted Trace products. In fact, open source solutions have many excellent features, such as extensive component support, multi-language solution unification, flexible data/page customization and so on. But open source is not a panacea, nor is the production environment a testing ground. When it comes to the lifeline of online stability, we must carefully evaluate and conduct in-depth research on the advantages and disadvantages of different schemes, instead of just comparing common basic capabilities, which will bring huge hidden dangers to the subsequent application and promotion.

Due to space limitations, this article analyzes the shortcomings of open source self-built/hosted solutions through only 10 typical problem scenarios, emphasizing that Trace is not easy, and ignoring this point may force you to re-experience the pit that commercial self-developed products have trod. This is just like being an Internet e-commerce business. Instead of opening a shop online, a series of complex links, such as product polishing, traffic expansion, user transformation, and word-of-mouth operation, are hidden behind. If you enter the business rurly, you may lose miserably.

So what are the advantages of open source self-building/hosting? They are compared and analyzed with commercial self-developed Trace products in terms of product functions, resource/labor costs, secondary development, multi-cloud deployment, stability, ease of use, etc. Please pay attention to the next comprehensive analysis of self-built, hosted and commercialized Self-developed Trace products.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.