Introduction: Recently in the investigation and selection of distributed call chain monitoring components. Three main APM components are selected for practice and comparison. Originally intended to write an article, the length is too long, plan to divide into two. It has been nearly a month since the first article, but I am busy with my work recently, so the update is very slow. This paper will talk about the comparison and performance test of several APM types.

1. Review

The previous article focused on three practices for monitoring components for distributed call chains. The background of the problem is the microservices architecture, and the benefits of microservices are no longer needed, and the following diagram is often used.

Microservice invocation chain

The complexity of the system increases as a result. The more complex the system, the more difficult it is to solve problems, such as system failures or performance problems. It’s not too difficult to find a solution in a three-tier architecture, just analyzing three components such as web server, application server and database, and there are not many servers. However, if the problem occurs in an N-tier architecture, a large number of components and servers need to be investigated. Another problem is that it’s hard to see the big picture by analyzing just a single component; When a low-visibility problem occurs, the more complex the system, the longer it takes to find the cause. Worst of all, in some cases we may not even be able to find out.

In fact, the problems of fault location have been mentioned above. The problems of the business system built based on microservice system are basically divided into three categories:

  • It is difficult to locate faults. Behind a simple operation, more than a dozen microservices may be jointly completed, and these microservices are also responsible for by different teams. When a problem arises, at worst we may need a dozen teams to solve the problem.
  • Link carding is difficult, and applications do not form an application topology. Therefore, the downstream users of their services are not known.
  • Resources are wasted and capacity estimation is difficult. For some services, CPM and memory consumption may be less than 10%, far underutilized by physical machines. This is actually related to capacity estimation, too much or too little estimation of peak machine capacity, is a waste.

The main purpose of APM is to solve these four problems by collecting, storing, analyzing, and calling event data in distributed systems to assist development operators in fault diagnosis, capacity estimation, performance bottleneck location, and call link sorting. In fact, the first article has talked about the requirements of the link monitoring component:

  • Invasiveness of code
  • Probe performance cost
  • Comprehensive call link data analysis
  • scalability

Pinpoint has mentioned several points in its wiki:

  • Distributed transaction tracking, which tracks messages across distributed applications
  • Automatically detect the application topology to help you understand the application architecture
  • Scale horizontally to support large server clusters
  • Provide code-level visibility to easily locate failures and bottlenecks
  • Use bytecode enhancement to add new functionality without modifying the code

Let’s take a look at several of the distributed call chain monitoring components along these requirements.

2. The AMP

The requirements are listed above, but they are not general enough. The author extracts the items that need to be compared:

  1. The performance of probe is mainly the influence of agent on service throughput, CPU and memory. The scale and dynamic nature of microservices makes the cost of data collection significantly higher.
  2. The Collector’s scalability can scale horizontally to support large server clusters.
  3. Comprehensive call link data analysis provides code-level visibility to easily locate points of failure and bottlenecks.
  4. For development transparency, it is easy to switch on and off to add new features without modifying the code.
  5. Full call chain Application topology Automatically detects the application topology to help you figure out the application architecture

The author according to the main needs, extract the above five points.

2.1 Probe performance

In fact, the author is also concerned about the performance of the probe. After all, APM positioning is still a tool. If the link monitoring component is enabled, the throughput will be directly reduced by half, which is unacceptable. The author of skywalking, Zipkin, Pinpoint pressure measurement and comparison with the baseline (without the use of probe).

Choose a common spring-based application, which includes Spring Boot, Spring MVC, Redis client, mysql. Monitoring the application, each trace, probe captures 5 spans (1 Tomcat, 1 SpringMVC, 2 Jedis, 1 Mysql). This is basically the same test application as SkywalkingTest.

Three types of concurrent users are simulated: 500, 750, and 1000. Using JMeter, each thread sent 30 requests and set the thought time to 10ms. The sampling rate used is 1, i.e. 100%, which may be different from the production line. Pinpoint default sample rate of 20, that is 50%, by setting the agent profile to 100%. Zipkin also defaults to 1. Together, there are 12. Let’s look at the summary table.

The performance comparison

As can be seen from the above table, among the three link monitoring components, The probe of Skywalking has the least impact on throughput, while the throughput of Zipkin is in the middle. The impact of pinpoint probe on throughput is more obvious, the throughput of the test service is reduced from 1385 to 774 when 500 concurrent users, which has a great impact. Then look at the impact of CPU and memory, the author conducted pressure tests on internal servers, the impact of CPU and memory is almost within 10%.

2.2 Collector scalability

Collector is scalable, allowing it to scale horizontally to support large server clusters.

  • In the previous article, we developed zipkin-Server (which is essentially a package provided out of the box). Zipkin-agent communicates with Zipkin-Server over HTTP or MQ, and HTTP communication affects normal access. So mq asynchronous communication is recommended, and Zipkin-Server is consumed by subscribing to specific topics. This is of course scalable, with multiple Zipkin-Server instances asynchronously consuming monitoring information in MQ.
zipkin

  • Skywalking SkyWalking’s collector can be deployed in standalone or cluster mode. GRPC is used for communication between collector and agent.
  • Pinpoint, also support cluster and single – machine deployment. Pinpoint Agent sends link information to the Collector through the thrift communication framework.

2.3 Comprehensive call link data analysis

Comprehensive call link data analysis provides code-level visibility to easily locate failures and bottlenecks.

  • zipkin
Zipkin link call analysis

  • skywalking
Skywalking link call analysis

Skywalking also supports 20+ middleware, frameworks, and class libraries such as the mainstream Dubbo and Okhttp, as well as DB and messaging middleware. The interception of skywalking link call analysis in the figure above is relatively simple. The gateway calls the User service. Due to the support of numerous middleware, the call analysis of Skywalking link is more complete than that of Zipkin.

  • pinpoint
Pinpoint link call analysis

Pinpoint e should be the three kinds of APM components, the most complete data analysis component. Provides code-level visibility to easily locate failures and bottlenecks, as you can see in the figure above for SQL statements executed, which are logged. Alarm rules can also be configured to set the corresponding person in charge of each application. According to the configured rules, the supported middleware and framework are relatively complete.

2.4 For development transparency, easy switching

For development transparency, easy to switch on and off, add new features without modifying the code, and easy to enable or disable. We expect functionality to work without modifying the code and expect code-level visibility.

For this purpose, Zipkin uses a modified class library and its own container (Finagle) to provide distributed transaction tracking capabilities. However, it requires that the code be changed as needed. Skywalking and Pinpoint are both based on bytecode-enhanced methods, with developers not having to modify the code and being able to collect more accurate data because there is more information in bytecode.

2.5 Complete application topology of call chain

Automatically detect the application topology to help you understand the application architecture.

Pinpoint link topology

Skywalking link topology

zipkin dependency

The above three figures show the respective call topologies of APM components, which can achieve a complete call chain application topology. Relatively speaking, pinpoint interface display more rich, specific to the DB name, Zipkin topology is limited to services between services.

3. Summary

This paper describes the comparison of three kinds of distributed call chain monitoring components, mainly from five aspects, the author of each into the comparison. You can select the specific component based on actual business requirements and scenarios. The data compared above is for reference only. All three are open source projects, and companies generally do some secondary development for the actual situation, such as adding support for some components, connecting with existing big data platforms, etc.

Finally, I read eagleEye’s relevant introduction and want to mention how the monitoring system transforms from passive alarm to active discovery, which is actually very close to AIOps. The amount of link monitoring data is very large. Although the amount of data transmitted can be reduced by compression ratio, do we really need to store every link? Do you only need to identify abnormal cases in each link? The anomaly in the chronology, that point in time we need to identify. After identification, the exception is correlated to locate the final problem. Of course, this involves business and application system level, which is very complicated, but I think it is the general trend of AIOps.

Recommended reading

Practice and Comparison of several distributed Call chain monitoring components (I) Practice

Subscribe to the latest articles, welcome to follow my official account


reference

  1. Technical Overview Of Pinpoint
  2. Ali micro service’s wounds and distributed link tracking technology principle