In a single service architecture, all services and components reside on the same machine. It is often easy to monitor service anomalies and time consumption. We can use AOP to calculate the overall invocation time by printing the time before and after the invocation of the specific business logic. You can also print logs at key nodes during problem tracking.

However, in microservices architecture, a request involves multiple modules and systems, often requiring the cooperation of multiple machines to complete. And a series of requests, not only will involve the series and parallel, and synchronous asynchronous points. At this point, if the singleton architecture approach of service monitoring is still adopted, determining which services, which modules, which nodes are called behind the request and in what order, how long the call takes, or tracing the problems in the call can even kill the developer on the spot. Therefore, we immediately think of link tracing.

Link tracing: Records the link information of each request by logging in the program. The data is accurate and detailed. It is suitable for viewing the invocation link of a request and the interface bottleneck that responds slowly.

To implement service tracking, we have three problems to solve:

  • 1. Bury and collect context data for service invocation.
  • 2. Analyze and process the collected data in real time.
  • 3. Visual display of data link.

Each specific request link of the following requests can be well located through the distributed tracing system, thus the request link tracing can be easily implemented:

So, is there a specification that allows everyone to develop a good link tracking system?

OpenTracing distributed call chain specification

To address the incompatibility of different distributed tracing system apis, the OpenTracing specification was born. OpenTracing is a lightweight standardization layer that sits between an application/library and a trace or log analyzer.

1. What is OpenTracing?

Opentracing is a standard for distributed link tracing. With platform-independent features, developers can switch between link tracing systems that conform to Opentracing standards with only a small amount of configuration code modification.

Tracing services only need to invoke the Opentracing interface to be supported by any tracing background (such as Zipkin, Jaeger, etc.) that implements this interface, and as a tracing background, any service that invokes this interface can be traced. At the same time, Opentracing standardized the minimum independent unit Span for interprocess trace data transfer and tracing.

2. Data structure of OpenTracing

  • 1, the Span

Span is the basic component and smallest independent unit of a trace link. A SPAN represents a single unit of work, such as a function call, an HTTP request, and so on.

The SPAN contains the following information:

  • Service name (the name of the calling request)

  • Start time and end time of the service

  • K/V Tags

  • Truncated log Logs in K/V form

  • SpanContext (SPAN context information, which contains Trace ID and SPAN ID)

  • References: References of this span to one or more Spans (obtained by referring to SpanContext).

  • 2, the Trace

A complete request link, with each invocation chain consisting of multiple spans.

  • 3, SpanContext

Global context information for the Trace, such as traceId.

If you look at my diagram to get a sense of the data structure, the integrity of a request the integrity of a request is a Trace, and obviously for that request, you have to have a global identity that identifies that request TraceId, Each call is called a Span and each call is accompanied by the global TraceId and its own SpanId so that the global TraceId can be associated with each call.

3. Implementation of OpenTracing

After learning about the data structures of OpenTracing. We’ll start by collecting data from the service Tacre.

Different from burying points in individual services, mainstream Zipkin and SkyWalking now adopt different methods to collect data of service links:

  • Zipkin: Intercepts requests and sends (HTTP, MQ) data to the Zipkin service, which can be persisted to a database if needed.

  • SkyWalking: Use the form of agent to access the business system, so that there is no need to add any log code in the business system, but also can be recorded in the corresponding call database.

The content of the collected data is shown in the figure below:

If you were to collect for every request invocation, there is no doubt that the amount of data would be very large. For services that are already online, the situation is similar for each request (except if an exception is reported). There is no need to sample every request. You can set the sampling frequency to only sample part of the data. SkyWalking has a default of 3 samples per 3 seconds. This frequency is actually enough to analyze performance (bleep: most businesses don’t have that many requests at all). At the same time, in order to ensure global consistency, upstream services are sampled (SpanContext explains), so downstream services must be sampled as well, or inconsistencies are likely to occur.

After obtaining these data, the link data can be visualized to achieve service tracking.

Zipkin implements the OpenTracing specification

Zipkin is an open source distributed real-time data tracking system, designed based on The paper of Google Dapper and developed and contributed by Twitter. Its architecture is shown as follows:

But Zipkin alone is not enough. Spring Cloud provides us with Sleuth, a component that provides error replenishment, time analysis, and link tracing for inter-service calls.

Spring Cloud Sleuth can combine with Zipkin to send information to Zipkin, use Zipkin’s storage to store information, and use Zipkin Ui to display data (Sleuth uses Http to transmit SPAN to Zipkin by default).

1. Simply set up the Zipkin project

Building Zipkin with SpringBoot is relatively easy, and the module is already fully scaffolded. You just need to put the corresponding dependencies in Maven (or gradle or any repository).

Introduce dependencies in the Zipkin project <dependencies> <! Track - distributed link -- -- > < the dependency > < groupId > org. Springframework. Cloud < / groupId > <artifactId>spring-cloud-starter-sleuth</artifactId> </dependency> <dependency> <groupId>org.springframework.cloud</groupId> <artifactId>spring-cloud-starter-zipkin</artifactId> </dependency> <! --RabbitMQ dependency --> <dependency> <groupId>org.springframework.amqp</groupId> <artifactId> Spring-rabbit </artifactId> </dependency> <! --> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-test</artifactId> <scope>test</scope> </dependency> </dependencies> Zipkin dependencies also need to be introduced in the Service and Client projects. - have been zipkin service tracking start relying on - > < the dependency > < groupId > org. Springframework. Cloud < / groupId > <artifactId>spring-cloud-starter-zipkin</artifactId> </dependency>Copy the code

Why choose RabbitMQ messaging middleware to send SPAN messages

Sleuth adopts HTTP communication mode by default and transmits data to Zipkin for page rendering. However, if HTTP communication is interrupted due to irresistible factors during HTTP transmission, the data of this communication will be lost. With RabbitMQ, however, message queues can accumulate tens of millions of messages and continue to be consumed the next time they are reconnected. As the number of threads increases and the amount of concurrency increases, RabbitMQ has a significant advantage in sending data asynchronously. RabbitMQ also supports message and queue persistence, which can be highly available through message state dropping, reenqueueing, mirroring, etc. For details, see RabbitMQ.

In addition to the dependency aspect, in YAML zipkin is configured this way

server: port: 9411 spring: application: name: zipkin-server main: allow-bean-definition-overriding: true eureka: client: service-url: defaultZone: http://localhost:8761/eureka/ management: metrics: web: server: auto-time-requests: False Add zipkin configuration to other servers spring: sleuth: sampler: probability: 1.0 Zipkin: base-URL: http://localhost:9411Copy the code

After using Spring Boot 2.x, it is not recommended to customize the build. Let’s use the compiled JAR package. So we need to use @enableZipkinServer or @EnableZipkInstreamServer annotations with the springBoot version.

@SpringBootApplication
@EnableEurekaClient
@EnableZipkinServer
public class ZipkinServerApplication {

    public static void main(String[] args) {
        SpringApplication.run(ZipkinServerApplication.class, args);
    }

}
Copy the code

2, zipkin the UI

Set up is completed, we can go to http://localhost:9411/zipkin/ home page address into zipkin:

Click Find Traces to see a concise display of the number of spans and the total call delay.

We enter a full call chain and access one of the nodes to get the following data:

In the Dependencies option, you can see the diagram of the call link. Of course, for testing reasons, we did not establish a very complex call link.

From this we can see that SpringCloud is very friendly to Zipkin integration and can be directly integrated into the SpringCloud environment without having to launch another JAR package, which is why I chose Zipkin as the demonstration.

Common problems in link tracing system

1. TraceId is globally unique

To ensure that TraceId is globally unique, we can refer to the design of SOFATrace. The string can be globally unique by using an 8-bit IP address, a 13-bit timestamp, a four-bit increment sequence, and the PID number of its own process.

The IP address The production ID time Since the increasing sequence The process of PID
eight 13 four The variable sequence
0ad1348f 1403169275002 1003 56696

2. How to optimize data acquisition process

For example, you can use messaging middleware optimizations: install Rabbitmaq, change the startup mode of the Zipkin server, and pull messages from Rabbit.

Modify the client to send messages to MQ in the form of RabbitMQ (add dependencies to microservices where logs need to be collected, modify the configuration)

3. Sampling rate setting

In fact, the sampling rate of Trace has always been a headache. If it is set too low, some interfaces may fail to collect effective information; if it is set too high, the cost will be too high. Jaeger provides the function of dynamic sampling rate, which ensures that in the same service, interfaces with low QPS can be collected effectively, while interfaces with high QPS have low sampling rate.

Other link tracking systems

Zipkin is an open source call chain analysis tool for Twitter. Currently, it is widely used based on SpringCloud SleUTH. It is characterized by lightweight and easy to use and deploy.

Among other things:

  • SkyWalking native open source call chain analysis based on bytecode injection, and application monitoring analysis tools. Feature is to support a variety of plug-ins, strong UI function, access end without code intrusion. You have joined the Apache incubator.

  • Pinpoint Korean open source call chain analysis based on bytecode injection, and application monitoring analysis tools. Feature is to support a variety of plug-ins, UI powerful, access end without code intrusion.

  • CAT Dianping open-source call chain analysis based on coding and configuration, application monitoring analysis, log collection, monitoring alarm and a series of monitoring platform tools.

category Zipkin SkyWalking Pinpoint CAT
implementation Intercepts requests and sends (HTTP, MQ) data to the Zipkin service Java probe, enhanced bytecode Java probe, enhanced bytecode Code burying points (interceptors, annotations, filters, etc.)
access You can import configurations based on linkerd or SLEUTH Javaagent bytecode Javaagent bytecode Code burying points (interceptors, annotations, filters, etc.)
Code into http,MQ gRPC thrift http/tcp
Data is stored ES, mysql,Cassandra, memory ES, H2 Hbase mysql,hdfs
granularity Interface level Method level Method level The code level
Community activity high low In the In the

As for the development and selection of the problem, we can choose the right link tracking system according to their own products, you can refer to the zhihu article monitoring system comparison Skywalking Pinpoint Cat Zipkin

Hello everyone, I am Nanju who has been practicing Java for two and a half years. Here is my wechat. If you need the previous map or want to exchange experience with each other, you can communicate with each other.