This article has been published on Github /JavaMap. It contains a series of articles on advanced technology knowledge for Java programmers.


Why do distributed systems need link tracing?

With Internet business rapidly expanding, software architecture has become increasingly complex, in order to adapt to mass users high concurrent requests, moving towards a more and more components in the system is distributed, such as monomer architecture into micro service, service in the cache into a distributed cache, communication into distributed message service components, these components constitute the complex distributed network.


Suppose there is a system with tens of thousands of services deployed, and the user clicks a single case of Maotai on the main interface through the browser. The result is that the system tells the user that there is an internal error, and the user is believed to be very broken.

The operator passes the problem to the developer, who only knows that there is an exception, but the specific microservice that causes the exception needs to be examined service by service.


It’s very inefficient for developers to go through logs one by one, so is there a better solution? The answer is to introduce a link tracking system.

What is link tracing?

Distributed link tracing is to restore a distributed request to call link, and show the call condition of a distributed request in a centralized manner, such as the time consuming on each service node, which machine the request reaches, and the request status of each service node, etc.

Main link tracing functions:

  • Quick fault location: You can use the call chain and service logs to quickly locate faults.
  • Link performance visualization: Displays the link duration and service dependencies in each phase on a visual interface.
  • Link analysis: Analyzes link time and service dependencies to obtain user behavior paths. This analysis is applied in many service scenarios.

Link tracing fundamentals

Link Tracing system (probably) was first published by Goggle in the paper Dapper, A Large-scale Distributed Systems Tracing Infrastructure, which is widely known to all. So if you tech gurus have a dirty weapon, don’t hide it and get published.

In this famous paper, the basic principle and key technical points of Dapper link tracking system are mainly described. Next pick a few key technical points to introduce you in detail.

Trace

Trace refers to the path of a request through all services, which can be represented by the graph in the following tree.


A complete link in the figure is: Chrome -> Services A -> Services B -> Services C -> Services D -> Services E -> Services C -> Services A -> Chrome. Local links between services constitute a complete link. Each local link is identified by a globally unique Traceid.

Span

In the figure above, we can see that the request goes through service A, and service A calls service B and service C, but is service B or service C called first? It’s hard to tell from the picture, but you only know the order by looking at the source code.

To express this father-son relationship, the concept of Span was introduced.

The same level of parent iD is the same, but the span ID is different. The span ID indicates the order of requests from small to large. It can be clearly seen from the following figure that service A calls service B first and then C.

As shown in the figure below, the span ID of service C is 2, and the parent ID of service D is 2, indicating that service C and service D form a parent-child relationship. It is obvious that service C invokes service D.


Summary: By burying a spot in the log, finding the same traceId log, and adding the parent ID and span ID, you can string together a complete chain of requests.

Annotations

Dapper also defines the concept of annotation, which is used for user-defined events to assist in locating problems.

It usually contains four comments: cs: Client Start, indicating that the Client initiates a request; Sr: ServerReceived: indicates that the server receives the request. Ss: Server Send: indicates that the Server completes processing and sends the result to the client. Cr: ClientReceived: indicates that the client receives information returned from the server.


In the figure above, the process of a request and response is described. Four points correspond to four Annotation events.

The following figure shows a complete call to the server from the client. To calculate the time taken for a call, you simply subtract the point in time the client received from the point in time the client started, which is T4-T1 on the timeline in the figure. If you want to calculate the client sending network time, which is t2-T1 on the timeline in the figure, other similar calculations can be made.


In-band data and out-of-band data

The restoration of link information depends on in-band and out-of-band data.

Out-of-band data refers to events generated by each node, such as CS and SS. These data can be independently generated by the node and reported to the storage device in a centralized manner. With out-of-band data, you can analyze more details about links on the storage device.

In-band data such as traceid, SPANID, and parentid are used to identify trace, SPAN, and the position of span in a trace. The data needs to be passed from the start of a link to the end. All the processes of a link can be connected through in-band data transmission.

The sampling

Since each request generates a link, In order to reduce performance consumption and avoid waste of storage resources, Dapper does not report all span data, but uses sampling. For example, 1000 requests per second access the system. If the sampling rate is set to 1/1000, only one request will be reported to the storage.


By adjusting the sampling rate and controlling the number of span reports, the performance bottleneck can be found and performance loss can be effectively reduced.

storage


Span data In a link is collected and reported and stored In a centralized place. Dapper uses BigTable data warehouse, which is commonly used for ElasticSearch, HBase, and In-memory DB.

Link tracing system commonly used in the industry

After The Google Dapper paper was issued, many companies have given their solutions based on the basic principle of link tracking, such as Twitter’s Zipkin, Uber’s Jaeger, Pinpoint, Apache open source Skywalking, and domestic products such as Ali eagle Eye, Meituan’s Mtrace, Didi Trace, Sina’s Watchman and JD’s Hydra, however, are basically not open source in China.

In order to facilitate interoperability between systems, the OpenTracing organization developed a series of standards aimed at providing a unified interface for each system.

Here is a comparison of several open source components for the convenience of technology selection in the future.


Attached with the address of the major open source components:

  • Zipkin: zipkin. IO /
  • Jaeger :www.jaegertracing.io/
  • Pinpoint: github.com/pinpoint-ap…
  • SkyWalking :skywalking.apache.org/

Next, take a look at the basic Zipkin implementation.

Implementation of Zipkin distributed link tracking system

Zipkin is an open source project for Twitter, based on the Google Dapper implementation, which aims to collect timed data from the service to address latency issues in the microservice architecture, including data collection, storage, lookup, and presentation.

Zipkin basic architecture


During service operation, a lot of link information is generated. The place where the data is generated is called Reporter. The link information is sent to Zipkin’s collector through a variety of transmission methods, such as HTTP, RPC, Kafka message queue, etc. After Zipkin processes it, the link information is finally saved in memory. O&m personnel can query the call chain information by invoking the interface on the UI interface.

Zipkin core components

Zipkin has four core components


(1) the Collector

Once the Collector collection thread has acquired link-trace data, Zipkin validates, stores, and indexes it, and calls the storage interface to save the data for lookup.

(2) Storage

Zipkin Storage was originally built to store data on Cassandra because Cassandra is scalable, has a flexible mode, and is used heavily in Twitter. In addition to Cassandra, there is support for ElasticSearch and MySQL storage, and third-party extensions may come later.

(3) Query Service

After the link tracing data is stored and indexed, the webui can invoke the Query Service to query any data to help O&M personnel quickly locate online problems. Query Services provide a simple JSON API to find and retrieve data.

(4) Web UI

Zipkin provides a web interface for basic query and search. O&m personnel can quickly identify online problems based on specific call chain information.

conclusion

  1. Distributed link tracing is to restore each distributed request to the calling link.
  2. The core concepts of link tracing include Trace, Span, Annotation, in-band and out-of-band data, sampling, and storage.
  3. Open source components commonly used in the industry are based on the Evolution of The Google Dapper paper;
  4. Zipkin core components include Collector, Storage, Query Service, and Web UI.

— END —

Daily praise: Hello technical people, praise first look to form a habit, your praise is my power on the way forward, the next stage is more exciting.

The blogger graduated from Huazhong University of Science and Technology with a master’s degree. He is a programmer who pursues technology and has passion for life. In the past few years, he has worked in Huawei, netease and Baidu, and has many years of practical experience in development.

Search the official wechat account “Architect who loves to laugh”, I have technology and story, waiting for you.

The articles are constantly updated, you can see my archived series of articles on Github /JavaMap, have interview experience and technical expertise, welcome Star.