With the popularity of microservices architecture, services are split in different dimensions, often involving multiple services in a single request. Internet applications are built on different sets of software modules, which may be developed by different teams, implemented in different programming languages, and distributed across thousands of servers across multiple data centers. Therefore, you need tools that can help you understand the behavior of the system and analyze performance problems so that when a failure occurs, you can quickly locate and resolve the problem. In a complex microservice architecture system, almost every front-end request forms a complex distributed service invocation link. A complete call chain for a request might look something like this:

As the number of services increases, the analysis of the invocation chain becomes more complex. The calling relationship between them might look like this:

With the increasing scale of business, increasing number of services and frequent changes, facing the complex call link brings a series of problems:

  • How to quickly identify problems?
  • How do I determine the scope of the fault?
  • How to sort out service dependencies and their rationality?
  • How do I analyze link performance and plan real-time capacity? Link tracing is designed to solve this problem. It can locate problems in complex service calls, and it can also let new people know which link they are responsible for when they join the background team.

In addition, if the time spent on an interface suddenly increases, you do not need to query the time spent on each service. Instead, you can intuitively analyze the performance bottleneck of the service, facilitating accurate and reasonable capacity expansion when traffic surges.

What is link tracing

The term “link tracking” was coined in 2010, when Google published a Dapper paper: Dapper, Tracking System for Large-scale Distributed Systems, which introduced the principle of Google’s own distributed link tracking and how they made it transparent to applications at a low cost.

Simple understanding of link tracing is that from the start to the end of a task, all the systems invoked and the time (time span) can be completely recorded.

In fact, Dapper was just an independent call link tracking system at the beginning, and later gradually evolved into a monitoring platform, and many tools were developed based on the monitoring platform, such as real-time warning, overload protection, index data query, etc.

In addition to Google’s Dapper, there are some other well-known products, There are Alibaba’s Hawkeye, Dianping’s CAT, Twitter’s Zipkin, Naver’s LINE parent and SkyWalking (which has been contributed to Apache), the parent company of the most popular social networking app.

What is a Sleuth

Spring Cloud Sleuth implements a distributed tracking solution for Spring Cloud. Compatible with Zipkin, HTrace and other log-based tracking systems such as ELK (Elasticsearch, Logstash, Kibana).

Spring Cloud Sleuth provides the following features:

  • Link tracing: Sleuth allows you to clearly see which services a request has passed through, making it easy to clarify the invocation relationships between services, etc.
  • Performance analysis: Sleuth can be used to conveniently determine the time spent in each sampling request, analyze which service invocations are time consuming, and provide reminders for service expansion when the time spent in service invocations increases with the number of requests.
  • Data analysis and link optimization: For frequent invocation of a service or parallel invocation, you can optimize the service.
  • Visual errors: Exceptions that are not caught by the program can be viewed with Zipkin.

Professional term

Span

The basic unit of work, a single call chain can be called a Span, Dapper records the name of the Span, as well as the ID and parent ID of each Span, in order to reconstruct the relationship between different spans in a tracing process, a rectangular box in the figure is a Span, The front-end is a Span from the time it sends the request to the time it receives the reply.

The initial span to start tracing is called root Span. The value of the ID for this span is equal to the trace ID.

Dapper records the span names, as well as the ID of each span and the parent span ID, to reconstruct the relationship between the different spans during a track. If a span does not have a parent ID it is called a root span. All spans are attached to a specific Trace and share the same Trace ID.

Trace

A tree structure composed of a series of spans. A Trace is considered to be a complete link and contains more than N spans inside. There is a one-to-many relationship between Trace and Span, and a parent-child relationship between Span and Span.

Here’s an example: The client invokes services A, B, C, and F, and each service such as C is A Span. If D is called by another thread in service C, then D is A subspan of C. If E is called by another thread in service D, So E is a subspan of D, and the link C -> D -> E is a Trace. If the link tracking system is in place and the link data is available, with the help of front-end parsing and rendering tools, you can achieve the following results:

Annotation

Annotations are used to record the presence of an event in time, and some core annotations are used to define the start and end of a request.

  • Cs-client Sent: The Client sends a request. This annotation describes the beginning of the span;
  • Sr-server Received: The Server Received the request and is ready to process it. If sr is subtracted from the CS timestamp, the network latency is obtained.
  • Ss-server Sent: When the request is Sent back to the client, it takes the Server time to process the request if ss is subtracted from the SR timestamp.
  • Cr-client Received: Indicates that the span is over and the Client successfully receives the reply from the server. If cr is subtracted from the cs timestamp, all the time required by the Client to obtain the reply from the server is obtained.

Realize the principle of

First of all, thank Zhang Yino for the implementation of the schematic diagram.

If you want to know where an interface is broken, you have to know which services are invoked by the interface and in what order. If you string these services together like a chain, we call it the invocation chain.

To implement the invocation chain, make an identity for each invocation, and then order the services by identity size to get a clearer idea of the order of invocation, which we’ll call spanID for the moment.

In the actual scenario, we need to know the situation of a request invocation, so only spanID is not enough, we need to make a unique identifier for each request, so that we can find out all the services called by this request according to the identifier, and we name this identifier tracEID.

Now it is easy to know the order of the invoked services by spanID, but not the hierarchy of the invocations. As shown in the figure below, multiple services may be invoked in a chain of steps, or they may be invoked simultaneously by the same service.

Therefore, we should record who called it every time. We use parentid as the name of this identifier.

Till now, already know the calling sequence and hierarchy, but after interface problems, still can’t find out the problem of link, if you have any problem, a service that is invoked to perform the services must take very long, if you want to calculate the time consuming, the above three logo is not enough, also need to add a timestamp, timestamp can be a bit more detailed, accurate to microsecond.

The time difference can be calculated only by recording the time stamp when the call was made. The time difference can be calculated only by recording the time stamp when the service was returned. Since the time stamp is also recorded, let’s record all the above three identifies. otherwise, it is impossible to tell whose time stamp it is.

Although it is possible to calculate the total time from service invocation to service return, this time includes service execution time and network latency, which sometimes need to be separated to facilitate specific optimization. So how do you calculate network latency? We can divide the calling and returning procedures into the following four events.

  • Client Sent is cs. The Client sends a call request to the server.
  • Server Received indicates that the Server receives an invocation request from the client.
  • Server Sent for short ss indicates that the Server has completed the processing and is ready to return the information to the client.
  • Client Received cr: the Client receives a message from the server.

If the timestamp is recorded as these four events occur, it is easy to calculate the time. For example, sr minus cs is the network delay at the time of the invocation, SS minus sr is the service execution time, Cr minus SS is the delay of the service response, and CR minus cs is the time to execute the entire service invocation.

In fact, in addition to recording these parameters, span can also record some other information, such as the name of the calling service, the name of the called service, the returned result, IP, the name of the calling service, etc. Finally, we synthesize the span information of the same parentid into a large span block. A complete chain of calls is completed.

Environment to prepare

Sleuth-demo polymerization project. SpringBoot 2.2.4.RELEASE, Spring Cloud hoxton.sr1.

  • Eureka-server: indicates the registry
  • Eureka-server02: indicates the registry
  • Gateway-server: indicates the Spring Cloud Gateway service gateway
  • Product – service, goods and services, provides according to the primary key interface to http://localhost:7070/product/ {id} query goods according to the multiple primary key query goods interface http://localhost:7070/product…
  • Order-service: provides an interface to query an order based on the primary key
  • http://localhost:9090/order/ {id} and orders the service call goods and services.

An introduction to case

Add the dependent

Add spring-cloud-starter-sleuth dependency to projects that need link tracking (service gateway, commodity service, order service).

<! - spring cloud sleuth dependence - > < the dependency > < groupId > org. Springframework. Cloud < / groupId > <artifactId>spring-cloud-starter-sleuth</artifactId> </dependency>

log

Add logback. XML log file to the project that requires link tracing. The content is as follows (the output level of logback log must be DEBUG) :

Change the project name in .

Log core configuration:

%d{yyyy-MM-dd HH:mm:ss.SSS} [${applicationName},%X{X-B3-TraceId:-},%X{X-B3-SpanId:-}] [%thread] %-5level %logger{50} - %msg%n
<?xml version="1.0" encoding="UTF-8"?>

<!-- 日志级别从低到高分为TRACE < DEBUG < INFO < WARN < ERROR < FATAL,如果设置为WARN,则低于WARN的信息都不会输出 -->
<!-- scan: 当此属性设置为true时,配置文件如果发生改变,将会被重新加载,默认值为true -->
<!-- scanPeriod: 设置监测配置文件是否有修改的时间间隔,如果没有给出时间单位,默认单位是毫秒。当scan为true时,此属性生效。默认的时间间隔为1分钟。 -->
<!-- debug: 当此属性设置为true时,将打印出logback内部日志信息,实时查看logback运行状态。默认值为false。 -->
<configuration scan="true" scanPeriod="10 seconds">
    <!-- 日志上下文名称 -->
    <contextName>my_logback</contextName>
    <!-- name的值是变量的名称,value的值是变量定义的值。通过定义的值会被插入到logger上下文中。定义变量后,可以使“${}”来使用变量。 -->
    <property name="log.path" value="${catalina.base}/gateway-server/logs"/>
    <!-- 加载 Spring 配置文件信息 -->
    <springProperty scope="context" name="applicationName" source="spring.application.name" defaultValue="localhost"/>
    <!-- 日志输出格式 -->
    <property name="LOG_PATTERN" value="%d{yyyy-MM-dd HH:mm:ss.SSS} [${applicationName},%X{X-B3-TraceId:-},%X{X-B3-SpanId:-}] [%thread] %-5level %logger{50} - %msg%n"/>

    <!--输出到控制台-->
    <appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
        <!--此日志appender是为开发使用,只配置最底级别,控制台输出的日志级别是大于或等于此级别的日志信息-->
        <filter class="ch.qos.logback.classic.filter.ThresholdFilter">
            <level>DEBUG</level>
        </filter>
        <encoder>
            <pattern>${LOG_PATTERN}</pattern>
            <!-- 设置字符集 -->
            <charset>UTF-8</charset>
        </encoder>
    </appender>

    <!-- 输出到文件 -->
    <!-- 时间滚动输出 level为 DEBUG 日志 -->
    <appender name="DEBUG_FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <!-- 正在记录的日志文件的路径及文件名 -->
        <file>${log.path}/log_debug.log</file>
        <!--日志文件输出格式-->
        <encoder>
            <pattern>${LOG_PATTERN}</pattern>
            <charset>UTF-8</charset> <!-- 设置字符集 -->
        </encoder>
        <!-- 日志记录器的滚动策略,按日期,按大小记录 -->
        <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
            <!-- 日志归档 -->
            <fileNamePattern>${log.path}/debug/log-debug-%d{yyyy-MM-dd}.%i.log</fileNamePattern>
            <timeBasedFileNamingAndTriggeringPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedFNATP">
                <maxFileSize>100MB</maxFileSize>
            </timeBasedFileNamingAndTriggeringPolicy>
            <!--日志文件保留天数-->
            <maxHistory>15</maxHistory>
        </rollingPolicy>
        <!-- 此日志文件只记录debug级别的 -->
        <filter class="ch.qos.logback.classic.filter.LevelFilter">
            <level>DEBUG</level>
            <onMatch>ACCEPT</onMatch>
            <onMismatch>DENY</onMismatch>
        </filter>
    </appender>

    <!-- 时间滚动输出 level为 INFO 日志 -->
    <appender name="INFO_FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <!-- 正在记录的日志文件的路径及文件名 -->
        <file>${log.path}/log_info.log</file>
        <!--日志文件输出格式-->
        <encoder>
            <pattern>${LOG_PATTERN}</pattern>
            <charset>UTF-8</charset>
        </encoder>
        <!-- 日志记录器的滚动策略,按日期,按大小记录 -->
        <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
            <!-- 每天日志归档路径以及格式 -->
            <fileNamePattern>${log.path}/info/log-info-%d{yyyy-MM-dd}.%i.log</fileNamePattern>
            <timeBasedFileNamingAndTriggeringPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedFNATP">
                <maxFileSize>100MB</maxFileSize>
            </timeBasedFileNamingAndTriggeringPolicy>
            <!--日志文件保留天数-->
            <maxHistory>15</maxHistory>
        </rollingPolicy>
        <!-- 此日志文件只记录info级别的 -->
        <filter class="ch.qos.logback.classic.filter.LevelFilter">
            <level>INFO</level>
            <onMatch>ACCEPT</onMatch>
            <onMismatch>DENY</onMismatch>
        </filter>
    </appender>

    <!-- 时间滚动输出 level为 WARN 日志 -->
    <appender name="WARN_FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <!-- 正在记录的日志文件的路径及文件名 -->
        <file>${log.path}/log_warn.log</file>
        <!--日志文件输出格式-->
        <encoder>
            <pattern>${LOG_PATTERN}</pattern>
            <charset>UTF-8</charset> <!-- 此处设置字符集 -->
        </encoder>
        <!-- 日志记录器的滚动策略,按日期,按大小记录 -->
        <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
            <fileNamePattern>${log.path}/warn/log-warn-%d{yyyy-MM-dd}.%i.log</fileNamePattern>
            <!-- 每个日志文件最大100MB -->
            <timeBasedFileNamingAndTriggeringPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedFNATP">
                <maxFileSize>100MB</maxFileSize>
            </timeBasedFileNamingAndTriggeringPolicy>
            <!--日志文件保留天数-->
            <maxHistory>15</maxHistory>
        </rollingPolicy>
        <!-- 此日志文件只记录warn级别的 -->
        <filter class="ch.qos.logback.classic.filter.LevelFilter">
            <level>WARN</level>
            <onMatch>ACCEPT</onMatch>
            <onMismatch>DENY</onMismatch>
        </filter>
    </appender>

    <!-- 时间滚动输出 level为 ERROR 日志 -->
    <appender name="ERROR_FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <!-- 正在记录的日志文件的路径及文件名 -->
        <file>${log.path}/log_error.log</file>
        <!--日志文件输出格式-->
        <encoder>
            <pattern>${LOG_PATTERN}</pattern>
            <charset>UTF-8</charset> <!-- 此处设置字符集 -->
        </encoder>
        <!-- 日志记录器的滚动策略,按日期,按大小记录 -->
        <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
            <fileNamePattern>${log.path}/error/log-error-%d{yyyy-MM-dd}.%i.log</fileNamePattern>
            <timeBasedFileNamingAndTriggeringPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedFNATP">
                <maxFileSize>100MB</maxFileSize>
            </timeBasedFileNamingAndTriggeringPolicy>
            <!--日志文件保留天数-->
            <maxHistory>15</maxHistory>
            <!-- 日志量最大 10 GB -->
            <totalSizeCap>10GB</totalSizeCap>
        </rollingPolicy>
        <!-- 此日志文件只记录ERROR级别的 -->
        <filter class="ch.qos.logback.classic.filter.LevelFilter">
            <level>ERROR</level>
            <onMatch>ACCEPT</onMatch>
            <onMismatch>DENY</onMismatch>
        </filter>
    </appender>

    <!-- 对于类路径以 com.example.logback 开头的Logger,输出级别设置为warn,并且只输出到控制台 -->
    <!-- 这个logger没有指定appender,它会继承root节点中定义的那些appender -->
    <!-- <logger name="com.example.logback" level="warn"/> -->

    <!--通过 LoggerFactory.getLogger("myLog") 可以获取到这个logger-->
    <!--由于这个logger自动继承了root的appender,root中已经有stdout的appender了,自己这边又引入了stdout的appender-->
    <!--如果没有设置 additivity="false" ,就会导致一条日志在控制台输出两次的情况-->
    <!--additivity表示要不要使用rootLogger配置的appender进行输出-->
    <logger name="myLog" level="INFO" additivity="false">
        <appender-ref ref="CONSOLE"/>
    </logger>

    <!-- 日志输出级别及方式 -->
    <root level="DEBUG">
        <appender-ref ref="CONSOLE"/>
        <appender-ref ref="DEBUG_FILE"/>
        <appender-ref ref="INFO_FILE"/>
        <appender-ref ref="WARN_FILE"/>
        <appender-ref ref="ERROR_FILE"/>
    </root>

</configuration>

access

Visit: http://localhost:9000/order-s. , the results are as follows:

Service gateway print information:

[gateway - server, 95 aa725089b757f8, 95 aa725089b757f8]

Print information for goods and services

[product-service,95aa725089b757f8,e494e064842ce4e8]

Order service print information

[order-service,95aa725089b757f8,f4ee41a6dcf08717]

According to the printed information, the traceId of the entire link is 95aa725089b757F8, and the spanId is e494e064842CE4E8 and F4EE41A6dCF08717.

Viewing log files is not a good idea, and as microservices grow and log files grow, the query becomes more and more cumbersome. Spring officially recommends using Zipkin for link tracking. Zipkin aggregates logs for visual presentation and full-text retrieval.

Use Zipkin for link tracking

What is a Zipkin

Zipkin is an open source Distributed real-time data Tracking System developed and contributed by Twitter. It is designed based on Google Dapper’s paper. Its main function is to gather real-time monitoring data of various heterogeneous systems.

It can collect trace data of request links on each server and assist us to query trace data through THE Rest API interface to realize real-time monitoring of distributed systems, discover the problem of delay increase in the system in time and find out the root of system performance bottleneck. In addition to the development-oriented API, it also provides a convenient UI component. Each service reports timing data to Zipkin, and Zipkin generates dependency graphs based on invocation relationships to help us intuitively search for trace information and analyze request link details. Zipkin provides plugable data storage: In-Memory, MySql, Cassandra, and Elasticsearch.

There are other mature implementations of distributed tracking systems, such as: Naver’s PinPoint, Apache’s HTrace, Alibaba’s Eagleeye Tracing, JD’s Hydra, Sina’s Watchman, Meituan-Dianping’s CAT, Apache’s SkyWalking and so on.

The working principle of

There are four components that make up Zipkin:

  • Collector: Collector component that processes trace information sent from external systems and converts it into a Span format processed internally by Zipkin for subsequent storage, analysis, presentation, and so on.
  • Storage: a Storage component that processes trace information received by the collector and stores the information in memory by default. You can modify the Storage policy to use other Storage components, such as MySQL and Elasticsearch.
  • Web UI: UI component, an upper layer application based on an API component implementation, that provides Web pages for displaying call chains and system dependencies in Zipkin.
  • RESTful API: An API component that provides an interface for querying data in storage on the Web UI.

Zipkin is divided into two ends, one is the Zipkin server and the other is the Zipkin client. The client is the application of the microservice. The client will configure the URL address of the server. Generate the corresponding Trace and Span information to send to the server. There are two ways to send, one is the message bus such as RabbitMQ to send, and the other is the HTTP message to send.

Server Deployment

The server is a separate executable JAR package. https://search.maven.org/remote_content?g=io.zipkin&a=zipkin-server&v=LATEST&c=exec, the use of Java – jar zipkin. Jar command to start, The default port is 9411. The jar package we downloaded is zipkin-server-2.20.1-exec.jar, and the startup command is as follows:

Java jar zipkin - server - 2.20.1 - exec. Jar

Visit: http://localhost:9411/ The results are as follows:

The latest version of the interface.

Previous version interface.

Client Deployment

Add the dependent

Add spring-cloud-starter- Zipkin dependencies for projects that require link tracking (service gateway, commodity service, order service).

<! - spring cloud zipkin dependence - > < the dependency > < groupId > org. Springframework. Cloud < / groupId > <artifactId>spring-cloud-starter-zipkin</artifactId> </dependency>
The configuration file

Configure the Zipkin server address and data transmission mode in the projects requiring link tracking (service gateway, commodity service, order service). The default configuration is as follows.

Spring: zipkin: base-URL: http://localhost:9411/ # Sender: type: web # Data transmission mode, web indicates sending data to the server in the form of HTTP packet sleuth: Sampler: Probability: 1.0 # Collect data percentage, default 0.1 (10%)
access

Visit: http://localhost:9000/order-s. Here are the results:

The new version operates as follows:

Visit: http://localhost:9411/ By time filter click search results are as follows:

Click the corresponding trace information to view the details of the request link.

You can view the dependency relationships of services on a link.

Old version of the operation is as follows:

Visit: http://localhost:9411/ click to find the results below:

Click the corresponding trace information to view the details of the request link.

You can view the dependency relationships of services on a link.

Zipkin Server stores trace data in memory by default, which is not suitable for production environments, and can cause historical data to disappear if the Server is shut down and restarted or the service crashes. Zipkin can modify storage policies to use other storage components, MySQL, Elasticsearch, etc.