TL; DR

In this paper, the design dimensions of call chain tracing system are summarized as follows: call chain data model, metadata structure, causality, sampling strategy and data visualization. We can use these five dimensions as an analysis framework to help us theoretically deconstruct any call chain tracking system in the market, and in practice, select the technology and design the system according to the application scenario. If you are interested in researching related Systems, join the Database of Tracing Systems project to research existing projects and build a Database of call chain Tracing Systems.

The introduction

This article does not require the reader to have any theoretical knowledge or practical experience related to the call chain tracing system. It requires the reader to have some concept or practical experience of microservice architecture. It is expected that after reading this article, readers can master the core design dimensions of invoking chain tracking system, understand the design tradeoffs, and use these dimensions to analyze the implementation of new and old chain tracking system on the market, and even help themselves to choose the technology and system design according to the application scenario in production practice.

Problem solved

Observability of microservices

Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the The organization’s communication structure.

– Melvin e. Conway

If there is a discipline called sociology of software, Then Conway’s Law must be one of its fundamental laws. If the entire internal information system of an Internet company is regarded as a whole system, the system module structure will converge to the organizational structure of the company. From the perspective of organizational structure, the corporate structure evolves from flat to multi-tiered, with the increase of information transmission links and the decline of communication efficiency, thus affecting the efficiency of corporate action. No matter from the perspective of the familiarity among team members or the consistency of department goals, the communication efficiency within the department is much higher than that between departments. Therefore, if the system module structure and organizational structure approximate, the communication efficiency of the company can be close to the maximum. The fragmentation of teams is often accompanied by the dismantling of services, which is the motivation for many companies to microserve as their business grows. After microservitization, the company information system is forced to become a distributed system. Despite the benefits of distributed systems, such as continuous integration, incremental deployment, horizontal scaling, and fault isolation, the system visibility is much less than that of stand-alone systems, and almost no one has the full picture of a company’s information system.

The ultimate goal of any distributed system is to “give developers distributed capabilities and a stand-alone feel.” Calling the chain tracking system is an integral part of the ultimate ideal. Call chain tracking system collects call chain data to help developers move from machine-centric to request-centric when observing distributed system behavior, and cooperate with log and monitoring index data (Telemetry). It allows microservices developers to think through the whole thing, but also to analyze it locally, and still be in control as the system scales up.

Usage scenarios

When the call chain information of the system is connected, the developer can carry out various forms of system behavior analysis based on this. Common usage scenarios can be divided into the following categories:

Anomaly detection

Exception detection refers to locating and troubleshooting requests that cause abnormal system behavior. In general, the frequency of these requests is usually very low, and their occurrence is a low probability event. Although the probability of abnormal events being sampled is very low, its information entropy is large, which can give developers more details about their maintenance system. These details might include: slow requests, slow queries, uncapped loop calls, error level logs, problem logic branches that do not override tests, and so on. If the call chain tracing system can actively find abnormal problems for the developer, it will make the potential risks exposed in advance, and be killed in the cradle.

The steady-state analysis

Steady-state analysis refers to the analysis of all aspects of microserver status under normal traffic. The analysis granularity may include a single interface, a single service, multiple services, etc. The analysis scope may be a single request or multiple requests; Analysis angles may include buried point indicators, dependencies, flow sizes, and so on. Steady-state analysis usually reflects the health status of major system processes. Configuration changes, such as storage node modifications and client log reporting frequency, may be reported to the system steady state. Steady-state analysis can also have many segmentation scenarios, such as:

Steady-state performance analysis: Locate and troubleshoot steady-state performance problems of the system. These problems are usually of a similar origin to anomaly detection, but their impact is not sufficient to trigger an alarm.
Service dependency analysis: Construct an interface level dependency diagram. The nodes are usually interfaces or services, the edges are usually the invocation relationship of interfaces or services, and the weight of edges is traffic. Construction methods can be divided into offline construction and online construction, corresponding to the static relationship diagram or dynamic relationship diagram. This information can be made available to upstream applications, such as APM, in the form of an underlying API.

Distributed profile

Many programming languages provide profiling tools, such as Go Tool Pprof, which can collect the load of different resources, such as CPU, memory, coroutine, etc., to analyze the resource usage patterns of different modules within the process, and finally present them to developers through visualization methods such as call trees or flame charts. Distributed profiling is a distributed version of such profiling tools. Developers can turn on the profiling switch, sample and analyze the resource usage ratio between microservices, such as latency, and then analyze the performance bottleneck of the interface or service as a whole through data visualization similar to that of a single machine.

Resource attribution

The main question that resource attribution answers is: “Who should pay for the cost of my services?” It requires the association of resource consumption or occupation with the requester, and resource attribution is also the basis of cost analysis.

Load modeling

Load modeling mainly refers to analyzing and speculating the behavior of the system. The questions answered in this scenario can be expressed as “If XX changes, how will the overall system or critical link state change?” Common applications such as capacity estimation, full link pressure measurement, chaos test and so on.

Basic implementation scheme

How do I trace the call chain

In the microservice architecture, the information of each call chain is scattered in each microservice through which the request passes, and these information needs to be collected and connected by some technical means to reconstruct the complete call chain. There are two basic ideas to solve the problem, one is the blackbox method without code intrusion; The other is metadata Propagation with code intrusion.

The black box method

The black box method, as the name implies, is to regard the entire microservice collection as a black box. After collecting logs of a specific format to the storage center, statistical methods are used to infer and rebuild the call chain:

The advantage of this method is that there is no code intrusion, but the disadvantage is also obvious: the inference result is not accurate, which is embodied in:

It is difficult to infer asynchronous computing task relationships
Statistical analysis requires a certain amount of data and computing resources, which is time-consuming
Statistical analysis needs to deal with uneven data sets, such as interfaces with few requests
…

The black box method is not really used in production practice and is only used as a theoretical idea for reference.

Metadata propagation method

The metadata communication principle is to inject the necessary metadata in the call chain into the communication messages between micro-services, and then each service is responsible for reporting part of the call chain information recorded by itself, including the call chain identification, upstream services and other information. Finally, the back-end system uses this information to rebuild the call chain. The schematic diagram is as follows:

Metadata dissemination method is opposite to black box method, which has the advantage of accurate result of call chain reconstruction and the disadvantage of code intrusion. But this code is buried in a unified microservices governance framework, avoiding exposure to front-line business development. Almost all call chain tracing systems in production practice use metadata propagation.

Call chain trace system basic architecture

While there are a variety of call chain tracing systems on the market, their basic architecture is relatively consistent:

Buried point

Each micro service will bury points (instrumentation) at the connections across processes, such as:

Sends an HTTP/RPC request and receives an HTTP/RPC response
Database query
Read and write cache
Production and consumption of message-oriented middleware
…

Each point records the name of the cross-process operation, the start time, the end time, and the necessary tag key-value pairs, which are pieces of the jigsaw puzzle of the call chain.

The sampling

In practice, it is not necessary to collect all buried data based on the cost of computing and storage resources or specific application scenarios. Therefore, many call chain tracking systems require data to be reported according to a strategy aimed at achieving a balance between costs and benefits and improving the input-output ratio.

report

Data can be sent directly from a service instance to a processing center or reported by an Agent on the same host. One of the advantages of using Agent reporting is that some computing operations can be processed uniformly in agent, and some logic, such as compression, filtering, and configuration change, can be realized in Agent. The service only needs to implement a thin layer of burying point and sampling logic. This also minimizes the impact of the invocation chain tracing scheme on the business service itself; Another benefit of using Agent reporting is that the discovery mechanism of the data processing service is transparent to the service itself. Therefore, deploying agents on each host is the recommended deployment for many call chain tracing systems.

To deal with

The call chain data is reported to the processing center, which is usually called the collector. The collector completes the necessary post-processing, such as data filtering, data marking, tail sampling, data modeling, etc., and finally writes the data in batches to different storage services and establishes the necessary indexes.

Storage/index

Call chain trace data has two main characteristics: large volume, value decreases over time. Therefore, in addition to data model, scalability and data retention policy support should also be considered in the selection of storage services. In order to facilitate the query, we also need to set up appropriate indexes for the storage of data.

visualization

Visualization is one of the most important aspects of efficient use of call chain data, and high-quality interactive experiences can help developers quickly obtain the information they need. In general, visualization can be divided into two granularity: a single call chain view, multiple call chain aggregation analysis, and at each granularity there are many visualization options.

scalability

If no sampling is done, the data that the chain tracking system needs to process is positively related to the total number of requests for the whole site. If the average number of requests for the whole site is 20 services, the call chain tracing system will need to bear 20 times the total number of requests for the whole site, so every layer of its architectural design needs to be extensible.

If the service SDK is used to report directly, the horizontal expansion of the reporting layer is automatically achieved by adding instances. If agent reporting is used, horizontal expansion can be achieved by adding hosts. The data processing layer should theoretically be stateless and support horizontal scaling. Because the processing logic of many call chain data needs to obtain all data of the same call chain, load balancing through TraceID is a natural choice. The storage scalability of data is guaranteed by the storage services used.

Overload control

The instantaneous peak is a common traffic load pattern, so each component of the call chain tracking system also needs to consider the overload control logic. It is necessary to prevent the impact of burying points and reporting on online services under peak flow, and also to consider the carrying capacity of each module at the back end of the call chain tracking.

During data reporting and processing, the Agent or Collector can maintain local queues to reduce peaks, but if the capacity limit of local queues is exceeded, there is a trade-off between data loss and timeliness. If data loss can be tolerated, it can throw away data that cannot be processed directly like a router throwing packets. If peak timeliness can be abandoned, local queues can be replaced with high-throughput, high-storage messaging middleware such as Kafka.

Design dimensions

In a 2014 paper, the team identified four design dimensions after analyzing multiple call chain tracing systems of the time:

Which causal relationships should be reserved?
How should causal relationships be tracked?
How should sampling be used to reduce overhead?
How should traces be visualized?

In this section, we take this paper as a starting point to introduce five design dimensions of call chain tracking system: call chain data model, metadata structure, causality, sampling strategy, and data visualization.

1. Call the chain data model

Every implementation of call chain tracking system needs to reasonably model the data of call chain, and their choice of data model may affect various links such as burial point, collection, processing and query. The most common data models are Span Model and Event Model. You can read this article if you are interested in this topic.

Span Model

First proposed by Google in Dapper, the Span Model takes a computation task, like handling customer requests, and presents it as a set of SPANS, each of which represents a segment of a computation task, Keep track of the start time and end time. Each span also records the span that triggered it, the parent Span, indicating causality in the system. If SpanA triggers SpanB, then SpanA is the parent of SpanB. Because a parent-child relationship implies cause and effect, spans don’t form a loop, or you’d get a causal loop. So usually the SAME SPANS relationship traced can be represented by a tree, as follows:

It should be noted that in the Span Model, each Span has only one parent node, that is, there is only one reason for a certain calculation. The tracking system using the Span Model needs to actively stop the Span at the buried point, after which the calculated information of the Span will be reported to the processing center. Logically, the parent node reports only after the child node reports. From the view of the reporting channel, both are reported by the local thread, and there is no relationship.

The Span Model is conceptually compatible with the call stack and is easily understood and mastered by engineers. However, it is not sufficient to express all types of computational dependencies, such as multiple causes and one effect:

Event Model

X-trace was the first project to use the Event Model. In X-Trace, an event is a moment in a computing task. Causality in the computing task is represented by the edge between the events. Any two events can be connected by an edge. It’s worth noting that edge here actually represents the happens-before relationship mentioned in Lamport (1978), assuming that there is an edge from EventA to EventB, So “happens-before” so EventA might have an effect on EventB. But in a simple scenario, an edge can be thought of as an activation relationship or dependency relationships, both of which are subsets of happens-before relationships. Unlike the Span Model, an Event Model can have multiple incoming edges per Event, which makes it easy to express complex relationships such as fork/join or fan-ins/fan-outs. The Event Model supports more refined presentation of call chain data, as shown in the following example:

Where the dotted box represents a thread of execution; Dots represent events; Arrows represent edges. For ease of understanding and comparison, the span is also represented by a solid line box.

The advantage of Event Model lies in its strong expression power, but its disadvantage is that compared with Span Model, it is more complex and difficult for engineers to accept and get used to. Meanwhile, the visualization of similar call stack of Span Model is more concise.

2. Metadata structure

First, to avoid ambiguity, it is noted here that metadata refers to a call chain that traces relevant data as it passes between processes. Almost all call chain tracing systems use metadata propagation to track call chains across processes. So how do we design metadata structures that are passed between processes? Metadata structure can be divided into three types: static fixed length, dynamic fixed length and dynamic variable length from two dimensions of content variability and length limitation.

Static fixed-length

Static fixed-length metadata, that is, the length of data is fixed and does not change during propagation. Static fixed-length metadata contains only a single request-level unique fixed ID, TraceID. The call chain tracing system can retrieve all the information related to the same request through TraceID, and then establish a causal relationship. Since only TraceID is in the metadata, the system can only rely on some external information, such as threadID and the host’s wall clock, to predict causality.

Dynamic fixed-length

Dynamic fixed-length metadata, in which the length of data is fixed but can change during propagation. In addition to TraceID, dynamic fixed-length metadata also passes request source identifiers, such as SpanID or EventID, which establish upstream and downstream relationships between two nodes.

The dynamic variable length

Dynamic variable-length metadata, which means that the length and content of data can change as it is propagated. Dynamic variable-length metadata usually contains all or part of the information of all upstream nodes. After the current node is processed, the information of the current node and upstream information are transmitted to the downstream. Each node gets all the information from the call chain up to the current node, so there is no need to rebuild the call chain with additional components.

3. Cause and effect

A causal relationship may exist between computing tasks of the same intra-request, for example:

Process P1 calls process P2 through HTTP or RPC
Process P1 writes data to or reads data from the storage service
Process P1 produces messages to MQ, and process P2 consumes the messages and processes them
…

Causality may also exist between computing tasks of different Inter-requests, for example:

Request R1 and R2 to obtain a distributed lock at the same time, R1 succeeded, R2 failed
Request R1 writes data to the local cache. Request R2 writes data to the local cache and triggers batch processing
Request R1 writes data to the storage system and asks R2 to read the corresponding data for processing
…

In practice, developers tend to analyze problems from the perspective of a single request, so call chain tracing systems typically do not focus on cause and effect between different requests, but maintain corresponding expressiveness in the data model. Typically, SDK providers try to help developers track causality between computation tasks on the same request by laying down points at all cross-process join points, such as HTTP/RPC calls, database access, message production and consumption. But sometimes A computation originating from request A is triggered by request B, as shown in the following example:

Request One commits data D1 to a write-back cache, and Request Two commits data D2 to the same cache, triggering d1 to be written out to persistent storage. How d1’s write operations are assigned at this point determines whether the chain tracing system is called to select submitter-preserving or triger-preserving angles.

Submitter perspective

The submitter Angle means that when an aggregate or batch operation is triggered by another request, the operation is attributed to the submitter. As shown in the left figure above, the data retained by Request One in the write back cache is finally cleared because Request Two writes data, and the operation of clearing data belongs to Request One.

Trigger Angle

The trigger Angle means that when an aggregate or batch operation is triggered by another request, the operation is attributed to the trigger. As shown in the upper right figure, the data retained by Request One in the write back cache is finally cleared because Request Two writes data, and the clearing operation belongs to Request Two.

4. Sampling policy

The total volume of call chain data is positively correlated with the business volume, and the full collection of call chain data will bring two aspects of pressure to the company system as a whole:

Network I/O pressure for each service due to data reporting
Calculation and storage stress of calling chain trace services due to data collection and analysis

To reduce these two pressures, sampling is a necessary component of most call chain tracing systems. Common adoption strategies in practice can be divided into three categories:

Coherent sampling: Head-based coherent sampling
Coherent sampling: Tail-based coherent sampling
Unit sampling: Unitary sampling

Their schematic is shown below:

Head sampling

Header coherent sampling refers to the decision whether or not to sample the request as soon as it enters the system, and the decision is passed along with the metadata to the downstream service to ensure that the sampling is consistent. Because the sampling decision is made early, there is less pressure on the system as a whole. But because the decision is made early and the sampling accuracy is minimal, it is difficult to ensure that the call chain collected is of value. Coherent sampling head has a variation, which the head coherent sampling with abnormal report back: coherent sampling of the head at the same time, several spans at each service node cache recent information, once called downstream abnormal, can be in the frame of the micro service awareness back to the upstream node at the same time, to ensure that the abnormal call chain data can be reported.

Tail coherent sampling

Tail-coherent sampling refers to deciding whether or not to sample until the request is complete. Before making a decision, the system needs to cache the data to ensure that the sampling is consistent. Due to late sampling decisions, data needs to be fully reported and stored temporarily for a period of time, which adds to both pressures mentioned above. But also because decisions are made later and information is more complete, tail coherence sampling takes advantage of some empirical rules to ensure that important call chains are captured.

Sampling unit

Unit sampling does not require consistency, and each component in the system decides whether to sample or not, so this scheme usually cannot establish call chain information for a single request.

5. Data visualization

The visualization of call chain data usually corresponds to the application scenario one by one. Efficient visualization can better empower engineers, shorten the time of troubleshooting, and improve the quality of r&d life.

Gantt Charts

Gantt charts are often used to show call chain data for a single request. The following are the most common variations of gantt charts for call chain tracing systems:

The left side of the figure is usually organized into a tree structure. Usually, the parent node represents the caller, the child node represents the caller, and the sibling nodes are in a concurrent relationship with monotonously increasing time from top to bottom. On the right is a strip structure similar to the standard Gantt chart.

A Swimlane (Swimlane)

Swimlane charts can be used to show the call chain data of a single request. Compared with Gantt charts, swimlane charts are more detailed and are often used to show more complex computational relations of Event models. Examples are as follows:

The swimming lane, namely the dashed line box, is used to represent the computing execution unit. Dots show events at a given moment; The arrows represent relationships between events.

Flow Graphs

Flowcharts are often used to show aggregate information about call chain data for multiple similar requests that should have exactly the same call chain structure. Examples are as follows:

Nodes in the graph represent the events occurring in the system, edges represent causality, and weights can represent the time difference between events, which together form a directed acyclic graph. The flow chart can even express the causal relationship between Fan-outs and fan-ins, i.e. forks and joins, preserving more details of the call chain.

Call Graphs

Call diagrams are used to show aggregate information for multiple requests that do not have to have exactly the same call chain structure. Nodes in the call diagram represent services, modules or interfaces in the system, edges represent causality, and weights can represent customized information such as traffic and resource occupancy. The possibility of rings in the call diagram implies that there are ring dependencies in the system. The following is an example of a call diagram:

Calling Context Trees

Call trees are used to show aggregate information for multiple requests, often with different call chain structures. The call path from the root node to any leaf node is a real call path in the distributed system, as shown in the following example:

Flame graph

Flame charts are often used to show the call stack time of a stand-alone program, such as pprof in Go. Similar to the structure of the call tree, it is often used to display the aggregate information of multiple requests, but in a different form, it can more intuitively display the time consuming information of each component, for example:

From dimensions to scenarios

Now that we know about the various design dimensions, let’s go back to the scenarios mentioned at the beginning of this article and try to analyze how to choose between these dimensions. The following uses exception detection and distributed profiling as examples:

Anomaly detection: the developer needs to check the complete call chain information when a request has a problem, so it needs coherent sampling. Moreover, since the occurrence of a problem request is a small probability event, only coherent sampling can be used to ensure that the data can be captured. Developers tend to analyze problems in terms of the impact of each request, so the causal relationship within the request should be chosen from the trigger perspective. Gantt charts and flowcharts are visualizations that apply to a single invocation chain. In metadata structure, dynamic fixed length can collect upstream and downstream relationship more accurately than static fixed length, and relative dynamic variable length can save network cost, and the real-time optimization brought by the latter is not important to anomaly detection, so dynamic fixed length metadata is a more appropriate choice.

Distributed profiling: Profiling can help developers look at performance bottlenecks at the call chain level, but the analysis object is aggregated data, there is no requirement for the integrity of a single call chain, and unit sampling is the cheapest sampling solution. Like exception detection, the trigger view is more intuitive to the developer and has no extra overhead, so the trigger view is chosen. The visual choices for profiling are no surprise: call trees and fire diagrams. In metadata structures, if the call chain depth is controllable, dynamic lengthening can help developers see the profile data faster; If the depth is not controllable, the dynamic fixed length also meets the requirements, but consumes computing resources in the data processing link.

The call chain data model affects the ultimate implementation effect and capability boundaries of each scenario, but does not affect the effectiveness of the scenario solution, so it is not specifically discussed here. If in practice you need to tackle multiple scenarios simultaneously, consider taking a package set at each design dimension.

Case study: Jaeger

Project history

Jaeger, whose name comes from the German word for hunter, was developed by Uber’s internal Observability team to integrate a full call chain tracking solution with buried points, collection to visualization. In April 2017, Jaeger was officially open source; In September 2017, the incubator entered CNCF; In October 2019, I graduated from CNCF and became a top CNCF project.

Basic architecture

Jaeger’s architecture is very similar to the basic architecture of the call chain tracing system mentioned above, and it has two deployment architecture options, as shown in the following two figures:

The main difference is that Kafka buffer is added between jaeger-collector and DB to solve the problem of peak traffic overload. There is no single point of failure throughout the Jaeger back end, jaeger-Collector, Kafka, DB (Cassandra and ElasticSearch) all support horizontal scaling.

Application scenario: Steady state analysis

Jaeger introduces its main functions on its website as follows:

Distributed Context Propagation
Distributed Transaction Monitoring
Root Cause Analysis
Service Dependency Analysis
Performance/ Latency Optimization

Reconstructing call chain relationships requires propagating metadata between processes, so distributed context propagation is the foundation for modeling call chain trace data, and is not typically used to propagate non-call chain trace related data such as UID, DID, and so on. This data is typically disseminated through a microservices governance framework. The following distributed transaction monitoring, root cause analysis, service dependency analysis and performance/delay optimization are mainly used to analyze system behavior through the call chain data collected from the collection side and the dependency relationship of service and operation.

Call chain data Model: Span Model

The call chain data Model in Jaeger complies with The OpenTracing standard and uses a typical Span Model, whose core data structure is shown in the figure below:

Here’s a concrete example:

There are two causal relationships between Span and Span, ChildOf and FollowsFrom. In the ChildOf relationship, the parent node depends on the result executed by the child node. In the FollowsFrom relationship, the parent node does not depend on the execution result of the child node, but has a causal relationship with it.

Causality: the user decides, the trigger perspective is dominant

The call chain data model adopted by Jaeger can completely associate different processes in the same request. The committer perspective or the trigger perspective depends on the access party of Jaeger. There is no extra cost for the access party to choose the trigger perspective, while the access party needs to invest extra efforts in customization development. Therefore, in most cases, the trigger perspective is used.

Metadata structure: dynamic fixed length

The metadata structure that Jaeger passes between processes is as follows:

TraceID can be used to confirm the owning relationship of the current SPAN. SpanID and parentID can be used to establish parent-child relationships between upstream and downstream processes. Usually the amount of data in baggage doesn’t change. Overall consideration: Jaeger’s metadata structure belongs to dynamic fixed length.

Sampling strategy: continuous head sampling

Currently, Jaeger supports three sampling methods:

Constant: Either all sampling or no sampling
Probabilistic: Sampling with fixed probability
Rate Limiting: Ensures that each process collects a maximum of K data at a time

In addition to writing out the sampling configuration directly during SDK initialization, Jaeger also supports remote dynamic adjustment of the sampling method, but the selection of adjustment must still be one of the above three. In order to prevent the failure of obtaining call chain information due to low probability of occurrence for some requests with small amount of call, Jaeger team also proposed Adaptive Sampling, but this proposal has not been promoted since 2017.

Either way, the decision whether to sample or not is made at the time the request is entered into the system, so the conclusion is that Jaeger currently supports continuous head sampling. It’s worth noting that Jaeger’s team is also discussing the possibility of introducing coherent tail sampling, but nothing substantial has been done yet.

Data visualization: Gantt chart, call tree, call graph

The Jaeger-UI project provides rich call chain data visualization support, including gantt charts for individual requests, call trees, and call diagrams for global services.

Gantt chart

Call tree

Call trees are still in the experimental stage and are not yet an official feature.

Call graph

You can also focus on a node so that the call graph shows only the services associated with that node, known as the Focus Graph.

Call the chain trace system database

In 2014, AP established a website dbDB. IO, namely Database of Databases, to analyze the dazzling Database systems in the market from some fixed dimensions. Inspired by this project, we can also analyze the call chain tracing system on the market from the design dimension mentioned in this paper, so as to obtain a more systematic understanding and deposit the results of the analysis and investigation. Therefore, I set up the project Database of Tracing Systems. If you are interested in this project, you are welcome to participate in the research and build a Database of call chain Tracing system together.

The resources

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
So, you want to trace your distributed system? Key design insights from years of practical experience
Canopy: An End-to-End Performance Tracing And Analysis System
End-to-End Tracing Models: Analysis and Unification
Tracing, Fast and Slow
Github: Database of tracing systems
Database of Databases
X-Trace: A Pervasive Network Tracing Framework
Time, Clocks, and the Ordering of Events in a Distributed System
Evolving Distributed Tracing at Uber Engineering
Jaeger Docs
OpenTracing Specification

Call the chain tracking system in companion fish: Theory