Dapper, tracking system for large-scale distributed systems

An overview of the

Modern Internet services are usually implemented in complex, large-scale distributed clusters. Internet applications are built on different sets of software modules, which may be developed by different teams, implemented in different programming languages, and spread across thousands of servers across different data centers. Therefore, there is a need for tools that can help you understand the behavior of the system and analyze performance problems.

Dapper– distributed tracking system under The production environment of Google came into being. So let’s introduce a large-scale cluster tracking system, how it meets the three requirements of low loss, application transparency, and large-scale deployment. Of course, at the beginning of Dapper design, we refer to some ideas of other distributed systems, especially Magpie and X-Trace. However, we still need some finishing touches to successfully apply Dapper in production environment, such as the use of sampling rate and the modification of code implantation limited to a small part of public library.

1. Introduction

We developed Dapper to collect more information about the behavior of complex distributed systems and present it to Google developers. Such distributed systems have a special benefit because those large-scale low-end servers, as carriers of Internet services, are a special and cost-effective platform. To understand the behavior of distributed systems in this context, you need to monitor the associated actions across different applications and different servers.

Here’s a search-related example that illustrates some of the challenges Dapper can tackle. For example, a front-end service might launch a Web query against hundreds of query servers, each with its own Index. This query might be sent to multiple subsystems that process ads, spell check, or find specific results like pictures, videos, or news. According to the query results of each subsystem for screening, get the final result, finally summarized to the page. We call this search model “universal search.” In total, this global search has the potential to invoke thousands of servers, covering a variety of services. Moreover, users are very sensitive to the search time, and the inefficiency of any subsystem will lead to the final search time. If an engineer knows only that the query time is abnormal, he has no way of knowing which service call caused the problem or why the call performed poorly. First of all, the engineer may not be able to accurately locate what services to the global search is to call, because a segment of the new service, and service, may at any time on line or modified, is likely to be geared to the needs of user functionality, it is possible that some such as the function of improvement for performance or safety certification. Second, you can’t expect the engineer to know all the services involved in the global search, each of which may have been developed or maintained by a different team. Again, these exposed services or servers may also be used by other clients, so the performance problems of the global search may even be caused by other applications. For example, a back-end service may be dealing with a variety of request types, and a highly efficient storage system, such as Bigtable, may be being read and written repeatedly because of the variety of applications running on it.

From the above case, we can see that there are only two requirements for Dapper: ubiquitous deployment and continuous monitoring. The importance of ubiquity is obvious, because when tracking systems are used, even if only a small part of the monitoring is not monitored, there are huge questions about whether the system can be trusted. In addition, monitoring should be 7×24 hours, after all, system anomalies or important system behaviors that may have occurred once are difficult or even impossible to reproduce. So, based on these two specific requirements, we can directly derive three specific design goals:

1. Low consumption: The impact of the tracking system on online services should be small enough. In some highly optimized services, even a small loss can be easily detected and may force the deployment team of the online service to shut down the tracking system.

2. Application-level transparency: There is no need for application programmers to know there is a tracking system. If a tracking system is to be effective, it must rely on the active cooperation of the application developer, so the tracking system is too fragile, often due to the bugs or negligence of the tracking system implanted code in the application, which is unable to meet the requirement of “ubiquitous deployment” of the tracking system. This is especially important in today’s fast-paced development environment like Google’s.

3. Scalability: Google’s monitoring system should be able to fully control the size of its services and clusters for at least the next few years.

An additional design goal is to make the analysis of trace data fast after it is generated, ideally within a minute of the data being stored in the trace warehouse. While tracking systems can be valuable for counting data up to an hour old, they can respond quickly to anomalies in production environments if they provide feedback quickly enough.

To achieve true application-level transparency, which is probably the most challenging design goal today, we made the core trace code very light, and then built it into common components that are ubiquitous, such as thread calls, control flows, and RPC libraries. Using adaptive sampling rates can make tracking systems scalable and reduce performance losses, as discussed in Section 4.4 (p. The resulting systems also need to include code to collect trace data, tools for visualization, and libraries and apis for analyzing large-scale trace data. While Dapper alone is sometimes enough for developers to pinpoint the source of exceptions, Dapper is not intended to replace all other monitoring tools. We found that Dapper’s data tends to focus on performance surveys, so other monitoring tools also have their own uses.

1.1 Summary of literature

The design space of the distributed system tracking tool has been explored by some excellent articles, Pinpoint[9], Magpie[3] and X-Trace[12] and Dapper are the most similar. These systems tend to be written into research reports early in their development, even before they have had time to clearly assess the importance of some of their designs. Because the Dapper, by contrast, has been in mass production environment of seasoning for many years, after so many production verification, we think the paper is most suited to emphasis on in the process of deploying Dapper we have the harvest, decided how our design idea, and in what way it will be the most useful. Dappe is a platform that hosts performance analysis tools developed based on Dapper, as well as Dapper’s own monitoring tools. The value of Dappe is that we can find unexpected results in retrospective evaluations.

Although Dapper has absorbed the research results of Pinpoint and Magpie in many high order design ideas, the realization of Dapper contains many new contributions in the field of distributed tracking. For example, sampling rates are necessary if we want to achieve low losses, especially in highly optimized Web services that tend to be extremely delay-sensitive. Perhaps more surprisingly, we found that even a sample rate of 1/1000 provides sufficient information for general use of tracking data.

Another important feature of our system is the application-level transparency we can achieve. Our component’s intrusion into the application is limited to a low enough level that even a distributed system as large as Google’s web search can be tracked without additional annotations. While it is easier to be transparent about the application layer because our deployment system is blessed with a degree of homogeneity, we proved that this is a sufficient condition for achieving this level of transparency.

2. Dapper’s distributed tracking

Figure 1: This path is initiated by the user’s X request through a simple service system. Nodes identified by letters represent different processes in distributed systems.

A tracking system for distributed services needs to record information about all the work done in the system after a particular request. As an example, Figure 1 shows A single service associated with five servers, consisting of A front end (A), two middle tiers (B and C), and two back ends (D and E). When a user (the originator of this use case) makes a request, it reaches the front end first and then sends two RPCS to servers B and C. B will respond immediately, but C needs to interact with D and E at the back end before returning it to A, who will respond to the initial request. A simple, practical implementation of distributed tracing for such a request is to collect trace message identifiers and timestamped events for every action you send and receive on the server.

To associate all record entries with a given initiator (for example, the RequestX in Figure 1) and record all information, there are now two solutions, black-box and annotation-based monitoring. The black box scheme [1,15,2] assumes that there is no additional information to be tracked beyond the information described above and uses statistical regression techniques to infer the relationship between the two. The annotation-based scheme [3,12,9,16] relies on the application or middleware explicitly marking a global ID to link each record to the originator’s request. Although black-box schemes are lighter than annotation schemes, they require more data to obtain sufficient accuracy because they rely on statistical inference. The main drawback of the annotation-based approach is that, obviously, code placement is required. In our production environment, because all the applications used the same threading model, control flow, and RPC system, we found that it was possible to restrict code embedding to a small common component library, enabling the application of monitoring systems to be effectively transparent to developers.

We tend to think of Dapper’s trace architecture as a tree structure embedded in RPC calls. However, our core data model is not limited to our specific RPC framework, we can also track other behaviors, such as Gmail SMTP sessions, external HTTP requests, and external queries to SQL servers. In terms of form, our Dapper tracks the tree structure, Span and Annotation used by the model.

2.1 Trace tree and SPAN

In the Dapper trace tree structure, tree nodes are the basic units of the entire architecture, and each node is a reference to a SPAN. The lines between nodes represent a direct relationship between the span and its parent span. While spans simply represent span start and end times in log files, they are relatively independent of the entire tree structure, and any RPC-related time data, zero or more application-specific annotations, are discussed in Section 2.3.

Figure 2: The transient correlation of 5 spans in Dapper tracking tree species

Figure 2 illustrates what span looks like over a large trace. Dapper records the span names, as well as the ID and parent ID of each span, to reconstruct the relationship between the different spans in a single trace. A span without a parent ID is called a root span. All spans hang on a specific trace and also share a trace ID (not shown in the figure). All of these ids are identified with globally unique 64-bit integers. In a typical Dapper trace, we want to correspond to a single span for each RPC, and each additional component layer corresponds to a hierarchy of trace tree structures.

Figure 3: A detailed view of a single span shown in Figure 2

Figure 3 gives a more detailed view of the record points of a typical Dapper tracking span. In Figure 2, one of these spans represents two “helper. Call” RPCS (server side and client side). The start and end times of the SPAN, as well as any RPC time information, are recorded through Dapper’s incorporation into the RPC component library. If the application developer chooses to add their own comments (for “Foo” in the figure) (business data) to the trace, this information will be logged just like any other SPAN information.

Keep in mind that any span can contain information from different hosts, and this should also be recorded. In fact, each RPC Span can contain annotations for both client and server processes, making the span linking the two hosts what the model calls a span. Since the timestamps on the client and server come from different hosts, we have to account for the time bias. In our analysis tool, we take advantage of the fact that the RPC client sends a request before the server receives it, and the same is true for the response (the server responds before the client receives the response). In this way, the RPC on the server side has an upper and lower bound for timestamps.

2.2 point implanted

Dapper can track the path of distributed control at almost zero cost to application developers, relying almost entirely on modifications based on a small number of common component libraries. As follows:

When a thread processes a trace control path, Dapper stores the trace context in a ThreadLocal. The trace context is a small, easily copied container that holds Scan attributes such as the trace ID and SPAN ID.
When calculations are invoked lazily or asynchronously, most Google developers use a common control-flow library to call back and forth through thread pools or other actuators. Dapper ensures that all such callbacks store the context of the trace, and that the trace context is associated with the appropriate thread when the callback is fired. In this way, Dapper can use trace ids and span ids to aid in building paths for asynchronous calls.
Almost all of Google’s interprocess communication is based on an RPC framework developed in C++ and Java. We built tracing into the framework to define all spans in RPC. The SPAN ID and trace ID are sent from the client to the server. Rpc-based systems like that are widely used in Google, which is an important point of adoption. As those non-RPC communication frameworks mature and find their own user base, we plan to implement RPC communication frameworks.

Dapper’s trace data is language-independent, and much of the trace in production combines data from processes written in C++ and Java. In Section 3.2, we’ll discuss how these theories work in practice when we discuss application transparency.

2.3 the Annotation

The above points of insertion are sufficient to derive the details of tracking complex distributed systems and make Dapper’s core functionality usable without changing Google apps. However, Dapper also allows application developers to add additional information during Dapper tracking to monitor higher-level system behavior, or to help debug problems. We allow users to define timestamped annotations through a simple API, and the core sample code is shown in Figure 4. These annotations can add anything. To protect Dapper users from accidental over-zealous logging, each trace span has a configurable upper limit on the total number of annotations. However, application-level annotations are nota substitute for information used to represent span structures and record RPC-related information.

In addition to simple text annotations, Dapper also supports key-value mapping annotations, giving developers more tracking capabilities such as persistent counters, binary message records and arbitrary user data running through a process. The Annotation approach of key-value pairs is used to define the relevant types for a particular application in the context of distributed tracing.

2.4 sampling rate

Low wear and tear is a key design goal for Dapper, because if the tool is unproven and has an impact on performance, you can understand why service operators are reluctant to deploy it. Furthermore, we want developers to use the Annotation API without worrying about the extra overhead. We have also found that certain types of Web services are indeed sensitive to the performance degradation associated with migration. Therefore, in addition to limiting the performance loss of the basic components of Dapper’s collection as little as possible, we can further control the loss by recording only a small part of a large number of requests. We will discuss the sampling rate scheme for tracing in more detail in Section 4.4.

Figure 5: Overview of the Dapper collection pipeline

2.5 Tracing Collection

Dapper’s track record and collection pipeline process is divided into three stages (see Figure 5). First, the SPAN data is written to (1) the local log file. Dapper’s daemons and collection components then pull the data out of the production hosts (2) and eventually write it to (3) Dapper’s Bigtable repository. A trace is designed as a row in Bigtable, with each column equivalent to a span. Bigtable’s support for a sparse table layout is perfect for this situation, as there can be any number of spans per trace. The median delay for tracking data collection (that is, the time it takes to transfer binary data from the application to the central repository) is less than 15 seconds. The 98th Percentile Latency tends to be bimodal over time; About 75 percent of the time, the 98th percentile delay is less than two minutes, but about 25 percent of the time, it can increase to several hours.

Dapper also provides an API to simplify access to trace data in our repository. Google developers use this API to build general-purpose and application-specific analysis tools. Section 5.1 contains more information on how to use it.

2.5.1 Out-of-band Data Tracing and Collection

Tip1: Out-of-band data: Transport layer protocols use out-of-band data (OOB) to send important data. If one party has important data that needs to be notified to the other party, the protocol can quickly send the data to the other party. To send this data, the protocol typically does not use the same channel as normal data, but uses a different channel.

Tip2: The in-band strategy refers to the transmission of trace data along the call chain. Out-of-band collects trace data through other links. Dapper’s log writing and log collection mode belongs to the out-of-band strategy

Dapper The Dapper system requests the tree to trace and collect out-of-band data. This is done for two unrelated reasons. First, the in-band collection scheme — where trace data is returned as RPC response headers — affects application network dynamics. In many of Google’s larger systems, it is not uncommon to track thousands of spans at a time. However, the RPC response size — even in cases close to the root node of a large distributed trace — is still relatively small: typically less than 10K. In this case, the tracking data with the in-dapper dwarfs the application data and the amount of data that tends to use the results of subsequent analysis. Second, the in-band collection scheme assumes that all RPCS are perfectly nested. We found that there are many middleware that return the results to their callers before all the back-end systems return the final results. In-band collection systems cannot account for this non-nested distributed execution pattern.

2.6 Security and privacy considerations

Recording a certain amount of RPC payload information will enrich Dapper’s tracing capabilities because the analysis tool can find relevant samples in the payload data (parameters passed by the method) that can explain why the monitored system is behaving erratically. However, in some cases, payload data may contain internal information that should not be disclosed to unauthorized users, including engineers who are debugging.

Since security and privacy issues are not negligible, the names of RPC methods are stored in Dapper, but no payload data is recorded at this point. Instead, application-level annotations provide a convenient optional mechanism: application developers can choose to associate data within a SPAN that will provide value for later analysis.

The Dapper also offers some security benefits that its designers didn’t anticipate. By tracking exposed security protocol parameters, Dapper can monitor whether an application is meeting security policies through the appropriate level of authentication or encryption. For example. Dapper can also provide information for policy-based isolation systems to perform as expected, such as applications that support sensitive data that do not interact with unauthorized system components. Such measurements provide stronger safeguards than source code reviews.

3. Dapper deployment status

Dapper has been our tracking system in production for over two years. In this section, we report on the state of the system, focusing on how Dapper meets our goal of ubiquitous deployment and application-level transparency.

3.1 Dapper runtime library

Perhaps the most critical part of the Dapper code is the integration of a library of components for basic RPC, thread control, and flow control, including the creation of spans, setting sampling rates, and writing logs to local disks. In addition to being lightweight, the embedded code needs to be stable and robust, because it can be difficult to maintain and fix bugs when it interconnects with a large number of applications. The embedded core code is composed of no more than 1000 lines of C++ and no more than 800 lines of Java code. An additional 500 lines of code have been added to support the key-value pair Annotation.

3.2 Coverage in production environment

Dapper penetration can be summarized into two aspects: on the one hand, the process that Dapper traces can be created (related to the component library implanted by Dapper), and the Dapper trace collection daemon is running on the server in production environment. Dapper’s daemons are distributed as a simple topological map of our servers, and they exist on almost all of Google’s servers. It is difficult to determine the exact part of the Dapper-ready process because the Dapper does not know the process even if it does not produce trace information. However, given the ubiquitous repository of Dapper components, we estimate that almost every Google production process is traceable.

In some cases Dapper does not track the control path correctly. This often results from the use of non-standard control flow, or Dapper’s error in attributing paths to unrelated events. Dapper provides a simple library to help developers manually control trace propagation as a workaround. There are currently 40 C++ applications and 33 Java applications that require some manually controlled trace propagation, but this is only a fraction of the thousands of traces. There is also a very small percentage of programs that use non-component communication libraries (such as native TCP sockets or SOAP RPC) that do not directly support Dapper tracing. But these apps can be plugged into Dapper individually if needed.

For the safety of the production environment, Dapper tracking can also be turned off. In fact, it was turned off by default early in deployment until we were confident enough about Dapper’s stability and low wear. Dapper’s team occasionally performs reviews looking for changes in trace configuration to see which services have Turned Dapper’s trace off. This is rare, however, and is usually due to concerns about the performance cost of monitoring. After further investigation and measurement of the actual performance cost, all of these off Dapper tracking has been turned on again, but it doesn’t matter anymore.

3.3 Tracking the use of annotations

Programmers tend to use application-specific annotations, either as a distributed debug log file or by classifying traces through some application-specific functionality. For example, all Bigtable requests record the name of the table being accessed into the Annotation. Currently, 70% of Dapper spans and 90% of all Dapper traces have at least one app specific Annotation.

Custom annotations were added to 41 Java applications and 68 C++ applications to better understand the behavior of the span in the application in their services. It’s worth noting that our Java developers have adopted Annotation apis on every trace span far more than our C++ developers have. This is probably because the scope of our Java applications tends to be closer to the end user (C++ is low-level); These types of applications often deal with a broader set of requests and therefore have a more complex control path.

4. Handle tracking losses

The cost of a tracking system consists of two parts: 1. The system performance is degraded due to the consumption of tracking data generated by the system being monitored, and 2. A portion of the resources are required to store and analyze trace data. While you can argue that a valuable component implementation trace is worth a fraction of the performance cost, we believe that the initial roll-out of the trace system will be greatly aided if the basic cost is negligible.

In this section, we will look at three aspects: the cost of Dapper component operations, the cost of trace collection, and the impact of Dapper on production environment load. We also described how Dapper’s adjustable sampling rate mechanism helps us deal with the trade-off between low loss and tracking representativeness.

4.1 Generating trace losses

The overhead of generating traces is the most critical part of Dapper’s performance impact, because collection and analysis can be more easily turned off in an emergency. The most important trace generation cost in the Dapper runtime is creating and destroying spans and annotations and recording them to local disk for subsequent collection. The creation and destruction of the root span took an average of 204 nanoseconds, while the same operation took 176 nanoseconds on other spans. The time difference is mainly due to the need to assign a globally unique ID to the trace on both spans.

If a span is not sampled, the cost of creating an annotation on that extra span is negligible, which consists of a ThreadLocal lookup at Dapper runtime, which takes an average of 9 nanoseconds. If this span is included in the sampling, annotating it with a string — shown in Figure 4 — takes an average of 40 nanoseconds. The data was collected on an x86 server running at 2.2GHz.

Writing to local disk during the Dapper runtime is the most expensive operation, but their visible loss is greatly reduced because writing to log files and operations are asynchronous relative to the application being traced. However, log writing operations can become noticeable in heavy traffic situations, especially if every request is tracked. We documented the performance cost of a Web search load shown in Section 4.3.

4.2 Tracking collection consumption

Reading out trace data can also interfere with the load being monitored. Table 1 shows the WORST-case CPU usage of Dapper’s log collecting daemon when it was tested at a higher-than-actual load base. In production, the daemon never had more than 0.3% single-core CPU usage for trace data processing, and there was only minimal memory usage (and heap fragmentation noise). We also restricted the Dapper daemon to the lowest priority of the kernel scheduler to prevent CPU contention on a heavily loaded server.

Dapper is also a lightweight consumer of bandwidth resources, taking up an average of 426 bytes per span transmitted through our repository. As a tiny part of the network behavior, Dapper’s data collection occupies only 0.01% of the network resources in Google’s production environment.

Table 1: CPU resource usage of the Dapper daemon during load tests

4.3 Impact on load in the Production Environment

Each request takes advantage of a large number of servers with high-throughput online services, which is one of the primary requirements for effective tracking; This situation requires large amounts of trace data to be generated, and their impact on performance is most sensitive. In Table 2, we take the network search service under the cluster as an example. By adjusting the sampling rate, we measure the impact of Dapper on performance in terms of latency and throughput.

Table 2: Effects of different sampling rates on network latency and throughput in a network search cluster. The experimental errors of delay and throughput are 2.5% and 0.15% respectively.

We see that although the impact on throughput is not significant, tracing sampling is necessary to avoid significant delays. However, the loss of latency and throughput is all within the experimental error range after the sampling rate is adjusted to less than 1/16. In practice, we found that even if the sampling rate was adjusted to 1/1024 there was still enough trace data to track a large number of services. It is important to keep Dapper’s performance degradation baseline at a very low level, as it provides a loose environment for applications to use the full Annotation API without fear of performance loss. Using the lower sampling rate has the added benefit of allowing trace data persisted to hard disk to be retained longer before being processed by the garbage collection mechanism, which gives Dapper’s collection component more flexibility.

4.4 Variable sampling

The amount of Dapper consumed for any given process is proportional to the sampling rate of tracking per process per unit of time. The first production version of Dapper used a uniform sample rate of 1/1024 across all processes within Google. This simple scheme is very useful for our high-throughput online service, because those events of interest (in the case of high-throughput) are still likely to occur frequently and are often enough to be caught.

However, important events can be missed at lower sampling rates and lower transmission loads, and acceptable performance losses are required to use higher sampling rates. The solution for such a system is to override the default sampling rate, which requires manual intervention, a situation we try to avoid in Dapper.

In the process of deploying variable sampling, we parameterize the sampling rate, instead of using a uniform sampling scheme, we use a sampling expectation rate to identify the tracking of sampling in unit time. In this way, low flow and low load will automatically increase the sampling rate, while high flow and high load will reduce the sampling rate and keep the loss under control. The actual sampling rate used will be recorded along with the tracking itself, which is conducive to accurate analysis from the tracking data of Dapper.

4.5 Coping with Aggressive Sampling

New Dapper users often feel that the low sampling rate — often as low as 0.01% for high-throughput services — will be detrimental to their analysis. Our experience at Google leads us to believe that aggressive sampling does not preclude the most important analysis for high-throughput services. If a significant action occurs once in the system, it occurs thousands of times. Low-throughput services — perhaps dozens of requests per second, rather than hundreds of thousands — can afford to keep track of every request, which is what made us decide to use adaptive sampling rates.

4.6 Additional sampling during collection

The sampling mechanism described above is designed to minimize significant performance losses in applications that work with the Dapper runtime. Dapper’s team also needed to control the total size of data written to the central database, so to do this we combined secondary sampling.

Our production cluster currently generates over 1TB of sampled trace data per day. Dapper users expect to keep track of processes in production for at least two weeks after they are recorded. The increasing density of tracking data must be balanced against the server and hard disk storage consumed by Dapper’s central repository. The high sampling rate for requests also brings the Dapper collector close to the upper limit of write throughput.

In order to maintain flexibility between the demand for material resources and the increasing throughput of Bigtable, we added additional sampling rate support to the collection system itself. We take advantage of the fact that all spans come from a particular trace and share the same trace ID, even though these spans may span thousands of hosts. For each span in the collection system, we hash the trace ID into a scalar Z, where 0<=Z<=1. If Z is lower than the coefficient in our collection system, we keep the span information and write it to Bigtable. If not, we abandon him. By using the trace ID in the sampling decision, we either save or discard the entire trace, rather than processing the span within the trace separately. We found that having this additional configuration parameter made it much easier to manage our collection pipeline because we could easily adjust our global write rate parameter in the configuration file.

It would be simpler to use a single sample rate parameter for the entire trace process and collection system, but this does not allow for the need to quickly adjust the run-time sample rate configuration on all deployed nodes. We chose the run-time sampling rate so that we could elegantly remove excess data that we could not write to the warehouse, and we could adjust the run-time sampling rate by adjusting the secondary sampling rate coefficient in the collection system. Dapper’s pipeline maintenance becomes easier because we can directly increase or decrease our global coverage and write speed by modifying our configuration of the secondary sample rate.

5. Universal Dapper tool

A few years ago, when Dapper was just a prototype, it was only available with the patient support of Dapper developers. Since then, we have gradually and iteratively built collection components, programming interfaces, and web-based interactive user interfaces to help Dapper users solve their own problems independently. In this section, we summarize which methods are useful and which are not, and we also provide basic usage information about these general analysis tools.

5.1 Dapper Depot API

Dapper’s “Depot API,” or DAPI, provides direct access to distributed trace data in Dapper’s regional repository. DAPI and Dapper trace repository are designed in tandem, and DAPI is meant to expose a clean and intuitive interface to the metadata in the Dapper repository. We used the following three recommended ways to expose such interfaces:

Access by trace ID: DAPI can read any trace information by its globally unique trace ID.
Batch access: DAPI can leverage MapReduce to provide parallel reads of hundreds of millions of Dapper trace data. The user overrides a virtual function that accepts a Dapper trace as its only argument, and the framework invokes each collected trace in a user-specified time window.
Index access: Dapper’s repository supports a unique index that matches our common call template. The index is drawn according to commonly requested Trace Features to identify Dapper’s trace information. Because trace ids are created according to pseudo-random rules, this is the best way to access trace data associated with a particular service or host.

All three access modes point users to different Dapper trace records. As described in Section 2.1, Dapper’s tracking data composed of span is modeled by tree structure. Therefore, the data structure of tracking data is also a simple traversal tree composed of span. A Span is equivalent to an RPC call, in which case the time information for the RPC is available. Time-stamped application-specific annotations can also be accessed using the SPAN structure.

Choosing a suitable custom index is the most challenging part of DAPI design. Compressed storage requires building an index in the tracking data to be only 26% smaller than the actual data, so the cost is huge. Initially, we deployed two indexes: the first is a host index and the other is an index for the service name. However, we did not find a trade-off between host indexing and storage costs. While users are interested in each host, they are also interested in specific services, so we ultimately chose to combine the two into a composite index that allows efficient lookups by service name, host, and timestamp.

5.1.1 Use of DAPI inside Google

DAPI is used at Google in three categories: continuous online Web applications that leverage DAPI, well-maintained DAPi-based tools that can be called from the console, and disposable analytics tools that can be written and run, but have largely been forgotten. We know of three persistent DAPi-based applications, eight additional on-demand dAPi-based analysis tools, and about 15 to 20 one-off analysis tools built using the DAPI framework. Beyond that, it’s hard to tell, as developers can build, run, and discard these projects without the technical support of the Dapper team.

5.2 Dapper user Interface

The vast majority of user usage takes place in web-based user interaction interfaces. We don’t have enough space to list every feature, so we can only show a typical user workflow in Figure 6.

Figure 6.

Users describe the services and times they care about, and any other information they can use to distinguish the tracking template (for example, the name of a span). They can also specify the cost metric (for example, service response time) that is most relevant to their search.
A large table of performance profiles for all distributed processing charts associated with identified services. Users can sort these execution ICONS as they want and select a histogram to show more detail.
Once a single distributed execution part is selected, the user can see a graphical description of the execution part. The selected service is highlighted in the center of the diagram.
After generating the statistics related to the cost metric dimensions selected in Step 1, Dapper’s user interface provides a simple histogram. In this example, we can see a rough distributed response time distribution for the selected portion. The user is also presented with a list of specific trace information, showing the different areas of the histogram into which the trace information is divided. In this example, when the user clicks on the second trace instance in the list, he sees a detailed view of the trace below (Step 5).
The vast majority of Dapper users will eventually check a certain tracking situation, hoping to collect some information to understand the root cause of the system behavior. We didn’t have enough space to do a review of the trace view, but we did use an interaction that was characterized by a global timeline (visible above) and the ability to expand and collapse the tree structure. The continuous layers of the distributed tracking tree are represented by rectangles of different colors embedded in them. The span of each RPC is broken down over time into server process consumption (green) and network consumption (blue). User annotations are not shown in this screenshot, but they can optionally be included on the global timeline as spans.

In order to enable users to query real-time data, Dapper’s user interface can directly interact with daemons on each Dapper server in the production environment. In this mode, you can’t expect to see system-level diagramming as described above, but you can still easily pick up a specific trace based on performance and network characteristics. In this mode, real-time data can be retrieved within seconds.

According to our records, about 200 different Google engineers use Dapper’s UI in a single day; Over the course of a week, there are about 750-1000 different users. These users, on the internal announcement of the new feature, are sequential on a monthly basis. Typically, users will send a connection for a specific trace, which will inevitably result in many one-time, short-duration interactions when querying the trace.

Experience in 6.

Dapper is widely used in Google, partly directly through the user interface of Dapper, and partly indirectly through the secondary development of Dapper API or based on apI-based applications. In this section, we do not attempt to list every known way of using Dapper, but rather try to cover the “basic vectors” of how Dapper is used and try to illustrate what applications are most successful.

6.1 Using Dapper in Development

Google’s AdWords system is built around a large database of keyword targeting criteria and related text ads. When new keywords or ads are inserted or modified, they must pass a service policy term check (such as checking for inappropriate language, a process that would be more effective if done using an automated review system).

When it came time to redesign an AD review service from scratch, the team iterated from the first prototype to Dapper, and ultimately maintained their system with Dapper all the way through. Dapper helped them improve their service in the following ways:

Performance: Developers track targets for request latency and locate areas that are easy to optimize. Dapper is also used to identify unnecessary serial requests on the critical path — often from subsystems not developed by the developers themselves — and prompt the team to continually fix them.
Correctness: Advertising review services are built around large database systems. The system has both a read-only copy policy (cheap data access) and a master policy of read and write (expensive access). Dapper is used in a variety of situations to determine which queries can be accessed using a replica policy instead of the master policy. Dapper is now responsible for monitoring which master policies are directly accessed and for securing important system constants.
Understanding: AD review queries span a variety of systems, including BigTable, the database mentioned earlier, multidimensional indexing services, and various other C++ and Java back-end services. Dapper’s tracing is used to assess the total query cost and facilitate re-engineering of businesses to reduce load on their system dependencies.
Testing: New code versions go through a QA process with Dapper tracing to verify correct system behavior and performance. During the test run, we found many problems with the AD review system’s own code or its dependency packages.

The AD review team makes extensive use of the Dapper Annotation API. Guice[13] has an open source AOP framework used to annotate “@Traced” on important software components. This trace information can be further annotated, including the input and output sizes of important subpaths, basic information, and other debugging information, all of which will be sent in addition to the log file.

At the same time, we also found deficiencies in the use of some AD review panels. For example, they want to search for all the Annotation information they track in an interactive time period, but this requires running a custom MapReduce or manually checking each trace. In addition, there are other systems at Google that also collect and centralize information from the common debug log, and it is also valuable to integrate the massive data of those systems with the Dapper repository.

Overall, even so, the AD review team has evaluated Dapper’s usefulness as follows, and their service latency has been optimized by two orders of magnitude through data analysis using Dapper’s tracking platform.

6.1.1 Integration with Exception Monitoring

Google maintains a service that continuously collects and centralizes exception reporting from running processes. If these exceptions occur in the context of a Dapper trace sample, then the corresponding trace ID and SPAN ID are also recorded as metadata in the exception report. The front end of the exception monitoring service provides a link from reporting specific exception information directly to their respective distributed tracing. The AD review team uses this feature to understand the wider context in which bugs occur. By exposing an interface built on a simple unique ID, the Dapper platform can be relatively easily integrated into other event monitoring systems.

6.2 Address the long tail of delays

Debugging a service like full-text search (mentioned in Section 1) can be very challenging given the number of moving parts, the size of the code base, and the scope of deployment. In this section, we describe our various efforts to mitigate the long tail effect of delayed distribution of full-text search. Dapper is able to verify the assumption of end-to-end latency, and more specifically, Dapper is able to verify critical paths for search requests. When a system involves not just several subsystems, but dozens of development teams, the root cause of poor end-to-end performance is a question that even our best and most experienced engineers cannot answer correctly. In this case, Dapper can provide much needed data and draw conclusions about many important performance issues.

Figure 7: Trace fragment of global search, in the case of infrequent high network latency, at the end to end request latency along the critical path, as shown.

In the course of debugging the long tail of delay, engineers can build a small library that can infer the critical path hierarchy from DAPI trace objects. These critical path structures can be used to diagnose problems and provide expected performance improvements for full-text search that can be prioritized. Dapper’s work led to the following findings:

Transient network performance degradation along the critical path does not affect system throughput, but it can have a significant impact on delayed outliers. As you can see in Figure 7, most of the slow tracking of global search results from critical path network performance degradation.
Many of the problems and costly query patterns result from interactions between unexpected services. Once discovered, they are often easy to correct, but before Dapper came along it was quite difficult to identify these problems.
Generic queries are collected from security log repositories other than Dapper and are associated with Dapper’s repository using Dapper’s unique trace ID. This mapping is then used to build up a list of instance queries that are slow for each individual subsystem in the global search.

6.3 Inferring service dependencies

A typical computing cluster within Google at any given time is a host of thousands of logical “tasks,” a set of processors performing a common method. Google maintains many of these clusters, and of course, in fact, we find that tasks computed on one cluster often depend on tasks computed on other clusters. Because the dependencies between tasks change dynamically, it is not possible to infer the dependencies between all of these services from configuration information alone. However, among other reasons, processes within the company require accurate service dependency information to identify bottlenecks and plan the migration of services. Google’s project, called “Service Dependencies,” automates the attribution of Service Dependencies by using trace annotations and the DAPI MapReduce interface.

When Dapper core components are used together with Dapper tracking annotations, the “Service Dependencies” project calculates Dependencies between tasks and between tasks and other software components. For example, all BigTable operations are tagged with affected table names. Using Dapper’s platform, the Service Dependencies team automatically calculates the granularity of services that depend on named resources.

6.4 Network Usage of Different Services

Google has invested a lot of human and material resources into its network architecture. Previously, network administrators might have focused on individual hardware information, common tools, and dashboards that created a bird’s eye view of the global network. Network administrators do have an overview of the health of the entire network, but when faced with a problem, they have few tools that can accurately find network load and use them to locate the culprit at the application level.

Although Dapper is not designed for link-level monitoring, we found that it is very suitable for application-level task analysis of network activity between clusters. With Dapper, Google can build a constantly updated console that displays application-level hot spots of the most active web traffic between clusters. In addition, with Dapper we can provide a pointed reason for the tracking of expensive network requests, rather than being overwhelmed by islands of information between different servers. It didn’t take more than two weeks to build a Dashboard based on the Dapper API.

6.5 Tiered and Shared Storage Systems

Many storage systems at Google are composed of multiple distributed infrastructure devices with multiple independent layers of complexity. For example, Google’s App Engine[5] is built on top of an extensible physical storage system. The entity storage system exposes certain RDBMS capabilities on A BigTable-based basis. BigTable also uses Chubby[7] (distributed locking system) and GFS. Furthermore, systems like BigTable simplify deployment and make better use of computing resources.

In such layered systems, it is not always easy to determine the consumption patterns of end-user resources. For example, the large amount of GFS information from a given BigTable cell is primarily generated by a single user or by multiple users, but at the GFS level, these two obvious usage scenarios are difficult to define. And without a tool like Dapper, the competition for shared services can be just as difficult to debug.

Dapper’s user interface, shown in Section 5.2, aggregates performance information traced by multiple clients that invoke any public service. This makes it easy for the providers of these services to rank their users across multiple dimensions. (for example, inbound network load, outbound network load, or total time for service requests)

6.6 Firefighting Dapper’s Firefighting

For some “fire fighting” missions, Dapper can handle some of them. “Fire-fighting” tasks in this context refer to some high-risk operations on distributed systems. Typically, Dapper users need to work with new data when they are “putting out fires” and don’t have time to write new DAPI code or wait for periodic reports to run.

The Dapper user interface typically isolates the location of these latency bottlenecks for services that have high latency and, no, perhaps worse, respond to timeouts under normal load. By communicating directly with the Dapper daemon, specific high-latency trace data can be easily collected. When a catastrophic failure occurs, it is often not necessary to look at the statistics to determine the root cause, and it is sufficient to look at the sample trace (as mentioned earlier, trace data is available almost immediately from the Dapper daemon).

However, shared storage services, such as those described in Section 6.5, require that information be aggregated as quickly as possible in the event of a sudden interruption during user activity. After the occurrence of an event, the sharing service can still use the summarized Dapper data. However, unless the batch analysis of the collected Dapper data can be completed within 10 minutes after the occurrence of the problem, it will be difficult for Dapper to successfully complete the “fire-fighting” task related to the shared storage service as expected.

7. Other gains

While our experience with Dapper so far has been largely in line with our expectations, there have been some positives that we didn’t fully anticipate. First of all, we were very pleased with the number of Dapper use cases we had expected. In addition to a few use cases mentioned in section 6, there are resource accounting systems, inspection tools for services that are sensitive to specified communication modes, and an analyzer for RPC compression policies, among others. We believe these unexpected use cases are partly due to the fact that we have opened up the trace data store to developers in a simple programming interface that allows us to take advantage of the creativity of this much larger community. In addition, Dapper’s support for old workloads is easier than expected, simply by introducing a recompiled common component library (including common thread usage, control flow, and RPC framework) in the application with the new version.

Dapper’s widespread use within Google also provided valuable feedback on its limitations. Here are some of the most important flaws we know about the Dapper:

Merge impact: The implicit premise of our model is that different subsystems are processing requests from the same traced request. In some cases, it is more efficient to buffer a portion of requests and then operate on one set of requests at a time. (For example, a merge write on disk). In this case, a traced request can appear as one large unit of work. In addition, when multiple trace requests are collected, only one of them is used to generate that unique trace ID for use by other spans, so no trace can be made. The solution we are considering is to solve this problem with as few records as possible, as long as this situation can be identified.
Tracking batch load: Dapper was designed primarily for online service systems with the initial goal of understanding the system behavior generated by a user request. However, off-line intensive loads, such as those that conform to the MapReduce[10] model, can also benefit from performance tapping. In this case, we need to associate the trace ID with some other meaningful unit of work, such as a key value (or range of key values) in the input data, or a MapReduce Shard.
Find the root cause: Dapper can effectively determine which part of the system is slowing down the whole system, but it can’t always find the root cause of the problem. For example, a request may be slow not because of its own behavior, but because of queued ahead of it in the queue. Applications can use application-level annotations to write queue sizes or overloads to the trace system. Furthermore, if this is a common occurrence, the paired sampling technique mentioned in ProfileMe[11] can solve this problem. It consists of two time-overlapping sample rates and looks at their relative delays throughout the system.
Record kernel-level information: Details of some kernel-visible events can sometimes be useful in determining the root cause of a problem. We have tools that can track or otherwise describe kernel execution, but it is difficult to get this information into the context of user-level tracing in a general or less obtrusive way. We are working on a compromise solution where we take snapshots of some kernel-level activity parameters at the user level and bind them to an activity span.

8. Related products

In the field of distributed system tracking, there is a complete set of systems. Some systems are focused on locating faults, while others are focused on optimizing for performance. Dapper is indeed used to detect system problems, but it is more commonly used to detect performance deficiencies and to improve the understanding of system behavior under full-scale workloads.

Dapper related black box monitoring systems, such as Project5[1], WAP5[15] and Sherlock[2], can be said to achieve higher application-level transparency without relying on the runtime. The disadvantages of black boxes are that they are somewhat imprecise and may lead to greater system wear when statistically inferring critical paths.

Annotation based middleware or applications themselves are a potentially more popular solution for distributed system monitoring. For example, Pip[14] and Webmon[16] systems are more dependent on application-level Annotation, while X-Trace[12], Pinpoint[9] and Magpie[3] mostly focus on library and middleware modification. Dapper is closer to the latter. Like Pinpoint, X-Trace, and earlier versions of Magpie, Dapper uses global identifiers to link together events that are related to each part of a distributed system. Similar to these systems, Dapper tries to avoid using application-level annotations, but hides the annotations in common component modules. Magpie abandoned the use of global ids and still tried to do the correct propagation of requests, eventually describing the relationship between different events accurately by adopting the event strategy written by each application system. But it’s not clear how effective Magpie’s transparency strategies are in the real world. X-trace’s core Annotation is more ambitious than Dapper, because x-Trace system collects traces not only at the level of Trace nodes, but also at different software layers within nodes. Our low performance degradation requirements for components forced us to move away from models such as X-Trace and work towards the minimum cost of connecting a request to a full Trace. Dapper tracing can still benefit from optional application-level annotations.

9. To summarize

In this article, we introduce Dapper, a distributed system tracking platform in Google’s production environment, and report our experience in developing and using it. Dapper is deployed on almost all Google systems and can be traced without application level modification, with no significant performance impact. The benefits brought by Dapper to developers and operation and maintenance teams can be seen from the extensive use of our main tracking user interface. In addition, we also listed some use cases of Dapper to illustrate its role, some of which were developed by the application developers without the participation of Dapper development team.

To our knowledge, this is the first paper to report a distributed system tracking framework in a production environment. In fact, our main contribution stems from the fact that the system reviewed in the paper has been in operation for two years. We found that the decision to enhance tracing, combined with a simple API for developers and complete transparency to the application system, was well worth it.

We believe that Dapper achieves greater application transparency than previous Annotation based distributed tracing, which has been demonstrated by the amount of work required with only a small amount of human intervention. Although it is partly due to the homogeneity of our system, it is still a significant challenge in itself. Most importantly, our design provides sufficient conditions to achieve application-level transparency, which we hope will help in the development of solutions in more heterogeneous environments.

Finally, by opening the Dapper tracking warehouse to internal developers, we promoted the creation of more analysis tools based on the tracking warehouse, while the result of Dapper team quietly working hard in the information island could not reach such a large scale. This decision promoted the development of design and implementation.

Acknowledgments

We thank Mahesh Palekar, Cliff Biffle, Thomas Kotzmann, Kevin Gibbs, Yonatan Zunger, Michael Kleber, and Toby Smith for their experimental data and feedback about Dapper experiences. We also thank Silvius Rus for his assistance with load testing. Most importantly, though, we thank the outstanding team of engineers who have continued to develop and improve Dapper over the years; in order of appearance, Sharon Perl, Dick Sites, Rob von Behren, Tony DeWitt, Don Pazel, Ofer Zajicek, Anthony Zana, Hyang-Ah Kim, Joshua MacDonald, Dan Sturman, Glenn Willen, Alex Kehlenbeck, Brian McBarron, Michael Kleber, Chris Povirk, Bradley White, Toby Smith, Todd Derr, Michael De Rosa, and Athicha Muthitacharoen. They have all done a tremendous amount of work to make Dapper a day-to-day reality at Google.

References

[1] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance Debugging for Distributed Systems of Black Boxes. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, December 2003.

[2] P. Bahl, R. Chandra, A. Greenberg, S. Kandula, D. A. Maltz, and M. Zhang. Towards Highly Reliable Enterprise Network Services Via Inference of Multi-level Dependencies. In Proceedings of SIGCOMM, 2007.

[3] P. Barham, R. Isaacs, R. Mortier, and D. Narayanan. Magpie: online modelling and performance-aware systems. In Proceedings of USENIX HotOS IX, 2003.

[4] L. A. Barroso, J. Dean, and U. Holzle. Web Search for a Planet: Google Cluster Architecture. IEEE Micro, 23(2):22 — 28, March/April 2003.

[5] T. O. G. Blog. Developers, start your engines. http://googleblog.blogspot.com/2008/04/developers-start-your-engines.html,2007.

[6] T. O. G. Blog. Universal search: The best answer is still the best answer. http://googleblog.blogspot.com/2007/05/universal-search-best-answer-is-still.html, 2007.

[7] M. Burrows. The Chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, Pages 335 — 350, 2006.

[8] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A Distributed Storage System for Structured Data. In Proceedings of the 7th USENIX Symposium on Operating Systems Design And Implementation (OSDI ’06), November 2006.

[9] M. Y. Chen, E. Kiciman, E. Fratkin, A. fox, and E. Brewer. Pinpoint: Problem Determination in Large, Dynamic Internet Services. In Proceedings of ACM International Conference on Dependable Systems and Networks, 2002.

[10] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’04), Pages 137 — 150, December 2004.

[11] J. Dean, J. E. Hicks, C. A. Waldspurger, W. E. Weihl, and G. Chrysos. ProfileMe: Hardware Support for Instruction-Level Profiling on Out-of-Order Processors. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture, 1997.

[12] R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica. X-Trace: A Pervasive Network Tracing Framework. In Proceedings of USENIX NSDI, 2007.

[13] B. Lee and K. Bourrillion. The Guice Project Home Page. http://code.google.com/p/google-guice/, 2007.

[14] P. Reynolds, C. Killian, J. L. Wiener, J. C. Mogul, M. A. Shah, and A. Vahdat. Pip: Detecting the Unexpected in Distributed Systems. In Proceedings of USENIX NSDI, 2006.

[15] P. Reynolds, J. L. Wiener, J. C. Mogul, M. K. Aguilera, and A. Vahdat. WAP5: Black Box Performance Debugging for Wide-Area Systems. In Proceedings of the 15th International World Wide Web Conference, 2006.

[16] P. K. G. T. Gschwind, K. Eshghi and K. Wurster. WebMon: A Performance Profiler for Web Transactions. In E-Commerce Workshop, 2002.