From Opentracing, OpenCensus to OpenTelemetry, look at the evolution of observable data standards

Author: Observable

As distributed applications and Serverless applications are accepted by more and more developers and enterprises, the hidden operation and maintenance problems are gradually highlighted — long request links in the microservice architecture lead to long time to locate problems, and it is very difficult for operation and maintenance to monitor daily. As an example of a specific problem, completing a single user request in a distributed application may require several different microservices processing, and the failure or performance breakdown of any one of these services can have a significant impact on the user request response. As the business grows, the chain of calls becomes more complex. It is difficult to get a panoramic view or drill down to the bottom just by printing logs or APM performance monitoring. When it comes to troubleshooting problems or performance analysis, this is like trying to find an elephant.

Faced with such problems, Google published the paper “Dapper – a Large-scale Distributed Systems Tracing Infrastructure” [1] to introduce their Distributed tracking technology. The distributed tracking system should meet the following business requirements:

• Low performance losses: Performance losses to services in distributed tracking systems should be as negligible as possible, especially in performance-sensitive applications.

• Low intrusion: Make business code as low or non-intrusive as possible.

• Rapid scaling: Ability to scale quickly with business or microservices.

• Real-time display: low-delay data acquisition, real-time monitoring of the system, and quick response to system abnormalities.

In addition to the above requirements, this paper also makes a complete exposition of the three core links of distributed tracking: data collection, data persistence and data presentation. In this case, data collection refers to burying points in the code to set the content of the request to be reported. Data persistence refers to storing the reported data in disks. Data display is displayed on the interface based on the requests associated with the TraceID query.

With the birth of this paper, Distributed Tracing was accepted by more and more people, and the concept of technology gradually emerged. Related products have sprung up, with distributed tracking products like Uber’s Jaeger and Twitter’s Zipkin making their mark. But there is a problem: while each product has its own set of data acquisition standards and SDKS, most are based on the Google Dapper protocol, but the implementation is different. To solve this problem, OpenTracing and OpenCensus were born.

OpenTracing

For many developers, supporting distributed tracing is too difficult. This requires that trace data be passed not only within the process, but also between processes. More difficult is the need for other components to support distributed tracking, such as NGINX, Cassandra, Redis and other open source services, or the introduction of gRPC, ORMs and other open source libraries within the service.

Prior to OepnTracing, on the one hand, many distributed tracing systems were implemented using application-level detection with incompatible apis, which made developers uneasy about tightly coupling their applications to any particular distributed tracing. On the other hand, these application-level detection apis have very similar semantics. To address the incompatibility of different distributed trace system apis, or the standardization of trace data passing from one library to another and from one process to the next, the OpenTracing specification was created. A lightweight standardization layer between an application/library and a trace or log analyzer.

advantage

The advantage of OpenTracing lies in the development of a set of vendor-independent and platform-independent protocol standards, enabling developers to quickly add or replace the implementation of underlying monitoring simply by modifying Tracer. Based on this, cloud Native Computing Foundation CNCF officially accepted OpenTracing in 2016, becoming the third project of CNCF. The first two projects have become de facto standards in the cloud native and open source worlds –Kubernetes and Prometheus. This also shows the industry’s emphasis on the importance of observable and uniform standards.

OpenTracing consists of API specifications, frameworks and libraries that implement the specification, and project documentation, and makes the following efforts:

• Standardization of backend independent API interfaces: traced services only need to invoke the relevant API interface to be supported by any trace backend that implements the interface.

• Standardize management of tracking minimum unit spans: define apis for starting spans, ending spans, and recording Span time.

• Standardize the way trace data is transferred between processes: Define apis to facilitate trace data transfer.

• Standardization of multilingual application support: full coverage of GO, Python, Javascript, Java, C#, Objective-C, C++, Ruby, PHP and other development languages. It supports Zipkin, LightStep, Appdash trackers, and is easily integrated into frameworks such as GRPC, Flask, DropWizard, Django, and Go Kit.

Introduction to core terms

• Trace

A full request link. • Span – One call procedure

The logical unit in the system has a start time and execution time, and contains multiple states.

Each Span encapsulates the following states: • An operation name – Operation name • A start timestamp – Start time • A Finish timestamp – End time • Span Tag – A set of key and value pairs that constitute A Span Tag collection.

The key of a key-value pair must be String, and the value can be a String, Boolean, or numeric type.

• Span Log – A collection of Span logs.

Each Log operation contains a key-value pair and a timestamp. The key of a key-value pair must be String, and the value can be of any type.

• Reference-SPAN Relationships

Zero or more related spans. References are established between spans using SpanContext.

• SpanContext – Refer to other causally related spans via SpanContext.

OpenTracing currently defines two types of references: ChildOf and FollowsFrom. Both reference types specifically simulate direct causality between child and parent spans.

The parent Span in a ChildOf relationship waits for the child Span to return. The execution time of the child Span affects the execution time of its parent Span. The parent Span depends on the execution result of the child Span. In addition to serial tasks, there are many parallel tasks in our logic that correspond to parallel spans, in which case a parent Span can merge all child spans and wait for all parallel child spans to finish. In distributed applications, some upstream systems do not depend in any way on the execution results of the downstream system, for example, the upstream system sends messages to the downstream system through message queues. In this case, the relationship between the child Span of the downstream system and the parent Span of the upstream system is FollowsFrom.

The data model

After understanding the terminology, we can see that there are three key and interconnected types in the OpenTracing specification: Tracer, Span, and SpanContext. The technical model of OpenTracing becomes clear: The Trace call chain is implicitly defined by the Span belonging to the call chain. Each call is called a Span, and each Span is accompanied by a global TraceId. The Trace call chain can be thought of as a directed acyclic graph (DAG graph) consisting of multiple spans, in which the spans are connected head to tail in a Trace. TraceID and related contents follow the Span “path” in sequence through the transport protocol using SpanContext as the carrier. The above can be regarded as the whole process of a client request in a distributed application. In addition to the DAG diagram from the business perspective, in order to better display the information of component invocation time and sequence relationship, we also tried to better display the Trace invocation chain based on the timing diagram of the time axis.

Best practices

• Application code

Developers can use OpenTracing to describe cause and effect relationships between services and add fine-grained logging information.

• library code

Libraries that take intermediate control of requests can be integrated with OpenTracing, for example, a Web middleware library can use OpenTracing to create spans for requests, or an ORM library can use OpenTracing to describe high-level ORM semantics and perform specific SQL queries.

• RPC/IPC framework

Any cross-process subservice can use OpenTracing to standardize the format of trace data.

Related products

Products that follow the OpenTracing protocol include tracking components such as Jaeger, Zipkin, LightStep, and AppDash, and can be easily integrated into open source frameworks such as gRPC, Flask, Django, and Go-Kit.

OpenCensus

Across the observable spectrum, in order to better implement DevOps, in addition to distributed Trace tracking, operations personnel are starting to focus on Log and Metrics. Metrics Monitors machine indicators such as CPU, memory, hard disk, and network, network protocol indicators such as gRPC request delay and error rate, and service indicators such as the number of users and access numbers.

OpenCensus provides uniform measurement tools: capture and track Span across services, application level Metrics.