Author: Observable

As distributed applications and Serverless applications are accepted by more and more developers and enterprises, the hidden operation and maintenance problems are gradually highlighted — long request links in the microservice architecture lead to long time to locate problems, and it is very difficult for operation and maintenance to monitor daily. As an example of a specific problem, completing a single user request in a distributed application may require several different microservices processing, and the failure or performance breakdown of any one of these services can have a significant impact on the user request response. As the business grows, the chain of calls becomes more complex. It is difficult to get a panoramic view or drill down to the bottom just by printing logs or APM performance monitoring. When it comes to troubleshooting problems or performance analysis, this is like trying to find an elephant.

Faced with such problems, Google published the paper “Dapper – a Large-scale Distributed Systems Tracing Infrastructure” [1] to introduce their Distributed tracking technology. The distributed tracking system should meet the following business requirements:

• Low performance losses: Performance losses to services in distributed tracking systems should be as negligible as possible, especially in performance-sensitive applications.

• Low intrusion: Make business code as low or non-intrusive as possible.

• Rapid scaling: Ability to scale quickly with business or microservices.

• Real-time display: low-delay data acquisition, real-time monitoring of the system, and quick response to system abnormalities.

In addition to the above requirements, this paper also makes a complete exposition of the three core links of distributed tracking: data collection, data persistence and data presentation. In this case, data collection refers to burying points in the code to set the content of the request to be reported. Data persistence refers to storing the reported data in disks. Data display is displayed on the interface based on the requests associated with the TraceID query.

With the birth of this paper, Distributed Tracing was accepted by more and more people, and the concept of technology gradually emerged. Related products have sprung up, with distributed tracking products like Uber’s Jaeger and Twitter’s Zipkin making their mark. But there is a problem: while each product has its own set of data acquisition standards and SDKS, most are based on the Google Dapper protocol, but the implementation is different. To solve this problem, OpenTracing and OpenCensus were born.

OpenTracing

For many developers, supporting distributed tracing is too difficult. This requires that trace data be passed not only within the process, but also between processes. More difficult is the need for other components to support distributed tracking, such as NGINX, Cassandra, Redis and other open source services, or the introduction of gRPC, ORMs and other open source libraries within the service.

Prior to OepnTracing, on the one hand, many distributed tracing systems were implemented using application-level detection with incompatible apis, which made developers uneasy about tightly coupling their applications to any particular distributed tracing. On the other hand, these application-level detection apis have very similar semantics. To address the incompatibility of different distributed trace system apis, or the standardization of trace data passing from one library to another and from one process to the next, the OpenTracing specification was created. A lightweight standardization layer between an application/library and a trace or log analyzer.

advantage

The advantage of OpenTracing lies in the development of a set of vendor-independent and platform-independent protocol standards, enabling developers to quickly add or replace the implementation of underlying monitoring simply by modifying Tracer. Based on this, cloud Native Computing Foundation CNCF officially accepted OpenTracing in 2016, becoming the third project of CNCF. The first two projects have become de facto standards in the cloud native and open source worlds –Kubernetes and Prometheus. This also shows the industry’s emphasis on the importance of observable and uniform standards.

OpenTracing consists of API specifications, frameworks and libraries that implement the specification, and project documentation, and makes the following efforts:

• Standardization of backend independent API interfaces: traced services only need to invoke the relevant API interface to be supported by any trace backend that implements the interface.

• Standardize management of tracking minimum unit spans: define apis for starting spans, ending spans, and recording Span time.

• Standardize the way trace data is transferred between processes: Define apis to facilitate trace data transfer.

• Standardization of multilingual application support: full coverage of GO, Python, Javascript, Java, C#, Objective-C, C++, Ruby, PHP and other development languages. It supports Zipkin, LightStep, Appdash trackers, and is easily integrated into frameworks such as GRPC, Flask, DropWizard, Django, and Go Kit.

Introduction to core terms

Trace

A full request link. • Span – One call procedure

The logical unit in the system has a start time and execution time, and contains multiple states.

Each Span encapsulates the following states: • An operation name – Operation name • A start timestamp – Start time • A Finish timestamp – End time • Span Tag – A set of key and value pairs that constitute A Span Tag collection.

The key of a key-value pair must be String, and the value can be a String, Boolean, or numeric type.

• Span Log – A collection of Span logs.

Each Log operation contains a key-value pair and a timestamp. The key of a key-value pair must be String, and the value can be of any type.

• Reference-SPAN Relationships

Zero or more related spans. References are established between spans using SpanContext.

• SpanContext – Refer to other causally related spans via SpanContext.

OpenTracing currently defines two types of references: ChildOf and FollowsFrom. Both reference types specifically simulate direct causality between child and parent spans.

The parent Span in a ChildOf relationship waits for the child Span to return. The execution time of the child Span affects the execution time of its parent Span. The parent Span depends on the execution result of the child Span. In addition to serial tasks, there are many parallel tasks in our logic that correspond to parallel spans, in which case a parent Span can merge all child spans and wait for all parallel child spans to finish. In distributed applications, some upstream systems do not depend in any way on the execution results of the downstream system, for example, the upstream system sends messages to the downstream system through message queues. In this case, the relationship between the child Span of the downstream system and the parent Span of the upstream system is FollowsFrom.

The data model

After understanding the terminology, we can see that there are three key and interconnected types in the OpenTracing specification: Tracer, Span, and SpanContext. The technical model of OpenTracing becomes clear: The Trace call chain is implicitly defined by the Span belonging to the call chain. Each call is called a Span, and each Span is accompanied by a global TraceId. The Trace call chain can be thought of as a directed acyclic graph (DAG graph) consisting of multiple spans, in which the spans are connected head to tail in a Trace. TraceID and related contents follow the Span “path” in sequence through the transport protocol using SpanContext as the carrier. The above can be regarded as the whole process of a client request in a distributed application. In addition to the DAG diagram from the business perspective, in order to better display the information of component invocation time and sequence relationship, we also tried to better display the Trace invocation chain based on the timing diagram of the time axis.

Best practices

• Application code

Developers can use OpenTracing to describe cause and effect relationships between services and add fine-grained logging information.

• library code

Libraries that take intermediate control of requests can be integrated with OpenTracing, for example, a Web middleware library can use OpenTracing to create spans for requests, or an ORM library can use OpenTracing to describe high-level ORM semantics and perform specific SQL queries.

• RPC/IPC framework

Any cross-process subservice can use OpenTracing to standardize the format of trace data.

Related products

Products that follow the OpenTracing protocol include tracking components such as Jaeger, Zipkin, LightStep, and AppDash, and can be easily integrated into open source frameworks such as gRPC, Flask, Django, and Go-Kit.

OpenCensus

Across the observable spectrum, in order to better implement DevOps, in addition to distributed Trace tracking, operations personnel are starting to focus on Log and Metrics. Metrics Monitors machine indicators such as CPU, memory, hard disk, and network, network protocol indicators such as gRPC request delay and error rate, and service indicators such as the number of users and access numbers.

OpenCensus provides uniform measurement tools: capture and track Span across services, application level Metrics.

advantage

• While OpenTracing supports only Traces, OpenCensus supports both Traces and Metrics. • Compared with OpenTracing, OpenCensus not only makes specifications, but also includes agents and collectors. • The family team is larger than OpenTracing, supported by Google and Microsoft.

What did

• Standard communication protocols and consistent apis for handling Metrics and Trace. • multi-language library support: Java, C++, Go,.net, Python, PHP, node. js, Erlang, Ruby. • Integration with RPC framework. • Integrated storage and analysis tools. • Fully open source and supports three-party integration and plug-in output. • No additional servers or agents are required to support OpenCensus.

Introduction to core terms

In addition to the related terms of OpenTracing, OpenCensus also defines some new terms.

• Tags OpenCensus allows metrics to be associated with dimensions while recording. Thus, the measurement results can be analyzed from different angles. • Stats Collects observable results of library and application records, summarizes and exports statistical data, including Recording (Recording) and Views (aggregate measurement query). • Trace in addition to the Span attribute provided by Opentracing, OpenCensus also supports Attributes such as Parent SpanId, Remote Parent, Attributes, Annotations, Message Events, and Links. • Agent OpenCensus Agent is a daemon that allows OpenCensus’s multilingual deployment to use its Exporter. Instead of traditionally removing and configuring OpenCensus Exporter for each language library and each application, using the OpenCensus Agent can be solely operated by enabling the OpenCensus Agent for its target language. For the operations team, implement a single exporte management and extract data from multilingual applications, sending the data to the chosen backend. At the same time, minimize the impact of repeated startup or deployment on your application. Finally, agents come with “Receivers”. “Receivers” make agents pass through the back end to receive observable data and route it to the selected Exporter. Like Zipkin, Jaeger, or Prometheus.

The Collector, an important part of OpenCensus, is written in the Go language and can receive traffic from any available receiver application, regardless of programming language or deployment mode. For a service or application that provides Metrics and Trace, it only needs a Exporters export component to get data from the multi-language app.

For developers, only a single Exporter is managed and maintained, and all applications send data using OpenCensus. At the same time, the developer has the freedom to send the data to the back end where the business needs it, and to do better at any time. To address the problem of sending large amounts of data over a network that may have to deal with sending failures, Collector has buffering and retry capabilities to ensure data integrity and availability.

• Exporters OpenCensus can export data to various backends through a variety of Exporters such as: Prometheus for stats, OpenZipkin for traces, Stackdriver Monitoring for Stats and traces, and Jaeger for traces Traces, Graphite for STATS

Related products

Products that follow the OpenCensus protocol are Prometheus, SignalFX, Stackdriver, and Zipkin. Looking at this, we can see that both are evaluated in terms of functionality, features, and so on. Both OpenTracing and OpenCensus have obvious advantages and disadvantages. OpenTracing supports more languages and has lower coupling to other systems. OpenCensus supports Metrics, distributed tracking, and support from the API layer all the way down to the infrastructure layer. For many developers, at the same time, a new idea is being discussed: can there be a project that can integrate OpenTracing and OpenCensus, and support observable data related to logging?

OpenTelemetry

To answer the previous question, let’s look at what a typical service troubleshooting process looks like: • Open the monitor to find abnormal phenomena and query to find abnormal modules (Metrics) • Query and analyze abnormal modules and associated Logs. Locate the core error messages (Logs) • Locate the code causing the problem through detailed call chain data

In order to obtain better observability or quickly solve the above problems, Tracing, Metrics and Logs are indispensable.

At the same time, a wealth of open source and commercial solutions exist in the industry, including:

• Metrics: Zabbix, Nagios, Prometheus, InfluxDB, OpenFalcon, OpenCensus Jaeger, Zipkin, SkyWalking, OpenTracing, OpenCensus • Logs: ELK, Splunk, SumoLogic, Loki, Loggly.

There are a lot of different schemes and there are a lot of different protocol formats/data types. It is difficult for different solutions to be compatible with each other. At the same time, in real business scenarios, various solutions are mixed, and developers have to develop various Adapters for compatibility.

What is a OpenTelemetry

To better integrate Traces, Metrics, and Logs, OpenTelemetry was born. As an incubator project of CNCF, OpenTelemetry is a set of specifications, API interfaces, SDKS, tools, and integrations combining the OpenTracing and OpenCensus projects. A unified standard for Metrics, Tracing, and Logs for many developers, all of which have the same metadata structure and can be easily correlated with each other.

OpenTelemetry is vendor-agnostic, platform-agnostic, and does not provide observability related back-end services. According to user requirements, observable data can be exported to storage, query, visualization, and other back-end, such as Prometheus, Jaeger, and cloud vendor services.

advantage

The core advantages of OpenTelemetry are as follows:

• Completely break the lock-on hidden dangers of various manufacturers

As an operation and maintenance person, when the tools are not enough, but the implementation cost is too high to switch, I will jump up and scold the manufacturer “dog thief is trying to kill me again”. OpenTelemetry aims to break this fate by providing a standardized instrumentation framework as a pluggable service that can easily add common technical protocols and formats, making service choice more free.

• Specification development and agreement unification

OpenTelemetry uses a standards-based implementation approach. The focus on standards is especially important for OpenTelemetry because of the need to track interoperability across languages. Many languages come with type definitions that can be used in implementations, such as interfaces for creating reusable components. It includes the specifications required for the internal implementation of the observable client and the protocol specifications required for the communication between the observable client and the outside world. Specifically include:

• API: Defines the types and operations of Metrics, Tracing, and Logs data. • SDK: Define API specific language implementation requirements, define configuration, data processing, and export concepts. • Data: Define OpenTelemetry Line Protocol (OTLP). Although components in Opentelemetry support implementations of the Zipkin V2 or Jaeger Thrift protocol formats, both are provided as third-party contribution libraries. Only OTLP is the official native supported format for Opentelemetry.

Each language implements the specification through its API. Apis contain language-specific types and interface definitions, which are abstract classes, types, and interfaces that are used by concrete language implementations. They also include no-op implementations to support local testing and provide tooling for unit testing. The definition of the API is in the implementation of each language. As stated by the OpenTelemetry Python client: “The OpenTelemetry – API package includes abstract classes and no-action implementations that make up the compliant OpenTelemetry API.” You can see a similar definition on the Javascript client side: “This package provides everything you need to interact with the OpenTelemetry API, including all the TypeScript interfaces, enumerations, and no-op implementations. It can be used both on the server and in the browser.”

• Multi-language SDK implementation and integration

OpenTelemetry implements an SDK for each common language, combining an exporter with an API. SDKS are concrete, executable API implementations. Contains C++,.net, Erlang/Elixir, Go, Java, JavaScript, PHP, Python, Ruby, Rust, Swift.

The OpenTelemetry SDK generates observable data in the chosen language using the OpenTelemetry API and exports this data to the back end. And allows enhancements for common libraries or frameworks. Users can use SDK for automatic code injection and manual burying, while other three libraries (Log4j, LogBack, etc.) integration support; These packages are generally based on the specifications and definitions in OpentElemetry – Specification, combined with the characteristics of the language itself to achieve the basic capability of collecting observable data on the client. Such as metadata transfer between services and processes, Trace addition monitoring and data export, creation, use and data export of Metrics indicators, etc.

• Implementation of data collection system

There is a fundamental principle in the Tracing practice that the observable data collection process needs to be orthogonal to the business logic process. Collector is based on the principle of minimizing the impact of the observable client on the legacy business logic. OpenTelemetry A collection system based on the OpenCensus Service, including the Agent and Collector. Collector collects, transforms, and exports observable data, receives observable data in a variety of formats (such as OTLP, Jaeger, Prometheus, etc.), and sends the data to one or more back ends. It also supports processing and filtering of observable data before it is output. The Collector Contrib package supports more data formats and backends.

At the architectural level, the Collector has two modes. One is to deploy the Collector on the same host (Kubernetes’ DaemonSet) or on the same Pod (Kubernetes’ Sidecar) and use the telemetry data collected. Directly to the Collector over the loopback network. This mode is collectively called Agent mode. Another pattern is to treat the Collector as a standalone middleware to which the application passes telemetry data collected. This pattern is called the Gateway pattern. The two modes can be used independently or in combination, as long as the data protocol format of the data egress is consistent with that of the data entrance.

• Automated code injection technology

OpenTelemetry has also begun to provide an implementation for automatic code injection, which currently supports automatic injection in various major Java frameworks.

• Cloud native architecture

OpenTelemetry was designed with cloud native features in mind and also provides the Kubernetes Operator for rapid deployment.

OpenTelemetry Specifies the data type supported

• the Metrics

Metrics are metrics about a service that are captured at run time. Logically, the moment when one of the metrics is captured is called a Metric event, and it includes not only the Metric itself, but also the time when it was captured and the associated metadata. Application and request metrics are important indicators of availability and performance. Custom metrics provide insight into how availability affects the user experience and business. Customizing Metrics provides insight into how usability Metrics can impact the user experience or business.

OpenTelemetry currently defines three Metric tools: • Counter: a sum of values over time, thought of as a car’s odometer, that can only go up. • Measure: aggregate value over time. It represents a value within a defined range. • Observer: Captures a set of current values at a specific point in time, such as a fuel gauge in a vehicle.

• Logs

Logs are time-stamped text records that can be structured or unstructured with metadata. Although each log is a separate data source, it can be attached to the Span of Trace. Logs can also be seen during node analysis during daily use calls. In OpenTelemetry, any data that is not part of distributed Trace or Metrics is a log. Logs are often used to determine the root cause of the problem and often contain information about who changed the content and the results of the changes.

• Traces

Trace refers to the tracing of a single request, which can be initiated by an application or a user. Distributed Tracing is a form of cross-network and cross-application Tracing. Each unit of work is called a Span in Trace, and a Trace consists of a tree of spans. Span represents what the service or component designed by the application does, and Span also provides Metrics for requests, errors, and durations that can be used to debug availability and performance issues. A Span contains a Span context, which is a set of globally unique identifiers that represent unique requests to which each Span belongs. Usually we call this a TraceID.

• Baggage

In addition to Trace propagation, OpenTelemetry also provides Baggage to propagate key-value pairs. Baggage is used to index observable events in a service that contains properties provided by previous services in the same transaction that help establish causality between events. While Baggage can be used as a prototype for other crosscutting concerns, this mechanism is primarily intended to pass values from the OpenTelemetry observability system. These values can be consumed from Baggage and used as additional dimensions for metrics, or as additional context for logging and tracing.

Just a first step, or a one-stop shop?

Combining the above, we can see that OpenTelemetry covers specification definition, API definition, specification implementation, and data acquisition and transmission for all types of observable data. The application only needs one KIND of SDK to realize the unified generation of all types of data; The cluster can collect all types of data by deploying only one OpenTelemetry Collector. In addition, Metrics, Tracing, and Logging have the same Meta information and can be seamlessly correlated.

OpenTelemetry is to solve the first step of uniform observable data, through API and SDK to standardize the collection and transmission of observable data, OpenTelemetry does not want to rewrite all components, but to maximize the reuse of common tools in various fields in the industry. By providing a secure, platform-neutral, vendor-neutral protocol, component, standard. Its own positioning is very clear: data collection and the unification of standards and norms, for data how to use, storage, display, alarm, the official is not involved. However, as far as the overall observable scheme is concerned, OpenTelemetry only completes the unified data production part, and there is no clear plan on how to store, analyze and alarm these data, but these problems are very prominent.

• Storage methods of various types of data

Metrics can exist in Prometheus, InfluxDB, or various time series databases. Tracing can connect with Jaeger, OpenCensus and Zipkin. However, how to select and operate these back-end services is a difficult problem.

• Data analysis (visualization and correlation)

How to analyze the collected data uniformly? Different data need to be processed by different data platforms. To display Metrics, Logging and Tracing on a unified platform and realize the correlation between them, a lot of customized development work is needed. This is a lot of work for operation and maintenance.

• Anomaly detection and diagnosis

In addition to daily visual monitoring, application anomaly detection and root cause diagnosis are important business requirements of operation and maintenance, so OpenTelemetry data needs to be integrated into AIOps. But for many development and operations teams, basic DevOps are not yet fully in place, let alone AIOps.

Best practice: Access the application real-time monitoring service ARMS through OpenTelemetry

To solve the above problems, Ali Cloud provides application real-time monitoring service ARMS to help the operation and maintenance team solve data analysis, anomaly detection and diagnosis problems. ARMS supports multiple methods for accessing OpenTelemetry Trace data. You can directly report the OpenTelemetry Trace data to ARMS or forward it through the OpenTelemetry Collector.

(1) Direct reporting

• Access the OpenTelemetry Trace Data Java application through the ARMS Java Agent You are advised to install the ARMS Java Agent. The ARMS Java Agent has built-in link burying points of a large number of common components and can automatically report Trace data in OpenTelemetry format, which is out of the box without additional configuration. For details, see Monitoring Java Applications [2].

• Combine ARMS Java Agent and OpenTelemetry Java SDK to report Trace data. ARMS Java Agent versions V2.7.1.3 and later support the OpenTelemetry Java SDK extension. You can also use the OpenTelemetry SDK to extend custom method burying points while automatically retrieving generic component Trace data using the ARMS Java Agent. For details, see OpenTelemetry Java SDK Support [3].

• Directly report Trace data through OpenTelemetry You can use the OpenTelemetry SDK to bury the application and directly report Trace data through Jaeger. For details, see Reporting Java Application Data through OpenTelemetry [4].

(2) Forward through the OpenTelemetry Collector

• Forwards Trace data through the ARMS for OpenTelemetry Collector

In a container services ACK environment, you can install the ARMS for OpenTelemetry Collector with one click to forward Trace data. The ARMS for OpenTelemetry Collector implements link nondestructive statistics (local preaggregation, statistical results are not affected by sampling rates), dynamic configuration parameter tuning, state management, and out-of-the-box Trace Dashboard on Grafana, At the same time more easy to use, stable, reliable. The ARMS for OpenTelemetry Collector access flow is as follows:

  1. Install the ARMS for OpenTelemetry Collector from the ACK console application directory.

A. Log in to the Container Services Management Console [5]. B. In the navigation tree, choose Market > App Market. C. On the application Market page, search for ACK-arms-Cmonitor components by keyword and click ACK-arms-CMonitor. D. Click one-click in the upper right corner of the ACK-Arms-Cmonitor page. E. In the creation panel, select the target cluster and click Next. The default namespace is arms-PROM. F. Click OK. G. In the navigation tree, click the cluster and click the name of the cluster where the ACK-Arms-Cmonitor component is installed. H. In the navigation tree, choose Workload > Daemon Set and select arms-PROM as the namespace at the top of the page. I. Click otel-collector-service. Check whether the Otel-Collector-service (Service) is running properly. If multiple Receivers ports are exposed to receive OpenTelemetry data, the installation is successful, as shown in the following figure.

  1. Example Set the Exporter address in the SDK to otel-Collector-service :Port.

• Trace data is forwarded through the open source OpenTelemetry Collector

Using the open source OpenTelemetry Collector to forward Trace data to ARMS, you only need to modify the access point (Endpoint) and authentication information (Token) in the Exporter.

exporters:   otlp:     endpoint: <endpoint>:8090     tls:       insecure: true     headers:       Authentication: <token>
Copy the code

instructions

• Replace it with the Endpoint corresponding to your reporting area, for example, tracing-analysis-dc-bj.aliyuncs.com:8090.

• Replace the Token obtained on the console, for example, b590lhguqs@3a7*********9b_b590lhguqs@53d *****8301.

(3) OpenTelemetry Trace Usage Guide

To better play the value of OpenTelemetry Trace data, ARMS provides diagnostic capabilities such as link details, preaggregation, Trace Explorer post-aggregation analysis, and invocation of link association service logs.

• Link Details On the left side of the link details panel, you can view the interface call order and elapsed time of the link. On the right side of the panel, you can view detailed additional information and associated indicators, such as database SQL, JVM, and Host monitoring indicators.

• Pre-aggregate MetersArms provides multiple pre-aggregate metersbased on OpenTelemetry Trace data, including application overview, interface call, database call, etc.

• Trace Explorer Post-aggregation analysis For OpenTelemetry Trace data, ARMS provides flexible multidimensional filtering and post-aggregation analysis capabilities, such as querying abnormal links for specific applications. You can also aggregate Trace data by IP, interface, and other dimensions.

• Invoking link association service log ARMS Associates the OpenTelemetry Trace with service logs to troubleshoot service faults on application interfaces.

A link to the

[1] the Dapper “- a Large – Scale Distributed Systems, a Tracing proceeds” “static.googleusercontent.com/media/resea…

[2] Monitoring Java applications help.aliyun.com/document_de…

[3] OpenTelemetry Java SDK support help.aliyun.com/document_de…

[4] Report Java application data by OpenTelemetry help.aliyun.com/document_de…

[5] Container Services Management Console cs.console.aliyun.com/

Click here to learn more about Ali Cloud observables! Release the latest information of cloud native technology, collect the most complete content of cloud native technology, hold cloud native activities and live broadcast regularly, and release ali products and user best practices. Explore the cloud native technology with you and share the cloud native content you need.

Pay attention to [Alibaba Cloud native] public account, get more cloud native real-time information!