background

With the popularity of cloud native concept and cloud native architecture, the traditional single-application large-scale system is divided into many microservice modules with single function under the mature cloud native technology, and the combination of container technology can realize faster deployment, iteration and delivery of applications. At the same time, the complexity of the system is greatly improved, and the operation and maintenance of the system are also more challenges.

In this case, traditional monitoring methods can no longer meet the operation and maintenance requirements of cloud native applications, so the concept of observability is introduced into cloud native applications.

observability

Observability has its roots in cybernetics, a concept developed in the 1960s by Hungarian-born engineer Rudolf Kalmann. The extent to which a system can infer its internal state from its external output. This external output in AN IT system is typically generated by application services such as Trace, Metric, and Log, which are the three pillars of observability.

The relationship between observability and the three pillars was explored by Peter Bourgon in his February 2017 blog Metrics, Tracing, and Logging.

Observability under cloud native first appeared in Apple engineer Cindy Sridharan’s blog “Monitoring and Observability”, in which she explained the relationship between Observability and cloud native Monitoring.

In Google, the famous SRE system has laid the theoretical foundation of observability for a long time. That is to say, before the concepts of microservices and observability appeared, predecessors called this set of theories monitoring. Google SRE especially emphasized the importance of white-box monitoring. But put the black box monitoring which was commonly used in the technical circle at that time in a relatively minor position; White box monitoring echoes the concept of “initiative” in observability.

Baron SchSchwarz has a clear definition of the difference between monitoring and observability:

So far, observability has become the solution of cloud native operation and maintenance with new thinking and new perspective.

OpenTelemetry

Currently, the mainstream observability system is based on Trace, Metric, and Log pillars. The data structures of these three pillars are completely different, and the open source community has built many excellent open source projects for these three data types, such as Pinpoint, Prometheus, Fluentd, ELK, Jaeger, etc. However, these projects are observables that focus on specific pillars (data types) to address specific application scenarios. Each project is independent, providing a unique UI and usage style; However, under the actual application complexity of cloud native microservices, three-pillar joint analysis is often needed to locate specific problems, and there are various problems in the existing solutions

  • System overload: At least three open source systems are required for the three pillars, each maintained independently
  • Data independence: Data is isolated and independent and cannot be used for greater value. Each of the three pillar systems has its own independent UI, and the actual problem location process requires repeated hops across the three systems
  • Vendor binding: Each open source project has its own data format based on the three pillars. It is fully bound from the collection end to the presentation end, and the subsequent replacement and upgrade costs are high

The inconsistency of Trace, Metric and Log data standards is the root cause of many of the above problems. Opentelemetry was born to unify the data format of the three pillars and is currently in the incubation project of CNCF.

What is OpenTelemetry

What is OpenTelemetry?

OpenTelemetry is a set of APIs, SDKs, tooling and integrations that are designed for the creation and management of telemetry data such as traces, metrics, and logs.

OpenTelemetry is a collection of standards and tools designed to manage observational data such as trace, metrics, logs, and so on.

OpenTelemetry is an observability project of CNCF, which aims to provide a standardized solution in the field of observability, solve the standardization problems of data model, collection, processing and export of observed data, and provide services independent of three vendors.

What does Opentelemetry solve

OpenTelemetry uses Spec to standardize the data model and collection, processing, and export methods of observation data, including Trace, Metrics, and Logs. For details, see OpenTelemetry -Specification.

To facilitate multiple clients and unify data formats, the Protobuf protocol is provided. For details, see OpentElemety-Proto.

As a collection of standards and tools, OpenTelemetry provides more convenience for the three-pillar data based on specs

  • Provide multi-language SDK, support 10+ mainstream development languages, see the official documentation for details
  • The Collector service based on configuration management enables you to collect, process, and export Trace, Metric, and Log data
  • Contrib services for major vendors such as AliCloud, AWS, Azure, etc

Based on these standards and toolsets, developers can easily and quickly unify the structure of observed data and send the data to the Collector service for additional processing of the data (e.g. TraceId association, K8s attribute data marking, etc.), and the data export function is compatible with existing open source projects and customized storage solutions.

OpenTelemetry’s observability standard solution has revolutionized cloud native applications:

  • Unified data protocol: The OpentElemelety-Specification standardized Metric, Trace, and Log standards, so that the data of the three pillars have the same data format, which facilitates data correlation and improves data value.
  • Unified Agent: The Collecote Agent mode enables the collection, processing, and transmission of all observable data. The system reduces the occupation of system resources and makes the system architecture simpler and easier to maintain
  • Cloud native friendly: incubated in CNCF, it is more friendly to support all kinds of cloud native systems. At the same time, more and more cloud manufacturers have announced to support Opentelemetry, including but not limited to Ali Cloud, AWS, Azure, so it will be more convenient to use on cloud in the future.
  • Vendor-independent: Data-based uniform standards, from data collection to export, independent of any particular vendor, completely neutral. Choose/change service provider at will.
  • Good compatibility: Seamless compatibility with existing mainstream observability systems, including but not limited to Prometheus, OpenTracing, OpenCensus, Fluntd, Jaeger, etc., in data acquisition and data export.

What are the prospects for OpenTelemetry

OpenTelemetry is a combination of Two open source projects, Opentracing and OpenCensus, the latter of which was contributed by Google.

In May 2019, the two open source projects merged and the official Open Source Opentelemetry project was announced. Soon CNCF Technology Supervision Committee (TOC) voted OpenTelemetry as an incubator project for CNCF.

In February 2021, Trace Spec reached 1.0 Release! Others are in the works, as can be seen in official Status.

Up to now, the progress of the unified indicators of all data is as follows

Api Sdk Protocol
Tracing stable stable stable
Metrics stable feature-freeze stable
Logging draft draft beta
Baggage stable stable N/A

At the same time, more and more cloud vendors pay attention to and contribute. From their contribution codes, we can see that some cloud service providers provide data collection Receiver, and more of them focus on data export, which is convenient for importing standardized data into their own services, such as ALI Cloud’S SLS.

At present, the project is in the incubation stage of CNCF, and the community is very active. With the support of Google and the participation and contribution of many mainstream cloud manufacturers, it is believed that the project will graduate from CNCF in a long time and become the de facto standard of cloud observability solutions.

The limitations of OpenTelemetry

OpenTelemetry standardized the uniform standard of observation data, positioning as observability infrastructure, to solve the three pillar data collection, transmission, processing, collection problems. However, a lot of work (data storage, data calculation, data association, data presentation) after that still depends on each Vender platform for the time being. At present, there is no unified solution in the open source community.

Based on the complexity and customization requirements of the operation and maintenance work under the rapid development of the company’s business, the team of THE cloud operation and maintenance proposed the goal of building a cloud native observability platform to help improve the efficiency and reduce the cost of r&d and operation.

We use OpenTelemetry as the bottom data base of the observability platform, give full play to its advantages of unified standardized data, effectively associate the three pillar data aggregation, and improve the value of data; At the same time, it also gives a specific solution to its limitations: political cloud Otel platform

Zheng cloud Otel

Political acquisition cloud Otel is a complete solution based on OpenTelemetry monitoring operation and maintenance observability, mainly composed of data acquisition, data analysis, data display three components. Standardized observation data can be simply and efficiently associated and aggregated among the three components, providing richer interface data for various business scenarios under observability.

The data collection

Thanks to the standardization of observable data and a series of toolsets provided by OpenTelemetry, we can quickly build a uniform standard of observable data collection, processing and export services.

In the concrete implementation scheme, we adopted the Agent + Collector deployment mode.

  • Agent runs on Node with Daemonset and is mainly responsible for Receiver and Processor of data on Node, including but not limited to observable data (Trace, Log and Metric) of resources such as application, host and cluster.

  • As a data center service, the Collector is responsible for collecting data processed by agents, processing data (such as data marking and grouping) on the data, and exporting the processed data to the corresponding service. The Collector can be deployed independently and supports high availability. Only the Collector can communicate with the Agent.

The data collection architecture and implementation scheme are mainly completed by the tool set provided by OpenTelemetry, which can be easily deployed through configuration. At the same time, we made the following optimization and modification to this scheme:

  • Add Processor configuration to add authentication information for all data to support the multi-tenant function of data

processors:

resource: attributes: – key: zcy.tenement.token value: “a1c191dd3b084b09cb8c3c473e58be06” action: upsert

  • Filebeat’s Outputs plugin is officially only available (Es, Kafka, Logstash, etc.). OpenTelemetry is not supported yet

Based on the official interface, we implemented a Telemetry plug-in for Filebeat to standardize log data and upload it directly to the Agent

At this point, OpenTelemetry has completed all of the observability standard solutions it is currently looking to solve, and the ability to analyze, store, and display data after Collector is not the problem it is trying to solve (at least not in the short term). For this problem, we still rely on their respective Vendor platforms to achieve, and there is no mature and unified solution in the open source community.

Although Collector is friendly to support the export of data to Alicloud SLS, we decided to implement a complete observability solution based on the customization of the visualization requirements of operation and maintenance within the company, as well as operating costs, to solve the problems of data analysis, storage and presentation.

The data analysis

How to deal with the huge amount of data uploaded by the Collector is the core of the observability solution. This data, as metadata, is mainly used to implement two services:

  • Business visualization

    Application visualization, including visualization of application topology, application performance, application links, and application resources, visualization of logs about cluster health, cluster resources, and cluster status, including log query and analysis...Copy the code
  • Monitoring alarm

    Application service Monitoring Basic O&M Monitoring Exception Root Cause Analysis and Locating...Copy the code

These services have a strong demand for timeliness and relevance, so the throughput, accuracy and flexibility of data processing are also very important. Based on these factors, we adopted Flink as the basic platform for data processing and analysis, and made corresponding improvements according to special needs

  • Dynamic configuration update: Uses the Broadcast mechanism of Flink to obtain and update various configurations in near real time
  • Rule engine: user-defined Flink Task operator rules in the form of configuration and supports dynamic Task management (Update, Start, Cancel) by Jobmanger
  • SQL API encapsulates Flink’s top SQL API in data calculation, and defines calculation logic through common SQL syntax, which is simple, universal and low cost to learn
  • K8s cloud native deployment, Jobmanger can dynamically manage Resource resources by Task (Apply, Destroy)

The data show

After real-time calculation and processing by Flink, the data will be output and stored by Flink Sink. The storage of these data is divided into three parts:

  • Elasticsearch stores aggregated data commonly used for front-end service visualization and acts as a data cache
  • Cassandra, with ultra-high write performance and efficient compression strategy, is suitable for storing raw data, which has a large amount of data and low frequency of use. It only needs to support simple query
  • Kafka is used to store alarm events and new computing aggregation data. The former is used as the original data of alarm notification service, and the latter is used as the original data of stream computing

Elasticsearch is a distributed search engine with its unique advantages of full text search, fast query, complex aggregation, high availability, etc. Let’s choose Elasticsearch as the base of data display, and build an OpenAPI service otel-Dashboard on it. Read and reaggregate Elasticsearch data for use by business front ends.

conclusion

As a complete cloud native monitoring operation and maintenance observability program, Opentelemetry is used as data infrastructure, using its rich tool set to quickly build data collection services; At the same time, OpenTelemetry gives full play to the advantages of standardized unified data model specification.

All service components have been containerized and provided with the corresponding Helm Charts package, after which we will provide a complete deployment solution of Otel and a simple and quick access method.

【 References 】

  • OpenTelemetry
  • Swastika cracking cloud native observability

Recommended reading

Guava Cache actual Combat – From scenario to principle analysis

Details of HTTP2.0 and HTTPS protocols

Wechat official account

The article is published synchronously, the public number of political cloud technology team, welcome to pay attention to