Overview: Compared with traditional alarms and monitoring, the observable performance can see through the complex system in a more “white box” way, helping us better observe the system running status, and quickly locate and solve problems. In the case of an engine, an alarm just tells you if there is a problem with the engine. Some of the dashboards, which include speed, temperature and pressure, can help us roughly determine which parts are likely to have a problem. The actual location of the problem requires looking at the sensor data of each part.

The author source b | | yuan ali technology to the public

A preface

The concept of observability first appeared in electrical engineering in the 1970s, and the core definition is:

A system is said to be observable if, for any possible evolution of state and control vectors, the current state can be estimated using only the information from outputs.

Compared with traditional alarms and monitoring, the observable performance can see through the complex system in a more “white box” way, helping us better observe the running status of the system, and quickly locate and solve problems. In the case of an engine, an alarm just tells you if there is a problem with the engine. Some of the dashboards, which include speed, temperature and pressure, can help us roughly determine which parts are likely to have a problem. The actual location of the problem requires looking at the sensor data of each part.

Observability of IT system

The age of electrification originated from the Second Industrial Revolution in the 1870s and was marked by the widespread use of electricity and the internal combustion engine. Why did the concept of observability take nearly 100 years to be proposed? Wouldn’t you have relied on the output of various sensors to locate and troubleshoot faults and problems before then? Obviously not, the troubleshooting method has always been there, but the overall system and situation is more complex, so there needs to be a more systematic, systematic way to support this process, hence the concept of observability. So the core point is:

  • The system is more complex: before the car only needs an engine, conveyor belt, vehicle, brake can run, now any car has at least hundreds of components and systems, fault location becomes more difficult.
  • Development involves more people: with the advent of the era of globalization, the division of labor of companies and parts is becoming more and more detailed, which means that the development and maintenance of the system need more departments and people to jointly complete, and the cost of coordination is getting bigger and bigger.
  • There are various operating environments: under different operating environments, the working conditions of each system are changing. We need to effectively record the state of the system at any stage, which is convenient for us to analyze problems and optimize products.

The IT system after decades of rapid development, the development model, system architecture, deployment pattern, infrastructure, etc are also after several rounds of optimization, optimization with the faster development, deployment, efficiency, but with the whole system is also more complex, development rely on more people and department, the deployment pattern and running environment is more dynamic and uncertain, So the IT industry has come to the point where IT needs to be more systematic and systematic.

In fact, the observability of IT system is similar to that of electrical engineering. The core is to observe the output of each system and application and judge the overall working state through data. These outputs are typically classified as Traces, Metrics, and Logs. As for the characteristics, application scenarios and relationships of these three types of data, we will expand them in detail later.

Evolution of IT observability

IT observability technology is constantly developing. In a broad sense, observability related technology can be applied not only in IT operation and maintenance scenarios, but also in general and special scenarios related to the company.

  1. IT OPERATION and maintenance (O&M) scenarios: In the IT operation and maintenance (O&M) scenarios, the observation objects start from the basic equipment room and network to the user end. The observed scenario evolves from pure errors, slow requests, and so on, to the actual product experience of the user.
  2. General scenario: Observation is a general behavior in essence. In addition to operation and maintenance scenarios, it is applicable to the company’s security, user behavior, operation growth, transaction, etc. We can build application forms for these scenarios, such as attack detection, attack tracing, ABTest, advertising effect analysis, etc.
  3. Special scene: in addition to the scene of general scenario in the company, according to different industry attributes, can be derived for a specific industry, the observation scene and application, such as brain ali cloud city, through the observation information such as road congestion, signal lights, traffic accident, by controlling the different traffic light time, travel planning methods to reduce the overall rate of congestion.

How does Pragmatic observability land

Going back to the observability approach, we may not be able to make an observable engine that can be applied to various industry attributes at this stage, and will focus more on DevOps and the corporate business side of general. The two core tasks are:

  1. The coverage of data is sufficient: it can include different types of data in different scenarios. In addition to the narrow sense of log, monitoring and Trace, it also needs to include our CMDB, change data, customer information, order/transaction information, network flow, API call, etc
  2. Data associated with a unified analysis: the value of the data excavation is not a simple implementation through a data, more time we need to use a variety of data related to achieve a goal, such as the combination of user information table and access log, we can analyze the user behavior characteristics of different ages, gender, targeted to recommend; Through login log, CMDB, etc., combined with the rule engine, the attack detection of security class is realized.

Looking at the overall process, we can divide the observability effort into four components:

  1. Sensors: The premise of data acquisition is to have enough sensors to generate data, these sensors in the IT field form: SDK, buried point, external probe, etc.
  2. Data: After sensors generate data, we need to be able to acquire and collect various types of data, and classify these data for analysis.
  3. Computing power: The core of the observable scene is to cover enough data. The data must be massive, so the system needs to have enough computing power to calculate and analyze the data.
  4. Algorithms: The ultimate application of observable scenes is data value mining, so various algorithms need to be used, including some basic numerical algorithms, various AIOPs-related algorithms and the combination of these algorithms.

Fifth, classification of observability data

  • Logs, Traces, and Metrics, as the three main functions of IT observable data, can be used for monitoring, alerting, analysis, and troubleshooting. However, in actual scenarios, we often get mixed up about the applicable patterns of each data. Here, we give a general overview of the features, conversion methods, and applicable scenarios of these three data types.
  • Logs: We use a broader definition of Logs as a vehicle for recording event/event changes, including text for common access Logs, transaction Logs, kernel Logs, and generic data including GPS, audio and video. The log can actually be changed to Trace after the call chain scenario is structured, and to Metrics after the aggregation and downsampling operations.
  • Metrics: indicates the aggregated value, which is relatively discrete and generally consists of name, labels, time, and values. Metrics generally has a small amount of data, lower relative cost, and higher query speed.
  • Traces: Is the most standard call log. In addition to defining the parent-child relationship of the call (generally through TraceID, SpanID, ParentSpanID), it also defines the details of the operation, such as service, method, attribute, status, time and so on. Trace can replace part of Logs. You can also get Metrics for each service and method by tracing the aggregation.

The observability scheme of “fragmentation”

The industry has also developed a variety of observability-related products for this situation, including open source and commercial projects. Such as:

  1. Metrics: Zabbix, Nagios, Prometheus, InfluxDB, OpenFalcon, OpenCensus
  2. Traces: Jaeger, Zipkin, SkyWalking, OpenTracing, OpenCensus
  3. Logs: ELK, Splunk, SumoLogic, Loki, Loggly

Using a combination of these projects can more or less solve one or more specific types of problems, but when you really apply it you will find all kinds of problems:

  • Multiple solutions: At least three Metrics, Logging, and tracingsolutions are likely to be used, which can be costly to maintain
  • Data incommunication: Although it is the same business component and the same system, the data generated in different solutions is difficult to communicate with each other and cannot give full play to the value of data

In this scenario with multiple solutions, troubleshooting requires dealing with multiple systems. If these systems belong to different teams, they also need to interact with multiple teams to solve problems. Overall maintenance and use costs are very high. Therefore, we want to use a set of systems to solve all types of observable data collection, storage, and analysis functions.

Observability data engine architecture

Based on our thinking above and returning to the essence of observability, the observability scheme of our target needs to meet the following points:

  1. Comprehensive data coverage: including all kinds of observable data and support to collect data from each end and system
  2. Unified system: resizes fragmentation and supports unified storage and analysis of Traces, Metrics, and Logs in one system
  3. Data correlating: Each type of data can be internally associated with each other, but also support cross-data type association, can use a set of analysis language to analyze all kinds of data fusion
  4. Sufficient computing power: distributed, scalable, in the face of PB level data, can also have enough computing power to analyze
  5. Flexible and intelligent algorithms: In addition to basic algorithms, aiOPs-related exception detection and prediction algorithms should also be included and choreographed

The overall architecture of the observable data engine is shown in the figure below, and the four layers from bottom to top are basically in line with the guiding principle of the solution: sensor + data + computing power + algorithm:

  • Sensor: Data source OpenTelemetry as the core, and support all kinds of data form, device/end, data format collection, coverage is enough “wide”.
  • Data + computing power: The collected data is first fed into our pipeline system (similar to Kafka), and different indexes are built for different data types. Currently, dozens of PEtabytes of new data are written and stored on our platform every day. In addition to the usual query and analysis capabilities, we have built-in ETL capabilities for data cleansing and formatting, as well as support for external streaming and offline computing systems.
  • Algorithms: In addition to basic numerical algorithms, we currently support more than a dozen exception detection/prediction algorithms, and also support streaming exception detection; It also supports data choreography using Scheduled SQL to help generate more new data.
  • Value discovery: The process of value discovery is mainly realized through human-computer interaction such as visualization, alarm and interactive analysis. Meanwhile, OpenAPI is provided to connect with external systems or users to realize some customized functions.

Compatibility between data source and protocol

With alibaba’s full embrace of cloud native, we are gradually compatible with open source and cloud native observable protocols and solutions. Compared with the closed model of its own protocol, compatible open source and standard protocols greatly expand the range of data collection that our platform can support, and reduce unnecessary wheel building. The figure above shows our overall progress of compatibility with external protocols and agents:

  • Traces: In addition to internal Traces and Hawkeye Traces, open Traces include Jaeger, OpenTracing, Zipkin, SkyWalking, OpenTelemetry, OpenCensus, etc.
  • Logs: There are few protocols for Logs, but many log collecting agents are designed. Our platform is compatible with Logtail, Logstash, Beats (FileBeat, AuditBeat), Fluentd and Fluent bits, as well as syslog protocol. Routers and switches can directly use syslog to report data to the server.
  • Metrics: Timing engine We implemented Prometheus compatibility from the beginning of the design and support Telegraf, OpenFalcon, OpenTelemetry Metrics, Zabbix and other data access.

Unified storage engine

For storage engines, the first element of our design goal was uniformity, the ability to store all kinds of observable data in one engine; The second element is fast, including writing, query, can be applied to ali inside and outside the large scale of the scene (write dozens of PB a day).

Logs, Traces, Metrics, where Logs and Metrics have similar formats and query attributes.

  • Logs/Traces: The query method is mainly based on keywords and TraceID. In addition, certain tags, such as hostname, region, and app, are filtered. The number of matches in each query is relatively small, especially in the TraceID query method. And the hit data is likely to be discrete, which is usually the best type of data to store in search engines, the core of which is inverted indexing
  • The Metrics: Generally, it is a range query. Each time a single indicator/timeline is queried, or a group of timelines is aggregated. For example, the query that unifies the average CPU timing class of all the machines in an application generally has high QPS (mainly many alarm rules). It is necessary to do a good job of data aggregation. For this kind of data, there will be a special timing engine to support it. At present, the mainstream timing engine is basically implemented with the idea similar to LSM Tree to adapt to high-throughput write and query (Update and Delete operations are few).

At the same time, observable data has some common characteristics, such as high throughput write (high flow, QPS, but also have Burst), large scale query characteristics, time access characteristics (hot and cold characteristics, access locality, etc.).

Based on the analysis of the above characteristics, we designed a unified set of observable data storage engine with the overall architecture as follows:

  1. The access layer supports all kinds of protocol writes, and the data written is first entered into a FIFO pipe, similar to Kafka’s MQ model, and supports data consumption, which is used to connect to all kinds of downstream
  2. On top of the pipe, there are two sets of index structures, inverted indexes and SortedTable, which provide fast lookup capabilities for Traces/Logs and Metrics, respectively
  3. The two sets of indexes share different structures and other mechanisms, such as storage engine, FailOver logic, cache policy, and hot and cold data layering policy
  4. All these data are implemented in the same process, greatly reducing the operation, maintenance, and deployment costs
  5. The entire storage engine is implemented based on a pure distributed framework and supports horizontal expansion. A single Store can write up to PB data per day

Unified analysis engine

If the storage engine is compared to fresh food materials, the analysis engine is the tool for processing these food materials. For different types of food materials, different kinds of knives can be used to get the best effect, such as vegetables with slicing knife, chops with bone knife, fruit with peeling knife, etc. There are also appropriate analysis methods for different types of observable data and scenes:

  1. Metrics: Usually used for alarms and graphical displays. Metrics are usually obtained directly or supported by simple calculations, such as PromQL and TSQL
  2. Traces/Logs: The simplest and most direct methods are for keyword queries, including TraceID queries that are only special queries for keywords
  3. Data analysis (typically for Traces and Logs) : usually Traces and Logs are also used for data analysis and mining, so use turing-complete languages. SQL is the most popular language for programmers

The above analysis methods have their own applicable scenarios, and it is very difficult and convenient to implement all the functions in one syntax/language (although PromQL and keyword query capabilities can be achieved by extending SQL, a simple PromQL operator may require a large string of SQL to implement). So our analysis engine was chosen to be compatible with the syntax of keyword queries and PromQL. At the same time, in order to facilitate the association of all kinds of observable data, we realized the ability to connect keyword query, PromQL, external DB and ML model on the basis of SQL, making SQL become the top-level analysis language and realizing the ability of observational data fusion analysis.

Here are a few examples of our query/analysis applications. The first three are relatively simple and can be used with pure keyword queries, PromQL, or in combination with SQL. The final example shows fusion analysis in a real world scenario:

  • Background: A payment failure error has been found online, and the CPU indicator of the machine with the payment failure error needs to be analyzed
  • Firstly, query the CPU indicators of the machines and Join the Region information of the machines associated with them (check whether a Region is faulty) and the machines with payment failure in the logs. Finally, the timing anomaly detection algorithm is applied to quickly analyze the CPU indicators of these machines and the final result is visualized using the line graph. The result display is more intuitive

The above example queries LogStore and MetricStore at the same time, and is associated with CMDB and ML model. A statement achieves very complex analysis effect, which often occurs in actual scenarios, especially analyzing some complex applications and exceptions.

11 Data Arrangement

Observability, compared with the traditional monitoring data is more worth exploring ability stronger, can only through the output to infer the system running state, so this work looks like and data mining, collect all kinds of complex data, formatting, pretreatment, analysis, inspection, finally according to the conclusion to “story”. Therefore, in the construction of the observability engine, we are very focused on the ability of data orchestration, which can make the data flow, extract the higher value data from the vast original log, and finally tell us whether the system is working and why it is not working. In order for the data to “flow”, we developed several features:

  1. Data processing: The function of T in BIG data ETL (Extract, Transform, and Load) can help us process unstructured and semi-structured data into structured data, which is easier to analyze.
  2. Scheduled SQL (Scheduled SQL) is Scheduled SQL that is Scheduled to run periodically. The core idea is to simplify massive data and facilitate query. For example, the AccessLog is used to periodically calculate website access requests every minute, aggregate CPU and memory metrics based on APP and Region granularity, and calculate Trace topology periodically.
  3. AIOps inspection: a specially developed inspection capability based on timing anomaly algorithm for timing data, which uses machines and computing power to help us check which indicators and which dimensions have problems.

Observability engine application practice

At present, we have accumulated 100,000 internal and external users on this platform, and write 40PB+ data every day. Many teams are building their own company/department’s observation platform based on our engine, carrying out full stack observation and business innovation. Here are some common scenarios for using our engine:

1 Full link observation

The observability of full link has always been an important step in DevOps. In addition to the usual monitoring, alarm and troubleshooting, it is also responsible for user behavior playback/analysis, version release verification, A/B Test and other functions. The following figure shows the full link observability architecture of A product within Ali:

  1. Data sources include mobile terminal, Web terminal, and back-end data, as well as monitoring system data and third-party data
  2. Collection is achieved by SLS Logtail and TLog
  3. Based on the off-line mixed data processing method, the data are pre-processed such as marking, filtering, association and distribution
  4. All kinds of data are stored in the SLS observable data engine, which takes advantage of the index, query, and aggregate analysis capabilities provided by SLS
  5. The upper layer based on SLS interface constructs the whole link data display and monitoring system

2 the cost is observable

The top priority of business companies is always revenue and profit. As we all know, profit = revenue – cost, the cost of IT department usually takes up a large part, especially for Internet companies. Now after the full cloud of Ali, including the team inside Ali will also care about their IT expenditure, as much as possible to reduce the cost. The following example is the monitoring system architecture of one of our aliyun clients. In addition to monitoring IT infrastructure and business, the system is also responsible for analyzing and optimizing the IT cost of the whole company. The main data collected are as follows:

  1. Collect charges for each product (virtual machine, network, storage, database, SaaS class, etc.) on the cloud, including detailed billing information
  2. Collect monitoring information about each product, including usage and usage
  3. Set up the Catalog/CMDB, including the line of business, team, purpose, and so on for each resource/instance

Using Catalog + product billing information, you can calculate the IT expenses of each department. In combination with the usage and usage information of each instance, you can calculate the IT resource usage of each department, for example, the CPU and memory usage of each ECS. Finally, the rationality of each department/team’s overall use of IT resources is calculated, and the information is summarized into an operation report to promote the optimization of departments/teams with low rationality of resource use.

3 Trace is observable

With the implementation of cloud native and micro-services in various industries, distributed link tracing (Trace) has been adopted by more and more companies. For Trace, the most basic capability is the ability to document the propagation, dependencies, and visualization of requests being invoked across multiple services. In terms of the data characteristics of Trace itself, it is a regular, standardized and dependent access log, so more values can be calculated and mined based on Trace.

The following is the implementation architecture of SLS OpenTelemetry Trace. The core is to calculate original Trace data through data choreography and get aggregated data, and implement various additional functions of Trace based on the interface provided by SLS. Such as:

  1. Dependency: This is a feature provided by most Trace systems. Trace Dependency is obtained through aggregation calculation based on the parent-child relationship in Trace
  2. Service/interface gold indicator: Trace records the call delay and status code of the service/interface. Based on these data, gold indicators such as QPS, delay and error rate can be calculated.
  3. Upstream and downstream analysis: computation-based Dependency information is aggregated by a Service to unify the upstream and downstream indicators that the Service depends on
  4. Middleware analysis: The calls to middleware (database /MQ, etc.) in Trace are generally recorded into a Span. The statistics based on these spans can obtain the QPS, latency, and error rate of middleware.
  5. Alarm related: You can set the monitoring and alarm based on the golden indicator of the service/interface. You can also set the alarm of the whole service entrance. (Generally, the Span whose parent Span is empty is regarded as the service entrance call.)

4. Root cause analysis based on choreography

In the early stage of observability, a lot of work needs to be done manually. What we hope most is to have an automated system that can diagnose anomalies automatically based on the observed data, obtain a reliable root cause and Fix the anomalies automatically according to the root cause of diagnosis. At present, automatic abnormal recovery is difficult to achieve, but root cause location can be implemented through certain algorithms and arrangements.

The following figure shows the observation abstraction of a typical IT system architecture. Each APP has its own gold index, business access log/error log, basic monitoring index, invocation of middleware index, and associated middleware index/log. Meanwhile, the dependency relationship between upstream and downstream APPS/services can be obtained through Trace. This data, combined with algorithms and programming, allows for a degree of automated root cause analysis. Here are the core dependencies:

  1. Correlation: Dependencies between apps/services can be calculated through Trace. The dependencies between APP, PaaS and IaaS can be obtained from CMDB information. By making connections, you can “follow the lead” to find the cause of the problem.
  2. Timing anomaly detection algorithm: automatically detect whether a certain curve or a group of curves is abnormal, including ARMA, KSigma, Time2Graph, etc. For detailed algorithms, please refer to: anomaly detection algorithm, streaming anomaly detection.
  3. Log cluster analysis: Aggregates logs with high similarity to extract common log patterns to quickly grasp the overall picture of logs. Meanwhile, the Pattern comparison function is used to compare the patterns in normal and abnormal periods to quickly find exceptions in logs.

Timing and log exception analysis can help us determine if there is a problem with a component, while correlation allows us to “follow through”. Through the combination of these three core functions, an anomaly root cause analysis system can be programmed. Here is a simple example: Firstly, analyze the gold indicator of the entry from the alarm, and then analyze the data of the service itself, dependent middleware indicators, and application Pod/ VIRTUAL machine indicators. Trace Dependency can recursively analyze whether there is a problem with downstream Dependency, and some change information can also be associated to quickly locate abnormalities caused by changes. The abnormal events that are finally found can be deduced on the timeline, and the root cause can also be determined by operation/development.

Thirteen is at the end

The concept of observability is not a direct invention of “black technology”, but we have evolved from monitoring, troubleshooting, prevention and so on. Similarly, we started as a log engine (log service on Ali Cloud), and then gradually improved and upgraded to an observability engine. For “observability”, we need to put aside the concept/noun itself to discover its essence, which is often related to Business. For example:

  1. Make the system more stable, better user experience
  2. Watch IT expenses to eliminate unreasonable usage and save more costs
  3. Watch trades for brushing/cheating, even with stops
  4. Use AIOps and other automated means to find problems, save more manpower, and improve operation and maintenance efficiency

Our main focus for the development of observability engine is how to serve more departments/companies to implement observability solutions quickly and effectively. Sensors, data, calculations and algorithms in the engine are constantly evolving and iterating, such as more convenient eBPF acquisition, data compression algorithm with higher compression rate, parallel computing with higher performance, root analysis algorithm with lower recall rate and so on.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.