Aliyun Kubernetes observability is a one-stop observability product developed for The Kubernetes cluster. Based on the indicators, application links, logs and events under the Kubernetes cluster, Alibaba Cloud Kubernetes observability aims to provide the overall observability solution for IT development and operation personnel.

Authors: Li Huangdong, Yan Xun

Abstract

Ali Cloud has launched a one-stop observability system for Kubernetes, aiming to solve the problems of high complexity of the architecture in the Kubernetes environment, the coexistence of multi-language and multi-protocol operation and maintenance. The data collector adopts eBPF technology, which is currently popular in the sky. The product supports non-invasive collection and application of gold indicators, to build a global topology, greatly reducing the difficulty of public cloud users to maintain Kubernetes.

preface

Background and Issues

At present, cloud native technology is mainly based on container technology around the standardization of Kubernetes ecological technology, through standard scalable scheduling, network, storage, container runtime interface to provide infrastructure, at the same time, through standard extensible declarative resources and controller to provide operational ability, two layers of standardization to promote the elaboration of social division of labor, In this context, a large number of companies are using cloud native technology to develop operation and maintenance applications. Because cloud native technology brings more possibilities, current business applications are characterized by numerous microservices, multi-language development and multi-communication protocols. Meanwhile, cloud native technology itself reduces the complexity, bringing more challenges to observability:

1. Chaotic microservice architecture

Due to the division of labor, business architecture is prone to the phenomenon of a large number of services and complex service relationships (see Figure 1).

Figure 1 Chaotic microservice architecture (see the end of the article for picture source)

This raises a number of questions:

  • Cannot answer the current operating architecture;
  • It is not possible to determine whether downstream dependent services for a particular service are healthy;
  • Unable to determine whether upstream dependent service traffic for a particular service is normal;
  • Cannot answer whether the DNS request of the application is resolved properly.
  • Cannot answer whether the connectivity between applications is correct;
  • .

2. Multi-language application

In business architecture, different applications are written in different languages (as shown in Figure 2). Traditional observability methods require different methods for observability in different languages.

Figure 2 Multilingual (see bottom of article for picture source)

This raises a number of questions:

  • Different languages need different embedding methods, and some languages even have no existing embedding methods.
  • The influence of buried point on application performance cannot be simply evaluated.

3. Multi-communication protocol

In business architecture, communication protocols between different services are also different (as shown in Figure 3). Traditional observable methods usually bury points in specific communication interfaces at the application layer.

FIG. 3 Multi-communication protocol

This raises a number of questions:

  • Different communication protocols because different clients need different buried point methods, and some communication protocols even do not have a ready-made buried point method;
  • The influence of buried point on application performance cannot be simply evaluated.

4. End-to-end complexity introduced by Kubernetes

Complexity is eternal, and we can only find ways to manage it, not eliminate it. Although the introduction of cloud native technology reduces the complexity of business applications, it only moves complexity down to the container virtualization layer in the entire software stack, but does not eliminate it (See Figure 4).

Figure 4. End-to-end software stack

This raises a number of questions:

  • The expected number of Deployment copies is inconsistent with the actual number of running copies;
  • The Service does not have a back end and cannot process traffic.
  • Pod cannot be created or scheduled.
  • Pod cannot reach Ready state.
  • The Node is in the Unknown state.
  • .

Solution and technical scheme

In order to solve the above problem, we need to use a multilingual support, more technology, the communication protocol and the product level as much as possible to cover software stack end-to-end observability requirements, through the research, we put forward a kind of interface and the underlying operating system based on the container, upward associated application performance observation of observability solution (as shown in figure 5).

The data collection

Figure 5. End-to-end observability solution

We take the container as the core, collect the associated Kubernetes observable data, and at the same time, collect the system and network observable data of the container-related process downward, collect the performance data of the container-related application upward, and connect them in series through the association relationship to complete the end-to-end coverage of observable data.

Data transmission link

Our data types include metrics, logs, and links, using the Open Telemetry Collector solution (Figure 6) to support uniform data transmission.

Figure 6. OpenTelemetry Collector

Data is stored

Metrics are stored through ARMS Prometheus and logs/links are stored through XTRACE, backed by existing infrastructure at ARMS.

Core functions of the product

Core scenarios support architecture awareness, error-slow request analysis, resource consumption analysis, DNS resolution performance analysis, external performance analysis, service connectivity analysis, and network traffic analysis. These scenarios are based on the principle from the whole to the individual in product design: Start from the global view and find abnormal Service individuals, such as a Service. After locating the Service, view the gold indicator, association information and Trace of the Service for further association analysis.

Figure 7 core business scenario

The gold indicator that never goes out of style

** What is the gold indicator? ** is the minimum set of performance and status of an observable system: latency, traffic, errors, and saturation. The following is a quote from the SRE Bible, Site Reliability Engineering:

The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.

Why is the gold indicator important? First, directly and clearly express whether the system is normal external services. Second, customer-oriented, can further evaluate the impact on the user or the severity of the situation, which can save a lot of time for SRE or R&D. Imagine if we took CPU utilization as the gold indicator, then SRE or R&D would run out of energy, because high CPU utilization might not have much impact. Especially in a smooth running Kubernetes environment. So Kubernetes observability supports these gold indicators:

  • The number of requests/QPS
  • Response time and quantile (P50, P90, P95, P99)
  • Wrong number
  • Slow call number

Figure 8 Gold index

The following scenarios are supported:

1. Performance analysis

2. Slow call analysis

Global view of the application of extension

He who does not plan for the whole cannot plan for the whole. – the various ge is bright

With the increasing complexity of the current technical architecture and deployment architecture, it becomes more and more difficult to locate problems after problems occur, resulting in higher and higher MTTR. Another effect is that it is very challenging to analyze the surface of the influence. Therefore, having a big map like a map is very necessary. The global topology has the following characteristics:

  • ** System architecture awareness: ** The system architecture diagram is often referred to as an important reference for programmers to understand a new system. When we get a system, at least we need to know where the traffic entry is, which core modules are there, which internal and external components depend on, etc. During exception location, having a global schema map is a big boost to the exception location process. An example topology for a simple e-commerce application shows the architecture:

Figure 9. Architecture Awareness

  • ** Dependency analysis: ** There are some problems in the downstream dependency, if the dependency is not maintained by the own team, it will be more troublesome when the own system and the downstream system do not have enough observability, in this case, it is difficult to explain the problem to the maintainer of the dependency. In our topology, a call diagram is formed by connecting the upstream and downstream of the gold indicator with a call relationship. Edge as a dependency visualization, you can see the gold signal for the corresponding call. Having a gold signal can quickly analyze whether there is a problem with downstream dependencies. The following figure is an example of locating the overall RT high caused by slow calls of low-level service invocation microservices, from the entry gateway, to the internal service, to the MySQL service, and finally to the statements with slow SQL:

Figure 10 dependency analysis

  • ** High availability analysis: ** Topology can easily see the interaction between systems to see which systems are the main core links or are heavily dependent, such as CoreDNS, through which almost all components perform DNS resolution, so we further see possible bottlenecks. Check the gold indicator of CoreDNS to predict whether the application is healthy or insufficient.

Figure 11 High availability analysis

  • ** Invasion-free: ** Unlike ant’s Linkd and group’s Eagleeye, our solution is completely invasion-free. Sometimes the lack of observability isn’t because it can’t be done, it’s because the application needs to change the code. As an SRE, the intention is good for better observability, but it is obviously not appropriate to have the application owner of the whole group change the code with you. This is where the power of non-intrusion comes in: the app doesn’t need to change code, and it doesn’t need to be restarted. So the cost of access is very low.

Trace facilitates root cause locating

Protocol Trace differs from distributed tracing in that only one call is traced. The protocol Trace is also invasion-free and language-neutral. If the request contains a distributed link TraceID, the system automatically identifies the request, facilitating further drilling down to trace the link. The request and response information of application layer protocols helps analyze the request content and return code to know which interface is faulty.

Figure 12 Protocol details

Out of the box alarm function

It is inappropriate for any observable system not to support alarms. 1. This parameter is delivered from the default template and the threshold is based on industry best practices.

Figure 13 warning

2. Multiple configuration modes are supported

  • Static threshold. You only need to configure the threshold. You do not need to manually write PromQL
  • The dynamic threshold based on sensitivity adjustment is suitable for scenarios where it is difficult to determine the threshold
  • Compatible with PromQL, requires a certain learning cost, suitable for advanced users

Rich contextual associations

In an interview, The CEO of Datadog said that datadog’s product strategy is not to support as many features as possible, but to think about how to bridge the gap between different teams and members, To bridge the gap between the teams and get everything on the same page. In product design, we associate key contextual information to facilitate the understanding of engineers from different backgrounds, thus speeding up the troubleshooting of problems.

At present, we associate the context of alarm information, gold indicators, logs, Kubernetes meta information, and constantly add valuable information. For example, the alarm information is automatically associated with the corresponding service or application node, so that you can clearly see which applications are abnormal. You can click The Application or Alarm to display the application details, alarm details, and gold indicators of the application. All operations are performed on one page:

Figure 14 Context correlation

other

I. Observability of network performance:

Long response time caused by network performance is a common problem. Because the underlying TCP mechanism shields some complexity, the application layer is insensitive to this problem, which causes some problems in scenarios with high packet loss and retransmission rates. Kubernetes supports retransmission & packet loss and TCP connection information to represent network conditions. The following figure shows the example of high RT caused by high retransmission:

FIG. 15 Network performance observability

EBPF superpowers revealed

Figure 16 data processing flow

EBPF is equivalent to building an execution engine in the kernel, attaching the program to a kernel event through the kernel call, so as to monitor the kernel event. With the event, we can further deduce the protocol, screen out the protocol we are interested in, and then further process the event and put it into ringbuffer or eBPF’s own data structure Map for user mode process to read. After reading these data, the user-mode process further associates Kubernetes metadata and pushes it to the storage end. That’s the whole process.

EBPF’s super power is that it can subscribe to all kinds of kernel events, such as file reading and writing, network traffic, etc. Everything in the container or Pod running in Kubernetes is implemented through the kernel system call. The kernel knows everything that happens in all processes on the machine, so the kernel is almost the best observation point for observability. That’s why we chose eBPF. Another benefit of monitoring on the kernel is that applications do not need to change, nor do they need to recompile the kernel, making it truly non-intrusive. When there are dozens or hundreds of applications in a cluster, a non-intrusive solution can be a big help.

As eBPF is a new technology, it is natural for people to have some concerns about it. Here are some simple answers:

1. How is eBPF security? The limitations of eBPF code, such as the current maximum stack space of 512 and the maximum instruction number of 1 million, are intended to ensure sufficient kernel runtime security.

2. What is the performance of eBPF probe? It’s about 1%. The high performance of eBPF is mainly reflected in the data processing in the kernel, reducing the data copy between the kernel and user mode. To put it simply, the data is calculated in the kernel and then given to the user process, such as a Gauge value. In the past, the original data was copied to the user process and then calculated.

conclusion

Product value

Alibaba Cloud Kubernetes observability is a set of one-stop observability products developed for Kubernetes cluster. Based on the indicators, application links, logs and events under the Kubernetes cluster, Alibaba Cloud Kubernetes observability aims to provide the overall observability solution for IT development and operation personnel. Alibaba Cloud Kubernetes observability has the following characteristics:

  • ** Code non-intrusive: ** With bypass technology, rich network performance data can be obtained without burying the code.
  • ** Language independent: ** network protocol parsing at the kernel level, support any language, any framework.
  • ** High performance: ** Based on eBPF technology, it can obtain rich network performance data with very low consumption.
  • ** Strong association: ** describes entity association from multiple dimensions through network topologies, resource topologies, and resource relationships. It also supports association among various types of data (observables, links, logs, and events).
  • ** Data end-to-end coverage: ** Covers the observed data of the end-to-end software stack.
  • ** scenario closed loop: The scenario design of the ** console, associated with architecture awareness topology, application observability, Prometheus observability, cloud dial-up, health inspection, event center, log service and cloud service, including a complete closed loop for application understanding, exception discovery, and exception location.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.