Summary: Currently, cloud native technology is based on container technology, providing the infrastructure through standard and extensible scheduling, networking, storage, and container runtime interfaces.

As Kubernetes becomes the cloud’s native fact standard, observability challenges arise

Currently, cloud native technology is based on container technology, providing the infrastructure through standard and extensible scheduling, networking, storage, and container runtime interfaces. At the same time, through standard extensible declarative resources and controllers to provide operation and maintenance capabilities, two-layer standardization promotes the separation of development and operation concerns, further enhance the scale and specialization of each field, and achieve comprehensive optimization of cost, efficiency and stability.

In such a big technology background, more and more companies are introducing cloud native technology to develop, operate and maintain business applications. Because cloud native technology has brought more and more complex possibilities, business applications have emerged with the distinctive characteristics of numerous micro-services, multi-language development and multi-communication protocols. At the same time, cloud native technology itself moves complexity down, creating more challenges for observability:

1. Chaotic microservice architecture, mixed with multiple languages and multiple network protocols

Business architecture is prone to a large number of services and complex call protocols and relationships due to division of labor. Common problems include:

  • Unable to accurately and clearly understand and control the overall system operation architecture;
  • Cannot answer whether the connectivity between applications is correct;
  • Multi-language and multi-network call protocols bring linear increase in buried point cost and low ROI of repeated buried point. Development generally reduces the priority of such requirements, but observable data have to be collected.

2. The sinking infrastructure capability shields the implementation details, making the problem demarcation more and more difficult

Infrastructure capacity continues to sink, development and operations concerns continue to separate, implementation details are shielded from each other by layering, data aspects are not well correlated, and problems cannot be quickly demarcated at which layer the problem occurs. Development students only care about the normal operation of the application, not the details of the underlying infrastructure. When problems occur, operation and maintenance students need to cooperate to troubleshoot problems. In the process of troubleshooting problems, operation and maintenance students need to provide enough upstream and downstream to promote the investigation, otherwise they only get such a general statement as “XXX application has high latency”, which is difficult to achieve further results. Therefore, common language is needed between development students and operation students to improve communication efficiency. Concepts such as Label and Namespace of Kubernetes are very suitable for constructing context information.

3. Various monitoring systems cause inconsistent monitoring interfaces

A serious side effect of complex systems is the multiplicity of monitoring systems. Data links are not associated or unified, and the monitoring interface experience is inconsistent. Most of the operation and maintenance students probably have the experience that when locating a problem, the browser opens dozens of Windows and switches back and forth between Grafana, console, log and other tools. It is not only very time-consuming, but also the brain can process limited information, making problem locating inefficient. With a unified observable interface, data and information can be organized effectively, reducing distractions and page switching, improving problem location efficiency, and investing valuable time in building business logic.

Solution and technical scheme

In order to solve the above problems, we need to use a multilingual support, more technology, the communication protocol and the product level as far as possible cover software stack end-to-end observability needs, through research, we put forward a kind of interface and the underlying operating system based on the container, upward associated the observability of application performance monitoring solution.

It is very challenging to collect data from container, node operating environment, application and network. The cloud native community provides cAdvisor, Node exporter and Kube-state-metics for different needs, but they still cannot meet all the needs. The cost of maintaining many collectors is also significant, which leads to the question of whether there can be a data acquisition scheme that is non-invasive to applications and supports dynamic expansion. By far the best answer is eBPF.

“Data Acquisition: The Superpower of eBPF”

EBPF is equivalent to building an execution engine in the kernel, attaching the program to a kernel event through the kernel call, and realizing the monitoring of kernel events. With the event, we can further deduce the protocol, screen out the protocol we are interested in, and then further process the event and put it into ringbuffer or eBPF’s own data structure Map for user mode process to read. After reading these data, the user-mode process further associates Kubernetes metadata and pushes it to the storage end. That’s the whole process.

EBPF’s super power is that it can subscribe to all kinds of kernel events, such as file reading and writing, network traffic, etc. Everything in the container or Pod running in Kubernetes is implemented through the kernel system call. The kernel knows everything that happens in all processes on the machine, so the kernel is almost the best observation point for observability. That’s why we chose eBPF. Another benefit of monitoring on the kernel is that applications do not need to change, nor do they need to recompile the kernel, making it truly non-intrusive. When there are dozens or hundreds of applications in a cluster, a non-intrusive solution can be a big help.

But as a new technology, there are some concerns about eBPF, such as safety and probe performance. To ensure adequate kernel runtime security, eBPF code imposes restrictions such as a maximum stack space of 512 and a maximum instruction count of 1 million. At the same time, eBPF probes are controlled at around 1% for performance concerns. Its high performance is mainly reflected in the kernel processing data, reduce the data copy between kernel and user mode. To put it simply, the data is calculated in the kernel and then given to the user process, such as a Gauge value. In the past, the original data was copied to the user process and then calculated.

A programmable execution engine naturally lends itself to observability

Observability engineering helps users better understand the internal state of the system to eliminate knowledge blindness and timely eliminate systemic risks. What is the power of eBPF in terms of observability?

Application exceptions, for example, when found abnormal application, solve the problem in the process of observability found missing application level, this time through the burial site, testing and on-line supplementary application of observability, specific issue has been resolved, but tend to take temporary solution not effect a permanent cure, the next somewhere else has a problem, need to walk the same process again, another language and protocol for buried point of higher costs. It’s better to do it in a non-invasive way, to avoid having no data when you need to observe it.

The eBPF execution engine collects observable data by dynamically loading execution eBPF scripts. For example, suppose the original Kubernetes system does not do process-specific monitoring and one day discovers that a malicious process (such as a mining program) is frantically consuming CPU. At this point, we find that such malicious process creation should be monitored, and we can integrate the open source process event detection library to achieve this, but this often requires packaging, testing, and release of the whole process, the whole process can be completed in a month.

By contrast, eBPF is more efficient and fast. Because eBPF supports dynamic loading of events created by the kernel listener process, we can abstract eBPF scripts into a sub-module, and the collection client only needs to load the scripts in this sub-module to complete data collection. The data is then pushed to the back end through a unified data channel. This eliminates the tedious process of changing code, packaging, testing, and releasing, and dynamically implements the need for process monitoring in a non-intrusive manner. Therefore, eBPF’s programmable execution engine is ideal for enhancing observability and gathering rich kernel data to facilitate troubleshooting by associating business applications.

From monitoring systems to observability

With the wave of cloud natives, the concept of observability is taking root. However, it is still inseparable from the data cornerstone of three observable fields: log, index and link. Those of you who have done operations or SRE often face the problem of being pulled into an emergency group in the middle of the night, being asked why the database isn’t working, and not being able to get to the heart of the problem without context. We believe that a good observability platform should help users respond well to context, as Datadog’s CEO put it: Monitoring tools are not about being as versatile as possible, but about how to bridge the gap between different teams and members, To bridge the gap between the teams and get everything on the same page.

Therefore, in the product design of the observability platform, it is necessary to take indicators, links and logs as the basic, and integrate all kinds of cloud services of Ali Cloud externally. At the same time, it also supports data access of open source products, and associates key context information to facilitate the understanding of engineers with different backgrounds, thus accelerating the troubleshooting of problems. If the information is not organized effectively, the cost of understanding will be generated. The granularity of information is organized into a page by events -> indicators -> links -> logs in order to facilitate drill-down and provide a consistent experience without the need for multiple systems to jump back and forth.

So how does it relate? How is information organized? Mainly from two aspects:

**1, end-to-end: ** expansion is the application to the application, service to service, Kubernetes standardization and separation of concerns, their respective development and maintenance of their respective areas of concern, so the end-to-end monitoring has become a lot of “no matter” area, when there is a problem it is difficult to check which link out of the problem. Therefore, from an end-to-end point of view, the call relationship between the two is the basis of the association, because the system call creates the association. Through eBPF technology, it is very convenient to collect network calls in a non-invasive way, and then parse the calls into well-known application protocols, such as HTTP, GRPC, MySQL, etc. Finally, the topology relationship is constructed, forming a clear service topology, which is convenient to quickly locate problems. As shown in the following figure, if any link in the complete link of gateway >Java Application >Python Application > cloud service delays occur, the problem should be identified at a glance in the service topology. This is the first pipeline point end-to-end.

**2. Top-down full-stack association: ** Adopts Pod as the medium, Kubernetes layer can associate Workload, Service and other objects; infrastructure layer can associate nodes, storage devices, networks, etc.; application layer can associate logs, call links, etc.

Next, the core functions of Kubernetes monitoring are introduced.

The gold indicator that never goes out of style

Gold metrics are the smallest set of metrics used to monitor the performance and state of a system. ** Gold index has two advantages: first, it directly and clearly expresses whether the system is in normal external service. Second, the ability to quickly assess the impact on the user or the severity of the situation can greatly save SRE or R&D time. Imagine if we took CPU utilization as the gold indicator, then SRE or R&D would run out of energy because high CPU utilization might not have much impact.

Kubernetes monitoring supports these indicators:

  • The number of requests/QPS
  • Response time and quantile (P50, P90, P95, P99)
  • Wrong number
  • Slow call number

As shown below:

Global view of the service topology

Zhuge Liang once said, “Those who do not plan for the whole situation cannot plan for one region.” With the increasing complexity of the current technical architecture and deployment architecture, it becomes more and more difficult to locate problems after problems occur, resulting in higher and higher MTTR. Another effect is that it is very challenging to analyze the impact surface, often resulting in the loss of one and the loss of another. Therefore, it is necessary to have a big map of the topology like a map. The global topology has the following characteristics:

  • ** System architecture awareness: ** System architecture diagram is an important reference for programmers to understand a new system. When they get a system, they at least need to know where the flow entrance is, which core modules are there, and which internal and external components depend on. During exception locating, having a global schema map is a big boost to the exception locating process.
  • ** Dependency analysis: ** There are some problems in the downstream dependency, if the dependency is not maintained by the own team, it will be more troublesome when the own system and the downstream system do not have enough observability, in this case, it is difficult to explain the problem to the maintainer of the dependency. In our topology, a call diagram is formed by connecting the upstream and downstream of the gold indicator with a call relationship. Edge as a dependency visualization, you can see the gold signal for the corresponding call. Having a gold signal can quickly analyze whether there is a problem with downstream dependencies.

Distributed Tracing helps root location

The protocol Trace is also invasion-free and language-neutral. If the request contains a distributed link TraceID, the system automatically identifies the request, facilitating further drilling down to trace the link. The request and response information of application layer protocols helps analyze the request content and return code to know which interface is faulty. To view details about code levels or requested boundaries, click Trace ID to drill down to Link Trace Analysis.

Out of the box alarm function

Out-of-the-box alarm template, all different levels of full coverage, do not need to manually configure the alarm, the large-scale Kubernetes operation and maintenance experience into the alarm template inside, carefully designed alarm rules and intelligent noise reduction and gravity, we can do once the alarm is a valid alarm, and the alarm with associated information, You can quickly locate abnormal entities. The advantage of full stack coverage of alarm rules is that high-risk events are reported to users in a timely and active manner. Users can gradually achieve better system stability through a series of means, such as troubleshooting, alarm resolution, follow-up, and fail-oriented design.

Network performance monitoring

The network performance problem is very common in Kubernetes environment. Because the TCP underlying mechanism shields the complexity of network transmission, the application layer is insensitive to this, which brings certain trouble to the production environment to locate the problem of high packet loss rate and high retransmission rate. Kubernetes monitoring supports RTT, retransmission & packet loss, TCP connection information to represent the network condition, the following RTT as an example, support from the namespace, node, container, Pod, service, workload these several dimensions of network performance, support the following various network problem location:

  • The load balancer cannot access a Pod. The traffic on this Pod is 0. You need to determine whether the Pod network is faulty or the load balancer configuration is faulty.
  • The performance of applications on one node seems to be poor. It is necessary to determine whether there is a problem on the node network and to achieve this through other node networks.
  • Packet loss occurs on the link but is not determined on the layer. You can check the packet loss by node, Pod, and container.

Kubernetes observability panoramic view

With the above product capabilities, based on alibaba’s rich and profound practices in containers and Kubernetes, we summarize and transform these valuable production practices into product capabilities, so as to help users locate production environment problems more effectively and quickly. The screening panorama can be used in the following ways:

  • The overall structure starts with services and Deployment (applications) as the entry point, and most development only needs to focus on this layer. Focus on whether the service and application is faulty, whether the service is connected, and whether the number of copies meets expectations
  • One layer down are pods that provide real workload capability. Pod focuses on error-slow requests, health, resource abundance, and health of downstream dependencies
  • The lowest layer is the node, which provides the runtime environment and resources for pods and services. Pay attention to whether the node is healthy, in the schedulable state, and in sufficient resources.

Network problems

Network is one of the most difficult and common problems in Kubernetes, which makes it difficult to locate network problems in production environments for several reasons:

  • Kubernetes network architecture is highly complex, nodes, PODS, containers, services, VPCS are dazzling;
  • Troubleshooting network problems requires some professional knowledge, most of the network problems have a natural fear;
  • The Eight Distributed fallacies tell us that the network is not stable, the network topology is not invariable, and the delay cannot be ignored, resulting in the end-to-end network topology uncertainty.

The network problems in Kubernetes environment are as follows:

  • Conntrack record full problem;
  • IP conflict;
  • CoreDNS resolution is slow and fails.
  • The node is not connected to the Internet. (Yes, you heard me right);
  • Service access failure;
  • Configuration problems (LoadBalance configuration, routing configuration, Device configuration, network adapter configuration);
  • The network outage made the entire service unavailable.

There are thousands of network problems, but one thing that never changes is that the network has its own “golden indicator” of whether it is working properly:

  • Network traffic and bandwidth;
  • Packet loss rate and retransmission rate;
  • RTT.

The following example shows a slow call problem caused by a network problem. From the view of gateway, a slow call occurs. Topology check shows that the RT of downstream product is relatively high, but the gold index of product itself shows that there is no problem with the service of product itself. Further check the network status between the two, and find that RTT and retransmission are relatively high. This indicates that the network performance deteriorates, causing the overall network transmission to slow down. The TCP retransmission mechanism hides this fact, and the application layer cannot detect the problem, and logs cannot detect the problem. At this time the network’s gold index helps to define the problem, thus accelerating the investigation of the problem.

Node problem

Kubernetes has done a great deal of work to ensure that the nodes provided for workload and services are normal as far as possible. The node controller checks the status of the node 7×24 hours. After finding problems affecting the normal operation of the node, the node is set to NotReady or unschedulable. Banish the business Pod from the problem node through Kubelet. This is the first line of defense for Kubernetes, and the second line of defense is the self-healing component of the node designed by cloud manufacturers for the node high-frequency abnormal scenarios. For example, Ali Cloud’s Node Repairer: After finding the problem node, it performs drainage and evicting, replacing the machine, so as to automatically ensure the normal operation of the business. Even so, the nodes inevitably produce all sorts of weird problems over the long term, which can be time-consuming and labor-intensive to locate. FAQ Classification and level:

In view of these complicated problems, the following troubleshooting flowchart is summarized:

Take a full CPU as an example: 1. The node status is OK and the CPU usage exceeds 90%

2. Check the triplet of the corresponding CPU: usage, TopN, and timing chart. First, the usage of each core is very high, which leads to the high overall CPU usage. Next, we need to know who is using CPU like crazy. From the TopN list, there is a Pod that is using CPU like nobody else. Finally, we need to confirm when the CPU spike started.

Slow service response

The possible causes of the service response are code design problems, network problems, resource competition problems, and slow service dependency. In the complex Kubernetes environment, positioning slow call can start from two schemes: first, the application itself is slow; Secondly, whether the downstream or network is slow; Finally, check resource usage. The Kubernetes monitor analyzes service performance horizontally and vertically, respectively, as shown in the figure below:

  • Horizontal: mainly from the end to end level, first see whether there is a problem with the gold index of their service, and then gradually see the downstream network index. Notice if it takes a long time to call the downstream from the client, but the downstream is normal from the gold indicator, it is very likely to be a network problem or operating system level problem. In this case, you can use network performance indicators (traffic, packet loss, retransmission, RTT, etc.) to determine the problem.
  • Vertical: determine the external delay of the application itself is high, the next step is to determine the specific reason, determine which step/which method is slow can be seen with the flame chart. If there is no problem with the code, there may be a problem with the environment in which the code is executed. In this case, you can check whether the CPU or Memory of the system is faulty.

Here is an example of a slow SQL query (figure below). In this example, the gateway invokes the Product service, which relies on the MySQL service. The product service gradually checks the gold index on the link, and finally finds that the Product executes a particularly complex SQL and associates multiple tables, resulting in slow response of the MySQL service. MySQL protocol is based on TCP. After identifying MySQL protocol, our eBPF probe assembled and restored the content of MySQL protocol, and SQL statements executed in any language could be collected.

The second example is where the application itself is slow, and it is natural to ask which step and which function is causing the slowness. The Fire map supported by the ARMS application monitoring helps quickly locate code-level problems by periodically sampling CPU time (as shown below).

Application /Pod status problem

The Pod is responsible for managing the container, which is the vehicle that actually performs the business logic. At the same time, Pod is the smallest unit of Kubernetes scheduling, so Pod has the complexity of business and infrastructure at the same time, which needs to be combined with logs, links, system indicators and downstream service indicators. Pod traffic is a frequent problem in production environments. For example, when there are thousands of PODS in the environment, it is particularly difficult to identify which Pod is the main source of the traffic.

Let’s take a look at a typical case: a downstream service grayed a Pod during publication, and the Pod responded very slowly due to code reasons, causing the upstream to time out. The reason why we can achieve poD-level observation is that we use EBPF technology to collect THE flow and gold index of Pod, so we can easily view the flow of Pod and Pod, Pod and service, Pod and external by means of topology and large plate.

conclusion

Non-intrusive collection of gold/network indicators /Trace in multiple languages and network protocols through eBPF, by correlating various contexts such as Kubernetes objects, applications, cloud services, and by providing specialized monitoring tools (such as flame maps) when further drilling is required. Realize the one-stop observability platform under Kubernetes environment.

If you have any problems in the process of building cloud native monitoring, please do not hesitate to contact us to discuss:

  • Not familiar with Kubernetes, need a complete set of unified monitoring scheme;
  • Data fragmentation and difficulty in using Prometheus, Alertmanager, Grafana and other systems;
  • Application in container environment, infrastructure buried point cost is too high, looking for low-cost or non-invasive solutions;

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.