With the rapid development and implementation of container technology, more and more enterprises are operating in containers. As one of the mainstream deployment approaches, the container separates the team’s tasks and concerns so that the development team can focus on application logic and dependencies, while the operations team can focus on deployment and management without having to worry about application details such as specific software versions and application-specific configurations. This means that development and operations teams can spend less time debugging and going live, and more time delivering new features to end users. Containers make it easier for enterprises to increase application portability and operational flexibility. According to the CNCF survey, 73% of respondents are using containers to increase production agility and speed up innovation.

The author of | white

Why do we need container monitoring

In the process of large-scale use of containers, it is of great significance to establish a monitoring system to maintain the stability of operating environment and optimize the cost of resources in the face of high dynamic and continuous monitoring of container environment. Each container image can have a large number of running instances, and because new images and versions are introduced quickly, failures can easily spread through containers, applications, and architectures. This makes it very important to locate the root cause of the problem immediately after it occurs in order to prevent abnormal spread. After a lot of practice, we believe that it is important to monitor the following components during container use:

  • Host server;
  • Container runtime;
  • Orchestrator Control plane.
  • Middleware dependency;
  • An application that runs inside a container.

With a complete monitoring system, in-depth understanding of metrics, logs, and links, the team can not only understand what is happening in the cluster as well as in the container runtimes and applications, but also provide data to support the team in making business decisions, such as when to expand/shrink instances/tasks/pods, and when to change instance types. DevOps engineers can also improve troubleshooting and resource management efficiency by adding automated alarms and related configurations, such as proactively monitoring memory usage and informing o&M teams to add additional nodes before available CPU and memory resources are exhausted when resource consumption approaches a set threshold. The values include:

  • Early detection of problems to avoid system outages;
  • Analysis of container health across the cloud environment;
  • Identify clusters with over/under allocated available resources and tune applications for better performance;
  • Create intelligent alarm to improve alarm accuracy and avoid false alarm;
  • Use monitoring data to optimize for optimal system performance and lower operating costs.

However, in the actual landing process, the operation team will feel that the above value is relatively simple, as if the existing operation tools can achieve the above purpose. However, for container related scenarios, if the corresponding monitoring system cannot be built, as the business continues to expand, the following two very difficult targeted problems have to be faced:

1. The obstacle removal time is long and SLA cannot be met.

It’s hard for the development and operations teams to know what’s going on and how well it’s being implemented. Maintaining applications, meeting SLAs, and troubleshooting are exceptionally difficult.

2. Scalability is weighed down and elasticity cannot be achieved.

The ability to rapidly scale an application or microservice instance as needed is an important requirement of a containerized environment. The monitoring system is the only visual way to measure requirements and user experience. Scaling too late, resulting in decreased performance and user experience; Downsizing too late leads to a waste of resources and costs.

Therefore, as the problems and values of container monitoring continue to add up and surface, more and more operation and maintenance teams begin to pay attention to the construction of container monitoring system. However, in the process of actual landing container monitoring, there are all kinds of unexpected problems.

Such as the difficulty of tracking due to ephemeral features, due to the complexity of the container itself, which contains not only the underlying code but also all the underlying services that the application needs to run. Containerized applications are frequently updated as new deployments go into production and change code and underlying services, which increases the likelihood of errors. The fast build, fast destroy feature makes it extremely difficult to track changes in large complex systems.

Another example is the difficulty of monitoring shared resources, where resources such as memory and CPU used by containers are shared across one or more hosts, making it difficult to monitor resource consumption on physical hosts and getting a good indication of container performance or application health.

Finally, it is difficult to meet the requirements of container monitoring with traditional tools. Traditional monitoring solutions often lack the metrics, tracking, and logging tools required for virtualized environments, especially for container health and performance metrics and tools.

Therefore, combined with the above values, problems and difficulties, we need to consider and design the vessel monitoring system from the following dimensions

  • Non-invasive: monitoring whether SDK or probe integration into business code is invasive, affecting the stability of business;
  • Integrity: whether the entire application can be observed in terms of business and technology platform performance;
  • Multi-sourcing: whether relevant indicators and log sets can be obtained from different data sources for summary display, analysis and alert;
  • Convenience: Whether events and logs can be associated, exceptions can be detected, faults can be actively and passively removed, and losses can be reduced. Whether alarm policies can be easily configured.

In the process of clarifying business requirements and designing monitoring system, there are many open source tools for the operation and maintenance team to choose from, but the operation and maintenance team also needs to evaluate the possible business and project risks. These include:

  • There are unknown risks that may affect the stability of the service. Check whether the service can be “non-trace”. Whether the monitoring process itself affects the normal operation of the system.
  • It is difficult to predict the manpower/time investment of open source or self-research. Associated components or resources need to be configured or built by themselves, and there is a lack of corresponding support and services. As the business changes constantly, may more manpower and time cost be consumed? And whether open source or enterprise own team can quickly deal with performance problems in large-scale scenarios.

Ali Cloud Kubernetes monitoring: make container cluster monitoring more intuitive and simple

Therefore, based on these insights and a lot of practical experience, Alibaba Cloud launched Kubernetes monitoring service. Aliyun Kubernetes Monitor is a set of one-stop observability products developed for Kubernetes cluster. Based on the indicators, application links, logs and events under the Kubernetes cluster, Aliyun Kubernetes monitoring aims to provide an overall observable solution for IT development and operation and maintenance personnel. Aliyun Kubernetes monitoring has the following six features:

  • Code non-intrusion: Network performance data can be obtained by bypass technology without code burial.
  • Multi-language support: network protocol resolution through the kernel layer, support for any language and framework.
  • Low consumption and high performance: Based on eBPF technology, obtain network performance data with very low consumption.
  • Automatic resource topology: Through the network topology, the resource topology displays the association of related resources.
  • Multidimensional presentation of data: Supports all types of observable data (monitoring indicators, links, logs, and events).
  • Build associated closed loop: complete associated observable data of architecture layer, application layer, container operation layer, container control layer and basic resource layer.

At the same time, compared with open source container monitoring, Aliyun Kubernetes monitoring has the differentiated value more close to the business scenario:

  • No upper limit on data amount: Data such as indicators, links, and logs is stored independently. The cloud storage capability ensures low cost and large capacity storage.
  • Efficient resource association and interaction: By monitoring network requests, a complete network topology is constructed to facilitate the viewing of service dependency status and improve o&M efficiency. In addition to the network topology, the 3D topology function allows you to view the network topology and resource topology at the same time, improving fault location.
  • Diversified data combination: visual display and free combination of indicators, links, logs and other data, mining operation and maintenance optimization points.
  • Build a complete monitoring system: build a complete monitoring system together with other sub-products that apply real-time monitoring services. Application monitoring focuses on application language runtimes, application frameworks, and business code; Kubernetes monitors container runtimes, container controls, and system calls for containerized applications. Both monitors serve applications and focus on different levels of applications. The two products complement each other. Prometheus is the infrastructure for collecting, storing, and querying indicators. Both application monitoring and Kubernetes monitoring indicator data depend on Prometheus.

Based on the above product features and differentiated values, we apply them in the following scenarios:

  • Using the default or custom inspection rules monitored by Kubernetes, nodes, services, and workloads are detected. Kubernetes monitors anomalies of nodes, services and workloads from the three dimensions of performance, resources and management and control. The analysis results are displayed in normal, warning, and critical states with specific colors to help operation and maintenance personnel intuitively perceive the running status of user nodes, services and workloads.

  • Kubernetes is used to monitor the location of service and workload response failure root causes. Kubernetes monitors failed requests by analyzing network protocols for detailed storage, using failed request metrics associated with failed request details to locate failure causes.
  • Using Kubernetes to monitor the causes of slow response of location services and workloads, Kubernetes monitors DNS resolution performance, TCP retransmission rate, network packet RTT and other indicators by capturing the critical path of network links. The indicator of the critical path of the network link is used to locate the reason for the slow response and optimize related services.

  • Use Kubernetes monitoring to explore the application architecture and discover unexpected network traffic. Kubernetes monitoring allows you to view large topologies built up by global traffic and to configure static ports to identify specific services. Use the topology diagram to explore the application architecture through intuitive and powerful interactions, and verify whether the traffic meets expectations and the architecture is reasonable.

  • Kubernetes is used to detect the uneven utilization of node resources, and deploy node resources in advance to reduce service operation risks.

Kubernetes Monitor is currently in full public beta, free of charge during the beta. Let Kubernetes monitor to help you get rid of mechanical repetitive operations

The original link

This article is ali Cloud original content, shall not be reproduced without permission.