This is part 2 of the Kubernetes Monitoring Tutorial: How to find service and workload exceptions in Kubernetes. Sharing consists of three parts: first, Kubernetes anomaly location has pain points; 2. For these pain points, Kubernetes monitors how to find abnormalities faster, more accurately and completely; Iii. Typical case analysis of network performance monitoring and middleware monitoring.

Hello everyone, I am Li Huangdong from Aliyun. Today, I will share with you the second part of the Kubernetes monitoring open class: how to find service and workload anomalies in Kubernetes.

This sharing consists of three parts:

1. Kubernetes anomaly location has pain points;

2. For these pain points, Kubernetes monitors how to find abnormalities faster, more accurately and completely;

Iii. Typical case analysis of network performance monitoring and middleware monitoring.

Kubernetes exception location has pain points

In the current Internet architecture, more and more companies adopt the architecture of microservices + Kubernetes, which has the following characteristics:

  1. First application layer based on the service, the service is composed of decoupling open several service call each other, the service general clear responsibilities, clear boundary, the result is a simple products have dozens or even hundreds of services, mutual dependence, call is very complex, the cost of this brings to the location problem is bigger. At the same time, the owner of each service may come from different teams with different development and may use different languages. The impact on our monitoring is that monitoring tools need to be accessed for each language, resulting in low return on investment. Another feature is multi-protocol. Almost every middleware (Redis, MySQL, Kafka) has its own unique protocol. How to quickly observe these protocols is not a small challenge.
  2. While Kubernetes and containers shield upper-layer applications from the complexity of the lower layer, the result is two things: higher and higher infrastructure layers; Another is the increasing complexity of information between the upper-layer applications and the infrastructure. User feedback, for example, web site access is slow, the administrator to check the access log, service status, water resources found that’s all right, don’t know where problems occur at this time, although the suspect infrastructure has a problem, but cannot be delimited a only one screen under the condition of low efficiency, the root of the problem is the lack of fluctuation problem of connection between the upper application and infrastructure, You can’t do end-to-end series.
  3. The last pain point is scattered data, many tools, information did not get through. For example, let’s say we get an alarm, and we use Grafana to look at the metrics, and the metrics are pretty rough, and we have to look at the logs, and we have to go to the SLS logging service to see if there’s a log, and the logs are fine, and we have to log into the machine to look at the logs, However, logging into the container may cause the log to disappear due to a restart. After a few rounds, we decided that maybe the problem wasn’t in the app, so we went back to link tracing to see if there was a problem downstream. All in all, there are a lot of tools and a dozen Windows in the browser, which is inefficient and a poor experience.

These three pain points are respectively summarized as cost, efficiency and experience. Aiming at these pain points, let’s take a look at the data system monitored by Kubernetes and see how to better solve the three major problems of cost, efficiency and experience.

How does Kubernetes monitor find exceptions

The golden Tower below shows the density or level of detail of information from top to bottom, with more detailed information as you go down. Starting from the bottom, Trace collects application layer protocol data, such as HTTP, MySQL and Redis, in a non-invasive, multi-protocol and multi-language manner through eBPF technology. The protocol data will be further parsed into easy to understand request details, response details and time consuming information at each stage.

The next layer is indicators, indicators mainly by gold indicators, network, Kubernetes system indicators. The gold index and network index are collected based on eBPF, so they are also non-invasive and support various protocols. With the gold index, we can know whether the service is abnormal, slow and affects users on the whole. Network indicators are used to monitor network status, such as the packet loss rate, retransmission rate, and RTT. Indicators of Kubernetes system refers to the original Kubernetes monitoring system of cAdvisor/MetricServer/Node Exporter/NPD these indicators.

At the next level are events, which tell us exactly what is happening. Perhaps the most we encounter are Pod restarts, mirror failures, and so on. We persisted Kubernetes events and saved them for a period of time to help locate problems. Then, our inspection and health inspection also support to report in the form of events.

The top layer is alarms. Alarms are the last link in the monitoring system. When certain exceptions may damage services, you need to configure alarms for indicators and events. Alarms currently support PromQL, and intelligent alarms support intelligent algorithm detection of historical data to discover potential abnormal events. The alarm configuration supports dynamic thresholds. You can adjust the sensitivity to configure alarms to avoid write dead thresholds. Once we have Trace, indicator, event and alarm, we use topology diagram to associate these data with Kubernetes entities. Each node corresponds to services and workloads in Kubernetes entities. Calls between services are represented by lines. With a topology diagram, we can quickly identify anomalies in the topology and further analyze them, including upstream and downstream, dependency and impact surfaces, just like getting a map. In this way, we can have a more comprehensive control over the system.

Best practices & Scenario analysis

Next we’ll look at best practices for finding service and workload exceptions in Kubernetes.

First of all, there are indicators, indicators can reflect the monitoring status of the service, we should collect various indicators as far as possible, and the more complete the better, not limited to gold indicators, USE indicators, Kubernetes native indicators, etc.; Then, the indicator is macro data, and we need to do root cause analysis. We need to have Trace data. In the case of multiple languages and protocols, we need to consider the cost of collecting these traces, and also support as many protocols and languages as possible. Finally, a topology is used to summarize and connect indicators, Trace and events to form a topology diagram for architecture perception analysis and upstream and downstream analysis.

Through the analysis of these three methods, service and workload exceptions are usually exposed, but we should not stop there and add the exception to the next time, then we have to do all over again, the best way is to automatically manage the alarm corresponding to the configuration of such exceptions.

Let’s use a few specific scenarios to elaborate:

(1) Network performance monitoring

Network performance monitoring takes retransmission as an example. Retransmission means that the sender resends the packets when it considers that packet loss has occurred. Take the transmission process in the picture as an example:

  1. The sender sends a packet numbered 1, the receiver accepts it, and returns ACK 2
  2. The sender sends a packet numbered 2, and the receiver returns ACK 2
  3. The sender sends packets numbered 3, 4, and 5, and the receiver all returns ACK 2
  4. Until the sender receives the same ACK three times, the retransmission mechanism is triggered and the retransmission causes the delay to increase

Code and logs are invisible, and in this case it is ultimately difficult to find the root cause. In order to quickly locate this problem, we need a group of network performance indicators to provide the locating basis, including the following indicators, P50, P95 and P99 to represent the delay, and then we need traffic, retransmission, RTT and packet loss indicators to represent the network situation.

Take a certain service RT height as an example: First of all, we see that the edge of the topology is red, and the judgment logic of red is judged according to delay and error. When this red edge is found, click the top edge to see the corresponding gold index.

Click the button at the bottom far left to view the network data list of the current service. We can order it by average response time, retransmission and RTT. We can see that the delay of the first service invocation is relatively high, as fast as one second return time, and the retransmission is relatively high, much higher than other services. Here is actually through the tool to inject the retransmission height such a fault, it seems more obvious. This analysis down we know that the network may be the problem, can further investigation. Experienced development will generally take network indicators, service name, IP, domain name to find these information network colleagues, rather than just told each other that I service is slow, so that the other party know too little, will not actively to screening, because other people also don’t know where to start, when we provide the relevant data, the other had reference, will ultimately to further advance.

(2) DNS resolution is abnormal

The second scenario is a DNS resolution exception. DNS is usually the first step in protocol communication, such as HTTP request, the first step is to get the IP, which is commonly referred to as the service discovery process. If the first step fails, the whole call will fail directly, which is called the critical path cannot be lost. In Kubernetes cluster, all DNS go through CoreDNS resolution, so CoreDNS is prone to bottlenecks. Once problems occur, the impact surface is very large, and the whole cluster may be unavailable. As a vivid example, two months ago, Akamai, a well-known CDN company, had a DNS failure that caused many websites like Airbnb to be inaccessible for an hour.

There are three core scenarios of DNS resolution in Kubernetes cluster:

  1. Call the external API gateway
  2. Invokes cloud services, which are generally public
  3. Calling external middleware

CoreDNS common problems, you can refer to, check your cluster has no similar problems:

  1. Configuration problem (NDOTS problem), ndots is a number, indicating that the number of dots in the domain is less than Ndots, then the search will take precedence over the search list of fields, which can result in multiple queries, which can have a significant impact on performance.
  2. Because Kubernetes all domain name resolution go to CoreDNS, very easy to become a performance bottleneck, some statistics about QPS in 5000~8000 time to pay attention to the performance problem. Especially rely on external Redis, MySQL, such a large number of visits.
  3. Lower versions of CoreDNS have stability issues, which are also a concern.
  4. Some languages, such as PHP, do not support connection pooling very well, resulting in DNS resolution every time to create a connection, this phenomenon is also quite common.

Next, we will look at some possible problems in Kubernetes CoreDNS. First, there may be problems with the network between the application and CoreDNS. The second problem is CoreDNS itself. For example, CoreDNS returns SERVFAIL and REFUSE error codes, or even the return value is wrong because Corefile is incorrectly configured. Third, network interruption and performance problems occur when communicating with external DNS. Finally, the external DNS is not available.

In view of these problems, the following steps are summarized to check:

First, from the client side, first look at the request content and return code, if it is returned is an error code that the server has a problem. If it’s slow parsing, you can look at the time waterfall and see where the time is going.

Second, to see whether the network is normal, see traffic, retransmission, packet loss, RTT these indicators are enough.

Third, look at the server, see the flow, error, delay, saturation these indicators, and then look at the CPU, memory, disk and other resource indicators, can also locate the problem.

Fourth, look at the external DNS. Similarly, we can locate the external DNS by request Trace, return code, network traffic, retransmission and other indicators.

Now let’s look at the topology. The first thing we see is that the red line represents the DNS resolution call with an exception. Click on this to see the gold indicator of the call. Click to view the list, the details page pops up, you can view the details of the request, is the request for this domain name, the request process goes through three processes: send, wait, download, it seems that the indicator is normal, then we click to see the response, found that the response is that the domain name does not exist. So at this point we can take a closer look at the external DNS to see if there is a problem, and the steps are the same, and I’ll show that in the demo later, so I won’t expand it here.

(3) full-link pressure test

The third typical scenario is full-link pressure test. For this scenario, the peak value is several times of the usual. How to ensure the stability of the large speed test? There are usually several steps: Preheating, first under the validation link is normal at this moment, gradually add flow until the peak, and then began to touch high, also is to see what is the biggest TPS can support, then add traffic, this time is mainly to see whether the service can the normal current limiting, because has tried the biggest TPS, increase strength is destructive flow again. So in this process, what points should we pay attention to?

First of all, for our multi-language and multi-protocol microservice architecture, such as Java, Golang and Python applications, RPC, MySQL, Redis and Kafka application layer protocols, we need to have gold indicators of various languages and protocols to verify system capabilities. In view of the bottleneck and capacity planning of the system, we need the use indicator to determine whether to expand the capacity by looking at the saturation of resources at various traffic levels. For each additional traffic, we need to look at the use indicator to adjust the capacity and optimize the capacity step by step. For complex architecture, it is necessary to have a global big picture to help sort out upstream and downstream dependencies and full-link architecture, and determine explosion alarm. For example, CheckoutService here is a key service. If problems occur at this point, the impact will be large.

First, various languages, protocol communication gold indicator, by viewing the list to further see the details of the call

Second, click node details to drill down to view CPU, memory and other use resource indicators

Thirdly, the whole topology can reflect the shape of the entire architecture. With a global architectural perspective, we can identify which services are prone to become bottlenecks, how big the explosion radius is, and whether high availability guarantee is needed.

(4) Access external MySQL

MySQL > access external MySQL

  1. The first is slow query, because slow query withdrawal has a high delay index. At this time, it is necessary to see what the detail request is in trace, which table is queried and which fields, and then to see whether the query volume is too large, the table is too large, or there is no index, etc.
  2. If the query statement is too large, transmission time is high. If the network jitter is slight, retry will be failed and bandwidth will be occupied. This is usually caused by some batch updates and inserts. When this happens, the latency index will skyrocket. In this case, we can select some Trace with higher RT to see how the statement is written and whether the length of the query statement is too large.
  3. Error code return, such as the table does not exist, then parse out the error code is very helpful, and then further look at the details inside, to see the statement, it is easier to locate the root cause.
  4. Network problems, this point has also talked about more, generally with the delay index plus RTT, retransmission, packet loss to determine whether there is a problem in the network.

The application in the middle box depends on external MySQL services. Click on the topology line to further see the gold index. Click on the View list to further see the details of the request, response, etc. The table classifies network data in the current topology according to source and target, including the number of requests, number of errors, average response time, socket retransmission, and Socket RTT. Click the above arrow to sort the data accordingly.

(v) Multi-tenant architecture

The fifth typical case is multi-tenant architecture. Multi-tenant means that different tenants, workloads, and teams share the same cluster. Usually, one tenant corresponds to one namespace. Resources are logically or physically isolated from each other and do not interfere with each other. Common scenarios are as follows: Intra-enterprise users, one team corresponds to one tenant, networks in the same namespace are not restricted, and network policies are used to control network traffic between namespaces. The second is the SaaS provider multi-tenant architecture, where each user has a namespace and tenants and platforms are in different namespaces. While the namespace nature of Kubernetes brings convenience to multi-tenant architectures, it also presents two challenges for observables: the first is that there are so many namespaces that finding information becomes cumbersome, increasing the cost of administration and understanding. Second, tenants are required to isolate traffic from each other. When namespaces are large, abnormal traffic cannot be accurately and comprehensively discovered. The third is Trace support for multiple protocols and languages. I once met a client who had more than 400 namespaces in a cluster. It was very painful to manage, and the application was multi-protocol and multi-language, so it had to be transformed one by one to support Trace.

This is the cluster front page of our product. Kubernetes entities are divided into namespaces and support queries to locate the cluster I want to see. The bubble diagram shows the number of entities in the corresponding namespace, as well as the number of entities with exceptions. For example, pod with exceptions exists in the three namespaces in the box. Click in to further see the exceptions. At the bottom of the front page is a performance overview sorted by gold metric, the Top feature of the scene, which allows you to quickly see which namespaces are abnormal.

If there are many namespaces in the topology, you can view the desired namespace by filtering or quickly locate the desired namespace by searching. Since nodes are grouped by namespaces, you can view the traffic of namespaces by the lines between namespaces. This makes it easy to see which namespaces have traffic coming from, whether there is abnormal traffic, and so on.

We summarize the above scenarios as follows:

  1. Network monitoring: How to analyze service errors and interruptions caused by the network, and how to analyze the impact of network problems
  2. Service Monitoring: How to determine health of a Service using the Gold indicator? How do I view details through multiprotocol Trace?
  3. Middleware and infrastructure monitoring: how to use gold index and trace to analyze anomalies of middleware and infrastructure, and how to quickly determine whether it is a network problem, its own problem or a downstream service problem
  4. Architecture awareness: How to perceive the entire architecture through topology, sort out upstream and downstream, internal and external dependencies, and then control the overall situation? How to ensure sufficient observability and stability through topology guarantee architecture, and how to find bottlenecks and explosion radius in the system through topology analysis.

Common cases are listed in these scenarios: Network and service availability check and health check; Middleware architecture upgrade observability guarantee; Verification of new business launch; Service performance optimization; Middleware performance monitoring; Scheme selection; Full link pressure measurement, etc.

Product value

After the above introduction, we summarize the product value of Kubernetes as follows:

  1. The index and Trace data of the service are collected in a multi-protocol, multi-language and non-invasive way to minimize the cost of access and provide comprehensive coverage of the index and Trace.
  2. With these metrics and Trace, we can scientifically analyze and drill down services and workloads;
  3. By associating these indicators and Trace into a topology map, architecture awareness, upstream and downstream analysis and context correlation can be carried out on a large map, so as to fully understand the architecture and evaluate potential performance bottlenecks, which is convenient for further architecture optimization.
  4. This section describes how to configure alarms with simple configurations. Experience and knowledge can be deposited into alarms to proactively discover alarms.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.