Summary: If monitoring can tell us that the system is broken, then observation can tell us where the system is broken and why it is broken. Observability can not only judge whether the system is normal, but also proactively discover system risks before problems occur in the system.

Author: Ten Sleep, water for

Observable introduction

Peter Drucker once said, “If you can’t quantify it, you can’t manage it.” Observability is an important part of helping microservices run soundly. “Are our systems still working?” “, “Is the end-user experience as expected? “, “How do we proactively identify risks to the system before it’s about to fail?” . If monitoring can tell us that the system is broken, then observation can tell us where the system is broken and why it is broken. Observability can not only judge whether the system is normal, but also proactively discover system risks before problems occur in the system.

From the perspective of the system, monitoring is mainly Ops, focusing on discovery to ensure the stability of the system. The goal of observability is white box, focusing on Recall+Precision, through Dev/Tester/Ops and other links, through a variety of observation means, to ensure that the root cause is found and prevent it in the future.

Observable challenges for cloud native microservice applications

At present, common microservice frameworks include Spring Cloud and Dubbo and other multi-language microservices, which have the basic capabilities of service registration discovery, service configuration, load balancing, API gateway, distributed microservices and so on. Among them, service governance includes lossless offline, service fault tolerance, service routing and other capabilities. Observability includes application monitoring, link tracing, log management, application diagnostics, etc.

With the advent of cloud native, microservice architectures are increasingly used. From the machine-centered cloud server ECS to the containerized cloud native deployment with container as the core; In order to be more agile, Ali Cloud began to take the application as the core of micro-services. Now, when the micro service has developed to a certain application scale, Ali Cloud begins to focus on the business core and service governance for the purpose of improving efficiency and stability.

There are three main challenges for microservice observability in cloud native environment:

  • Find difficult

From cloud server ECS to Kubernetes, the complexity of microservice architecture increases, the complexity of observation objects increases, and the coverage of monitoring data is not complete.

  • Positioning is difficult

With the deepening of various governance capabilities, the observability requirements are high, the complexity of service framework increases, the technical threshold increases, the complexity of data itself increases, and the data correlation is poor.

  • Collaboration is poor

As organizational roles change, visibility goes beyond operations.

The real-time monitoring service ARMS is used as an observable product of Ali Cloud to support automatic detection of some product problems. At present, more than 50 fault scenarios have been covered, including application change, large request, QPS surge, etc., and the recognition rate of diagnosis report is as high as 80%.

As shown in the figure below, 7% of online applications are time-consuming on Dubbo’s RPC, and the root cause cannot be located due to buried points.

Ali Cloud in the process of serving customers, found a lot of problems.

  • Service discovery

At present, some monitoring tools cannot diagnose problems at the service discovery level of the service framework, resulting in many service invocation problems that are difficult to troubleshoot. Monitoring alone makes customers unable to start. Therefore, we hope to provide the following service discovery monitoring and diagnosis capabilities to help customers timely troubleshoot application abnormalities caused by service discovery problems.

(1) No Provider problem occurs on the monitoring client;

(2) Which registry is connected to the micro-service application, service discovery link call example diagram, large block content includes Provider, Consumer and registry, click the corresponding component to see the detailed address;

(3) Whether the application service is registered successfully;

(4) The number of addresses & content of the last drop;

(5) Whether the heartbeat between the application and the registry is healthy;

(6) Registry status information, such as CPU, memory and other operating hardware status information, the number of registered services, subscription services, service content and other information.

  • Microservice life cycle

Microservices start up slowly, taking 3 minutes for one server and 30 minutes for 5 servers. We want to have loads from Spring Beans, link pool connection monitoring, service registration for microservices, and Kubernetes monitoring checks in place during application startup. When the application is offline, service registration, in-transit requests are stopped, scheduled tasks /MQ are cancelled, and services are stopped. For example: Spring bean initialization exception, stuck on which bean load, which bean initialization takes a long time. Helps users analyze the cause of slow startup and automatically provides repair suggestions. However, the current overall process is lack of relevant observation capacity.

  • Call link

The Consumer calls time out, and the Provider returns quickly.

In addition, there are microservices configuration chaos, not easy to comb; After Kubernetes microservice application, thread pool full, but can not find the cause of a series of problems.

Therefore, when considering how to build the system from the perspective of micro-service, we propose a solution to enhance the observability of micro-service. What more can be done on top of traditional monitoring programmes?

Observable exploration and practice in microservice scenarios

What problems does microservices observable enhancement solve

In a word: comprehensively enhance the observable capability in microservice scenarios.

Frontline o&M personnel are equipped with the basic capability of micro-service diagnosis, which enables them to troubleshoot 80% of common micro-service problems and quickly perform performance analysis and diagnosis.

The ARMS Microservices Observable Enhancement solution answers the following questions:

  • Why is the service slow to start

From Pod creation to application initialization and service registration to application startup, the root cause of slow application startup is analyzed end to end, and the observable ability of application startup life cycle is completed.

  • Whether there are pitfalls to dependency

Analyze Jar packages dependent on SpringCloud/Dubbo to determine whether there are Jar dependency conflicts and other problems;

  • Configuration analysis

In microservice scenarios, the configurations are scattered and redundant, providing the observability and expert experience of configuration optimization during application runtime.

  • Dubbo call chain enhancement

Cover the burying points of addressing, serialization, networking, etc., and see at a glance where the time of Dubbo calls went.

Why does the service start slowly? From Pod creation to application initialization and service registration to application startup, the root cause of slow application startup is analyzed end to end, and the observable ability of application startup life cycle is completed.

By connecting the entire process in series, the time taken at each point is observed in real time, and the observable view dissects the problem. The above image shows the ARMS container starting the analysis function. On the left is service startup. The system breaks down each step of the startup process to clearly see which step of the microservice startup is slow, enhancing its observability.

The microservice engine provides the ability to go online without loss. Console dynamic configuration, real-time lossless up and down observable view, complete solution without changing a line of code. Protection and governance of various schemes are carried out in the whole process of micro-service startup: in the pre-establishment connection stage, asynchronous connection creation is carried out in advance to ensure that the connection establishment process will not be blocked; In the service registration and discovery stage, the application startup speed is further improved through parallel registration and subscription capabilities. In the small traffic preheating phase, the load balancing capability of the client is adjusted to ensure that the traffic in the new instance grows slowly.

Because the coverage relationship of microservice configuration is complex, configuration analysis is required.

The figure above shows the configuration coverage relationship provided by Dubbo. It can be seen that it has a certain sequence. It’s often hard to tell if a configuration is mismatched, valid, or overwritten. In microservice scenarios where configurations are scattered and redundant, we provide the application runtime configuration observability and expertise in configuration optimization.

We provide the ability to analyze Jar packages that SpringCloud/Dubbo relies on to help locate Jar dependency conflicts, security, performance risks, and so on.

Where the hell did an RPC call time go? An RPC call includes routing, traffic limiting degradation, serialization, and network. On the client, routes, filter, Invoker, serialize, and Remote are required. On the server side, serialize, Proxy Invoke, Filter, and impleme are required.

The diagram above is a flow chart of an RPC call. This includes the connection establishment time of addressing and load balancing, the serialization time of packaging, the deserialization time of unpackaged reprint value, the server processing time and the time waiting for the server processing to return.

The above is our answer. We further subdivide the call chain within the RPC framework to see the time-consuming details of routing, serialization, network, proxy, server processing and so on at a glance.

conclusion

Based on the traditional observability scheme, we further expanded the data of traditional observability coverage including Tracing, Logging and Metrics from the perspective of microservices, and combined with the diagnostic experience of microservice experts.

From the front end, application to the underlying machine, the real-time monitoring service ARMS monitors every run, every slow SQL and every exception of the application service in real time. At the same time, it provides a complete data monitoring, showing the number of requests, response time, FullGC times, slow SQL and abnormal times, inter-application call times and time-consuming and other important key indicators, so as to keep abreast of the running status of the application and ensure to provide optimal user experience.

Ali Cloud micro service engine MSE new upgrade, MSE in the governance center to improve the efficiency and stability of micro service development. Support Spring Cloud and Dubbo applications for nearly 5 years, and multi-language heterogeneous microservice system. It provides lossless offline, full-link gray scale, outlier instance removal, service authentication, and other differentiated capabilities. In the registry configuration center, MSE has fully managed Zookeeper/Nacos/Eureka services. Default high availability: Deployment of multiple availability zones and automatic detection. Configure authentication, encryption, and grayscale publishing. In the cloud native gateway, MSE integrates alarm monitoring, link tracing, traffic limiting degradation, and certificate management. Traffic network on the micro service gateway two in one, cost reduced by 50%.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.