Author: Ten Sleep, water for

Observable introduction

Peter Drucker once said, “If you can’t quantify it, you can’t manage it.” Observability is an important part of helping microservices run soundly. “Are our systems still working?” “, “Is the end-user experience as expected? “, “How do we proactively identify risks to the system before it’s about to fail?” . If monitoring can tell us that the system is broken, then observation can tell us where the system is broken and why it is broken. Observability can not only judge whether the system is normal, but also proactively discover system risks before problems occur in the system.

From the perspective of the system, monitoring is mainly Ops, focusing on discovery to ensure the stability of the system. The goal of observability is white box, focusing on Recall+Precision, through Dev/Tester/Ops and other links, through a variety of observation means, to ensure that the root cause is found and prevent it in the future.

Observable challenges for cloud native microservice applications

At present, common microservice frameworks include Spring Cloud and Dubbo and other multi-language microservices, which have the basic capabilities of service registration discovery, service configuration, load balancing, API gateway, distributed microservices and so on. Among them, service governance includes lossless offline, service fault tolerance, service routing and other capabilities. Observability includes application monitoring, link tracing, log management, application diagnostics, etc.

With the advent of cloud native, microservice architectures are increasingly used. From the machine-centered cloud server ECS to the containerized cloud native deployment with container as the core; In order to be more agile, Ali Cloud began to take the application as the core of micro-services. Now, when the micro service has developed to a certain application scale, Ali Cloud begins to focus on the business core and service governance for the purpose of improving efficiency and stability.

The observability of microservices in cloud native environment is mainly faced with three challenges: • Difficulty in discovery from cloud server ECS to Kubernetes, increasing complexity of microservice architecture, increasing complexity of observation objects, and incomplete coverage of monitoring data.

• Difficult positioning With the deepening of multiple governance capabilities, high requirements for observation, increasing complexity of service framework, higher technical threshold, increasing complexity of data itself, poor data relevance.

• Poor collaboration as organizational roles change, observable beyond operations.

The real-time monitoring service ARMS is used as an observable product of Ali Cloud to support automatic detection of some product problems. At present, more than 50 fault scenarios have been covered, including application change, large request, QPS surge, etc., and the recognition rate of diagnosis report is as high as 80%.

As shown in the figure below, 7% of online applications are time-consuming on Dubbo’s RPC, and the root cause cannot be located due to buried points.

Ali Cloud in the process of serving customers, found a lot of problems.

• Service discovery Currently, some monitoring tools cannot diagnose problems at the service discovery level of the service framework, resulting in many service invocation problems that are difficult to troubleshoot. Monitoring alone makes it impossible for customers to start. Therefore, we hope to provide the following service discovery monitoring and diagnosis capabilities to help customers timely troubleshoot application abnormalities caused by service discovery problems.

(1) No Provider problem occurs on the monitoring client; (2) Which registry is connected to the micro-service application, service discovery link call example diagram, large block content includes Provider, Consumer and registry, click the corresponding component to see the detailed address; (3) Whether the application service is registered successfully; (4) The number of addresses & content of the last drop; (5) Whether the heartbeat between the application and the registry is healthy; (6) Registry status information, such as CPU, memory and other operating hardware status information, the number of registered services, subscription services, service content and other information.

• Microservice life cycle Microservices start slowly, taking 3 minutes for one server and 30 minutes for five servers. We want to have loads from Spring Beans, link pool connection monitoring, service registration for microservices, and Kubernetes monitoring checks in place during application startup. When the application is offline, service registration, in-transit requests are stopped, scheduled tasks /MQ are cancelled, and services are stopped. For example: Spring bean initialization exception, stuck on which bean load, which bean initialization takes a long time. Helps users analyze the cause of slow startup and automatically provides repair suggestions. However, the current overall process is lack of relevant observation capacity.

• The link Consumer calls out and the Provider returns quickly.

In addition, there are microservices configuration chaos, not easy to comb; After Kubernetes microservice application, thread pool full, but can not find the cause of a series of problems.

Therefore, when considering how to build the system from the perspective of micro-service, we propose a solution to enhance the observability of micro-service. What more can be done on top of traditional monitoring programmes?

Observable exploration and practice in microservice scenarios

What problems does microservices observable enhancement solve

In a word: comprehensively enhance the observable capability in microservice scenarios.

Frontline o&M personnel are equipped with the basic capability of micro-service diagnosis, which enables them to troubleshoot 80% of common micro-service problems and quickly perform performance analysis and diagnosis.

The ARMS Microservices Observable Enhancement solution answers the following questions:

• Why service startup is slow from Pod creation to application initialization and service registration to application startup, analyze the root cause of slow application startup end-to-end, and complement the observable ability of application startup life cycle;

• Whether there are hidden dependencies Analyze Jar packages that SpringCloud/Dubbo depend on to locate problems such as Jar package dependency conflicts;

• Configuration analysis In microservice scenarios, configurations are scattered and redundant, providing application runtime configuration observability and configuration optimization expertise;

• The Dubbo call chain enhances the burying point of the addressing, serialization, networking, etc phases to see at a glance where the time of the Dubbo call goes.

Why does the service start slowly? From Pod creation to application initialization and service registration to application startup, the root cause of slow application startup is analyzed end to end, and the observable ability of application startup life cycle is completed.

By connecting the entire process in series, the time taken at each point is observed in real time, and the observable view dissects the problem. The above image shows the ARMS container starting the analysis function. On the left is service startup. The system breaks down each step of the startup process to clearly see which step of the microservice startup is slow, enhancing its observability.

The microservice engine provides the ability to go online without loss. Console dynamic configuration, real-time lossless up and down observable view, complete solution without changing a line of code. Protection and governance of various schemes are carried out in the whole process of micro-service startup: in the pre-establishment connection stage, asynchronous connection creation is carried out in advance to ensure that the connection establishment process will not be blocked; In the service registration and discovery stage, the application startup speed is further improved through parallel registration and subscription capabilities. In the small traffic preheating phase, the load balancing capability of the client is adjusted to ensure that the traffic in the new instance grows slowly.

Because the coverage relationship of microservice configuration is complex, configuration analysis is required.

The figure above shows the configuration coverage relationship provided by Dubbo. It can be seen that it has a certain sequence. It’s often hard to tell if a configuration is mismatched, valid, or overwritten. In microservice scenarios where configurations are scattered and redundant, we provide the application runtime configuration observability and expertise in configuration optimization.

We provide the ability to analyze Jar packages that SpringCloud/Dubbo relies on to help locate Jar dependency conflicts, security, performance risks, and so on.

Where the hell did an RPC call time go? An RPC call includes routing, traffic limiting degradation, serialization, and network. On the client, routes, filter, Invoker, serialize, and Remote are required. On the server side, serialize, Proxy Invoke, Filter, and impleme are required.

The diagram above is a flow chart of an RPC call. This includes the connection establishment time of addressing and load balancing, the serialization time of packaging, the deserialization time of unpackaged reprint value, the server processing time and the time waiting for the server processing to return.

The above is our answer. We further subdivide the call chain within the RPC framework to see the time-consuming details of routing, serialization, network, proxy, server processing and so on at a glance.

conclusion

Based on the traditional observability scheme, we further expanded the data of traditional observability coverage including Tracing, Logging and Metrics from the perspective of microservices, and combined with the diagnostic experience of microservice experts.

From the front end, application to the underlying machine, the real-time monitoring service ARMS monitors every run, every slow SQL and every exception of the application service in real time. At the same time, it provides a complete data monitoring, showing the number of requests, response time, FullGC times, slow SQL and abnormal times, inter-application call times and time-consuming and other important key indicators, so as to keep abreast of the running status of the application and ensure to provide optimal user experience.

Ali Cloud micro service engine MSE new upgrade, MSE in the governance center to improve the efficiency and stability of micro service development. Support Spring Cloud and Dubbo applications for nearly 5 years, and multi-language heterogeneous microservice system. It provides lossless offline, full-link gray scale, outlier instance removal, service authentication, and other differentiated capabilities. In the registry configuration center, MSE has fully managed Zookeeper/Nacos/Eureka services. Default high availability: Deployment of multiple availability zones and automatic detection. Configure authentication, encryption, and grayscale publishing. In the cloud native gateway, MSE integrates alarm monitoring, link tracing, traffic limiting degradation, and certificate management. Traffic network on the micro service gateway two in one, cost reduced by 50%.