Alibaba | panoramic monitor the conversation practice guidelines

Developer tools Create around the whole life cycle of the developer tools products https://developer.aliyun.com/… ; utm_content=g_1000283729

With the development and evolution of cloud native technology, micro-service and containerized technology become the inevitable choice of large-scale distributed IT architecture. New technologies not only make IT systems more agile, robust and high-performance, but also bring higher technical architecture complexity, which brings unprecedented challenges to application monitoring.

Traditional Surveillance Challenges

System monitoring has exploded

Traditional monitoring focuses on application, host, network and other system level monitoring. With the extensive application of new technologies such as micro-service and cloud native, the system architecture is becoming more and more complex, and the number of applications is growing explosively. When a fault occurs, a large number of system alarms can create an alarm storm, making it difficult for technicians to quickly locate the problem. At the same time, a large number of system-level monitoring will produce a large number of false positives, technicians are forced to spend a lot of energy to deal with these false positives, and eventually become numb to the alarm.

Lack of connection between monitoring results and business goals

Traditional monitoring lacks the monitoring from a business perspective and the connection between the business and IT systems. This leads to a lack of a unified perspective among users, business people, and technical people. Many failures are often users have reported, but the technical personnel in the system monitoring indicators to prove that the system is not a problem. Or the business has been damaged, technicians still can’t determine which system is the problem, the recovery time is greatly extended.

Data fragmentation among monitoring tools and lack of data analysis

Alibaba USES the many kinds of monitoring tools, used to monitor network, the physical machine, application, the client, such as different objects, but cannot share data between different tools, lack of unified analysis of monitoring data, more difficult to combined with the business scenario, cause a lot of failure can only rely on technical personnel’s experience, constantly switching between multiple tools and stepwise screening, This greatly increases the recovery time from failure.

The monitoring maintenance cost is high and the alarm accuracy is low

Traditional monitoring requires a lot of configuration work, the whole process is difficult, the lack of automation, intelligent monitoring means, which is also an important reason for the uneven monitoring capabilities of each system. Some new businesses are unable to invest a lot of energy in the configuration of monitoring, resulting in the lack of business monitoring capabilities. At the same time, with the development of business, technical personnel need to constantly adjust the alarm rules, but often because of not timely adjustment caused by false positives and omissions.

A business-driven monitoring philosophy

In order to adapt to the development mode of DevOps and solve the problems of traditional monitoring, Alibaba has summarized a set of business-driven top-down panoramic monitoring system, which mainly includes three layers: business monitoring, application monitoring and cloud resource monitoring:

Business monitoring is the “top level” of the whole monitoring system, which can reflect the real situation of users using the business, can be directly linked to the business results, and can be understood by people in different departments and roles.
Application monitoring provides the monitoring capability of service and system layer in application, which can directly reflect the running state of the system, help developers to comprehensively understand the health state of service and middleware in application, and quickly locate system problems.
Cloud resource monitoring provides basic monitoring of all kinds of cloud resources (such as RDS, OSS, SLB, etc.) that the application depends on. In fault troubleshooting, it can provide detailed monitoring data at instance level for the R&D personnel to quickly determine whether the problem is the application system or the problem of cloud-based implementation.

After monitoring layer, each layer of the monitoring indicators and the alarm will rule according to the important degree into serious, warning, ordinary, such as multiple levels, different levels and different levels of monitoring alarm will be assigned to different roles, such as the group’s safety production team to focus solely on the core business of the group’s index, the stability of the department head concerned departmental core business of monitoring, The R&D personnel of each team will receive their own business and application alarms, while the monitoring of cloud resource instances generally does not send alarm messages, and is mainly used for fault troubleshooting and positioning. This takes full advantage of DevOps and eliminates the traditional problem of a small number of operations and maintenance personnel being the bottleneck in troubleshooting. At the same time, the number of alarms each person has to deal with is significantly reduced, which also solves the problem of drowning out critical business monitoring alarms due to alarm storms in the event of a failure.

Uniform monitoring architecture

Based on the concept of panoramic monitoring, Alibaba has explored a set of unified monitoring architecture. This architecture does not pursue a unified monitoring platform mode, but adopts layered construction to abstract three monitoring systems, namely cloud resources, application and business. Each monitoring system focuses on finding faults in related fields. The CMDB is unified to solve the problem of inconsistent monitoring metadata. The intelligent algorithm platform, alarm center and fault platform are used to centrally manage events, faults and improve the accuracy rate.

Business monitoring

Alibaba’s “Business Monitoring” adopts the self-developed log collection & calculation framework, which can extract real-time monitoring indicators from the logs through the configuration of the page. It has the characteristics of being simple to use, strong customization ability, fast response speed, and no intrusion into the business, etc. A complete business monitoring domain model is provided to guide users through monitoring coverage.

Domain models for business monitoring include:

Business domain: A complete business or product is called “business domain”, such as e-commerce “trading domain”, “marketing domain”, “payment domain” and so on.
Business scenario: The core business use cases in the business domain are called “business scenarios”, such as “order confirmation”, “create order”, etc., of the transaction domain. The business scenario is
The core of the entire monitor model.
Business metrics: metrics unique to each business scenario, such as number of orders per minute, transaction success rate, error codes, etc.

In the choice of business indicators, traditional operation and maintenance personnel like to use exhaustive means with all observable indicators, and all kinds of alarms to appear “safe”. In fact, when a failure comes, the screen appears to be full of abnormal indicators, increasing warning messages, such monitoring looks powerful, but the actual effect is counterproductive.

Through the careful combing of Alibaba’s failures over the years, common failures (not logical problems of the business itself) in the core business of Alibaba Group can be reflected through traffic, delay, error and other three types of indicators, which we call the gold index:

Traffic: the business traffic drops to zero OR is abnormal, and the middleware traffic, such as the service provided by the message drops to zero, may trigger a major failure;
Delay: When the service provided by the system OR the service dependent on the system has a sudden and large spike in delay, it is basically a precursor of system problems.
Error: The total number of errors returned by the service, the success rate of the system providing the service OR depending on the service. The business monitoring platform provides the “gold indicator” plug-in, which can generate a group of gold indicators through a single configuration, which is the most widely used indicator model for business monitoring at present.

The business monitoring alarm is directly related to the fault, which has high requirements for the quality of monitoring data and good flexibility (it can meet the monitoring requirements of different technology implementations without affecting the performance of the monitored business system). Alibaba’s “Business Monitor” uses logs as a source of data to ensure maximum flexibility in business monitoring and can adapt to almost any technology stack. Log collection adopts uncompressed incremental collection, zero-copy and other technologies to reduce the impact of monitoring collection on the performance of the business system; The pull mode architecture, retry mechanism and data completeness model are adopted to ensure the reliability and integrity of data collection. Completely white screen configuration ability, perfect debugging function, minimize the user’s configuration difficulty and configuration cost.

Application of monitoring

Alibaba application monitoring is built in a standardized and componentized way, and it is combined with Alibaba technology stack to provide monitoring components at the common system and middleware level. Operation and maintenance students do not need to modify the program code, and the whole monitoring process is automated. After the application is launched and expanded, the application monitoring is automatically started, without human operation, and the maintenance cost of monitoring is greatly reduced.

When the operation and maintenance system implements the operation of application on-line and capacity expansion, it will write the change information to CMDB, and CMDB will push the change information to MQ. The application monitoring platform will subscribe to MQ to obtain the configuration changes of the application in real time and generate new monitoring tasks, which will be sent to the Agent end of the specified target server (container). The Agent sends the corresponding collection request according to the configuration information of the task, obtains the monitoring data from the Endpoint such as Exporter provided by the business application, and uploads it to the monitoring cluster for calculation and storage. At the same time, the anomaly detection module will generate the alarm detection task according to the changes of application configuration, pull the monitoring data from the time series database for anomaly detection, and send the abnormal events to the alarm center.

Cloud resource monitoring

Alibaba cloud resource monitoring direct docking ali cloud platform “cloud monitoring” API for different types of cloud resources index data and alarm events, then these data and the relationship between application and cloud resources in CMDB information connection, finally forms the application perspective of cloud resources health view, solved the cloud infrastructure monitoring and upper application monitoring isolated problem. Relying on the monitoring capability of the cloud platform and the data accumulation of CMDB, the whole cloud resource monitoring is also completed automatically, without the user’s manual configuration.

Intelligent detection platform

In order to solve the problems of low alarm accuracy and high configuration maintenance cost, Alibaba has built an intelligent detection platform, which uses AI algorithms to accurately find anomalies in online businesses and applications, and does not require any manual configuration of alarm threshold during this process. According to the different characteristics of business and application monitoring data, different anomaly detection strategies are adopted:

1. Intelligent baseline

Business monitoring requires extremely high accuracy of alarm. Meanwhile, data will fluctuate with the business cycle constantly. Data between peak and trough may differ by dozens or even hundreds of times. Traditional threshold or sequential alarm often requires experienced operation and maintenance experts to constantly adjust the rules, which is easy to cause false alarms. In response, Alibaba uses an intelligent baseline algorithm to automatically learn the cycle of the data curve from the historical trend. A business alarm is triggered immediately when the business metrics exceed the baseline tolerable range. In order to optimize the compatibility of the algorithm to the data, the intelligent baseline algorithm achieves a good compromise between the long-term historical law and the recent historical law through the function of online prediction (that is, the algorithm makes point-by-point prediction of the data in the future period). Based on the training and expert experience annotation of a large number of diversified business indicators inside Alibaba, the platform can gracefully reflect different types of business fluctuations in the algorithm. The algorithm can well adapt to the fluctuation burr in the data curve and the ups and down generated with the business, and can access all kinds of business monitoring data with one key; The algorithm has been subjected to various external attacks and internal pressure and test interference of crawlers for a long time, and now it has a good resistance to interference attacks. The algorithm can well support second-level and minute-level calculation, without any manual monitoring configuration, without adjusting the parameters of the algorithm with the change of the business, and the algorithm can adapt to the change of the business by learning the law.

2. Application of index anomaly detection

There are a large number of monitoring indicators and the cost of traditional manual threshold configuration is very high. Enterprises often use alarm template to configure the same alarm threshold for a large number of applications. However, due to the great difference between different application systems, it is difficult to define an accurate threshold, which is easy to produce the problem of “a small missed alarm, a large false alarm”. The scenario of system indicators is different from that of business indicators, whose periodicity is more uncertain. The fluctuations of each indicator are relatively large and have no periodic characteristics. In view of the characteristics of the applied indicators, Alibaba has developed an anomaly detection algorithm for the applied indicators, which combines multiple algorithms such as fault detection, frequency fluctuation anomaly detection, peak/trough anomaly detection, long-term trend gradient detection and floating threshold detection. At the same time, due to the large number of monitoring indicators, in order to be able to be used in a wide range, all detection methods adopt the lightweight algorithm, which greatly reduces the resource consumption of anomaly detection service.

Alarm center: unified docking of the alarm events of each monitoring platform, to achieve unified record of alarm events, unified processing (merger, noise reduction, suppression, etc.), and finally sent to the relevant processor.

Fault management platform: The platform used to define the fault level and manage the entire fault life cycle. Important alarms that hit the fault level definition will be upgraded to a fault and enter the fault management process.

CMDB: Unification of Operation and Maintenance CMDB is the metadata center of the entire Alibaba application operation and maintenance system, which maintains the operation and maintenance objects of the whole Alibaba, such as products, applications, instances (containers, VM, cloud resources), machine rooms, units, environments, and the correlation between objects. The monitoring system at all levels is associated with the objects in the CMDB model.

Through this systematic construction, we can not only quickly and accurately send out warnings when failures occur, but also have the ability to drill down from the business entry to analyze the critical applications, resource status, and even infrastructure on the failure link. Therefore, development, operation and maintenance personnel can gradually eliminate the doubtful points when the fault occurs on a monitoring interface and quickly define the cause of the fault.

Here, a sudden delay failure of an order system call is taken as an example to introduce the fault troubleshooting process of panoramic monitoring:

In the first time that an online problem occurs, the development Owner in charge of the order system will receive a delay alarm call from the alarm center (if the online problem reaches the fault level standard, the fault desk will automatically send a fault notice of the corresponding level to the relevant personnel).
Through the link in the alarm information, the R&D personnel directly opened the corresponding business monitoring page to check the gold index data such as the amount of general transaction adjustment, delay and success rate, and found that the delay increased significantly, the success rate decreased, the amount of adjustment did not decline in detail, and the problem of system capacity caused by the increase of upstream traffic was excluded.
Through the multi-dimensional drill-down function of business monitoring, we can find a large increase of “timeout error code” by viewing the details of the error code of the transaction, which can be ruled out as a business logic problem.
Continuously drilling down from business monitoring to application monitoring, according to the call link corresponding to the order index, it is found that the success rate of database call of a single application has decreased significantly, and the call delay has increased sharply.
Then drill down from application monitoring to cloud resource monitoring, check the CPU, call time and slow SQL index of the database associated with the application, check the list of slow SQL and the change record of the application and find that it is related to the timing task of the last release, and solve the problem by rolling back the operation and maintenance system.

Monitoring and management system

In addition to the excellent monitoring function, a good monitoring system must also have a matching management system. Alibaba has adopted a fault management driven monitoring and management system, and formulated strict quantitative fault level definitions for each department and team. The definition of fault level is directly associated with business monitoring indicators, and the trigger rules of indicators corresponding to different fault levels are defined. The work safety team will review the core business scenarios, business indicators and fault level definitions with each business department. The “business monitoring” configuration and “fault level definition” completed by combing need to be agreed upon through review, so as to form a unified monitoring standard between the business team (operation, product, customer) and the R&D team, clarify the responsibilities of each party, reduce the communication cost, and achieve a strong correlation between the monitoring results and the business objectives.

The whole fault definition process is online and structured. When the business indicators exceed the scope of the fault definition, the fault desk will automatically trigger the fault notification and send the notification information to the technical personnel of the relevant team in time. Technicians can quickly view the business monitoring data through fault notification. Through the vertical topology linkage ability of panoramic monitoring, they can drill down and analyze from business indicators to the associated application state, and then drill down and analyze from the application state to the state of cloud resources, so as to realize fast fault location. Then the technicians determine the fault recovery plan according to the fault troubleshooting information, and quickly restore the fault by rolling back, degrading, cutting flow and other operations through the operation and maintenance platform. The whole process is done online, the progress of troubleshooting is automatically pushed to the relevant personnel, and all actions are recorded. Finally, the work safety team will organize the failure review, formulate improvement measures and improve monitoring coverage, so as to realize the positive feedback of business safety production.

conclusion

Panoramic monitoring is not only the simple integration of hierarchical monitoring capabilities such as business, application and resources, but also the vertical topological linkage ability of drilling down from business indicators to application state and from application state to resource state. It is also the integrated monitoring of intelligent health inspection ability of indicators at all levels. Panoramic monitor direct traditional monitoring platform lack business monitoring ability, the scattered data monitoring and alarm of each layer, high cost monitoring configuration pain points, based on alibaba powerful monitoring techniques and emergency troubleshooting best practices, provide alibaba economy integration, one-stop monitoring solution, is that alibaba production safety management best practices.