background

Monitoring has always been an important means for the server to master the running status of applications. After the development of recent years, Ali Xiami server now has more than 100 Java applications and nearly 50 applications that undertake core business. The monitoring configuration of applications also varies from person to person. Some applications are carefully monitored. After multi-user development, some applications are gradually neglected in monitoring management. The monitoring items of some applications are modified only two years ago, which cannot adapt to service development.

Like most teams, also has an alarm processing group of dried shrimp, internal monitoring alarm platform (such as Sunfire) delivery of information through the robot to the group, due to the unreasonable configuration monitoring items, monitoring particle size is bigger, every alarm group be dozens or even hundreds of alarm inform bombarded, over the long term everybody has been numb to alarm, Most calls to the police go unattended.

Based on this situation, xiami SRE team (Site Reliability Engineering) was first proposed by Google. Committed to creating a high availability, high expansion site stability project) will focus on the monitoring of governance, after 2 months of research and development, the construction of xiami’s new monitoring system.

Alarm cause analysis

In the past, there were a variety of monitoring configurations. Most of the monitoring that was done by the application was limited to the overall RT and QPS monitoring of the application and part of the service log monitoring. When the alarm occurred, it was only known that there was a problem with the application in most cases, but it was difficult to quickly locate the problem and what the problem was. A new student may need to check configuration items, log in to the machine, scan logs and even check offline logs. It takes more than ten minutes to locate the problem, and sometimes it even takes half a day to troubleshoot the problem.

After a period of research and exploration, we found that if an application suddenly generates an alarm after stable operation for a period of time, the reasons are usually as follows:

  • Program bugs: such as code problems resulting in null Pointers, frequent FullGC, etc.
  • Upstream dependency failure: An upstream interface failure causes interface timeout or invocation failure of the application.
  • Single-machine failure: The Load and CPU of a container increase suddenly due to host applications, resulting in timeout and full thread pool.
  • Middleware faults: Common faults, such as Cache and DB chattering, cause RT growth and timeout increase in a period of time. However, it should be noted that high Load on a single machine may also cause Cache read/write and DB problems on a single machine.

Monitor the optimization

After analyzing the cause of alarm, the next step is to optimize monitoring. A monitoring alert can tell you something is wrong, while good monitoring can tell you what is wrong. Our previous monitoring usually only completed the first stage, but could not tell us where the problem was, and a lot of auxiliary means were needed to locate it. After analyzing the cause of alarm, we have to find a way to accurately locate the problem through monitoring.

At present, shrimp monitoring is divided into fault monitoring, basic monitoring and general monitoring, as shown in the following figure:




Fault monitoring

The so-called fault monitoring, is these monitoring alarm means that there is a fault. We believe that if all external factors have an impact on the application, they must be reflected in the RT and success rate of the interface, which will either increase the RT of the interface, or increase the number of interface failures and decline the success rate. If there is no such influence, the external influence can be ignored. Therefore, interface monitoring is configured as a major part of fault monitoring. If core interface monitoring is configured for each application, during troubleshooting, it is easy to determine whether an interface of the upstream application is responsible for the fault of my application.

Therefore, we use the success rate, RT and error code to monitor the failure of an interface. In particular, for RT monitoring of the client interface, we do not use the average RT, but the Top 75% RT. Because it is intended to reflect the user’s feelings, for example, the 75% bit line alarm threshold of RT is set to 1000ms, when this monitoring item alarms, it means that 25% of users request interfaces have exceeded 1000ms. Usually this alarm threshold is set to an RT that the user cannot tolerate, such as 500ms or 1000ms.



In fault monitoring, we also set up three types of application dimension Exception, Error, and message Exception monitoring, which monitors exceptions and errors on the server. This type of monitoring is mainly used to quickly find program bugs. For example, if these three types of errors increase during a release, then you should consider rolling back.

General monitoring

In most cases, application problems are caused by a single machine failure, if the gold index of the interface on one machine suddenly changes, errors or anomalies suddenly increase, and other machines do not change, it is caused by a single machine. Therefore, we have configured the corresponding single-machine monitoring for the fault monitoring of applications. Here, we have also introduced two types of single-machine monitoring, namely HSF (Dubbo) thread pool full and HSF (Dubbo) timeout, because when the single-machine Load is high and the CPU is faulty, the most common performance is that the HSF thread pool is suddenly full. The number of HSF (Dubbo) timeouts increases. These two monitors can also assist in locating single-machine problems. Through this kind of monitoring, we can easily interface alarm is caused by a certain machine.




Based on monitoring

The first two types of monitoring can basically locate whether the fault is caused by program Bug, upstream application or single machine failure, and the other type is the monitoring of middleware. Here, we use the basic monitoring of Sunfire to monitor various indicators of the application’s CPU, Load, JVM, HSF (Dubbo), MetaQ and other middleware. If there is a middleware failure, there will be an obvious alarm.




Alarm path optimization

After sorting out and optimizing the monitoring, each application has no more than 30-50 alarm items at present. If all the alarm items are delivered to the alarm group in the previous way, it will be a disaster. There is no way to see it, let alone quickly locate the problem. At the same time, an app owner usually only cares about his app’s alarms, and it’s useless for him to read other app’s alarms. Therefore, we built an SRE platform to optimize the alarm link, and the optimized alarm link is as follows:




We use flow calculation to set the alarm window, carry out alarm aggregation, and determine which alarms should be delivered through alarm classification. When the alarm group is accurate to at-related students and check the alarm group, we can directly locate the message of AT and quickly extract useful information. At the same time, SRE platform supports the application and upstream application within an hour of the alarm classification and aggregation display, where the problem at a glance. Through our robot, we only send alarm information in accordance with the rules in the nail group, which greatly reduces the number of alarms and improves the readability of alarms. At present, about 5000 kinds of alarm information are generated on an average day, and about 50-100 alarm information are delivered after the screening of decisions and rules. And these are the kinds of alarms that we feel we have to deal with immediately.







With traffic scheduling

Mentioned before a lot of failures are caused by single, in the past, we screen out single fault often do is to put the service stopped or single permutation, so efficiency is very low, actually we need to do is when there is something wrong with the machine, to be able to cut its flow rate, cut the flow when it recover back, it’s better if it can automatic manner.

  • With the help of Alibaba’s traffic scheduling platform (AHAS), we can perfectly solve the following problems:
  • Release warm-up problem, avoid RT, Load increase caused by release and HSF timeout and other problems;
  • Partial machine traffic is too high, affected by the host, too many slow calls, service unavailability caused by full HSF threads, and RT is too high.




At present, about 40 applications have been connected to the traffic scheduling platform, dispatching machine traffic for more than 1000 times per week. With the help of the traffic scheduling platform, we can no longer care about the application alarm caused by single machine failure.

The original link

This article is the original content of the cloud habitat community, shall not be reproduced without permission.