Introduction: Smart alarms are essential for AiOps. Arms Alarm Operation and Maintenance Center makes alarm – oriented organization coordination more convenient and efficient!

Author | Nine debate

No system in the world is 100% perfect. If usability is to be guaranteed, the technical team needs to be familiar with the various states of the service, able to identify problems in the first place and quickly locate the cause. However, in order to achieve the above two points, we can only rely on the perfect monitoring and alarm system to monitor the running status of the service, but the technical team can not always stare at kanban and pay attention to all aspects. Therefore, alarms become the primary means for teams to monitor service quality and availability.

But in practice, technical teams often get too many alarms, not too few. Let’s take a look at the daily work of SRE, a cross-border e-commerce system, which is probably familiar to every engineer:

  1. When you open IM, the alarm message of o&M group is 99+ or even 999+.
  2. When you open the group to view messages, the screen is full of alarm titles, levels, and dispatchers, but the information is too large to quickly filter and determine high-priority alarms.
  3. Open the alarm information one by one, view the alarm content, and evaluate the actual priority, including but not limited to service timeout, network retransmission, and slow database response.
  4. The alarm of level “P1” was found, and the check content was from the timeout of the transaction system service. The alarm was dispatched by the student who developed the transaction system. The student who developed the transaction system checked and found that there was no abnormality in the transaction system.
  5. When I arrived at the company, I opened the alarm center system and clicked the list items in order of alarm severity. I had a meeting with business development, network equipment maintenance and database DBA respectively. After comprehensive analysis, I found that “transaction system service timeout alarm” was caused by “slow database response” caused by “network retransmission”.

IT can be seen that with the deepening of enterprise digitization, the division and heterogeneity of IT system make enterprise technical architecture more and more complex. To better ensure system stability and avoid missing faults, technical teams usually set a large number of monitoring indicators and alarm rules for infrastructure, platforms, and applications in the monitoring system, from the network to machines, from instances to modules, and then to upper-layer services. Although the fault detection capability is greatly improved, an exception or fault may trigger a large number of alarms, resulting in alarm storms. For example, when a machine fails, an alarm rule that monitors the machine’s health generates an alarm; The alarm rules that monitor the running status of instances on the machine also generate alarms. The upstream application modules of these instances are affected and an alarm is generated. For example, the instance in the application module generates an alarm, and the upstream application module also generates an alarm. When an application module contains many instances, hundreds of alarm messages are generated. Moreover, the network, machine, domain name, application module, and service simultaneously generate multi-level and multifaceted abnormal alarms, generating tens of thousands of alarm messages.

At the same time, when anomalies occur, the traditional alarm system sends alarms to relevant personnel through emails, SMS messages, and telephone calls. However, a large number of alarm messages cannot help them quickly find the root cause and formulate a stop-loss plan, but will drown the real effective information. At the same time, problem solving often needs to coordinate with different teams and synchronize progress in time. Single point sending is not conducive to problem description and follow-up. The MTTR is greatly prolonged by repeated descriptions of situations communicated to those responsible across teams.

Many small and medium-sized Internet companies have relatively complete monitoring and alarm systems, and their alarm quality and emergency efficiency are much higher than those of large and super-large enterprises. This is because the monitoring system is developed and maintained by an o&M team. The business structure, product capability and usage mode are relatively simple and unified. The main users of the monitoring system are product O&M engineers, and the configured monitoring and alarm quality is high. However, as the size of enterprises continues to grow, small and medium-sized enterprises will also face the same problems as large enterprises:

  • There are more and more monitoring systems, but the operation mode and product capability of each monitoring system cannot be aligned.
  • Most monitoring systems have poor functional design experience and high learning costs for technical teams. The technical team did not know which monitoring and alarm rules should be configured, resulting in failure to cover 100% of risk points or a large number of invalid alarms.
  • There are more and more responsible people for different monitoring systems. When the organizational structure changes, the subscription relationship of each monitoring system cannot be updated in time.

The end result is more and more alarms, more and more invalid alarms, the technical team gives up monitoring the alarms, and it starts a vicious cycle. Specifically attributed to the above phenomena, we find that the problems mainly focus on the following points:

The standardized Alarm Processing flow System is missing

Alarm source data lacks unified standards and labels with unified dimensions

The O&M systems of each domain in an enterprise are built independently. There is no unified standard, and most alarm data contains only the title, level, and basic content. O&m personnel spend a lot of time reading alarms and analyzing alarm sources and final causes. In this process, SRE’s past experience is highly relied on. The reason is that alarm data from different domains, alarm policy configuration logic is inconsistent, no labels exist, or label definitions are inconsistent. The SRE needs to identify valid information in multifarious alarms, analyze the correlation between alarms, and find the root cause. To standardize and enrich alarm information, traditional IT o&M systems define unified alarm data standards at the enterprise level. Alarm systems in each domain need to be connected according to these standards. The forced standardization method is bound to encounter the following problems in practice: 1) The transformation cost of different operation and maintenance domains is high and the project promotion is difficult; 2) Poor data expansibility, one data standard change affects all operation and maintenance domains.

Lack of global perspective on alarm data processing and enrichment

The IT system o&M integrates alarms from different domains and processes them in a unified manner to obtain more information and make more accurate judgments. However, if the alarm o&M system passively receives and dispatches alarms, its value as the o&M information center is not reflected, and its efficiency and experience are not improved. In this case, operation and maintenance personnel can actively digest, absorb, and enrich the alarm content to make the noisy information clear and tidy. Therefore, the alarm operation and maintenance system needs tools that can decompose, extract, and enhance the content of alarms.

Collaborative alarm processing fails

How can I handle alarms in a collaborative manner?

In an organization, the stability of services is often implemented in the day-to-day work of one or more organizations. Alarm handling must be coordinated within and among teams. When an alarm is generated, the active duty personnel is notified based on the current shift plan. If the alarm is not handled in a period of time, the standby duty personnel is notified. If the alarm is not handled in a timely manner, the standby duty personnel is upgraded to the administrator. When discovering that an alarm needs to be handled by another upstream or downstream team or needs to be handled with a higher priority, the duty personnel can change the alarm level, quickly transfer the alarm to other personnel, and the assigned personnel can handle the alarm.

How to handle alarms flexibly without the complexity of organizational isolation?

In normal scenarios, the technical team does not want to see the alarms of other teams, but also does not want the alarms of the team to be seen by other teams (sensitive information such as faults). However, when an alarm needs to be handled in collaboration with other teams, the alarm must be quickly sent to other personnel and authorized at the same time. How can these flexible rights management requirements be met on the cloud? The traditional authorization method on the cloud is to establish a sub-account for each member on the cloud to authorize the member. This method is obviously not suitable for alarm handling, online services have been damaged, do you need to seek administrator authorization to handle alarms? Faced with the above problems, enterprises of different sizes offer different solutions:

Small-scale enterprises: Configure people in the organization as alarm contacts on the cloud platform. When an alarm is triggered, some people are notified based on the alarm content.

Advantages: When the team size is small, alarms can be distributed and processed through simple configuration. Disadvantages: The relationship between the organizational structure and alarm contacts needs to be constantly synchronized, for example, when a new employee enters the organization and when an old employee leaves the organization.

Large-scale enterprises: Send alarms to the internal alarm platform through unified Webhook for secondary distribution.

Advantages: The self-built system can be integrated with the internal organization structure and rights system of an enterprise, and has the complexity of organization isolation and flexibility in alarm distribution. Disadvantages: Self-built alarm platform, large investment, high cost.

In view of the above two problems, we need more complete ideas to solve the above problems. After a lot of practice, we provide the following ideas for your reference:

Standardized Alarm Event Processing Flow

Considering the pain points of the above o&M cases and difficulties in alarm standardization, we no longer force each O&M domain to adapt before integration. Development o&M personnel use the Standardized Alarm event Processing Flow function provided by the O&M center to arrange and maintain processing flows in different scenarios and standardize and enhance the contents of alarms from different sources by using the following methods:

The flexible orchestration and combination capabilities of the alarm platform and various processing actions are used to quickly handle diversified alarm scenarios

From the perspective of the alarm O&M center, the alarm data processing process varies according to the source or scenario. Through the data processing, data identification and logic control provided by the rich processing flow, in the face of standardization or scene demands, SRE uses conditions to filter out the current concern of the alarm, select the action choreography processing flow. After the alarm is enabled, the alarm data is saved to the alarm system for notification according to the expected standard. SRE alarm operation and maintenance experience can be accumulated for subsequent automatic processing.

Content CMDB enrichment, breaking the information island

Breaking the “information silos” of alarms from different sources is an important and challenging task in enterprise IT operation and maintenance. The CMDB data of an enterprise is the best raw material. By maintaining static and API interfaces to integrate THE CMDB data, the alarm event processing flow can enrich the information through the CMDB, so that alarms from different domains can be associated in different dimensions. In this way, during alarm handling, IT resources can be associated with each other to quickly analyze and locate the root cause.

You can quickly learn about alarm distribution through AI content identification

With the aid of AI content recognition ability, the alarm content is analyzed and classified. O&m personnel can learn about system alarm distribution based on global statistics. O&m personnel can identify the object type and error classification of a specific alarm at a glance, shortening the path from symptom to root cause. And in the process of follow-up, intelligent classification information can be used as reference data for IT system optimization and improvement.

“Alarm Oriented Organizational Collaboration”

Beyond standardization, we can see that organizational synergy needs to be flexible enough for alarm handling. Instead of handling alarms based on the Organization, you need to build an organization based on the Alarm. When an alarm occurs, you need to coordinate upstream and downstream processing personnel to establish a temporary organization for handling alarms. Members in the temporary organization have the right to handle alarms. After the alarm is cleared, the temporary organization can be quickly dismissed to avoid frequent disturbance by alarms and unnecessary fault information dissemination.

Contact information is automatically registered with the alarm system

For agile O&M teams, do not manually maintain the contact information of organization members who need to handle alarms in the alarm system. Maintaining contacts manually is not suitable for organizations that change frequently. In an excellent alarm system, each member of an organization should maintain contact information and set notification. In this way, the timely contact information update required by frequent organizational structure changes can be avoided, and different people’s preferences for notification methods can be met.

Reuse existing account system, avoid using more than one account system in work

Usually, enterprises will use an office collaborative IM tool such as Dingding.com, Feishu or enterprise wechat. Do not use an independent account system on the alarm processing platform. If an enterprise usually uses software such as nails for office work, and the alarm system supports handling alarms by nails, then the alarm system can easily be added to the production tool chain of the enterprise. On the other hand, if the enterprise usually uses nails, but the alarm system needs to use a separate account to log in, two sets of accounts need to be maintained, which may cause communication difficulties and delay in information processing.

Flexible permission assignment mode

When an alarm is generated, the personnel on duty should coordinate the required team and resources to resolve the alarm as soon as possible. In addition, after the alarm is cleared, the rights of the temporarily coordinated member can be reclaimed to ensure service security and prevent information leakage. Combined with common alarm coordination methods in work, pull-group communication is undoubtedly the most suitable way for alarm handling. When an alarm occurs, the person on duty temporarily pulls someone into the group to view and handle the alarm. In this case, the group becomes a natural authorization carrier. If you enter the group, you can view and handle alarms. After you leave the group, you will not be disturbed by alarms.

Rich scalability

Multiple collaborative tools may be used simultaneously during team collaboration. For example, in alarm handling, you need to perform a recheck to handle important alarms. After the recheck, you may specify some tasks to fundamentally resolve alarms. This process may involve the use of other tools, such as collaborative documentation tools, project management tools. The alarm system needs to be connected to these systems more conveniently and integrated into the chain of enterprise office tools.

Combining with the above ideas, Ali Cloud productized it and deeply integrated it with ARMS monitoring to provide customers with a more perfect alarm and monitoring system.

Core advantages of the ARMS alarm O&M center

Connect to 10+ monitoring data sources

ARMS already provides application monitoring, user experience monitoring, Prometheus and other data sources, and seamlessly connects to a series of data sources commonly used on the cloud, such as log service and cloud monitoring. Users can access most alarms with one click.

Strong ability of alarm correlation

Based on the ARMS APM capability, it can quickly associate common alarms and automatically output fault analysis reports.

ChatOps capability based on spike building

No need to import organizational structure, no need for cloud accounts. Alarm events can be assigned and claimed in a spike group, greatly improving O&M efficiency.

Based on ali fault management experience, provide in-depth analysis of alarm data and continuously improve alarm availability.

The core scenario

Core scenario 1: Integration of multiple monitoring systems

ARMS already integrates most monitoring systems on the cloud, right out of the box. It also supports user-defined data sources.

Core scenario 2: Alarm compression

ARMS provides 20+ rules based on common alarms to help users quickly compress alarm events and supports custom event compression.

Core scenario 3: Configure multiple notification channels

Handles and allocates alarms in a spike group.

Core scenario 4: Analyzing alarm data

Core scenario 5: Intelligent noise reduction capability out of the box

Automatically identifies alarms with low entropy.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.