Abstract: Enterprise operation and maintenance needs and challenges, to see how Huawei AIOps to solve!

This article is shared from huawei cloud community “AIOps? New Power of Enterprise Operation and Maintenance!” , original author: Qiming.

International convention, let’s first introduce the concept of AIOps: ArtificialIntelligence AIOps, or ArtificialIntelligence for IT Operations, applies ArtificialIntelligence to the field of operation and maintenance. Based on existing operation data (logs, monitoring information, application information, etc.), To further solve problems that cannot be solved by automatic operation and maintenance through machine learning.

Gartner predicts that current IT applications will change dramatically, as will the way the entire IT ecosystem is managed. Key to these changes is what Gartner calls the AIOps platform.

What we are going to talk about today is the requirements challenge for AIOps and how we are addressing it.

AIOps requirements and challenges

(1) New technologies and new challenges call for highly intelligent telecommunication networks

In recent years, new technologies represented by 5G have been rapidly applied in telecommunication networks. The application of new technology brings us a lot of benefits, such as large connection, low delay, high speed and so on. With the development of 5G, these numbers have all improved by at least one order of magnitude.

However, the increase of data magnitude is accompanied by the increase of operation and maintenance difficulty, which brings the following challenges to operation and maintenance:

1. Network complexity:

As the scale of data increases, the network becomes more complex: new technologies are applied quickly, but old technologies are not phased out synchronously, resulting in an addition to the original complexity for each new technology we introduce. And in some scenarios, you even have to multiply.

For example, in the wireless field, 2G/3G/4G/5G, “four generations under one roof”; In the core network, PS/CS/MS Internet of Things and so on ten domains coexist…… Such a high network complexity is bound to bring considerable challenges to operation and maintenance.

2. 2B New requirements

The second o&M challenge is the new scenario of To B, enterprise applications. The application of 5G has promoted intelligent manufacturing, and the network has gradually been integrated into the production and manufacturing process of enterprises. In this case, the requirements for network reliability will inevitably increase. After all, once the network goes wrong, the production process may be affected or even interrupted, which will cause great losses.

Cost pressure

Cost pressures are mainly transmitted by the first two challenges. The first two challenges result in either a more complex network or higher requirements. If we deal with it in the traditional way of operation and maintenance, it will inevitably lead to a sharp rise in costs. Of course, the increase in cost, there is another factor is energy consumption. After all, 5G uses far more energy than 4G.

How do we address these challenges? AI technology is key.

(II) AI is a key technology to improve the automation and intelligence of telecommunications networks

In terms of operation and maintenance costs, statistics show that 90% of the operation and maintenance costs need labor, and 70% of the costs are human costs. In this case, a natural idea is to use AI technology to reduce human costs and improve operation and maintenance efficiency.

For example, I just mentioned 5G energy consumption. Can we use artificial intelligence to reduce energy consumption? Judging from past practical experience, the answer to the above question is yes.

Next, let’s take three examples to illustrate.

1. Base station energy saving

The first example is base-station energy saving. The power consumption of base stations is very high. At the beginning of the network, there are fewer base station users, and sometimes the base station is often open. The solution is to make some predictions about traffic. If we can accurately predict traffic, we can save energy by turning off a certain amount of carrier during low traffic. According to statistics, more than 10% energy can be saved by LSTM neural network in the process of traffic forecasting.

2. Core network KPI anomaly detection

The second example is exception detection. Deploy the KPI exception detection service on the carrier’s core network. The original anomaly detection service uses a fixed threshold for alarm notification. AI technology, on the other hand, is more intelligent, timely and accurate in identifying anomalies.

3. Fault identification and root cause location

Once a fault occurs on the network, a large number of alarms will be triggered, and the system sends orders for operation and maintenance in a high longitude and latitude dimension. If multiple agents report multiple alarms, this can happen. In other words, a fault occurs and multiple network members report an alarm, which may result in sending orders in multiple domains (wireless domain, transport domain, etc.).

(iii) AI application development still faces challenges: high threshold and long development cycle

From the above three examples, we can see that AI is relatively reliable. But if AI is so reliable, why isn’t it being applied across the board and quickly? Because AI development is still facing a lot of challenges, can be summarized in six words: high threshold, long cycle

Above is a research report from Gartner. It analyzes the main obstacles to AI adoption from four dimensions. The top three are:

  • Staff skills

  • Understand gains and uses

  • Data scope and quality

This brings us back to our six words: high threshold, long cycle.

1. The high threshold

By “high bar”, the first point is the lack of AI algorithm developers. The average o&M team does not have a dedicated AI algorithm developer, which inevitably leads to a lack of AI skills.

But this is not the most critical, because AI personnel can be solved through training, training, recruitment and other means.

The most critical point, which is the second point we said, is that it is difficult to combine algorithms with business. If you want to do a good job in an application, it is best to start from the business, according to the actual situation of the business to choose the appropriate algorithm, so as to do a good job in the application. But in the actual operation process, first of all, we need a business expert with a deep understanding of operations; Second, you need an algorithm expert with AI expertise. After that, they need to have enough time and willingness to sit down and talk deeply. Time and will get in the way here.

The third point is data. The data contains two problems: engineering problems and annotation problems. In other words, developing an AI application is actually a considerable amount of engineering work, because it first needs to access a huge amount of multi-modal data to complete model training and reasoning, and finally to complete the presentation of results, including docking some existing systems. Therefore, in addition to the operation and maintenance experts and algorithm experts, many engineering developers are also needed.

2. The cycle is long

The development threshold is high, determines the development cycle is long, after all, there is such a high threshold, if not a good solution, then the cycle is bound to be particularly long. Long development cycles lead to:

First, understand the gain and use. How do you understand that? In other words, if we don’t get results for a long time, then corporate decision makers may be skeptical of what AI can do.

Second, the longer it goes on, the higher expectations people have for the project. If you make the same thing and achieve the same effect, say 5% less time to repair, two years to make and one month to make, the evaluation may be completely different.

In view of the challenges encountered in the implementation of AIOps, Huawei launched AIOps service! Now let’s take a look at what the AIOps service is and how it addresses the challenges ahead.

Huawei AIOps service

The picture above shows the overall framework of the AIOps service. AIOps is divided into four layers from the bottom up:

Layer 1: Data collection and governance. Data collection governance, easy as it sounds, is hard to do. Why? Because there are many data types to face, interfaces and data types are not uniform. Light to adapt these data, are likely to be tired. Comparatively speaking, Huawei AIOps service first supports common interfaces, and then presets some common devices. Finally, it can achieve automatic interconnection and automatic data management.

Level 2: AI atomic abilities. Huawei AIOps has more than 20 atomic capabilities, covering detection, prediction, identification, and diagnosis. Atomic power is not just an implementation of AI algorithms. Each atomic capability has been validated with actual site data and optimized for specific operational scenarios. At the same time, each atomic capability is also integrated with Huawei’s previous operation and maintenance experience, and some atomic capabilities can even be directly used without training.

Level 3: Choreography. Including process and large screen, as well as RPA. Atomic capability is the basic component of AIOps intelligent operation and maintenance. The process choreography operation is simple and flexible. You only need to drag data from the component library and combine AI operation and maintenance capabilities to complete end-to-end graphical choreography of command scenes, which truly supports partners to lower the development threshold and efficiently build AI application choreography framework.

Layer 4: Industrial AI APP. Out of the box for the most typical scenarios. Through the rich 2D and 3D visualization components, such as providing more than 30 chart controls, covering polyline, topology, list, column and other styles, and provides multiple map controls, interactive controls and media controls to build. Operational effect only when the screen from the component library drag out all kinds of controls, on-demand portfolio free layout, flexible configuration, application of various reports, auxiliary monitoring and analysis, such as DIY service health monitoring of the hall, enables the visualization, display interfaces average success rate, average delay, interface failure rate, interface of calls, etc. At the same time, KPI alarm list is provided to provide fault warning reference for operation personnel. Drag and drop the required control number and customize the style, data and interaction of the control to meet the display requirements. Back-end data can also use various intermediate data defined in the APP composition process. After the configuration is complete, you can preview and publish the operation and maintenance effect with one click. The interface, average success rate, average latency of the interface, interface failure rate, and interface call times can be displayed on a large screen to quickly realize DIY visualization of the large screen.

(I) RPA helps AIOps connect with existing operation and maintenance systems

In addition to the display bits, the inference results must be able to aid in troubleshooting. At this stage, it is generally connected to existing systems, such as work order systems (people who need work order email address to handle), automatic response and questionnaire. If the manual docking, time-consuming and laborious and prone to error. So robotic process automation, or RPA services, came naturally. RPA service can complete data docking, handling and work order issuance, etc., reduce manpower input, reduce error costs.

(2) 10+ App out of the box, supporting rapid deployment

For some of the most typical scenarios, Huawei cloud AIOps has prepared the choreography capability in advance, that is, there are more than a dozen out-of-the-box APPS, such as campus network, DC network, IT application, operator network and other scenarios. Flexible deployment, supporting public cloud, HCS deployment, On Premise deployment, and cloud/ground collaboration, etc. Open ecology, support partners to develop industry apps, and release AI applications to the AI market, win-win cooperation, to build a network AI ecology.

Let’s use the “KPI Exception Detection” App to demonstrate how to use an out-of-the-box App.

Step 1: Import the NE list.

Step 2: Configure performance and alarm data sources.

Step 3: Associate the data source with the App;

Step 4: Start the App;

Step 5: View the large screen and analyze the fault.

AIOps Enables intelligent network operation and maintenance (O&M) in the campus

So how does AIOps solve the actual operation and maintenance in the park?

(I) Construction and maintenance mode of park network

The figure above shows two construction and maintenance modes of campus network:

OMC of 2B and 2C sharing large network: current mainstream mode. Companies rent wireless equipment and other equipment from carriers. The problem with this model is that the terminals are maintained by the enterprise and the network is maintained by the operator, so it is difficult to distinguish responsibility when problems occur. Another problem is that the operation and maintenance capability of the carrier side and the organization of the LARGE network 2C O domain cannot support the high SLA on the enterprise Intranet and enhance customer demands.

2B and 2C Separate OMC (EMS) : An enterprise purchases and maintains all devices, such as 5G CPE, wireless, and core networks, in an end-to-end view. According to the documents issued by the Ministry of Industry and Information Technology, VDF, Audi Park and enterprise SLA guarantee, it will gradually become the mainstream for enterprises to rent the spectrum of operators or dedicated spectrum to build their own 5G network.

(II) Analysis of business scenarios and pain points: Customers in the park need easy-to-use network operation and maintenance with multi-domain integration

1. Typical network status

The figure above shows a common video detection business in a park. We can see that even one of the most common business, about a dozen network elements will participate in it, from 5G wireless to transmission to edge computing, and even the core network, will participate in it.

2. Park application

The picture above lists some common applications in the park, including EDGE AI detection, intelligent logistics, indoor positioning, etc. All of these businesses are similar to the previous figure, in that any one simple business involves the participation of multiple domains.

So what is the difference between the park and operator operation and maintenance? There are three main points as follows:

Users: Lack of professional communication knowledge and weak network operation and maintenance ability;

Network: the networking is relatively simple, but involves multi-domain, wireless, transmission, data communication, IT, etc.

SLA: The production system network has high requirements for end-to-end SLA contracts, which are 99.99% in 7X24 hours.

Therefore, if customers are in the park operation and maintenance, they have the following pain points:

Skills: The introduction of 5G 2B makes the network more complex, and enterprise engineers lack relevant skills, making operation and maintenance difficult;

Tools: Due to the lack of effective O&M tools, complex network fault locating requires cross-domain expert on-site consultation, which is costly and time-consuming.

To sum up, the cross-domain devices of the campus network need to realize data fusion, support end-to-end analysis and presentation, and finally realize unified operation and maintenance of enterprise ICT infrastructure. The campus network involves a lot of network equipment and fuzzy boundary, so it needs unified cross-domain location and defining ability to speed up the location of production network problems.

(III) The traditional manual and instrumental operation and maintenance cannot meet the new requirements of the park network, and is in urgent need of intelligent transformation

According to the data in the figure above, we can see:

Passive operations: 75% of problems are discovered by users rather than actively detected. If discovered by users, they are more likely to complain.

Low degree of automation: 70% of the operating cost of the enterprise belongs to the human cost, and the cost surges;

Difficulty in troubleshooting: 90% of the recovery time of a fault is spent on fault location. The actual recovery time is very small.

In this way, no matter from efficiency or effect, there is an appeal to introduce artificial intelligence to solve problems and enable automatic closed-loop prediction, analysis and decision-making of network operation and maintenance.

(4) Cross-domain fault location algorithm flow

The figure above shows the algorithm flow of cross-domain fault location. The whole process is as follows:

Input:

  • Alarms: alarms reported by devices.

  • Topo: Network Topo structure.

  • Fault propagation diagram: Indicates the influence relationship between alarms.

Process introduction:

  • Noise reduction: Filter the large number of invalid alarms such as flash interruption and shock interruption in the original alarms.

  • Aggregation: Divides the Topo alarms into multiple alarm groups by separating unrelated Topo alarms and aggregating the alarms that may be related to the same fault.

  • Identification and location: Analyze each alarm group based on Topo and fault propagation diagram to identify several faults in each alarm group, and identify the root nes and root alarms for each fault.

  • Diagnose: Diagnose the fault type for each fault alarm, for example, power interruption.

Output:

  • The root cause of the fault

  • Fault design alarm

  • The fault types

  • Fault Recovery Suggestions

(v) AIOps framework implementation algorithm flow

The whole algorithm process is explained above. Next, let’s see how to use Huawei AIOps framework to realize the algorithm process.

1. Quickly configure data sources and choreograph processes

Configure data sources: Connect alarms in multiple domains, such as wireless, transmission, and core networks, to network topology data.

Process choreography: Use existing atomic capabilities for rapid process choreography.

After the above process, you can complete the “event notification” function and save the results to a recordset (that is, a database) for large screen display. The renderings are as follows:

If you open one of the alarms, the following information is displayed:

AIOps deployment recommendations

Based on the above practices, we can summarize the following:

1. Select mature scenarios and deploy AIOps step by step

After long-term practice, the main reasons for AIOps deployment failure are summarized as follows:

Data is not good: data is scattered in each independent system, lack of comprehensive acquisition and management means. Lack of data and low data quality are the main reasons for the poor effect of AIOps.

Command down: lack of automatic operation and maintenance tools, can not carry out active detection, recovery operation;

The model is not intelligent: it cannot effectively accumulate the annotation information in daily operation and maintenance, and cannot realize self-learning of the model.

Therefore, based on the deployment failures, we can conclude that to successfully deploy AIOps, we need:

AIOps deployment should be promoted step by step based on the mature scenarios with the conditions;

  • Collect all kinds of operation and maintenance data comprehensively to improve data quality;

  • AIOps backend is connected with automatic operation and maintenance tools to enhance diagnostic means and automatic recovery capabilities.

  • Effective accumulation of annotation data, so that AIOps model can constantly receive feedback, with self-learning ability.

2. Select a mature AIOps service

For different types of enterprises, the choice of AIOps services is also different, as shown in the following table:

Huawei AlOps reduces the threshold of network AI application development and speeds up the landing of network AI applications. IT has accumulated 10+ smart apps out of the box, covering application fields such as carrier network, campus network, data center network and IT application. Pre-integration of rich AI atomic capabilities, covering fault prediction, detection, diagnosis, recognition and other links. Support users to develop AI applications with zero coding to improve operation and maintenance efficiency.

Interested in experience together ~ www.hwtelcloud.com/products/ai…

Click to follow, the first time to learn about Huawei cloud fresh technology ~