Zhao Jianchun, Chairman of Tencent Technology Operation Channel, Assistant general manager of Tencent social network Operation Department, AIOps, core expert in writing white paper

The topic I want to talk about today, AIOps, is a relatively new topic. In fact, it has only been about a year since the concept was put forward and we started to build it. A new thing has its development cycle. In Tencent, we have made a lot of exploration, but there are still some shortcomings, just as we have seen that the development of AI also has many shortcomings. Today bring some cases to share with you, I hope you have some reference and reference significance.


First, start with a NLP story

I’d like to start with a small NLP story.

In the 1930s and 1940s, there were a lot of attempts to understand natural language in a machine-like way, first in rule-based ways like the syntax tree on the left, but gradually in statistical ways.

After the 1970s, rule-based syntactic analysis gradually came to an end.

In 1972, Jarinik, a master of natural language processing, joined IBM. Around 1974, he proposed the concept of speech recognition based on statistical concepts at IBM, and the effects of natural speech recognition have been continuously broken.

Google’s statistics-based translation system outperformed rules-based translation in 2005, tearing down the last bastion of speech recognition that the rule approach had held.

Before the 1970s, scholars and experts in rule-based natural language processing were doomed to be lost.

The reason why I use this story as an introduction is because I think our operation and maintenance environment, every year a large number of developers join, write a large number of code for us.

With the growth of business volume, the equipment will continue to increase, the system will become bigger and bigger, the complexity will increase exponentially, and all these pressures on us, and our monitoring log and other data is also very massive.

So I think there are similarities between operations systems and natural language processing, language is very complex, and the magnitude is very large.

Second, the o&M experience is essentially a set of rules. In our operation and maintenance environment, many automated operation and maintenance systems are the implementation of a set of rules. Rules have the advantage of being easy to understand, but there are often scenarios that are missing.

The rules must have been written by one person, and one person, faced with a huge amount of data, would have been overwhelmed to deal with them. AIOps is not a replacement for DevOps, but an auxiliary and supplement to DevOps. It is a process of ai-based transformation of the regular part of DevOps.

Rules are our experience and our burden, and just like the experts before the 1970s, we need a transformation.

So what is AI?

AI is about deriving patterns (models) of accurate predictions from a large number of inputs. We take a lot of inputs like x1x2x3, we count some parameters, some parameters like ABCDWB, and let’s make some numerical, 01, probabilistic predictions that fit in the new environment. It involves reinforcement learning, which is another way of getting data and doing statistics, and with each probe, you know how much success and benefit this time, whether it’s positive or negative, or zero.

In fact, as long as we have enough data, we can repeatedly probe and get feedback.

Scenarios that are easy to access to AI are characterized by clear features and easy extraction of positive and negative features, that is, what is right and what is wrong, we can better classify and have continuous feedback.

Data + algorithm + training can train a model, which is different from the previous rules and can be considered as a model with memory function.

However, if I want to contact AIOps, I will find a problem, that is, I may be a small team, or I lack algorithm experts, and EVEN if I use other people’s algorithm model, I still want to know the principle of this algorithm.

The last one is that both the provider of the algorithm and the user of the algorithm are reluctant to provide the data for fear that the data will be leaked to the other party, so both sides have this fear, and this is the difficulty.

For the rules in the previous operation and maintenance environment, in fact, you can think of them as APIS, or some written logic processing, the characteristics are rarely changed, because it is written by people, so it is easy to understand, experts concluded, has nothing to do with the data, he wrote there, like if-else, case swich and so on.

However, AI, as mentioned above, is actually a set of apis with memory ability. Where does this memory ability come from? It is dependent on data and learned from statistics in the data.

Therefore, this model is constantly changing and very complex. It may be the decision path of decision tree, regression parameters or the network structure and path weight of neural network.

Because of the complexity of its various algorithms, the structure of the neural network of the decision tree, its weights, or its regression parameters, this is not written by human beings, so it is difficult to understand.

Two, from API to learning software

AIOps so that we can come to one from the API to learn a shift, “learning” concept was put forward by zhou teacher of nanjing university, he is the domestic leading level in the field of AI, is very severe, he proposed to learn is through the data can be constantly learning, constantly as the data to join will be better, and it’s algorithm is public, You can also see how it works.

You can also take it and use it, train the model with my data and give it to you, but I don’t give you the data, I give you the parameters, the network structure and all that stuff, I don’t give you the data to solve the data security problem.

You can also use your own data to retrain and improve the model for your environment, so it’s evolvable. The algorithm is also publicly accessible, which can be reused to solve some of the problems inside.

This is a capability framework for the AIOps white paper we wrote with industry colleagues a while back, and I won’t go into the details.

Our general idea means that the bottom is that a variety of machine learning algorithms, the algorithm and combining our actual environment scene, by training some individual AIOps study, single point scene can also solve the problem, after the single point to learn together form AIOps series application scenarios, can eventually form a intelligent scheduling model, To solve the cost, quality, efficiency and other operational and maintenance concerns of our operation and maintenance links.

AIOps Level 5 Classification:

  • Level one, try to apply

  • Level 2, single point application

  • Level 3, concatenation

  • The fourth is the intelligent solution of most of the more important scene series problems

  • Level 5, since AI is mentioned, we still hope to dream bigger. Can we have an intelligent operation brain like Skynet in the Matrix, which can achieve multi-objective optimization of quality, cost and efficiency?

Such as recommended in the scene, I want the user to scale is more and more big, also hope that more and more high active, at the same time hope his level of consumption is higher and higher, but these three goals is a conflict, like the quality of our costs and there is a conflict, but we hope it has a better balance in multi-objective, the highest level, Even multiple objectives can be simultaneously optimized to balance.

In general, AIOps is expected to be a complement to DevOps, and then a process from single point to series to intelligent scheduling to solve the problems of cost, quality and efficiency in operation and maintenance.

Then, our team made some practical and theoretical explorations and attempts with the efficient operation and maintenance community. Today, WE also hope to share with you the dimensions of quality efficiency through the series of several single points.

Third, we share practical cases

1. Single point case: Cost-memory intelligent cooling

The first point is cost, namely intelligent cooling of memory storage. Because we are a social network business with large users and a large amount of access, the team prefers memory KV storage.

At launch, the volume of requests may be high, but as time goes on, the volume of data continues to grow, and the density of visits decreases, putting great pressure on our costs.

Then, we will think of cooling reduction, but before cooling reduction, we are familiar with the use of the latest time of data in accordance with the rules, but if you think about this, there is only one indicator, the last use time of the data as a feature to analyze, in fact, far from enough.

We sampled a lot of features from each type of data, dozens of features, such as periodic heat changes, which are shown here, and others that haven’t been written down.

Then, based on the experience of our students, because they have handled many data items manually before, they will have some experience, which data items can be lowered cold. After labeling them, we will use logistic regression and random forest to learn and train them. In fact, it is to do classification, and most machine learning is to do classification.

After making a classification, LR and logistic regression are on the top and random forest is on the bottom. That works best in random forest, 30 trees, because random forest is a bagging method, which improves stability.

The end result was that we sank the data, almost 90% of the data, onto the hard drive, and we didn’t lose traffic, SSD data didn’t cause access pressure, and you can see that the sink and drop were very precise.

Moreover, the data delay and success rate in this case have hardly changed. In fact, the previous colleague set the sinking setting manually, but the efficiency is very low. This module improves the sinking efficiency by 8 to 10 times, which is the first case of cost.

2. Single point case: Quality – Unified monitoring to remove thresholds

Quality, as you can see, it makes a lot of sense to uniformly monitor the thresholds. There are two kinds of monitoring, one is the success rate of monitoring, it should be a straight line, normal should be about 100%, but it will fall down.

The second is something like a cumulative curve, or a CPU curve, which is very variable.

Before, we might have set the alarm by setting the threshold, the Max, the min, the threshold.

The curve keeps changing, the maximum and minimum keeps changing, and then it’s very variable, and it’s very hard to set things like that.

The way we did two kinds of the first one is the success rate, we use the 3 sigma way, from the industry, is to control the defective rate of the product, if it is 3 sigma is 99.7% authentic, actually in the way we calculated the alarm, more than in the normal range is much more than we think, how many defective goods, Find it.

The second step is to use isolated forest, which is similar in length and difficult to classify. It takes many steps to get to the leaf node. Therefore, we can see the Gap, which means that the node of the relatively shallow leaf is the abnormal node.

We found one by statistical method in step one and unsupervised method in step two. In the last step, we added some rules to make the alarms more reliable. In fact, this rule is to see when I alarm and recover. Since such a logic is a rule, we will further make an AI-based transformation in the future.

For the monitoring of this curve, at present, we just have a large range because the curve is not normally distributed. A curve is a curve. We took it and we took it in segments of 3sigma, which is segments per hour, and we took a sample over the last 7 days.

There are also curves that we can use polynomial to fit this curve, we use 3sigma, statistical methods, polynomial fitting as the first step, which is equivalent to the multiple recall in the recommendation system.

The second step is to isolate the forest, which is the same principle as before.

The third step is supervised manual labeling, that is, some alarms circled on the graph have some labels that should not be alarms. After labeling the training set, the training will be automatically classified.

In order to get more sample libraries, colleagues used this covariance algorithm called correlation coefficient to find more sample libraries. You can focus on it, that is to say, find some similar curves, and if you don’t train well, pack them up and train them.

In general, abnormal alarms are found through three-level filtering.

We have hundreds of multiple devices, more than 1.2 million monitoring view, more than 70% of the actually before we didn’t set the alarm, because it is hard to each set a high low, so now take these modules are incorporated into the monitoring, one hundred percent coverage, it is a value monitoring area, to set up a case.

3. Serial application cases: Quality-root intelligent ROOT cause anomaly analysis

The third case is the serial case of quality, abnormal root cause analysis. In fact, our students have shared this case on many occasions.

We have done a lot of our system of relationship between access statistics, generating a business visit relationship view, what is the access to the business relationship, finally will draw such a chart to, like a spider web, this is just one of the parts, but after failure occurs, which one is the root causes of problems.

The practice of returning for analysis at the beginning, is the way through the first dimension reduction relationship, right on the left side of the column are all the same module, each path generated by this module, we have listed, more than is on the right path, the path to the module, the alarm appear superimposed on the module, and then set an artificial algorithm defined area, from the size of the area, Although it is based on rules, it worked well before, helping us find TOPN alarms that may be the root cause. But now we have made some updates to it based on AI algorithm.

The middle row is the main logic of root analysis introduced earlier. Before the superposition of alarms, access related modules are the modules that cause root alarms. Therefore, community groups are divided according to the closeness of access relationship, and some modules with close access to each other are made into a cluster.

Then, alarms with similar severity are more likely to cause each other. Then, DBSCAN density clustering algorithm is used for clustering. And then finally using frequent itemsets and correlation coefficients and so on to find some sort of recurring, which is correlation, contribution and support.

In addition, when we were communicating, someone suggested that we use Bayesian algorithm to find the probability of TOP root cause. Because this is a probabilistic statistics, we are also conducting experiments and tests now.

4. Intelligent scheduling case: Efficiency-Cloud automatic capacity expansion

Let’s take another example of intelligent scheduling. I thought intelligent scheduling was a very ambitious goal, not just something like this. It was a small improvement. So our intelligent fully automated capacity expansion process.

Before we work on many occasions about intelligent process, he in fact all the resources in a business module need to register in, through application equipment, access to resources, after the release deployment, self-check, business test, gray launched such a six step, there are actually more than twenty steps, we will combination, combined into a different process, So let’s look at the process.

(Brief explanation of the video content when the video is played) Automatic expansion of a module, first adding some resources, actually launching a new business, but it is automatically implemented under normal circumstances. There’s going to be some business packages, there’s going to be some base packages, and there’s going to be some permissions, which are also basic things.

We saw that the pressure increased continuously, and the CPU increased to more than 75%. With the increase, we found that the pressure of the system exceeded the standard, and the system automatically started to expand, which was several times faster.

Such a process, is just listed more than 20 steps of expansion steps, it is automatically executed, after the implementation of the enterprise wechat prompt to quickly online this module, and then monitor his problems, is fast online.

With two new devices, and then two new devices, the load is increasing, and the old traffic is gradually decreasing, and finally an equilibrium is reached.

This is our Tencent weave cloud automatic capacity expansion, the current capacity forecast and a lot of monitoring has been through a certain degree of AI transformation.

I also want to focus on a module called the balance beam. The reason for the balance beam is that the monitoring section has also explained that for a module, the balance beam has hundreds of thousands of devices in a module, although the same pressure weight is set for it, which is the data in the real environment, but the difference between the upper and lower is very large.

And then with one adjustment of the balance beam, you can take the load curve up here, and basically shrink it into a line, and the capacity of the second curve becomes 40 percent, so you increase your support by 22 percent with one adjustment of the balance beam. So how does that work?

In fact, we have a machine learning task. We hope that the load of each single machine in our module should be as consistent as possible. We set a Loss function and find the Settings of parameters W1-WN in this LossFunction by gradient descent. After several rounds of adjustment, the load of all devices tends to be the same.

5. More single point or series applications

There are many more single point and tandem applications, some of which were previously shared in GOPS Shanghai, and are only briefly mentioned here as case studies.

The first is multi-dimensional drilling intelligent analysis, because after the launch of an APP, you may have a broadcasting platform, operators have domain names, each with hundreds of devices, operators have multiple operators, once a problem occurs, it may be a problem in the block of the Rubik’s Cube, it is difficult to locate.

For example, the data before and after we find a large difference, but the page view is very small, then it is not the core, the core is the greater the difference, the greater the contribution of those anomalies, may be the problem.

We all know the case of beer diapers, the frequent item set algorithm, is that after the occurrence of alarm A, alarm B will definitely occur, and the probability of occurrence of AB will exceed A certain percentage, we can find something like this, and merge the alarm.

The third is that we always have an intelligent physical examination change report. Just as the Google lecturer said before, Google finds out whether there is a failure after this change through various monitoring methods. For example, the traffic may increase or an abnormal alarm may be generated.

Our change detection report will automatically monitor the changes of various indicators of the module after the change to assist engineers in judging whether the change is normal or not. The output can be divided into two categories. If the output is not normal, it can be ignored:

One is that the change leads to a large fluctuation of data, but it may not be abnormal, such as a large increase in traffic;

One is that exceptions may occur after the change, such as a large number of Coredump, which needs to be handled. How to deal with it is also a dichotomous problem, which we will also change in the future.

Last time, we also talked about the problem of intelligent customer service. Using NLP technology, we can make intelligent customer service tool robots of information retrieval and operation execution.

Fourth, thinking and outlook

With a bit of simple thinking and vision, AIOps is just getting started, but when you do it, you can think of it as something like a common component, a learning component, but with a learning result in it.

Some scenarios are internal scenarios of the company and can be used as a learning tool internally. However, for some common things, such as monitoring, the monitoring scenario of each company is the same. If we define it as a standard reporting format and a standard time interval, train the AIOps model well, and everyone agrees on the effect, we can use it together.

The word “learning pieces”, Teacher Zhou Zhihua has also spoken on many occasions. By writing common reporting formats, naming conventions, etc., common AIOps components will become more shared and agreed upon, which is something we hope to put more into standards in the future.

This is also the purpose of the AIOps Standards Committee.

This figure is our cloud platform for the Metis, weaving in forming inside the more we want to learn a library, to solve the problem of some series, just also give some examples, also hope that in the coming months, or a period of time, open up with everyone learning together, jointly explore and improve, may my speech here.

To conclude with a famous quote, it’s common to overestimate the impact of a new technology in a year or two and underestimate the impact of a new technology in five to ten years.

AIOps is such a technology, as an auxiliary and improvement of DevOps, I believe IT will make IT operation and maintenance more simple and easy, and truly become an assistant tool and intelligent brain of operation and maintenance students. Thank you.

Note: This article is based on zhao Jianchun’s share at the main venue of DOIS 2018 Beijing Station. Reprinted from: Efficient Operation and Maintenance.