AIOps community is initiated by Cloud Intelligence. Aiming at operation and maintenance business scenarios, AIOps community provides the overall service system of algorithms, computing power and data sets as well as solution exchange community for intelligent operation and maintenance business scenarios. The community is committed to spreading AIOps technology, aiming to solve the technical problems of intelligent operation and maintenance industry together with customers, users, researchers and developers from various industries, promote the implementation of AIOps technology in enterprises, and build a healthy and win-win AIOps developer ecosystem.

preface

Log analysis, as an important sub-field of AIOps (the combination of artificial intelligence and operation and Maintenance), is receiving increasing attention from academic and industrial circles. Therefore, many classical models combining neural network and log analysis have emerged and achieved good results in practical applications.

In this paper, we have invited Guo, an intern of cloud intelligence algorithm and a doctoral candidate at Beihang University, to briefly introduce the recent progress in the combination of deep learning and this field from the perspective of academia.

I. Overview of log research

  1. The research status

Log data is generated by the operation of the system and provides a detailed description of large-scale internal events of the system and user intentions. With the rapid development of large-scale IT systems, the volume of log data has grown to the point that traditional methods are difficult to analyze. In addition, it is difficult to obtain and label logs. The following figure shows the process from code to log. Logs for the same system also produce personalized content, and we can define any system feedback we want in our code.

In order to solve the above bottlenecks, operation and maintenance personnel tried to enhance IT operation and maintenance capabilities by integrating artificial intelligence algorithms, thus a batch of traditional machine learning algorithms were born. In recent years, with the development of computing power and the increase of data volume, deep learning techniques have been used in the field of log analysis. Researchers believe that semi-structured log messages also contain partial system semantics, similar to natural language corpus. Therefore, researchers have adopted language models for log data modeling and analysis, such as LSTM, Transformer and so on. In order to solve the problem of difficult access to labels, a group of researchers have adopted self-supervised, unsupervised, weakly supervised and semi-supervised methods that do not require complete labels, such as Bert in the Log field. There are also different learning methods such as transfer learning, integrated learning and continuous learning to improve the efficiency of operation and maintenance in all aspects. All in all, researchers are exploring the research and application value of deep learning in this field.

  1. Tasks that revolve around logs

Research on logging can be divided into three main directions: Log Compression, Log Parsing, and Log Mining. For log compression, we study how to compress logs efficiently without losing important information. Log parsing is to automatically extract event templates and key parameters from software logs. Log mining involves a variety of tasks, including log anomaly detection and log alarm detection. The main purpose of log mining is to improve system reliability. The figure below shows the number and direction of recent papers. We can see that the number of papers is increasing year by year and most papers are focused on the direction of log mining.

Ii. Sharing of academic frontiers

This academic sharing focuses on log anomaly detection tasks. Log exception detection, as the name implies, mainly detects system exceptions in log data.

  1. Log Parsing(Log mode resolution)

The semantic similarity between massive log data is high. Therefore, logs must be represented based on actual requirements. Therefore, scholars expect to extract a fixed template/schema for the log to represent the entire log database. After investigation, most current log anomaly detection methods need log parsing. The reasons are briefly described here.

The figure above shows the extraction process of log template. From top to bottom, the original log, the parsed log template, the structured log, and finally the structured data are sent to various downstream log mining tasks. In detail, L1, L2, L3, L4, and L5 represented five original logs, and three templates were extracted through Drain parsing: T1, T2, and T3. After Mapping, we get five structured logs, namely l1-L5 in the pink box. Log parsing removes information that we consider irrelevant from the log, such as Timestep, ID, and so on. Common parsing algorithms: Drain (based on tree structure similarity) Spell (longest common subsequence) AEL (occurrence frequency of constants and variables) IPLoM (iterative partitioning strategy based on message length, token position and mapping relationships, etc.)

  1. Log anomaly detection(Log anomaly detection)

The post-2020 deep log anomaly detection framework mostly consists of three parts: log parsing module -> feature encoder -> classifier/decoder. This section will introduce you to some of the deep learning frameworks.

DeepLog:

Deeplog: Anomaly detection and diagnosis from system logs through deep learning.

As shown in the figure below, the model is divided into two stages: training and testing. In the training stage, the template of the next log is predicted through LSTM network learning representation after original log parsing. In the test phase, the predicted template result is obtained after the test data is input into the model. If the predicted template is not in the Top K templates, the log is judged to be abnormal.

LogRobust:

Robust log-based anomaly detection on unstable log data.

The model is based on supervised learning and uses the two-way LSTM architecture based on attention. The log data is analyzed using Drain, and the feature extractor uses word-to-vector and TF-IDF weighting technology to generate log representation. The model uses both normal and abnormal log data for training and finally classifiers to determine whether the log is an exception.

HitAnomaly:

Hitanomaly: Hierarchical transformers for anomaly detection in system log.

The model is also based on supervised learning and adopts Transformer based architecture. The log parser still uses Drain. In schema parsing, the template usually does not retain the original log value, but this model encodes the lost value into the model to retain the original log as much as possible, which is where the innovation comes in.

Logsy:

Self-attentive classification-based anomaly detection in unstructured logs.

The model is supervised learning, using a Transformer based architecture. The innovation of this work is that the whole original log is sent to the encoder instead of the log parser, which will retain the information of the original corpus to the maximum extent, but the detection efficiency in the actual situation will be greatly affected.

Iii. Sharing of self-developed models

Translog

TRANSLOG: A Unified Transformer-based Framework for Log Anomaly Detection.

Portal: arxiv.org/pdf/2201.00…

This model is also based on supervised learning. Different from the framework of the above model, which is almost unchanged, this work reconstructs the idea of log anomaly detection. Main contributions:

  • Learning shared log semantic knowledge through migration can solve the problem that log sources with multiple sources and low resources are difficult to detect.
  • Under the condition that the effect reaches SOTA (optimal performance), the number of training parameters of the compressed model can be reduced to 5% of the original, which improves the application possibility of the deep learning model.
  • Based on theTransformerIs a new framework,Pretraining-TuningThe two-stage approach provides a new learning paradigm for log analysis.

The starting point:

As shown in the figure below, different systems have the same exception problem. Therefore, some systems with low log resources can also share general log semantic knowledge for exception detection. For example, the BGL, Thunderbird, Spirit, and Liberty systems in the figure all have “Program Not Running” exceptions.

Framework method:

The frame diagram of the model is shown as follows, which is divided into two stages: Pretrainging and Adapter-based Tuning. The first step is to input all the parsed log event sequences into a pre-trained language model (here we use the BERT model) to extract the representation. Then Transformer encoder is used to train on the high resource source domain data set to obtain the shared semantic information. Then for the target data source, we freeze the parameters of the encoder and adjust only the parameters of Adapter on the target domain data set. Thus we achieve the goal of transferring knowledge from the source dataset to the target dataset.

Experimental results:

We compared six different approaches, tested them on three public data sets, and all of our algorithms achieved SOTA (optimal performance). At the same time, we reduced the number of ginseng by nearly 95%.

Four,

Based on our in-depth experience in the intelligent operation and maintenance industry and our research on cutting-edge technologies, we have summarized the following three trends in the field of logging: Due to the widespread phenomenon that it is difficult to obtain the label of the log itself, a large number of unsupervised or weakly supervised deep learning methods will emerge to help personnel in this field better engage in practical research and development in the case of unlabeled data. And with the development of operational domain multimodal, introducing the external knowledge, knowledge map, for example, or operational data of the invocation chain, index data types such as to expand the original log information, so there will be a lot of the method based on the supervision and direction of the combination of a variety of modal data, towards a better operational integration development; With the increasing volume of operation and maintenance data, the large-scale model similar to Bert in the natural language field is gradually showing its performance. Combined with the learning paradigm of pre-training and fine-tuning, a large-scale model learning various operation and maintenance knowledge and log mode will have a great opportunity to serve as the research model of AIOps.

Write in the last

In recent years, against the backdrop of the rapid development of AIOps, the urgent need for IT tools, platform capabilities, solutions, AI scenarios and available data sets has exploded across industries. Based on this, AIOps community was launched by Cloud Intelligence in August 2021, aiming to set up an open source banner and build an active user and developer community for customers, users, researchers and developers in various industries to contribute and solve industry problems and promote the development of technology in this field.

The community has opened source products such as FlyFish, OMP, Moore platform, Hours algorithm and so on.

Visual Choreography Platform -FlyFish:

Project introduction: www.cloudwise.ai/flyFish.htm…

Github address: github.com/CloudWise-O…

Gitee address: gitee.com/CloudWise/f…

Business case: www.bilibili.com/video/BV1z4…

Some large screen cases: