Appropriate letter intelligent monitoring platform construction practice

Abstract: This paper introduces the design idea, technical architecture, core functions and practical experience of creditEase intelligent operation and maintenance platform UAVStack.

Sources: the technique salon of the appropriate letter technical college 6 – online live | appropriate letter intelligent monitoring platform construction practice

Speaker: Xie Zhiqiu, Senior architect of Creditease & Head of Intelligent Monitoring platform

First, the background of UAVStack platform

At present, there are many monitoring software commonly used in the industry, and mainstream products are good at monitoring depth or monitoring breadth.

Prometheus, which is characterized by an active ecosystem and provides off-the-shelf metrics collection plug-ins for monitoring common Internet middleware such as MySQL, Redis, Kafka, RocketMQ, MongoDB, ElasticSearch, etc. Similar products include Zabbix, Nagios and Open-Falcon.
Focus on the depth of monitoring products also have a lot, such as Cloud, OneAPM, PinPoint, SkyWalking. This type of software is generally probe type, providing more in-depth monitoring capabilities in application performance monitoring.

Each of these products has its own advantages and disadvantages:

Unable to balance the breadth and depth of monitoring;
It cannot simultaneously support the collection of real-time indicator, call chain and log data. Without considering the integration connectivity of these three functions, it cannot solve the problems of data timeliness, quality control and alignment.

In order to overcome the above shortcomings, meet the diversified and intelligent monitoring needs of the company, and reduce the cost and difficulty of the second research, we independently developed the full-dimension monitoring and intelligent Operation and Maintenance base platform (UAVStack).

As an intelligent monitoring platform, monitoring is only the first link of intelligent operation and maintenance. We believe that intelligent operation and Maintenance (AIOps) can be divided into three steps: all-dimensional monitoring, all-dimensional relevance and all-dimensional intelligence.

Step 1: Full-dimension monitoring: Collect full-dimension monitoring data through a unified collection system, including portrait data, real-time indicator data, call chain data and log data of the system, application and service.
Step 2: All-dimensional association. After obtaining all-dimensional monitoring data, further establish the association between data. This association can be a strong association established through portrait, service flow graph or call chain data, or it can be established through machine learning algorithms. Through full-dimension correlation, real-time correlation can be established among monitoring data. With the correlation, we can do root cause analysis.
Step 3: All-dimensional intelligence. By introducing AI services including anomaly detection, root cause analysis, intelligent noise reduction and task robots, machines replace human beings to make some decisions, so as to continuously improve the level of intelligent operation and maintenance of the company.

Second, the overall technical architecture of UAVStack platform

2.1 Full-dimension Monitoring + Application O&M Solution

A one-stop all-dimensional monitoring + application o&M solution can be built using UAV’s all-dimensional monitoring and application performance management toolset.

First, the UAV deploys a MonitorAgent (MA) on each physical machine, virtual machine, and container. MA is actually a standalone JVM process deployed on the host.

Second, in every JEE middleware, JSE application, or other JVM language application, a monitoring probe can be embedded in the form of a Java Agent that starts with the application in the same JVM process.

When the monitoring probe is started, it automatically paints and monitors the application. Application portraits include portraits of service components, client components, and logging components.

A service component is an interface that exposes the service capability of an application, such as a service URL.
The client component is the client of other services or third-party data sources (such as MySQL, Oracle, Redis, MQ, etc.) accessed by the application.
Log components are logs generated by applications.

In addition to automatic portrait and real-time data collection of the above three components, monitoring probes also record each request/response to generate end-to-end invocation links, draw invocation relationships among applications/services, and generate service maps.

The monitoring agent (MA process) periodically pulls data collected by the monitoring probe and collects performance indicators of the application environment, such as CPU, memory, and disk I/O. In addition, MA provides a plug-in mechanism to support the collection of personalized metrics.

Finally, we collected full-dimension monitoring data including Metrics, call chain Tracing and Logging. Among them:

Metrics Includes user-defined service indicators, application environment performance indicators, application cluster/instance performance indicators, service component/client component /URL performance indicators.
Tracing data includes call chain indicators and client experience (UEM) data.
Logging data includes log and Thread stack data.

2.2 Monitoring probe architecture

The UAV acquisition side mainly consists of two parts: monitoring Agent and monitoring probe.

The monitoring probe monitors applications.
The monitoring Agent is responsible for monitoring the application environment and periodically pulls the data from the monitoring probe and sends it to the UAV server.

The diagram above shows the architecture of the monitoring probe. As UAV capabilities have increased, the probe has been used for more than just surveillance purposes and has now been renamed a middleware enhancement framework.

As can be seen from the left side of the figure above, the middleware enhancement framework is located between the application server and the application. It adopts the middleware hijacking technology to carry out classloading hijacking and bytecode rewriting of the application server code, without invading the upper application code.

On the right is the architecture diagram enlarged by the monitoring probe. At the bottom is the application server adaptation layer, which provides adaptation support for open source middleware commonly used in the Internet (such as Tomcat, Jetty and SpringBoot). An Adapter Adapter can be extended to support other servers accordingly.

On top of the adaptation layer, a series of general extension points SPI are first provided, and then based on these SPI, the monitoring related portrait collection and index collection functions are extended. Call chain tracing, browser tracing, JVM thread analysis, heap memory dump execution and other functions related to problem diagnosis and analysis tools; Traffic limiting and degrading services and service security control related to service governance. In addition, adaptation of these features to Docker and K8s container environments is provided.

The top layer provides application docking API and data publishing API to obtain monitoring data on the probe through HTTP and JMX.

2.3 Data capture architecture

The architecture for UAV data capture and transmission will be described next.

As can be seen from the figure above, the monitoring Agent adopts dual-channel + dual-heartbeat mode for data transmission:

1) Dual channel refers to the two channels of HTTP heartbeat and MQ transmission:

The Http heartbeat transmission channel is used to transmit monitoring data related to the application environment, including container/node image data and real-time monitoring data.
The MQ transport channel is used to transfer application-related monitoring data, including real-time application data, portrait data, log data, and APM data such as call chain and JVM thread stack. The data format on the MQ data transmission channel adopts a unified Schema to facilitate data conversion and processing in the later period.

2) Dual heartbeat refers to data from either Http channel or MQ channel, which can be regarded as monitoring data or heartbeat data. Data from each channel is “checked in” to the UAV monitoring background service. Two ways of communication means more reliability.

The Agent transmits the data to the UAV monitoring background in a dual-channel way, which we call the health management service.

The health management service will parse and process the monitoring data according to the data type and persist them to appropriate data sources respectively. For example, OpenTSDB stores timing indicator data. ES stores APM data such as logging, call chains, JVM thread analysis, and so on.

AppHub is a unified portal for UAVs, providing centralized display of surveillance data and management of user rights. Users can log on to the UAV from both PC and mobile devices for an operation and maintenance experience anytime and anywhere.

The health management service is also constructed using the microservice architecture. It includes multiple microservices and supports cluster deployment and expansion.

Iii. Core Functions and Principles of UAVStack Platform (with case)

3.1 UAVStack Core Functions

Above shows the current UAVStack core functions, mainly including: application monitoring, environment monitoring, service flow, call chain, the JVM monitoring, database, monitoring, logging, monitoring, alarm, browser tracking performance, configuration, center, the sand of time and space, god compass, service management, container ecological support, business monitoring, intelligent operations (AIOps), etc. Among them:

Browser monitoring: Used to monitor performance data of front-end Web pages;
Space-time sand table: provides the query of historical monitoring data;
God’s Compass: provides large screen display for monitoring;
Intelligent operation and maintenance (AIOps) includes four aspects: anomaly detection, root cause analysis, alarm convergence and intelligent noise reduction, and task robot.

In addition, it includes some operational support related tools not listed on the diagram, such as the UAV Unified Upgrade Center; Daily, weekly and monthly reports for UAV monitoring; UAV usage statistics, etc. This post will highlight the features of the white text in the image above.

3.2 Application Monitoring

Firstly, the core principles of UAV application monitoring are introduced.

3.2.1 Core principle: Non-invasive technology for application code

The core principle of UAV application monitoring is: non-intrusive technology for application code.

The UAV is non-intrusive to the application code and the application does not require any modifications.
UAV applications do not require a uniform development framework.

UAV’s code name is short for “unmanned aerial vehicle”, which means that UAV flies in the sky and performs its tasks intelligently and transparently.

The core technologies mainly include:

Middleware hijacking technology, including Java Agent probe and bytecode rewriting;
Application/service profiling and monitoring technologies.

3.2.2 Non-invasive technology: Application/service portrait

The monitoring probe implements automatic portrait and monitoring of applications/services through middleware hijacking technology.

Middleware hijacking is the embedding of our own code into the behavior of middleware.
The core of middleware hijacking is: mastering the class loading tree, getting priority to load, and embedding our own code.

Take the application/service portrait for example:

When the Standard Context class in the Web container is loaded, the associated painting code is embedded through bytecode rewriting. JEE application startup executes the embedded code to generate application portraits and service portraits;
The application image includes the following: application identifier, application name, and application URI: HTTP (s)://< local IP>:< port >/, and application library information (obtained from the WebApp Class Loader that loads the application).
The service portrait is scanned according to JEE technical specifications, extracting information related to the service registration by scanning annotations and deployment descriptors to generate the service portrait.

3.2.3 Non-invasive technology: Application/service monitoring

Similar to application/service profiling, application/service monitoring is embedded by bytecode rewriting when loading server-related classes.

Take Tomcat as an example:

CoyoteAdapter is responsible for service requests throughout Tomcat; StandardWrapper is responsible for all servlets’ service requests.
When these two classes are loaded, the UAV implants the monitoring code via bytecode rewriting. When an actual request occurs, the embedded request interception code and response interception code are invoked to collect performance indicators.
The collected performance indicators are cached in the global counter and collected by the monitoring Agent in a centralized manner.

The figure above shows an actual display of application monitoring.

You can drill down from the display page of application cluster to the display page of application instance monitoring, and then to the display page of service component/client component and log component automatically displayed. Finally, you can drill down to the monitoring indicator page and log display page of each URI of service component/client component. You can drill down to the details globally to get the monitoring data you want to see.

In addition, we also provide service URL monitoring reports and client URL monitoring reports.

Take the service URL monitoring report as an example:

You can intuitively see the statistics of access count, average response time, cumulative access count, cumulative error count, success rate and other indicators of all service urls in the application within the selected time interval.
The time range can be the latest 10 minutes, latest 3 hours, today, yesterday, latest 7 days, or any user-defined time range.

As shown in the figure above, click view details of a URL to view detailed reports of the URL in different time periods.

3.3 Application/Service Topology: Service flow

Next, the functions related to service flow are introduced. Based on application/service portraits and monitoring data, the UAV provides service flow capabilities. Service flows cover the content of an application topology, but provide a richer representation of runtime state than an application topology.

As you can see from the figure above, the current service is in the middle, with the service that calls the current service on the left and the other third-party services that the current service calls on the right.

In the service flow diagram, the thickness of the line indicates the amount of transfer. The colors of the lines represent health status in terms of response time and number of errors: green for health, yellow for warning, and red for critical. For example, when the line is a thick red line, it means that a large number of requests are in error.

As shown in the figure, we can drill down from the global service flow to the service flow of a service line, and then to the service flow of an application cluster/instance of the service line for global performance tracking.

3.4 call chain

3.4.1 Call Chain: Full chain tracing

Call chain is divided into light call chain, heavy call chain and method-level call chain.

The light call chain is also called the basic call chain. Applications can be started and stopped at run time without any modifications. The obtained data includes service/request URL, service class + method, calling class + method, time consuming, result state + exception, application feature, technology stack feature, etc., and the performance cost can be ignored.
Heavy call chain adds the acquisition of request/response data packets on the basis of light call chain, and the performance cost is slightly higher. Generally, the performance will decrease by 5% according to the amount of packet data.
Method-level invocation chain: If the method-level invocation chain is not enabled, the invocation chain nodes are generated only at the entry and exit of the service. If you want to generate call chain nodes within your application as well, you can use method-level call chains. The AppHub interface allows you to configure classes and methods that need to be traced. The performance cost of method-level invocation chain is low, and the specific consumption depends on the amount of message data.

3.4.2 Call Chain: Implementation principle

The figure above shows the specific generation process of a call chain. Call chain nodes are generated primarily at service interface code and client invocation code. If the method-level call chain is turned on, call chain nodes are also generated at the procedure method code.

Also, talk about passing the context of the call chain.

Intra-service context passing: basic ThreadLocal is used in the same thread case; Transitive ThreadLocal (TTL) is used in the case of cross-thread (pool).

Transfer of context between services: ADAPTS original protocols (such as HTTP, RPC, MQ) through client hook, and injects metadata of the call chain context in the protocol header. When transferred to the next service interface, the next service parses the call chain context metadata in the protocol header, reconstructs the call chain context, and then passes on.

3.4.3 Call chain: Key technology

The implementation of the call chain mainly uses four key technologies.

Passing of context within the service. Threadlocal is implemented to support value passing between parent and child threads.
Passing of context between services. By client hook, the original protocol is adapted and the call chain context metadata is injected into the protocol.
Extract the content of the newspaper style. In order to minimize the impact on the application when the reinvocation chain extracts the request/response datagram content, a Servlet Wrapper mechanism is used to dump the data when the user reads the datagram (minimizing the memory footprint with the attributes of the String pool).
Call chain data collection and processing. The agent captures the data of the call chain, and the HM terminal parses the data into the database and provides the query interface. The minimalist data format is used to minimize the bandwidth occupation.

3.4.4 Call chain Display: Visualization, logs can be associated, and problems can be quickly located

This is the actual presentation of the call chain. On the call chain list,

You can obtain the slowest call chain in the last 1 minute, 100 in the last 12 hours, and the last 1 hour with one click.
You can customize the search related call chain by time interval or business keyword based on the characteristics of the application service.
At any point in the call chain, you can view the entire end-to-end call chain.
Slow invocation bottlenecks or faulty nodes can be quickly found through the end-to-end presentation of the complete invocation chain.
You can jump to the log interface from any node of the call chain to view the log corresponding to the call chain link.
You can jump to the call chain corresponding to the log from the log interface to check which link of the complete call link the log is located in, which helps you quickly troubleshoot and locate problems.

3.4.5 Call Chain Display: View request and response packets

With the recall chain enabled, we can view the detailed data of request/response messages.

As can be seen from the above figure, when the re-call chain is enabled, we can obtain detailed data such as request header information, request content, response header information and response content.

3.5 Log Monitoring and Management

3.5.1 Log capture Architecture

The diagram above shows the architecture of the UAV log function. The UAV logging function uses the popular EKK architecture of log management systems, including log collection, sending Kafka, ES storage/query, RAID history backup/download, and statistics and alarm functions based on exceptions/keywords and times.

The Agent on the application server collects and reads logs, and sends the read data to the Kafka cluster.

For logs that need hot query, the logging-store program reads the logs from Kafka and saves them to the ElasticSearch cluster.
For logs that need cold backup, the logging-RAID program reads logs from Kafka and saves them to the local disk. At dawn every day, local logs are compressed and backed up to the RAID cluster.

Log statistics and alarm functions: The logging-statistics program reads anomaly, keyword, and Nginx logs from Kafka. The logs are counted in minutes and saved in Redis for subsequent statistics display and alarm.

The specific log display interface has already appeared in the section about call chain association with logs, so it will not be described here.

3.6 Performance Alarms

3.6.1 Performance Alarm: Multi-indicator expression, streaming/year-over-year/sequential, double convergence, and feedback action

After the UAV obtains the full-dimensional server indicator set, client indicator set, log indicator set, and custom indicator set, it can set the combined alarm conditions of multiple indicators. These conditions include streaming/year-over-year/quarter-over-year conditions (” year-over-year “, for example, 10 o ‘clock today versus 10 o ‘clock yesterday; “Sequential”, such as comparing the last 5 minutes to the previous 5 minutes, can be mixed to form joint expressions.

To avoid alarm bombing, the UAV provides two alarm convergence strategies: time-cooling convergence and gradient convergence. In the gradient convergence policy, 1, 5, and 10 are configured. That is, alarms are sent only when the conditions for the first, fifth, and tenth alarms are met. In other times, alarms are suppressed and no alarms are sent.

Alarms can be sent to users by SMS, email, wechat, or mobile App. Alarms can also be sent to other systems through HTTP.

3.6.2 Performance Alarms: Warning policy templates, flexible policy editing, and multiple notifications

You can use a warning policy template to create a warning policy. Above is a screenshot of the early warning policy template in the system.

After the policy type is selected, the rules and conditions of the early warning policy will be set automatically according to our default recommended package. Users only need to configure the target range and notification mode to alarm and save it directly. You can also adjust the thresholds and alarm expressions in the template package. In addition, alarms can be dynamically enabled and disabled at runtime.

3.7 JVM Monitoring analysis

3.7.1 JVM Monitoring analysis Tools: Overall architecture

The JVM monitoring and analysis tool is based on the existing overall architecture of UAVStack, as shown in the figure above. The whole is divided into front end, back end and probe MOF part.

The front end is responsible for data display and sending user execution instructions to the background.
The background is responsible for sending instructions to specific nodes and the collection and processing of results.
The monitoring probe MOF is responsible for receiving the instruction from the background and returning the result after executing the instruction.

In the probe part:

Real-time JVM monitoring data, such as heap memory size, Minor GC, and Full GC, is obtained through the interfaces provided by JMX.
JVM heap memory sampling analysis data and CPU sampling analysis data are obtained by Java Attach API provided by JDK.

3.7.2 JVM Monitoring analysis Tool: Monitoring function display

Above is a page showing the monitoring capabilities of the JVM monitoring analysis tool. The main functions of the JVM monitoring and analysis tool include:

Basic Information Tab Displays basic INFORMATION about the JVM, including the JVM version, startup time, JVM parameters, and system properties.
The Monitor Tab provides real-time JVM monitoring metrics, including CPU, threads, memory, GC statistics, and more. You can switch the time/interval, such as the last 10 minutes, last 3 hours, today, yesterday, last 7 days, or user-defined time/interval, to view THE JVM monitoring data in different time/interval.
The Thread Analysis and Memory Dump Tab provides JVM thread analysis and JVM heap memory Dump online tools.
CPU Sampling and Memory Sampling Tab provides JVM heap memory sampling analysis and CPU sampling analysis tools.

3.7.3 JVM monitoring analysis tools: Heap memory sampling and CPU sampling analysis

Heap memory sampling analysis can help locate the root cause of memory leaks by sampling in real time the HEAP memory allocation of the JVM, the number of objects per class instance, and the amount of memory occupied by those instances.
CPU sampling analysis can sample the CPU execution time of each METHOD in the JVM in real time to assist in locating hotspot methods. For example, when the CPU reaches 100%, you can determine which methods occupy the highest CPU proportion.

3.8 Database Monitoring

3.8.1 Database Monitoring: Core function

Different from traditional database-side monitoring, UAV database monitoring adopts a new perspective to analyze from the application side, which makes up for the deficiencies of existing database-side monitoring and increases the database-application association analysis capability.

The functions of database monitoring include database connection pool monitoring, SQL category statistics, slow SQL statistics, slow SQL time distribution statistics, slow SQL tracking, and call chain/log association.

The monitoring function of slow SQL mainly includes slow SQL statistics + slow SQL tracking. For slow SQL monitoring, you can set thresholds to determine how slow SQL is. You can set the value by application and database instance. SQL operations that exceed the threshold are counted as slow SQL operations.

With call chain/log aggregation enabled, slow SQL operations can be associated with the corresponding call chain and log to help diagnose and locate problems.

The figure above shows the slow SQL statistics report of the database monitoring function. It shows the slow SQL statistics in a certain period of time. Slow SQL category statistics Slow SQL category statistics are collected based on SQL types, including I-insert, D-delete, U-update, Q-query, and B-batch operation.

In the lower two reports, the database connection pool monitor allows you to view the total number of connections, active connections, and idle connections in the connection pool. The other is SQL category statistics, which can be classified according to THE SQL type.

3.8.2 Database Monitoring Case: A collection system

The application method of database monitoring is illustrated by a database monitoring case of an outsourcing collection system.

When the collection system queries the collection history, the count(*) statement of the number of records is counted. Because the execution plan is abnormal and the execution efficiency is low, a large number of resources are occupied. As a result, the CPU resources of the database server are exhausted and the system becomes unavailable. As can be seen from the figure, the number of slow SQL statements, specifically count(*) statements, increases significantly during failures.

In the figure above, you can see the detailed SQL statement of the slow SQL. During the fault, the connection pool resources are exhausted, the number of active connections reaches the peak, and the number of idle connections is 0. The SQL category statistics chart also shows a significant increase in the number of SQL query errors during the failure.

On the slow SQL tracing page, you can view the list of slow SQL statements during faults. The three SQL statements that take a long time to execute are all count(*) statements.

The result of each slow SQL execution and SQL statement can be associated with the call chain. The count(*) statement takes a long time to execute and is executed incorrectly. Slow SQL execution is associated with call chains and logs to help locate and analyze faults.

3.9 Container ecological support

3.9.1 Container Ecological Support: Basic principles

Ecological support for the container means that all of the UAV functions can be seamlessly migrated and used on the container cloud platform. In a container environment, the monitoring Agent and application reside in different containers. Therefore, some adaptation tasks need to be performed, including application portrait/monitoring data collection, process portrait/monitoring data collection, and log collection path adaptation.

Firstly, in the collection of application portrait/monitoring data, the monitoring agent container should allow the virtual IP of the container to access the application container and obtain application portrait and real-time monitoring data through HTTP request.
Secondly, in the process portrait/monitoring data collection, the PID namespace of the monitoring agent container must be consistent with that of the host machine, so that the monitoring agent can scan the /proc directory of the host machine to obtain process information.
Finally, in the log collection path adaptation, the monitoring agent should obtain the volume information used by the application and the agent through the API. With the volume information of both sides, the Agent can correctly find the path of application output logs in its own container.

3.9.2 Container Ecology Support: Application environment Monitoring – Kubernetes

All of the above functions of UAV can be seamlessly migrated and used on the container cloud platform, so the UI is indistinguishable from VM, only the application environment monitoring interface is different. The image above captures the application environment monitoring interface in Kubernetes environment. You can see that there are 10 host processes, 17 containers, and 28 processes in containers on a physical host.

Application environment monitoring displays the mapping between containers and processes. Click to view container performance indicators and process performance indicators respectively.

Added k8S-related attributes to the container or process property list. This is in the container cloud environment, and we can see a slight difference in the application environment monitoring UI compared to the VM environment. For other functions (such as call chains, log monitoring, database monitoring, and so on), there is no difference between the interface in the container environment and the VM environment, and the user will not notice the difference.

3.10 Agent Plug-in Support

3.10.1 Agent Plug-in support: Open-Falcon plug-in and CUSTOMIZED UAV plug-in are supported

To compensate for this lack of breadth of monitoring, UAV currently offers indicator acquisition plug-ins that support the existing Open-Falcon indicator acquisition plug-in (similar to Prometheus’ EXPORTER) as well as UAV customization plug-ins, allowing UAV monitoring capabilities to be flexibly extended to virtually all commonly used Internet middleware. Examples include MySQL, Redis, Kafka, RocketMQ, MongoDB, ElasticSearch, etc.

The figure above shows the UAV monitoring curve of Kafka, RocketMQ and Redis indicators.

3.11 Service Link Monitoring and Alarm

3.11.1 Service Link Monitoring and Alarm: Solution

Most appropriate letter business across multiple lines of business and multiple systems, and can quickly locate problems system for the IT level, or the affected at the business level can also be given scale and customer specific business document, solve the pain points of business/operations staff, UAV provides a set of general business link monitoring and alarm access platform.

As shown in the figure, the platform includes functions such as heterogeneous service log collection, data sending, data segmentation, filtering, and aggregation computing. After that, the results can be persisted, large screen display of service reports can be provided, and service work orders can be generated based on the results and alarms.

During the implementation process, each service group first stores logs with service meaning in the application, and then automatically configures and maintains the logic of parsing service logs, specific alarm policies, and alarm message templates. In this way, a link monitoring system for its own services can be built quickly.

The advantages of this business monitoring system are:

Bidirectional association between IT level call chain and business events endows IT level call chain with business meaning, and upgrades cross-system call tracking to cross-business domain tracking.
When a problem occurs, you can send an alarm message with business meaning to directly report the problem to the business/operation personnel. It can also quickly locate the problem node according to the call chain, so as to help technical operation and maintenance personnel; In addition, technical personnel and operation personnel can trace problems through the service ID reverse lookup system link.

3.11.2 Service Link Monitoring and Alarm: Service alarm example

This is a concrete example of a business alarm.

Above is an alarm email sent to business colleagues, which can be detailed to X:X:X :X. In X service links of X systems, X problem occurs, affecting X customers. The customer name is X, and the mobile phone number is X. Assist business operations personnel to quickly locate problem documents and affected customers.

Below are emails sent to technical operation and maintenance colleagues. In addition to emails sent to technical operation and maintenance colleagues, additional IT call links are provided for technical operation and maintenance colleagues to quickly locate and diagnose problems.

3.12 Intelligent O&M

At present, UAV engineering practices in AIOps intelligent operation mainly include anomaly detection, root cause analysis, alarm convergence and intelligent noise reduction, as well as mission robot HIT. This sharing will focus on indicator anomaly detection and root cause analysis.

3.12.1 Intelligent O&M: Exception detection framework

The image above shows a popular time series anomaly detection framework used in UAV engineering practice. It mainly includes offline model optimization, online model prediction and A/B TEST. Offline model optimization and online model prediction form an intelligent monitoring closed loop for abnormal index detection. The specific process is shown in the figure, among which the key points include:

For the analysis of unmarked data, unsupervised method is used to identify anomalies. For example, in the abnormal detection of continuous data, the isolated forest algorithm can be selected to judge whether the anomaly is detected by the forest formed by multiple iTree trees.
The tagged data were analyzed using a supervised learning approach to learn the historical performance of abnormal and normal populations. In this way, when new data is detected, decisions can be made directly through the model to output abnormal conditions.
However, it is difficult to meet the requirements of supervised learning methods on data magnitude due to the heavy workload of manually annotating samples. Therefore, semi-supervised method can be adopted to expand the annotated sample base.

3.12.2 Intelligent O&M: All-dimensional data can be associated

Following the technical route of all-dimensional monitoring -> All-dimensional correlation -> all-dimensional intelligence, after the UAV collects multi-dimensional monitoring data, it needs to establish the correlation before these data.

This relationship:

It can be a strong association relationship established through portraits, such as the relationship between host and virtual machine, virtual machine and application server, application server and application, application and service components.
It can also be a strong association established through the call link or service flow graph.
It can also be an association established by machine learning algorithm, for example, indicators that change at the same time in the same time window may have some correlation.

It should be noted that the business characteristics of the financial industry determine the dependence on the third party, so the randomness of alarms is large, which objectively leads to the low quality of learning samples. Therefore, UAVs currently use strong correlation relationships.

3.12.3 Intelligent O&M: Root cause analysis, alarm convergence, and intelligent noise reduction

With the correlation, you can do root cause analysis. We can collect alarms from various channels, filter out the repeated alarms and unimportant alarms through alarm filtering, and then establish association between different types of alarms in the same time window based on association analysis. The association can be established by portrait or by calling link. Then, calculate the probability of a root alarm based on the preset weight of each alarm. Finally, the alarm with the highest weight is marked as a root alarm. In addition, you can find recommended solutions for similar root alarms based on the historical alarm processing knowledge base.

In the process of root cause analysis and location, alarm convergence and intelligent noise reduction are realized. For example, we suppress repeated alarms, non-root alarms, and other alarms on the same link.

Four,

The figure above shows the actual online map of creditEase’s core business line call relationship. As creditEase’s standard company-level intelligent monitoring software, UAV has continued to cover all of CreditEase’s critical business systems, supporting the company’s more than 300 business lines. More and more colleagues are becoming proficient in using UAVs for daily operations and maintenance, early warning, in-process problem diagnosis, and post-response analysis.

With UAVs, the operation and maintenance experience can be achieved anytime, anywhere. UAVStack monitoring is now open source on GitHub, you can log in to see more details.

Official website: uavorg.github. IO /main/
Open source address: github.com/uavorg

Appropriate letter intelligent monitoring platform construction practice | share activity