1. background

1. At present, problems such as middleware container node failure and insufficient machine resources (disk size, memory size, and CPU) often occur. After automatic operation and maintenance (O&M) is added, cluster exceptions can be quickly handled.

2. In the past, human intervention was required to deal with problems, resulting in high labor cost and lack of standard operation and maintenance process.

2. The target

1. Standardized and standardized operation and maintenance process, and formulated standard operation and maintenance process.

2. Visualization, operation and maintenance process visualization, platform, traceable, traceable.

3. Automation, container reconstruction, process start and stop, and fault self-healing through root cause analysis for some indicators.

3. Fault self-healing architecture diagram

The monitoring data collection module of fault self-healing periodically reports the collected instance indicator data to the processor, and the processor calls the metadata module to obtain the matching rules and fault self-healing processing flow. Abnormal data is successfully matched, o&M events are generated, and event convergence filtering is performed to ensure that no large number of the same attributes (such as services and equipment rooms) are detected. Finally, the self-healing processing flow is executed to recover o&M events, send notifications, and restore services.

Product Architecture Diagram:

Overall flow chart:

4. Scheme design

4.1 Fault Identification

Exceptions are identified by pulling instance monitoring data and aggregating multiple indicators, and the fault automation process is triggered.

Solution 1: Filter monitoring data

Filter type detection match, only related to the data itself, the time window setting has no requirements, data to a processing. An o&M event is triggered when the specified exception threshold is reached. This detection scheme is too rough, and the instantaneous spike phenomenon of some monitoring data will also trigger misoperation and maintenance. Frequent self-healing will affect the stability of middleware. This solution is usually used to trigger alarms. If it is used to trigger o&M, some risks may arise.

Scheme 2: Detection based on window time

Window selection category:

Fixed Windows: Sets a fixed period of time for collecting statistics in real time. Usually, you do some Partition based on the key, so you can do some concurrent processing and speed up the calculation.

Sliding Windows: set a window length and sliding length. If the sliding length is less than the window length, then some Windows will cover each other and some data will be double-counted. If the window length is equal to the execution period, then the mode is fixed window; If the window length is less than the execution period, it is a sampling calculation.

Session Windows: For a specific event, such as a collection of videos watched by a specific person. The session waits for data that is not certain when it will arrive, and the window is always irregular.

Conclusion: Periodic monitoring data can be regarded as relatively regular and infinite data, so the first two window modes are suitable for streaming calculation.

Window time selection:

Window processing based on computational time is very simple, just focus on the data in the window, and don’t worry about data integrity. However, the actual data must contain event time, which is usually unordered in a distributed system. If there is a delay at some points in the system, the accuracy of the obtained results will be greatly reduced. Event-based time has a clear benefit for business accuracy, but it also has a significant disadvantage because of data latency, which is difficult to say during which time the data is complete in a distributed system.

Data integrity assurance:

Obviously, no matter how big the window is given, it can never be guaranteed that the data in line with the event time in the window will arrive on time. Using watermarks can solve the problem of when the data is considered to be over and the window is closed for calculation. The diagram below:

Set the fixed window for 2-minute aggregation calculation, and the aggregation results of the 4 Windows were 6, 6, 7, and 12, respectively. However, after the aggregation of the first window ended at 12:02, the data of the window was not complete until 12:03, so the obtained results were not accurate. The correct aggregation result 11 could be obtained by introducing watermark. The watermark here indicates how long ago the data will not be updated, that is to say, the calculation of watermark will be performed before each window aggregation. First, the maximum event time of the aggregation window will be determined, and then the delay time that can be tolerated will be added to the watermark. When the event time of a set of data or the newly received data is longer than the watermark, the data will not be updated. The data in the window can be displayed for calculation and the state of the data will not be maintained in the memory.

Fixed window for streaming calculation:

The amount of data periodically reported by the middleware monitoring data is not very large. In the distributed system, redis can be used for real-time aggregation of lightweight flows and triggering of rolling Windows.

As shown in the figure above, set the matching window size to 2 minutes, and allow the maximum delay time of data to be 2 minutes, then watermark = the maximum window time +2. Continuous scrolling of the window can be completed by aggregating the results of the two Windows in real time to the Redis cache. When the event time is greater than the watermark threshold time of the Window1 window, a Window1 window is immediately displayed to the process processor to determine whether the exception threshold is exceeded. If so, an operation and maintenance event is generated to wait for self-healing. At the same time, the data of the second window Window2 is moved to the first window Window1, so as to achieve continuous scrolling effect.

Conclusion: Rolling window takes cache space is less, fast polymerization and insufficient place is not accurate, there may be a matching if Settings window time is bigger, arrived just configuration threshold value of data aggregation results in two Windows connected data set, this will not trigger operations, secondly the multi-target (a monitoring index corresponds to a fixed window) matching operational events, There will be the case that the popup time of multiple Windows after reaching the water level line is not aligned, and there may be the case that they will never be matched. At this time, we need to increase the matching wait between Windows to solve the problem. The above two problems can be solved based on sliding window method.

Sliding window for streaming calculation:

Multi-index sliding window; DataEvent indicates the monitoring data of an instance, which is reported once or more times every minute. The data contains three metrics1, MetricS2, and Metrics3 indicators. If the aggregation result of three metricS1 indicators exceeds the specified threshold, an O&M event is triggered. The sliding window time is 1 minute, and the maximum delay is 1 minute. In this case, data in three Windows is displayed after 12:08 for aggregation and matching o&M event rules. At the same time, if the window moves forward, the data that no longer participates in the statistics will not be maintained in the cache, such as the indicator data with dotted lines in the figure above.

4.2 Event convergence andself-healingcontrol

Event convergence:

When the same event occurs multiple times in a short period of time, self-healing events may occur in parallel execution or be fired multiple times in a short period of time. Self-healing usually involves container or service restart, and frequent self-healing affects cluster stability. Therefore, a silent time can be set for event convergence, and events will not be sent to the self-healing service before the silent time expires.

Self-healing control:

1. In the same cluster, cluster events and instance events are mutually exclusive. That is, only one node in the cluster can perform self-healing at a time. If the instances in the cluster are self-healing (such as vertical expansion), the cluster becomes unavailable. The serial self-healing with the cluster instance can be implemented through the MQ sending end using the cluster ID to do the route to the specified queue, and the consumer end pull the queue in order to complete the consumption. As shown in the figure below:

2. When a new node is added or taken offline, the system allows two minutes for the node to prevent self-healing due to the instability of the node that has just been added to the cluster or taken offline.

3. Set the upper limit of self-healing times for scenarios that cannot be resolved by self-healing to prevent self-healing cycles and concurrent notifications.

Filter expired events. Each event has an expiration time, which indicates how long after the event occurs, it is considered to be expired. The event is judged to be valid in the decision process.

4.3 Fault Cause Analysis

The o&M event triggers a callback to analyze the root cause of the fault and identify the misoperation and maintenance. The root cause analysis strategy corresponding to operation and maintenance events is pulled, and the self-healing is mainly realized by using dynamic indicators + decision tree. The entire analysis self-healing module is visualized. Indicators: Indicators of monitoring items, such as system load, CPU usage, memory usage, network I/O, disk usage, system logs, and GC logs

E. decision tree model

Summary of the e. node offline

4.4 Fault Self-healing

Based on root cause analysis and exception conclusion summary, the visualized event processing process is choreographed and the configuration of decision actions and execution actions is carried out in the metadata module. When the occurrence of operation and maintenance events is detected, the pre-choreographed event processing flow is combined with the execution of relevant process actions to realize the service self-healing effect.

The node exception processing flow is choreographed as follows:

5. To summarize

By pulling monitoring data, abnormal data is detected and matched to trigger operation and maintenance events. In combination with the event processing flow arranged, some tedious self-healing behaviors are automatically completed, and the entire execution process is visualized and serialized. You can also arrange o&M scenarios such as disk clearing and capacity expansion. In addition, you can accumulate troubleshooting experience to form a knowledge base, and trace abnormal monitoring data to detect problems in advance and handle potential faults.

Author’s brief introduction

Carry OPPO Senior Back-end engineer

At present, I am responsible for the r&d of middleware automation operation and maintenance in OPPO middleware group, focusing on distributed scheduling, message queue, Redis and other middleware technologies.

More exciting content, please pay attention to [OPPO Digital Intelligence Technology] public account