Author’s brief introduction

Xiaoquanquan Senior R&D engineer of Baidu Cloud

Responsible for baidu cloud intelligent operation and maintenance of Noah external network quality monitoring platform system and strategy development, has extensive practical experience in the direction of network monitoring.

An overview of the dry

In the previous introduction of Baidu cloud intelligent operation Noah external network quality monitoring platform article “Baidu network monitoring combat: Falcon World War I famous (on)”, we briefly introduced a kind of network anomaly type – machine room side abnormality (Baidu side equipment/link abnormality). This fault is shown in data as the service of multiple provinces accessing a Certain Baidu machine room is not unobunobvious. Therefore, in falcon (Baidu External Network monitoring platform) external network fault detection, the abnormal proportion of provinces accessing a certain machine room can be set to exceed a given threshold to determine the occurrence of abnormalities in the machine room side.

In the statistics of externet faults, it is found that the failure of backbone network link of operators will also lead to the abnormal access of several provinces to specific machine rooms. In the existing framework of externet fault detection, the abnormal link of backbone network will also be judged as the abnormality of the machine room side. However, there are some differences between equipment room side faults and backbone network link faults in terms of cause and data presentation, and the stop-loss methods for both faults are different. Therefore, we need to design a barrier detection strategy to distinguish between the two types of exceptions, so that the automatic stop loss system can execute an appropriate external stop loss plan according to the type of exception.

In the following article, we will introduce the backbone network link and its abnormal performance, as well as the design idea of fault detection strategy.

What is a backbone link?

Backbone network is a high-speed network used by operators to connect multiple regions or regions. Therefore, an important role of backbone network is to carry network data transmitted across regions. A number of cross-regional links form a complete backbone network.

Figure 1 shows a backbone link used to connect the north and south regions — the second Beijing-Han-Guangzhou link. The network data transmitted by each province across the north-south region will first converge to the core city nodes such as Beijing, Wuhan and Guangzhou on the link, and then be transmitted to the destination location through the link.

FIG. 1 The second Beijing-Han radio link

If the network devices, such as switches and optical fibers, that constitute a link on the backbone network are congested or damaged, the network data flowing through the faulty link is affected, usually resulting in packet loss or data disconnection.

The impact of abnormal backbone network on baidu external network quality

As shown in Figure 2, falcon, the external monitoring system, monitors the network connectivity status of baidu’s computer rooms in real time through detection points in each province.

Figure 2 Monitoring diagram of Falcon extranet

In each judging cycle, each province will report a number of baidu machine room detection data. Assume that a province reports M pieces of data during the detection of a specific equipment room, and n pieces of data are abnormal. The anomaly rate n/m can be used to measure the quality of the external network from the province to the equipment room.

When the link of the backbone network is abnormal, the quality of Baidu external network will be damaged. Specifically, the abnormal rate increases when users access Baidu machine room services across regions, while the abnormal rate is not affected when users access Baidu machine room services in the same region.

Figure 3(a) and b respectively show the abnormal rates of visits to the computer rooms of Baidu in the north and south by provinces in China when the link of north-south backbone network is abnormal. The abnormal rates corresponding to the colors of different provinces are shown in the lower left corner of the figure.

Figure 3(a) Anomaly rate of visiting Beijing machine room in China when north-south link anomaly occurs

Figure 3(b) Anomaly rate of computer room visiting Guangzhou in China when north-south link anomaly occurs

As can be seen from Figure 3(a), the abnormal rate of visiting Baidu Beijing machine room in the provinces south of the Henan-Shandong line generally increases, while the abnormal rate of visiting Beijing Engine room in the provinces north of the line is low. As shown in Figure 3(b), the visits from different provinces to The machine room in Guangzhou were abnormal across the north-south regions and normal in the same region.

The difference between the backbone network exception and the equipment room exception

Figure 4 shows the network anomaly rate of each province when the anomaly occurs on the equipment room side. By comparing Figure 3 and Figure 4, it can be seen that when the link of the backbone network connecting two regions is abnormal, the abnormal provinces are usually in the same region. In addition, the abnormal provinces can access the equipment room across the region, but the normal access to the equipment room in the same region occurs. When an exception occurs on the equipment room side, the abnormal provinces are scattered throughout the country without obvious distribution rules. This difference is what separates the two types of exceptions.

Figure 4 Abnormal rate of access to the equipment room when an exception occurs

Because the two types of abnormal performance is different, so the corresponding stop loss plan is also different. If the equipment room side is abnormal, you can directly schedule all the traffic in the abnormal equipment room to the normal equipment room. For a link exception on the backbone network, the exception occurs only when cross-region access occurs. Therefore, all cross-region access traffic needs to be processed and all cross-region access traffic can be scheduled to the normal equipment room in the same region. In order to enable the automatic stop loss system to implement a proper stop loss plan for the backbone network abnormality, it is necessary to design a fault diagnosis strategy for the backbone network link abnormality.

In addition, the backbone network topology of the operator mainly connects the core cities in the north and south, and the backbone network anomalies also occur on the north-south backbone network links. Therefore, the subsequent strategy design will focus on the north-south backbone network links (hereinafter referred to as North-South links).

Analysis of thinking of judging obstacles

According to the abnormal characteristics of north-south link and the nature of the problem, we try to consider the solution from the following two ideas.

The most significant feature of the north-south link anomaly is that the cross-region access to the machine room anomaly rate is high, while the same region access anomaly rate is low, and there is an obvious north-south division between provinces with high and low anomaly rates. Given this feature, an intuitive idea would be to draw a north-south line based on historical data; By observing the abnormal conditions of provinces on both sides of the demarcation line, the types of anomalies can be determined.

However, by observing the history of many north-south link anomalies, we find that the dividing line has no fixed position. It changes dynamically with the location of the abnormal link of the backbone network, and there are differences in abnormal provinces according to the location of the demarcation line. As shown in the figure below, (a) and (b) show the abnormal rate of users accessing Baidu Beijing machine room when the abnormal link exists in Hebei and Henan respectively.

Figure 5. (a) Link failure in Hebei

Figure 5(b) Link failure in Henan province

As can be seen from Figure 5(a), when the abnormal link is located in Hebei, the abnormal rate of visiting Beijing machine room is generally high in provinces south of Hebei, that is, the partition line is located near the Hebei-Shandong line. In FIG. 5(b), when the abnormal link is located in Henan, the latitude of the division line moves down to below the Shannxi – Henan line, and the provinces south of the line have a higher anomaly rate, and the number of abnormal provinces is less than that in FIG. (a) due to the location of the division line moves down.

Therefore, it is difficult to directly implement the idea of finding a suitable north-south boundary and observing the abnormal state of provinces on both sides of the boundary to determine whether there is abnormal north-south link.

As mentioned in the overview, we want to perform secondary processing on the data that has been determined as a machine room side exception to correctly distinguish the machine room side exception from the north-south link exception. Obviously, this is a binary classification problem, and it is also a way to solve it by using classifier model.

If the detection data from 31 provinces to each machine room can be obtained in each determination period, a classifier that accepts 62-dimensional characteristic data (abnormal status of 31 provinces corresponding to each machine room in the north and south) can be trained by accumulating historical data to distinguish between north-south link anomalies and machine room side anomalies.

However, it is difficult to ensure that the complete detection data from 31 provinces to the machine room can be obtained in each judgment period due to the delay of the detection data return, the failure of the detection link, and the few detection points in a single province. In other words, there is a large probability of missing values in the characteristic data. In addition, the frequency of north-south link failures is small, and the data samples that can be used for training classifier are insufficient, so the trained model is easy to overfit.

According to the analysis of these two ideas, it can be found that they are difficult to be directly applied because of some problems. Therefore, we integrated the useful parts of the two ideas and designed the backbone network obstacle judgment strategy.

Backbone network anomaly determination policy

Considering the above two schemes comprehensively, we try to adopt the classifier model in the obstacle detection strategy, and artificially design features to reduce the feature dimension and reduce the risk of model overfitting.

The specific steps of the judgment strategy are as follows:

We judge the abnormal state from the province to the machine room according to the abnormal rate of each province and the artificial threshold value, and take this state as the true value of the abnormal state of the province.

After determining the abnormal state of each province to a certain machine room, all provinces are sorted according to latitude, and each province is traversed as a possible partition position, so as to find the location with the smallest “partition error” as the partition line position.

Each possible partition location divides the collection of provinces into those south of the partition location and those north of the partition location. According to the abnormal characteristics of the north-south link, if the abnormal equipment room is the southern equipment room, it should be the set of normal provinces and the set of abnormal provinces. If the abnormal equipment room is the northern equipment room, the situation is the opposite.

For each province, if the state of the province obtained by division is inconsistent with the true value of the abnormal state of the province, the province is considered to be wrongly divided, and the division error can be obtained by dividing the number of wrong provinces/the total number of provinces.

As shown in Figure 6, eight provinces are divided, and the upper part is a set of normal provinces, while the lower part is a set of abnormal provinces. According to the truth value of abnormal state, the partition error can be calculated as 2/8=0.25.

FIG. 6 Calculation example of partition error

After traversing all partition positions, the minimum partition error and the corresponding partition position can be obtained.

According to the observation of historical data, the division error corresponding to the north-south link anomaly is relatively small, and the division line changes up and down in the middle of the map. The abnormal location and error of the equipment room side have no rule to follow. Figure 7 shows a scatter plot of two types of anomalous data.

FIG. 7(a) Results of linear partition

FIG. 7(b) Results of nonlinear partition

As can be seen from FIG. 7(a) and (b), it is difficult to distinguish the two types of abnormal data well, whether it is linear or nonlinear, when only two-dimensional features are used.

In order to improve the classification effect, we need to introduce other auxiliary classification features as follows:

  • Location of the equipment room and median latitude of the abnormal province

The relative position relationship between the two has obvious characteristics when the north-south link is abnormal, so the introduction of the two dimensional data enhances the identification of the north-south link anomaly. For example, when an anomaly occurs on the north-south link, the latitude of the province where the anomaly occurs to the southern equipment room is usually much higher than that of Guangdong province where the equipment room is located. Take the median to eliminate the effects of extreme points and noise.

  • Divide the mean anomaly rate of two provinces

When an anomaly occurs on the equipment room side, abnormal provinces are generally evenly distributed across the country. Therefore, the difference between the mean anomaly rates of provinces on both sides of the dividing line is usually not large. Therefore, the two – dimensional features are helpful for classifier to identify anomalies on the equipment room side.

In order to distinguish the two types of anomalies, we train a binary classifier. The positive example of training data is the feature extracted from the north-south link anomaly according to the above steps, and the negative example is the feature extracted from the machine room side anomaly. In the selection of classifier, we choose support vector machine (SVM), a commonly used classifier model, and select THE RBF kernel function according to the experimental backtracking effect.

Through the above steps, we realize the determination strategy of abnormal link of backbone network. Excellent result of anomaly determination has been obtained since on-line operation.

Based on the actual problems encountered in the abnormal monitoring of the external network, this paper introduces the design idea of the abnormal link of the backbone network and the decision strategy. This strategy effectively solves the problem of the confusion between the exception on the backbone network and the exception on the equipment room side, and enables Noah, Baidu cloud intelligent operation and maintenance product, to be able to monitor the abnormal link on the backbone network.