Effective monitoring and data collection of online services has always been an indispensable topic for back-end services. As a classical distributed system, live broadcasting requires monitoring and data collection. How to effectively monitor and maintain the massive service cluster, and how to capture the fragmented data in the cluster to optimize the service?
Netease cloud letterAudio and video r&d engineers will discuss with you.

Recommended reading

Video Private Cloud Practice: Building on-demand Private Cloud Platform based on Docker behind High-definition Sound Quality: Technical Decryption of netease Yunxin Music Teaching Scheme

The machine stands on the shoulders of giants and uses wheels

As a distributed cluster, the smallest unit at the physical layer is naturally the machine. For a machine, the conventional performance indicators are CPU, memory, network card usage. There are many ways to obtain these capabilities, and the video cloud uses Netease’s Sentinel system. Sentinel system is netease’s monitoring system that provides very detailed and real-time performance indicators.



With this powerful wheel, the Sentry, we can easily do effective monitoring at the machine level. For example, when network card traffic or CPU abnormal, you can quickly take the alarm processing. Of course, not light light can monitor whether the machine can run normally, but also can monitor whether it is malicious attack, this is not to mention.

Integration of performance indicators and services

Of course, machine-level data alone is not enough. As the saying goes, data that does not fit the business is not good data. As a live CDN service, the most common parameters are audio and video bit rate and delay.



A close observer might spot a few peculiar statistics.

Why do you count the total bit rate and also the audio and video bit rate alone?

This is because in a real-world scenario, the total bit rate does not necessarily can restore the scenario we need, there are a lot of things will require a separate analysis of the audio and video bit rate, such as user take the initiative to shut down the video output caused by inadequate or sampling machine performance video card, this time only need to coordinate frame rate statistics, can quickly restore scenario. Of course, the video bit rate itself is not a fixed value. The video cloud also provides QoS (changeable bit rate) for weak networks.

What is push_delay?

Push delay is a parameter that measures the network condition between C/S. When this parameter fluctuates, it indicates that the packet from the C terminal takes longer to reach S than expected. Indicates network jitter. What if we calculate this number? In simple terms, the timestamp of the RTMP header is used. If I had to explain it in a formula, I think it would be:

Delay=abs((RTMP Delay – RTMP Delay) – (RTMP Delay – RTMP Delay))

Calculates the difference in the time taken for each packet to reach the server, which is used to represent network jitter. Of course, there are many other things that need to be done, such as weighting and Jitter algorithms to reduce errors and avoid them.

Why send_kbps?

In fact, this is quite understandable, because CDN itself is a distributed system, and path selection between nodes is required, and then transmission from node to node is realized to accelerate. Send_kbps is actually the transmitting bit rate of the previous node and the next node. So this raises the question, what if I trace a stream? For each stream, we give a unique flag, and as we pass from node to node, we add a self-incrementing flag called Hops to the stream.

Through this mark, we can accurately find the direction of the flow in the node, so as to aggregate the data of each node together

In addition, we will also grab some client information such as source IP, user device and so on. This information can help us enter the era of big data.

The architecture of the overall data service

In distributed systems, each node generates a large amount of statistics and performance data. So in the video cloud, there is a complete statistical architecture to support it. From the most front-end data acquisition, transmission, to the summary, and then to the computing cluster, and finally output. Each service has a role to play. Let’s look at the overall architecture.



For each zone, there is a data aggregation server that collects data from the streaming server. The original metadata, after being aggregated, filtered, and compressed by the data aggregation server. Statistics servers in the central cluster are reported in a unified manner. The statistics server will store all the statistics, one by one, in the data warehouse. The rest of the data computing cluster periodically reads and computes from the data warehouse. The calculation interval varies with service types. For example, the operation and maintenance platform will mainly read some machine-level data for analysis and alarm. The big data computing cluster computes the data and comes up with optimization directions, which we’ll talk about later. The business data display platform provides real-time output data (such as bit rate and latency) for users and technical support queries. Of course, there are a variety of other data processing services that I won’t cover here.

Some of the things data can do

Finally, let’s talk about data. In this era of big data, having data but not doing anything is equivalent to wasting it. So, with all this data, what did we do? Of course, the most obvious, is to adjust the scheduling strategy, additional points. For example, according to the calculation result of big data in the figure above, the network weight of Nanjing Telecom is relatively poor, which indicates that the area of Nanjing Telecom needs to be checked. The large number of users of Nanjing Mobile also shows that Nanjing area should add service points.



In addition, data and performance indicators are reported for load balancing. For example, when a node is under high pressure or its performance is unstable, the scheduling priority of this node will be reduced (that is, it is less likely to be allocated to users first).

The above are my views on the monitoring of live CDN service, but the monitoring and data collection of live CDN service is a matter worthy of endless discussion and optimization. Welcome to leave comments and discuss with me.

Want to get more product dry goods, technical dry goods, welcome to pay attention to netease yunxin blog.