** Article/Zhang Liang **

Organizing/LiveVideoStack

Hello, everyone, I am Zhang Liang, research and development director of Startimes. This time to share is based on some problems and optimization experience of Startimes when doing online video services in Africa. As we all know, the Internet environment in Africa is very complex, and there is almost no worse Internet environment than Africa. Therefore, what we introduce here is a relatively extreme situation for your reference only.

The content of sharing is mainly divided into three parts. The first is a brief introduction to StarTimes On App, which leads to which indicators our product should care about. The core purpose of optimization must be clear, and it is definitely not feasible to optimize all indicators together. The second part will give you some actual data to give you an idea of the actual network situation in Africa. The third part will focus on sharing with you the specific optimization methods and methods we have adopted in such an extreme network environment, and finally what effect has been achieved.

1 Introduction to StarTimes On APP

Maybe people didn’t know much about Startimes before, because our main business was television operations in Africa. Startimes is a household name in Africa. We have been working in Africa for 14 years, operating in 45 African countries, and now we have more than 10 million pay TV subscribers. So our overall revenue scale and influence are at a certain level. At the same time, we are also the implementation party of the “10,000 Village Access” project, which is an important project of the “Belt and Road” initiative in China.

1.1 StarTimes On APP

StarTimes On was launched late in 2017, just in time for the 2018 World Cup. Initially our network environment have much difference in Africa is no psychological preparation, just get some data from the APM vendors, but the actual real data is much less than to get the data, so in the process of World Cup broadcast or appeared some problems, but we are trying to get rid of just in time.

Now, StarTimes On is basically well known and consistently ranked among the top in the Google Play entertainment category.

1.2 Business model and operating indicators

We can find our core indicators from the business model of APP. First of all, our content is copyrighted content. Users are divided into two categories, one is free users, the other is paying users.

Free users need to see ads to watch videos, and ads bring in revenue. Free users are limited to viewing VIP content for three minutes. So our operating metric is how much video free users watch, because watching video means more ads, and watching video means more potential paying users.

Our revenue stream for paying users is subscription fees, paying users don’t need to see ads, and all content rights are unlocked accordingly. So for paying customers, we focus on how many videos they watch and how long they watch. If you watch more, if you watch more, you are more satisfied, and if you are more satisfied, you will continue to pay. So this is our operational indicator.

With the requirements of operational indicators, we can further disassemble indicators and analyze from a technical point of view what needs to be optimized. Operational metrics can be broken down into the number of videos watched and the length of videos watched.

The number of videos watched is easy to understand, if the video fails to start, the number of views will naturally decrease, and if every time the video is opened successfully, you can see the first frame, that’s a good result. Therefore, in the QoE part, we will pay attention to the active withdrawal rate of users. Why users voluntarily quit is mostly due to the long waiting time. For example, if users wait for more than 8 seconds, most users may choose to quit. The QoS perspective will pay attention to the first screen time, that is, the faster and smaller the first screen time, the lower the active exit rate.

Viewing time can be measured in two ways. For TV series and movies, we pay attention to the proportion of the total video time spent by users. If it’s a live channel, focus on the length of time you watch it. The core QoS factor that affects the viewing time of users is lag. Users generally have a psychological expectation. For example, when watching a movie, they can tolerate the video card for three times. If there are too many times, users may give up watching.

After the above analysis, we can derive the two core QoS indicators in this business mode: first screen time and kadon-ratio. When I communicated with many students before, many students in the field of interaction and RTC would pay more attention to delay. However, under our business model, delay does not need special attention. As we’ll see later, we also have optimization strategies that sacrifice delay for other benefits.

Status and challenges of networks in Africa

Let me tell you a little bit about the reality of the Network in Africa.

2.1 Basic information of networks in Africa

First of all, if you look at South Africa in the data graph, South Africa is a moderately developed country, and the Internet situation is ok. The CDN latency from South Africa to Europe is about 100 + ms at idle times and 200 ms at busy times, which is actually not bad. But if you look to the west, Nigeria, Ghana, Koto and some other countries are much worse. In Nigeria, for example, RTT can exceed 600 milliseconds on busy days. This means that every time we do anything on the network, even downloading a byte of file, we have to wait 600 milliseconds because the hard metrics are there every time the network comes and goes. If the back-end of our business was in Europe, it would take 600 milliseconds to perform any operation, which would be very disruptive to the user experience.

To the east, places like Kenya, Uganda, Tanzania actually don’t have a very good Internet. In China, if a user in the northernmost region visits the machine room in the southernmost region in tens of milliseconds, which is considered relatively slow, but in Africa, it can be hundreds of milliseconds, and their network is ten times or even tens of times worse than in China, which means we have a huge challenge.

Here’s the packet loss rate. The packet loss problem is worse now. Recently, we collected data, and compared with before the epidemic, the packet loss rate has doubled. Due to the pandemic, people will use more mobile phone data, and Africa has very limited bandwidth resources, so the loss of packets is more serious. As shown in the figure, the packet loss rate will be about 24~25% in the peak period. Under such a packet loss rate, the download speed will definitely not increase.

Look at some other indicators, the success rate of building a company is 80%, the indicator is relatively stable, 5 times will fail 1 time. We will mitigate the impact of connection failure through long connections, but long connections can also be disconnected, so many times we still need to reconnect. DNS resolution time is also very long, about a second, the data is relatively poor.

We use HLS protocol for video encapsulation, and there are a large number of M3U8 index files and video slice files on CDN. The index file size is several hundred bytes, and downloading such a file might take 1000 2000 milliseconds. Video slice download speed is about 200 400kbps. So that’s the current Internet situation in Africa.

2.2 Causes of the problem

Next, we will briefly sort out where the network problems in Africa actually occur. Only after locating the problems, can we better explore the optimization ideas.

African networks have high packet loss rates and high latency. Cause there are a lot of packet loss, here I made two more major: one is the wireless access network packet loss, because the network resources in Africa (access) is very inadequate, although everyone is using 3 g, there is also a part of the 4 g base stations, but the number of base station is too little, if we use the same time, base station resources is obviously not enough. So peak signal weak, cell switching will have a lot of problems, which will lead to packet loss. Another reason is congestion, which is very serious in any country in Africa, and congestion can be very evident in the export of operators. If the network is always congested, but people are still sending a large number of packets, or to request packets, it is likely to lose packets.

Delays can be divided into several categories, such as transmission delay and processing delay. Queue delay is ultimately related to congestion. If the network congestion is severe, the switches and routers in the middle will have to queue, which will take more time. After practical analysis, we find that queuing delay is the most important problem. Retransmission delay refers to the delay caused by resending packets after packet loss. From the perspective of the application layer, this delay is also a kind of delay.

After learning about latency and packet loss in Africa, we wanted to identify where these problems occurred and then look for solutions. The above picture shows the process from the mobile phone to the server in the IDC receiving and responding to the request. Users to connect mobile phone need to base station at the beginning, to the base station sends wireless signals, base station after internal processing, the request of the network through the export from Internet operators, then there is a greater Internet, but in fact there are many layers of network provider, eventually sent to the IDC, above all link problems may occur.

At first, we wanted to use MTR or Ping tools to diagnose problems, but in actual operation, we found that if data is collected on the mobile terminal, it is basically impossible to collect data, and the operator may be sensitive to such data. When doing MTR in China, you can see a lot of data, but in Africa, the data in almost all links are “***”, indicating that it does not allow detection.

2.3 Determine the problem

Finally, we designed several experiments to determine the source of network problems, which can be divided into three groups.

First we need to make sure that there is really serious congestion. When we looked at things like video lag and startup time during off-peak hours and busy hours, we found significant differences. Compared to busy time, the first screen can be reduced by 30% and the lag can be reduced by 40%, which is a significant difference and indicates the presence of congestion, but it is not clear exactly what part.

Next we want to test whether the difference in user access to the network is the factor that causes the difference. We know that the 4 g network must be much better than the 3 g network, but 4 g users in Africa is less, we should speculate that network congestion situation is relatively good, through the experiment to verify the indeed, use of 4 g networks will be much better than using 3 g network, but also not as carefree as busy as significant difference, so the access network is not the main problem.

Next, verify whether it is a problem of the carrier’s egress network. We also set up some servers inside the carrier to do testing. The results show that compared with the European CDN, it is more profitable to set servers in the operator network, but it is not significant. This also shows that the export of operators also have problems, but not the main problem.

There are still some IXP in Africa, but there may be fewer domestic IXP. The so-called IXP is simply set up a room, each operator to pull their lines into the room, from here you can easily connect to each operator, operators can exchange traffic. However, in fact, the network between IXP and operators in Africa is also congested. If CDN is placed in IXP, the optimization effect will be worse than that in the operator’s network.

Through the above tests, we can make a qualitative judgment that the network congestion from mobile phone to base station is the most serious, and there are certain problems after the export from the operator’s Internet, and there are no major problems in the subsequent process. In this case, optimization is actually difficult. But at least we’ve recognized the problem, and the next step is to think about specific solutions.

2.4 Summary of network situation in Africa

In general, Networks in Africa are severely bandwidth-deficient and congested at the link and network layers.

From the perspective of the transport layer, it is not the transport layer itself, but the link layer and the network layer that affect the transport layer, and the outgoing layer is characterized by high packet loss rate and high RTT. To the application layer, domain name resolution is very slow, the download speed is very slow, and the download failure often occurs. So that’s the basic network in Africa.

3. Optimization of video experience with high delay and high packet loss

Now that we have a basic view of the network, we need to decide how to optimize it.

3.1 Determine optimization objectives

Going back to specific indicators, we are more concerned about the first screen and the lag since we are doing long copyright videos. Delay is not particularly critical for us because live TV channels do not involve interaction and viewers are not very sensitive to delay. So our focus will be on fixing the first screen and the lag.

User experience and cost, because our core users are paying users, they have certain requirements for video quality. But because they are in Africa, their requirements are certainly not as high as those of Chinese or American users. The key is how to define “certain needs”.

Ultimately, our goals were: 1) to reduce carton ratio, and 2) to reduce first screen time. With a high cadence ratio, users will voluntarily quit, which we don’t want to see. The first screen ranked second because users have a certain tolerance for the first screen, and it is relatively acceptable for long videos to start slowly. But if the short video if slow to start, users should be difficult to accept. Therefore, we hope to limit the first screen time to no more than 5 seconds based on business characteristics. As for latency, it is relatively expendable in the business model.

Aiming at the picture quality, we conducted market research on some key content, such as football match. Finally, we came to the conclusion that the user’s minimum requirement for football match video is to be able to see the ball. That’s not always easy to do. At Download speeds in Africa, a video must play at a lower bitrate to play smoothly, and when the bitrate drops, the ball blurry — it’s tiny when it flies through the air, a pixel or two across the screen, making it very easy to edit out. So all too often, the ball flies in the sky and gets out of position, and then reappears and lands on the ground. We also did a lot of optimization for this problem to make it meet the minimum requirements that users can accept. In other aspects, news programs need to clearly display the faces of newsmakers when the report is broadcast, which is nothing to worry about in China, but it needs to be realized through various optimization in Africa. These are our final optimization goals.

3.2 Optimization Ideas

The specific optimization ideas should start from the CDN level. We just mentioned the slow, poor and congested Internet in Africa. What are the causes? From the perspective of IDC, we can find some problems. There are many ISPs in Africa, similar to those in Southeast Asia and India, with small scale and poor interconnection with each other. For example, if you have two different operators in the same country in Africa that are accessing each other, because the operators are not interconnected, the traffic needs to go to Europe and go around, or go to South Africa and go around.

So we thought maybe we could find an IDC in Africa, or go to the cloud to solve this problem, but it didn’t work out. Because IDC only has flag or Transit with some carriers, it cannot fully connect with all carriers. If you set up a server in AN IDC, users of one operator will be happy, but users of other operators will be miserable, and their traffic will need to be diverted to Europe and then back to Africa, rather than directly using cloud services in Europe.

We did not know this information at first and used CDN and cloud services in Europe to support our business. Later, we tried to move to Africa and found that the effect was worse. Finally, our strategy was to build our own CDN in the larger ISP network, and then use the European CDN as backup.

Another idea is to find a third-party CDN directly connected with the ISP, but in fact it is difficult to find. Therefore, the third-party CDN can only serve as backup and auxiliary, which is designed for the characteristics of African networks.

At present we have deployment of CDN construction scale is relatively large, diagram startimes logo position on our behalf in Africa operator laid the CDN server, the CDN basic are lightweight, we can set up a server in each room, server itself is highly available, it looks more like a server, However, all internal modules are backed up. For example, the power supply module, fan module, backplane module, switching module, computing module, and storage module are all in duplicate, ensuring high reliability. We use a large number of such servers as edge cache nodes for users to access and play videos directly in the network.

3.3 Monitoring and scheduling system

With the self-built CDN server mentioned above, you may encounter scheduling problems.

The CDN can only be accessed by Intranet users, because the CDN does not have public IP addresses. Their IP addresses are Intranet IP addresses such as 10.x. If there is an error in scheduling and users are allowed to access the self-established CDN in another operator network, TCP connection cannot be established. Therefore, more caution should be taken in scheduling.

Secondly, the export instability within the operator network is due to the limited operation and maintenance level of the operator in Africa. For example, when one of our CDN servers is connected to the switch in the machine room and then goes out from the switch, sometimes the switch in the machine room will lose packets, such as 90% packets without any warning. The operators themselves did not monitor the problem. Every time we found a problem, we would contact and restart the switch to solve it — this actually affected the user experience.

In addition, some ball games and concert scenes, these scenes for people to do video and the nature of “second kill” is the same, will instantly come in a large number of people, the operator’s net exit may be directly hit burst.

After these problems are found, targeted treatment is needed for CDN scheduling, mainly including the following three strategies:

* * 1. Scheduling based on user experience: * * switches to the room, no indication error and is not alarm, we added buried a lot of points to the player, through the player download real-time reporting caton, start-up success rate, speed indicators, such as the background to these information for real-time analysis, the analysis results can be used as a scheduling policy reference input. Assume that the outlet in the operator’s network is unstable, and the CDN itself has no problem in this case, but the user experience is very poor, then the user experience index will alarm, and the scheduling system will transfer the user to the standby CDN.

**2. Scheduling based on CDN status: ** This point is relatively basic. For example, if the CDN server is faulty, the network in the equipment room is unavailable, or the BANDWIDTH of THE CDN is full, the traffic can no longer be scheduled here.

**3. Cost-based scheduling: ** We will preferentially transfer users to the in-network CDN, and then transfer to the third-party CDN when the in-network CDN is unavailable.

3.4 Audio and video technology

There will be more content in audio and video technology. First of all, the physical network itself is not very good. After the laying of CDN, there has been some improvement, but only a small improvement rather than a qualitative leap, and more optimization needs to be carried out from the technical perspective.

The specific aspects can be summarized as follows:

Asynchronous service interface: When a video is playing, users will think that clicking the link of the video will start to play, but in fact, the service background needs to do a lot of things, such as authentication, advertising and some other policies. If these policies are executed in serial, they will have a great impact on the first screen time.

The optimization of network layer means to improve download speed and reduce the influence of connection time on the first screen time by optimizing transport protocol and congestion control algorithm.

Video packaging optimization can reduce the interaction times between the player and THE CDN, thus reducing the first screen time and slowing down.

Video coding optimization can reduce the bit rate, the first screen time and the lag rate.

Select a streaming media protocol

Before analyzing more specific issues, let’s talk about the selection of streaming media protocol. We finally choose HLS encapsulation. At first, we considered HTTP FLV encapsulation, which is widely used in China, because of its low latency, low encapsulation cost, large number of users and mature technology. However, compared to actual requirements, we found that there are many problems with using HTTP FLV. For example, we have multiple tracks and multiple subtitles. Many movies have two tracks (English and French, for example), and some have local languages, so they may end up with four or five tracks. If we put all the tracks into the same stream, the stream will be packaged inefficiently and the user will only use one track, but will need to download the entire stream. Including multiple subtitles is a similar problem, so we need to split up these different streams.

In addition, FLV does not do well in the separation of audio and video data streams and smooth bit rate switching. If we use FLV, we need to do secondary development on top of it. Then there is the support problem of overseas third-party CDN. Most overseas CDN vendors do not support FLV protocol.

In addition, DASH was another option at the time, but when we started developing DASH in 2016, there were very few open source tools for DASH, so we finally chose HLS, with good support from all aspects and high technology maturity.

3.5 First Screen Time Problem

Next to the analysis of specific problems, the first thing we need to solve is the first screen problem. There are several steps from the user clicking the video to the video successfully playing, as shown in the above flow chart.

First, business authentication. In a payment business like ours, it is necessary to verify whether users have rights and interests, and the verification process is relatively complicated. For example, if there are many people stealing streams, we need to prevent hacking, that is, to determine whether the current user is a legitimate user and has permission to use the stream. Here we have made a large number of data models to judge whether the user is a robot or not. Only real users can obtain tokens of CDN. Other business logic includes the strategy to play the AD, whether to continue, and selecting the user’s preferred bit rate, all of which are executed after the user clicks the play button.

The next step is to select a CDN. Because the number of CDN is very large, including the number of the third party about dozens or even more, it is necessary to make the most appropriate choice, after choosing CDN, domain name resolution should be carried out. After resolving the domain name, we started to download the video file. Because we used the HLS protocol, the player had to download the M3U8 file and slice file, and finally the data of the first frame could be obtained.

The whole link is relatively long, if no optimization, the first screen time is basically more than a dozen RTT. For example, according to the specification of HLS, M3U8 and slice can be placed on different CDN, but the same TCP connection cannot be used to download, so they need to establish connections respectively and then download successively. The number of connections also affects the success rate of the first screen, because the success rate of TCP handshake is only 80%, and the success rate of both connections is only 64%.

Let’s look at some more data through the whole process. The first one is the success rate of the first screen, which is an overall indicator. Error rate refers to the possibility of errors in any link, such as CDN errors, 404 or 403 returns when downloading files, or failure in connection construction. In short, errors in any link will be recorded in the error rate index. There’s also the active quit rate, assuming that the user doesn’t end up watching the video, either because of an error or because of active quit. If it is an active exit, we also need to record the link and time of active exit, which has a strong guiding significance for the subsequent optimization.

The figure shows some data collected after we define the indicators. The bar above is the average of startup time, and different colors represent different links. The dark green on the left is the service interface, and the blue is the CDN selection, which is the same as the process just described. According to the process, we conducted collection for a period of time. By checking the average data, we found that the user spent a lot of time downloading the slice file, which may be hundreds of k in size, while the previous steps may be dozens of bytes, so it seems reasonable.

But actually if we want to optimize the first screen time, we need to look down at the bottom bar, and this is the first screen time analysis in 85 fractions, which is still the longest to download, but because some of the first steps take up 2/3 of the total. If our goal is to reduce the active quit rate of users by reducing the first screen time, it is not enough to optimize the download slice time only. Even if the optimization is 0, the previous link will take about 5s, which is still unacceptable to users. So once we had this data, we looked at the root cause, and it was clear that the RTT was large and everything was executed sequentially, which resulted in a very long first screen. After we get the conclusion from the data, we can decide on an optimization idea.

Optimizing the first screen time

First, business interface optimization, according to the business of different enterprises, optimization methods are diverse. In our business, logic such as authentication and advertising broadcast can be changed to asynchronous. For example, a policy can be delivered to the client, and the client can execute asynchronously. For example, the client can record the playback history and synchronize with the server every time the App is started. When a video starts to play, the client decides whether to continue the video. These optimizations can reduce serial links and reduce 1-2 RTT in the overall process, which can be reflected as hundreds of ms or even a time saving of about 1 second in Africa.

CDN selection includes DNS resolution in fact, the optimization idea is the same. In order to save the selection time of CDN, we directly select CDN on the list page, check the user’s location on the list page, and provide the data to the background for quick selection. The APP can also select CDN asynchronously. For example, if the mobile phone network changes from 3G to 4G or switches to WIFI, the APP will make an asynchronous selection analysis when there is a change, so as to ensure the normal playback of the video and reduce 2 RTT in the process.

Then there is the M3U8 download problem. To download, you must first establish a TCP connection, and the TCP handshake costs 1RTT. We have two ways to save the time of connection establishment. The first way is that the client directly establishes the connection after the CDN selection ends, and then maintains the heartbeat. It is not easy to maintain the connection in Africa. The connection will be broken after a while and the packet cannot be sent. At this time, it will waste users’ traffic to rebuild the connection, but otherwise, it will waste more time to rebuild the connection when the video data needs to be downloaded. We need to fine-tune this survival strategy.

It is better to use QUIC because it has the characteristics of 0RTT fast connection. QUIC also requires a handshake to establish a connection, but because the handshake packet and packet are sent together, from the user’s point of view, there is no handshake time. Of course, there are also problems, its effective rate is not particularly high, in Google’s default policy, 0RTT will be invalid even if THE IP changes, which is actually a strong constraint, because mobile network IP is easy to change. According to actual tests, 0RTT works only 50 percent of the time, compared with about 60 percent according to Google’s own data, although this also allows for regional variations. The above optimization can save 1-2 RTT.

Then there is the M3U8 download problem. Fragmented M3U8 consists of master M3U8 and sub-M3U8. We use fragmented MP4 packaging instead of TS. Encapsulation adds an init.mp4 file, which is not large but needs to be downloaded independently, which means adding another RTT. So we combined the content of these files into the URL of the video, so that users can access the content of the file directly when they visit the URL, without multiple separate downloads. The contents of these files are all text or strings. We only need to transfer the strings to the client, and the client constructs files such as M3U8 locally and gives them to the player. Then the player can play normally. This saves 1 or 2 RTTS.

Finally the download of TCP jianlian time slice, some companies may put sliced and m3u8 in two CDN, it must establish a connection respectively, but if slicing and m3u8 in the same CDN, we can use the same connection, at least on the on demand is feasible, because on demand only need to download a m3u8, then slice file. However, livestream may not work, because the update of livestream M3U8 and the update of slice are independent, they are updated in two lines in parallel, so there must be two connections to do parallel download at this time. In this case, our optimization strategy for live broadcasting is to directly establish two links when establishing a connection. Of course, if there are protocols that use HTTP2 or QUIC, it will be easier because these protocols support connection multiplexing. HTTP2 and QUIC may be more difficult because they build more packets, but because the connection can be multiplexed, it will reduce 1-2 RTTS overall.

Therefore, the overall optimization idea is actually very simple, and the goal is very clear. The high RTT itself is difficult to optimize, so we should directly reduce the number of RTT. After all the optimizations were completed, we found that there were about 10 RTTS reduced by calculation. We optimized on the worst basis and finally reduced 10 RTTS. So what does a 10 RTT boost do in real life? When the user clicks the video in the list page, there is no other link, even the connection is set up, the player directly downloads the data of the video itself, download a little data and the first frame of the video will come out, so the startup time will be significantly improved.

That’s what we optimized. Before, the first screen time of 85 minutes can reach more than 7s, which is also unbearable for African users. However, after our optimization, the time is less than 3s now. For domestic standards, it’s still longer, but it’s acceptable for African users.

The voluntary drop-out rate is also very noticeable. It was 14 percent, 100 people watched the video, 14 people quit because they didn’t want to wait, which is a very bad number. We optimized it to reduce it by about half to about 7%. 3 s, of course, there are some users are not willing to wait, we also analyze the user behavior, found that the user may have problems on their own in the operating habits, such as they would in the channel list “point”, 1 s point several video, frequently exits, such operations will be in the background to take the initiative to quit, so the overall proportion of active exit only fell by half. But for users who normally watch the video, this screen time is acceptable.

3.6 Lag problem

And then there’s Caton. The index system on The Caton side is simpler. Player downloads M3U8 first, then downloads slices. If it is live, it will take turns to download, download M3U8 on demand, and then download slices all the time. After that comes the process of caching and decoding. This part is very simple.

The optimization idea of cartonbi is mainly to improve the download speed, the download speed is only 250kbps, and the player can not be downloaded all the time, so the cartonbi of live broadcast is much higher than that of vod. The reason is that live broadcast cannot be sliced down all the time, but M3U8 should be downloaded frequently.

Caton optimization scheme

The optimization idea here is that M3U8 and slices should be downloaded in parallel. One method is to put the contents of M3U8 into slices. This is an interesting change. We put the text of M3U8 directly into the HTTP Response header of the slice, because M3U8 is itself a string, so the player doesn’t have to download M3U8 separately, just keep downloading the slice. Because each slice comes with the next M3U8, it saves the time to download m3U8 separately, and the overall download speed is naturally improved.

Then there was buffer optimization, because we didn’t care about latency, so we increased the cache to 75 seconds.

We use fragmented MP4 for bit rate with low Overhead. The Overhead of the TS package is very high in most cases, with a low bit rate of 10%. For example, the original bit rate of audio and video stream is 200kbps, and the package becomes 220kbps, which is not cost-effective. Fragmented MP4 has only 1% overhead, but fMP4 also has its own problems with audio encoded in front and video encoded in back, which affects startup time. Therefore, we need to do our own interweaving packaging.

In the coding part, as mentioned just now, we mainly optimize the content and improve the picture quality and playback fluency under low bit rate through processing and optimization.

In the CDN part, the download speed can also be significantly improved by establishing CDN and optimizing CDN selection strategy.

Finally, it is worth mentioning about BBR/QUIC. At the beginning, we used the congestion control algorithm of BBR, and the benefits were not obvious, which was greatly different from the expectation. After analysis, it was concluded that the congestion was too serious. In general, BBR is a relatively benign algorithm, unlike Cubic, which sends packets until they are lost. BBR detects the start of congestion and stops sending packets. However, since the network in Africa is congested, BBR’s benefits are not so significant.

The optimized revenue of Caton is also very significant. At the 85th percentile, the original live caton ratio is 15%, but now it is less than 2%, which is a good result in Africa. For voD, the 85-quartile caton ratio is less than 1%, which is also a good result.

After doing these things, our user experience has been greatly improved, which has promoted the development of the business.

3.7 Summary of optimization ideas

Just a quick summary. First of all, data is the basis for improvement, and to find problems accurately requires burying too many points in advance. If you just imagine the problem and change the program, the result is likely to be random. So we even buried a little bit in the network protocol stack and pulled back some of the users’ network protocol stack logs, like when each IP packet was sent, when it was lost, why it was resent, whether it was resent on time, and pulled back every single piece of data for analysis. As for the player and business buried point is more basic, must be buried completely.

Second, look at the data on the quantile line, not just the average, which can mask extreme situations and lead us in the wrong direction of optimization.

Finally, focus on the core metrics, which for us means sacrificing delay to get the other metrics right. You need to be able to identify and optimize your core bottlenecks to ensure high efficiency.