Welcome toTencent Cloud + community, get more Tencent mass technology practice dry goods oh ~

This article is published in cloud + community column by Tencent Cloud Video

Pay attention to the public, “tencent cloud video”, a key access technology dry | | preferential activities video solutions

This is a Buddhist type chicken player, since three girls with lying dead after eating chicken, will be determined to become a chicken master, early in the morning addicted to the major website chicken live, is to see the exciting moment of the final, live spend screen? And then the game is over? Oh, my God. Who am I? Where I am? What did I miss?

How can I, as an OBSESSIVE IT boy, allow the phenomenon of screen blotches to exist? On the one hand, to become a chicken eater. On the other hand, don’t miss every opportunity to get a raise or promotion. Thus began a long march…

Live for business, “seconds, caton, time delay, into the room rate” is we often focus on several indicators, the indicators can say is “a user can gracefully into the studio” in terms of considerations, however after entering studio “what users see content” is a key ring, content involves the security of some indicators in addition to, There may also be some abnormal content such as splashes, green screens, etc. At this time, we will consider how to measure and find splashes on the Internet.

This paper presents a detection scheme based on CNN network for screen splicing.

01

Build screen detection capability

Whether it is a video or a live broadcast, it is composed of frames of images, which are presented to us in a dynamic form because of the phenomenon of human visual persistence.

When an object is moving fast, the human eye can still retain the image of its image for about 0.1-0.4 seconds after the image it sees disappears. This phenomenon is called visual persistence.

In this case, detecting whether there is a splintered screen in the live broadcast can actually be converted to detecting whether the frame in the live broadcast is a splintered screen, that is, an image recognition problem. So how to identify an image is a splintered screen?

Usually, image recognition is always feature-based. We will first extract the corresponding features according to the set goals for us to develop strategies later. Fortunately, the deep learning convolutional neural network CNN can help us complete feature extraction and decision-making strategies.

However, deep learning CNN network can not bypass the data set and model training

1.1 Data set Preparation

difficult

To use deep learning networks, a threshold is the need for enough labeled data sets, otherwise the network learned by ** is easy to overfit, resulting in poor generalization ability. ** says that it is the threshold because there is a lack of data set in actual business. Take the current screen splicing as an example. At present, there are very few cases of screen splicing happening in live broadcast, and it is extremely difficult to collect enough screen splicing pictures as a training set through actual cases. Therefore, other ways to collect training sets must be explored.

The reason why we can distinguish the screen of flowers is because the human eye can find features of the screen of flowers, although we may not be able to describe these features with words, in fact, if we can describe features with words, we can easily translate them into code to find features of the screen of flowers.

Making training Sets

In fact, machine learning also works by features. In this case, we can make some split-screen images and let CNN find the features that distinguish them from normal pictures, so as to learn the detection ability of split-screen images.

When using the YUVviewer tool, I found that screen splintering occurs when the video file is played at the wrong resolution. Inspired by this, it is possible to extract frames from YUV files using the wrong resolution to get a picture of the screen. The overall process is as follows:

Here we need to understand the storage format of YUV file, so as to extract corresponding frames according to the format:

YUV is divided into three components, “Y” represents Luminance or Luma, which is the gray value; The “U” and “V” refer to Chrominance or Chroma, which describes the color and saturation of an image and specifies the color of a pixel.

YUV sampling styles are as follows: a black dot represents the Y component of the pixel sampled, and a hollow circle represents the UV component of the pixel. Generally, we use YUV420 format

The storage mode of YUV420 is as follows:

Write code to extract frames as shown above:

Def get_frames_from_YUV(filename, dims, numfrm, startFRm, frmstep): """ Extract the corresponding frame data from the given YUV file in the format YUV :param filename: YUV file path :param dims: Resolution of YUV file :param numfrm: number of frames to be extracted :param startFRm: Which frame to extract from :param frmstep: The interval between extracting frames, i.e., every few frames :return: Returns the Y list of extracted frames, Filesize = os.path.getsize(filename) fp = open(filename, If (startfrm+1+(numfrm-1)*frmstep)*blk_size > filesize: Numfrm = (filesize/ blk_size-1-startfrm)/frmstep +1 util.log(' filesize/ blk_size-1-startfrm ')/frmstep +1 util.log(' filesize/ blk_size-1-startfrm ') Y, U, V= [],[] d00 = dims[0] / 2 d01 = dims[1] / 2 for I in range(numfrm): Uint8) Ut = zeros((d01, d00), uint8, 'C') Vt = zeros((d01, d00) d00), uint8, 'C') for m in range(dims[1]): for n in range(dims[0]): # print m,n Yt[m, n] = ord(fp.read(1)) for m in range(d01): for n in range(d00): Ut[m, n] = ord(fp.read(1)) for m in range(d01): for n in range(d00): Vt[m, n] = ord(fp.read(1)) Y = Y + [Yt] U = U + [Ut] V = V + [Vt] fp.seek(blk_size * (frmstep - 1), Fp. close() return (Y, U, V)Copy the code

Here, a variety of cases of resolution error are also studied, and the following rules are found:

1) Correct resolution

2) Resolution width+1

3) Resolution width+ N

4) Resolution width-1

5) Resolution width-n

It can be seen that the stripe direction of the small screen is lower left when the width changes, and the stripe direction of the small screen is lower right when the width increases.

So we set different error values for width and height, which will cause different types of splashes, so we can use this strategy to construct a large number of splashes.

We used more than 800 videos, with each video drawing 10 frames at a certain interval, and obtained more than 8000 flower screen pictures.

These pictures are labeled as flower screen, which is our positive sample, while the negative sample can be selected as normal screenshots in actual live broadcast.

At this point, the data set is almost ready.

1.2 Model and training

Some well-known models on the Internet can be used for the model, here we use the lightweight Mobilenet model, the model structure is shown in the figure below:

Finetune training was performed based on the model already trained with Imagenet, and random hyperparameters were used for the last layer (this layer also could not read the hyperparameters of the Pretrained model because the classification numbers were inconsistent). Training can converge quickly because the features are so obvious and the test set is very accurate.

In fact, training is an iterative process, because the machine learning model is a blank sheet of paper, and what capabilities it needs to have are entirely taught by you through training sets (data and tags). In order for it to be able to handle as many situations as possible, your training set needs to cover as many situations as possible.

And our training set is always inadequate, you always have places where you can’t care. What happens when the training set is insufficient? For example

You train someone to recognize the model of an airplane, and most pictures of airplanes have the sky, so if you give a picture of the sky to the model, it will probably think it’s an airplane, because the model is probably learning the characteristics of the sky. And how do you get the model to learn the characteristics of the airplane, of course, you have to adjust the training set so that the training set includes not only the airplane with the background of the sky, but also the airplane on land and so on.

Through continuous low iterative tuning, our detection accuracy has reached 94% in spatial scenes and 90% in NOW live scenes.

02

Live detection scheme

If you want to access the live streaming service after obtaining the detection capability, you need to divide the live streaming into frames, and then perform split-screen detection on the corresponding frames.

Overall background frame

Using the method of frame capture, if 20 million screenshots need to be detected every day, that is, one screenshot needs to be detected every 10 seconds. Considering that screen capture always occurs in a long period of time, it is hoped to sample for a longer time to avoid wasting computing power.

Attached is a screenshot of the current business test result:

As an IT guy who loves work, IT took me a week to finally put an end to the live screen detection based on CNN network.

Work makes me happy, games make me happy, and finally I can walk smoothly in the chicken eating live broadcast of various websites again ~

Question and answer

Game Architecture

reading

Regiment warfare open black essential “medicine” to understand!

No longer have to worry about Internet cafes open black teammates inaudible!

3 lines of code for QQ light game plus voice interaction ability

Machine learning in action! Quick introduction to online advertising business and CTR knowledge

This article has been authorized by the author to Tencent Cloud + community, more original text pleaseClick on the

Search concern public number “cloud plus community”, the first time to obtain technical dry goods, after concern reply 1024 send you a technical course gift package!

Massive technical practice experience, all in the cloud plus community!