Live as a recently emerging forms of interactive and alibaba double tenth one a highlight of this year, the contents of risk monitor is a new subject, the challenge of the technology is very large, control difficulty mainly include the lack of mature scheme and standard, the anchor behavior, broadcast content is not controllable, peak during the thousands of road high concurrent processing, the algorithm of real-time response requirements, etc.

The feature of the Security Department of Alibaba Group in the live broadcast control this year lies in the massive use of artificial intelligence and deep learning technologies, combined with the optimized high-performance multimedia computing cluster, which greatly reduces the cost of manual audit and improves the prevention and control ability of content risks. During the peak period, the system successfully processed 5,400 channels of live videos and a total of 250,000 fans lianliankan games, and warned or blocked the illegal content. The main technologies are real-time filtering of live broadcast content and optimization of multimedia processing cluster.

1. Real-time filtering of live broadcast content

In the process of live broadcast, some anchors do illegal things in order to attract eyeballs or promote products. In addition, this Singles’ Day introduced an interactive game between buyers: Lianliankan. The gameplay is that the system randomly selects two game participants and turns on the front camera of the mobile phone to shoot video and send it to the mobile phone of the other party for display. Both sides of the game stare, do not laugh and other actions. Participants in the game are not physically authenticated and need real-time control of the content. Double tenth period forecast peak 5400 road live online at the same time, the ultimate capacity a reviewers is about 60 road, it takes approximately 90 reviewers review online at the same time, it’s such a waste of manpower, and because human attention and leakage risk for the content, it will have to rely on artificial intelligence technology to the comprehensive prevention and control risk.

So what are the risks of live streaming?

We analyzed all the penalty records of Taobao Live broadcast since its inception, as well as the external live broadcast data captured on the Internet, and found that the vicious violations were concentrated in pornography and vulgar, and sensitive figures portrait. Therefore, we call two algorithm services: video yellow detection and sensitive face detection when we judge the risk of screen content. As a result, 99 percent of videos are automatically reviewed, and only about 1 percent of videos are moderated manually.

1.1 Intelligent yellow detection technology

Intelligent yellow authentication is to input a picture or video and the algorithm model returns a score between 0 and 100. The score is a non-linear indicator of how likely an image is to be sexually explicit: images with a score of 99 or more are almost certainly pornographic and can be processed automatically by the machine; Scores of 50-99 require manual review; A score below 50 is considered normal, as 50 and above can cover >99% of pornographic images. Intelligent yellow authentication has two features: 1) more than 60% of the pornographic pictures will be concentrated in the score of 99 and above, that is, the machine can automatically deal with most of the pornographic risks; 2) The proportion of pictures requiring manual review is very low, about 0.1% in taobao live scenes.

What is the principle of intelligent yellow detection?

Intelligent yellow authentication is an intelligent identification engine for pornographic images. It provides personalized multi-scale identification ability for different scenes and users, with recognition accuracy up to 99.6%, which greatly reduces the cost of image content control. We construct a multi-layer visual perceptron based on a deep learning algorithm and use an improved Inception neural network layer and multi-model cascade to achieve rapid recognition of multi-scale pornographic content. The following figure shows how to generate intelligent yellow authentication.

Steps for generating intelligent yellow authentication model

1.1.1 Clear classification criteria

In the steps shown above, standards and labeling data are more difficult than training models. Because the real world is complex, different people often perceive the same picture differently. In order to develop the standard, operation and algorithm students discussed and revised the first version several times, and in the subsequent marking process, according to the problems encountered several times to supplement, the standard was stable.

1.1.2 Collecting samples

Sample acquisition is skipped here. Scale of data: nearly 2,000 websites and pornography violation cases accumulated by Ali Ecosystem have been investigated, totaling over 60 million suspected pornographic pictures, and over 13 million high-quality annotations have been completed. This piece is the most important cornerstone of intelligent yellow detection.

1.1.3 Sample marking

Content on the Internet is highly repetitive, and there must be a considerable proportion of the same/similar images among the 60 + million images. In order to save labeling resources, we used image search technology to remove the images, leaving about 23 million images. Image search is a search technology for image content developed by ourselves based on local feature visual words, which can detect the target image processed by size scaling, cutting, rotation, partial occlusion, color transformation, blur and so on. The effect is shown in the figure below.

Image search engines find examples of similar graphs

Alibaba has developed an efficient marking platform (MBox), which provides exercises and exams as pre-marking quality control; The method of providing verification questions as the quality control in the mark can automatically calculate the accuracy of the marker, and can terminate the participation qualification of the marker with low quality according to the set conditions. We observed that even for skilled and responsible taggers, the error rate still fluctuated around 1%. Therefore, we used the trained model to judge the marking sample, and re-marked if the machine result was inconsistent with the human result. This process is repeated to ensure the high quality of the labeled samples.

Schematic diagram of sample labeling process

1.1.4 The result of model training annotation is automatically returned to the ODPS table in the early morning of the next day, and the data can be read for training at any time. The training used open source code based on Caffe’s framework, with some modifications as needed. About 1 million samples were used in the first training, and the training lasted for nearly a month with a single GPU machine. Later, the network structure was changed and the training platform provided by the Pluto team was used to realize multi-machine and multi-card training, which could control the training time of tens of millions of samples in less than a week.

Yellow model generation system schematic diagram

Aiming at the requirements of control scale and timeliness of live broadcast scenes, we designed a multi-stage classification model, which reduced response time by about 30% while slightly increasing recall rate.

Multistage classification model

Lianliankan game online, intelligent yellow quickly hit a number of exposure, pictures should not be displayed. Also captured some violations of businesses (medical advertising dew point, display of adult products, display of large-scale pictures, improper dress, etc.), pictures omitted. According to the violation cases, there are various forms of pornographic risks in live broadcast, such as reshooting screen, pictorial, real person, adult articles, models, etc., and various gestures and actions.

During the whole double tenth one, because the scatological, dress is not the whole punished broadcast a total of 82 games, including algorithm hit 68, scraping to 100% risk of pornographic and vulgar, and more than 80% of the dress is not straight violations (taobao live is very strict with dress size, some the dress also belong to the visible on illegal), and only need to review the screenshot of about 0.1%. Success was achieved in both risk coverage and audit manpower saving.

1.2 Sensitive face detection

The control of sensitive characters in live broadcast belongs to the (1: N) problem of face recognition, involving various forms of carriers of characters, such as animation, printing, PS processing and reshooting screen. The expression, posture, illumination, distance, occlusion and blur of the portrait are not controllable.

The detection system includes two modules: sensitive person storage and user picture query. The sensitive person database includes feature extraction and index establishment. When querying a user’s picture, the system returns the picture, name, and similarity of the person most similar to the queried face, and determines whether the sensitive person is matched according to service rules. The database is composed of nearly 2W portraits of well-known figures in various fields at home and abroad, and is divided into different levels according to the sensitivity to provide a multi-level list of controlled names.

Sensitive person recognition mainly includes two parts, one is face feature extraction, the other is the construction of retrieval system. Deep learning algorithm was used to construct the model, and the basic network structure of five-layer convolution + two-layer full link was adopted, and age + gender and other attributes were integrated, as well as regression and classification of various loss functions for training. This multi-data and multi-task training method fully mines multi-dimensional information of training data, so as to build a model with better generalization performance.

Architecture diagram of sensitive person recognition technology

Briefly describe the process of indexing algorithm:

1. Choose a set of hash functions and project the data onto discrete values. All data are stored in hashed buckets;

2. During retrieval, the queried data uses the same hash function to calculate the bucket number, take out all the data in the bucket, calculate the distance, sort and output.

Search performance: On millions of data sets, the RT of a single query is less than 10ms, and the accuracy rate of top10 nearest neighbors is 90% (based on traversal retrieval).

The algorithm system is mainly used to control portraits of politically sensitive people and fake use of celebrity images, and the audit ratio generated by the algorithm system during the whole Singles’ Day is about 0.01%. The algorithm hit 1613 live broadcasts in total, 38 of which were correct. Among the 38 shows, the background of 17 shows contains the image of the controller, 8 shows use the image of the controller as the mask, 7 shows are related to RMB, 2 shows use the controller to advertise, 3 shows vilify the controller and 1 shows live news. 14 of the 38 live broadcasts were judged to be illegal according to business control standards.

During the whole Period of Double 11, there were 15 illegal live broadcasts involving 99 core control figures, and only one was not matched by the algorithm, with an overall recall rate of 93.3%. For well-known reasons, the offending cases of portraits of politically sensitive people cannot be displayed. Here are some cases of users using photos of stars to participate in Lianliankan:

The user impersonates the star image to participate in lianliankan game schematic

There are two reasons why one might think the algorithm’s hit accuracy is not high:

1) The overall audit ratio is very low. In order to guarantee recall, the threshold is set relatively low;

2) As some female celebrities are included in the management figures, it is easy for anchors to face celebrities. For example, the following two female anchors can be easily identified as Yang Mi.

An anchorwoman who butts heads with celebrities

2. Optimization of multimedia processing cluster

In order to balance the conflict between timeliness of control and computing resources, in practice, we cut frames of live streams every 5 seconds, save pictures in OSS, and push messages to security interfaces. The interface layer passes the message to the rules layer, where the rules are configured, the algorithms that need to be called for screenshots are determined, and the results returned by the algorithms are judged and messages are sent to the auditing system.

Block diagram of the whole system of live broadcast control

The problem we are faced with is that 5400 channels of concurrent video need to give feedback within 5 seconds, too long delay will cause wrong risk exposure. Image algorithm service itself consumes a lot of computing resources, which is the bottleneck in the system. Therefore, we adopt the following countermeasures.

2.1 Decouple applications through message access

Synchronous access algorithm service is the simplest and easiest to maintain, but it will face three main problems: 1) Synchronous access brings more resource consumption to the access party; 2) Once the algorithm service is abnormal, the main process will be affected. 3) The number of pictures has far exceeded the limit of audit manpower. The operation can only cover some potentially key risk videos, and non-key risk video streams do not need to flow into the audit. Therefore, the decision to use asynchronous access was made despite the maintenance costs associated with asynchronous access.

2.2 Reduce access costs through asynchronous callback

After receiving the asynchronous message, the node will call the algorithm service. If the synchronous call is adopted, many threads I/O blocks and a large number of tasks are needed, which requires many nodes. With asynchronous callback services, task threads can be reclaimed immediately, reducing the number of task threads and saving nodes. Approximately 70% of nodes were saved in this project.

2.3 Increase throughput through batch processing

In the prevention and control of live broadcast, two algorithms will be called for a single screenshot, while the previous mode is to send two messages for each picture. Since it can be internally parallel and non-blocking multiple algorithms, the cost of one algorithm for a single graph is the same as that of multiple algorithms, so we combine multiple messages that call multiple algorithms for a single graph into one. Throughput doubled, QPS assessed machine costs halved.

2.4 Peak clipping and abnormal protection

Although the peak of the live stream is 5400 concurrent routes, considering that frame capture is every 5 seconds, it is not necessary to prepare capacity by peak. If we smooth the peak by 4S, the number of machines can also be reduced by 75%. In addition to the conventional flow limiting measures, considering that the review page is refreshed every 5 seconds, if more than 4 seconds of unprocessed messages are discarded, the avalanche caused by sudden message accumulation can be avoided. All error messages are written back to SLS and synchronized to ODPS for later troubleshooting, analysis, and recovery. At the same time, we deployed the application in two machine rooms for disaster recovery.

Algorithmic services system architecture diagram

According to the experience of e-commerce picture scenarios before going online, 95% of algorithm requests are returned within 3s, and 98% of the actual measured requests are returned within 600ms after going online, with an average time of 200ms and lower resource consumption. Although the two scenarios are not completely comparable, at least it shows that our algorithm service is fully competent for real-time scenarios of live broadcast prevention and control.

Advertising comes, how to reduce the cost of the situation, improve the effectiveness and accuracy of prevention and control? How to use the technology for risk prevention and control of live content on Singles’ Day?

Based on the above technologies and algorithms, ali Green Network, a content risk prevention and control product that has been tested by a lot of practices (supported by Aliju Security), enables users to provide audio and video, pictures, text and other forms of content detection through low-cost one-time access, covering the risks of violent terrorism, pornography, political involvement, advertising and so on. It can also seamlessly interconnect with cloud products such as OSS and ECS. Ali Green net is in saving more than 90% of the labor cost at the same time, support second return results, to achieve more than 99% accuracy. More product information or trial may refer to: https://www.aliyun.com/product/lvwang

For more security technology articles from Ali, please visit Aliju security blog