Artificial Intelligence (AI) technology is increasingly used to audit User Generated Content (UGC) on the Internet, which can improve the efficiency of manual audit. AI auditing takes into account both the violation of a single piece of content and the dimension of the user who posted it. This paper will share the practice and application of intelligent audit in iQiyi's audit business: modeling and rating the security of users, and realizing a more intelligent and efficient UGC audit strategy combined with AI content security detection.
Internet content security audit refers to the review of videos, audio, pictures, text and other content produced by users. Only compliant content can be published to purify the network environment.
The current audit strategy used in the industry is AI+ human. The AI strategy mainly adopts the deep learning model to intelligently identify the illegal information related to politics, pornography, violence and terrorism from the content dimensions such as pictures, texts and audio, namely AI content security detection. Figure 1 is the current mainstream intelligent audit solution.
Figure 1: Mainstream intelligent auditing solutions
However, the cost of building a set of AI content security detection service is high. On the one hand, a large number of classified and high-quality labeling data are required for different types of violations such as pornography and violence and terrorism, mainly manual labeling. On the other hand, in order to achieve usable accuracy, deep models are commonly used, and both training and reasoning rely on expensive GPU resources.
The industry also considers that content security should not only be in a single content, but also need to combine higher dimensions, such as uploading content users, mainly in two forms: 1, the establishment of user black and white list; 2. Define operation rules and score users based on historical data. This scheme has the problems of low accuracy and high misjudgment.
In order to optimize the above problems, this paper draws lessons from the idea of user risk control and puts forward the user safety rating: it adopts machine learning algorithm to model the user's credibility and security, which can predict the user's future performance. After that, combined with AI content security testing and manual audit, flexible audit strategy was developed to realize intelligent audit and be implemented in iQiyi audit business.
UGC intelligent audit
The project design
The UGC intelligent audit scheme proposed in this paper mainly includes two parts: \
User security rating. To evaluate the security of users, on the one hand, in order to be quantifiable and more objective, it mainly measures the pass rate and high risk rate of users' uploaded content. On the other hand, in order to be forward-looking, AI algorithm is used to predict it.
Intelligent audit strategy. After a user uploads a content, its security rating will be preferentially invoked. For the content of high-level users, the AI content security detection service is not invoked. In addition, the audit priority is affected. A higher user level means a shorter audit time.
Therefore, the scheme has the following advantages:
1. The definition of user safety rating is objective and has the predictability of the future, so it is of high accuracy;
2. Cost savings. User safety ratings do not rely on manual labeling and are computationally lightweight. Also reduced AI computing resources;
3. Reduce the audit time of compliance content and improve user experience;
4. Assisted audit to improve the efficiency of manual audit.
User safety rating
User safety rating is modeled using self-supervised machine learning. Firstly, business data such as user history upload and audit data, user portrait, user registration and authentication should be collected. Then, the characteristics of users are calculated statistically and their safety levels are automatically marked to form training data. After that, feature engineering such as extraction, screening and transformation is carried out based on business characteristics. Finally, select the appropriate AI algorithm, conduct model training, and evaluate the effect. The overall process is shown in Figure 2:
Figure 2: The construction process for user security ratings
-- Unsupervised annotation of user safety rating
User security rating is divided into different security levels based on the audit results of user uploaded content. The higher the grade is, the safer the UGC is, that is, the higher the probability of approval and the lower the degree of violation. And vice versa.
The annotation of user security level should also be predictive and forward-looking, so it mainly refers to the performance of users in the "future" period of time, and the historical performance is secondary. In addition, if the method of manual annotation is adopted, on the one hand, the evaluation standard is not easy to determine, it is difficult; On the other hand, the user's performance is variable, requiring frequent tagging of a batch of user levels to keep the model up to date and iterative, which is labor-intensive.
Therefore, this program adopts objective and direct grade evaluation standard without supervision. Set different weight coefficients for different time intervals (such as one day in the future, one day in history, and seven days) to calculate the PassRate of each user.
Where, PI is the amount of approved content produced by a user within time interval I, Ti is the total content produced within this time interval, and WI is the weight coefficient of the corresponding time interval.
In combination with the confidence degree (such as variance of pass rate and total upload volume), the standards of each security level are defined, so as to classify each user into the corresponding level. For example, there are 10 levels.
-- Feature engineering and model training
The required characteristics are mainly the historical behaviors of users in the audit service, such as the pass rate, upload volume, and high-risk volume within the past 1, 7, and 30 days. At the same time, you can also collect user data from other services as much as possible, such as user registration information, user portrait, and reported data, to reflect user security from a certain Angle.
With the user's characteristic data, the training data is constructed by splicing it with the user security level annotations in the previous section. The sample form is shown in Table 1:
Table 1: Sample user security level model
After constructing the training data, we can choose the appropriate algorithm to carry out the training of feature engineering and user security level model.
User security level model belongs to multi-classification problem. In consideration of the large number of continuous values in features, GBDT algorithm (Gradient Boosting Decison Tree) and XGBoost open source framework are selected. XGBoost is an efficient implementation of GBDT, with high efficiency, robustness and other characteristics.
In order to reduce feature engineering and improve the accuracy of prediction, DeepFM deep learning algorithm is also adopted and the network structure is improved, as shown in Figure 3. For example, you can convert the pass rate feature into one-hot feature in buckets to support multi-hot features. Finally, the predicted results of GBDT and DeepFM models are fused.
Figure 3: User security level DeepFM network structure
Since the overall user performance will change over time, the model will be automatically updated periodically to ensure the stability of the model effect. Because it is unsupervised annotation, model updating is very simple: the daily user characteristics and corresponding grade labels are calculated statistically, that is, the samples of the day, and then combined with historical samples to obtain training samples. Training sets and test sets are randomly divided. Then the GBDT and DeepFM models were trained respectively. The effect is evaluated on the test set, and if the effect reaches a preset threshold, it can be published to the online service.
-- Model optimization
There are many reasons why UGC is not approved, including ordinary violations such as advertising and irrigation, as well as high-risk violations such as involvement in politics and pornography. The influence of high-risk content is bad and the harm is great. In order to identify such content, this paper pays special attention to it. As the user security level model mainly takes the user pass rate as the standard, it is weak in predicting the high-risk violation content. To this end, a detection model for high-risk users is specially added to predict whether users will post high-risk violation types of content in the next few days.
The labeling of high-risk users is similar to but simpler than the labeling of user security levels. It mainly depends on whether users post high-risk content in the next few days. The user characteristics are also similar to the user security level model, and the form of training data constructed is shown in Table 2:
Table 2: Sample of high-risk user detection model
The detection model of high-risk users is a binary problem, which is mainly to recall high-risk users and screen out high-risk content to reduce the risk of the platform. GBDT model and improved DeepFM model are also used in the algorithm. The specific process is shown in Figure 4.
After the detection model of high-risk users is added, the detection capability of high-risk users and content is greatly improved.
Figure 4: A comprehensive process for user security rating
-- Model reasoning
Thanks to the company's perfect big data processing platform and machine learning platform, the collection and calculation of training data, the training, evaluation and reasoning deployment of user safety level model and high-risk user detection model can be easily and efficiently completed.
When a user uploads a piece of content, it may first go through user risk control and other links to determine whether the user can upload the content. After the upload is successful, the audit process is entered: based on the user ID, the user characteristics required by the model are obtained first, and then the online inference service of the user level model and high-risk model is requested in real time, so as to obtain the user's security level and high-risk results.
The two prediction results are fused to determine the final user security category. First, the security level of level 10 can be flexibly classified into high, medium, and low levels according to the actual situation of each service. For example, users with levels 9 to 10 are advanced users. Grade 5 to 8 are intermediate; Grades 1 to 4 are low grades. Then, based on whether they are high-risk, they are finally divided into three types of users: safe users, general users and dangerous users, as shown in Table 3.
Intelligent audit strategy
After obtaining the user's security category, different auditing and publishing strategies can be flexibly configured to realize intelligent auditing. For example, you can customize the Table 3 policy:
Table 3: User classification and intelligent audit strategy
Whether it is the speed channel or the ordinary channel, UGC published online after review will also monitor its online exposure, exposure duration, reporting and other indicators. When it reaches a certain threshold, it will be recalled and returned to the monitoring and review process. This strategy enables all content to be cross-reviewed and manually reviewed, and may be repeatedly manually reviewed, ensuring the safety and control of the content.
As for the priority of each UGC in manual audit, the security level of uploading users will be an important factor to participate in the calculation of priority, and the final priority will also refer to the uploading time of content, whether new users upload and other factors. The higher the security rating is, the higher the priority is for the UGC to be reviewed, which reduces the review time of compliance UGC.
To sum up, the intelligent audit strategy is shown in Figure 5:
Figure 5, Intelligent audit strategy
This scheme constructs a set of user security rating model and defines the user security standards with the pass rate and high-risk violation as the main measurement indexes. Unsupervised annotated data, the user security level model and high-risk user detection model were trained based on the characteristics of historical behavior and attributes of each dimension. Combining the results of the two models, the user's security type can be predicted accurately. After that, combined with the results of AI content security detection, the safe content is screened out for speed channel review and release. It not only reduces the computing resources and time consumption of AI content model; And greatly shorten the time of review and release, improve the user experience; And, through manual audit, to ensure that the content is safe and controllable.
After the implementation of UGC video audit service, the computing resources of AI content model are reduced by 25%. The average review time for premium videos has decreased by 80%.
In the future, we will continue to explore ways to improve the accuracy of user security category prediction by optimizing annotation scheme and improving algorithm model. At the same time, for graphic and text audit, live broadcast audit and other businesses, respectively training model, to enable more business. So as to continuously improve the effectiveness and efficiency of audit, optimize user experience, and ensure the content security of IQiyi.
Did you see the heart?
Stamp?????? "Read the original" direct to the job page
Join iQiyi now!
Maybe you'd like to see more
How to build a universal and efficient frame extraction tool under different AI video inference scenarios? \
Scan the qr code below, more exciting content to accompany you!