While the Internet brings convenience to people’s life, it also implies a lot of bad information. Preventing the spread of harmful content on the Internet platform has aroused high attention from many aspects. This speech will share the practical experience of netease E-Shield algorithm in the field of content security from the technical level, including the effect optimization scheme of deep learning image algorithm in complex scenes.

By Li Yuke
Organizing/LiveVideoStack
Good afternoon, everyone. I’m Li Yuke from netease Yidun. This time, MY topic is the application of deep learning image algorithm in the field of content security. Netease E-Shield is a leading content security and business security service provider in China, and I am honored to have the opportunity to share e-Shield’s practical work in artificial intelligence here.


In the early stage, I was mainly engaged in the algorithm work of member marketing in the field of e-commerce. The main content was to issue different types of coupons to different users. I managed several hundred million yuan, and I was terrified every day. Recently, I mainly promoted the optimization of content security algorithm in Yidun, including the work of identifying pornography and pornography.


This share mainly divided into display AI business ground cases and share netease easy to shield in long-term work experience algorithm to optimize the process of two parts, the concept of artificial intelligence has been hot for at least two years, compared with the heat, commercial case or relatively limited landing, a lot of times is packaged into the traditional technology of artificial intelligence, The commercialization of artificial intelligence deep learning in the real sense focuses on biometric identification, interactive entertainment and other aspects. In addition, in the academic world, although artificial intelligence, especially deep learning method, has made great breakthroughs, it still faces great challenges in the application of practical industrial scenes.




1. Background


1.1 Internet content security



The Internet has penetrated into every aspect of life. With the increase of information, harmful information spreads rapidly on the Internet, especially in the field of Internet social interaction. In some of our clients, pornography accounts for more than 10%. What’s more, some apps for teenagers also contain a lot of content that is not suitable for children. Harmful content information includes roughly pornography, advertising, spamming fear, the respect such as illegal, pornographic information, for example, in the current Internet pornography is not only simple, exposed to pornography, more into a hint, obscure, side-door vulgar information, it is clear the content increased a lot of difficulty. Having been engaged in such work and seen many samples, I personally feel it is necessary to clean up information on the public platform.


In my impression, in the past two years, especially in recent months, there have been a lot of news about the removal of Internet apps. A large part of the reason is that users uploaded harmful information, but the platform failed to detect and stop it in time, so that it spread, and was then discovered by the regulatory authorities and removed.


In fact, not only Internet enterprises, any enterprise with the function of user uploading content will face this major security problem. It is indeed very difficult to find a small amount of illegal content in the massive data, and harmful information involves a wide variety of business definition is also very complex. As a content security service provider, ESHIELD needs to provide enterprises with a series of illegal content identification services, including pornography and terror authentication, to ensure that there are no problems in enterprise content.



In fact, this issue has attracted the attention of various departments of the country, and relevant laws and regulations have been issued successively, which has played an important guiding role in Internet content governance and also provided a more precise direction for relevant work.
1.2 Manual audit and machine audit



For the implementation plan of Internet content audit, the earliest stage is mainly completed by manual, although the intuitive feeling is generally believed that the effect of manual audit is more reliable, but in fact, pure manual audit will bring a lot of problems. First, with the growth of business volume, data surge, the cost of labor is too large. For example, one authenticator can review 40,000 pictures every day, and 25 authenticators are needed for manual review on a platform of 1 million per day. The annual investment cost may be around 2 million yuan, which is a huge burden for small and medium-sized enterprises. Secondly, according to the actual situation of manual audit, it can be found that the speed of manual audit does not match the speed of data generation, which will lead to a certain lag of user uploading data. And because the amount of data is too large, the accuracy of manual audit is not as high as imagined. Finally, the auditing standards are very complex, and the training of auditors is also very difficult.



With the growth of content audit demand in recent years, Easy Shield has also made a lot of attempts in technology. At first, The machine audit was carried out by means of black and white list library and rule system. Then, the traditional CV method was gradually added, including image texture, skin color and other means to assist. Finally, the content audit method based on deep learning was developed.


Personally, I mainly experienced the technical exploration of the third stage, that is, the deep learning stage. Before the development of deep learning, it was actually very difficult for machines to conduct audit by pure image method.
1.3 Challenges faced by machine audit



Before introducing specific solutions, we should first mention the difficulties of machine audit using deep learning technology, which can be divided into two stages: massive data resource requirements in the early stage, investment in the later stage and operation and maintenance. The problems we faced in the early stage are mainly because deep learning relies on data, while prohibited data has a relatively small amount of data, which is difficult to collect and has limited coverage types. However, after the model is put online, there will be the problem of sample attack and defense, which is inevitable for all services. The cost of confrontation needs to invest a lot of manpower and physics.


1.4 Preliminary data barriers



Above lists some barriers lead to early data, the preliminary data accumulation, before a client platform in the some pictures of my wrist, because the data was relatively small, the proportion of outside the difficulty of data collection is bigger also, the team then went to the data service, and service providers are not similar data can provide, The lack of samples brought great difficulties to the team in machine audit, because deep learning methods still need to learn on the basis of data.
1.5 Late sample attack and defense



Attack and defense confrontation is a common problem in the security field. As is known to all, deep learning method itself has the problem of sample confrontation. If a graph that can be recognized is added with minor changes, the model may be recognized incorrectly or identified into other categories. When users find that the illegal pictures with obvious features can be automatically identified, they will perform a series of tampering on the pictures, so that the system cannot recognize them. Take advertising as an example. On the left is an advertising picture with obvious features, and on the right is an image of users attacking the image defense system.
1.6 Technologies used in deep learning



With these difficulties mentioned above, the team made a preliminary exploration of deep learning in this field. In terms of specific methods, the basic technical means adopted by the team are image classification network and target detection network, which are the most common in deep learning. Image classification network classifies the overall image after extracting features, and target detection labels the image position and category after extracting features. In the actual process of solving problems, different models and services are often used to deal with different prohibited content.


To publicity, people tend to deep learning and human intelligence relates in together, I do neuroscience from individual itself is, in fact both are really have some similarities, for example the visual system are V1 – V2 – V4 – MT hierarchy, feeling the wild RF and convolution kernels have some similarities, but the connection is relatively weak, Neuroscientists tend to disagree with this correspondence. Especially target detection, semantic segmentation and other methods, engineering design sense is very strong. Therefore, although deep learning is effective in a strict sense, it is still far from brain intelligence in the real sense. Intuitively speaking, it is very important for deep learning to rely on data, and its reasoning ability is limited. To learn to distinguish cats from dogs, a large amount of data needs to be read. But the human brain doesn’t really need to look at this many samples to make a good distinction. It can be seen that deep learning is not so similar to the way the human brain works, nor is it a direct reference, but it is very useful in engineering applications.
2. Initial exploration


This page PPT image is: https://github.com/hoya012/deep_learning_object_detection


Algorithm in the process of the early work, the engineer will often fall into a trap, because in the last few years the development of deep learning so fast, all kinds of network emerge in endlessly, in public data sets and many shocking effect, so the algorithm engineer too much attention to the method itself, and in the limited business data on the adjustable parameters. Therefore, in the process of work, we will find that the advantages of methods are actually very limited, and the methods that have advantages in public data sets may not be applicable to the actual business data sets. In my opinion, there are two reasons for this result:


1. Open source approaches are often tested with open data sets, and open data training examples are a subset of the real world.


2. The algorithm itself does not have strong reasoning ability and is more dependent on data.



In this working mode, a large number of unexplainable misjudgments and unrecalled fuzzy samples will appear online. A typical miscarriage of justice is on the right of the image above, and there are many more online.
3. Optimize the process


3.1 Sorting out the algorithm optimization process



After the problems mentioned above were exposed, the team began to sort out the optimization process of the algorithm and did the following series of work. The first thing to do is to define the business criteria, each subcategory needs to have a clear description, marking the judgment of a clear basis. In addition, it is necessary to clarify the importance of the problem, consider from the overall perspective, give up some sporadic samples, and focus on solving a certain type of problem. In advertising, for example, above the left image content including women’s pornographic pictures and contact way promotion advertising, we call it a beauty advertising, above the right pictures contain invoice, powder brush, fakes, loans, part-time network has obvious drainage source of information, the promotion of a third party, we call it spam.


3.2 Test Standards



After the standard definition of the source, the next step is to establish a test set and test standards that can better express online effects. Testing is not just a matter of selecting a large number of data for training and testing during algorithm demo, and then using it after achieving good results on the test set, which is far from enough. Five test types are listed in the figure above. The first is a relatively basic basic test set, with data of about one hundred thousand to one million levels, which will have a baseline cognition of algorithm effect. The second is online data, which is used to pull tens of millions of online data to the local test. Since the suspicion of online data is relatively low, the misjudgment applied to online data can be assessed by checking the recall data through the test. The third is the specific type set, which combines the high-frequency types for fixed tests to evaluate the missed judgment; The fourth is to evaluate the improvement of the performance by collecting some historical feedback between the two releases. The fifth is pre-launch testing, which simulates the on-line situation to reduce unexpected problems.


3.3 Data Level


The page image is: http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex8/ex8.html


Regarding data, In the process of algorithm maintenance, E-Shield has optimized the data level by establishing an online data closed-loop process. After passing the model, data will return to the labeling system with scores. However, large-scale labeling costs a lot, so not all labeling will be carried out. Among them, Yi Dun pays more attention to two types of data. One is the good data judged by the model, that is, the data in the upper right corner of the figure. This part of the model has obvious characteristics, which will actually improve the overall score of the model after being taken back and thrown into the model training. And in the service will be divided into certain types and suspected type two levels, determine the type of precision about 99% above the latest data (porn model) above, the machine can do direct processing of this part of the data (interception, delete), suspected type of precision about 80% above the latest data (porn model) above, For this part of the type of team still hope to do a second audit through manual, if there is really a problem, then further processing. In the process of online application, the proportion of confirmed suspects is about 2:1 to 3:1, and the proportion of confirmed types is still the majority, which also benefits from the model’s continuous backflow of confirmed type data to strengthen the model.


The data points placed on the boundary are similar to the questions that are uncertain in the exam. The data points are summarized as problem sets, so as to focus on better training of the model. In this way, the model can be better for boundary training, and the explanatory and certainty can be improved.


3.4 Optimization of missed judgment




In terms of missing detection optimization, netease E-Shield has also made a lot of attempts, mainly from the five points mentioned above: data backflow expansion of positive samples, targeted data collection, target detection assistance, FPN + ATTENTION and other technical assistance, through which the whole identification service of E-shield is packaged.



Model layer optimization netease Eshield tries feature pyramid and attention mechanism. Feature pyramid is more used in detection network. Through the integration of basic information at the bottom and semantic information at the top, it can solve some scale problems and alleviate the problem of small target detection, which is relatively effective in practical application. The attention mechanism is mainly used in the classification stage, which can make weight operation on specific areas, improve attention among them, and help to define fuzzy information. These two methods are also commonly used in the optimization stage of the model layer.


3.5 Assisted by multiple technical means



The deep learning image algorithm of netease Yidun appears to be an advertising service externally, but in fact it combines a lot of technical means behind it. For the outside world, it only sees an image uploaded and determines whether it is an advertisement, but it actually goes through a lot of steps internally.
3.6 OCR auxiliary




Although there are a lot of OCR products on the market, netease is easy to shield or invested a lot of energy to the matching content security scenario of OCR technology, mainly solves the tilt, upside down, affine transformation and landscape as well as special fonts, layout and handwriting problems, though these content in general less common in the scene, but the content is very much on the content security, Using general OCR techniques to process these images often does not produce particularly good results.
3.7 Photo library assistance



At the same time, netease Eshield will do some processing for the pictures that are occasionally found but cannot be taken care of. For example, some pictures are missed, but cannot be timely updated to solve the problem, so they will be added to the picture library. The picture library here is not the MD5 library and Hash library in the traditional sense, but based on deep learning. The image library that extracts the features of the whole region and combines traditional local features is technically called homologous image retrieval. After adding images, its similar images will also be recalled, which greatly improves the ability to cope with clipping and alteration.


3.8 Summary of optimization process



In my opinion, the most important two points in the optimization of easy shield algorithm are problem definition and data retrieval. Of course, the selection and adjustment of model can not be ignored, but the former determines the lower limit of algorithm service, and the upper limit of service should be raised through the latter on the premise of ensuring the lower limit of service. The closed-loop optimization of easy shield algorithm is shown on the right in the figure above, and this process is also applied in deep learning.
3.9 Service Effect



Based on the above series of optimization logic, Easy Shield has achieved good business results through continuous efforts. First of all, the overall business performance can still satisfy the operation and product. Of course, there will be some difficult data, which will be solved slowly through the optimization process. In addition, easy shield has also made an expansion of the identification range, and the overall identification ability is now very complete. It is with the previous optimization experience, Easy shield can quickly respond to some new needs to choose the best template to do some practice. In terms of data statistics, the overall miss rate of the core modules (including pornography, political involvement, advertising, etc.) is controlled within three per ten thousand, and the accuracy is maintained at more than 97%. By comparing with the products of friends, it is relatively advantageous to achieve such indicators in the industry at present.
3.10 Further work



In addition to the work mentioned above, there are a lot of work that can be further advanced, such as business output refinement, model-level refinement and model performance optimization mentioned above.




The chart above lists the work that YIDun has done in the horizontal expansion of its image business, such as Logo recognition and flag recognition. Easy to shield, on the other hand, also in the platform support and intensive cultivation on the two lines do things independently, the former because there are some good practices and easy to shield their best templates, so can quickly respond to demand and solve the problem, for some scenarios of requiring effect, easy to shield tend to optimize the existing business, vertical orientation optimization model, The two lines conflict and are applicable to different application scenarios.
3.11 Algorithm architecture diagram


Red box annotation is mainly the content of model layer optimization in the algorithm


It can be seen from the algorithm architecture diagram in the above figure that the algorithm only does a small part of the work, and more operations, engineering and annotation teams need to cooperate to make the algorithm reach a satisfactory state.
3.12 Algorithm extension



All of the above are graphic optimizations and hands-on experience, but Yi Dun does the same work with parallel text, images, video, and audio.


3.13 Audio technology



Audio technology is able to provide service is sound detection and language, sound detection includes the type of purr, moaning, ASMR and shot blast, language testing can provide identification of small language service, in the language recognition, which is easy to shield and contrasted the artificial recognition found that machine recognition accuracy is higher than the artificial recognition, The processing process is to extract the frequency spectrum features of audio segments and throw the traditional frequency spectrum features into THE CNN network for classification. In fact, good results can be achieved in this simple way.


3.14 Honors


Netease Yidun wins honors



China Artificial Intelligence Competition a-level certification
4. To summarize



Finally, to make a summary of this sharing content, first of all, we should pay more attention to the definition of the problem and keep the global perspective; The second is data, pay attention to the effective collection of data, concentrate on processing effective information; For the cost problem, it is necessary to mention the expensive labeling cost, so we should select the labeling when doing the labeling. In addition, the cost problem also involves the cost of the machine. At present, only after a breakthrough in performance optimization can we achieve better savings in the cost of the machine. For customization, the degree of refinement needs to be determined according to the scenario. General scenarios do not have high requirements on the degree of refinement. I personally suggest customization for scenarios with high requirements on the degree of refinement.