background

Meituan produces millions of pictures every day. The operation staff is responsible for reviewing the content of relevant pictures and deleting pictures that involve legal risks or do not conform to the rules of the platform. Due to the large number of pictures, manual audit time and energy consumption and audit ability is limited. In addition, for different audit personnel, audit standards are difficult to unify and change in real time. So it is necessary to achieve intelligent audit with the help of machines.

Intelligent image audit generally refers to the use of image processing and machine learning related technologies to identify the image content, and then to screen whether the image is illegal. Intelligent image auditing aims to establish an automatic image auditing service, in which the machine automatically forbids the types of pictures that do not conform to the regulations (negative examples), automatically approves the types of pictures that conform to the regulations (positive examples), and the uncertain pictures of the machine are handed over to manual auditing. Therefore, the indexes to measure the performance of intelligent audit system are accuracy rate and automation rate.

The usual automatic review idea is to exhaustiblyselect the pictures that do not meet the requirements (such as watermark pictures, pornography, violent terror pictures, star faces, advertising pictures, etc.) type, the rest of the pictures as positive examples automatically approved. The problem with this is that it is not scalable enough to add illegal content, and it has to wait for all models to be built before it can be automatically filtered. If we can take the initiative to mine the pictures that meet the requirements (such as normal figure picture and scene consistent picture) for automatic approval, and combine positive and negative case filtering, we can save manual audit more quickly. Therefore, our image intelligent audit system is divided into image negative example filter module and image positive example filter module. The images to be examined first enter the negative example filter module to judge whether it is prohibited or not, and then enter the positive example filter module for automatic passage. The remaining uncertain images of the machine are handed over to manual audit. The whole technical scheme is shown in Figure 1.

Both negative case filtering and positive case filtering modules involve detection, classification and recognition, and deep learning is the preferred technology in this field. The following will introduce the application of deep learning in image intelligent audit by watermark filtering, star face recognition, pornographic image detection and scene classification respectively.

Watermarking detection based on deep learning

In order to protect copyright and support original content, it is necessary to automatically detect whether pictures uploaded by merchants or users contain prohibited watermarks (rival watermarks, logos of other products). Unlike other rigid-body like targets, watermarks have the following characteristics.

  • More style. There are more than 20 kinds of mainstream prohibited watermarks involved in offline collection, and each kind of watermarks has many styles. In addition, there are a lot of unknown watermarks on the line.
  • The subject is changeable. The position of the watermark in the image is not fixed and small, and the subject is cropped and overlapped (multiple watermarks), as shown in FIG. 2.

  • The background is complex. As most of the mainstream watermarks adopt transparent or translucent methods, the text identifiers in the watermarks are vulnerable to interference from complex backgrounds, as shown in FIG. 3.

Traditional watermark detection adopts the sliding window method to extract a fixed size image block and input it into the pre-trained identification model to obtain a category of the block. By doing this, all candidate positions in the image are iterated to obtain an image-intensive category score graph. Blocks with scores higher than a certain threshold are considered as candidate watermark regions, and the final result can be obtained by non-maximization suppression. The features of the identification model can adopt the edge direction statistical features commonly used in the field of character recognition, or CNN can be used for feature learning to improve the robustness against cutting, deformation and complex background. In order to further improve the confidence of the score, the information of the type prototype can be added, and the similarity between the input image block feature and the cluster center feature (included Angle cosine) can be used as the recognition confidence. However, the detection efficiency of the above method is extremely low. As the watermark position and size are not fixed, it is necessary to discriminate images of multiple scales at all positions, resulting in a large number of redundant Windows.

One idea is to reduce the number of sliding window child window method. Firstly, a series of candidate regions are generated through unsupervised/supervised learning, and then a CNN classifier is used to judge whether the regions contain targets and which kind of targets. R-cnn series is more representative of this kind of methods. Because the candidate frame obtained by this method can be mapped to the resolution of the original image, the accuracy of the positioning frame is high enough.

Another solution is to adopt the method of regression directly on the feature graph. As we know, for the convolutional layer of CNN network, the input image size can be not fixed, but the input size is required to be consistent from the full connection layer. Therefore, when images of any size are input into CNN until the first fully connected layer, only one forward operation is required to obtain feature maps of all layers. Then, the regression object is the location information and category information of the target to be detected, which can be regressed on the feature graph of different levels according to the needs of the target size. Such methods are represented by Yolo and SSD. The characteristic of this method is that it has good real-time performance on the premise of ensuring high detection accuracy.

Figure 4 shows the performance comparison of the two types of frameworks with the best traditional methods of DPM (Deformable component model) :

Considering that the watermark detection task does not require high accuracy of the positioning frame and needs to meet the throughput of millions of images per day, we use SSD framework and Resnet network structure for reference. In terms of training data, we collected a total of 15,000 watermarking images in 25 categories manually, and augmented the data by means of subject random cutting and front background synthesis.

The online data are tested based on the model obtained by training. 3197 online images were randomly selected as the test set, 2795 of which did not contain watermarks, 302 of which contained watermarks that had appeared in the training set, and the other 100 contained minority watermarks that did not appear in the training set. Based on this test set, we evaluated the traditional approach (manual design features + sliding window recognition) and the SSD framework-based approach.

As you can see from Figure 5, the SSD framework has significant advantages over traditional approaches in terms of recall and accuracy. Further analysis shows that the deep learning method recalls 38 minority watermark images, indicating that CNN has better feature generalization ability.

Star face recognition

In order to avoid infringement of stars’ portrait rights, the audit scene needs to identify whether the image uploaded by users/businesses contains the star’s head picture. This is a typical face recognition application, specifically is a 1∶(N+1) face comparison. The whole face recognition process includes face detection, face key point detection, face correction and normalization, face feature extraction and feature comparison, as shown in Figure 6. The deep convolution model is a recognition model to be trained for feature extraction. Below we will respectively introduce face detection and face recognition technology solutions.

Face detection

Face detection methods can be divided into traditional detector and detector based on deep learning. The traditional detector is mainly based on V-J frame and has been designed to achieve the detection through the vAUGHted cascade structure and artificial features. Features include Harr feature, HOG feature and feature based on pixel comparison (Pico, NPD), etc. This kind of detector has good detection effect and running speed in constrained environment, but for complex scenes (illumination, expression and occlusion), the detection ability will be greatly reduced due to artificially designed features. In order to improve performance, related studies combined face detection and face key point location tasks for joint optimization (JDA), the key point detection as an important evaluation standard for face detection, but its accuracy needs to be further improved.

Deep learning detectors have three ideas. The first type adopts V-J framework but uses Cascaded CNN instead of traditional features. The second category is frameworks based on candidate region and border regression (e.g. Faster R-CNN). The third category is the framework based on direct regression of full convolutional networks (such as DenseBox).

We adopted the Faster R-CNN framework and made improvements in the following aspects: In order to better resist the interference of complex background, facial-like, occlusion and so on, the detection rate of small face and side face can be effectively improved.

Face recognition

There are two main approaches to face recognition. One is to directly convert it into image classification tasks, and each category corresponds to multiple photos of a person. Representative methods include DeepFace and DeepID. The other method transforms recognition into metric learning problem, and makes different photos from the same person closer to each other and photos from different people farther apart through feature learning. Representative methods include DeepID2 and FaceNet.

Since the ID to be recognized in the task is a semi-closed set, we can combine image classification and metric learning for model training. Considering that Triplet Loss has high requirements on negative case mining algorithm and convergence is slow in practical training, Center Loss is adopted to minimize the intra-class variance and Softmax Loss is combined to maximize the inter-class variance. In order to balance the two loss functions, it is necessary to select the hyperparameters by experiment. The network structure we use is Inception- V3, which is divided into two stages in practical training: In the first stage, Softmax Loss+C×CenterLoss was used, and public data set CasIA-Webface (a total of 10 575 ids and 490,000 face images) was used to initialize network parameters and optimize the hyperparameter C. According to the test, C=0.01 was obtained. In the second stage, Softmax Loss+0.01×Center Loss was adopted, and network parameters were fine-tuned on the business data (5200 star face ids and 1 million face pictures).

In order to further improve performance, the multi-model integration strategy adopted by Baidu is used for reference, as shown in Figure 7. Specifically, the face region is divided into multiple regions according to the location of the key points of the face, and the feature model is trained respectively for each region. At present, the face region is divided into 9 regions, plus the whole face region, a total of 10 models need to be trained.

In the test phase, features were extracted for the face region to be verified and the candidate face region based on the 10 regions shown in Figure 7. Then, for each region, the similarity (cosine distance) between the two eigenvectors is calculated. Finally, the similarity weighting method is used to judge whether the two faces belong to the same person. Table 1 shows the evaluation results of the mainstream methods on the LFW dataset. It can be seen that meituan model achieves high accuracy under relatively limited data.

Pornographic image detection

Pornographic image detection is an important part of image intelligent audit. Traditional detection methods identify the compliance of images by skin color, posture and other dimensions. With the development of deep learning, the existing technology [Yahoo NSFW (Not Suitable for Work) model] directly defines the problem of binary classification (pornography, normal) for pornographic image detection, and carries out end-to-end training on massive data through convolutional neural network.

For the trained model, different levels learned different features, some levels learned skin color features, others learned part contour features, and still others learned posture features. However, as the definition of pornography is very broad, explicit points, sexual innuendo, art and so on May be classified as pornography, and in different scenes or in the face of different people, the definition of pornography standards cannot be unified. Therefore, the initial learning of model generalization ability is limited. In order to improve the prediction accuracy of the machine, it is necessary to continuously add misclassification samples so that the machine can learn more features incrementally to correct errors. In addition, we have optimized in the following areas.

  • Model refinement. Our classification model refined the degree of eroticism in the images: erotic, sexy, normal, and other categories. Porn, sexy and normal people are difficult to classify each other, and the other types are normal pictures of non-people. The separation of sex appeal category and normal figure category from pornography category is helpful to enhance the model’s ability to distinguish pornography. As can be seen from Table 2, compared with Yahoo’s NSFW model, our model has obvious advantages in recall rate.

  • Machine audit combined with manual review. In the actual business, because the detection of pornography adopts an early warning mechanism, all suspected pictures should be recalled as far as possible in the machine review process, and an appropriate amount of manual review should be combined to improve the accuracy. Therefore, the upper business logic will divide the image into three parts: “Certain yellow map”, “certain non-yellow map” and “suspected” according to the model prediction category and confidence level. The “suspected” part is sorted according to the highest confidence level and handed over to manual review. In online business, the accuracy of “confirmed yellow map” and “confirmed non-yellow map” parts can reach more than 99%, while the “suspected” part only accounts for about 3% of the total number of pictures, which can greatly save manpower under the condition of ensuring high precision filtering.

  • Support video content review. For the review of short video content, we transform it into review of a single image by extracting key frames, and then draw conclusions by integrating the recognition results of multiple frames.

Scene classification

As an Internet platform for eating and drinking, Meituan’s business involves a variety of vertical fields, as shown in Table 3. It is necessary to identify the categories of images that operate or uploaded by users to keep them consistent with the scope of the business. In addition, in order to further improve the display effect, it is necessary to classify and organize the pictures in the merchant album, as shown in Figure 8.

Deep convolutional neural network has surpassed the recognition rate of human eyes in the tasks related to image classification (such as ILSVRC), but as a typical supervised learning method, it has outstanding requirements on the quantity and quality of labeled samples in a specific domain. For our scene classification task, it will cost a lot if we completely rely on the reviewers to screen and clean the images. Therefore, the model needs to be fine-tuned based on transfer learning.

Transfer learning aims to rapidly and efficiently improve the performance of a target task by preserving and utilizing knowledge learned from one or more similar tasks, domains, or probability distributions. Model transfer is a common transfer method in the field of transfer learning, which is implemented by learning the shared parameters of Source Domain and Target Domain. Deep neural network is very suitable for model migration because of its hierarchical structure and its hidden layer can represent the characteristics of abstraction and invariance.

As for deep convolutional neural networks trained in primitive domains, attention should be paid to which layers of parameters can be transferred and how. The mobility of different levels is different, and the level with high similarity between target domain and original domain is more likely to be migrated. To be specific, features learned by shallow convolutional layer are more general (such as image color, edge and basic texture) and therefore more suitable for migration, while features learned by deep convolutional layer are more task-dependent (such as image details) and therefore not suitable for migration, as shown in FIG. 9.

Model migration trains other layers with the data of the target domain by fixing the parameters of a particular layer of the network. For our scene classification task, first modify the network output layer according to the number of categories of classification, then fix the shallow convolution layer and train the parameters of the network’s penultimate layer based on business annotation data. If more training data is available, you can further fine-tune the parameters of the entire network for additional performance gains, as shown in Figure 10. Compared with directly extracting high-level semantic features of images for supervised learning, the staged parameter transfer is more robust to the differences between the original domain and the target domain.

Based on the above transfer learning strategy, we carried out relevant experiments in the classification of food scene map and hotel room type map, and achieved high recognition accuracy based on limited (10,000 level images) labeled samples. The performance of the test set is shown in Table 4.

As mentioned above, the image classification and detection method based on deep learning has replaced the traditional machine learning method in image intelligent audit. On the basis of open model and transfer learning, continuous learning from massive data has realized the landing of business scenes.

reference

[1] H. Chen, S. S. Tsai, G. Schroth, D. M. Chen, R. Grzeszczuk, And B. Girod. “Robust text Detection in Natural images with edge-enhanced Maximally Stable extremal Regions.” ICIP 2011. [2] Z Zhong, LJin SZhang, ZFeng. “DeepText: A Unified Framework for Text Proposal Generation and Text Detection in Natural Images. Architecture Science 2015. [3] Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, Wenyu Liu. A Fast Text Detector with A Single Deep Neural Network “. AAAI 2017. [4] S. Ren, K. He, R. Girshick, “An Approach to Real-time Object Detection with Region Proposal Networks.” NIPS 2015. [5] Graves, A. Fernandez, S.; Gomez, F.; And Schmidhuber, J. “Connectionist temporal Classification: Labelling unsegmented sequence data with recurrent neural networks. “ICML 2006. [6] R Girshick, JDonahue, TDarrell, JMalik. “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.” CVPR 2014. [7] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. “You only look once: CVPR 2016. [8] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. Single Shot Multibox Detector. ECCV 2016. [9] “Object Detection with Discriminatively Trained Part-based Models”. TPAMI 2010. [10]Robust Real-time Object Detection. Paul Viola, Michael Jones. IJCV 2004. [11]N. Markus, M. Frljak, I. S. Pandzic, J. Ahlberg and R. Forchheimer. “Object Detection with Pixel Intensity Organized in Decision Trees”. CoRR [12]Shengcai Liao, Anil K. Jain, and Stan Z. Li. “A Fast and Accurate Unconstrained Face Detector,” TPAMI 2015. [13]Dong Chen, ShaoQingRen, Jian Sun. “Joint Cascade Face Detection and Alignment”, ECCV 2014. [14]Haoxiang Li, Zhe Lin, XiaohuiShen, Jonathan Brandt, Gang Hua. “A convolutional Neural Network Cascade for Face Detection”, CVPR.2015. [15]Lichao Huang, Yi Yang, Yafeng Deng, Yinan Yu. “DenseBox: [16] Formarked Landmark Localization in real economy “CVPR 2015. [16]Taigman Y, Yang M, Ranzato M A, et al. Deepface: Closing the gap to human-level performance in face verification.CVPR 2014. [17]Sun Y, Wang X, CVPR 2014. [18]Sun Y, Chen Y, Wang X, Et al. Predicting 10,000 classes with Deep learning face representation. et al. Deep learning face representation by joint identification-verification.NIPS. 2014. [19]FaceNet: A Unified Embedding for Face Recognition and Clustering. CVPR 2015. [20]A Discriminative Feature Learning Approach for Deep Face Recognition. ECCV 2016. [21]Rethinking the Inception Architecture for Computer Vision. CVPR 2016. [22]Alex Krizhevsky, IlyaSutskever, Geoffrey E. Hinton. “ImageNet Classification with Deep Convolutional Neural Networks”. 2014. [23]Murray, N., Marchesotti, L., Perronnin, F. “Ava: A Large-scale Database for Aesthetic Visual Analysis.” CVPR 2012.

Author’s brief introduction

Xiaoming, visual technology director of Meituan Platform Intelligent Technology Center, has worked in Canon Research Institute and Samsung Research Institute. Joined Meituan in 2015, mainly committed to the accumulation of image and video-related technologies and business implementation. As the technical leader, led the launch of intelligent image audit, first picture optimization, face brushing authentication, photo recording and dishes, which significantly improved the intelligent experience of users and businesses.

Recruitment information

The Intelligent technology Center of Meituan platform makes full use of the advantages of artificial intelligence to support multiple business lines of Meituan Review, and has achieved good application effects in intelligent recommendation, intelligent marketing, intelligent operation, intelligent audit and other fields. We are looking for candidates with background in natural language processing, computer vision, large-scale machine learning, data mining algorithms or engineering. Interested students are welcome to send their resumes to: [email protected].