Apple has allowed its AI developers to publish their research papers and actively participate in discussions in the AI academic community. This is a beginning.

Giiso Information, founded in 2013, is a leading technology provider in the field of “artificial intelligence + information” in China, with top technologies in big data mining, intelligent semantics, knowledge mapping and other fields. At the same time, its research and development products include editing robots, writing robots and other artificial intelligence products! With its strong technical strength, the company has received angel round investment at the beginning of its establishment, and received pre-A round investment of $5 million from GSR Venture Capital in August 2015.

Source: Apple

Translated by NetEase Technology

Reproduced with authorization by Zhidxcom

A few days ago, Apple has published its first academic paper on artificial intelligence (AI), “Learning from Simulated and Unsupervised Images through Adversarial” Training, which describes methods for improving image recognition in computer vision systems, may mark a new direction for Apple’s research.

To view the original English version of this paper, please reply “zhidxcom”

Apple AI paper “download.

Here is the full text of the report:

Pick to

As graphics technology continues to advance, it is becoming easier to train machine learning models with synthetic images, which can help avoid the expensive expense of annotating images. However, training machine learning models through composite images may not achieve satisfactory results, because there are differences between composite images and real images. In order to reduce this difference, we propose a “simulation + unsupervised” learning method, which trains the algorithm’s image recognition ability through computer-generated images or synthetic images.

In fact, this “analog + unsupervised” learning involves combining unlabeled real image data with annotated synthetic images. To a large extent, it relies on new machine learning techniques called generative adversarial networks (gans), which use two neural networks against each other to produce a more realistic image. We make several key modifications to the standard GAN algorithm to preserve annotations, avoid artifacts, and train stability: self-regularization — local admission-loss — use refined image upgrade discriminator.

We found that this process can produce highly realistic images, which have been proven both in terms of quality and in user studies. We have used training models to assess eye level and gesture posture for quantitative evaluation of computer-generated images. Our image recognition algorithms have made great strides by using composite images. We achieved the highest level of results in the MPIIGaze dataset without using any standard real data.

Lead it

With the recent rise of high capacity deep neural learning networks, large-scale annotated training datasets are becoming increasingly important. However, standard large data sets are expensive and time consuming. To this end, the idea of training algorithms using composite images rather than real images began to emerge because annotation could already be automated. Assessing body posture with the XBOX360 peripheral Kinect, among other tasks, is done using synthetic data.

(Figure 1: “Simulation + Unsupervised” learning: training the algorithm’s image recognition ability through computer-generated or synthetic images)

However, due to the gap between the composite image and the real image, the use of composite image training algorithm may cause many problems. Because the composite image is usually not real enough, the neural network learning can only learn the details of the composite image, but can not completely identify the real image, and then can not provide accurate learning for the algorithm. One solution is to improve the emulator, but computations to increase realism are often expensive and renderer design is more difficult. Furthermore, even the best renderer may not be able to mimic all the features in a real image. Therefore, a lack of authenticity may cause the algorithm to overfit unreal details in the composite image.

In this paper, we propose the method of “simulation + unsupervised” learning, which aims to improve the authenticity of synthetic images by using simulators without real data. Improving authenticity helps train machine learning models better without collecting any data or requiring humans to continue tagging images. In addition to increasing authenticity, “analog + unsupervised” learning should also retain annotated information for training machine learning models, such as the direction of gaze shown in Figure 1. In addition, since machine learning models are very sensitive to artifacts in synthetic data, “analog + unsupervised” learning should also produce images without artifacts.

We’ve developed a new approach for analog + unsupervised learning, which we call SimGAN, that uses a neural network we call refiner Network to extract synthetic images from simulators. An overview of this approach is shown in Figure 2: First, a composite image is generated in a black box simulator and refined using a “refiner network.” To increase authenticity, which is the primary requirement of the “simulation + unsupervised” learning algorithm, we need to train the “refiner network” with a similar generative adversarial network (GAN) to produce refined images that the discriminant network cannot distinguish between true and false.

Second, in order to retain the annotation information on the composite image, we need to use “self-regularization loss” to compensate for the adversarial loss, which is modified between the composite image and the refined image. In addition, we use full convolutional neural networks to operate at the pixel level and preserve the global structure rather than modify the content of the image as a whole.

Third, the GAN framework requires the training of two neural networks for confrontation, and their targets are often unstable and tend to produce artifacts. In order to avoid drift and stronger artifacts, which lead to more difficult screening, we need to limit the receiver area of discriminator to local receiver rather than the whole image, which leads to multiple local antagonistic losses in each image. In addition, we introduced a way to improve training stability by upgrading discriminators using refined images rather than existing images in the current “refiner network.”

1.1 Related work

GAN framework requires two neural network competition losses, namely generator and discriminator. The goal of generator network is to draw random vectors on real images, while the goal of discriminator network is to distinguish generated images from real images. GAN networks, first introduced by I. Goodfellow and others, can help generate realistic visual images. Since then, GAN has improved a lot and been put into interesting applications.

Figure 2: Overview of SimGAN: We use a “refiner network” to refine the output image generated by the simulator and minimize local resistance losses and self-regularize. Adversarial loss can fool the discriminator network into mistaking the composite image for the real one. Self-regularization minimizes the difference between the composite image and the real image, including preserving annotated information and allowing refined images to be used to train machine learning models. The refiner network and discriminator network will also be upgraded alternately.

X. Wang and A. Gupta used structured gans to learn surface normals and then combined them with Style gans to generate natural indoor scenes. We propose a recurrent generative model training using confrontational training. In addition, iGAN, which recently launched, allows users to change images in interactive mode. The CoGAN combined with GAN developed by M.-Y. Liu et al. can jointly distribute images in multiple modes without requiring tuples dealing with images, which is conducive to the development of joint publishing solutions. InfoGAN, developed by X. Chen et al., is an extension of GAN information theory that allows meaningful statement learning.

Oncel Tuzel and others are using GAN to solve the problem of ultra-high resolution of human face images. C. Li and M. Wand propose Markovian GAN for effective texture synthesis. W. Lotter et al. used adversarial losses for visual sequence prediction in LSTM networks. L. Yu et al. proposed the SeqGAN framework, which utilizes GAN reinforcement learning. A number of recent issues show problems related to the generative model domain, such as PixelRNN using softmax losses of RNN to predict pixel order. Generation network focuses on using random noise vectors to generate images. Compared with our model, the generated images do not have any annotation information, so they cannot be used to train machine learning models.

Many efforts are being made to explore the use of synthetic data for a variety of predictive tasks, including line of sight assessment, RGB image text detection and classification, font recognition, object detection, hand pose assessment in depth images, RGB-D scene recognition, semantic segmentation of urban scenes, and human pose assessment. A. Gaidon et al. have shown that training deep neural networks with synthetic data can improve their performance. Our work complements these methods by using untagged real data to improve the realism of the simulator.

In the data domain adaptation setting, Y. Ganin and V. Lempitsky used synthetic data to understand the features that remain unchanged in the process of changing the synthetic image and the real image domain. Z. Wang et al. trained the cascaded convolutional code autoencoder using synthetic and real data in order to understand the low-level representation of its font detector ConvNet. X. Zhang et al. learned multichannel coding in order to reduce the transition between real and synthetic data domains. In contrast to the classical domain adaptation method, it uses specific features to adapt to specific prediction tasks, and we can bridge the gap between image distribution through adversarial training. This approach allows us to generate very realistic images that can be used to train any machine learning model and perform potentially many more tasks.

2. “Simulation + Unsupervised” learning

The goal of simulation + unsupervised learning is to use a set of unlabeled real images Yi Y to learn refiner Rθ(X) that refiners the synthetic image X, where θ is a function parameter. Let’s use X, right? That’s the refined graph, and then you get X, right? : θ= R (X). In analog + unsupervised learning, the key requirement is to refine the image X? To make it look more like a real image, while preserving the annotated information from the emulator. To do this, we recommend learning by maximizing the combination of two losses:

Where, Xi is e ith composite training image, and X is the corresponding refined image. The first part is the cost of authenticity, which is the cost of adding authenticity to the composite image. The second part represents the cost of preserving annotation information by minimizing the differences between the composite image and the refined image. In the following sections, we will expand this formula and provide algorithms for optimizing θ.

2.1 Adversarial loss

To add authenticity to the composite image, we need to establish a connection between the composite image and the parts of the real image. Ideally, the refiner may not be able to classify a given image as real or highly refined. This requires the use of adversarial discriminator, network Dφ, which can be trained to distinguish between real images and refined images, and φ is the discriminator network parameter. Confrontational loss training refiner networkR is responsible for tricking D networks into mistaking refined images for real ones. Using the GAN method, we build a limit game model involving two neural networks, and upgrade the “refiner network” Rθ and the discriminator network Dφ. Next, we describe this model more precisely. The discriminator network updates parameters by minimizing the following losses:

This is equivalent to the cross entropy error generated by the two-order classification problem, where Dφ(.) It’s a composite image, and 1? D phi (.). It’s a real image. At this point, we have implemented Dφ as the final output layer of the ConvNet, and the samples are likely to be refined images. To train the network, each small batch of randomly selected samples is made up of refined composite images and real images. For each yj, the target tag loss layer of cross entropy is 0, and each x? I’s correspond to 1’s. Then φ escalates with small batch gradient losses through stochastic gradient descent (SGD). In our practice, the authenticity loss function uses a trained discriminator network D as follows:

By minimizing the loss function, the power of the “refiner network” prevents the discriminator from distinguishing the refined image from the composite image. In addition to producing realistic images, the refiner network should store annotation information for the simulator. For example, a learning shift used to assess vision should not change the direction of the gaze, and a hand posture assessment should not change the position of the elbow. This is a necessary part of training machine learning models to use refined images with annotated information from the simulator. To achieve this goal, we recommend using self-regularization, which minimizes the difference between the composite image and the refined image.

(Algorithm 1)

Figure 3: Graphical representation of local antagonistic losses. The discriminator network outputs a WXH probability graph. The adversarial loss function is the sum of the cross entropy losses on the local block.

Therefore, in our execution, the overall refining loss function (1) is:

(4) in the | |. | | 1 is the L1 norm, we will R theta as a completely convolutional neural networks, without having to leap or pooling. Modify the composite image at the pixel level rather than the image content as a whole. This is the case, for example, in a fully connected ground encoder network, preserving the global structure and annotations. We learn the refiner and discriminator parameters by alternately minimizing LR(θ) and LD(φ). When we update the parameters of Rθ, we keep φ fixed, while when we update Dφ, we keep θ constant. We describe the whole training process in Algorithm 1.

Figure 4: Using a detailed image history diagram. See text description for information.)

2.2 Loss of local confrontation

Another key requirement of the refining network is that it should learn to simulate real image features without introducing any artifacts. When we train the strong discriminator network, the refining network tends to overemphasize certain image features to deceive the current discriminator network, resulting in bias and artifacts. The key is that any local patch we sample from the refined image should have statistics similar to the real image. Thus we can customize the local discriminator network to classify local image patches instead of defining the global discriminator network.

This not only limits the acceptance domain, but also, therefore, limits the capacity of the discriminator network and provides more samples per image for learning the discriminator network. It also improves the training of the refining network due to multiple actual loss values per image.

In our execution, the discriminator D is designed as a complete convolutional network, and the pseudo-class W × H probability graph is output. In the latter w × H is the number of local patches in the image. When training the refining network, we sum up the cross entropy loss values of the W × H local patch, as shown in Figure 3.

2.3 Update discriminator with refined image history

Another problem with adversarial training of adversarial training is that the discriminator network only focuses on the latest fine image. This can result in (I) diverging from adversarial training, and (ii) refining the network to reintroduce artifacts that the discriminator has forgotten. Any fine image generated by the refining network at any time during the entire training process is a fake image for the discriminator. Therefore, the discriminator should be able to identify these images as fake. Based on this observation, we introduced a method to improve the stability of adversarial training by using the history of fine images, rather than just tinkering with the current small batch. We slightly improve algorithm 1 to increase the buffering of the fine images generated by the previous network. Let B be the size of this buffer and let B be the mini-batch large size used in algorithm 1

Figure 5: Sample image output from SimGAN. The left is the real shot image collected by MPIIGaze, and the right is the optimized UnityEye composite image. It can be seen that the skin texture and iris area in the finely synthesized image are more realistic than the synthetic image.

(Figure 6: ResNet blocks with two NXN convolution layers, each with an F feature graph.)

In each iteration of discriminator training, we update the parameter φ by sampling B /2 images from the current refining network and collecting additional B /2 images from the buffer. After each iteration, we randomly replace the B /2 samples in the buffer with newly generated fine images, keeping buffer B constant in size. This process is illustrated in Figure 4.

Experiment 3.

We used the appearance estimation dataset from MPIIGaze [40,43] and the gesture dataset from New York university [35] to evaluate our method. We use a fully convolution refined network with ResNet blocks in all experiments (Figure 6).

3.1 Gaze estimation based on appearance

Gaze estimation is a key factor in many human-computer interaction (HCI) tasks. However, estimation directly from eye images is challenging, especially when the image quality is poor. For example, a front-facing camera on a smartphone or laptop captures an eye image. Therefore, in order to generate large amounts of annotated data, recent methods [40,43] have trained their models with large amounts of synthetic data. Here, we show that training with finely synthesized images generated by SimGAN significantly improves task performance.

The fixation estimation dataset consisted of 12 million samples generated using the eye fixation synthesizer UnityEyes and 21,000 live-action samples from the MPIIGaze dataset. MPIIGaze’s image samples are images captured under various less-than-ideal lighting conditions. UnityEyes images are all generated in the same render environment.

Qualitative results: Figure 5 shows the synthetic eye fixation image and the processed real shot image. As shown in the figure, we observed significant quality improvements in the composite image: SimGAN successfully captured skin texture, sensor noise, and the appearance of the iris area. Note that our approach improves authenticity while preserving annotated information (gaze direction).

The ‘Visual Turing Test’ : To quantitatively assess the visual quality of fine images, we designed a simple user study that asked subjects to distinguish whether the images were real or composite. Each subject was shown 50 real images and 50 composite images. In the experiment, subjects were repeatedly shown 20 images that were both real and fake, and eventually had difficulty telling the difference between the real and the refined images. In our overall analysis, 10 subjects were correct only 517 times out of 1000 trials (P =0.148), similar to the following machine selection. Table 1 shows the confusion matrix. In contrast, when using raw images and real images, we showed each subject 10 real images and 10 composite images. In this case, subjects selected correctly 162 times out of 200 experiments (P10-8), with significantly better results than random selection.

(Table 1: The “Visual Turing Test” using real and composite images. The average human classification accuracy is 51.7%, indicating that the automatically generated fine images are already visually indistinguishable from the real thing.)

(Figure 7: Quantitative results of eye fixation estimation using MPIIGaze live-shot samples. The curves describe the errors in system estimates for different numbers of tests. Using detailed images rather than composite images in diagrams can significantly improve system performance.

Quantitative results: We trained a simple convolutional neural network (CNN) similar to [43] to predict the direction of eye fixation. We train on UnityEyes and test on MPIIGaze. Figure 7 and Table 2 compare the different performances of CNN using synthetic data and the refined data generated by SimGAN respectively. We observed a significant improvement in performance on SimGAN output training, with an absolute percentage increase of 22.3%. We also found a positive correlation between training outcomes and training data, where 4x refers to 100% of the training data set. The quantitative evaluation confirms the value of the qualitative improvements observed in Figure 5 and shows that the machine learning model performs better with SimGAN. Table 3 shows the comparison with the existing technology. The performance of CNN training on fine images is better than the existing technology on MPIGaze, with a relative improvement of 21%. This huge improvement shows the practical value of our approach in many HCI tasks.

Implementation details: The refining network Rθ is a residual network (ResNet). Each ResNet block consists of two convolutional layers and contains 63 feature graphs, as shown in FIG. 6. A 55×35 input image was convolved with a 3×3 filter to output 64 feature images. The output is passed over four ResNet blocks. Finally, the output of ResNet block is passed to the 1×1 convolution layer to generate a feature graph corresponding to the finely synthesized image.

Table 2: Comparison of training using synthetic data and SimGAN output. Training with images produced by SimGAN showed a 22.3% advantage without supervising real data.)

(Table 3: Comparison between SimGAN and MPIIGaze prior art. R= real image, S= composite image. The error is the average estimation error of eye fixation in degrees. Training on fine images led to a 2.1 degree improvement, a 21 percent improvement over prior art.)

Discriminator network Dφ consists of 5 extension layers and 2 maximum merging layers, which are as follows: (2) Conv3x3, stride = 2, stride = 64, (3) MaxPool3x3, stride = 1, (4) Conv3x3, stride = 1, stride = 32, Conv1x1, Stride = 1, feature map = 32, (6) Conv1x1, Stride = 2, (7) Softmax.

Our adversarial network is fully convolution and has been designed so that the acceptance domains of the last layer neurons in Rθ and Dφ are similar. We first perform 1000 steps of self-regularization loss training for Rθ networks, and 200 steps for Dφ networks. Then for every update of Dφ, Rθ is updated twice in the algorithm. That is, Kd is set to 1 and Kg to 50.

The eye-gaze estimation network is similar to [43], with slight modifications to make better use of our large synthetic data set. The input is a 35×55 grayscale image, through 5 convolution layers, followed by 3 fully connected layers, and the last one encodes a 3D fixation vector: (1) Conv3x3, feature map = 32, (2) Conv3x3, feature map = 32, (3) Conv3×3, feature map = 64, (4) max-pool3x3, stride = 2, (5) Conv3x3, feature map = 80, (6) Conv3x3, feature map = 32, (4) Max-pool3x3, stride = 2, Stride = 2, (8) FC9600, (9) FC1000, (10) FC3, (11) EU-clidean loss. All networks were trained with a constant 0.001 learning rate and 512 batch size until error convergence was verified.

3.2 Gesture image simulation of depth image

Giiso information, founded in 2013, is the first domestic high-tech enterprise focusing on the research and development of intelligent information processing technology and the development and operation of core software for writing robots. At the beginning of its establishment, the company received angel round investment, and in August 2015, GSR Venture Capital received $5 million pre-A round of investment.

Next, we will use this method to simulate the depth images of various gestures. In this study, NYU gesture database provided by New York University is mainly used, which contains 72,757 training samples and 8251 test samples collected by three Kinect cameras, of which each test sample includes one frontal gesture image and two side gesture images. Each depth image sample is tagged with gesture information to generate a composite image. Figure 10 shows a sample of the gesture database. We preprocess the database sample and extract the corresponding pixel points from the real image by using the composite image. Before processing with the deep learning network ConvNet, the resolution of each image sample was uniformly adjusted to 224*224, the background value was set to zero, and the foreground value was set to the original depth value minus 2000. (This assumes a background resolution of 2000).

Figure 10: NYU Gesture database. On the left is a sample depth image; The picture on the right is the composite image after processing.

Qualitative description: Figure 11 shows the calculation results of the gesture database by the generative adversarial network (SimGAN). It can be seen from the figure that the noise of the real depth image has been marginalized and its distribution is discontinuous. SimGAN can effectively learn and simulate the noise of the original image, so as to produce a more realistic and refined synthetic image, without any marking or annotation on the real image.

Figure 11: An example of a detailed test image of the NYU gesture database. The image on the left is a real image, the image on the right is a composite image, and the image on the right is a corresponding refined output image from apple’s generative adversarial network.

The main noise source in real image is non-smooth edge noise. The learning network can learn to simulate the noise that exists in real images, and importantly does not require any marking and annotation.

Quantitative analysis:

A CNN simulation algorithm similar to Stacked Hourglass human pose algorithm is applied to real images, composite images and refined composite image processing, and compared with test samples in NYU gesture database. Through the algorithm training of 14 hand joint transformation. In order to avoid bias, we use a single layer neural network to analyze the improved effect of the algorithm on the composite image. Figure 12 and Table 4 show the quantitative results of the algorithm’s processing of the gesture database. Among them, the refined composite image output by SimGAN is significantly better than the image generated by training based on real images, which is more realistic, 8.8% higher than the standard composite image, and the annotation cost of simulation output is zero. At the same time, it should be noted that 3X represents that all angles are selected for image training.

Figure 12: Quantitative results of gesture estimation, real depth images of THE NYU gesture test set.

The graph shows the cumulative curve of functions between the image and the background. It can be seen that the refined composite image output by SimGAN is significantly better than the image generated by training based on real images, and it is more realistic, 8.8% higher than the standard composite image. Importantly, our learning network does not need to tag real images.

Table 4: Similarity of various gesture images generated by training.

Synthetic Data is Synthetic images generated by general network training, Real Data is Real images and Refined Synthetic Data is Refined Synthetic images generated against network SimGAN output. 3X represents multi-angle simulation of real images.

Implementation details: The framework of gesture image discrimination is the same as that of eye image, but the input image resolution is 224*224, the filter size is 7*7, and the residual network value is 10. Discriminant network D is as follows:

Conv7x7,stride=4, feature maps=96, (2) Conv5x5, stride=2, feature maps=64, (3) MaxPool3x3, stride=2, (4) Conv3x3,stride=2, feature maps=32, (5) Conv1x1, stride=1, feature maps=32, (6) Conv1x1, stride=1, Softmax feature maps = 2, (7).

Firstly, self-regularization training for R network will be conducted 500 times, followed by 200 times for D network training. Subsequently, for every update to the D network, the R network will be updated twice. In gesture estimation, we use Stacked Hourglass Net human pose algorithm to output hotspot images with a size of 64*64. We introduce the random data set [-20,20] in network learning to train images from different angles. The network training ends when the error is effectively converged.

3.3 Correction analysis of confrontational training

First we compare the image bias of local adversarial training and global adversarial training. In global antagonism, the discriminant network uses the complete connection layer, which makes the whole image more refined. Localization adversarial training makes the generated image more realistic, as shown in FIG. 8.

Figure 8: Left is the result of global adversarial training, right is the result of localized adversarial training.

The result deviation of global adversarial training and local adversarial training is shown. The image on the left produces a more detailed but unreal image, while the image on the right produces a more realistic image.

Next, in Figure 9, the results of updating the discriminant network with a history refined image of repeated training and comparing it with the composite image generated by the standard antagonism are shown. As shown in the figure, historical refinements are used to create more realistic shadows, such as in standard confrontational training, where there are no shadows at the eye corners.

Figure 9: Results of updating discriminant networks using historical fine-grained images.

Left: Standard composite image; Middle figure: the image results after updating the discriminant network with historical data; Right: Image results of an updated discriminant network using recent historical data. As shown in the figure, a more realistic shadow is created using a refined historical image with repeated training.

4 conclusions and the next step

In this paper, we propose a “simulation + unsupervised” machine learning method, which can effectively improve the photorealism of simulated images. We describe a new generative adversarial network, SimGAN, and apply it to unlabeled real images to obtain the best results. Next, we will continue to explore the creation of more realistic and detailed images for the composite image, as well as how to process the video.

East and West Wise:

Are you still confused after reading the full text of the Apple paper? It doesn’t matter that unless you’re a professional, you won’t be able to understand the complex equations in the paper, but the paper is pretty straightforward on the face of it. Apple trained the machine on image recognition using composite images, and the results are said to be good.

Another important aspect of the paper is that Russ Salakhutdinov, apple’s head of AI research, announced earlier this month at NIPS, an ARTIFICIAL intelligence conference in Spain, that Apple will allow its AI developers to publish their research and actively participate in the AI academic community. This article is a start, but apple’s openness is also self-serving, hoping to attract more AI talents to Apple by enhancing communication.