Unsupervised learning is the holy grail of deep learning. Its purpose is to train general purpose systems with very little data that does not need to be annotated. This paper will start from the basic concept of unsupervised learning and then further describe the basic algorithms of unsupervised learning and their advantages and disadvantages. Eugenio Culurciello is a researcher at E-Lab, which focuses on robotics and vision research.

Deep learning models today require training on large supervised datasets. This means that for each piece of data, there is a label that corresponds to it. In the popular ImageNet dataset, there are a million artificially tagged images, or 1,000 for each of 1,000 categories. It takes a lot of work to create such a data set, and it could take many people months to complete it. If you wanted to create a dataset of a million classes, you would have to annotate every frame in the 100 million video data set, which is basically impossible.


Now, think back to how you were taught when you were very young. Yes, we get some supervision, but when your parents tell you it’s a “cat”, they don’t tell you it’s a “cat” every time they observe a cat for the rest of their lives! Today’s supervised learning works like this: I tell you what a cat looks like again and again, maybe a million times. And then your deep learning model learns about cats.


Ideally, we’d like to have a model that works more like our brains. It needs very few labels to understand many kinds of things in the real world. In the real world, I mean categories of objects, categories of actions, categories of environments, categories of parts of objects, and so on.


As you’ll see in this review, the most successful models are those that can predict what is about to appear in a video. One problem many of these technologies face, and are trying to solve, is that in order to achieve good global performance, training takes place on video, not static images. This is the only way to apply the representations learned to practical tasks.


The basic concept


The main goal of unsupervised learning research is to pre-train models (i.e. discriminators or encoders) that can be used for other tasks. Encoder features should be as generic as possible so that they can be used in classification tasks (such as training on ImageNet) and provide results as good as possible as supervised models.


The latest supervised models always perform better than the unsupervised pre-trained models. That’s because supervision allows the model to better encode features on the data set. However, when the model is applied to other data sets, supervision degrades. In this regard, unsupervised training promises to provide more general characteristics for performing any task.


If the target is real life applications, such as unmanned driving, motion recognition, target detection and recognition in real-time extraction, then the algorithm needs to be trained on video.


Since the encoder


Sparse Coding with an Overcomplete Basis Set published by Bruno Olshausen of UC Davis and David Field of Cornell University in 1996: A Strategy by V1? (article links: redwood.psych.cornell.edu/papers/olsh…). It shows that coding theory can be applied in the receptive domain of visual cortex. They demonstrate that the basic visual vortex (V1) in our brain uses the principle of sparsity to create a minimal set of basic functions that can be used to reconstruct the input image.


The link below is a great overview of autoencoders by Piotr Mirowski of Microsoft’s Bing team in London in 2014.


Link: piotrmirowski.files.wordpress.com/2014/03/pio…


Yann LeCun’s team also works in this area. In the demo on the linked web page, you can see how filters like V1 are learned. (Link: “CBLL, Research Projects, Computational and Biological Learning Lab, Courant Institute, NYU”)


By repeating the process of greedy layer by layer training, Stacked auto encoders have also been used.


The autoencoder method is also known as the direct mapping method.


Advantages and disadvantages of autoencoder/sparse coding/stacked autocoding


Advantages:


  • Simple technique: Rebuild the input

  • Multilayer stackable

  • Intuitive and neuroscience-based research


disadvantages


  • Each layer is greedily trained

  • No global optimization

  • Not as good as supervised learning

  • Multi-layer failure

  • The reconstructed input may not be an ideal indicator for the representation of a general target


Clustering learning


It is a technique for learning filters in multiple layers by k-means clustering.


Our group named this technique Clustering Learning for Robotic Vision and Clustering linking (see article). An Analysis of the Connections Between Layers of Deep Neural Networks), and volume accumulation class (see paper: Convolutional Clustering for Unsupervised Learning). Just recently, this technique achieved very good results on the popular unsupervised learning dataset STL-10.


So our work in this field is independent of Adam Coates and Andrew Ng’s study on Learning Feature Representations with K-means.


It is well known that restricted Boltzmann machines (RBM), deep Boltzmann machines (DBM), and deep belief networks (DBN/ see research by Geoffrey E. Hinton et al., due to numerical problems in solving partition functions: A fast learning algorithm for deep belief net) has been difficult to train. Therefore, they are not widely used in problem solving.


Advantages and disadvantages of cluster learning:


Advantages:


  • Simple technique: Get the output of similar clusters

  • Multilayer stackable

  • Intuitive and neuroscience-based research


Disadvantages:


  • Each layer is greedily trained

  • No global optimization

  • In some cases it is comparable to supervised learning

  • Multiple incremental failures == diminishing performance returns


Generate adversarial network model


Generative adversarial networks attempt to generate a good generative model by fighting discriminators and generators, which hope to generate realistic images that can fool discriminators. Generative Adversarial networks, one of the best Generative Adversarial networks in recent years, was proposed by Ian Goodfellow and Yoshua Bengio in their thesis Generative Adversarial Nets. See how OpenAI researcher Ian Generative Adversarial Networks (GANS) summarized their work in late 2016.


A generation-adversarial model called DCGAN, instantiated by Alec Radford, Luke Metz, and Soumith Chintala et al, has yielded excellent results. Their study is published in the paper Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks.


Vincent Dumoulin and Ishmael Belghazi et al have made a better interpretation of this model (link: Adversarially Learned Inference).


DCGAN discriminators are designed to determine whether an input image is real (from a dataset) or fake (from a generator). The generator takes a random noise vector (say 1024 numbers) as input and generates a picture.


In DCGAN, the generator network is as follows:




Although this discriminator is a standard neural network. See the code below for more details.


The key is to train the two networks in parallel without overfitting them so that the dataset is replicated. The features learned need to be generalized over unknown samples, so learning data sets will not be useful.


Code to train DCGAN on Torch7 (Soumith/dgan.Torch) is also provided. This requires a lot of experiments, the relevant content Yann LeCun also share in Facebook: www.facebook.com/yann.lecun/…


When both generators and discriminators are trained, you can use both. The main goal is to train a network of discriminators that can be used for other tasks, such as classification on other data sets. Generators can be used to generate images from random vectors. These images have very interesting properties. First, they provide smooth transformations from the input space. The following example shows the image generated by moving between nine random input vectors:




The input vector space also provides mathematical attributes, proving that the learned features are organized according to similarity, as shown in the figure below:




The smooth spatial revelation discriminator learned by generators has similar properties, making it a great general-purpose feature extractor for encoding images. This helps to resolve THE failure of CNN to train discontinuous images because of noise resistance (see Christian Szegedy et al. ‘s Intriguing Properties of Neural Networks, [1312.6199] Intriguing properties of neural networks.


GAN’s latest advances achieved a 21% error rate on the CIFAR-10 dataset of just 1,000 tag samples, See Improved Techniques for Training GANs by Tim Salimans et al., OpenAI, arxiv.org/pdf/1606.03… .


A recent paper on infoGAN, “infoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets), which generates images with clear features, And these images have a more interesting meaning. However, they did not publish a performance comparison of the learned features on a task or in a data set.


In blogs and websites as shown below is about summary of generated against model, see OpenAI technology blog Generative Models and web code.facebook.com/posts/15872… .


Another very interesting example is the following, in which the author uses generative adversarial training to learn to generate images from text descriptions. See Generative Adversarial Text to Image Synthesis [1605.05396] Generative Adversarial Text to Image Synthesis.




What I like most about this work is that the network it uses uses text descriptions as inputs to the generator, rather than random vectors, so that the output of the generator can be precisely controlled. The network model structure is shown in the figure below:



Generate the disadvantages and advantages of adversarial models


Advantages:


  • Global training for the entire network

  • Easy to program and implement


Disadvantages:


  • Difficult training and conversion problems

  • Comparable to supervised learning in some cases

  • Need to improve usability (this is the problem for all unsupervised learning algorithms)


Models that can be learned from data


These models learn directly from unlabeled data by designing unsupervised learning tasks that do not require labels and learning algorithms designed to solve these tasks.


Unsupervised learning by solving puzzle problems in visual representations is indeed a clever technique. The author segmented the images into puzzles and trained deep networks to solve the puzzles. The resulting network performs well enough to match the best pre-trained networks. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles [1603.09246] Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles


Unsupervised learning through image patches and layouts in visual representations is also a clever technique. They made two patches on the same image closely spaced. These patches are statistically the same object. The third patch selected a random image and placed it in a random location, which was statistically different from the first two patches. A deep network is then trained to distinguish between two patches belonging to the same category and another patch of a different category. The resulting network has the same performance as one of the highest performance fine-tuned networks. For details, see Learning Visual Groups from Co-occurrences in Space and Time. [1511.06811] Learning visual groups from co-occurrences in space and time.


Unsupervised learning models from stereoscopic image reconstruction take stereoscopic images as input, such as the left half of an image frame, and then reconstruct the right half of the image. Although this work is not aimed at unsupervised learning, it can be used as unsupervised learning. This method can also be used to generate 3D movies from still images. See the paper Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks, link: arxiv.org/abs/1604.03… Piiswrong/deep3D, Python source on Github


Using unsupervised learning visual representations of alternative classes using images not to create very large alternative classes. These image patches are then enhanced and then used to train supervisory networks based on enhanced substitute classes. This gives the best results in unsupervised feature learning. For details, see the Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks. [1406.6909] Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks.


The visual representation of unsupervised learning using video uses an LSTM encoder-decoder pair. The encoded LSTM runs on a sequence of video frames to generate an internal representation. These representations are then decoded through another LSTM to generate a target sequence. To make this unsupervised, one way is to predict the same sequence as the input. Another way is to predict future frames. Unsupervised Learning of Visual Representations Using Videos, [1505.00687] Unsupervised Learning of Visual Representations using Videos.


— Another paper up front using Video was Co-authored by Vondrick and Torralba et al. ([1504.08023]) up for their Visual Representations from Unlabeled Video, With very dramatic results. The idea behind this work is to predict the representation of future frames from video input. It’s an elegant approach. The model used is as follows:




One problem with this technique is that a neural network trained on still image frames is used to interpret video input. The network does not learn the temporal dynamics of the video and the smooth transformation of objects moving through space. So we don’t think this network is suitable for predicting images in future videos.


To overcome this problem, our team created a large video data set, eVDS (E-VDS), which can be used to train new (recursive and feedback) network models directly from video data.


PredNet


PredNet is a network designed to predict future frames in a video. You can see some examples in this blog, linked to: PredNet by CoxLab.


PredNet is a very clever neural network, and we think it will play an important role in neural networks in the future. PredNet learned the neural representation of a single frame in a supervised CNN.


PredNet combines biology inspired two-way [] the brain model (as shown in the paper the Unsupervised Pixel – prediction, cca shut. The nips. Cc/paper / 1083 -… . It uses [predictive coding and feedback connections in Neural models] (see the paper “Neural Encoding and Decoding with Deep Learning for Dynamic Natural Vision” for details, [1608.03425] Neural Encoding and Decoding with Deep Learning for Dynamic Natural Vision. Here is the PredNet model and an example with two stacked layers:


PredNet incorporates biologically-inspired bidirectional human brain modelsPredNet incorporates biologically-inspired bidirectional human brain models


This model has the following advantages:


  • You can use unlabeled data for training

  • Loss functions are embedded in each layer to calculate errors

  • The ability to perform online learning by monitoring for error signals that it knows need to learn when the model cannot predict the output


One problem with PredNet is that it is relatively easy to predict future input frames for some simple motion-based filters at the first layer. In our experiments with PredNet, PredNet learned to reconstruct input frames with good results, but higher levels did not learn better representations. In fact, higher levels were unable to solve simple sorting tasks in the experiment.


In fact, predicting future frames is unnecessary. What we’re willing to do is predict the representation of the next frame, just like Carl Vondrick did. — Up front Visual Representations from Unlabeled Video [1504.08023] Up front Front Visual Representations from Unlabeled Video.


Learn features by observing the movement of objects


The most recent paper trained unsupervised models by observing the movement of Objects in videos (” Learning Features by Watching Objects Move,”Learning Features by Watching Objects Move). Motion is extracted in the form of optical flow and used as a segmentation template for moving objects. Although the optical flow signal does not provide anything close to a good segmentation template, the average effect over a large data set makes the resulting network perform well. Here is an example:


This work is very exciting because it follows neurological theories about how the human visual cortex learns to segment moving objects. See paper Development of Human Visual Function, link: Development of human Visual Function.


In the future


The future is up to you to shape.


Unsupervised training is still a growing topic. You can make a big contribution in the following ways:


  • Create a new unsupervised task to train the network, such as solving puzzles, comparing image patches, generating images, etc……

  • Come up with tasks that create great unsupervised functions, such as understanding what objects and backgrounds are in stereoscopic images and videos in the same way that our human visual systems work.

Compiled from The Heart of Medium Machine