The author | Faizan Shaikh
The translator | Ma Zhuoqi
Edit | Vincent
AI Front Line introduction:In this article, we use an intuitive case study to outline the concept of unsupervised deep learning. The code of unsupervised learning on MNIST data set is explained in detail, including K-means, autoencoder and DEC algorithm.






Please pay attention to the wechat public account “AI Front”, (ID: AI-front)
introduce

As data scientists, we use various machine learning algorithms to extract actionable information from data. Most of these are supervised learning problems, because you already know what the target function is. The data presented has many details that will help you achieve your final goal.

While unsupervised learning is a complex challenge, it has many advantages. It has the potential to solve previously unsolvable problems and is getting a lot of attention in the field of machine learning and deep learning.

The purpose of this article is to give a visual introduction to unsupervised learning and how it can be used in real life.

Note — Reading this article requires some deep learning background and an understanding of machine learning concepts. If you haven’t mastered the basics, check out the following references:

  • Experimental data: https://trainings.analyticsvidhya.com/courses/course-v1:AnalyticsVidhya+EWD01+2018_EWD_T1/about

  • Deep learning basis, artificial neural network as a starting point: https://www.analyticsvidhya.com/blog/2016/03/introduction-deep-learning-fundamentals-neural-networks/

Now let’s get down to business.

Why use unsupervised learning?

The typical approach in machine learning projects is designed in a supervised manner. We tell the algorithm what to do and what not to do. This is a general structure for problem solving, but it limits the potential of the algorithm in two ways:

  • Algorithms are constrained by bias in monitoring information. Yes, the algorithm learned how to do this on its own. However, the algorithm cannot consider other possible situations when solving the problem.

  • Because learning is supervised, creating labels for algorithms is labor-intensive. The fewer tags you manually create, the less data the algorithm can use for training.

To solve this problem in an intelligent way, we can use unsupervised learning algorithms. Unsupervised learning gets the properties of the data directly from the data itself, and then summarizes the data or groups the data so that we can use those properties to make data-driven decisions.

Let’s use an example to better understand this concept. For example, banks want to group their customers so that they can recommend suitable products to them. They can do this in a data-driven way — first breaking down customers by age, and then deriving customer characteristics from those groupings. This will help banks provide better product recommendations to customers, thereby improving customer satisfaction.

Case study of unsupervised deep learning

In this article, we will present a case study of unsupervised learning based on unstructured data. Deep learning techniques are usually most powerful when working with unstructured data. Therefore, we take the application of deep learning in the field of image processing as an example to understand this concept.

Definition question – How do I organize photo libraries

Right now, I have 2,000 photos on my phone. If I were a selfie-aholic, the number of photos would probably be 10 times that. Selecting the photos was a nightmare because basically one out of every three was useless to me. I’m sure most people have the same problem.

Ideally, I want an app that organizes photos and lets me view most of them at any time. That way I also know how many categories OF photos I currently have.

To get a clearer picture of the problem, I tried categorizing the photos myself. Here’s my summary of the situation:

  • First of all, I discovered that a third of my photo library is made up of Internet anecdotes (thanks to our lovely friends at WhatsApp).

  • I also personally collect some interesting answers or shares I see on Reddit.

  • There are at least 200 photos that I took at the famous DataHack Summit and on a subsequent trip to Kerala, and some that colleagues shared with me.

  • There are also photos of whiteboard discussions during the meeting.

  • There were also screenshots documenting code errors that needed to be discussed by the internal team. They must be removed after use.

  • I also found some “personal” images, such as selfies, group photos and several special scenes. There are not many of them, but they are my precious possession.

  • In the end, there were countless “Good morning”, “Happy Birthday” and “Happy Diwali” posters that I managed to delete from my photo library. But no matter how much I delete them, they still show up!

In the following sections, we’ll discuss some of the solutions I’ve come up with to solve this problem.

Method 1: Classification based on time

The easiest way is to organize photos by time. You can have a different folder for each day. Most photo-browsing applications use this approach (such as the Google Photos app).

The advantage of this is that all the events of the day are stored together. The downside of this approach is that it’s too generic. Every day, I might take pictures of the outing, take screenshots of the interesting answers, and so on. They get mixed up, which is not what I want at all.

Method two: classification based on location

A relatively good method is to organize photos by location. For example, every time we take a picture, we can record where the picture was taken. We can then make folders based on these locations — whether country, city, or region — to the regional granularity we want. This approach is also used by many photo applications.

The downside of this approach is that it’s simplistic in its thinking. How do we define the position of a funny picture, or a cartoon? And they’re a pretty big part of my photo library. So it’s not very clever.

Method three: Extract the semantic information of photos and use it to define my photo library

Most of the approaches we’ve seen so far rely on metadata taken at the same time as the photo. A better way to organize photos is to extract semantic information from the image itself and use that information intelligently.

Let’s break this idea down into several parts. Suppose we have a diversity similar to (as described above) photos. What trends should our algorithms capture?

1. Are you shooting natural scene images or artificially generated images?

2. Is there any text in the photo? And if so, can we identify what it is?

3. What are the different objects in the photo? Can their combination determine the beauty of the photo?

Is there anyone in the photo? Can we identify them?

5. Are there similar images on the network that can help us identify the content of the image?

Therefore, our algorithm should ideally capture this information without obvious tagging, and use it to organize and categorize our photos. Ideally, the final application interface should look like this:

This approach is to solve problems in an “unsupervised manner”. We didn’t directly define the outcome we wanted. Instead, we train an algorithm to find these results for us. Our algorithm summarizes the data in an intelligent way and then tries to solve the problem based on those inferences. Cool, right?

Now you might be wondering, how can we use deep learning to deal with unsupervised learning?

As we saw in the case study above, we can better understand the similarity of images by extracting semantic information from them. Therefore, our problem can be expressed as: how can we reduce the dimensionality of the image so that we can reconstruct the image from these coded representations.

We can use a deep learning network structure called autoencoder.

The idea of an autoencoder is to train it to reconstruct the input from the features it learns. The nice thing is that it reconstructs the input with a very small feature representation.

For example, an autoencoder with a coding dimension of 10 is trained on images of cats, each of which is 100×100 in size. So the input dimension is 10000, and the autoencoder needs a vector of size 10 to represent all the input information (as shown below).

An autoencoder can be logically divided into two parts: the encoder and the decoder. The encoder’s job is to convert the input to a low-dimensional representation, while the decoder’s job is to reconstruct the input from a low-dimensional representation.

This is a high level overview of autoencoders, and in the next article we will look at the algorithms for autoencoders in detail.

Although research in this area is booming, the most advanced methods are not easily able to solve industrial problems, and it will be several years before our algorithms are ready for industrial use.

Unsupervised deep learning code is explained on MNIST data sets

Now that we have a basic understanding of how to use deep learning to solve unsupervised learning problems, let’s apply what we’ve learned to real life problems. Here, let’s take the MNIST dataset, which has always been a must-have dataset for deep learning testing. Before we dive into the code, let’s look at the definition of the problem.

The original problem was to determine the number in the image. The database labels the numbers in the image. In our case study, we will try to find similar images in the database and group them together. We will evaluate the purity of each category by labeling. You can download the data from AV’s DataHack platform – “Identifying numbers” practice issues.

We will test three unsupervised learning techniques and evaluate their performance:

  1. KMeans clustering is performed on images directly

  2. KMeans + self-encoder

  3. Deep embedded clustering algorithm

Before you start experimenting, make sure you have Keras installed on your system. (Please refer to the official installation guide.) We will be using TensorFlow as the background, so make sure you have this in your configuration file. If not, follow the steps given here.

We need to use Xifeng Guo implementation of DEC algorithm open source code. Enter the following command on the CLI:

You can open a Jupyter Notebook and follow the code below.

First we need to import all the necessary modules.

Let’s set the seed value to a limited random number.

Now set the working path for the data for subsequent access.

Read in the training and test files.

In this database, each image has a class label, which is not common in unsupervised learning, and here we use these class labels to evaluate the performance of the unsupervised learning model.

Now let’s display the data as a picture:

We then read in all the images, store them in a NUMpy matrix, and create training and test files.

We divide the training data into training sets and test sets.

K-Means

We first directly use K-means clustering for images, and cluster them into 10 categories.

Now that we’ve trained the model, let’s see how it performs on the validation set.

We will use normalized mutual information (NMI) scores to evaluate our model.

Mutual information is a symmetry measure of the dependence between clustering results and artificial classification. It is based on the concept of cluster purity PI, which measures the quality of a single cluster Ci, i.e. the maximum number of the same target in Ci and Mj, by comparing Ci with all manual classifications in M. Because NMI is normalized, we can use it to compare clustering results with different number of clusters.

The NMI formula is as follows:

K-Means + AutoEncoder

Now, instead of using K-means directly, we first use self-encoder to reduce the dimension of data, extract useful information, and then transfer the information to k-means algorithm.

Now train the autoencoder model:

As can be seen from the results, combining the autoencoder with K-means has a better algorithm effect than using only K-means.

DEC

Finally, let’s look at the implementation of the DEC algorithm. DEC algorithms train clustering and autoencoders together for better results. (Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for Clustering Analysis. ICML 2016.)

DEC algorithm has the best performance compared with the above two methods. The researchers found that further training of the DEC model could achieve even higher performance (NMI as high as 87).

英文原文 :

Essentials of Deep Learning: Introduction to Unsupervised Deep Learning (with Python codes)