In this article, we will show how to build a deep neural network that can classify images with 90% accuracy, which was a very difficult problem before the emergence of deep neural networks, especially convolutional neural networks.

Deep learning is one of the most exciting topics in ARTIFICIAL intelligence right now. It is based on concepts from biology and is now a collection of algorithms.

Deep learning has been proven to work well in many fields, including computer vision, natural language processing, and speech recognition.

In the last six years, deep learning has been applied to a wide range of fields, and many of the recent technological breakthroughs are related to deep learning.

Deep learning is responsible for technological breakthroughs like Tesla’s self-driving cars, Facebook’s photo-tagging system, virtual assistants like Siri or Cortana, chatbots, and cameras that recognize objects, just to name a few.

In so many fields, the performance of deep learning in cognitive tasks such as language comprehension and image analysis has reached our human level.

How to construct a deep neural network that can achieve 90% accuracy in image classification tasks?

The problem seems simple enough, but it was a thorny problem that computer scientists had been working on for years before the rise of deep neural networks, particularly convolutional neural networks (CNN).

This paper is divided into the following three parts:

  • Present data sets and use cases, and explain the complexity of the image classification task.

  • Build a dedicated environment for deep learning built on AWS GPU-based EC2 services.

  • Train two deep learning models: the first uses an end-to-end process from scratch using Keras and TensorFlow, and the other uses neural networks that have been pre-trained on large data sets.

An interesting example: classifying images of cats and dogs

There are many image datasets specifically used to benchmark deep learning models. The dataset I use in this article is from the Cat vs Dogs Kaggle Competition, and this dataset contains a large number of tagged images of Dogs and cats.

As with every Kaggle contest, this dataset contains two folders:

  • Training folder: It contains 25,000 pictures of cats and dogs, each with a label that is part of the file name. We will use this folder to train and evaluate our models.

  • Test folder: It contains 12,500 images, each named after a number. For each image in this dataset, our model predicts whether the image is a dog or a cat (1= dog, 0= cat). In fact, this data is also used by Kaggle to score the model and then rank it on the leaderboard.

Let’s take a look at the characteristics of these images, which come in a variety of different resolutions. The cats and dogs in the pictures are different in shape, position and color.

Their posture varies, some sitting and some not, their mood may be happy or sad, the cat may be sleeping, and the dog may be barking. Photographs may be taken at any focal length from any Angle.

The possibilities are endless, and while it would be effortless for us humans to identify a pet in a scene from a range of different types of photos, it’s no small feat for a machine.

In fact, if machines are to categorize themselves, we need to know how to strongly characterize cats and dogs, that is, why we think a cat is in this picture and a dog is in that picture. This requires describing the inner characteristics of each animal.

Deep neural networks work well for image classification tasks because they have the ability to automatically learn multiple abstraction layers, which in turn can give a simpler feature representation for each category given a classification task.

Deep neural networks can recognize patterns of extreme changes and have good robustness in distorted images and simple geometric transformations. Let’s take a look at how deep neural networks handle this problem.

Configure the deep learning environment

Deep learning is very computationally intensive, as you can see when you run a deep learning model on your computer.

But if you use GPUs, training speeds up dramatically, because GPUs are so efficient at parallel computing tasks like matrix multiplication, and neural networks are almost full of matrix multiplication, so performance is incredibly improved.

I don’t have a powerful GPU on my own computer, so I chose to use a virtual machine on Amazon Cloud Services (AWS) called P2.Xlarge, which is part of Amazon EC2.

The virtual machine’s configuration includes a nvidia GPU with 12GB of video memory, a 61GB of RAM, four Vcpus, and 2496 CUDA cores.

It’s a performance monster, and the good news is we can use it for $0.90 an hour. Of course, you could choose another virtual machine with better configuration, but a P2.Xlarge virtual machine is more than adequate for the task we’re about to tackle.

My virtual machine is running on Deep Learning AMI CUDA 8 Ubuntu Version, so let’s take a closer look at this system.

The system is based on an Ubuntu 16.04 server, with all the deep learning frameworks we need wrapped up (TensorFlow, Theano, Caffe, Keras), and the GPU driver installed (I heard it was a nightmare to install the driver myself).

If you are not familiar with AWS, you can refer to the following two articles:

  • https://blog.keras.io/running-jupyter-notebooks-on-gpu-on-aws-a-starter-guide.html

  • https://hackernoon.com/keras-with-gpu-on-amazon-ec2-a-step-by-step-instruction-4f90364e49ac

There are two things you can learn from these two articles:

  • Establish and connect to an EC2 virtual machine.

  • Configure the network for remote access to the Jupyter Notebook.

Build a cat/dog image classifier with TensorFlow and Keras

After the environment was configured, we started to build a convolutional neural network that could classify cat and dog images, and applied the deep learning frameworks TensorFlow and Keras.

Keras is a high-level neural network API. Written in pure Python and based on Tensorflow, Theano, and the CNTK backend, Keras is designed to support fast experiments and quickly turn your ideas into results.

Build a convolutional neural network from scratch

First, we set up an end-to-end pipeline training CNN that goes through the following steps: data preparation and enhancement, architecture design, training, and evaluation.

We will chart loss and accuracy indicators on the training set and test set, which will allow us to more intuitively assess the model’s evolution during training.

Data preparation

The first thing to do before you start is download and unzip the training data set from Kaggle.

We had to reorganize the data to make it easier for Keras to process it. We create a data folder and create two subfolders in it:

  • train

  • validation

Under the two folders above, each folder still contains two subfolders:

  • cats

  • dogs

We end up with the following file structure:

data/

train/

dogs/

dog001.jpg

dog002.jpg

.

cats/

cat001.jpg

cat002.jpg

.

validation/

dogs/

dog001.jpg

dog002.jpg

.

cats/

cat001.jpg

cat002.jpg

This file structure lets our model know from which folder to get images and training or test labels. A function is provided that allows you to rebuild the file tree, taking two parameters: the total number of images, and the weight of test set R.

I used:

  • N: 25000 (size of the entire data set)

  • R: 0.2

  • Thewire = 0.2

  • n = 25000

  • organize_datasets(path_to_data=’./train/’,n=n, ratio=ratio)

Now let’s load Keras and its dependencies:

Image generator and data enhancement

When training the model, we don’t load the entire data set into memory because that’s not efficient, especially if you’re using your own local machine.

We’ll use the ImageDataGenerator class, which can batch import image streams from training and test sets without limit. In the ImageDataGenerator class, we will introduce random changes in each batch.

This process we call dataaugmentation. It can generate more images so that our model doesn’t see two identical images. This method can prevent overfitting and help the model to maintain better generalization.

We’re going to create two ImageDataGenerator objects. Train_datagen corresponds to the training set and val_datagen corresponds to the test set. Both will scale the image, train_datagen will also make some other modifications.

Based on the previous two objects, we then create two file generators:

  • train_generator

  • validation_generator

Each generator can generate bulk image data at the directory with real-time data enhancement. In this way, the data will be generated in an unlimited loop.

Model structure

I will use CNN with 3 convolution/pooling layers and 2 full connection layers. The three convolution layers will use a 3 * 3 filter (fiter) of 32,32,64 respectively. On both fully connected layers, I use dropout to avoid overfitting.

I used the stochastic gradient descent method for optimization, and the learning rate and momentum were 0.01 and 0.9 respectively.

Keras provides a very convenient way to show the full picture of a model. For each layer, we can see the shape of the output and the number of trainable parameters. It is a wise choice to check before starting to fit the model.

model.summary()

Let’s take a look at the structure of the network.

Structural visualization

Before training the model, I define two callback functions that will be called during training:

  • One is used to stop training early if the loss function fails to improve the performance of the test data.

  • One for storing loss and accuracy metrics for each period: this can be used to chart training errors.

I also used Keras-TQDM, which is a great progress bar that integrates perfectly with Keras. It makes it very easy to monitor the training process of the model.

To use it, you simply load the TQDMNotebookCallback class from keras_tqDM and pass it in as the third callback function.

The figure below shows the effect of Keras-TQDM on a simple example.

A few more things to say about the training process:

  • We use the FIT_generator method, which is a variation of the standard fit method that takes the generator as input.

  • We trained the model over 50 epochs.

This model is very computationally intensive at runtime:

  • If you were running on your computer, each epoch would take 15 minutes.

  • If you’re like me on a P2. xlarge virtual machine on EC2, each epoch takes 2 minutes.

The classification results

We achieved 89.4% accuracy after the model ran 34 epochs (training/test errors and accuracy shown below), which is a good result considering I didn’t spend a lot of time designing the network architecture. Now we can save the model for later use.

model.save(`./models/model4.h5)

Below we plot the loss index values in training and testing on the same graph:

When the test loss did not improve over two successive epochs, we aborted the training process.

Below plot the accuracy on the training set and test set.

These two indicators continue to increase until the model is about to start a plateau of overfitting.

Load the pre-trained model

We got good results on our own CNN, but there’s another way to get better scores: load the weights of a convolutional neural network pre-trained on a large data set of 1,000 images of cats and dogs.

Such a network learns features that are relevant to our categorizing tasks.

I’m going to load the weights of the VGG16 network. Specifically, I’m going to load the network weights into all the convolutional layers. This network section will act as a feature detector to detect features that we will add to the full connection layer.

Compared to LeNet5, VGG16 is a very large network with 16 layers of trainable weights and 140 million parameters. To learn about VGG16 information, please refer to the PDF link as the following: https://arxiv.org/pdf/1409.1556.pdf

Now we feed the image into the network to get the feature representation, which will serve as input to the neural network classifier.

Images travel in an orderly fashion across the network, so we can easily tag each image.

Now we design a small fully connected neural network with additional features extracted from VGG16, which we use as the classification part of CNN.

After 15 epochs, the model achieved 90.7% accuracy. This is good enough, notice that each epoch now runs on my own computer in just 1 minute.

Many of the big names in deep learning encourage the use of pre-training networks for categorizing tasks, and in fact, pre-training networks often use very large networks generated on a very large data set.

Keras makes it easy to download pre-training networks like VGG16, GoogleNet and ResNet. To learn more about this information, please refer to here: https://keras.io/applications/

There’s a great motto: Don’t be a hero! Don’t reinvent the wheel! Use the pre-training network!

What else can be done?

If you are interested in improving a traditional CNN, you can:

  • At the data set level, introduce more enhanced data.

  • Study network hyperparameters: the number of convolutional layers, the number and size of filters, and test the effect after each combination.

  • Change the optimization method.

  • Try different loss functions.

  • Use more fully connected layers.

  • Bring in more aggressive dropout.

If you are interested in using the pre-training network to get better classification results, you can try:

  • Use different network structures.

  • Use more fully connected layers with more hidden units.

If you want to know exactly what CNN’s deep learning model has learned, you can:

  • Visualize feature maps.

  • Available at https://arxiv.org/pdf/1311.2901.pdf

If you want to use a trained model:

  • You can put the model on a Web APP and test it with new cat and dog images. This is also a good way to test the generalization of your model.

conclusion

This is a step-by-step tutorial on how to build a deep learning environment on AWS and how to build an end-to-end model from scratch. It also shows you how to build a CNN model based on a pre-trained network.

 

Deep learning in Python is fun to do, and Keras makes it easy to preprocess data and set up network layers.

If one day you need to build a neural network of your own making, you may need to use other deep learning frameworks.

Now in the field of natural language processing, a lot of people are starting to use convolutional neural networks. Here are some work based on this:

  • CNN’s text classification was used:

    https://chara.cs.illinois.edu/sites/sp16-cs591txt/files/0226-presentation.pdf

  • Automatically generate titles for images:

    https://cs.stanford.edu/people/karpathy/sfmltalk.pdf

  • Text classification at word level:

    https://papers.nips.cc/paper/5782-character-level-convolutional-networks-fortext-classification.pdf

By AhmedBesbes, translated by zhang shengqiang

Editors: Tao Jialong, Sun Shujuan

Submission: If you want to contribute, please contact [email protected]