Translation from The Keras Blog. The main purpose of translation is to slow down one’s own learning and perhaps help others as well. The following is a translation.


In this tutorial, we’ll show you some simple but effective ways to build powerful image classification models with very small training samples — only a few hundred to a few thousand images per category to be recognized.

We will use the following method:

  • Train a small network from scratch as a baseline for comparison
  • Use the bottleneck characteristics of the pre-training network
  • Fine-tune the last few layers of the pre-training network

This brings us to the following Keras features:

  • fit_generatorUse the Python data generator to train the Keras model
  • ImageDataGeneratorReal-time image enhancement
  • Layer freezing and model fine-tuning
  • . other

Note: All examples were updated to the Keras 2.0 API on March 14, 2017. Running this code requires a Keras version number of 2.0.0 or greater.

Our setting: only 2000 samples (1000 per category)

Let’s start with the following Settings:

  • The machine is installed with Keras, SciPy, PIL. You can use it if you have an NVIDIA graphics card (cuDNN is required), but it’s not necessary because we work with very few images.
  • A training set directory and a validation set directory, both of which contain subdirectories organized by category, containing.png or.jpg images:
data/
    train/
        dogs/
            dog001.jpg
            dog002.jpg
            ...
        cats/
            cat001.jpg
            cat002.jpg
            ...
    validation/
        dogs/
            dog001.jpg
            dog002.jpg
            ...
        cats/
            cat001.jpg
            cat002.jpg
            ...
Copy the code

To get hundreds to thousands of photos of the categories you’re interested in, you can download tag-specific images through the Flickr API, which is protocol-friendly.

In our example, we used two categories of images from Kaggle: 1000 cats and 1000 dogs (although the original data set had 12,500 cats and 12,500 dogs, we used only the top 1000 images of each category). Another 400 images from each category were used as validation sets to evaluate the model.

This leaves very few images to study, which is not easy for an image classification problem. This is a challenging machine-learning problem, but it’s also realistic: in the real world, it’s expensive or almost impossible to capture only small data sets (e.g., medical learning). Getting the most out of limited data is a key skill for a qualified data scientist.

How hard is the problem? More than two years after Kaggle started its dog and cat classification contest (with 25,000 images), the following comments appeared:

In an informal survey several years ago, computer imaging experts said that without huge advances in current technology, a classifier with an accuracy of more than 60 percent would be nearly impossible. As a reference, the 60% accuracy classifier increased the probability of guessing 12-image HIP from 1 in 4096 to 1 in 459. Today’s material suggests that machines can get this task right up to 80% of the time.

In the results of the competition, the top players were able to achieve 98% accuracy using modern deep learning techniques. In our example, we restricted the data set to only 8%, making the problem more difficult.

The role of deep learning in small data problems

I often hear the phrase “deep learning is only useful if you have huge amounts of data.” Not entirely wrong, but somewhat misleading. Of course, deep learning has the ability to automatically learn features from data, which is usually only possible when there is a large amount of data available — especially when the input sample has a high dimension, such as images. However, convolutional neural network networks — one of the important algorithms for deep learning — are the best model for most perceptual problems, such as image classification, even when there is only a small amount of data to learn from. Training a convolutional neural network from scratch on a small image data set can still yield good results and does not require manual feature design. Convolutional neural networks are good enough that they are the right tools for this kind of problem.

But going further, deep learning models are naturally suited to changing application scenarios: as you’ll see in this article, you can take a picture classification or speech recognition model trained on a large data set and apply it to a completely different problem with a few modifications. Especially in computer graphics, many pre-trained models (often on ImageNet datasets) are publicly downloadable, and you can use them to build powerful image models on very little data.

Data preprocessing and data enhancement

To make the best use of a limited sample, you need to use a series of random transformations to enhance the image so that the model doesn’t reuse the exact same image. This prevents overfitting and allows the model to generalize better.

This can be achieved by in Keras Keras. Preprocessing. Image. ImageDataGenerator class to implement. This class provides the following functionality:

  • Set the random transformation and standardized operation of pictures during training
  • through.flow(data, labels)orflow_from_directory(directory)To instantiate the generator for batch enhanced images. These generators can then act asfit_generator, ` ` evaluate_generatorandPredict_generator ‘as input to the Keras model method.

The following is an example:

from keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(
        rotation_range=40,
        width_shift_range=0.2,
        height_shift_range=0.2,
        rescale=1./255,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True,
        fill_mode='nearest')
Copy the code

Here are just a few of the configurable items (see the documentation for more. Let’s look at what these configurable items mean:

  • rotation_rangeIs a degree value ranging from 0 to 180, indicating the range of image rotation
  • width_shiftheight_shiftIs the percentage of width and length, and represents the range of translation of the picture
  • rescaleRepresents the multiplier applied to the picture before any other processing. Our original image is made up of RGB values between 0 and 255, which is too large for the model to handle (at typical learning rates), so the values need to be scaled to between 0 and 1 by a scale factor of 1/255
  • shear_rangeRepresents random applicationshear
  • zoom_rangeRepresents random scaling within an image
  • horizontal_flipRepresents flipping an image horizontally at random — when there is no assumption of horizontal asymmetry (for example, real-world images).
  • fill_modeRepresents the fill strategy for a new pixel after rotation or horizontal/vertical translation

Now let’s use this tool to generate some images in a temporary folder to get a feel for the effect of these enhancement strategies. For the image to be displayed, no numerical scaling factor is used.

from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img

datagen = ImageDataGenerator(
        rotation_range=40,
        width_shift_range=0.2,
        height_shift_range=0.2,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True,
        fill_mode='nearest')

img = load_img('data/train/cats/cat.0.jpg')  # This is a PIL picture
x = img_to_array(img)  # this is a Numpy array of the shape (3, 150, 150).
x = x.reshape((1,) + x.shape)  # this is a Numpy array of the shape (1, 3, 150, 150)

# The.flow() command below generates a batch of randomly transformed images
Then save to the preview/ 'directory
i = 0
for batch in datagen.flow(x, batch_size=1,
                          save_to_dir='preview', save_prefix='cat', save_format='jpeg'):
    i += 1
    if i > 20:
        break  Otherwise the generator will loop forever
Copy the code

Below is a picture of the data enhancement strategy.

Training a small convolutional network from scratch: 80% accuracy in 40 lines of code

Convolutional networks are a suitable tool to solve the problem of image classification. We first try to train one as a reference line. Due to the small number of samples, the first concern is overfitting. Overfitting occurs when the model has seen too few samples to generalize to new data, that is, the model starts to use unrelated features to make predictions. For example, if someone has only seen 3 pictures of lumberjacks, and 3 pictures of sailors, and only one of those pictures is a lumberjacks wearing a hat, you might think that wearing a hat is a lumberjacks characteristic and not sailors, and that would be a bad lumberjacks/sailors classifier.

Data enhancement is one of the ways to overcome overfitting, but data enhancement alone is not enough because the enhanced samples are highly correlated. The key to overcoming overfitting is the model’s entropy capacity — how much information the model can store. The more information a model can store, the greater the potential for accuracy, the more features it can exploit, but the risk of storing irrelevant features. When models store very little information, they can focus on salient features found in the data, which are more likely to be relevant to the real problem and can be better generalized.

In this example, we use a small convolutional network with a few layers and a few filters, using both data enhancement and dropout. Dropout also alleviates overfitting by preventing a layer from seeing the same pattern twice and thus acting as a similar data enhancement.

The following code is our first model, a simple 3-level convolutional network, where the convolution uses the ReLU activator and each level of convolution is immediately followed by a maximum pooling layer. This is similar to the structure of LeCun’s image classification model published in the 1990s, except that ReLU activators were not used at that time.

The full code for this experiment can be found here.

from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Activation, Dropout, Flatten, Dense

model = Sequential()
model.add(Conv2D(32, (3.3), input_shape=(3.150.150)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2.2)))

model.add(Conv2D(32, (3.3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2.2)))

model.add(Conv2D(64, (3.3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2.2)))

# Output dimensions of the model so far (height, width, features)
Copy the code

On top of this, two fully connected layers are used. The final layer is a single unit, which uses the Sigmoid activator, which is well suited to binary problems, and accordingly trains the model using binary_Crossentropy.

model.add(Flatten())  # 3 dimensional data features are transformed into 1 dimensional data
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])
Copy the code

Now to prepare the data, use.flow_from_directory() to generate batch image data and corresponding labels based on JPGS images and corresponding directories.

batch_size = 16

Image enhancement configuration for training
train_datagen = ImageDataGenerator(
        rescale=1./255,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True)

Image enhancement configuration for testing: Rescale only
test_datagen = ImageDataGenerator(rescale=1./255)

# Generator reads data from the 'data/train' folder and can generate enhanced picture data indefinitely
train_generator = train_datagen.flow_from_directory(
        'data/train'.# target folder
        target_size=(150.150),  Reset all images to 150x150
        batch_size=batch_size,
        class_mode='binary')  # binary_Crossentropy Loss functions require binary tags

A similar generator for validating set data
validation_generator = test_datagen.flow_from_directory(
        'data/validation',
        target_size=(150.150),
        batch_size=batch_size,
        class_mode='binary')
Copy the code

Now you can use the generator to train the model. Each round of training takes 20~30s on GPU and 300~400s on CPU. So if you’re not in a hurry, running on a CPU is ok.

model.fit_generator(
        train_generator,
        steps_per_epoch=2000 // batch_size,
        epochs=50,
        validation_data=validation_generator,
        validation_steps=800 // batch_size)
model.save_weights('first_try.h5')  Remember to save the model after and during training
Copy the code

After 50 rounds of training, the validation set accuracy of this method is about 0.79~0.81 (50 times is set randomly — because the model is very small and we use relatively strong dropout, it is not easy to overfit). When Kaggle first launched the contest, our results would have been the best — using only 8% of the data and not optimizing the network structure or hyperparameters. In fact, the model ranked in the top 100 (out of 215 competitors) in Kaggle’s competition. So it’s estimated that at least 115 competitors didn’t use deep learning.

Using the pre – trained network’s bottleneck feature: 90% accuracy within a minute

A more elegant approach is to use pre-training networks on large data sets. These networks have learned features that are useful for many computer image problems and can be exploited to achieve higher accuracy than models that rely only on the data available for the problem itself.

We use the VGG16 model, which is pre-trained on the ImageNet dataset – discussed earlier in this blog. Since ImageNet’s 1000 categories include several “cat” categories (Persian cat, Siamese cat, etc.) and many “dog” categories, the model has learned many characteristics relevant to our classification problem. In fact, just recording the predicted value of Softmax after data passing through the model can solve the problem of cat and dog classification without using bottleneck characteristics. However, in order to make the method more widely applicable, including categories that do not appear in ImageNet, we use bottleneck characteristics.

Here is the structure of VGG16:

Our strategy is as follows: use only the convolution part of the model, that is, all the layers before the full join. We then run the model on our data, using two NUMpy arrays to record all the outputs (that is, the last activation layer before the full link layer of VGG16). Then, based on these records, train a small, fully connected network.

The reason why we record these features offline instead of directly adding full-connection layer training on the frozen convolution layer is that the calculation is more efficient in this way. Running the VGG16 model is time consuming, especially with CPU, and we only want to evaluate it once. Note, however, that this cannot be enhanced with data.

You can find the full code here to get the pre-training weights from Github. How the model is constructed and loaded is not discussed here – it has been discussed in many of the Keras examples. But take a look at how to use a picture generator to record a bottleneck feature:

batch_size = 16

generator = datagen.flow_from_directory(
        'data/train',
        target_size=(150.150),
        batch_size=batch_size,
        class_mode=None.# This generator does not generate tags
        shuffle=False)  # keep the data order, first 1000 cats, then 1000 dogs
Given numpy data that is a total of generators, predict_Generator returns the output of the model
bottleneck_features_train = model.predict_generator(generator, 2000)
Save the output to the NUMpy array
np.save(open('bottleneck_features_train.npy'.'w'), bottleneck_features_train)

generator = datagen.flow_from_directory(
        'data/validation',
        target_size=(150.150),
        batch_size=batch_size,
        class_mode=None,
        shuffle=False)
bottleneck_features_validation = model.predict_generator(generator, 800)
np.save(open('bottleneck_features_validation.npy'.'w'), bottleneck_features_validation)
Copy the code

Then use the saved data to train a small, fully connected network:

train_data = np.load(open('bottleneck_features_train.npy'))
# Since features are generated in cat/dog order, all tag construction is simple
train_labels = np.array([0] * 1000 + [1] * 1000)

validation_data = np.load(open('bottleneck_features_validation.npy'))
validation_labels = np.array([0] * 400 + [1] * 400)

model = Sequential()
model.add(Flatten(input_shape=train_data.shape[1:]))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(train_data, train_labels,
          epochs=50,
          batch_size=batch_size,
          validation_data=(validation_data, validation_labels))
model.save_weights('bottleneck_fc_model.h5')
Copy the code

Since the network is small, training on the CPU is also fast (1 rounds per second) :

Train on 2000 samples, validate on 800 samples Epoch 1/50 2000/2000 [==============================] - 1s - loss: 0.8932-ACC: 0.7345 - val_loss: 0.2664-val_ACC: 0.8862 Epoch 2/50 2000/2000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 1 s - loss: 0.3556 acc: 0.8460 - val_loss: 0.4704 - val_ACC: 0.7725... Epoch 47/50 2000/2000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 1 s - loss: 0.0063 acc: 0.9990 - val_loss: 0.8230 - val_acc: 0.9125 Epoch 48/50 2000/2000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 1 s - loss: 0.0144 acc: 0.9960 - val_loss: 0.8204 - val_acc: 0.9075 Epoch 49/50 2000/2000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 1 s - loss: 0.0102 acc: 0.9960 - val_loss: 0.8334 - val_acc: 0.9038 Epoch 50/50 2000/2000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 1 s - loss: 0.0040-ACC: 0.9985 - VAL_loss: 0.8556 - val_ACC: 0.9075Copy the code

The accuracy rate is 0.90~0.91, which is quite good. This is because the base model has been trained on sample data tagged with dogs and cats.

Fine-tune the last few layers of the pre-training network

To further improve the previous results, we can fine-tune the convolutional module and classifier at the end of the VGG16 model. Fine-tuning means starting from trained networks and retraining on new data sets with very small weight updates. This example can be divided into 3 steps:

  • Initialize the convolution load weights of VGG16 network
  • On this basis, add the fully connected network we trained earlier and load the weights
  • Freeze the layer before the last convolution module

Note:

  • To fine-tune the model, all layers must start with trained weights: for example, you cannot add a fully connected layer initialized with random values to a pre-trained convolution layer. Because the weight of the convolutional layer will be hurt by the large update caused by the randomly initialized weight. So in this case we first train a fully connected layer classifier, and then we put it together with the convolution layer.
  • To prevent overfitting we only fine-tune the last convolutional layer and not the whole network, because the whole network has a very large entropy capacity and therefore is easy to overfit. The features learned by the underlying convolutional network are more basic and less abstract than the high-level features, so it is reasonable to fix the features of the first few layers (more basic features) and train only the last layer.
  • Fine-tuning should be done with a small learning rate, and generally with SGD optimizers rather than adaptive optimizers such as RMSProp. This is to keep the weight updates small and not harm the features that the network has learned.

The full trial code is available here.

After initializing the VGG16 network load weight, add the full connection layer we trained earlier:

# Add classifiers on the convolution layer
top_model = Sequential()
top_model.add(Flatten(input_shape=model.output_shape[1:]))
top_model.add(Dense(256, activation='relu'))
top_model.add(Dropout(0.5))
top_model.add(Dense(1, activation='sigmoid'))

# Note that for fine tuning training, fully connected classifiers should also be trained
# classifier, including the top classifier,
# in order to successfully do fine-tuning
top_model.load_weights(top_model_weights_path)

# add above the convolution layer
model.add(top_model)
Copy the code

Finally start using a very small learning rate, fine-tune the model:

batch_size = 16

Prepare data enhancement
train_datagen = ImageDataGenerator(
        rescale=1./255,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True)

test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
        train_data_dir,
        target_size=(img_height, img_width),
        batch_size=batch_size,
        class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
        validation_data_dir,
        target_size=(img_height, img_width),
        batch_size=batch_size,
        class_mode='binary')

# Fine-tuning model
model.fit_generator(
        train_generator,
        steps_per_epoch=nb_train_samples // batch_size,
        epochs=epochs,
        validation_data=validation_generator,
        validation_steps=nb_validation_samples // batch_size)
Copy the code

After 50 rounds of training, the accuracy of this method can reach 0.94. Great progress!

The following methods can be used to try to achieve the accuracy of 0.95:

  • More radical data enhancements
  • More radical dropout
  • Regularization using L1 and L2 (also known as “weight attenuation”)
  • Fine-tune one more layer (with greater regularization)

End of this article! To review, the code for each part of the experiment is as follows:

  • Ab initio trained convolutional networks
  • Characteristics of Bottleneck
  • fine-tuning

If you have any comments or suggestions on any of the topics discussed in this article, go to Twitter.