Editor’s note: This article is a solution to Kaggle’s challenge Plant Se99 Classification. Author Kumar Shridhar finished fifth. The method is very general and can be used in other image recognition tasks.

An overview of the task

Can you tell a weed from a crop?

If we can identify weeds efficiently, we can effectively increase food production and better manage the environment. Aarhus University Signal Processing, in collaboration with the University of Southern Denmark, published a dataset of 12 plants, with a total of nearly 960 plants at different stages of growth.

One of the sample plants

The dataset is now publicly available, with annotated RGB images with a physical resolution of about 10 pixels per millimeter. Here is a sample of the 12 plants in the dataset:

To classify each image into its own category, the task can be broken down into five steps:

The first step

First, the most important task in machine learning is analyzing data sets, and it is important to understand how complex the data sets are before you start designing algorithms.

The distribution of various images in the dataset is as follows:

As mentioned earlier, there are a total of 12 plant types and a total of 4750 photos. However, as we can see from the image above, the classification of plants is not even, with some species having 654 images and others only 221. This is a good indication that the data is unbalanced, which we will address in step 3.

Image Type Distribution

Visualizing the images now makes it easier to understand the data. The pictures below show 12 plants from each group and compare them to see how they differ.

All the plants in the picture look alike, and there’s nothing special about the picture. So I decided to use the visualization technology T-SNE to process the image classification.

T-sne is a dimension reduction technique for high dimensional data sets, which can be applied to large real data sets by Barns-Hut technology. On Zhijun also reported related information:

  • Google interns proposed tSNE real-time visualization on large, high-dimensional data sets

  • Visual MNIST: Exploring image dimensionality reduction

T-sne visualization of data sets

Up close, it’s hard to tell the difference between the classes. So it’s important to figure out whether the data is just hard to discern for people or hard to discern for machine learning models. So we need to have a basic standard for that.

Training and validation sets

Before we can start setting basic standards for the model, we need to split the data into training sets and validation sets. In general, the model is first trained on the training set, then tested on the verification set, and improves over time on the verification set. When we get satisfactory results of the validation set, we can apply the model to the real test set to see whether the model is over-fitting or under-fitting, so as to make adjustments.

We used 80% of the 4750 images in the dataset as training sets and 20% as validation sets.

The second step

With the training set and validation set, we start benchmarking the data set. In this task, we will use the convolutional neural network architecture. If you are a beginner, it is recommended to read the article on the basics of deep learning: Notes on Deep Learning.

There are many ways to create CNN models, and we chose Keras deep learning library. It is very inefficient to train a CNN from scratch, so we choose a model weight pre-trained on ImageNet and train it after fine-tuning. The top layer is designed to learn simple basic features, so it can be applied without training. It is important to note that we check the similarity between our dataset and ImageNet, as well as how large our dataset is. These two characteristics will determine how we should adjust. Can read Andrej Karpathy article for more details: cs231n. Making. IO/transfer – learning

For this task, our data set is small, but very similar to ImageNet. So we can first use the ImageNet weights directly, and then add an output layer with 12 plant species at the end to get the first benchmark. Then we unlock the next few layers and train.

We used Keras to establish the initial benchmark, because Keras provides some pre-training models, of which we chose ResNet50 and InceptionResNetV2.

Similarly, we can select benchmark models based on the performance of these models tested on the ImageNet dataset and the number of parameters for each model.

For the first benchmark, I removed the last output layer and added an output layer with 12 categories. The details are as follows:

The model went through 10 iterations and reached saturation after the sixth iteration. Training accuracy was 88 percent and verification accuracy 87 percent.

To further improve performance, there are some bottom layers that are not locked, where the learning rate decays exponentially. The result was a 2 percent improvement in performance.

Here are the hyperparameters used in this process:

The third step

Once we have a base standard, we need to improve the model based on it. We can start by enhancing more data to increase the number of images in the dataset. Data, after all, is the essence of machine learning.

But as we mentioned earlier, the data is unbalanced and we need to deal with it.

Real data sets are never balanced, and models do not perform well on small categories. So the cost of misclassifying data from a few categories into a normal sample is higher than a normal misclassification.

So in order to balance the data, there are usually two methods:

1. ADASYN algorithm. ADASYN can synthesize data for species with a small number of samples. Its core idea is to weight and distribute different species with a small number of samples according to the learning difficulty. The more the species with higher learning difficulty synthesize data, the less the species with easy learning data. Therefore, the ADASYN method improves learning performance by changing data distribution in two ways :(1) reducing the bias caused by classification imbalance, and (2) adaptively shifting the classification decision boundary to difficult samples.

2. SMOTE algorithm. SMOTE is oversampling or downsampling a few classes to get the best result. The best level of classifier can be obtained by over-sampling a few samples and down-sampling a majority of categories simultaneously.

In this context, SMOTE does better than ADASYN. After the data set is balanced, data enhancement can be carried out in the following ways:

  • Scale transformation

  • tailoring

  • flip

  • rotating

  • translation

  • Add noise

  • Changing light conditions

  • GAN

The fourth step

In order to further improve performance, we need to improve the learning rate. To do that, you first need to figure out what the optimal learning rate of the model is. This requires plotting between the learning rate and the loss function to see where the loss starts to decrease.

In our case, 1E-1 looks like the perfect learning rate, but as we get closer and closer to the global minimum, we want to take a smaller step. One method is learning rate annealing, but I used a warm restart. Again, the optimizer was changed from Adam to SGD and SGDR was installed.

Now, we will train multiple architectures with the above techniques and then fuse the results together, which is called model integration. Although this method has become common, it is computationally expensive. So I decided to use Snapshot ensemble, that is, train single neural networks for ensemble and let them converge to several local minima during optimization to save model parameters.

With the learning rate method out of the way, I started resizing the image. I trained a model where the image size was 64×64 (fine tuning on ImageNet), unlocked some layers, applied cyclic learning rate and snapshot integration, took the weight of the model, changed the size to 299×299, made fine tuning based on the weight of 64×64, and then snapshot integration and hot restart.

Step 5

The final step is to visualize the results to determine which category has the best (or worst) performance, and to make other adjustments to improve the results. A good way to understand the result is to build an obfuscation matrix.

We can see the difference between the tag predicted by the model and the real tag, and we can improve this situation over time. We can also do more data enhancement to make the model learn more kinds.

After the proposal was submitted, I was ranked number one (but as the competition unfolded, I’m currently number five).

The original address: medium.com/neuralspace/kaggle-1-winning-approach-for-image-classification-challenge-9c1188157a86