From: www.cnblogs.com/youyou0/p/1…

Introduction to the

Traditional supervised learning is mainly single-label learning, while in real life, target samples are often complex, with multiple semantics and multiple labels.

     Pictures of Dutch cities

(1) Traditional single label classification

City (person)

(2) Multi-label classification

city , river, person, European style

(3) Human cognition

Two men were walking by the river

European-style architecture. I guess they’re traveling

The sky is blue. It should be sunny but not too sunny

In comparison, single-label classification needs the least amount of information, while human cognition gets the most information, and multi-label classification is between them

Problem Description:

X=Rd represents the d-dimensional input space, Y={y1,y2… ,yq} represents the label space with q possible labels

Training set D = {(xi, yi) | 1 I m or less or less}, m said the size of the training set, the ordinal superscript said sample

Xi ∈ X is a D – dimensional vector. Yi ⊆Y is a subset of labels for Y

The task was to learn a multi-label set classifier H (x) that predicted h(x)⊆Y as the correct label set for X.

Common practice is to learn a function f(x,yj) that measures the relevance of X and Y, hoping that F (x,yj1)>(x,yj2), where Yj1 ∈ Y,yj2 ∉y.

Available data sets and evaluation indicators

1. Existing data sets

Nus-wide is an image data tagged with web tags, containing 269,648 images from websites, 5018 different types of tags.

Six low-level features were extracted from these images, including 64-D color histogram, 144-D color correlation, 73-D edge direction histogram, 128-D small ripple texture, 225-D block color moments and 500-D word bags based on SIFT description.

Address: lms.comp.nus.edu.sg/research/NU…

 

The MS-COCO dataset includes 91 categories of objects, 328,000 images, and 2,500,000 labels.

All object instances are labeled with detailed segmentation masks, totaling over 500,000 object entities.

Address: cocodataset.org/

  

PASCAL VOC Dataset The primary goal of this challenge is to identify objects from multiple visual object classes in a real-world scenario. It is basically a supervised learning problem because a training set is provided for labeling images. The 20 object categories that have been selected are: Person: human Animal: bird, cat, cow, dog, horse, sheep Vehicle: plane, bicycle, boat, bus, car, motorcycle, train Indoor: bottle, chair, dining table, potted plants, sofa, TV/monitor

The Train/VAL data consisted of 11,530 images containing 27,450 ROI annotation objects and 6,929 segmentation.

Address: host. Robots. Weather. Ac. UK/PASCAL/VOC /…

 

Tencent AI Lab’s open source ML-images data set includes 18 million training Images and more than 11,000 common object categories.

2. Evaluation indicators

Can be divided into three categories

  • Sample-based evaluation indicators (first consider the performance of a single sample on all labels, then average the multiple samples, not commonly used)
  • Evaluation indicators for all samples (performance of all labels directly on all samples)
  • Label-based evaluation indicators (first consider the performance of a single label on all samples, and then average the multiple labels)

Evaluation indicators for all samples

****Precision, Recall, F value (accuracy, Recall rate, natural expansion of F value in single label learning)

         

                  

      

Niq: the number of correctly predicted images with the ith label, Nip: the number of correctly predicted images with the ith label, Nig: the number of correctly predicted images with the ith label,

 

 

Label based evaluation indicators

****Precision, Recall, F value (accuracy, Recall rate, natural expansion of F value in single label learning)

         

                  

      

 

MAP (Mean Average Precision)

P: precision, the expansion of the accurate rate (which is determined by a single sample label relevance ranking, and the above three precision rate is different meaning) | {yj2 | rankf (xi, yj2) or less rankf (xi, yj1), yj2 ∈ X} |

AP: Average precision, the average P value of each category

MAP: Mean average precision, mean AP of all categories

        

Where rankf(xi,yj) stands for f(.,.) Rank yJ tags in this list. The higher the rank, the less the correlation.

 

Learning algorithm

1. The three strategies (based on the relationship between labels) The main difficulty of multi-label learning lies in the explosive growth of the output space. For example, if there are 20 labels, the output space will be 2^20. For example, an image tagged with rainforests and soccer is more likely to have a Brazilian tag. When a document is labeled entertainment, it is less likely to be political. Effective mining of the correlation between labels is the key to the success of multi-label learning. According to the strength of correlation mining, multi-label algorithms can be divided into three categories.

    • First-order strategy: Ignore correlations with other labels, such as splitting multiple labels into separate dichotomies (simple and efficient).
    • Second-order strategy: Consider pair associations between tags, such as sorting related and unrelated tags.
    • Higher-order strategy: Consider associations between multiple tags, such as for each tag considering the influence of all other tags (effect is optimal).

2. Two methods (based on how to combine multi-label classification with the current algorithm)

    • Modification of data adaptation algorithms: Commonly used methods such as merging multiple categories into a single category will result in too many categories
    • Modify the algorithm to adapt to data: such as normal output Q-dimension data, change softmax regression into sigmoid function, and finally change F (.) Results greater than the threshold are output.

3.Multi-label CNN (VGG, ResNet101)

This is the standard CNN model, which does not consider any label dependence and belongs to first-order strategy. The following are all high-order strategies.

4.label embedding

Label embedding is not a whole network, but a part of the network used to process the connection between labels in the network.

 (a) (b)

            (a) one hot encoding                                                                                          (b)embedding

Neural network analysis assumes that we only have four words, girl, woman, boy, and man. Let’s think about the difference between two different expressions. Although we know their relationship to each other, computers don’t. In the input layer of the neural network, each word is treated as a node. And we know that to train a neural network is to learn the weight of each connection. If we only look at the weight of the first layer, the following situation needs to determine the relationship of 43 connecting lines. Because each dimension is independent from each other, the data of girl will not be of any help to the training of other words, and the amount of data required for training is basically fixed there.

We are manually looking for the relationship between these four words f. Four words can be represented by two nodes. The meanings of different values for each node are shown in the following table. Girl can then be encoded as a vector [0,1] and man as a vector [1,1] (the first dimension is gender and the second is age).

                   

So now when you look at the neural network the weight of the connections that you have to learn is reduced to 23. At the same time, when the training data is sent to girl as input, because it is encoded by two nodes. Other input examples that share the same connection as girl can be trained as well (such as training that can help a woman with whom she shares female, and a boy with whom she shares child).

Generally speaking, label embedding is to achieve the results represented by the second neural network and reduce the amount of data required for training. The label embedding is to automatically learn the mapping f from the input space to the Distributed representation space from the data.

 

5. CNN + RNN (CNN – LSTM)

The network framework is mainly divided into CNN and RNN. CNN is responsible for extracting semantic information in pictures, while RNN is responsible for establishing models of image/label relationship and label Dependency.

                     A network model

In addition, RNN shifts attention to different places when recognizing different objects, as shown in the figure below:

 

      

In predicting zebra, we found that the network focused attention on the Zebra section.

This is a higher-order strategy that considers label dependencies at the global level.

6.RLSD

On the basis of CNN-RNN, RLSD adds potential semantic dependencies of regions, and further optimizes the algorithm considering the correlation between image location information and tags.

                  RLSD neural network

6.HCP

The basic idea of HCP is to firstly extract candidate regions (basically hundreds of them) in the image, then classify each candidate region, and finally integrate all candidate region classification results by cross-hypothesis max-pooling to obtain multi-category labels of the whole image. The attention mechanism is also utilized, as shown below:

    

Attention mechanics: Car, person, horse have high power and high attention. The advantage of this is that we do not need to add position information when training pictures, the algorithm will frame many boxes, automatically adjust the box weight of the relevant labels, to reduce the noise.

 

conclusion

1. Existing problems

At present, multi-label classification still has problems of single-label classification and target detection, such as occlusion and small object recognition

In addition, due to the relatively large number of labels, the possibility of classification increases exponentially with the category, rank, and uneven distribution of samples

2. Application areas

Image search, image and video semantic annotation

2. Research direction

On the whole, multi-label classification involves multiple labels, so it needs to know more information about pictures and labels, which means that the possibility of classification increases exponentially.

In order to reduce the possibility of such classification, it is necessary to consider the relationship between tags and images to reduce the amount of information.

    • The first involves the relationship between labels, that is, the relationship between words in NLP, which is at the semantic level
    • The second involves the relationship between labels and images, that is, the relationship between labels and image features. Attention mechanism is commonly used