Recently, Mingfeng Ou from Zhejiang University shared “Data-centered research and Exploration” in TechBeat AI community. The content is derived from my internship in Gewu Titanium, Joining datasets via Data Augmentation in the Label Space for Neural networks, a paper co-authored with Junbo Zhao, Scientific Advisor, Yunkai Cui, CEO, and Linji Xue, principal of the algorithm. Based on the research of Gewu Titanium Open Dataset, this paper was successfully included in ICML 2021 machine Learning Conference. Talk focuses on the connection and fusion of homogeneous data sets with similar domain, and proposes solutions.

The following articles are shared:

Data-centric research and exploration

I. Research background

By far the most popular paradigm for deep learning is the end-to-end learning paradigm. This paradigm is shown in Figure 1. For step 1, in the real world application, there are usually multiple optional data sets under the same task, and we tend to select only one of them at a time for model training through various methods. This not only wastes other data sets, but also brings limitations to the model, because the quality and quantity of data have a critical impact on the generalization ability of deep learning models. Based on this point, we thought: why use only one data set to train the neural network? Why can’t multiple datasets work at the same time?

Figure 1. End-to-end learning paradigm

In view of the above problems, it is bound to carry out data set fusion, which can be divided into two types: 1. Direct fusion can be carried out in the case of consistent data set labels, that is, direct fusion of samples under the same label. 2. In the case of inconsistent labels, the current schemes are basically mixed in the hidden vector space, such as transfer learning, which has many advantages, including effective transfer and fusion application of relevant domain knowledge, and greatly reducing the amount of data and training costs of downstream tasks. However, there are still some defects, such as weak interpretability due to mixing in the hidden layer, and no use of semantic information associated between the upper data sets.

Figure 2. Direct and indirect fusion of data sets

At this point, we will think, when the label is not unified, can we also carry out direct fusion in the label space?

Based on this problem, this paper proposes a solution from the perspective of the semantic information of labels of data sets, that is, labels of similar data sets are often semantically related in domain knowledge, so they can be connected by means of atlas. Specifically, as shown in The example in Figure 3, the three similar data sets in the field of animals at the left cannot be directly fused due to differences in label granularity, but the three data sets can be linked by establishing an atlas of their label sets.

Figure 3. Data set connection through label atlas

Following the above ideas, starting from the most basic single label classification task, this paper proposes a set of framework to realize data set connection in label space based on label graph. Specifically, as shown in Figure 4, the upper part is three similar single-label data sets, but their labels are different and cannot be directly fused. Therefore, each data set corresponds to the training of a single-label model. The second part is the scheme proposed by us. The three data sets are fused together for the training of a model at the same time, and the prediction changes from single label prediction to path prediction with label nodes as the end point. To help us understand, the ground truth labels before and after the connection of the three data sets in Figure 3 are compared and listed in Table 1.

Figure 4, the traditional single-label prediction model and the path prediction model based on the graph in this paper

Table 1, label changes before and after data set connection

The contributions of this work include:

· Proposed a new paradigm of directly connected data sets based on tag space;

· Developed a new training algorithm concept, and experimental results on image and text classification prove the effectiveness of this paradigm;

· This approach enhances interpretability and causal traceability compared to the traditional end-to-end paradigm.

2. Method generalization

Atlas construction:

To better understand the build process, let’s start withPet classificationFor example, suppose that there are two datasets, A and B, respectively, which are cat-dog binary datasets and cat-dog fine-grained breed classification datasets. The following four steps are used to construct the map (the partial submap is shown in Figure 5) :

(1) Find the closest common ancestor of these tags in terms of taxonomy, Animal, and add it to the empty graph as the root node.

(2) From the perspective of taxonomy, we chose the closest tagged dogs and cats as the two adjacent nodes under the root node.

(3) Repeat step 2, then continue to extend down according to the perspective of taxonomy, refer to the relevant information, determine three main characteristics that can be used for dog classification, including hair type, ear shape and tail shape, and list the specific characteristics of these three types: ① short hair, long hair; ② Hanging ears, erect ears, rose ears; ③ Short tail, slender tail, curly tail, long and wide tail. The same is true for cats. List the specific characteristics of the hair type and color pattern: ① Long hair, short hair; ② Solid color, focus color, tabby color. All of these features are added to the diagram as enhancement nodes.

(4) Connect each label node to the associated specific feature node (that is, the enhancement node), and connect the root node to the enhancement node.

Figure 5. Example of map construction (local map)

Competitive node: To better capture the relationship between nodes at the same level on the tag graph, we define the competitive node:

Because for general Softmax, all categories are competing with each other. But in our architecture, competition exists only between competing nodes. The comparison between Softmax and block-Softmax is shown in Figure 6.

Figure 6. Softmax vs. Block-Softmax

Deterministic path: In order to deal with cases where categories have deterministic characteristics, we define deterministic path:

Figure 7, the only deterministic path in the figure is marked in red, Animal->Cat->Shorthair- >British_Shorthair

For the training of deterministic paths, we adopt the Teacher Forcing training strategy, that is, we regard each ground truth path as a sequence, feed the sequence into the loop unit, and let the encoder autoregressively predict each token (node) on the sequence. The process is shown in the figure below. For the deterministic path P, all nodes on P go through the same steps, and the following loss function can be obtained to reverse propagation and optimization:

FIG. 8, training flow of deterministic path

Nondeterministic path: In order to deal with the case where the category has nondeterministic characteristics, we define nondeterministic path:

Figure 9, where the three nondeterministic paths from Animal to British_Shorthair are marked in red

Overall model structure:

We design the whole structure according to Encoder-Deocder framework. For Encoder, Effi cientnet-b4 is used in image classification task, while Bert or LSTM is used as feature extractor in text classification task, and GRU is used uniformly for Decoder.

FIG. 10, the overall structure of image classification task weight model

Third, the experiment

Data set setting:

Table 2, Data set statistics (K for number of classes)

Figure 11, data distribution of two original datasets in Group1 (left Oxford-IIIT Pet, right Dog vs. Cat), is from Graviti’s Open Dataset.

Figure 12. Data distribution of two original data sets in Group2 (Category Flower 102 on the left and Category Flower 17 on the right) from Graviti’s OpenDataset.

In this paper, three sets of data sets are set for the experiment, and the statistical information of the data sets is shown in Table 2, Figure 11 and Figure 12. Groups 1 and 3 correspond to the fusion of fine-grained and coarse-grained datasets, and there is no intersection between the labels of datasets. Group 2 corresponds to the fusion of data sets of the same granularity, and the intersection size between the two tags is 8. It should also be noted that our tests were conducted on more difficult fine-grained data sets.

It should be noted that the image data sets used for the text are all from the public data set community of Gewuti, and the gewuti data platform is used for efficient and simple connection and reading:

Base line setting: As for the base line setting, we set three kinds of image classification tasks

  1. Efficientnet-b4+FFN, a model of traditional single-label classification.

  2. Efficientnet-b4+Pseudo Labels: Data fusion of training sets based on Pseudo Labels, that is, fine-grained Pseudo Labels are generated for samples in coarse data sets, and these samples with Pseudo Labels are fused into fine-grained data sets.

  3. Efficientnet-b4+Label Set, based on the multi-label classification model, that is, the node labels on the corresponding ground truth path of the sample serve as the multi-labels of its ground truth.

For text classification tasks, we set up a traditional single-label classification model

Experimental results:

Table 3. Results of image classification and measurement index Accuracy

These results were obtained on a test set of fine-grained data sets. (the “X” indicates that this setting cannot be used directly for the model under consideration, while the “-” indicates that the experiment is of low priority, so the text is omitted).

Table 4, text classification results, measurement index F1 Score

The main experimental results of this paper are shown in Table 3 and Table 4, from which we can see:

I) Even without the help of additional data sets, performance can still be improved by simply expanding labels to label diagrams, along with the training strategies in this article.

Ii) The method proposed in this paper is better than the reference line, indicating the feasibility of our method for data set fusion in label space.

Interpretability analysis. Compared with the black-box end-to-end system fusion approach, we found in the experiment that our framework is more interpretability, because when performing the inference process, the label diagram actually provides a “decision process” for the classification model. The left figure in the following figure shows three enhanced nodes (Tabby-color, Point-color and solid-color) and some corresponding images. The right figure shows the path of each row of samples. Combining these two figures, we can see that, The model can learn the features of the enhanced nodes from the samples with deterministic paths and apply them to the reasoning of the samples with uncertain paths. Specific to Persian cat (Persian) as an example, the red dashed line box is a Persian cat, cat is a focus on color and pure color, so the color is uncertain, and the model can through the certainty in the focus of the color and solid color samples of the cats learn to focus on color and the characteristics of pure color, and the samples of different color to distinguish the Persian cat.

Figure 13, an example diagram of interpretable results

For each row of test images, the model reasoning process passes through the enhancement nodes listed at the far left. The prediction of the sample in the blue rectangle box of the left image (corresponding to the blue ellipse in the right image) is completed on the deterministic path, while the prediction of the sample in the green rectangle of the left image (corresponding to the green ellipse in the right image) is completed on the uncertain path.

Four,

This paper studies the problem of data set connection when label system is different. This paper proposes a new framework to solve this problem, including label space expansion, recursive neural network, sequence training and strategy gradient. Experiments on image and text classification show that the proposed method has good performance and interpretability. In addition, this paper tends to characterize this work as a preliminary attempt to integrate rich domain knowledge (tag maps) to promote connectionism (e.g., neural network classifiers). Finally, it is hoped that this work can promote the research of multi-data set fusion for different tasks.

Click to visit the official website of Gewu Titanium, make an appointment for the demo, and immediately experience the Gewu Titanium data platform, directly hitting your AI development data needs.

Gewu Titanium – Hits your AI development data needs directly