Data, model and algorithm jointly determine the effect of deep learning model
2020/4/20 FesianXu

∇ nabla∇

e-mail: [email protected]

QQ: 973926198

github: github.com/FesianXu

Zhihu column: Computer Vision/Computer Graphics theory and Application

Wechat official account:


In literature [1], fee-shot learning is well summarized, and an interesting point of view is mentioned, which is to be shared with you here. First, let’s put aside the concept of few-shot learning and analyze it from several basic concepts of machine learning.

Expected Risk minimization: ⋅ (⋅)\mathcal{L}(\cdot)L(⋅) For a model assuming h∈Hh \in \mathcal{h}h∈H, we expect the machine learning algorithm to minimize its expected risk, which is defined as:


R ( h ) = L ( h ( x ) . y ) d p ( x . y ) = E [ L ( h ( x ) . y ) ] (1) R(h) = \int \mathcal{L}(h(\mathbf{x}), y) dp(\mathbf{x}, y) = \mathbb{E}[\mathcal{L}(h(\mathbf{x}), y)] \tag{1}

If the parameter set of the model is θ\thetaθ, then our goal is:


Theta. = arg min Theta. R ( h ) (2) \theta = \arg \min _{\theta} R(h) \tag{2}

Empirical risk minimization: In fact, the data distribution p(x,y)p(\mathbf{x},y)p(x,y) is usually unknowable, so we can’t integrate it. We usually sample the distribution and get a number of labeled samples, the number of which we call III. Then we use the sampling results to approximate the distribution, so, We seek to minimize the risk of experience, and experience here refers to the data set obtained by sampling:


R I ( h ) = 1 I i = 1 I L ( h ( x i ) . y i ) (3) R_{I}(h) = \dfrac{1}{I} \sum_{i=1}^{I} \mathcal{L}(h(\mathbf{x}_i), y_i) \tag{3}

Here the empirical risk (3) can be minimized to approximate the expected risk (1) (of course, regular terms are usually required in practice).

We make the following three representations:


h ^ = arg min h R ( h ) (4) \hat{h} = \arg \min_{h} R(h) \tag{4}

h = arg min h H R ( h ) (5) h^{*} = \arg \min_{h \in \mathcal{H}} R(h) \tag{5}

h I = arg min h H R I ( h ) (6) h_{I} = \arg \min_{h \in \mathcal{H}} R_{I}(h) \tag{6}

Where, (4) represents the theoretically optimal hypothesis h^\hat{h}h^, and (5) represents the constrained optimal hypothesis H ∗h^{*}h∗, h^{*}h *, h∈ h \in \mathcal{h}h ∈H, h∈ h \in \mathcal{h}h ∈H. (6) represents the optimal hypothesis hIh_IhI obtained by minimizing the empirical risk in the specified hypothesis space H ∈ h \in \mathcal{h}h∈ h, which is optimized on the specified data set with a specified amount of data III.

Since we cannot know p(x,y)p(\mathbf{x},y)p(x,y), and therefore h^\hat{h}h^, h *h ^, as an approximation, h∗h^*h * is the approximation given a particular hypothesis space, HIh_IhI is an approximation in a particular data set and in a particular hypothesis space. For simple algebraic transformation, we have (7):


E [ R ( h I ) R ( h ^ ) ] = E [ R ( h ) R ( h ^ ) + R ( h I ) R ( h ) ] = E [ R ( h ) R ( h ^ ) ] + E [ R ( h I ) R ( h ) ] (7) \mathbb{E}[R(h_I)-R(\hat{h})] = \mathbb{E}[R(h^*)-R(\hat{h})+R(h_I)-R(h^*)] = \\ \mathbb{E}[R(h^*)-R(\hat{h})]+\mathbb{E}[R(h_I)-R(h^*)] \tag{7}

Which use Eapp (H) = E/R ∗ (H) – R (H ^)] \ mathcal {E} _ {} app (\ mathcal {H}) = \ mathbb {E} [R (H ^ *) – R (\ hat {H})] Eapp (H) = E/R ∗ (H) – R (H ^)]. Eest (H, I) = E/R (hI) – R (H ∗)] \ mathcal {E} _ {est} (\ mathcal {H}, I) = \ mathbb {E} [R (h_I) – R (H ^ *)] Eest (H, I) = E/R (hI) – R (H ∗)]. Eapp(H)\mathcal{E}_{app}(\mathcal{H})Eapp(H) represents how close optimal hypothesis H * H ^* H ∗ can be to optimal hypothesis H ^ hat{H}H ^ in a given hypothesis space H\mathcal{H}H under expected loss. Eest(H,I)\mathcal{E}_{est}(\mathcal{H, I})Eest(H,I) represents the effect of optimizing the empirical risk rather than the expected risk in a given hypothesis space H\mathcal{H}H. Without losing its uniqueness, we use DtrainD_{train}Dtrain to denote the entire training set, Have Dtrain = {X, Y}, X = {x1,…, xn}, Y = {y1,…, yn} D_ = {” train “} \ {\ mathbf {X}, \ mathbf {} Y \}, = \ \ mathbf {X} {\ mathbf {X} _1, \ \ cdots, \ mathbf {X} _n \}, \ mathbf {} Y = \ {y_1, \ \ cdots, y_n \} Dtrain = {X, Y}, X = {x1,…, xn}, Y = {y1,…, yn}.

It is not difficult to find that the effect of the whole depth model algorithm ultimately depends on the hypothesis space H\mathcal{H}H and the amount of data in the training set III. In other words, in order to reduce the total loss, we can consider from the following perspectives:

  1. Data, that is DtrainD_{train}Dtrain.

  2. Model, which determines the hypothesis space H\mathcal{H}H.

  3. How to search for the best hypothesis in the specified hypothesis space H\mathcal{H}H to fit DtrainD_{train}Dtrain.

In general, if DtrainD_{train}Dtrain is large, then we have sufficient supervisory information. In the specified hypothesis space H ∈Hh \in \mathcal{h}h∈H, The R(hI)R(h_I)R(hI) obtained by minimizing hIh_IhI provides a good approximation for R(h∗)R(H ^*)R(h∗). However, in fee-shot learning (FSL), the sample numbers for some categories are too small to support an approximation to a good hypothesis. There may be a large distance between the empirical risk term RI(h)R_{I}(h)RI(h) and the expected risk term R(h)R(h)R(h) R(h), which leads to the overfitting of hypothesis hIh_IhI. In fact, this is the core problem in FSL, which is that the empirical risk minimization assumption hIh_IhI becomes unreliable. The whole process is shown in Fig.1. The left figure has sufficient samples, so its empirical risk is minimal. It is assumed that hIh_IhI and H ∗ H ^*h∗ are fairly close, and h^ hat{h}h^ can be better approximated if h \mathcal{h}h is properly designed. On the right, however, hIh_IhI and H *h ^*h∗ are both farther away, not to mention h^\hat{h}h^.

Fig.1. Schematic diagram of the results in the learning process of sufficient and insufficient samples.

In order to solve the problem of unreliable empirical risk in the case of lack of data, namely THE FSL problem, we must introduce prior knowledge. Considering the introduction of prior knowledge from the perspectives of data, model and algorithm, the existing FSL work can be divided into the following types:

  1. The data. In this method, we make use of prior knowledge to data augment DtrainD_{train}Dtrain, increasing the amount of data FROM III to I~\widetilde{I}I, Usually I~>>I\widetilde{I} >> II >>I. Standard machine learning algorithms can then be performed on the augmented data set. Therefore, we can obtain a more accurate hypothesis hI~h_{\widetilde{I}}hI. As shown in Fig.2 (a).
  2. Model. In this method, the complexity of the hypothesis space H\mathcal{H}H is constrained by prior knowledge, and the narrow hypothesis space H\ \widetilde{\mathcal{H}}H is obtained. As shown in Fig.2 (b). Grey areas have been eliminated by prior knowledge, so the model does not consider updating in these directions, and therefore, less data is often required to achieve more reliable empirical risk assumptions.
  3. Algorithm. This type of approach considers the use of prior knowledge to guide the search for theta \theta theta. Prior knowledge can influence the parameter search strategy by providing a good parameter initialization or guiding the parameter update step. For the latter, the search update step is determined by prior knowledge and the least empirical risk.

Fig. 2 introduces prior knowledge from data, model and algorithm.

Reference

[1]. Wang Y, Yao Q, Kwok J, et al. Generalizing from a few examples: A survey on few-shot learning[M]//arXiv: 1904.05046. 2019.