The original link

Since issues such as privacy security and markup cost often result in insufficient markup data within the target domain, transfer learning is used to apply knowledge from one or more source domains to model learning that optimizes the target domain. However, this kind of transfer is not always effective, and sometimes using the data and knowledge of the source domain may even reduce the learning performance of the target domain, which is called negative transfer. Negative transfer has become a long-term and challenging problem in transfer learning. This paper mainly summarizes and discusses some existing methods to deal with negative transfer and their advantages and disadvantages.

Introduction to the

A common assumption in traditional machine learning is that training data and test data have the same distribution. However, in practical application, this assumption is not always true. For example, two image datasets may contain images taken by cameras of different resolutions under different lighting conditions; Different people may show significant differences in brain computer interfaces. Therefore, models trained from these data sets may have poor generalization ability.

A common solution is to re-collect large amounts of tagging data from the test data and train the model, however the reality is that this may be difficult to achieve due to high tagging costs or personal privacy issues.

A better solution is transfer learning, or domain adaptation. But transfer learning doesn’t always work. It needs to meet the following conditions:

  1. Learning tasks are similar or similar in source and target domains
  2. The distribution of data within the source and target domains must not differ too much
  3. You need to find a model that fits both domains

Violation of these preconditions may resultNegative transfer: That is, the introduction of knowledge in the source domain leads to the degradation of learning performance in the target domain, as shown in Figure 1.

This paper systematically introduces the research progress and related technologies of negative migration.

Background knowledge

Firstly, some relevant symbols, definitions, classifications and related factors leading to negative transfer are introduced.

Symbols and Definitions

Set up a
K K
A classifier for classification, with an input feature space
X \mathcal{X}
And an output tag space
y \mathcal{y}
, assuming we have access to a tag source field
S = { ( x S i . y S i ) } i = 1 n s \mathcal{S}=\{(x_S^i,y_S^i)\}^{n_s}_{i=1}
, derived from
P S ( X . Y ) P_{\mathcal{S}}(X,Y)
And,
X X . Y Y X\subseteq \mathcal{X},Y\subseteq \mathcal{Y}
. The target domain consists of two subdatasets
T = ( T l . T u ) \mathcal{T}=(\mathcal{T}_l,\mathcal{T}_u)
And,
T l = { ( x j l . y j l ) } j = 1 n l \mathcal{T}_l=\{(x_j^l,y_j^l)\}^{n_l}_{j=1}
From the
P T ( X . Y ) P_{\mathcal{T}}(X,Y)
the
n l n_l
A sample of markers is composed of, and
T u = { x u k } k = 1 n u \mathcal{T}_u=\{x^k_u\}^{n_u}_{k=1}
From the
P T ( X ) P_{\mathcal{T}}(X)
the
n u n_u
Sample composition. The main symbols are shown in Table 1:

In transfer learning, according to the difference of source and target domains (i.e., S≠T\mathcal{S} \neq \mathcal{T}S=T), the following cases can be divided:

  • XS≠XT\mathcal{X}_\mathcal{S} \neq \mathcal{X}_\mathcal{T}XS=XT.
  • The mark space is different: YS≠YT\mathcal{Y}_\mathcal{S} \neq \mathcal{Y}_\mathcal{T}YS=YT.
  • The marginal probability distribution of two fields: different PS P_ (X) (X) indicates a PT \ mathcal {S} (X) \ neq P_ \ mathcal {T} (X) PS  = PT (X) (X).
  • Different conditional probability distribution of two fields: the PS (Y ∣ X) indicates a PT (Y ∣ X) P_ \ mathcal {S} (Y | X) \ neq P_ \ mathcal {T} (Y | X) PS (∣ X Y)  = PT (Y ∣ X).

This review focuses on the latter two differences, assuming that the source and target domains share newly identical features and marker Spaces. Transfer learning focuses on designing a learning algorithm θ(S,T)\theta(\ Mathcal {S},\mathcal{T})θ(S,T) that uses data/information from both source and target domains to output a mapping function assuming HHH as the target domain, With a small loss expectancy ϵ T (h) = Ex, y ~ PT (X, y)/l (h (X)), y] \ epsilon_ \ mathcal {T} (h) = \ mathbb {E} _ {X, y \ sim P_ \ mathcal {T} (X, Y)} [l (h (X)), Y] ϵ T (h) = Ex, Y ~ PT (X, Y)/l (h (X)), Y], here the name ‘LLL loss function for a target domain.

Transfer learning classification

It can be seen that the previous writing is a review of transfer learning.

Negative transfer

Negative migration was first discovered experimentally by Rosenstein et al. Wang et al. gave a mathematical definition of negative transfer and proposed NTG (Negative Transfer Gap) to determine whether negative transfer occurs:

Definition 1 (NTG) : Let ϵT\epsilon_\mathcal{T}ϵT be the test error in the target domain, and let θ(S,T)\theta(\mathcal{S},\mathcal{T})θ(S,T) be the transfer learning algorithm between source and target domain. And theta (0, T) \ theta (\ mathcal {0}, \ mathcal {T}) theta (0, T) for the same algorithm but do not use the source domain information. when T (theta ϵ (S, T)) > ϵ T (theta (0, T)) \ epsilon_ \ mathcal {T} (\ theta (\ mathcal {S}, \ mathcal {T})) > \ epsilon_ \ mathcal {T} (\ theta (\ mathcal {0}, \ “math CAL {T})) ϵ T (theta (S, T)) > ϵ T (theta (0, T)), proves that negative transfer occurred, the severity of the negative transfer by the NTG to measure:


N T G = ϵ T ( Theta. ( S . T ) ) ϵ T ( Theta. ( 0 . T ) ) NTG=\epsilon_\mathcal{T}(\theta(\mathcal{S},\mathcal{T}))-\epsilon_\mathcal{T}(\theta(\mathcal{0},\mathcal{T}))

Obviously, negative migration occurs when NTG is positive. However, NTG is not always effective. For example, in unsupervised situations, due to lack of labeled target domain data, ϵ T (theta (0, T)) \ epsilon_ \ mathcal {T} (\ theta (\ mathcal {0}, \ mathcal {T})) ϵ T (theta (0, T)) can be calculated.

Factors of negative migration

Ben-david gives a theoretical boundary of transfer learning:


ϵ T ( h ) Or less ϵ S ( h ) + 1 2 d H Δ H ( X s . X t ) + Lambda. \epsilon_{\mathcal{T}}(h) \leq \epsilon_{\mathcal{S}}(h)+\frac{1}{2}d_{\mathcal{H}\Delta\mathcal{H}}(X_s,X_t)+\lambda

Here, ϵ T (h), (h) \ epsilon_ ϵ S {\ mathcal {T}} (h), \ epsilon_ {\ mathcal {S}} ϵ T (h) (h), ϵ S hypothesis (h) said HHH forecast error of the source domain and target domain, DH δ H(Xs,Xt)d_{\mathcal{H}\Delta\mathcal{H}}(X_s,X_t)dH δ H(Xs,Xt) represents the domain difference between two fields, and λ\lambda lambda represents a specific constant for the problem. Based on this formula, negative migration consists of four factors:

  1. Domain differences: Obviously, the difference between source and target domain is the root cause of negative transfer. The transfer learning method does not consider reducing the difference, which may lead to negative transfer at feature, classifier or target output level.
  2. The migration algorithm: A secure migration algorithm theoretically guarantees better learning performance of the target domain when using auxiliary information, or the algorithm must be carefully designed to improve the mobility of the auxiliary domain, otherwise negative migration may occur.
  3. Source data quality: The quality of source data directly determines the quality of transferred knowledge. If the source data is indivisible or there is too much noise, the classifier may become untrustworthy during training. Sometimes source data may be converted to a pre-training model for privacy security, and an over-fitting or under-fitting source domain model may also cause negative migration.
  4. Target data quality: Target domain data may be noisy or unstable, resulting in negative migration.

Reliable transfer learning

Figure 2 shows the authors’ proposal for a reliable transfer learning scheme, a representative existing approach that considers mitigating or avoiding negative transfer.

The method of domain similarity estimation is used in the scheme. If the similarity is too low, no migration or remote migration is carried out; if the similarity is neither high nor low, the negative migration mitigation strategy is implemented; if the similarity is high, the knowledge is directly transferred.

Safe migration

Secure migration avoids negative migration through objective function design, that is, migration functions need to perform better than those without migration, as shown in Table 2. There are only a few safe migration methods, some of which are for classification problems and others for regression problems.

Secure migration for classification

Cao et al. proposed a Bayesian adaptive learning method to automatically adjust the migration mode according to the similarity of two tasks. It assumes that source and target data obey a Gaussian distribution with a semi-parametric transfer kernel KKK:


K n m …… k ( x n . x m ) ( 2 e zeta ( x n . x m ) rho 1 ) K_{nm} \sim k(x_n,x_m)(2e^{-\zeta(x_n,x_m)\rho}-1)

Here the KKK is a kernel function effectively, if xn, xmx_n, x_mxn, xm from the same area is surrounded (xn, xm) = 0 \ zeta (x_n, x_m) = 0 zeta (xn, xm) = 0, otherwise the zeta (xn, xm) = 1 \ zeta (x_n x_m) = 1 zeta (xn, xm) = 1. The parameter ρ\rhoρ denotes the degree of difference between the source and target fields. By assuming the ρ\rhoρ in the Gaussian distribution γ (b,μ)\Gamma(b, mu) γ (b,μ b,\mub,μ is inferred from some labeled samples of the two fields, we can define:


Lambda. = 2 ( 1 1 + mu ) b 1 \lambda = 2(\frac{1}{1+\mu})^b-1

This determines how similar the two domains are and what can be migrated. For example, when λ\lambdaλ approaches 0, the correlation between the two fields is so low that only the kernel parameter KKK can be migrated.

Jamal et al. proposed an adaptive method for deep face detection to avoid negative migration by minimizing the following loss function:


min mu . Theta. ~ [ Lambda. 2 u 2 2 + E t max y t { 0 . 1 } ( R E S t ( w + u . Theta. ~ ) ) ] \ min_ {\ mu, \ tilde {\ theta}} [\ frac {\ lambda} {2} | | u | | _2 ^ 2 + \ mathbb {E} _t \ max_ {y_t \ in \ {0, 1 \}} (RES_t (w + u, \ tilde {\ theta}))]

Where WWW is the classifier weight of the source detector, uuu is the offset weight of the target detector constrained according to the source detector,θ~,\tilde{\theta},θ~ is the parameter value of the target feature extractor, REStRES_tRESt is the relative performance loss caused by pre-training the source face detector to learn the target detector, which is negative after optimization. Therefore, this loss function can ensure that the target detector is not worse than the source detector.

Li et al. proposed a safe weakly supervised learning (SAFEW) approach for semi-supervised domain adaptation. Assume that h∗h^*h∗ exists, which is constructed by multiple base learners in the source domain, i.e. H ∗=∑ I =1Mαihih^*=\sum^M_{I =1}\alpha_ih_ih∗=∑ I =1Mαihi, One} {hi I = 1 M \ {h_i \} ^ M_ {I = 1}} {hi I = 1 M for belt of alpha = [alpha 1;…; alpha M] acuity 0 \ alpha = \ [\ alpha_1;…; \ alpha_M] geq0 alpha = [alpha 1;…; alpha M] of zero or more MMM a source model. Its objective is to learn a predicted value HHH to maximize its gain compared to baseline H0H_0H0, which is trained only on the marked target data by optimizing the following objective function:


max h   min Alpha. M    l ( h 0 . i = 1 M Alpha. i h i ) l ( h . i = 1 M Alpha. i h i ) \max_{h}\ \min_{\alpha \in M}\ \ l(h_0,\sum^M_{i=1}\alpha_ih_i)-l(h,\sum^M_{i=1}\alpha_ih_i)

SAFEW avoids negative migration by optimizing worst-case performance.

Safe migration for regression

Kuzborskij and Orabona introduced a class of regularized least square algorithm (RLS), which avoids negative migration through regularization. The original RLS algorithm solved the following optimization problems:


min w 1 n i = 1 n ( w T x i y i ) 2 + Lambda. w 2 \min_w \frac{1}{n}\sum^n_{i=1}(w^Tx_i-y_i)^2+\lambda||w||^2

In optimized source hypothesis h ‘(⋅) h’ (\ cdot) h ‘(⋅), build a training set {xi, yi -‘ h (xi)} I = 1 n \ {x_i y_i – h ‘(x_i) \} ^ n_ (I = 1} {xi, yi -‘ h (xi)} I = 1 n and generate negative transfer hypothesis:


h T ( x ) = T C ( x T w ^ T ) + h ( x ) h_{\mathcal{T}}(x)=T_C(x^T\hat{w}_{\mathcal{T}})+h'(x)

Here the TC (^) y = min (Max (y ^ – C), C) T_C (\ hat {} y) = min (Max (\ hat {} y, – C), C) TC (^) y = min (Max (y ^ – C), C) for a truncation function is used to limit the output on [−C,C][-c,C][−C,C]


w ^ T = a r g min w 1 n i = 1 n ( w T x i y i + h ( x i ) ) w + Lambda. w 2 \hat{w}_{\mathcal{T}}=arg \min{w} \frac{1}{n}\sum^n_{i=1}(w^Tx_i-y_i+h'(x_i))^w+\lambda||w||^2

Kuzborskij and Orabonaz proved that this method is equivalent to training RLS separately on the target domain when the source domain and the target domain are independent.

Based on the above method, Yoon and Li proposed an optimized transfer learning algorithm, which assumes that the source parameters obey a normal distribution and optimizes the following loss function:


min w l T l ( w ; b ) + Beta. R ( w ) + Lambda. N ( w ; mu w . Σ w ) \min_{w}l{\mathcal{T}_l}(w; b)+\beta\mathcal{R}(w)+\lambda N(w; \mu_w,\Sigma_w)

Here WWW is the model parameter, R(w)\mathcal{R}(w)R(w) is a regularization term used to control the complexity of the model, N(w; Mu w, Σ w) N (w; \mu_w,\Sigma_w)N(w; μw, σ w) is a regularization term used to constrain WWW in a multivariable Gaussian distribution, where mean μw\mu_wμw and variance σ w\Sigma_w σ W are calculated from the source domain. They found that negative migration occurs when λ\lambdaλ is too large, so they proposed an optimization rule to select the weight λ\lambdaλ to eliminate the influence of the harmful source domain.

Domain similarity estimation

Domain similarity (or transferability) estimation is a very important module in trusted transfer learning, as shown in Figure 2. Existing estimation methods can be divided into three categories: based on feature statistics, based on test performance and based on fine tuning, as shown in Table 3:

Feature-based statistics

The original feature representation and first-order or higher-order tongji, such as mean or variance, are important inputs for measuring differences in domain distributions. There are three important methods in domain difference measurement, maximum mean difference (MMD), correlation coefficient and KL-difference.

MMD is probably the most widely used method in traditional transfer learning, specifically introducing visible links.

The correlation coefficient method measures distribution differences by the correlation between two high-dimensional trajectory vectors from different distributions. Domain mobility estimation (DTE) evaluates the mobility between source domain and target domain through the dispersion matrix between classes and domains (the matrix obtained by multiplying the covariance matrix by the coefficient, which has the same effect as the covariance matrix) :


D T E ( S . T ) = S b S 1 S b S . T 1 DTE(\mathcal{S},\mathcal{T})=\frac{||S^\mathcal{S}_b||_1}{||S^{\mathcal{S},\mathcal{T}}_b||_1}

Among them, the SbSS ^ \ mathcal {S} _bSbS for scattering matrix between the source domain classes, SbS, TS ^ {\ mathcal {S}, \ mathcal {T}} _bSbS, scattering matrix between T for the domain, and ∣ ∣ ⋅ ∣ ∣ 1 | | \ cdot | | _1 ∣ ∣ ⋅ ∣ ∣ 1 1 – norm.

Kl-difference is an asymmetric measure between two probability distributions. Gong et al. proposed a domain arrangement method (ROD), which ranks the similarity between the source domain and the target domain by calculating the weighted average of the symmetric KL-difference between the main angles.

Based on test performance

Domain similarity can be measured by test performance. If the source domain classifier performs well in marking target domain data, it indicates that the two domains are very similar. Obviously, this method requires a certain amount of marker data to exist in the target domain, and the scope of use is limited.

Based on the trimming

Domain similarity can be measured by fine-tuning, which is often used in deep transfer learning to apply the deep learning model of the source domain to the target domain by fixing its lower-level parameters and readjust higher-level parameters.

Generally, these methods input the target domain data into the neural network based on source domain training, and then determine the domain similarity through the output. Nguyen et al. proposed the log expected empirical prediction (LEEP), which can be calculated from a source model θ\thetaθ with nLN_lnL labeled target data:


T ( Theta. . D ) = 1 n l i = 1 n l l o g ( z Z P ^ ( y i z ) Theta. ( x i ) z ) T(\theta,\mathcal{D})=\frac{1}{n_l}\sum^{n_l}_{i=1}log(\sum_{z\in Z}\hat{P}(y_i|\mathcal{z})\theta(x_i)_{\mathcal{z}})

Here P ^ (yi) ∣ z \ hat {P} (y_i | \ mathcal {} z) P ^ (yi ∣ z) to pass the model to predict the theta \ theta theta in real goal under yiy_iyi the virtual target marking the experience of the z conditional distribution (training set frequency and the ratio of the sum of the sample. T(θ,D)T(\theta,\mathcal{D})T(θ,D) represents the mobility of the pre-trained model θ\thetaθ for the target domain D\mathcal{D}D, and is the upper bound of the negative conditional entropy (NCE) measure. (Negative conditional entropy measures the amount of information from the source domain to the target domain to evaluate the transferability of the source domain)

Remote migration

When the similarity between source domain and target domain is very low, transfer learning is not easy to implement. Remote transfer learning (also called transfer learning), which connects the source domain to the target domain through one or more intermediate domains, can be used to solve some difficult problems.

Negative migration mitigation

In the case that source domain and target domain have certain similarity, some methods can be adopted to mitigate the impact of negative migration, as shown in Table 4, including enhancing the mobility of data, enhancing the mobility of model and enhancing the predictability of target.

Data portability is enhanced

The portability of source domain data can be enhanced by enhancing data quality such as coarse-to-fine-grained data at the domain, instance, and feature levels.

Increased mobility at the domain level

When there are multiple source domains, selective aggregation of source domains closest to the target domain or weighted aggregation of all source domains may achieve better learning performance. So domain selection/weights are used to mitigate negative migration.

Mentioned method of similarity measurement domain can also use multiple source domain, Wang and Carbonell used MMD measure similarity between the domain, they trained classifier for each source domain first, then according to the target domain of the predicted results of the classifier weighted, the weight of each classifier is made up of the source domain and target domain based on the approximation degree of MMD, That is, measures of mobility between it and other source domains.

ZUO et al. proposed an attention-based domain recognition module to estimate domain correlation and reduce the impact caused by non-approximate domains. The core idea is to recombine instance tags when multiple source domains exist to distinguish each class and domain at the same time. It redefines the source tag for Y ^ s, I = Ys, I + (I – 1) * K \ hat {Y} _ {s} I = Y_ {s} I + (I – 1) \ times KY ^ s, I = Ys, I + (I – 1) * K, and trained according to the original features, a domain identification model. For the i-th domain, its learning weight is:


w i = j = 1 n t s i g n ( d j ^ . i ) n t w_i=\frac{\sum^{n_t}_{j=1}sign(\hat{d_j},i)}{n_t}

⋅ (⋅,⋅) Sign (\cdot,\cdot) Sign (⋅,⋅) for a sign function, DJ ^ hat{d_j} DJ ^ for the field label of a target instance xjx_jxj, It is predicted by analyzing domain recognition model.

Increased instance level mobility

Instance selection \ weights are also often used in transfer learning and negative transfer mitigation.

Seah et al. proposed a predictive distribution matching (PDM) regularizer to eliminate irrelevant source domain data, which iteratively inferred false labels of unlabeled target domain data and retained labels with high confidence. Finally, a SVM or logistic regression classifier is trained with the remaining source domain data and the pseudo-target domain data.

Active learning can be used to select the most useful unlabeled data for tagging or to select the most appropriate source sample. Peng et al. proposed active transfer learning, which optimizes the selection of source samples by selecting class equilibrium and target domain nearest, and can simultaneously minimize MMD and alleviate negative transfer.

Instance weights are also used to control negative transfer in deep transfer learning. Wang et al. proposed a discriminator gate to realize adversarial adaptive and class-level weighting of source samples. They estimate the distribution density ratio of the two domains at each particular feature point by identifying the output of the gate.

The mobility of feature level is enhanced

A common approach is to learn a common potential feature space where features from different domains become more consistent.

Long et al. proposed dual transfer learning to automatically distinguish common and domain-specific potential factors. The core idea is to find a potential feature space that can help the classification in the target domain to the greatest extent and express it as a non-negative matrix optimization problem with three factors:


min U 0 . U S . H . V S p 0 X S [ U 0 . U S ] H V S T \min_{U_0,U_S,H,V_S \geq 0}||X_S-[U_0,U_S]HV_S^T||

XSX_SXS is the feature matrix of the source domain, U0,USU_0,U_SU0 and US are common feature clusters and domain-specific feature clusters, VSV_SVS is a sample cluster allocation matrix, and HHH is a relational matrix. Optimization of knowledge transfer can be achieved by minimizing the relationship distribution between different domains.

Another important work is the enhancement of feature transferability in feature representation learning in deep transfer learning. Yosinski et al. defined the transferability of characteristics based on the uniqueness and universality of their training field.

Unfortunately, focusing solely on improving the portability of features can lead to poor discernibility, and in order to mitigate the effects of negative migration, both feature portability and discernibility (i.e., noise reduction) need to be considered. Xu et al. proposed a method to model characteristic noise in unsupervised transfer learning using sparse matrix pairs.

Model migration enhancement

The methods of model migration enhancement include TransNorm and Adversarial robust training.

TransNorm reduces the field shift in batch normalization, which is typically used behind the convolution layer.

Adversarial robustness training improves model robustness to adversarial instances, see link.

Target prediction enhancement

Transfer learning is often used in the target domain with a small number of labeled samples and a large number of unlabeled samples. Similar to semi-supervised learning, pseudo-labels are used in transfer learning for these unlabeled samples. Soft, selective and clustering enhanced pseudolabeling can be used to alleviate negative migration. (Introduction to false labels)

In soft pseudo-labeling method, each unlabeled sample is assigned to different categories with different probabilities, instead of directly specifying a category, so as to reduce label noise from weak source classifiers.

Selective pseudo-labeling is another strategy to enhance target prediction. Its core idea is to select unlabeled data with high confidence as the training target.

Clustering enhanced pseudo-labeling method is based on soft pseudo-labeling method, but further explores unlabeled clustering information to enhance target domain prediction.