• By Han Xinzi @Showmeai
  • Tutorial address: www.showmeai.tech/article-det…
  • Statement: All rights reserved, please contact the platform and the author and indicate the source

Recommendation, search and computational advertising are the most popular and easiest directions for Internet companies to realize business, and also the directions where algorithms play the biggest role. Breakthroughs and applications of cutting-edge algorithms can drive business growth to a great extent. In this series, let’s talk about technologies and corporate practices in these business directions. The topic of this issue is multi-objective learning optimization landing (with implementation code and wechat data set)

Read the full text in one picture

Data access

To obtain the full code for this article, go to GitHub project github.com/ShowMeAI-Hu…

– Part of the paper involved in the article, as well as the “wechat data set”, please reply the keyword “multi-objective learning” in the public account (AI Algorithm Research Institute) background.

If you are interested in the application of “multi-objective learning”, you are welcome to pay attention to our official account and check the article “Multi-objective Optimization practice in IQiyi Short video recommendation Service” and “Parallel two-tower CTR Structure in Information Flow Recommendation Sequencing” to learn more about dACang’s implementation scheme!

I. Introduction of multi-objective optimization

1.1 Multi-objective optimization scenario

Multi-objective ranking is a common technical realization in recommendation ranking system. In many recommendation and ranking systems, there are multiple business objectives. Finding a comprehensive ranking method makes multiple objectives achieve the overall optimization and can have the best overall income.

1.2 Why is multi-objective optimization needed

Why is there a need for multi-objective ranking? In the actual Internet recommendation system products, most user feedbacks are not direct ratings, but various implicit feedbacks, such as users’ clicks, favorites, shares, viewing duration, order and purchase, etc.

When evaluating user satisfaction and setting optimization goals, there may be some deviations:

  • Global bias: Different targets express different degrees of preference.

In the e-shopping scene, the preference expressed by purchasing behavior is higher than clicking browsing and favorites. In the news scenario, browsing for more than 20 seconds expressed preference over clicking. In short video scenes, the preference expressed by finishing play is higher than click.

  • Item Bias: Incomplete measurement of a single target.

In information flow products, click-through rate increases but satisfaction rate decreases; In short video scenes, suspense design improves the completion rate, but the user needs to watch the next one which leads to more actions. We media information products encourage forwarding rate, which may promote malignant operations such as “safe forwarding”.

  • User bias: Users express satisfaction in different ways.

In the information flow products, users have different ways to express their satisfaction, such as in-depth reading, liking and collecting. In short video scenarios, users have different ways to express satisfaction, such as “like”, “favorite” and “forward”.

The following figure shows the behavior path of users including information feedback under some Internet products.

In the numerous Internet businesses mentioned above, engineers may optimize and improve the target is multiple, for example, short video recommendation task, both click rate and complete rate; E-commerce sorting, both click rate and conversion rate.

1.3 Difficulties of multi-objective optimization

The main technical realization of multi-objective optimization is completed in the “ranking” link of the recommendation system. In the information flow recommendation as shown below, the ranking link affects the final display result and thus has the most direct influence on the target effect.

The sorting process is mostly completed by CTR (Click through Rate Prediction) technology, and the industry has very mature machine learning and deep learning technology methods and schemes. However, there are five difficulties in the application of multi-objective learning optimization:

  1. Some target data are sparse and the model accuracy is low.

For example, in e-commerce products, users’ ordering behavior is significantly less than clicking behavior, and the number and proportion of positive label samples of ordering are relatively small.

  1. Online services are computationally intensive.

Generally, multi-objective optimization model has more complex model structure and higher computational complexity when online estimation. Real-time recommendation tasks need to be stable with short response time and high concurrency support, and have higher technical complexity

  1. The importance of multiple goals is difficult to quantify.

How do these two targets quantify and weigh the importance of short videos in pursuit of CTR and COMPLETION rate?

  1. Fractional fusion hyperparameters are difficult to learn.

In many modeling schemes, we will get different objective score by quantification, but in the final fusion, the super-parameters involved in fusion calculation are not easy to be determined directly through business, and there is no appropriate method for model learning.

  1. Label is fuzzy.

In many businesses, even the label itself is vague. For example, in information products, how long is the reading time?

1.4 Multi-purpose vs. multi-task

In practical technical solutions, there are several very similar concepts, namely, multi-task, multi-objective and multi-category. Their definition and correlation are shown in the figure below:

In the recommended multi-objective optimization mentioned here, different objectives also correspond to different tasks. For example, in the scene of e-shopping mall, CTR modeling is carried out in the recommended sorting stage, and multiple targets of click rate and conversion rate are estimated for the same input sample at the same time. In this scene, we believe that multi-objective and multi-task optimization can adopt the same set of methods.

We often use the joint-train mode for multi-objective optimization learning, and the formula is as follows:


L = min Theta. t = 1 T Alpha. t L t ( Theta. s h . Theta. t ) L=\min _{\theta} \sum_{t=1}^{T} \alpha^{t} L^{t}\left(\theta^{s h}, \theta^{t}\right)

In the formula, θsh\theta^{sh}θsh is the multi-task shared parameter, θt\theta^{t}θt is the exclusive parameter of task TTT, and total Loss is the weighted sum of Loss corresponding to each sub-task.

Second, multi-objective optimization and two development directions

2.1 Parameters of multi-objective learning and sharing

For the realization of multi-task and multi-objective learning, we now mostly use the “sharing” mechanism to complete the design, which can be done in two aspects of model parameters and feature sharing of different tasks.

  • Model architecture: In deep learning networks, the embedding features can be shared, or some hidden units in the middle layer can be shared, or the results of a layer or the last layer of the model can be shared, and the parts outside the sharing are independent. In the design of the model, the relationship between layers is freely combined and matched.

  • Feature combination: Multiple tasks can adopt different feature combinations, some of which belong to only a part of the model architecture, and some of which can be used by the whole model.

Typical parameter sharing mechanisms are as follows:

  • Parameter Based hard sharing mechanism (Parameter Based sharing). Parameter-based sharing is the most common method of multi-objective learning. In deep learning network, the network architecture of shared features, features and hidden layer is adopted to distinguish different tasks by the way of full connection +softmax in the last layer. Finally, a linear fusion is made to achieve multi-objective sorting.

  • Soft sharing mechanism of parameters (Regularization Based sharing). Parameter soft sharing mechanism, each task has its own parameters and model structure, can choose which to share and which not to share. Finally, the distance between model parameters can be narrowed by regularization, such as L2 regularization.

In practical multi-objective optimization, there may be four different results after adopting various model structures designed by sharing mechanism:

  1. “Well Done” : the best state in which all share tasks are promoted together.
  2. “Not bad” : secondary status, all tasks are not reduced, at least one task is improved. If it is the combination of the main task and auxiliary task, it can achieve the effect of sacrificing the auxiliary task to improve the main task, which is also well done.
  3. “Not ideal” : a seesaw phenomenon where tasks go up and down.
  4. “Unacceptable” : Negative migration, where all tasks are not as good as before.

2.2 Two optimization directions

In order to better able to “share” parameters, make with multiple tasks in a model of harmonious coexistence, complement each other, complement each other, the research community has two optimization direction, respectively is a network structure optimization design better parameters (Shared location and style), optimization strategy promotion (design better optimization strategy to enhance the optimization of the process of Loss multitasking balance).

  • Optimization direction 1: network structure design. The direction of network structure design considers which parameters are shared, where and how.

  • Optimization direction 2: optimization methods and strategies. Multi-objective optimization strategy considers the relationship between tasks from the perspective of Loss and gradient. Balance the amount of Loss Magnitude, adjust the velocity of Loss update, and optimize the Gradient update direction. At the micro level, gradient conflict can be alleviated and parameter tearing can be achieved. At the macro level, multi-task balance optimization can be achieved.

3. Optimization Direction 1: network structure optimization

3.1 General ideas and evolution ideas

Network architecture design is the main focus of multi-task research and application. It mainly considers which parameters to share, where and how to share. Excellent and reasonable shared network structure plays a great role in improving the final effect.

In recent years, network architecture design has undergone a typical structural transition from Share Bottom to MMoE to PLE. The papers on multi-task network architecture design published by leading researchers in the industry include:

  • Share Bottom: An early approach, hard or soft, to modeling multitasking.
  • 2018 Google MMOE: Turn the parameter sharing of Hard into multiple Experts, and control the influence of different Loss on each Expert through gating.
  • 2019 Google SNR: With the help of simple NAS (Neural Architecture Search), sub-networks are combined to learn their own Network structures for different targets.
  • 2020 Tencent PLE: Added Expert unique to each task on the basis of MMOE.

The following figure summarizes the early three typical network structures of Share Bottom, MMoE and PLE as well as the corresponding motivation and formula.

  • Shared Bottom → MMoE: The MMoE splits the Shared Bottom into multiple Experts and automatically controls the gradient contributions of different tasks to these Experts through a gated network.
  • MMoE → PLE: On the basis of MMoE, PLE adds its own Expert for each task, which is only updated gradient by this task.

3.2 Core paper and typical network structure

Let’s take a concrete look at the typical network structure in the paper:

3.2.1 MMoE: Google KDD 2018, now CTR modeling multi-task learning standard [1]

Modeling Task relationships in multi-task learning with multi-gate MMoE proposed by mixture-of Experts has almost become the standard structure of various Internet companies to do multi-task and multi-objective learning sequencing.

In the Google paper, the researchers tested and studied the performance of different network structures by manually controlling the similarity of the two tasks. Multi-gate in MMoE structure design can alleviate the conflict caused by task difference to a certain extent, even if the correlation between multiple tasks is not high, it also has a good effect.

Different Experts in MMoE are responsible for learning different information contents, and then combine these information through gate. The thermal distribution difference of SoftMax in different task gates shows that Expert is responsible for different targets, thus improving the effect.

The experimental results of the large-scale recommendation system data set in the paper are as follows. Compared with the share Bottom method, MMoE significantly improves all indicators:

MMoE core code reference:

class MMoE_Layer(tf.keras.layers.Layer) :
    def __init__(self,expert_dim,n_expert,n_task) :
        super(MMoE_Layer, self).__init__()
        self.n_task = n_task
        self.expert_layer = [Dense(expert_dim,activation = 'relu') for i in range(n_expert)]
        self.gate_layers = [Dense(n_expert,activation = 'softmax') for i in range(n_task)]

    def call(self,x) :
        # Build multiple expert networks
        E_net = [expert(x) for expert in self.expert_layer]
        E_net = Concatenate(axis = 1)([e[:,tf.newaxis,:] for e in E_net]) # dimension (bs, n_expert, n_dims)
        # Build multiple gate networks
        gate_net = [gate(x) for gate in self.gate_layers]     # dimension n_task (BS,n_expert)

        # Towers Computing: Multiply the corresponding gate network by all the expert networks
        towers = []
        for i in range(self.n_task):
            g = tf.expand_dims(gate_net[i],axis = -1)  # dimension (bs, n_expert, 1)
            _tower = tf.matmul(E_net, g,transpose_a=True)
            towers.append(Flatten()(_tower))           # dimension (bs, expert_dim)

        return towers
Copy the code

3.2.2 SNR: Google AAAI 2019, Improvements to MMoE [2]

Google’s SNR: Sub-network Routing forFlexible Parameter Sharing in multi-task Learning is close to the idea of Network automatic search (NAS). Multi-task sub-network is produced by dynamic learning. The idea is to learn more shared structures under more similar tasks.

3.2.3 PLE: Tencent RecSys 2020, improved MMoE, simple structure and good effect [3]

Progressive Layered Extraction (PLE) A Novel Multi-Task Learning (MTL) Model for Personalized Recommendations proposes PLE, which is mainly based on MMoE. Each task added its own specific Expert, which was only gradient updated by this task.

As shown in the figure below, in the structure of Share Bottom, the whole shared parameter matrix is like an object with large mass. In the gradient update process, the two gradient vectors reversely calculated by Loss are G1 and G2 respectively, which are two forces of different directions and sizes received by the object. These two forces move the object at the same time. If the two forces are likely to be in the same direction during multiple updates, they can easily achieve harmonious coexistence and complement each other. On the other hand, multiple forces can consume and cancel each other out, and the effectiveness of the task is greatly reduced.

MMoE can transform a shared parameter matrix into multiple shared Experts combined with GATE by “dividing the whole into parts”. In this way, when different losses conflict with each other, different losses can be expressed in relative strength on different Experts, and mutual offset may be reduced. Some experts are greatly influenced by one task, while some experts are dominated by other tasks, thus forming a state of “each gaining something”. And PLE added spciFIC Experts, which can further guarantee “each has its own benefit” and ensure stable optimization.

The final experimental results of paper are as follows:

Under ali’s business, multi-objective optimization was carried out on CTR, CVR and R3, and the relative improvement results of single task were shown in the following table. In share-bottom, the see-saw phenomenon appears, while PLE achieves the result of multi-objective win-win.

PLE core code reference:

class PleLayer(tf.keras.layers.Layer) :
    "@param n_Experts: list, using several Experts per task. [3,4] The first task uses three Experts, and the second task uses four Experts. @param n_expert_share: int indicates the number of experts partially set to share. @param expert_DIM: int, vector dimension for each expert network output. @param n_task: int indicates the number of tasks. ' ' '
    def __init__(self,n_task,n_experts,expert_dim,n_expert_share,dnn_reg_l2 = 1e-5) :
        super(PleLayer, self).__init__()
        self.n_task = n_task

        # Define multiple task specific networks and 1 shared network
        self.E_layer = []
        for i in range(n_task):
            sub_exp = [Dense(expert_dim,activation = 'relu') for j in range(n_experts[i])]
            self.E_layer.append(sub_exp)

        self.share_layer = [Dense(expert_dim,activation = 'relu') for j in range(n_expert_share)]
        # Define a gated network
        self.gate_layers = [Dense(n_expert_share+n_experts[i],kernel_regularizer=regularizers.l2(dnn_reg_l2),
                                  activation = 'softmax') for i in range(n_task)]

    def call(self,x) :
        # Specific networks and shared networks
        E_net = [[expert(x) for expert in sub_expert] for sub_expert in self.E_layer]
        share_net = [expert(x) for expert in self.share_layer]

        # [gate weight] and [assigned task and shared task output] multiplication calculation
        towers = []
        for i in range(self.n_task):
            g = self.gate_layers[i](x)
            g = tf.expand_dims(g,axis = -1) # dimension (bs, n_expert_share + n_experts [I], 1)
            _e = share_net+E_net[i]  
            _e = Concatenate(axis = 1)([expert[:,tf.newaxis,:] for expert in _e]) # dimension (bs, n_expert_share + n_experts [I], expert_dim)
            _tower = tf.matmul(_e, g,transpose_a=True)
            towers.append(Flatten()(_tower)) # dimension (bs, expert_dim)
        return towers

Copy the code

4. Optimization Direction 2: Optimization methods and strategies

4.1 General ideas and evolution ideas

The optimization method focuses more on training and parameter optimization by combining tasks better under the existing structure. It considers the relationship between different tasks from the dimensions of Loss and gradient. In the process of optimization, gradient conflicts and parameter tearing are alleviated to achieve the balance optimization of multi-task as far as possible.

At present, various multi-task and multi-objective optimization methods and strategies mainly focus on three problems:

4.1.1 Magnitude (Loss Magnitude)

The value of Loss may be large or small. Loss with a large value may dominate, as shown in the figure. This problem needs to be addressed. A typical example is multi-objective optimization of binary classification task + regression task. The magnitude and amplitude of Loss of L2 Loss and cross entropy Loss and gradient size may be very different, which will cause great interference to optimization if not dealt with.

4.1.2 Velocity (Learning Speed of Loss)

Different tasks are inconsistent in learning speed of Loss during training and optimization due to the sparsity of samples and the difficulty of learning. If no adjustment is made, one task may be close to convergence or even over-fitting, while other tasks are still under-fitting.

4.1.3 Direction (Loss Gradient Conflict)

Loss of different tasks updates shared parameters, and gradients have different sizes and directions. When the same parameter is updated by multiple gradients at the same time, conflicts may occur, resulting in mutual consumption and cancellation, and thus seesaw or even negative migration. That’s what the core needs to deal with.

4.2 Core papers and typical methods

For the above three core issues, typical research sub-directions and excellent output methods in recent years are shown in the chart below:

2 Uncertainty Weight [4]

Simple multitasking learning often involves the joint optimization of all Losses, usually requiring manual adjustment of their weights. Typical Loss functions are as follows:


L total  = i w i L i L_{\text {total }}=\sum_{i} w_{i} L_{i}

However, this approach usually has the following problems: the final learning effect of the model is very sensitive to weights, otherwise it is difficult to harvest a model that is good for multiple tasks at the same time. At the same time, manual adjustment of these weights is very time-consuming and laborious work.

This paper proposes to directly model uncertainty in a single task, and then use uncertainty to guide weight adjustment.


L ( W . sigma 1 . sigma 2 ) material 1 2 sigma 1 2 L 1 ( W ) + 1 2 sigma 2 2 L 2 ( W ) + log sigma 1 + log sigma 2 \mathcal{L}\left(W, \sigma_{1}, \sigma_{2}\right) \approx \frac{1}{2 \sigma_{1}^{2}} \mathcal{L}_{1}(W)+\frac{1}{2 \sigma_{2}^{2}} \mathcal{L}_{2}(W)+\log \sigma_{1}+\log \sigma_{2}

σσσ is directly modeled uncertainty, which is a learnable parameter.

The total Loss is designed in such a way that the model optimization process tends to penalize high Loss and low σσ (if a task’s Loss is high and σσ is small, this term will be large and the optimization algorithm will tend to optimize it). The implication is: tasks with large Loss should have more uncertainty, and its weight should be smaller.

The result of this optimization is that tasks with small Loss (” relatively simple “) often have a larger weight. For example, in classification + regression multi-objective optimization task, regression task Loss is large, Uncertainty Weight is given small Weight, the overall effect may be helpful.

Uncertainty Weight core code reference:

from keras.layers import Input, Dense, Lambda, Layer
from keras.initializers import Constant
from keras.models import Model
from keras import backend as K

# Customize the Loss layer
class CustomMultiLossLayer(Layer) :
    def __init__(self, nb_outputs=2, **kwargs) :
        self.nb_outputs = nb_outputs
        self.is_placeholder = True
        super(CustomMultiLossLayer, self).__init__(**kwargs)

    def build(self, input_shape=None) :
        Initialize log_vars
        self.log_vars = []
        for i in range(self.nb_outputs):
            self.log_vars += [self.add_weight(name='log_var' + str(i), shape=(1,),
                                              initializer=Constant(0.), trainable=True)]
        super(CustomMultiLossLayer, self).build(input_shape)

    def multi_loss(self, ys_true, ys_pred) :
        assert len(ys_true) == self.nb_outputs and len(ys_pred) == self.nb_outputs
        loss = 0
        for y_true, y_pred, log_var in zip(ys_true, ys_pred, self.log_vars):
            precision = K.exp(-log_var[0])
            loss += K.sum(precision * (y_true - y_pred)**2. + log_var[0] -1)
        return K.mean(loss)

    def call(self, inputs) :
        ys_true = inputs[:self.nb_outputs]
        ys_pred = inputs[self.nb_outputs:]
        loss = self.multi_loss(ys_true, ys_pred)
        self.add_loss(loss, inputs=inputs)
        return K.concatenate(inputs, -1)
Copy the code

4.2.2 GradNorm [5]

The main idea of the Gradient Normalization method is that different tasks are expected to have similar levels of Loss, and that different tasks learn at a similar rate. “Gradnorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks this paper attempts to control the training of multitask networks by adjusting the gradients of different tasks to similar levels. To encourage the network to learn all tasks at the same speed as possible.

The details of Gradient normalization are as follows:

  • Two types of Loss are defined: Label Loss and Gradient Loss. These two kinds of Loss are optimized independently without operation.
    • Label Loss refers to the Loss calculated by the real data Label and network prediction Label of each task in multi-task learning. Label Loss is determined by the nature of learning tasks, such as classified Loss or regression Loss. Label Loss is realized by weighted summation of Loss of different tasks: Label Loss is a function of network parameter W4W4W4.
    • Gradient Loss is used to measure the quality of weight wi(t) W_i (t) WI (t) of each task Loss. Gradient Loss is a function of weight WI (t) W_I (t) WI (t).
  • The weight of each task wi(t) W_i (t) WI (t) is a variable (note that w is different from the network parameter W). WWW is also updated by gradient descent, and TTT represents the TTT step currently in network training.

Gradnorm’s process in a single batch step is summarized as follows:

  1. Forward propagation calculation total Loss⁡= σ iwili\ OperatorName {Loss}= Sigma_{I} w_{I} l_{I}Loss= σ iwili
  2. Computing GWi (t), ri (t), G ˉ Wi (t) G_ {W} ^ {I} (t) and r_ {I} (t), \ bar {G} _ {W} ^ {I} GWi (t) (t), ri (t), G ˉ Wi (t)
  3. Calculate GradLossGrad LossGradLoss
  4. Calculate the derivative of GradLossGrad LossGradLoss to WiW_ {I} WI
  5. The back propagation of LossLossLoss calculated in step 1 was used to update the neural network parameters
  6. Update wiW_ {I} WI with the derivative of Step 4 (effective at the next Batch step after update)
  7. Renormalize wiW_ {I} WIs (the next Batch step uses the WIW_ {I} WIs after renormalize)

GradNorm core code reference:

class GradNorm:
    def __init__(self,
                 device,
                 model,
                 model_manager,
                 task_ids,
                 losses,
                 metrics,
                 train_loaders,
                 test_loaders,
                 tensorboard_writer,
                 optimizers,
                 alpha=1.) :

        super().__init__(
                device, model, model_manager, task_ids, losses, metrics,
                train_loaders, test_loaders, tensorboard_writer)

        self.coeffs = torch.ones(
                len(task_ids), requires_grad=True, device=device)
        optimizer_def = getattr(optim, optimizers['method'])
        self.model_optimizer = optimizer_def(
                model.parameters(), **optimizers['kwargs'])
        self.grad_optimizer = optimizer_def(
                [self.coeffs], **optimizers['kwargs'])

        self.has_loss_zero = False
        self.loss_zero = torch.empty(len(task_ids), device=device)
        self.alpha = torch.tensor(alpha, device=device)

    def train_epoch(self) :
        """ Training 1 round """
        self.model.train()
        loader_iterators = dict([(k, iter(v))
                                 for k, v in self.train_loaders.items()])
        train_losses_ts = dict(
                [(k, torch.tensor(0.).to(self.device)) for k in self.task_ids])
        train_metrics_ts = dict(
                [(k, torch.tensor(0.).to(self.device)) for k in self.task_ids])
        total_batches = min([len(loader)
                             for _, loader in self.train_loaders.items()])
        num_tasks = torch.tensor(len(self.task_ids)).to(self.device)

        relative_inverse = torch.empty(
                len(self.task_ids), device=self.device)
        _, all_branching_ids = self.model.execution_plan(self.task_ids)
        grad_norm = dict([
                (k, torch.zeros(len(self.task_ids), device=self.device))
                for k in all_branching_ids])

        pbar = tqdm(desc=' train', total=total_batches, ascii=True)
        for batch_idx in range(total_batches):
            tmp_coeffs = self.coeffs.clone().detach()
            self.model.zero_grad()
            self.grad_optimizer.zero_grad()
            for k, v in self.model.rep_tensors.items():
                if v.grad is not None:
                    v.grad.zero_()
                if v is not None:
                    v.detach()

            For each task, calculate gradients, back-propagation, and accumulate gradients norms
            for task_idx, task_id in enumerate(self.task_ids):
                data, target = loader_iterators[task_id].next()
                data, target = data.to(self.device), target.to(self.device)

                # do inference and accumulate losses
                output = self.model(data, task_id, retain_tensors=True)
                for index in all_branching_ids:
                    self.model.rep_tensors[index].retain_grad()
                loss = self.losses[task_id](output, target)
                weighted_loss = tmp_coeffs[task_idx] * loss

                weighted_loss.backward(retain_graph=False, create_graph=True)
                output.detach()

                # GradNorm relative inverse training rate accumulation
                if not self.has_loss_zero:
                    self.loss_zero[task_idx] = loss.clone().detach()
                relative_inverse[task_idx] = loss.clone().detach()

                # GradNorm accumulate gradients
                for index in all_branching_ids:
                    grad = self.model.rep_tensors[index].grad
                    grad_norm[index][task_idx] = torch.sqrt(
                            torch.sum(torch.pow(grad, 2)))

                # calculate training metrics
                with torch.no_grad():
                    train_losses_ts[task_id] += loss.sum()
                    train_metrics_ts[task_id] += \
                        self.metrics[task_id](output, target)

            # GradNorm calculate relative inverse and avg gradients norm
            self.has_loss_zero = True
            relative_inverse = relative_inverse / self.loss_zero.clone().detach()
            relative_inverse = relative_inverse / torch.mean(relative_inverse).clone().detach()
            relative_inverse = torch.pow(relative_inverse, self.alpha.clone().detach())

            coeff_loss = torch.tensor(0., device=self.device)
            for k, rep_grads in grad_norm.items():
                mean_norm = torch.mean(rep_grads)
                target = relative_inverse * mean_norm
                coeff_loss = coeff_loss + mean_norm.mean()

            # GradNorm optimize coefficients
            coeff_loss.backward()

            # optimize the model
            self.model_optimizer.step()

            pbar.update()

        for task_id in self.task_ids:
            train_losses_ts[task_id] /= \
                len(self.train_loaders[task_id].dataset)
            train_metrics_ts[task_id] /= \
                len(self.train_loaders[task_id].dataset)

        train_losses = dict([(k, v.item())
                             for k, v in train_losses_ts.items()])
        train_metrics = dict([(k, v.item())
                             for k, v in train_metrics_ts.items()])
        pbar.close()
        return train_losses, train_metrics
Copy the code

Holdings DWA [6]

The paper “End-to-End Multi-Task Learning with Attention” directly defines an index to measure the speed of Task Learning, and then guides the adjustment of Task weight.


Lambda. k ( t ) : = K exp ( w k ( t 1 ) / T ) i exp ( w i ( t 1 ) / T ) . w k ( t 1 ) = L k ( t 1 ) L k ( t 2 ) \lambda_{k}(t):=\frac{K \exp \left(w_{k}(t-1) / T\right)}{\sum_{i} \exp \left(w_{i}(t-1) / T\right)}, w_{k}(t-1)=\frac{\mathcal{L}_{k}(t-1)}{\mathcal{L}_{k}(t-2)}

Divide the Loss of this round by the Loss of the previous round, so that the decline of Loss of this task can be obtained to measure the learning speed of the task, and then the weight of the task can be directly normalized. When a task Loss decreases more slowly than other tasks, the weight of this task will increase, and when it decreases quickly, the weight will decrease. There is a simplified version of Gradient normalization that just looks at the rate of decline. It’s very simple and straightforward.

4.2.4 PCGrad [7]

PCGrad is the method proposed by Google in NIPS 2020 “Gradient Surgery for Multi-Task Learning”. PCGrad points out that MTL multi-objective optimization has three problems:

  • Inconsistent direction, resulting in tearing, needs to be resolved
  • Magnitude inconsistencies, resulting in a large gradients dominance, need to be addressed
  • Large curvature, resulting in easy overfitting, needs to be addressed

The solutions are as follows:

  • First, detect whether the gradients of different tasks conflict, and the criterion of conflict is whether negative similarity exists.
  • If there are conflicts, the conflicting components are clipped off (i.e., the gradient of one task is projected in the orthogonal direction of the gradient of the other task).

The algorithm steps in the paper are as follows:

PCGrad core code reference:

class PCGrad(optimizer.Optimizer) :

    def __init__(self, optimizer, use_locking=False, name="PCGrad") :
        """ optimizer ""
        super(PCGrad, self).__init__(use_locking, name)
        self.optimizer = optimizer

    def compute_gradients(self, loss, var_list=None,
                        gate_gradients=GATE_OP,
                        aggregation_method=None,
                        colocate_gradients_with_ops=False,
                        grad_loss=None) :
        assert type(loss) is list
        num_tasks = len(loss)
        loss = tf.stack(loss)
        tf.random.shuffle(loss)

        Calculate the gradient for each task
        grads_task = tf.vectorized_map(lambda x: tf.concat([tf.reshape(grad, [-1,]),for grad in tf.gradients(x, var_list) 
                            if grad is not None], axis=0), loss)

        # Calculate gradient projection
        def proj_grad(grad_task) :
            for k in range(num_tasks):
                inner_product = tf.reduce_sum(grad_task*grads_task[k])
                proj_direction = inner_product / tf.reduce_sum(grads_task[k]*grads_task[k])
                grad_task = grad_task - tf.minimum(proj_direction, 0.) * grads_task[k]
            return grad_task

        proj_grads_flatten = tf.vectorized_map(proj_grad, grads_task)

        Restore the flattening projection gradient to original Shape
        proj_grads = []
        for j in range(num_tasks):
            start_idx = 0
            for idx, var in enumerate(var_list):
                grad_shape = var.get_shape()
                flatten_dim = np.prod([grad_shape.dims[i].value for i in range(len(grad_shape.dims))])
                proj_grad = proj_grads_flatten[j][start_idx:start_idx+flatten_dim]
                proj_grad = tf.reshape(proj_grad, grad_shape)
                if len(proj_grads) < len(var_list):
                    proj_grads.append(proj_grad)
                else:
                    proj_grads[idx] += proj_grad               
                start_idx += flatten_dim
        grads_and_vars = list(zip(proj_grads, var_list))
        return grads_and_vars
Copy the code

4.2.5 GradVac [8]

GradVac is a document published by Google in ICLR 2021 which is massively multilingual in Investigating and Improving Multi-Task Optimization The method proposed in the paper Models is applied to multilingual machine translation tasks as an improvement of PCGrad.

Compare PCGrad to GradVac:

  • PCGrad just sets a lower bound. Cosine similarity between two tasks should be at least 0. This lower bound is very easy to reach.
  • The true similarity of the two tasks will converge to the same level. This value can be considered as the true similarity of the two tasks.
  • The Gradinet similarity of the two tasks should be close to this similarity, instead of just satisfying the lower bound set by PCGrad.

Five, the summary

To sum up, optimization methods in multi-objective and multi-task scenarios are mentioned in this paper, which mainly include network structure optimization and optimization methods and strategy improvement. The ultimate goal is to alleviate conflicts and internal friction between tasks and optimize and improve all business objectives as much as possible. Here are some tips for building a promising multi-task, multi-objective solution:

  1. Firstly, pay attention to the business scenario, consider the key points of business objective optimization, and then determine the combination form of multi-tasks:
  • Main task + Main task: to solve the needs and needs of business scenarios, multiple tasks hope to improve
  • Primary mission + Secondary mission: Secondary mission provides some knowledge information enhancement to the main mission to help improve the main mission
  1. Consider the importance and similarity of different tasks, and clearly consider the relationship between auxiliary tasks and main tasks;
  • In the actual training process, one task can be trained and optimized to observe the changes in Loss of other tasks
  • The Loss of other tasks decreased synchronously, and the correlation was strong. Loss jitter of other tasks or an increasing trend, it is necessary to go back to the business itself to consider whether to combine multi-task training
  1. The network structure can be MMoE or PLE.

  2. Pay attention to the magnitude of Loss during training. If there is a big difference between different tasks, pay attention to constraints and control.

  3. For the optimization strategy of the training process, PCGrad and other methods can be tried to adjust the gradient and observe the effect.

Vi. References

  • [1] Ma J, Zhao Z, Yi X, et al. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018: 1930-1939.
  • [2] Jiaqi Ma, Zhe Zhao, Jilin Chen,et al. SNR: Sub-Network Routing forFlexible Parameter Sharing in Multi-Task Learning[C]//The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19).2019: 216-223

  • [3] Tang H, Liu J, Zhao M, et al. Progressive layered extraction (ple): A novel multi-task learning (mtl) model for personalized recommendations[C]//Fourteenth ACM Conference on Recommender Systems. 2020: 269-278.

  • [4] Kendall A, Gal Y, Cipolla R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7482-7491.
  • [5] Chen Z, Badrinarayanan V, Lee C Y, et al. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks[C]//International Conference on Machine Learning. PMLR, 2018: 794-803.

  • [6] Liu S, Johns E, Davison A J. End-to-end multi-task learning with attention[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 1871-1880.
  • [7] Yu T, Kumar S, Gupta A, et al. Gradient surgery for multi-task learning[J]. arXiv preprint arXiv:2001.06782, 2020.

  • [8] Wang Z, Tsvetkov Y, Firat O, et al. Gradient vaccine: Investigating and improving multi-task optimization in massively multilingual models[J]. arXiv preprint ArXiv: 2010.05874, 2020.

7. Data acquisition

  • To obtain the full code for this article, go to GitHub project github.com/ShowMeAI-Hu…
  • For some papers involved in the article, as well as the “wechat data set”, please reply the keyword “multi-objective learning” in the background of the public number (AI Algorithm Research Institute).
  • If you are interested in the application of “multi-objective learning”, you are welcome to pay attention to our official account and check the article “Multi-objective Optimization practice in IQiyi Short video recommendation Service” and “Parallel two-tower CTR Structure in Information Flow Recommendation Sequencing” to learn more about dACang’s implementation scheme!