How can Multi Task Learning excel in industry

Abstract: This article mainly introduces the comparative advantages of multi-task learning and single-task learning and some applications in industry. How to change from single-task learning to multi-task learning? How to optimize the accuracy of AUC and forecast? How to be more friendly to online applications with high real-time requirements? This paper will take Taobao as an example to share multi-task learning to realize personalized service search and recommendation in e-commerce applications.

Brief Introduction of speakers:

Liu Shichen (name: Xi Nai), senior algorithm expert of Alibaba Search Division. He studied in the Department of Juvenile Class of University of Science and Technology of China, majoring in computer science. Currently, I am a senior algorithm expert of Alibaba, providing services related to taobao search, sorting and personalization. Focus on search sorting algorithm research and application, involving real-time computing, deep learning, reinforcement learning and other fields, related work published in sigKDD, WWW and other conferences.

The following content is compiled according to the video sharing and PPT of the speakers.

This paper will focus on the following aspects:

1. The background

2. Introduction of relevant knowledge

3. Multitasking model

4. Experiment and effect

5. Effective skills and notes

one background

Research purpose of multi-task learning: Use machine learning and data mining related technologies to help better realize personalized service search and recommendation in e-commerce applications.

Why use multitasking:

1) In the past, single-task learning method was mostly used in service search and recommendation, but in real industrial application scenarios, multi-task is more common, so multi-task learning has more practical significance.

2) A multi-tasking learning model will be smaller than multiple single-tasking learning models, with lower online CPU utilization and more friendly to online services.

3) In Taobao, multi-task learning can help obtain a more general understanding and expression of users and products.

two Introduction to relevant knowledge

1. Academic background

I’ll start with some academic background and some recommended work done using DNN and RNN. When it comes to recommendations, you may immediately think of collaborative filtering. Since 2000, a large number of collaborative filtering algorithms have emerged, including model-based and memory-based ones. DNN has been used for recommendation for a long time. In the beginning, Restricted Boltzmann Machines (RBM) was mostly used for recommendation, which had a good performance in collaborative filtering at that time. Compared with the User-based Collaborative Filtering Recommendation and Item-based Collaborative Filtering Recommendation, the Collaborative Filtering Recommendation is based on the user-based Collaborative Filtering Recommendation Recommendation) is more effective. In recent years, the main recommended method used is DAE (Denoising Auto-encoders).

In the industry, recommendation algorithms have more applications. For example, Microsoft put forward DSSM (Deep Structured Semantic Models), a Pair Wise Ranking method. Google proposed a wide and deep network to do recommendation and ranking. These days, you’re probably using RNN/CNN or Attention to make recommendations. Because users have a natural behavior sequence on a certain platform, this nature makes RNN/CNN or Attention have better usability.

2. Multi-task Representation Learning

In recent years, multi-task expression learning has become more and more popular, because the success of machine learning and deep learning is mainly attributed to the model’s better access to data expression and mining the required information from the data. Multitasking expressive learning can obtain more comprehensive and variable information from data. The features extracted from the single task model are only valid for the single task, and a single feature cannot describe a sample well. Multi-task learning is more appropriate when the tasks are large and the features to be learned are required to serve each task, that is, the features are required to have a certain universality. Multitasking learning can be divided into two kinds, one is divided into Main tasks and auxiliary tasks. The auxiliary tasks help the Main goals to train. The other is Equal tasks, which have no priority.

3. System background

Taobao will mainly multi-task learning application in the search system. The process is mainly shown in the figure below:

First, the user enters a query, and the search engine returns a set of related candidates based on the inverted index. At the same time, prediction tasks will be set to obtain relevant information of users, such as gender, age, purchasing power, purchasing style, purchasing preference and so on. Based on the candidate set and user information, the attributes and characteristics of all goods in the candidate set can be obtained, as shown in the following table:

Characteristics related to goods include sales volume, after-sales satisfaction, etc. Personalized features include personalized prediction of goods, users’ price preference for the goods, etc. For a particular product, there may be dozens or hundreds of dimensions of feature description, and then the model is used to integrate a single score of these features, and finally the ranking is based on the total score. The yellow part of the flow chart is the part related to product personalization, that is, the part that needs the model to predict. Each yellow block can be viewed as a task or multiple tasks. It can be seen that there are usually multiple tasks in online sorting process, so it needs to use multi-task expression learning to solve it.

3. Multi-task model

The overall structure of Taobao multi-task model is shown in the figure below:

The input information of the model is the behavior sequence of users on Taobao. Each behavior includes two parts: the first part is the behavior, for example, the behavior type may be click, search and recommendation, the behavior time may be 1 minute ago, 5 minutes ago or half an hour ago, these attributes are the user’s action itself and have nothing to do with the product; The second part is related to commodities. Here each behavior X is expressed as a property description P (property) and E (embedding). Then, Long short-term Memory (LSTM) is established to connect user behavior sequences. Then use attention NET to pool and obtain a 128-dimensional vector representation. The vector is combined with other user information to obtain the final user representation that can be shared and learned by multiple tasks. In summary, the multi-task model can be divided into five layers: input layer Behavior Sequence, Embedding layer, LSTM layer, attention-based Pooling layer and multi-Task receiving and output layer. The technologies involved include:

Embedding
CNN/RNN/Memory Net
Attention
Multi-task Learning
Lifelong Leaning/Transfer Learning

Therefore, the multi-task expression model is adopted to construct user expression that can be shared to multiple tasks and be easily transformed. Each layer is described in detail next.

1. The Embedding layer

Embedding mainly converts user behavior into vectors. As mentioned above, a user’s behavior consists of a behavior property and item features. Product features include product ID, store, brand, category (such as clothing and bags) and other information. In addition, there will be some more general labels, such as whether the price of the product is expensive, what is the color of the product, whether the style is Korean version or European and American style, etc. The product description in the figure above is becoming more generalized from left to right. For example, if a product is very popular, then the ID of the product can represent it. However, when the sales volume of the product is very low, the ID of the product cannot represent it. In addition, the store, brand, category of the product and more personalized notes are needed. The user’s behavior description includes three aspects: first, the behavior scenario, for example, the behavior occurs in the search, recommendation or purchase; Second, the behavior time can be within one minute, five minutes or half an hour. Taobao divides the behavior time into different slots according to the behavior time. The third is the event type, divided into clinch a deal, click, add shopping cart, collection. As can be seen from the figure above, the five attribute dimensions of commodity characteristics are 32, 24, 24, 16 and 28 respectively, and the three attribute dimensions of behavior description are 16, 16 and 16 respectively. Finally, all the vectors are combined to obtain the final user behavior vector.

2. Property Gated LSTM and Attention Net

However, users usually engage in a long sequence of behaviors on Taobao, such as browsing, clicking and purchasing commodities. It is hoped that the relevant information of the user can be analyzed from this behavior sequence. So, similar to natural language processing, multiple words can be embedding into a sentence in sequence. Here, LSTM can be used to embed multiple behaviors into behavior sequence. The difference from the original LSTM is that the input information here includes two aspects, commodity characteristics and behavior description. I believe you all know that LSTM is a kind of RNN (recurrent neural network). Its innovation is that it includes many gates. On the one hand, it can ensure that the phenomenon of gradient disappearance or gradient explosion will not occur during network training, and on the other hand, it can emphasize or weaken some individuals in the sequence. In general RNN, every element in the sequence is equal, but in LSTM, weights can be set for individuals to indicate which elements can be emphasized and which elements can be ignored, which will have an important impact on user behavior learning. For example, a click a user made six months ago is significantly more important than a transaction a user made recently. So how does that show up in the model? Here, the user’s behavior description is placed in three gates, namely forget gate, input gate and Output gate. The user behavior description can determine what needs to be noticed and what needs to be ignored in a user behavior. Therefore, Property Gated LSTM with the structure shown in the following figure is proposed:

In the figure above, P represents property, e represents embedding, H (t-1) represents the output of the previous LSTM, and h(t) represents the output of the current LSTM. The specific Property Gated LSTM formula is as follows:

After LSTM network, Taobao also added attention Net mechanism to imitate natural language processing. Its function is similar to gate, and it can also determine the importance degree of behavior. However, different from gates, gates can only determine the importance of behavior according to the information of the current behavior. Attention Net mechanism can add some additional information, such as user query information and user information. User will include the user’s age, gender, purchasing power, purchasing preference and other information. Query contains its own ID, participle, and some inside information. The details are shown in the figure below:

Assume that 30 user behavior sequences are input and LSTM outputs 30 vector results H. Attention NET mechanism will determine the importance of h output and finally conduct pooling. For example, a user clicks to browse a dress, then buys a phone, browses some sweeping robots, laptops, etc. If the user enters query as iPhone at this time, the importance of clothing records in the user’s behavior is significantly reduced, because these records cannot reflect the user’s current interests, while the previous behavior records about mobile phones can more express the user’s current interests.

3. Multi-tasks model

After Embedding, user behavior sequence goes through LSTM layer, then attention NET is used for pooling, and finally a 256-dimensional vector user expression is obtained. Assuming you have such a generic user expression, prepare to apply it to the following five tasks.

The first task is THE CTR estimation task, which is commonly used in advertising and ranking recommendations, such as estimating the user’s click rate for a certain movie, video or music. Taobao uses CTR to estimate how many clicks a user has on certain products. The likelihood function is mainly used in the formula. The input includes 256-bit user representation and embedding of goods, which is shared by embedding in the user behavior sequence. These inputs are then routed through a three-layer network to produce an estimated result.

The second task is L2R (Learning to Rank, also LTR) task, which is similar to CTR estimation in form, but the difference is that the input information needs to contain specific Ranking Features. In CTR prediction, user representation is fully connected with embedding, while in L2R task, user representation is linearly multiplied with product features after two layers of network. Its advantage is that the top layer of the network is easy to understand and easy to check for errors. Different from CTR, weight information is added to the formula to indicate which behaviors are valued or ignored. For example, users’ behaviors of clicking and buying commodities need to be given greater weight, while no behavior can be ignored after browsing commodities.

Task three is the user’s preference for the talent. This is because it is expected that the user expression learned in the end will be more general, rather than all tasks related to goods, so the user personality learned in this way will be more limited. Therefore, task three mainly learns the types of users’ favorite talent. In addition to the 256-bit user expression, the input of task 3 also needs to input the characteristics of relevant experts, and then solve the problem of whether users will follow the second class.

Task 4 is to estimate user purchasing power (PPP). Here, the purchasing power of users is divided into seven grades, with the first grade lowest and the seventh grade highest. This can predict whether users are looking for quality and higher purchasing power, or looking for value and preferring lower-priced goods. The purchasing power estimate is independent of the item, and is simply a shard based on the input 256 bits of user representation.

The above four tasks are tasks that need to learn user expression in the network. Task model can be obtained by learning at the same time, so as to obtain the final user expression. Then the next step is to verify whether the final user expression can be applied to other tasks, so the Transfer Task is set. Transfer task is used to predict users’ store preferences. However, the task does not learn at the same time with the above four tasks, but learns from the user expression after learning the above four tasks to verify whether it can be directly used in the new task. Therefore, compared with the other four tasks that need to be linked to a larger network background, the depth of transfer task is relatively shallow.

Four. Realization and effect

After the model design is completed, experiments are needed to verify the effect of the model. Focus first on the training process. It can be seen from the above that there will be 5 tasks, so there will be 5 independent training data sets, 4 data sets will be trained at the same time, and the last one will be verified. In terms of data sets, the amount of sample data per day is about 6 billion, and it would reach about 20 billion without sampling. Then, the 10-day data was used to complete the training process, and the 10-day data was used to complete the prediction process. Mini-batch is used in the training process, and each batch contains 1024 samples. Regarding the online environment, CTR and LTR have an impact on the online effect.

CTR estimates are available online as a Ranking Feature. LTR will affect the Ranking Feature’s sorting process, so the impact is greater. PPP will also be used to estimate user purchasing power.

The figure below lists some of the parameters in the experiment. For example, LSTM has 100 user sequences, Dropout rate is 0.8, L2 regular is adopted, and learning rate in AdaGrad is 0.001. Distributed TensorFlow environment is used for training, with 2000 workers and 96 servers. With 15 CPU cores and no GPU, the entire training took 4 days to complete.

1. Comparison of DUPN and Baselines methods

First, analyze the results of the first set of experiments. The first set of experiments compared the proposed method, named DUPN, with other Baselines methods, including Wide, Wide & Deep, DSSM, and CNN-MAX. The Wide method is a single-layer network, which may contain more single features and cross features, and then perform logistic regression (LR). The second and third methods are Wide & Deep proposed by Google and DSSM proposed by Microsoft respectively. Finally, CNN-max adopts CNN to extract user behavior characteristics, and then conducts max-pooling to obtain user expression. And on the DUPN method contains 5 is proposed in this paper, DUPN – nobp/BPLSTM/bpatt/all/w2v. The dupn-noBP/BPLSTM/bPATT three-seed method means that the user behavior description property is only used in LSTM or Attention Net. Dupn-all indicates the most complete algorithm. Dupn-w2v does not use the end to end learning mode, but adds pre-training, uses Word to vector to train each commodity into vector form, and then directly inputs the vector into the subsequent network, which can greatly reduce the parameter space of the network. These methods are then applied to tasks 1-4 to obtain the following comparison of results:

As shown in the table above, cnN-max is the best among the four baseline methods. Wide & Deep and DSSM do not take into account the sequence of the user’s behavior, but just a combination of the user’s characteristics. Cnn-max extracts features from the user’s behavior sequence. Therefore, it has the highest accuracy in AUC of the first three tasks and task 4. Among the four DUPN algorithms, duPN-all has the best effect. The effect of DUPN-NOBP algorithm without property is close to CNN-max, which also confirms that when only LSTM layer is used, the effect is similar to CNN. However, compared with DUPN-NOBP, duPN-BPLSTM/BPATT has been greatly improved after Property Gated LSTM and Attention Net were added respectively. Therefore, the most complete DUPN-ALL achieved the best results, with increases of one to three percentage points for each AUC and an estimated five percentage points for purchasing power. The last method, DUPN-W2V, uses pre-training to reduce the parameter space for easy training, but the effect is not better than the previous ones. The reason may be that duPN-W2V can only get the commodities with similar attributes in the training data process, but cannot get the information of the commodities themselves, such as the hot selling degree, etc. Therefore, it can be concluded from the first experiment that the proposed DUPN-All algorithm is better than the traditional method in each task.

2. Comparison between multi-task learning and single-task learning

Next, verify the difference between multi-task learning and single-task learning. Tasks 1 to 4 above can be learned independently or simultaneously as multiple single tasks. The following figure compares the results of the two methods:

The top four pictures show the comparison of AUC of each task under two conditions, and the bottom four pictures show the decline of Loss. Taking the first figure L2R Rank AUC as an example, we first pay attention to the changing trend of AUC. At the beginning, the AUC value increases rapidly to about 0.68, and then slows down to 0.75. Because there will be some generalization features in the extracted user features, which exist in every sample, generalization features will play a major role in the initial stage and the learning speed is fast. However, for subsequent sparse features, such as store features or product IDS, the learning speed is very slow, but the AUC value can still gradually increase. The red curve in the figure is the result of multi-task learning, while the blue curve is the result of single-task learning. It can be clearly seen that in all images, the AUC and accuracy of multi-task learning are higher than that of single-task learning. So what to make of this phenomenon? You might expect multitasking to slow down some tasks, but it doesn’t. When the four tasks are carried out at the same time, the other three tasks can be regarded as regulars. For example, adding L2 regulars during learning will make the AUC value higher. However, the difference between these three tasks and L2 regularization is that they can not only prevent overfitting, but also make basic feature learning more generalized. Therefore, multitasking learning is actually more beneficial for each single task, and the AUC value is higher.

It is worth noting here that all of the above experimental results were based on the test set. If the above experiments were carried out in the training set, the numerical results of multi-task would be lower by comparison, but the differences between the two still exist. Therefore, based on accuracy, multi-task learning is better than single-task learning.

3. Model migration ability

Next, verify that some of the models have the ability to migrate. For example, after learning the above four tasks, task 5 is to learn users’ preferences for the store. You can choose from the four learning methods: End-to-end Re-training with Single task (RS), End-to-end Re-training with All tasks (RA), Representation Transfer (RT), Network Fine Tuning (FT). RS is similar to DUPN network, and takes task 5 as a brand new task and learns it separately. RA refers to retraining by doing task five and four at the same time. RT means that instead of training the whole network, the last updated user vector and store attributes are input to carry out a shallow training. FT refers to the learning of task 5 directly connected to the back end of the above large network, fine-tuning the initial network and getting the final result. The training process of the above four methods is shown in the figure below:

In the figure above, the abscissa is the training duration and the ordinate is the AUC value. The green curve FT has the best effect. On the one hand, FT converges quickly, and on the other hand, its final AUC value is also the highest, which is about 0.675. This shows that the previous network has reached a good training effect, after some fine-tuning can quickly get the final result. Although the black curve RA has a slower convergence rate, it can still reach the AUC value as high as FT in the end. However, it is obvious that FT has a lower cost. RA needs to complete a complete retraining, which can take four days or more and consume a lot of computing and storage resources. RT curve only needs to input the last updated 256-bit user vector and shop features. There are few learning parameters, so the convergence speed is the fastest. However, the learning effect is correspondingly low, about 2% lower than FT. The advantage of RT is that there is no change to the initial network, but a new learning task is grafted after it. If it is applied online, the consumption is small, and the AUC value obtained is barely appreciable. For example, if the five tasks are executed online and the scale is relatively large, it may be impossible to learn the five tasks at the same time, and for a real-time system, users need to give feedback in time, then RT is the best choice. Among these four methods, except RT as one model, the other three methods all have at least two models, which means that the amount of online computation and storage need to be almost doubled. Therefore, if online resources are sufficient, FT method is recommended. RT is suitable if online performance is limited and a fast method with small memory is required.

4. Attention analysis

Then, it analyzes users’ attention from two aspects. On the one hand is the query information entered by the user, as shown in the figure below. The bottom line is the product that the user has ever had related behavior on Taobao. The closer the behavior time is from left to right. When the user enters a different query again, the attention for these historical actions will be different, and the shade of color represents the size of the attention. For example, when the user searches for laptop again, attention will focus more on earphones and mobile phones, while clothing related categories such as dresses and T-shirts will play a larger role. Thus, Query is very effective in determining the importance of historical behavior.

On the other hand, user behavior information can also help analysis. In the figure below, the abscissa is the behavior time and the ordinate is the behavior type. Different behavior types have different attention weights. Overall, users’ transaction behavior is the most important, far higher than clicking, adding to cart and favorites. Bookmarking behavior is probably the least important in analyzing user behavior expression. Interestingly, the more recent a user clicks, the more likely it is to reflect the user’s interest, but the more recent a transaction does not. This is the same as people’s cognition. When users buy a certain product, they may not buy it again in the near future, so the color is lighter. On the contrary, the purchase behavior of a few hours or days later can reflect users’ interests more. This is why the accuracy rate of adding user behavior information (PROPERTY) into the learning network increases.

Finally, the algorithm was applied to an online system of Taobao, which is now fully effective. Taobao has made statistics on its operation effect within 7 days online. As shown in the following table, CTR can be increased by about 2%, sales volume can be increased by about 3%, and purchasing power is estimated to be increased from 33% to 44%.

The following two figures show the efficiency improvement of the algorithm in more detail. The purchasing power of taobao users is divided into 7 points, so it is necessary to observe the accuracy rate (above) and recall rate (below) of each level. As can be seen from the figure, both accuracy rate and recall rate have been improved, but the extent of improvement is not consistent. In terms of accuracy, the first and seventh gears improved more, while the 23rd gear improved less. In the recall rate, the improvement of each grade is relatively uniform, roughly between 5% and 10%.

Five. Effective skills and matters needing attention

1. The model needs frequent updates

The ID characteristics of goods often change. For example, the popularity of goods, the style changes with the seasons, and the interests of users change at any time. Therefore, the embedding needs to change as well. In practice, if the model is not updated, the model effect will gradually deteriorate. A large number of ID features lead to very slow training of the model, and a full training may take up to 4 days. So you can start with a full learning session using 10 days’ worth of data, and then do incremental learning each day using the previous day’s data. On the one hand, the training time can be significantly reduced and can be completed in a day; On the other hand, the model can be closer to recent data. For example, on November 11, taobao used the data of different periods of the day to update the model twice, since the samples of that day were often very different from those of the day. After the update, it can be seen that the training indicators have been significantly improved.

2. Split the model

When the model is in effect, the model can be split to some extent. In the sorting task, CRT estimation or LTR estimation is required for each item. If the number of items is large, the learning process will be very time-consuming, so how to reduce the calculation amount? Here you can split the model into the user part (red) and the goods part (blue), as shown below:

There is little correlation between the user part and the commodity part, so you can compute the user part just once for a given user after entering Query to get a vector representation of the user for the commodity. CRT estimates and LTR are then calculated for the commodity portion, where only the commodity portion needs to be duplicated. The large amount of computation is actually concentrated in the red part of the user, so this split is very friendly for online operation, can reduce online computation thousands of times, making the model more efficient online.

3. Consistency in BN

Batch normalization has a good effect on improving the model effect, resulting in a significant improvement in the AUC. However, it should be noted that the offline mean and variance remembered by BN in the training sample must be consistent with that in the online data. For example, various filtering and sampling will be done in the training sample, such as sampling click and transaction samples, so that the mean of some dimensions will be much higher than the mean of the actual line, although the AUC on the test set can also be improved, but this is very bad for the online effect. From the experiment, the user’s vector expression does have good transfer ability, that is, it can also perform well in other tasks. But this point is inconsistent and even contradictory in many references. You need to pay attention to different scenarios.

4. Taobao related part of the brief introduction

The following is the process of model implementation. Because these processes are highly relevant to Taobao, I only make a brief introduction, as shown in the figure below. Firstly, users’ click, purchase and PV behaviors are combined through OPDS, training is done in 10 days, evaluation is done in one day, and then these data are put in HDFS for training on TensorFlow. Such a model is around 150G. Therefore, when the number of models increases to more than five, online memory is unable to accommodate. This is also an important reason to adopt multitasking learning to reduce the storage and computational efficiency of the model.

The following figure shows the process of incremental mode. The old user behavior data is input into an online platform, exported to ODPS, and then the old model and the new data are incrementally trained to get the updated model.

The last diagram shows the implementation process. As mentioned above, the model is divided into two parts, so the effectiveness needs to be carried out in two places. One is in the user Query process section, and the user expression and item attributes are calculated in the other to get the item score.

The original link

To read more articles, please scan the following QR code:

How can Multi Task Learning excel in industry

Related Posts

Learn microservices series (12) : Service Governance

Java interviewers’ favorite keyword -volatile

Last bullet: Talk about different strategy modes (bookmark)