Architecture design and technology evolution of recommender system based on real-time deep learning

Brief introduction: Since the Aliyun Developer Conference on May 29th, Qin Jiangjie and Liu Tongxuan shared, including the principle of real-time recommendation system, what is real-time recommendation system, the overall system architecture and how to implement it on Aliyun, as well as the details of deep learning

This paper summarized the architecture design and technology evolution of recommender system based on real-time deep learning brought by Qin Jiangjie and Liu Tongxuan from the AliCloud Developer Conference and the sub-forum of big data and AI integrated platform on May 29. The contents are as follows:

The principle of real-time recommender system and what is real-time recommender system

The architecture of the whole system and how to achieve it on Ali Cloud

A detailed introduction to deep learning.

Making the address https://github.com/apache/flink welcome to Flink thumb up to star ~

First, the principle of real-time recommendation system

Before introducing the principle of real-time recommendation system, let’s take a look at a traditional, classic static recommendation system.

User behavior logs appear in message queues and are then ETL fed into feature generation and model training. This part of data is offline. Offline model update and feature update will be pushed to the online system, such as feature library and online reasoning service, and then serve online search and promotion application. The recommendation system itself is a service, and the service promotion application displayed in the front end may include search recommendation, advertising recommendation, etc. So how does this static system actually work? Let’s look at the following example.

1. Static recommendation system

The behavior log of the current user is intercepted and poured into the offline system for feature generation and model training. This log indicates that user 1 and user 2 have browsed page1 #200 and some other pages at the same time, in which user 1 browsed page1 #100 and clicked on ADS #2002. The log is then taken offline by the ETL and sent for feature generation and model training. From the generated features and models, it can be seen that both user 1 and user 2 are Chinese male users, and “Chinese male” is one of the characteristics of these two users. The final result of this learning model is: when a Chinese male user browsed page#100, he needs to be pushed ADS #2002. The logic behind this is to group the behaviors of similar users together to show that such users should have the same behavior.

User characteristics promote the model established by the feature base. When pushed to the online service, if a user 4 appears, the online reasoning service will go to the feature base to check the features of this user, and the found features may be that this user happens to be a Chinese male user. The model previously learned that Chinese male users would push ADS #2002 when they visited Page# 100, so it recommended ADS #2002 to user 4 according to the learning model. The above is the basic workflow of the static recommendation system.

However, this system also has some problems. For example, after the model training on the first day is completed, it is found that the behavior of User 4 on the second day is actually more similar to that of User 3, rather than User 1 and User 2. However, the result of previous model training is that Chinese male users will push ADS #2002 when they visit Page# 100, and this recommendation will be carried out by default. Only after the second model calculation can it be found that user 4 is similar to user 3. At this time, there will be a delay in making new recommendations. This is because models and features are static.

For a static recommendation system, features and models are generated statically. For example, taking the classification model as an example, users are classified according to their similarity, and then it is assumed that users of the same category have similar behavioral interests and characteristics. Once a user is classified into a certain category, he will remain in this category until the model is retrained.

2. Static recommendation system problem

First, user behavior is actually very diversified, there is no way to use a static event to describe the user’s behavior.
Second, the behavior of a certain class of users may be similar, but the behavior itself has changed. For example, when Chinese male users visit page#100, they will push ADS #2002, which is the behavior rule of yesterday. But when it came to the second day, it was found that not all Chinese male users would click ADS #2002 when they saw Page100.

3. Solutions

3.1 Flexible recommendation after adding real-time feature engineering

Add real-time feature engineering to the recommendation system, read the message in the message queue, and then do the feature generation of the near line. For example, when Chinese male users visited Page# 100 recently, the 10 most clicked ads were tracked in real time. That is to say, Chinese male users have visited Page# 100 within the last 10 minutes or half an hour at most 10 ads, which is not information obtained from yesterday’s historical data, but data of today’s users’ real-time behavior, which is the real-time feature.

With this real-time feature, we can solve the problem of following the crowd. Similarly, if the features here are collected from a user’s behavior in the last 3 minutes or 5 minutes, the intention of the user at that moment can be more accurately tracked and more accurate recommendations can be made for the user.

Therefore, the recommendation system can be accurately recommended after adding real-time features. For example, if user 4 visits Page# 100 in this case, the new learning content is: when Chinese male users recently visit Page# 100, the most clicked is ADS #2001. Then we will directly recommend ADS #2001 instead of pushing ADS #2002 to him according to yesterday’s information.

3.2 Limitations of real-time feature recommendation system

The behavior of the previous user 1 and user 2 is very similar, with the addition of real-time features to know its current intentions. However, if user 1 and user 2 are doing the same characteristic, their behavior is inconsistent; In other words, the behavior of users who are considered to be the same type in the model diverges, becoming two types of users. If the model is static, even if the real-time features are added, this new class of users cannot be found; The model needs to be retrained before a new classification can be generated.

After joining the implementation of feature engineering recommendation system, it can track the behavior of a certain type of users and fit a large stream of changes. It can also track a user’s performance in real time and understand their intentions at that moment. However, when the classification mode of the model itself changes, there is no way to find the most appropriate classification, and it is necessary to reclassify the training model, which will be encountered in many cases.

For example, when a lot of new products are launched, the business is growing rapidly, and a lot of new users are generated every day, or the distribution of user behavior changes rapidly. In this case, even if the real-time feature system is used, as the model itself is a gradual process of degradation, the model trained yesterday will be put on the line again today, which may not work well.

3.3 Solutions

Two new parts are added in the recommendation system: proximity training and proximity sample generation.

If user 1 and user 2 are users in Shanghai and Beijing respectively, it will be found that the previous model does not know the difference between users in Shanghai and Beijing, and it thinks that they are all Chinese male users. After the real-time training model is added, the users in Beijing and Shanghai will be gradually learned. There are differences in the behaviors of the users in Shanghai and Beijing. After confirming this, the recommendation will have different effects.

For example, a sudden rainstorm in Beijing or a particularly hot weather in Shanghai today will lead to different behaviors of users on both sides. Then another user 4 comes along, and the model will tell whether this user is from Shanghai or Beijing. If he is a user in Shanghai, he may recommend the same content as the Shanghai user. If not, go ahead and recommend something else.

The main purpose of adding real-time model training is to make the model fit the distribution of user behavior at this moment as much as possible on the basis of dynamic characteristics, and at the same time to alleviate the degradation of the model.

Second, Alibaba real-time recommendation scheme

First of all, understand the benefits of Ali’s internal implementation of this program:

The first is timeliness. At present, Alibaba’s promotion has become normal, and the timeliness of the whole model has been well improved during the promotion period.
The second is flexibility. Features and models can be adjusted as needed;
The third is reliability. When people use the whole real-time recommendation system, they may feel uneasy. Without large-scale calculation verification on the night of going offline, they will feel that it is not reliable enough if they directly put it online. In fact, there is already a complete process to ensure the stability and reliability of this thing.

Viewed from the graph, the process from feature to sample to model to online prediction is no different from offline prediction. The main difference is that the whole process is real-time, with this real-time process to serve online search and promotion applications.

1. How to implement it

Evolved from the classic offline architecture.

First, the user group behavior is taken from the message queue to the offline store, which then stores all the historical user behavior. Then on this offline storage, the samples are calculated by static features; Next, the samples are stored in the sample store for off-line model training. After that, the offline model was published to the model center for model verification. Finally, the verified model is pushed to the inference service to serve the online business. This is the complete offline system.

We’re going to do this in real time through three things:

The first is characteristic calculation;
The second is sample generation;
The third is model training.

Compared to the previous one, message queues are not only stored in offline storage, but also split into two links:

The first link will do real-time feature calculation, such as what advertisement Chinese male users clicked when they viewed Page# 100 in the last few minutes, which is calculated in real-time, that is, some behavioral characteristics that some users may have in the recent period of time.
The other link is the message queue, which can be used for real-time sample stitching, which means there is no need to manually tag, because the user will tell us the tag. If we make a recommendation, if the user clicks on it, it must be a positive sample. If the user doesn’t click after a certain amount of time, we consider it a negative sample. Therefore, we don’t need to label manually. Users will label for us. At this time, it is easy to get the samples, and then the samples will be put into the sample storage. The difference is that this sample store not only serves for offline model training, but also does real-time model training.

Offline model training is usually T+1 at the sky level, and a base model will be trained, which will be handed over to real-time model training for incremental training. The model output of the incremental model training may be at the level of 10 minutes, 15 minutes, and then sent to the model storage for model verification, and finally online.

The green parts of the architecture diagram are all live, some of which are newly added, and some of which are offline.

2. Aliyun enterprise-level real-time recommendation solution

In Aliyun enterprise-level real-time recommendation solution, how to use Aliyun products to build?

The message queue uses the DataHub; Real-time features and samples are calculated using the Real-time Flink version; MaxCompute is used for both offline feature storage and static feature computation; Feature storage and sample center using MaxCompute interactive analysis (Hologres); All parts of the message queue are DataHubs; The model training part uses PAI, model storage and validation, and the online inference service are all part of PAI.

2.1 Real-time feature calculation and reasoning

Feature and inference is to collect the user logs in real time and import them into the Flink version of real time computing to do real time feature calculation. And then it goes into the Hologres, takes advantage of the ability of the Hologres to stream, and uses it as the feature center. Here, PAI can directly query these user characteristics in Hologres, namely the ability to click.

In the real-time calculation of Flink version of the calculation of features, such as the user’s recent 5 minutes of browsing records, including goods, articles, videos, etc., according to different business attributes, real-time features are not the same. It may also include, for example, the 50 products with the highest click rate in each category in the last 10 minutes, the articles, videos and products with the highest view in the last 30 minutes, and the 100 words with the highest search volume in the last 30 minutes. In these different scenarios, such as search recommendation, there are ads, videos, texts, news and so on. This data is used for real-time feature calculation and inference of this link, and then on the basis of this link, sometimes also need to static feature backfill.

2.2 Static feature backfill

For example, when a new feature is put online, if the new feature needs the behavior of users in the last 30 days after the real-time link is put online, it is impossible to wait for 30 days to calculate it. You need to find the offline data, and then fill it in for the last 30 days with this feature. This is called feature backfill, also known as backfill. To compute the feature backfill by MaxCompute is also to write the Hologres, and the implementation is also to add the new feature, so this is a new feature scenario.

Of course, there are other scenarios, such as static features; For another example, there may be a bug in the online feature, but the data has already fallen offline. In this case, the process of backfill will also be used to correct the offline feature.

2.3 Real-time sample Mosaic

Real-time sample splicing is essentially a recommendation scenario where the sample gets a positive or negative sample after presenting a clickstream. However, this label is obviously not enough. Only with characteristics can we do training. Features can come from the DataHub. After the addition of real-time features, the characteristics of the sample will change from time to time.

For example, when making a recommendation for a certain product, it is 10:00 am, and the user’s real-time characteristic is his browsing history between 9:55 and 10:00 am. But when we see the sample flow back, it’s probably 10:15. If this sample is a positive sample, we cannot see the real-time characteristics of the user during the period when the recommended product is given to the user and he makes a purchase.

Because at that point the feature has become the browsing history of the user from 10:10 to 10:15. However, when making a prediction, the product is not recommended according to the browsing record within 5 minutes, so it is necessary to save the features adopted when making the recommendation and add them to the sample when it is generated. This is the role of DataHub here.

When using ES to make real-time recommendations, these features used to make recommendations at that time need to be saved for the generation of this sample. Once the samples are generated, they can be stored in the Hologres and MaxCompute, and the live samples can be stored in the DataHub.

2.4 Real-time Deep Learning and Flink AI Flow

There will be offline training in this section at the “day” level; There will also be real-time online training that will be “minute-level”; Some can do more acme, is to press “second” level. No matter which side comes out of the model, it will eventually be sent to the model, the model verification and online.

This is actually a very complex workflow. First, static feature calculation is periodic and may be manual. When backfill is needed, there is a process that is triggered manually. According to the model diagram, it can be seen that it is batch training. When it is finished, it needs to go to the line to do a real-time model verification. This model validation may be a flow operation, so it is of the process flow of a trigger, come from inside the flow assignment model, it is a long running operation, every 5 minutes to create a model, this model also need to send in every 5 minutes to do the model validation, so this is a flow trigger process flow.

Another example is the real-time sample stitching. We all know that Flink has the concept of Watermark. For example, at a certain point, when the previous data has been collected, a batch training can be triggered. When he arrives at a certain moment and needs to trigger batch training, this workflow cannot be done in the traditional workflow scheduling, because the traditional workflow scheduling is based on a process called job status change, that is, the job status changes.

Suppose that if a job runs without errors, the data generated by the job is ready, and the downstream jobs that depend on the data are ready to run. So to put it simply, one job runs and the next job continues to run, but as long as there is a stream job in the whole workflow, the whole workflow will not run, because the stream job will not be finished.

For example, the real-time calculation of this example, the data is constantly changing the run, but there will also be ready at any time, that is to say, the data may be ready when the run reaches a certain level, but in fact the homework is not finished at all. Therefore, we need to introduce a workflow, which we call Flink AI Flow, to solve the problem of synergistic relationship between the various jobs in the previous diagram.

In essence, Flink AI Flow means that each node is a logical processing unit, a logical processing node, and the relationship between nodes is no longer the relationship between the previous job and the next job. It is an event driven conditions, a concept triggered by an event.

Also at the workflow execution level, the scheduler is no longer scheduling actions based on job state changes, but on event-based scheduling. For example, in the case of event scheduling, when the water mark of a stream job arrives, that is to say, all the data before this time point has been completed, the batch job can be triggered to run, and the stream job does not need to run out.

For each job, it is necessary to pick up or stop the job through the scheduler. When these events meet a condition, the scheduling action takes place. For example, if a model is generated and a condition is met that requires the scheduler to pull a validation job from the model, then this is a process in which an event generates a condition that asks the schedule to do one thing.

In addition to scheduling services, Flink AI Flow also provides three additional support services to meet the overall AI workflow semantics, namely metadata service, notification service, and model hub.

Metadata server, which helps you manage some state in the data set and the entire workflow;
Notification services, to satisfy the event-based scheduling semantics;
Model center is to manage some of the life cycles in the model.

3. Real-time deep learning training PAI-ODL

After Flink generates the live sample, there are two streams in the ODL system.

The first stream is a real-time stream. The generated real-time samples are sent to the stream data source, such as Kafka. The sample in Kafka has two flows, one is to stream to the online training, the other is to stream to the online training. The other is streaming to online evaluation.
The second stream is the offline training data stream, and the offline data flow database is used for the Offline T+1 training.

In online training, users can configure the frequency of model generation. For example, users can configure the frequency of model generation to update online once every 30 seconds or 1 minute. This can be used in real-time recommendation scenarios, especially those with high timeliness requirements.

The ODL allows the user to set some indicators to automatically determine whether the generated model is deployed online or not. When these indicators are met at the Evaluation side, the model is automatically pushed online. Because the frequency of model generation is so high, manual intervention is impractical. Therefore, the user is required to set the indicators, and the system automatically determines that when the indicators meet the requirements, the model is automatically pushed back online.

There’s a line on the side of the offline stream called the model calibration, which is the calibration of the model. Offline training generates T+1 model, which corrects the model for online training.

Analysis of PAI-ODL technical points

1. Training of super-large sparse model

Training of super-large sparse model is a common function in sparse scenarios such as recommendation search advertisement. In fact, this is a typical and traditional deep learning engine, such as TensorFlow, whose native internal implementation is fixed size variable such as fix size. There will be some common problems in the use of sparse scenes.

Just like Static Shape, for example, in common scenes, such as mobile phone APP, new users will join every day, and there will be new products, news and new videos updated every day. If the shape is a fixed size, it is actually unable to express the semantics of such changes in the sparse scene. And this static shape limits the incremental training of the model itself over time. If the incremental training duration of a model is one or two years, it is likely that the previously set size is far from meeting the business needs, which may cause serious feature conflicts and affect the effect of the model.

If you set the static shape in the actual model to be large, but the utilization rate is low, it will result in a waste of memory and some invalid IO. This includes disk waste when generating full scale models.

Based on the PAI-TF engine, the embedding variable function is provided in PAI-ODL. This feature provides the ability to dynamic elastic features. Each new feature adds a new slot. It also supports feature elimination, such as when a product is removed from the shelves, the corresponding feature is deleted.

Incremental model means that the part of sparse feature changes within one minute can be recorded and generated into this incremental model. The incremental model records the characteristics of sparse changes and the parameters of the whole Dense.

Based on the export of incremental model, the model can be quickly updated in ODL scenarios. Rapid updating of incremental models is very small and can be done with frequent model online.

2. Support second level hot update of the model

Usually, among the users we contact, there are three main concerns:

The first one is the effect of the model, is it good after I go online?
The second point is cost, how much I’m spending.
The third point is performance, whether it can meet my requirements for RT.

Multi-level hybrid storage enables users to configure different storage modes. It can reduce the cost of users to a greater extent under the premise of satisfying user performance.

Embedding scenes are very characteristic of their own, for example, there is an obvious difference between hot and cold in our features. Some products or videos themselves are extremely hot; In some cases, the user clicks so much that it becomes hot. Some unpopular goods or videos are not clicked, which is obviously the separation of hot and cold, but also in accordance with the 80-20 principle.

EmbeddingStore stores these hot features on DRAM and cold features on PMEM or SSD.

3. Super-large sparse model prediction

In addition, EmbeddingStore supports distributed storage services. At the time of ad-serving, each ad-serving node actually needs to do a full load of models. If you use the distributed service of the EmbeddingStore, you can avoid loading the full model on each serving node.

EmbeddingStore enables users to configure the distributed embedding store service independently. Each ad-serving node queries for sparse features from the EmbeddingStore Service.

The design of Embeddingstore fully considers the data format and access characteristics of sparse features. A simple example is the key and value of a sparse feature. The key is int64 and the value is a float array. Whether on serving or training, visits are in large quantities. In addition, access to sparse features in the Inference stage is read-only access without locking. These are the reasons for us to design sparse feature storage based on embedding scenarios.

4. Real-time training model correction

Why does PAI-ODL support offline training model with a model correction for online training?

Usually in the process of real-time training, there will be such problems as inaccurate label and sample distribution. Therefore, using day-level models automatically corrects to Online Training and increases the stability of the models. The model correction provided by PAI-ODL is free for users. After configuring relevant configurations based on their own business characteristics, users will automatically calibrate the Base model of the online training terminal according to the newly generated full-volume model every day. When the base model is generated by offline training, online training will automatically discover the base model and automatically jump to the corresponding sample in the data stream source. Online training starts based on the latest BASE model and the new training sample points.

5. Model regression and sample playback

Although there are abnormal sample detection and abnormal sample processing, there is still an inevitable problem that the online update model will have effect.

When a user receives an alert, the online indicator drops. You need to provide the user with the ability to roll back the model.

However, in the scenario of online training, there may have been several model iterations from problem detection to intervention, and several models have been produced. The rollback at this point contains:

1) The online-serving model rolls back to the previous model at the point in question;

2) At the same time, online training needs to jump back to the previous model of the problem model;

3) The sample should also jump back to that time point to start the training again.

Copyright Notice:The content of this article is contributed by Aliyun real-name registered users, and the copyright belongs to the original author. Aliyun developer community does not own the copyright and does not bear the corresponding legal liability. For specific rules, please refer to User Service Agreement of Alibaba Cloud Developer Community and Guidance on Intellectual Property Protection of Alibaba Cloud Developer Community. If you find any suspected plagiarism in the community, fill in the infringement complaint form to report, once verified, the community will immediately delete the suspected infringing content.