In this paper, a set of innovative algorithms and architectures are proposed to solve the problem of elastic feature scaling and stability of online learning by elastic transformation of the underlying TensorFlow, and the model sparsity is optimized by self-developed algorithms such as GroupLasso and feature online frequency filtering. In the core recommendation business of Alipay, uVCTR has been significantly improved, and the link efficiency has been greatly improved.

Review 0.

Online learning becomes an important tool to improve the performance of recommendation system because it can capture the dynamic behavior of users and realize the rapid self-adaptation of the model. However, it has high requirements on the stability of link and model and the performance of training system. However, when designing Online recommendation algorithm based on native TensorFlow, we found three core problems:

Some information recommendation scenarios require a large number of long-tail words as features, so featureMap is required to truncate and encode the frequency of low-frequency features continuously, but the method is time-consuming and aggressive.

With streaming data, feature sizes are not predictable, but grow with training. Therefore, it is necessary to reserve characteristic space for training and restart after a few days, otherwise it will cross the boundary.

The model has poor sparsity and its volume reaches tens of GB, resulting in long and unstable uploading and online loading.

More importantly, online learning is in full swing. When streaming features and data are opened up, a new generation of training platforms that can add and delete features on demand and achieve elastic scaling of parameters has become the general trend. In order to solve these problems, from the end of 2017 to now, students from ant Financial’s ARTIFICIAL Intelligence Department, fully considering ant’s business scenarios and links, have carried out flexible transformation of TensorFlow to solve the above three pain points, simplify and accelerate offline and online learning tasks. Its core competencies are as follows:

Elastic characteristic telescopic system, supporting tens of billions of parameters training.

Group_lasso optimizer and frequency filtering improve model sparsity and significantly improve online effect.

Model volume compression 90%, perfect feature management and model stability monitoring.

With the joint efforts of the business line team, we have launched the full flow of several recommended scenarios on the home page of Alipay. Among them, the personalized online learning bucket of a certain recommendation position has improved 4.23% compared with the online multi-model fusion optimal bucket in the last week, and 34.67% compared with the randomized control. In the last week, a personalized information recommendation service increased UV-CTR +0.77%, PV-CTR +4.78%, model volume compression 90%, and link efficiency 50% compared with DNN benchmark.

1. Elastic transformation and advantages

Background: In native TensorFlow, we declare variables through Variable. If the variables exceed the capacity of a single machine, we can use partitioned_variables to assign parameters to different machines. But you must specify shape, after the declaration is immutable, through the array index lookup.

Because sparse features are widely used in recommendation systems, in practice, methods like embedding_lookup_SPARSE are used to find vectors and sum them in a large Dense Variable instead of matrix multiplication. Open source Tensorflow restricts the size of variables to be declared before they can be used, which causes two problems:

1) The mapping table of features to int values in the dimension range needs to be calculated in advance, which is usually done on ODPS. Because you need to scan and number all the features that appear, the calculation is very slow;

2) In the online learning scenario, in order to accommodate newly emerging features, it is necessary to reserve a part of dimension space and constantly modify the mapping table online. If the space exceeds the reserved space, online tasks need to be restarted.

In order to break through the limitation of fixed dimensions and realize the dynamic addition and deletion of features, the simplest optimization idea is to implement Variable simulating dictionary behavior in the bottom layer of TensorFlow and re-implement the upper API of TensorFlow on this basis. Therefore, we optimized and added hashmap-based HashVariable to the server, whose memory structure is as follows:


When declaring this variable, just add a sentence, and the rest of the training code does not change:


Each feature is mapped to a space of two to the power of 64 by the hash function. When this feature needs to be evaluated, PS lazily creates and returns it on demand. But its upper-level behavior is consistent with that of native TF. Internally, we refer to it figuratively as “de-id” because featureMap’s ID conversion process has been removed. On this basis, we implement a series of algorithms such as Group Lasso FTRL, frequency filtering and model compression.

Note: Elastic characteristics bring a significant advantage: as long as L1 sparsity constraint is strong enough, it can debug any large-scale feature training on a single machine, bringing a lot of convenience. Our implementation of hashMap is kV-enabled, where key is the feature and value is the start address of a vector.

Offline training optimization

After such transformation, the following changes have been brought about in offline batch learning:


Online training optimization

Online learning can bring the following changes:


In addition to the obvious improvement in performance, the biggest advantage is that there is no need to apply for space in advance and training can run seamlessly and stably.

2. Feature dynamic addition and deletion technology

The main purpose of elastic architecture is feature optimization, so that the model adaptively selects the optimal feature, and then achieves sparsity and reduces overfitting. This section introduces two core technologies of feature optimization:

Streaming frequency filter was used to determine the feature entry.

Features are filtered and removed using the Group Lasso optimizer.

2.1 Group Lasso optimizer

Sparse is an important model feature pursued by the algorithm, from simple L1 regularization and Truncated Gradient[9], to Regularized Dual Averaging (RDA)[10], and now common FTRL[2]. However, they are all sparse optimization algorithms for generalized linear model optimization without special processing for feature embedding in SPARSE DNN. Sparse embedding parameter vector as common parameter can not achieve the feature selection effect that can be achieved in linear model, and then can not effectively compress the model.

For example, when the sample containing a new feature enters, a set of parameters corresponding to a feature (for example, if the embedding size is 7, the number of parameters is 7) are activated. If FTRL determines that some parameters of the feature are invalid, the feature cannot be safely deleted. As shown in figure:


Therefore, on the basis of L1 and L2 regulars, L21 regulars (Group LASSO) and L2 exclusive sparsity (EXCLUSIVE Sparsity) are introduced as follows:


L21 was introduced in 2011. Its original purpose was to solve the problem that a set of highly related features (such as male and female) should be retained or deleted at the same time. We innovatively extend the representation of embedding to solve similar problems.

In L21, as the inner layer L2 regularization imposes the same constraint on all parameters of a feature, the whole set of parameters can be cleared or retained, thus determining whether the corresponding embedding vector of some features in the embedding layer is completely deleted to improve the generalization of the model. So it’s called Group Lasso.

L12, on the contrary, forces the number of non-zero parameters in each set of parameters to be the same but with different values as possible, but makes the output neurons compete with each other for input neurons, thus making the feature more discriminating to the target.

For DNN classification network, the low-level representation requires sufficient generalization and feature abstraction ability, while the upper layer is close to softmax layer and needs better differentiation. So we usually use group Lasso for the lowest layer embedding. Namely, the following optimization objectives:


When L21 regularization penalty is directly added into Loss, the model can converge eventually, but the sparsity cannot be guaranteed. Therefore, the Group Lasso optimizer refers to FTRL and divides the gradient iteration into two half steps. The first half step is gradient descent and the last half step is fine-tuned to achieve sparsity. The sparsity of the model can be effectively controlled by adjusting the L1 regular term (λ in the formula).

Group LASSO is the key to model performance improvement and compression after elastic computing modification. It is worth pointing out:

In the optimizer we implemented, Variable, accum and Linear slots are also KV storage.

The method of combining L12 and L21 regularization has also been discussed in papers [8], but we have not yet tried to effect it in business.

Due to space constraints, this section is not intended to cover the principle and derivation of Group LASSO in detail

2.2 Streaming frequency filtering

After discussing the method of feature dynamic deletion, we will analyze the access strategy of feature.

2.2.1 Necessity of frequency filtering

As mentioned in Article 1 of Google discussing FTRL, most features in high-dimensional data are very sparse, appearing only a few times in a sample of hundreds of millions. An interesting question then is whether the FTRL or Group FTRL optimizer can remove (LASSO) extremely low frequency features.

In the optimization formula of RDA, features satisfying the following conditions will be set to 0:


If this feature appears only a few times before t steps, and the gradient of unappeared steps is 0, it becomes easier and easier to meet the above conditions as the number of steps increases. Therefore, RDA can deal with extremely sparse features intuitively. But for FTRL, meet:


Among them,

It is not only related to historical gradient, but also related to historical learning rate and weight W. Thus, while FTRL can also handle extremely sparse features, it is not as aggressive as RDA (its lower bound needs to be analyzed in detail here; Group FTRL is similar).

Because the extremely low frequency features are not explicitly considered in the design and derivation of FTRL, although a large number of extremely low frequency features can be removed by increasing λ, some effective features are also affected by LASSO due to strong constraints, which has been proved to seriously affect performance in off-line experiments. Second, the engineering cost of preserving historical information for these massive VLF features is very high (adding several times the parameter space and storage requirements), as shown below:


Therefore, we ask, can we simulate off-line frequency filtering on real-time data stream to provide access threshold for features, and remove extremely low frequency features as far as possible without reducing model performance to further achieve sparse?

2.2.2 Several implementations of frequency filtering

Note: Because the default EMBEdding_lookup_SPARSE performs the unique operation on features (feature normalization to simplify calculations), it is impossible to get the true feature and label frequency on the PS side. The Python side of the placeholder statistics will be uploaded to the Variable specified by the server side. The optimizer will obtain the Variable through the slot to make a joint decision.

The most naive approach is to simulate off-line frequency filtering and count features until a certain threshold is reached before entering the training, but this destroys data integrity: if the total frequency is 6 and the threshold filtering is 5, the first 5 occurrences of the feature are ignored. To this end, we put forward two optimization schemes:

Feature frequency estimation based on Poisson distribution

Features after offline shuffle meet uniform distribution, but for online data flow, features entering the training system can be regarded as a Poisson process, which conforms to poisson distribution:

Where n is the number of current occurrences, t is the number of current steps, λ is the incidence rate per unit time, which is the main parameter of Poisson distribution, and T is the total number of training steps.

Is the lowest threshold of feature (i.e., the least number of occurrences in T time).

Therefore, we can regard t moment as unit time by the frequency n of feature occurrence in the previous t step, then

. From the Poisson distribution, we can figure out the remainder

Events occur within a time greater than or equal to

The probability of time

. Each time the feature appears, the probability can be used

For Bernoulli sampling, the probability of feature entering the system at t step is calculated by the following formula:

It can approach the effect of off-line frequency filtering by real online data simulation, and its λ is calculated dynamically with each feature entry. Its drawbacks are:

When t is smaller, the variance of The Times that events occur within T is larger, so features will be mistakenly added or discarded with a certain probability.

The future total number of training steps T is unknown in online learning.

Frequency filtering is separated from the optimizer, resulting in the inability to obtain optimizer statistics.

Dynamic adjustment L1 regular scheme

In a classical FTRL implementation, the L1 regular is consistent for every feature. This leads to the problem mentioned in 2.2.1: although too large L1 filters extremely low frequency features, it also affects the performance of the model. Referring to the improvement of learning_rate by various optimizers (such as Adam), we propose that features with different frequencies can have different LASSO effects by affecting L1 regularization coefficient.

Feature frequency is correlated with the confidence of parameter estimation based on MLE, and the lower the frequency of occurrence, the lower the confidence. If a prior distribution (regular term) is added on the basis of pure frequency statistics, the lower the confidence of frequency statistics is, the more inclined it is to the prior distribution, and the corresponding regular coefficient is larger. After several experiments, the following empirical formula is given:

Where c is the penalty multiple,

Is the lowest threshold of features, both of which are hyperparameters,

Is the frequency of the current feature.

We use dynamic adjustment L1 regularization scheme for on-line environment. On the basis of no decrease or even slight increase of UVCTR, the model feature number is reduced by 75% compared with that without frequency filtering, and the positive effect of frequency filtering on sparsity is proved experimentally. Its disadvantages are also obvious: the mapping relationship between characteristic frequency and regular coefficient lacks rigorous proof.

As a part of feature management, frequency filtering is seldom studied in relevant papers, and it is urgent for us to continue to explore.

3. Model compression and stability

3.1 Model compression

In engineering, due to optimization, such as feature after the optimizer LASSO, only set it to 0, will not really delete; Delete after enough steps have been taken. At the same time, memory pools are introduced to avoid unnecessary performance loss caused by repeated feature creation and deletion. This results in a large number of 0 vectors remaining in the model after the training. Further model compression should be done when exporting.

Since non-TF native operators such as HashPull and HashPush are introduced, they need to be clipped and converted into the op of native TF. We call these steps GraphCut, which makes the online inference engine compatible with elastic transformation without any changes. Due to the significant reduction in effective features, the scoring speed is increased by more than 50% compared to the original engine.

We regard graph clipping as a static optimization problem of TF-Graph, which is divided into three steps:

The Graph is iterated over and over again, searching for optimizable substructures and incompatible opS.

For the second traversal, the upstream and downstream and metadata of the node are recorded, the key OP is trimmed, and the non-zero value of Variable is transferred to the native MutableDenseHashTable of Tensorflow. This step compresses the model volume by 90%.

At last, the upstream nodes were recursively backdated and the subgraph structures unrelated to inference were removed

We implemented a complete and concise graph clipping tool, which was called when the model was hot exported, to compress the model from around 8GB to a few hundred megabytes, while keeping the model scoring consistent.

3.2 Model stability and monitoring

The stability of online learning is very important. We closely monitor the online real effect and the real-time model generated effect. Once the sample deviation is too much, the alarm will be triggered.

Due to the need to capture time-varying data changes, model results cannot be evaluated with fixed off-line data sets. We use the latest inflow data of the Ari stream log system SLS as the evaluation sample, and score first and then train by sliding window, which not only maintains the continuous training, does not waste data, but also obtains the latest model effect as frequently as possible.

We monitored the following core indicators:

Sample monitoring: positive/negative ratio, on-line score and online-AUC (auC obtained by on-line model score), output rate and consumption rate.

Training-level monitoring: AUC, User-AUC(refer to remarks), Loss, model scoring mean (aligned with positive and negative proportion of samples), abnormal information.

Feature level management: total feature size, effective /0/ deletion feature size, new/insert/delete rate.

Overall model and scheduling: model export time, size, scoring distribution is normal, normal scheduling.

Service indicators: UVCTR, PVCTR (hourly update, T+1 report).

The corresponding relationship between online and training indexes is shown in the following table:


Through the HTTP interface, monitoring data is sent every once in a while. When exceptions occur, nails and email alarms will be generated in time. The following figure shows the monitoring from September 20 to 27. From the second figure, the model can better adapt to the scoring distribution of the current data flow.


User-auc: The traditional AUC cannot fully describe uVCTR, because the model probably learns the partial ordering relationship between different users, rather than the partial ordering relationship between individual users’ clicks under different offers. For this purpose, we use user-AUC, which simulates the calculation process of online UVCTR as much as possible. In the real experiment, the UVCTR hourly report of the monitoring system is highly consistent with the User-AUC output of the real-time model.

4. Project implementation and effect

At present, the algorithm has been on the alipay home page of a number of recommended online. According to the user’s historical click, the recommendation system integrates the user’s portrait and interest, and combines real-time features to predict the user CTR, thus improving the overall click through rate of the system.

For example, the recommended bit service adopts the classic WIde&Deep network structure with sparse parts including hundred-level groups (see note 1 in the next paragraph). About 10 billion samples are poured in a day, and the join window of label is fixed for a long time. Due to the majority of negative samples, the upstream link downsamples positive and negative samples by 1:8 (see Note 2 below).

The ant unified training platform was used to construct the training task, and the workflow was used to schedule the training task. All other parameters of the offline and online tasks were consistent. Batchsize is 512. Every 200 steps (i.e. 200,000 samples), the model is exported to the online system through graph clipping at regular intervals. If a task fails, the system automatically starts the task and recovers the task from checkpoint.

The online learning bucket of this recommendation business has improved 4.23% compared with the online multi-model fusion optimal bucket in the last week, and 34.67% compared with the randomized control. Another news recommendation business showed a +0.77% increase in UV-CTR and +4.78% increase in PV-CTR over the DNN benchmark in the latest week. Compared with the experimental effect, there is a great improvement.

Note 1: Group embedding groups similar EMB features and concat after their respective lookup and summation, so that features are crossed at a higher level. It is designed to take into account the large differences in the characteristics of different groups (such as user and item) and should not be summed directly.

Note 2: Inference scoring is only used for pointwise sorting. Sampling changes the data distribution but does not change the partial order relation, so compensation is not made in training.

5. Future jobs

Elastic characteristics have become the core element of real-time reinforcement deep learning for ants. It is only the first step, after solving the problem of on-demand feature space creation, it will bring an imaginative underlying architecture, on which many technologies can be deeply dug: in engineering, it can continue to optimize from minute to second level, further improve the real-time performance of links and realize incremental update of models; In terms of algorithms, we are exploring techniques such as sample importance sampling, automatic feature learning, online linear programming combined with DNN, and realizing optimizer joint decision making.

Because online learning is a complex system engineering, we have encountered a lot of difficulties in development and tuning, involving sample reflux, training platform, model scoring, online evaluation and a series of problems, especially stability, but basically overcome all of them. In order to ensure that the online results are stable and reliable, we published this article after two or three months of observation and optimization.

The author of this paper is the basic algorithm team of cognitive Computing group of Artificial Intelligence Department of Ant Financial. The team is involved in image, NLP, recommendation algorithm, knowledge graph and other fields. The leader of the team is Chu Wei, a well-known algorithm expert in China.

share

I 13 years experience in Java development and product development experience, BAT background, worked as chief technology officer at many famous enterprises, enterprise scheme selection of chief adviser, successively engaged in kernel development, large Java system architecture design and iot system architecture design and development, is proficient in complex business difficulties, core research technical scheme selection, architecture, I have a very deep understanding of Java language and projects and rich practical experience. I love cutting-edge technology and am willing to share and discuss technology.