Shared by: Dr. Song Kai

Organizer: Lin Yizhen

Takeaway:

This article shares the experience and thinking of federal learning practice from the perspective of advertisers.

Firstly, the background of business and technology selection is introduced: the team project is user growth and cost control, and the way is advertising channel delivery. The delivery target is divided into two categories: attracting new and attracting active.

  • When pulling new, the user features in the microvision side are sparse, and the advertising platform accumulates a large amount of information, but only limited oCPX standardized data is returned.
  • When pulling, the microvision side has valuable portrait data such as user behavior sequence, which is complementary to the characteristics of the advertising platform, but cannot share data with the advertising platform directly and violently.

Therefore, it is hoped that the micro vision side and the advertising platform side can make use of the data of both sides to achieve win-win results, but ensure the security of data within the domain. In this context, our team chose “federal learning”, which provides a solution for multi-party security cooperation.

The article focuses on the following five points:

  • The federal study

  • Tencent federal learning platform PowerFL

  • Microvision advertising overall business

  • AD delivery federated learning framework

  • Modeling practices and details

Federal learning

Firstly, the pilot knowledge of Federated Learning (FL) is introduced.

1. Federal learning background

Machine learning models are data-driven, but in reality data is an island: data cannot be shared between companies or even between departments; Direct sharing can violate users’ privacy and hurt companies’ profits. In 2016, Google’s article, based on the input method NLP, proposed to update the model locally with Android mobile terminals. This article is generally regarded as the beginning of federated learning. Immediately, China’s Webank, Tencent and other companies also did a lot of pioneering work.

The basic definition of federated learning is: in the process of machine learning, participants can make joint modeling with the help of data from other parties. All parties do not need to directly contact the data resources of other parties, that is, under the condition that the data does not go out of the local area, data joint training can be carried out safely to establish a shared machine learning model.

2. Two structures for federated learning

  • Centralized federal architecture: Early development includes Google and Webank, all of which are such architectures. A trusted third party (central server) is responsible for encryption policy, model distribution, gradient aggregation, etc.
  • Decentralized federal architecture: Sometimes two parties cooperate, no trusted third party can be found, and all parties need to participate in peer-to-peer computing. This architecture requires more encryption and decryption and parameter transmission operations, such as 2N (n-1) transmission when party N participates. Here it can be argued that encryption and decryption algorithms actually play the role of a third party.

3. Three classifications of federal learning

  • Horizontal federated learning: sample federation, suitable for scenarios with high feature overlap and low user overlap. For example: two companies with similar businesses, with many orthogonal users but similar images, can conduct horizontal federated learning, which is more like a distributed machine learning of data distortion.
  • Vertical federated learning: Feature association. It applies to scenarios where users overlap a lot but features overlap a little. For example, advertisers and advertising platforms hope to combine the characteristics of the two parties for training.
  • Federated transfer learning: When characteristics and sample overlap between participants are low, it can be considered, but the difficulty is high.

The three types of federated learning interaction have different information and different troubles. For example, in horizontal federated learning, data of all participants are heterogeneous, so data is not independent and identically distributed, which is also a research hotspot of federated learning.

At present, vertical federal learning has been implemented in our business, and we are also exploring the combination of federal transfer learning and horizontal and vertical learning.

4. Federated learning vs. distributed machine learning

Precision upper bound: Federated learning is not like optimizing other specific sort and recall models, but rather like driving the entire model within data security constraints. Therefore, theoretically, the results of Distributed Machine Learning (DML) under shared data are taken as the upper limit.

Federated learning (FL) vs. distributed machine learning (DML)

Although some people regard federated learning as a special case of distributed machine learning, it still has the following differences compared to general DML:

  • There are restrictions on data not sharing;
  • All server nodes have weak control over worker nodes.
  • Communication frequency and cost are high.

Angel PowerFL, Tencent’s federal learning platform

From the beginning of federal Learning development, Tencent has been very involved. Including: formulate and publish “Federal Learning White Paper 2.0”, “Tencent Security Federal Learning Application Service White Paper”; In terms of infrastructure, Angel (github.com/Angel-ML/an…) is an open source intelligent learning platform based on Tencent. , build PowerFL, now internal open source; In terms of practice, I have tried many times and landed in finance, advertising and recommendation scenes.

1. Engineering features

In addition to the basic requirements of machine learning platform such as easy deployment and good compatibility, PowerFL also has the following five engineering features:

  • Learning architecture: Using a decentralized federated architecture, independent of third parties;
  • Encryption algorithm: the realization and improvement of a variety of common homomorphic encryption, symmetric and asymmetric encryption algorithm;
  • Distributed computing: A distributed machine learning framework based on Spark on Angel;
  • Cross-network communication: Pulsar is used to optimize cross-network communication, enhance stability, and provide multi-party cross-network transmission interface;
  • Trusted execution environment: Exploration and support of TEE (SGX, etc.).

2. Algorithm optimization

In addition, many optimizations have also been made for the algorithm side:

  • Ciphertext operation rewriting: rewrite ciphertext operation library based on C++ GMP;
  • Protocol improvement: Support For Pailler and Tencent’s self-developed homomorphic symmetric encryption protocol RIAC, both 10-20 times faster than the open-source FATE Gmpy-Paillier;
  • Data intersection optimization: two and multi-party optimization respectively, especially multi-party theoretical transformation (improved FNP protocol);
  • GPU support: Ciphertext computing can be parallelized by GPU.
  • Model extension support: supports flexible model extension. DNN model embedding can be developed using Tensorflow and Pytorch.

It is worth mentioning that in addition to homomorphic cryptographic schemes, PowerFL also supports federal neural network privacy protection schemes such as secret sharing and differential privacy (noise disturbance).

Three, micro advertising overall business

One of the overall goals of our team is to iteratively optimize the intelligent delivery system. We made efforts in the following three aspects:

1. Increase customer acquisition channels

Including external canal procurement, internal soft conductivity, self-growth; Among them, the form of purchasing can be subdivided into Marketing API to create ads in batches, RTA crowd targeting, sDPA/mDPA commodity database, RTB real-time bidding, etc.

2. Grow the material form

In order to undertake Marketing API and RTA, continue to optimize advertising creativity; In order to undertake RTB, sDPA/mDPA, optimize native advertising content; Strategies or models such as subsidies, red envelopes and coupons have been optimized in order to respond to the sharing/acceptance of self-growth.

3. Growth technology

No matter RTA or RTB, the core is to optimize the accurate matching of users and materials. We continue to explore the interaction between materials, users, and both:

  • Material aspects: including production, mining, understanding, quality control, such as the selection of content prone to negative feedback, the identification and enhancement of clarity, material automatic up and down and bidding. #
  • User: the portrait side continues to build user portraits, such as lookalike and user labels; Operation side with uplift and LTV model; The experience side pursues the integration of pulling and bearing.
  • Flow: the core of advertising decision-making is the management of flow and cost, and the development of a series of strategies; Reinforcement learning has been tried to solve the dilemma between traffic and cost.

Four, advertising federal learning framework

The following introduces the role of federated learning in the framework of micro-vision advertising: Circle the RTA crowd pack.

1. An overview of advertising system

First, the diagram below shows a simple, universal advertising system: the AD request from the user’s device ID arrives in the AD system; Through advertising recall, advertising targeted filtering RTA, advertising coarse layout, advertising fine layout, advertising delivery, and finally achieve advertising exposure.

2. RTA advertising structure

Then, we zoom in on the RTA side of the frame. The purpose of RTA is to prejudge user value, implement crowd oriented and auxiliary bidding.

  • The RTA advertising request is initiated, and the user device ID reaches the experimental platform;
  • Identify the channel allocation policy and ID Mapping to accept the pull policy for historical users and the new pull policy for non-historical users.
  • Federal learning is determined by THE RTA-DMP side, which is imported into DMP in the form of crowd package for crowd orientation and stratification.

3. Federated learning coarse-grained frameworks

Here, we introduce the federated learning coarse-grained framework:

  • The micro vision side provides user ID, portrait and Label, while the advertising platform side provides user ID and portrait.
  • Security sample alignment (Private Set Intersection (PSI)) obtains user Intersection and starts federated learning and collaborative training.
  • After the model evaluation, the two parties jointly extract the full user features and export them, and score the full users.
  • Finally, the results are returned to RTA-DMP.

The fifth part will be disassembled in detail.

5. Modeling practice and detail introduction

1. Pilot work

Pull new uses federated learning more urgently than pull live because the in-end features are more sparse and many users only have user device ids; Therefore, priority is given to pulling new, leading work includes:

1.1 Fitting objective: four-task model

  • Main task: retention rate of primary and secondary applications, i.e., the proportion of retained users who actively open wechat APP on T+1 day.
  • Secondary tasks: main retention cost, effective new cost, effective new proportion; Among them, the effectiveness of user addition has been modeled, and the probability score is given according to the residence time and other behaviors.

1.2 Microvisual unilateral data exploration and feature engineering

  • Sample and sampling: sample size, determine the sampling strategy.
  • Features and models: ID class features, behavior sequence features; Use the DNN model.
  • Formulate offline indicators consistent with online performance: After exploration, group-AUC is a good offline indicator, and Group refers to user stratification. Group-AUC is positively correlated with online performance and more sensitive than AUC.

2. Model training

After finishing the preparatory work, the microvision side began to conduct federated learning modeling with the advertising platform side.

2.1 Federated model training iterative process

(1) Data alignment: Determine the common sample set {ID} for collaborative training in the following two ways

  • Plaintext: fast, billion and billion level intersection, only a few minutes ~ ten minutes, but this method is not safe, because the two parties just want to confirm the public set part, and do not want to disclose their complement; Trust environment (TEE) can be used to ensure security in plain text.
  • Ciphertext: slower and takes 10 times longer than plaintext because it involves a large number of encryption and decryption operations and collisions. At present, we choose this strategy and use our own PowerFL platform to achieve.

(2) Multi-feature engineering

  • Longitudinal federated learning: features on both sides are independent and can be divided and ruled, such as feature standardization and completion.
  • Horizontal federated learning: the acquisition of some statistics needs to obtain the full distribution of the whole feature, and the communication of federated learning is still used to solve the data synchronization.

(3) Collaborative training

  • Determine the computing environment and storage resources.
  • Communication information (what physical quantity is hosted, such as gradient and embedding).

(4) Offline evaluation

(5) Online evaluation

2.2 Federation Model based on DNN (FL-DNN)

The multi-task DNN model was jointly trained by the microvision side and the AMS side of the advertising platform. The multi-task structure evolved from simple implementation methods such as sample strategy and modified loss function to MMoE. The project is based on Horovod parallelism.

2.3 Iterative process of fl-DNN model parameters

(1) Initialization: A (host, AMS side) and B (guest, microvision side) initialize their respective networks respectively (denoted asThe parameters of the), interaction layer parameters, remember the learning rate as, denoted the noise as

(2) forward propagation 🙁Stands for homomorphic encryption)

  • Side A calculation: calculation; Encryption is(that is, the output of SIDE A) and sends it to B.
  • Calculation on side B: also for embedding calculation generation, is symbolic symmetry, remember; receiveAnd calculatedAnd then calculateSend it to A.
  • A side to receiveDecrypted to get; To calculateSend to B
  • B side to receive, minus,. In an interactive networkDown spread, get, calculate the loss function

(3) Back propagation

  • B side calculation: loss function versus parameterYou take the derivative, you get the gradient; To calculateAnd sent to A.
  • A side to receiveAnd decryption; To calculateEncryption,And send the two quantities to B.
  • B side to receive; Calculate the relative gradient of the loss functionAnd willSend it to A.
  • A side to receiveAnd decryption.

(4) Gradient update: A, B and I update the gradient respectively to complete an iteration:

The structure looks similar to the twin towers commonly used in recall and rough layout, but the design principle is not the same. The twin-tower structure is often criticized for embedding interaction time too late, so there are many improved versions, such as MVKE model (Tencent), which make embedding interaction time earlier. In longitudinal federated learning, a-sideIt can be handed over to side B at the first layer even with no change (i.e. only feature encryption), so in principle there is no interaction timing problem.

2.4 Fl-DNN model parameter iteration special case: unilateral feature

B (on the guest side) If there is no or weak feature, only the user device ID and label can be provided. The preceding parameter iteration process degrades to noThe reader may try to write down the parameter update process.

In practice, due to data volume, feature coverage, intersection loss and other problems, in order to ensure sufficient DNN training, the following two situations are superimposed:

  • < ID,label>+ < ID, features>;

  • B side has features: < ID,label,features> + < ID,features>.

Online services

Each participant can only get the model parameters related to himself, and all parties need to cooperate to complete the prediction:

(1) Send request: the user device ID reaches A and B respectively;

(2) Embedding calculation

  • A lateral calculationEncryption,
  • B side calculation

(3) Label calculation

  • A side willTo side B;
  • Side B calculates the label
  • Decrypt side B to get Y.

4. Effect display

In the cooperation with Tencent AMS, the group-AUC increased to +0.025 in the federal learning compared with the independent training of Wechat. The primary goal is positively correlated with the three secondary goals and has improvement. The primary goal is to increase the primary retention rate (after coverage conversion) by +4.7PP. All indicators have been significantly improved after the launch of the first version, which has been released in full. The second iteration also achieved significant GAUC improvements and is experimenting with small traffic.

The chart below shows the effective reduction of the primary retention cost (orange) :

5. Iteration

5.1 Pull new models

Promote federated collaboration with other channels, but the team cannot afford to maintain a federated model on every launch platform. A preliminary attempt was made to put the model jointly trained with AMS platform into other platforms. However, due to heterogeneous data (sample distribution deviation), this model is not as good as base model (unilateral microvision). In addition, there are conflicts of interest in each platform, and they all want advertisers to focus on their own traffic. Therefore, we are trying to combine horizontal and vertical: micro-vision and advertising platform are vertical, and advertising platform is horizontal, expecting to start from the three-party federal cooperation, and currently iterating the idea of federal migration.

5.2 Pull model

Working with the AMS platform, once the federated model is open, we want to reuse it on the pull model. Since user pull is a case of multi-objective, multi-interest and different behavior sequence, we focus on timeliness and model innovation, and conduct exploration based on MMOE-Mind-Transformer model.

5.3 Iteration Difficulty

(1) Efficiency and stability

  • Improve the speed of data alignment: In order to improve the speed of ciphertext intersection, hashing buckets to achieve simple parallel and accelerate.
  • Compressed training time: incremental training to do finetune; Similar results were obtained with full training in half the time.

(2) Interpretability and debug difficulties: neither side of the federation can see the original data of the other side, and sometimes both sides will even hide their own neural network structures. This does keep the data secure; But from an iterative perspective, problem location is harder.

(3) The difficulty of multi-party federated modeling

  • Co-modeling with multiple partner advertising platforms has conflicts of interest, unlike the Google FedAvg scenario.

  • Joint modeling with other business divisions, such as wechat, search has strong and powerful features, but the other side has no motivation.

  • There are technical/network stability/communication costs.

Six, Q&A

Q1. Is TEE (Trusted Execution Environment) a requirement in federal learning tasks? In what scenarios will tasks be completed based on TEE? Are the current introduced items calculated based on TEE?

A1. Currently, TEE is not used. If TEE is used, plaintext operation can be performed directly without a large number of encryption operations; TEE ensures that even plaintext operation, data is safe and invisible to the other party. At present, both data intersection and model training (gradient and embedding) are ciphertext operations.

Q2. Data alignment, the first step in federated learning, is mapping table maintenance required?

A2: There is no need to maintain the mapping table. Due to billions of users plus features, the data volume of the mapping table reaches hundreds of G levels, which is actually a waste of resources. The actual operation of sample alignment is in order. The ID provided by the advertising platform is in the order from top to bottom as agreed, that is, there is no need to maintain the kv mapping relationship.

Q3. When Serving (online service), I need to take the characteristics of the other party (AD platform). How about this delay?

The latency is also caused by communication. The AD platform trains the AD side model on its own machine, and the micro-view side trains the micro-view side model on its own machine, and the final interaction is also the interactive embedding.

Q4. Is it necessary for side B (guest side) to provide label in all cases?

A4. B side (Guest side, microvisual side), since the data does not go out of the field, the label will not be provided to the other side. As can be seen from the formula in “Fl-DNN Model Parameter Iteration Process”, the gradient is calculated on the B side, and the other side cannot know the label.

Q5. Group-auc increases by +0.025 with federated learning. What is group-AUC before federated learning?

A5. The value has no direct guiding significance, and the change of sample definition and fitting target in different scenarios means the change; From 0.70 to 0.72-0.73.

Q6. Tencent some time ago hair MKVE paper full name is?

A6. 2021-T700-mixture of Virtual-kernel Experts for Multi-objective User Profile Modeling.

Q7.fl-dnn modeling requires a third party, how to trust a third party?

A7. In fact, the decentralized architecture does not require a third party, but can be undertaken by a series of encryption and decryption algorithms.

Q8. If both parties are the execution environment of TEE, is the data exchanged in the network in clear text?

A8. Yes, plain text is ok.

Q9. Combined with the Federal framework and RTA, is it offline to produce population packs or online real-time estimates?

A9. After exploration, the real-time importance of pulling the new side is not high. The offline crowd package is imported into DMP and then connected to RTA. Because the pull side wants to catch the user’s interest change in a short time, has the real-time requirement, is currently being studied.