Targeted ads vs. search ads

The user has no obvious intent (active Query Query)

Before users come to Taobao, they do not have a specific goal (using past historical behavior => item recommendation).

p(y=1 | ad, context, user)

AD stands for advertising candidate set

User Indicates the characteristics, age, and gender of the user

Context means context, device, time

Evolution of targeted advertising

LR model (linear model)

LR model + artificial features. LR model cannot deal with nonlinear features, so it needs feature engineering to add nonlinear features

[1] MLR model (nonlinear model)

Mixed Logistic Regression,

The strategy of divide and conquer is adopted, piecewise linear + cascade is used to fit the nonlinear classification surface of high-dimensional space, which improves efficiency and accuracy compared with manual

DNN model (Deep learning)

Capable of processing complex models and large amounts of data using gpus

[2] Deep Interest Network

Insights into historical user behavior:

A user can be interested in many kinds of things

Local Activation, Local Activation, only part of history behavior is helpful for click prediction at this point

[3] Deep Interest Evolution Network

Motivation:

• There are many random jumps in synthetic sequences, which are irregular and noisy

• Specific to an interest, there is a tendency to evolve over time

[4] Deep Session Interest Network

Motivation:

• View a sequence of user actions as multiple sessions

• Similar within sessions, but different from one session to another

Deep interest network DIN

Deep Interest Network for Click-Through Rate Prediction,2018

Problems solved:

CTR estimates predict how many clicks will be made on each AD based on information such as a given AD, user, and context

Insights into historical user behavior:

A user can be interested in many kinds of things

Local Activation. Only part of the historical data is helpful for current click prediction. For example, when the system recommends swimming goggles to the user, it is associated with the swimsuit that the user has clicked, and not with the book that the user has bought

Deep interest Evolutionary Network DIEN

Deep Interest Evolution Network for Click-Through Rate Prediction,2018

Targeted advertising DNN Base Model

embedding+MLP

Step1: transform different features into corresponding embedding representation

Step2: stitching all features of embedding

Step3: input the MLP (DNN) into the multi-layer perceptron and calculate the results

Feature representation:

User Profile, User Behavior, AD, and Context

Each feature category includes multiple feature fields

If the feature field is a single-valued feature, => One-hot encoding

Feature field is multi-valued feature => multi-Hot encoding

Embedding Layer converts high dimensional sparse vector to low dimensional dense vector

Pooling Layer + Concat Layer: Pooling is used to solve the problem of different dimensions of each user behavior, and the embedding results in the other three categories are spliced as the input of MLP Layer

MLP Layer learns the given stitching embedding and automatically learns the combination of higher-order features

Loss: The Loss function adopts negative log-likelihood Loss

S represents the number of training samples, x is the input of the network, y is the {0, 1} sample label,

P (x) represents the estimated probability of hitting the input x

What is Attention

Attention mechanism:

Attention Network (also called Activation Unit) is introduced on embedding calculation of user behavior.

The user history behavior features are regarded as the representation of user interest when embedding, and then the Attention Unit assigns different weights to each representation of interest

Attention Weight is calculated by matching historical user behavior with candidate ads, corresponding to insight (Diversity of user interest, and Local Activation).

Attention thought: during pooling, the weight of goods related to candidate Ad is more significant, while the weight of goods unrelated to candidate Ad is less

The Attention score, calculated by interacting candidate Ad with each item of historical behavior

The Activation Unit outputs the Activation Weight. The input includes user behavior embedding and candidate advertisement embedding, as well as the cross product of both of them. For different candidate ADS, the user behavior representation vectors obtained are also different

{e1, e2, … , eH} represents the embedding vector of user U, vA represents the embedding vector of candidate advertisement A, vU(λ) has different representation vectors A (.) for different advertisements. Is a feedforward neural network, resulting in activation weight

a(.) The Attention score of click sequence is not normalized, and the weighted sum of the score and the corresponding commodity embedding vector is directly added to retain the user’s interest intensity

For example, if the user’s click sequence is 90% clothes and 10% electronic products, and there are two candidate ADS (T-shirt and mobile phone), the candidate AD of T-shirt activates clothes and most of the historical behavior of clothes will get a greater intensity of interest than the mobile phone

SoftMax doesn’t capture the intensity of the user’s behavior, such as clothes 90% of the time and electronics 10% of the time

Deep interest network DIN

Ali deep interest network DIN

Deep Interest Network for Click-Through Rate Prediction,2018

Arxiv.org/abs/1706.06…

Problems solved:

CTR estimates predict how many clicks will be made on each AD based on information such as a given AD, user, and context

Insights into historical user behavior:

A user can be interested in many kinds of things

Local Activation. Only part of the historical data is helpful for current click prediction. For example, when the system recommends swimming goggles to the user, it is associated with the swimsuit that the user has clicked, and not with the book that the user has bought

Attention Weight Visualization

Attention Weight is calculated by matching historical user behavior with candidate ads, corresponding to insight (Diversity of user interest, and Local Activation).

Engineering DIN (evaluation metrics, Dice function, MBA-Reg regularization)

Evaluation indicators:

The evaluation index AUC was improved

AUC is widely used in CTR scenarios, meaning the probability that the score of positive samples is higher than that of negative samples. The previous evaluation index is to calculate AUC on samples without discriminating users

Instead of counting positive and negative samples of all users together, the improved AUC calculates its own AUC individually for each user and weights it according to its own number of actions (such as clicks)

The AUC of user level was calculated and weighted according to the number of display times, which eliminated the influence of user bias on model evaluation and more accurately described the performance effect of model for each user

Evaluation index RelaImpr

RelaImpr represents the relative improvement index compared with the based model. For random guesses, the value AUC is 0.5

The activation functionDice:

Adaptive Activation Function PReLU, Data Adaptive Activation Function

Dice function is to improve the PReLU, applied to Ali mother

The Dice function introduces statistics about the input data

Represents the mean and variance of each mini-batch of input data

In engineering, setup, Dice is a general form of PReLU that ADAPTS to changes according to the distribution of input data. When E[s]=0, Var[s]=0, Dice is reduced to PReLU

The MBA – Reg * * * * regular:

Adaptive regularization, mini-batch aware regularization

Mba-reg is not a new re, but is tailored for industrial scenarios

Mba-reg extended L2-norm to the sample level and only calculated the parameters related to the sample in each iteration step

Deep interest Evolutionary Network DIEN

Ali deep interest in evolutionary Network DIEN

Deep Interest Evolution Network for Click-Through Rate Prediction,2018

DIEN structure:

Low-dimensional embedding vector learning of input layer, User profile, AD and Context (the same as base Model processing)

Behavior layer, interest Extractor layer and interest Evolving layer are used to mine users’ interest and evolution related to target products from users’ historical behavior

Objective loss function, using negative log-likelihood loss

P(x) corresponds to the output of the network

The sequence model AUGRU is introduced to simulate the evolution of user interest

Interest Extractor layer which generates Interest and Interest Evolving layer which simulates Interest evolution are added between Embedding layer and Concatenate layer

Interest Extractor Layer uses the structure of GRU to extract the Interest of users in each time slice

Interest Evolving layer uses the structure of sequence model AUGRU to connect user interests at different times to form a chain of Interest evolution

Finally, the “interest vector” of the current moment is input into the upper layer of the fully connected network, and the final CTR estimation is carried out together with other features

Engineering of DIEN (Auxiliary loss Function)

The auxiliary Loss function was added to monitor the extraction of interest points at each step of learning

Using GRU is faster than LSTM, avoids RNN gradient disappearance problem, and is suitable for e-commerce system

The sequence behavior generated by GRU time is used to model the interest Extractor layer to capture the dependence between behaviors, and the generated interest state H (I) is used as the input of the interest Evolving layer

Auxiliary Loss functions in engineering:

Auxiliary Loss is used for additional monitoring to monitor current interest learning with next-moment behavior, enabling the GRU to be more efficient in refining expressions of interest

The interest of the present moment directly influences the behavior of the next moment (the next behavior is a positive example of the prediction of the present behavior)

Loss uses negative log likelihood, and the selection of negative cases is randomly selected from items that the user has not interacted with, or items that have been shown to the user but not clicked

α is used to balance interest representation and CTR prediction

Use the auxiliary loss function to better predict what the next user will do => Let each hidden State be fully trained to better express interest

Auxliary Loss for better extraction interest

Advertising samples are limited to product descriptions (not all products are AD). Auxiliary Loss monitoring introduced is a web-based monitoring situation that is more effective for learning interests

Deep Session interest Network DSIN

DIEN structure:

Low-dimensional embedding vector learning of input layer, User profile, AD and Context (the same as base Model processing)

Behavior layer, interest Extractor layer and interest Evolving layer are used to mine users’ interest and evolution related to target products from users’ historical behavior

Objective loss function, using negative log-likelihood loss

P(x) corresponds to the output of the network