1. The concept of CTR

CTR refers to the number of times an AD is clicked through/the number of times an AD is displayed.

CTR forecast model formula: y = f (x) y = f (x) = f (x), y y ∈ [0, 1] y \ [0, 1] y ∈ in [0, 1], said the probability that the advertisement was clicked.

The application of LR, GBDT, FM, FFM, MLR, Deep and Wide, Deep and Cross, deepFM, XDeepFM, PNN, NFM, AFM models in CTR will be introduced below.

2. CTR traditional prediction model

2.1 LR

LR and Logistic regression are the most basic models of CTR prediction model, which are applicable to a large number of high-dimensional discrete features.


f ( x ) = 1 1 + e Theta. T X f(x) = \frac{1}{ 1 + e^{-\theta^T X}}

Among them, the theta = (theta 0, 1, theta… , theta n) \ theta = (\ theta_0 \ theta_1,… Theta, \ theta_n) = (theta 0, 1, theta… Theta, n)

It can also be expressed as:


f ( x ) = l o g s t i c s ( l i n e a r ( X ) ) f(x)= logstics (linear (X))

Advantages of LR, processing discrete features, simple model, easy to realize distributed computing.

The LR variant, Google’s FTRL, can be seen as LR + regularization + specific optimization method.

Formula of FTRL algorithm:


w t + 1 = arg min w ( s = 1 t g s w + 1 2 s = 1 t sigma s w w s 2 2 + Lambda. 1 w 1 ) \mathbf{w}_{t+1}=\underset{\mathbf{w}}{\arg \min }\left(\sum_{s=1}^{t} \mathbf{g}_{s} \cdot \mathbf{w}+\frac{1}{2} \sum_{s=1}^{t} \sigma_{s}\left\|\mathbf{w}-\mathbf{w}_{s}\right\|_{2}^{2}+\lambda_{1}\|\mathbf{w}\|_{1}\right)

Disadvantages of LR: Features are independent in the model. For features with crossover possibility, feature crossover needs to be carried out manually.

2.2 GBDT

GBDT, gradient elevation decision tree, nonlinear model with strong expression ability.

GBDT has the advantage of handling continuous value characteristics, such as user history click through rates. And because of the tree splitting algorithm, it has certain ability of combining features. GBDT is not sensitive to the numerical linear change of features, and it will automatically select the optimal splitting features and corresponding optimal splitting points according to the objective function. And, according to the number of features split, a feature importance order is obtained.

Disadvantages: Most scenes of the recommendation system are large-scale discrete features. If GBDT is used, many features need to be counted as continuous value features or Embedding, which takes a lot of time. GBDT model has strong memory behavior, which is not conducive to mining the long tail features.

2.3 FM and FFM

FM is improved on the basis that LR cannot automatically process cross features.

FM model:

The Logistic function has two parts, one is a linear regression function, and the second part is a quadratic cross term.

Wijw_ {ij} WiJ needs to store the variables of a two-dimensional matrix, the dimension of which may be very large. FM uses the principle of matrix decomposition to decompose the weight matrix, namely

.

The FM model becomes:

FFM model formula:

That is, xix_ixi has a different viv_IVI when it intersects with different features.

On the basis of FM, FFM considers the field characteristics of feature crossover, but it also leads to that it cannot achieve linear time complexity (assuming that n features, a feature xix_Ixi has N − 1n-1N −1 VIV_IVi vectors and other feature vectors are crossed, There is a total of N ∗(n−1)2\ Frac {n*(n-1)}{2}2n∗(n−1) vector inner product calculation), model training is one order of magnitude slower than FM, but the effect is better than FM.

2.4 GBDT + LR/FM/FFM

GBDT is suitable for processing continuous value features, while LR, FM and FFM are more suitable for processing discrete features. GBDT can carry out a certain degree of feature combination (multiple combination), while FM and FFM are only second order combination. At the same time, GBDT has certain feature selection ability. That is, GBDT and LR, FM and FFM can play complementary roles.

Predicting Clicks on Ads atFacebook proposes a solution based on GBDT + LR, which is to select dense features using GBDT first. The obtained leaf nodes are then spliced with discretization features and put into LR for training.

GBDT + LR model:

2.5 MLR

MLR is a nonlinear model proposed by Ali team in the paper Learning Piece-wise Linear Models from Large Scale Data for Ad Click Prediction. Equivalent to clustering + LR form. Definition:


softmax ( X . i . m ) = e mu i X j = 1 m e mu j X \operatorname{softmax}(X, i, m)=\frac{e^{\mu_{i} X}}{\sum_{j=1}^{m} e^{\mu_{j} X}}

f ( x ) = i = 1 m softmax ( X . i . m ) logistics ( linear ( X ) ) f(x)=\sum_{i=1}^{m} \operatorname{softmax}(X, i, m) * \operatorname{logistics}(\operatorname{linear}(X))

It is equivalent to clustering X into M classes, and then training an LR separately for each cluster. MLR, like LR, requires large feature engineering, and the model itself belongs to non-convex model (?). , pre-training is required, and non-convergence may occur.

3. DNN articles sorted by CTR

CTR scene features are basically discrete features, and cross-features have significance. DNN models have strong model expression ability and can learn higher-order features. At the same time, it is easy to expand the features of other categories, such as pictures and text features.

The main architecture of DNN model:

Parallel structure:

Basically, the current DNN model is similar to this architecture, except that the stack layer method is different, such as whether there is a shallow part, or whether there is a difference (e.g., LR or FM).

Serial structure:


Here is the parallel structure:

3.1 Deep and Wide model

Wide & Deep Learning for Recommender Systems;

Shallow part is LR and stack layer is concatenate(vector after stitching Embedding)

Deep and Wide network myblog;

3.2 Deep and cross model

Shallow part is cross Network, high order combination, and stack layer is: concatenate(vector after stitching Embedding)

3.3 deepFM model

The shallow part of deepFM is FM, and the stack layer is also concatenate.

3.4 XDeepFM model

It is difficult to understand that in (a), the cross product of two matrices yields a THREE-DIMENSIONAL array, and the cross product of two vectors yields a matrix (a cross section), and D pairs have d-layer cross sections.

In (b), mapping a three-dimensional array back to a two-dimensional array is a weighted average of a cross section to obtain a vector.

(c) is the whole graph, each layer XKX ^ KXK is obtained by x0x^0x0 and the previous layer xk−1x^{k-1}xk−1 operation.

The shallow part of xDeepFM is CIN.


Here is the serial structure:

3.5 PNN model

PNN, also known as product-based Neural Network, considers that the representation of cross features learned by embedding into MLP is not sufficient, and proposes a Product Layer idea, which reflects the DNN Network structure of cross signs based on multiplication operation.

Lzl_zlz stands for linear part; Lpl_plp represents the quadratic partial feature.

Definition:


p i . j = g ( f i . f j ) p_{i,j} = g(f_i, f_j )

The choice of different g makes us have two calculation methods of PNN, one is called Inner PNN, or IPNN for short, and the other is called Outer PNN, or OPNN for short.

IPNN


g ( f i . f j ) = < f i . f j > g(f_i, f_j ) = <f_i,f_j>

Use the inner product.

OPNN

Cross product calculation.

Definition: g (fi, fj) = fifjTg \ left (f_ {I}, f_ {j} \ right) = f_ {I} f_ ^ {j} {T} g = fifjT (fi, fj)

3.6 (NFM) model

Deep part of NFM model:

The shallow part of NFM is LR; In the stack layer, bi-interaction is used to do the inner product of vectors first and then the accumulation.

The bi-interaction Layer has a nice name. In fact, it is the process of computing the quadratic terms in FM, so the vector dimensions obtained are the dimensions of our Embedding. The end result:

That is:

Here xivix_i v_ixivi is the representation of Embedding. Each Embedding feature is cross-multiplied and added. The result will be a k-dimensional vector, where k is the length of the embedding vector.

3.7 AFM

Notice, I’m leaving out the linear part, only the feature combination part.

Attention-based Pooling Layer:

The key idea behind the Attention mechanism is that different parts contribute differently when compressed together. AFM implements the Attention mechanism by adding a weighted sum after the Interacted vector. The formalization is as follows:


f A t t ( f P I ( E ) ) = ( i . j ) R x a i j ( v i Even though v j ) x i x j f_{A t t}\left(f_{P I}(\mathcal{E})\right)=\sum_{(i, j) \in \mathcal{R}_{x}} a_{i j}\left(\mathbf{v}_{i} \odot \mathbf{v}_{j}\right) x_{i} x_{j}

Attention Network is formalized as follows:


a i j = h T ReLU ( W ( v i Even though v j ) x i x j + b ) a i j = exp ( a i j ) ( i . j ) R x exp ( a i j ) \begin{aligned} a_{i j}^{\prime} &=\mathbf{h}^{T} \operatorname{ReLU}\left(\mathbf{W}\left(\mathbf{v}_{i} \odot \mathbf{v}_{j}\right) x_{i} x_{j}+\mathbf{b}\right) \\ a_{i j} &=\frac{\exp \left(a_{i j}^{\prime}\right)}{\sum_{(i, j) \in \mathcal{R}_{x}} \exp \left(a_{i j}^{\prime}\right)} \end{aligned}

It is multiplied by a matrix and weighted in Softmax.

Overall formula:


y ^ A F M ( x ) = w 0 + i = 1 n w i x i + p T i = 1 n j = i + 1 n a i j ( v i Even though v j ) x i x j \hat{y}_{A F M}(\mathbf{x})=w_{0}+\sum_{i=1}^{n} w_{i} x_{i}+\mathbf{p}^{T} \sum_{i=1}^{n} \sum_{j=i+1}^{n} a_{i j}\left(\mathbf{v}_{i} \odot \mathbf{v}_{j}\right) x_{i} x_{j}

4. To summarize

There are three key issues for DNN to do CTR:

  1. Whether to keep the shallow model;
  2. How to reflect the crossover of features, that is, the approach of stack layer. Concatenate, Bi-interaction, direct vector addition, etc.
  3. Embedding+ FC is standard.

Reference:

  1. Design of F (X) using CTR ranking in recommendation System – DNN Article – Jachin article – Zhihu;
  2. Design of F (X) using CTR ranking in recommendation System – Traditional Model – Jachin’s article – Zhihu;
  3. Recommendation system meets deep learning (VI)–PNN model theory and Practice;
  4. Neural Factorization Machines for Sparse Predictive Analytics;
  5. Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks