Microsoft Bing ADS recommendation algorithm (Deep Crossing) published in 2016

Today we take a look at the Bing search advertising algorithm announced in 2016. The original article is here www.kdd.org/kdd2016/pap… . This kind of neural network is different from the YouTube recommendation algorithm that we’ve seen before, the emphasis is completely different.

Background knowledge

When we try to make a machine learning model better, we often do permutations and combinations of features manually. Theoretically, these permutations and combinations of features can be understood by the model itself. However, due to the size of the model or the characteristics of the model itself, the distribution of feature data cannot be fitted. At this point, manual permutation and combination of features can make the model better. In this paper, people from Microsoft proposed to let the Deep learning Model do the arrangement and combination of features by itself, and they called this Model Deep Crossing Model.

More strictly speaking, the Deep Crossing Model mentioned in the paper can take a variety of features as input, such as text, category, ID, or numeric type features. These kinds of features are essential in the search advertising space where Bing Ads is located. Before we talk about specific features and models, let’s introduce some concepts related to the search advertising industry:

Query: A text string a user types into the search box

Keyword: A text string related to a product, specified by an advertiser to match a user query

Match type: An option given to the advertiser on how closely the keyword should be matched by a user query, usually one of four kinds: exact, phrase, broad and contextual

Each AD will bid on one or a group of keywords searched by the user, so the keywords of the AD and the specific search request of the user, and the specific way these keywords correspond to the user’s request, are very important to the model. There are characters and there are categories.

Title: The title of a sponsored advertisement (hereafter), Specified by an advertiser to capture a user’s attention

Landing page: A product’s web site a user reaches when the corresponding ad is clicked by a user

The ads themselves are copywritten and specific arrival pages, which are put into the model as features that ‘describe what’s inside the AD’. Most of these features are words.

During the operation of each advertisement, there will also be the number of views, clicks and click-through rates. Such features are mainly put into the model based on ID features (advertisement, user ID) and digital features.

Characteristics of the

All the features described above are to be put into the model, so how exactly is each feature input into the model? In general, each feature is a vector.

The character features are decomposed into tri-letter-gram, i.e. three letters to a vector, in a space of 49292 dimensions. .
If the category feature becomes one-hot, it is assumed that there are five different categories for the category feature, then category 1 is [1,0,0,0,0] and category 4 is [0,0,0,1,0]. One-hot means that only one 1 is ever present, “hot”.

model

The Deep Crossing model is divided into four parts: Embedding, Stacking, The Residual Unit and Scoring layer. The loss function of the model is log Loss.

Embedding

This layer is mainly responsible for the function below

J is a marker for this set of features, and we can just look at one set of features, so we can ignore j.

N, which is not written here, is the dimension of this feature. For example, the text feature dimension is 49,292.

X is the input to the feature, it’s an n-dimensional vector.

M is an M by n vector. You can define m, and when M is small you reduce X. You can think of it here as the decomposition of a matrix.

B is an n-dimensional vector, W*X is an m-dimensional vector, W*X is an m-dimensional vector.

Finally, take a ReLU and make each of the resulting vectors W*X+b be at least 0.

This Layer can be loosely interpreted as dimensional reduction for input to the next Layer, Stacking Layer

Stacking Layer

In this layer, stack/concatenate all the inputs,

Low latitude features are concatenate directly into the stacking layer without dimensionality reduction.

The Residual Unit

This is not a layer, but a ‘double’ ReLU on the output of the stacking layer and Embedding.

Here W_0 and b_0 should be the W and B of the previous Embedding layer, and a layer of ReLU is added on top of the output. In this way, the function F is fitting x_O-X_I.

Scoring layer

All the outputs are finally connected to a fully connected layer to a Sigmoid, and finally output a prediction, which is to predict the click rate of the advertisement.

The effect

Sparse vs. Dense (left) and sparse vs. dense +dense (right). It can be seen that this model is dense and sparse.

Own summary

The model itself is not significantly different from the recommendation algorithm based on deep learning released by YouTube in 2016. The main difference is ReLU, and the rest is matrix decomposition +concatenation. The model of YouTube is divided into two, with various design concepts, which are slightly more complicated. The Deep Crossing model and the concepts needed to understand it are much simpler, so I suggest you read this first and then the YouTube article. Post again, Deep Crossing is here www.kdd.org/kdd2016/pap… .