2016 Bid-Aware Gradient Descent for Unbiased Learning with Censored Data in Display Advertising

The work of this paper is highly relevant to the work of 2014 Optimal real-time Bidding for Display Advertising. It is suggested to read the paper in 2014 first and then the paper in 2016. The paper notes of 2014 have also been added, see: Paper Reading (011)

background

CTR estimation, bid optimization and other predictions are operated with the full number of bids in the pre-bidding stage, but the training data is collected after the bidding, which leads to inconsistent distribution of training data and online test data. In view of the above problems, this paper has the following work

Survival Models were used to solve the probability of winning bidding (used to reconstruct biased training data to the overall distribution)
CTR estimation and bid optimization are modified on the basis of work 1

The training data is skewed

In the bidding process, only one advertiser won, so only that advertiser’s request received feedback, including whether to click, whether to convert, market price value. This results in an inconsistent distribution of data from ADX (training data) and data from bid request (test data), as shown in the figure below

The bidding process is equivalent to a dynamic data filter, depending on the bid value. Because failed bidding requests cannot be exposed and cannot appear in training data dependent on the model, the whole training data is biased compared with full-volume data.

Probability of winning an auction

In the bidding process, the following concepts are defined

X: Characteristics of the request (including: user characteristics, advertising characteristics)
Bxb_ XBX: Bid on the request
Y: User feedback, whether users click or convert after winning the bidding
Z: Market price: the price charged after bidding, defined in this article as lower than the bid price
P (win ∣ x, bx) p (win | x, b_x) p (win ∣ x, bx) : during the bid for bxb_xbx winning probability
Qx (x) Q_x (x) Qx (x) : exposure data distribution
Px (x)p_x(x) PX (x) : Distribution of bidding data

Therefore, exposure data (x,y,z){(x, y,z)}(x,y,z) has the following relationship, which represents the offset between training data q_x(x) and bidding data data. We need to train on the training data, test on the bidding data, so solve for the offset between the two distributions, P(win⁡∣x,bx)⏟auction Selection \underbrace{P\left(\ operatorName {win} \mid \boldsymbol{x}, B_ {\ boldSymbol {x}}\right)}_{\text {auction selection}} auction selection P(win∣x,bx)

$\underbrace{q_{x}(\boldsymbol{x})}_{\text {impression }}=\underbrace{P\left(\operatorname{win} \mid \boldsymbol{x}, b_{\boldsymbol{x}}\right)}_{\text {auction selection }} \cdot \underbrace{p_{x}(\boldsymbol{x})}_{\text {bid request }}$

In this paper, we use the Survial Model to solve p (win ∣ x, bx) p (win | x, b_x) p (win ∣ x, bx), PZX (z) p_ {} z ^ {\ boldsymbol {x}} (z) PZX (z) said the market price for the winning probability when z, The probability of winning the auction is the probability that market price Z is lower than the bid bXB_XBX:

$w\left(b_{\boldsymbol{x}}\right) \equiv P\left(\operatorname{win} \mid \boldsymbol{x}, b_{\boldsymbol{x}}\right)=\int_{0}^{b_{\boldsymbol{x}}} p_{z}^{\boldsymbol{x}}(z) d z$

To simplify the problem, assume that the market price distribution is at the advertiser granularity rather than the exposure granularity (find a W (bx)w\left(b_{\ boldSymbol {x}}\right)w(bx) for the advertiser, so estimate the corresponding pz(z)p_{z}(z)pz(z) for an advertiser, Thus w(bx)w\left(b_{\ boldSymbol {x}}\right)w(bx).

Assuming that there is no problem of data censorship, i.e. an advertising agent wins all bids and knows the market price, wo(bx)w_o\left(b_{\boldsymbol{x}}\right)wo(bx), The calculation is as follows

$w_{o}\left(b_{\boldsymbol{x}}\right)=\frac{\sum_{\left(\boldsymbol{x}^{\prime}, y, z\right) \in D} \delta\left(z<b_{\boldsymbol{x}}\right)}{|D|}$

Which when z < BXZ < b_ {\ boldsymbol {x}} z at the delta (z < < bx bx) = 1, delta, left (z < b_ {\ boldsymbol {x}} \ right) delta (z < = 1 bx) = 1, otherwise 0.

The real problem is that advertisers can’t get market price if they don’t win the bidding

In bidding, imagine an advertiser engaging N bids, denoted as N tuples ⟨ BI, WI, ZI ⟩ I =1… N \ left \ langle b_ {I}, w_ {I}, z_ {I} \ \ right rangle_ {I = 1 \ ldots N} ⟨ bi, wi, zi ⟩ I = 1… N, where BIB_ibi indicates the size of the bid, wiw_iwi indicates whether the bid was won, and ziz_izi indicates the market price.

⟨ BJ, DJ, NJ ⟩ J =1 for converting the above format data in the log, ⟨ BJ, DJ, NJ ⟩ J =1… M \ left \ langle b_} {j, d_} {j, n_ {j} \ \ right rangle_ {j = 1 \ ldots M} ⟨ bj, DJ, nj ⟩ j = 1… M, i.e., bi < bj + 1 b_i < b_ {j + 1} bi < bj + 1

Djd_jdj indicates that market price is the number of bids won by BJ − 1B_J-1BJ −1
$n_j$ Represents bid for $b_j-1$ The number of unsuccessful bids, including
- The number of winning bids at a market price not lower than Bj-1Bj-1Bj-1
- The bid price is not less than the number of failed bids bj− 1b_J-1BJ −1.

So when the bid is BXB_XBX, the probability of losing the AD auction is (survival models solved)

$l\left(b_{\boldsymbol{x}}\right)=\prod_{b_{j}<b_{\boldsymbol{x}}} \frac{n_{j}-d_{j}}{n_{j}}$

So when the bid is BXB_XBX, the probability of winning the AD auction is

$w\left(b_{\boldsymbol{x}}\right)=1 – \prod_{b_{j}<b_{\boldsymbol{x}}} \frac{n_{j}-d_{j}}{n_{j}}$

Will ⟨ bi, wi, zi ⟩ I = 1… N \ left \ langle b_ {I}, w_ {I}, z_ {I} \ \ right rangle_ {I = 1 \ ldots N} ⟨ bi, wi, zi ⟩ I = 1… N ⟨ BJ, DJ, NJ ⟩ J =1… M \ left \ langle b_} {j, d_} {j, n_ {j} \ \ right rangle_ {j = 1 \ ldots M} ⟨ bj, DJ, nj ⟩ j = 1… M, and calculate w(bx)w\left(b_{\ boldSymbol {x}}\right)w(bx) as follows

It can be seen that wo(bx)w_{O}\left(b_{boldSymbol {x}}\right) WO (bx) has a larger value than W (bx)w\left(b_{boldSymbol {x}}\right) W (bx), that is, the training data moves to the right relative to the test data.

CTR estimates

Given training data D=(x,y,z)D={(x, y,z)}D=(x,y,z), where X obeys the training data distribution qx(x)q_x(x)qx(x), an unbiased supervised learning model requires that the loss function be minimized on the test data distribution Px (x)p_x(x)px(x).

Minimize loss function on training data

$\min _{\theta} \mathbb{E}_{\boldsymbol{x} \sim p_{x}(\boldsymbol{x})}\left[\mathcal{L}\left(y, f_{\boldsymbol{\theta}}(\boldsymbol{x})\right)\right]+\lambda \Phi(\boldsymbol{\theta})$

Where fθ(x)f_{\ boldSymbol {\theta}}(\ BoldSymbol {x})fθ(x) represents the prediction result of the model with parameters θ\thetaθ, L (y, theta f (x)) \ mathcal {L} \ left (y, F_ {\ boldsymbol {\ theta}} (\ boldsymbol {x}) \ right) L (y, theta f (x)) to predict the results Theta f (x) f_ {\ boldsymbol {\ theta}} (\ boldsymbol {x}) theta f (x) and the loss function of real tag is calculated in the y, Φ (theta) \ Phi (\ boldsymbol {\ theta}) Φ (theta) is a parameter regularization.

The following relationship exists

$\begin{aligned} & \mathbb{E}_{\boldsymbol{x} \sim p_{x}(\boldsymbol{x})}\left[\mathcal{L}\left(y, f_{\boldsymbol{\theta}}(\boldsymbol{x})\right)\right]=\int_{\boldsymbol{x}} p_{x}(\boldsymbol{x}) \mathcal{L}\left(y, f_{\boldsymbol{\theta}}(\boldsymbol{x})\right) d \boldsymbol{x} \\=& \int_{\boldsymbol{x}} q_{x}(\boldsymbol{x}) \frac{\mathcal{L}\left(y, f_{\boldsymbol{\theta}}(\boldsymbol{x})\right)}{w\left(b_{\boldsymbol{x}}\right)} d \boldsymbol{x}=\mathbb{E}_{\boldsymbol{x} \sim q_{x}(\boldsymbol{x})}\left[\frac{\mathcal{L}\left(y, f_{\boldsymbol{\theta}}(\boldsymbol{x})\right)}{w\left(b_{\boldsymbol{x}}\right)}\right] \\=& \frac{1}{|D|} \sum_{(\boldsymbol{x}, y, \boldsymbol{z}) \in D} \frac{\mathcal{L}\left(y, f_{\boldsymbol{\theta}}(\boldsymbol{x})\right)}{w\left(b_{\boldsymbol{x}}\right)}=\frac{1}{|D|} \sum_{(\boldsymbol{x}, y, z) \in D} \frac{\mathcal{L}\left(y, f_{\boldsymbol{\theta}}(\boldsymbol{x})\right)}{1-\prod_{b_{j}<b_{\boldsymbol{x}}} \frac{n_{j}-d_{j}}{n_{j}}}, \end{aligned}$

That is, loss represented by training set distribution X is transformed into Loss representation of test set distribution X. The above amendments have the following properties

For the low probability bid bXB_XBX, 1 – ∏ bj < bxni – dini ‾ 1 – \ prod_ {b_ {j} < b_ {x}} \ frac {n_ {I} – d_ {I}} {\ underline {n_ {I}}} – 1 ∏ bj < bxnini – di in training on the estimated value is low, but the corresponding gradient is bigger, The gradient from bidding data should be higher
If the bidding is very high, which results in a 100% probability of winning the auction, such data can be observed from the real data distribution. Therefore, this data will not be gradient reweighted (weight 1)

The adjustment of winning probability and weight is shown in the figure

Offer to optimize

CTR f(x)f(x)f(x) f(x) has been obtained through the model, and the bidding price b(f(x))b(f(x))b(f(x)). For T times of auction, when the budget is B, the optimization objective of bidding optimization is. Optimal real-time bidding for display advertising, in which budget constraint is set to ≤\ LEq ≤ and ===

$\underset{b()}{\arg \max } T \int_{\boldsymbol{x}} f(\boldsymbol{x}) w(b(f(\boldsymbol{x}))) p_{x}(\boldsymbol{x}) d \boldsymbol{x}$

Subject to T ∫ xb w (f (x)) (f (x)) (b) p (x) dx = BT \ int_ {\ boldsymbol {x}} b (f (\ boldsymbol {x})) w (b (f (\ boldsymbol {x}))) (\ boldsymbol p_ {x} {x}) d \ boldsymbol {x} = BT ∫ xb w (f (x)) (f (x)) (b) p (x) dx = b

W (b(f(x)) w(b(f(boldSymbol {x})) w(b(f(x)))w(b(f(f(x)))w(b(f(x))) indicates a given bid request x→\mathbf{x} \rightarrowx→ estimated CTR f→f \rightarrowf→ Bidding function b→ B \rightarrowb→ winning probability WWW

The actual data distribution is qx(x)q_x(x)qx(x), so transform the above equation and get as follows

$\underset{b()}{\arg \max } T \int_{\boldsymbol{x}} f(\boldsymbol{x}) w(b(f(\boldsymbol{x}))) \frac{q_{x}(\boldsymbol{x})}{w\left(b_{\boldsymbol{x}}\right)} d \boldsymbol{x}$

Subject to T ∫ xb w (f (x)) (f (x)) (b) qx (x) w (bx) dx = BT \ int_ {\ boldsymbol {x}} b (f (\ boldsymbol {x})) w (b (f (\ boldsymbol {x}))) (\ \ frac {q_ {x} boldsymbol {x})} {w \ left (b_ {\ boldsymbol {x}} \ right)} d \ boldsymbol {x} = BT ∫ xb w (f (x)) (f (x)) (b) w (bx) qx dx = b (x)

The Lagrange multiplier method is used to express the problem as unconstrained

$\begin{aligned} \mathcal{L}(b(f), \lambda)=& \int_{\boldsymbol{x}} f(\boldsymbol{x}) w(b(f(\boldsymbol{x}))) \frac{q_{x}(\boldsymbol{x})}{w\left(b_{\boldsymbol{x}}\right)} d \boldsymbol{x} \\ &-\lambda \int_{\boldsymbol{x}} b(f(\boldsymbol{x})) w(b(f(\boldsymbol{x}))) \frac{q_{x}(\boldsymbol{x})}{w\left(b_{\boldsymbol{x}}\right)} d \boldsymbol{x}+\frac{\lambda B}{T} \end{aligned}$

Euler-lagrangian conditions:

Qx f (x) (x) w (bx) partial w partial (f (x)) (b) b – lambda qx (f (x)) (x) w (bx) [w (b) (f (x)) f (\ boldsymbol {x}) \frac{q_{x}(\boldsymbol{x})}{w\left(b_{\boldsymbol{x}}\right)} \frac{\partial w(b(f(\boldsymbol{x})))}{\partial b(f(\boldsymbol{x}))}-\lambda (\ \ frac {q_ {x} boldsymbol {x})} {w \ left (b_ {\ boldsymbol {x}} \ right)} [w (b (f (\ boldsymbol {x}))) f (x) w (bx) qx (x) (f (x)) partial w partial b (b) (f (x)) – Lambda w (bx) qx (x)/w (b) (f (x)) + b (f (x)) partial w partial (f (x)) (b) b (f (x)] = 0 \ left. + b (f (\ boldsymbol {x})) \ frac {\ partial W (b (f (\ boldsymbol {x})))} {\ partial b (f (\ boldsymbol {x}))} \ right] = 0 + b (f (x)) (f (x)) partial w partial b (b) (f (x))) = 0 ⇒ lambda w (b) (f (x)) = [f (x) – lambda b (f (x)] partial w partial (f (x)) (b) b (f (x)), \ Rightarrow \ lambda w (b (f (\ boldsymbol {x}))) = [f (\ boldsymbol {x}) – \ lambda b(f(\boldsymbol{x}))] \frac{\partial w(b(f(\boldsymbol{x})))}{\partial B (f (\ boldsymbol {x}))}, ⇒ lambda w (b) (f (x)) = [f (x) – lambda b (f (x)] (f (x)) partial w partial b (b) (f (x)),

Optimal bidding function bORTBf(x)b_{ORTB}f(x)bORTBf(x) depends on winning Function W (b)w(b)w(b) W (b), e.g

$w(b(f(\boldsymbol{x})))=\frac{b(f(\boldsymbol{x}))}{c+b(f(\boldsymbol{x}))}$

Where C is a constant, so the optimal bid is

$b_{\mathrm{ORTB}}(f(\boldsymbol{x}))=\sqrt{\frac{c}{\lambda} f(\boldsymbol{x})+c^{2}}-c$

For the above unconstrained problem λ\lambdaλ, expressed by the Lagrange multiplier method, the euler-Lagrangian condition can be expressed as follows:

$\begin{aligned} \partial \mathcal{L}(b(f(\boldsymbol{x})), \lambda) / \partial \lambda &=0 \\ \Rightarrow \int_{\boldsymbol{x}} b(f(\boldsymbol{x}), \lambda) w(b(f(\boldsymbol{x}), \lambda)) \frac{q_{x}(\boldsymbol{x})}{w\left(b_{\boldsymbol{x}}\right)} d \boldsymbol{x} &=\frac{B}{T} \\ \Rightarrow \frac{1}{|D|} \sum_{(\boldsymbol{x}, y, \boldsymbol{z}) \in D} b(f(\boldsymbol{x}), \lambda) \frac{w(b(f(\boldsymbol{x}), \lambda))}{w\left(b_{\boldsymbol{x}}\right)} &=\frac{B}{|D|} \end{aligned}$

The following problem is convex.

$\min _{\lambda} \sum_{(\boldsymbol{x}, y, z) \in D} \frac{1}{2}\left(\frac{b(f(\boldsymbol{x}), \lambda) w(b(f(\boldsymbol{x}), \lambda))}{1-\prod_{b_{j}<b_{\boldsymbol{x}}} \frac{n_{j}-d_{j}}{n_{j}}}-\frac{B}{|D|}\right)^{2}$

Update using BGD(Bid-Aware Gradient Descent)

$\begin{aligned} \lambda \leftarrow & \lambda-\eta \overbrace{\frac{1}{1-\prod_{b_{j}<b_{\boldsymbol{x}}} \frac{n_{j}-d_{j}}{n_{j}}}}^{\text {instance reweighting }}(\overbrace{\frac{b(f(\boldsymbol{x}), \lambda) w(b(f(\boldsymbol{x}), \lambda))}{1-\prod_{b_{j}<b_{\boldsymbol{x}}} \frac{n_{j}-d_{j}}{n_{j}}}-\frac{B}{|D|}}^{\text {gradient direction }}) \\ &(\underbrace{\frac{\partial b(f(\boldsymbol{x}), \lambda)}{\partial \lambda} w(b(f(\boldsymbol{x}), \lambda))+b(f(\boldsymbol{x}), \lambda) \frac{\partial w(b(f(\boldsymbol{x}), \lambda))}{\partial \lambda}}_{\text {bidding function gradient }}) \end{aligned}$

The above formula has the following properties

The sample is re-weighted, as is done in the CTR estimation
A small bid bXB_XBx produces a large weight, indicating its importance in the training data
The historical bid of the training instance also has an effect on the gradient direction
Ratio B / ∣ ∣ B/D | D | B / ∣ D ∣ will ensure that the budget evenly in the new offer

Observe historical bids and corresponding gradients as if to draw conclusions

In the two parts of the data, bids 50 and 100, when historical bids are small, the gradient is positive –> λ\lambda lambda becomes large –> bids decrease
For a data instance with a low historical bid bXB_XBX, the probability of observing the data instance is low, that is, there are no similar data instances in the training set
If b (f (x), lambda) b (\ lambda, f (x)) (f (x), lambda) high b, b is the best offer (f (x), lambda) b (f (x), \ lambda) b (f (x), lambda) should be reduced, so as to avoid budget overruns in the full amount of data

The experiment

The experimental setup

The offline experiment data included iPinYou and TukMob, and the online experiment was conducted on Yahoo. The process of the experiment is shown below, where a truth-telling bidding strategy is executed to simulate the historical bidding process and generate win (marked but biased) impression data and lost (unmarked) bid request data.

The execution strategies for CTR estimation and bid optimization are

Bias: CTR estimation and bid optimization for direct training on biased training data
UOMP: Use wo(bx)w_{O}\left(B_ {boldSymbol {x}}\right) WO (Bx) to correct the data and use CTR estimation and bid optimization proposed in this paper
KMMP: use W (bx)w\left(b_{\ boldSymbol {x}}\right) W (bx) to correct the data and use CTR estimation and bid optimization proposed in this paper
FULL: Bid with high bids, so you can get labels for all bid request data, which is used for CTR estimation and bid optimization

The data distribution of Full and Win in the iPinYou dataset is as follows

Winning Probability Estimation

Before evaluating the actual CTR estimation and bidding optimization tasks, the performance of the comparison models in estimating the probability of winning bids was analyzed. The experimental results are shown in the figure below. The real curve is constructed from all the market price observations in the total trading volume forecast data and is regarded as the real value here

It can be concluded from the picture that there are

When the bid is 0, the winning probability is 0, and as the bid increases, the winning probability increases, and the winning probability converges to 1
The FULL curve is the closest to the Truth curve because FULL uses all click data and is unbiased
UOMP overestimated the probability of winning, and KMMP was closer to Truth than UOMP

The experiment compared the KL divergence and Pearson correlation coefficient of UOMP, KMMP, FULL and Truth. The experimental results are shown in the following table. KMMP has better effect than UOMP and is close to FULL.

CTR Estimation Results

The AUC and Cross Entropy values estimated by CTR are compared in the experiment, and the experimental results are shown in the figure below

It can be concluded from the picture that there are

UOMP and KMMP with data correction achieved better results than BIAS, indicating the effectiveness of the model in eliminating training data instance BIAS, which made the prediction model have better generalization on the predicted data
KMMP has better correlation than UOMP, that is, the correction method proposed in this paper has better effect

This paper compares the stability of AUC and cross entropy during learning, as shown in the figure below. The conclusions are as follows: UMOP and KMMP are more stable than BIAS, and FULL is the best.

Bid Optimisation Results

Set the budget ratio to perform offline bid optimization, where the training/test budget is set to 1/64, 1/32, 1/16, 1/8, 1/4, and 1/2 of the total cost of the training/test data set of the budget. You can’t set the ratio to 1, because in that case, people might simply bid infinity to win all the shows and clicks in the data and just spend all the budget.

The following table shows the results at 1/64 and 1/4 bids on the iPinYou dataset; Click values and eCPC values on the TukMob dataset

Conclusion there are

UMOP and KMMP have better effects than BIAS. The click value in Table 6 and 7 is higher, and the eCPC value in Table 7 is lower.

The figure below shows the percentage increase in clicks, exposure and the percentage decrease in eCPC compared to Bias. Relative Bias click through and exposure increases are both positive, and eCPC drops are both negative (except FULL is positive at 1/2).

Online A/B Testing

The experiment ran A/B tests on Yahoo, dividing traffic and budget by 50% and 50%. The control group was trained with BIAS strategy using gradient ascending tree, and the control group was trained with KMMP strategy using the same model. The experimental results are shown in the figure below (the two tables in the paper are omitted, the information in the figure is more and overlaps with that in the table).

It can be concluded from the picture that there are

KMMP gets more clicks and fewer exposures, increasing CTR
The boost comes from KMMP’s ability to reduce overbidding for low traffic
CTR goes up, eCPC goes down, which means you can get more clicks and lower bid costs

discuss

The theoretical derivation is clear, the experiment is rich, very powerful.

Read the work related to this paper, is this paper bid model modeling paper, can refer to: Paper reading (011).

The resources

Aside: DSP bidding algorithm
Optimal real-time bidding for display advertising
RTB advertising bid strategy

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Paper Reading (010): Bid-aware Gradient Descent for Unbiased Learning with Censored Data.

background

Probability of winning an auction

CTR estimates

Offer to optimize

The experiment

discuss

The resources

Paper Reading (010): Bid-aware Gradient Descent for Unbiased Learning with Censored Data.

background

Probability of winning an auction

CTR estimates

Offer to optimize

The experiment

discuss

The resources

Related Posts

Kafka Polling and Consumer Group rebalanced partitioning Strategy analysis – Kafka business Environment combat

Ethereum source code analysis – Syncing interface synchronization

OneSwap series 11 ETH contract security verification, fool-proof and friction