2017 Real-Time Bidding by Reinforcement Learning in Display Advertising

The pictures and data are from the original text, and the formula numbers are consistent with the text for convenience

1 background

In this paper, reinforcement learning to Bid (RLB) algorithm is proposed to apply reinforcement learning to real-time bidding in display advertising

  • Markov decision process (MDP) framework is established to learn the optimal bidding strategy
  • In order to deal with the scalability of real-world auction quantity and activity budget, a neural network model is used to approximate the value function
  • In addition to the direct generalization of neural network value functions, a novel coarse-Tofine episode segmentation model and state mapping model are proposed to overcome the problem of large-scale state generalization

2 methods

2.1 Problem Definition

In display advertising, bidding can be thought of as a episodic process, with T auctions for each episode, each represented by the higher-dimensional feature vector x\ boldSymbol {x}x. At the beginning, the agent initializes a budget of B, and the advertising goal is to get as many clicks as possible. Agent mainly considers three pieces of information. In episode, the agent bids based on t, B and x\ boldSymbol {x}x.

  • The remaining bid t∈{0,… ,T}t \in \{0,… T T ∈ \} {0,… ,T}
  • Unused budget B ∈{0,… ,B}b \in \{0,… , B \} ∈ {0, B… ,B}
  • Auction features x\boldsymbol{x}x

When the number of auctions is T, the probability density function of the market price in the bidding is m(δ,x)m(\delta, \ boldSymbol {x})m(δ,x). If the auction is won with the price \deltaδ, the remaining budget is updated as B −δb-\deltab−δ, In this case, the agent can observe user feedback (such as whether to click) and market price; If the bidding fails, the agent does not get any information. The predicted CTR(pCTR) is θ(x)\theta(x)θ(x) as the target of optimization. After the end of an auction, the remaining number of auctions becomes T-1. When the episode ends, the budget and auction number are reset.

Markov Decision Process (MDP) provides a framework widely used to model agent-environment interactions. The notations used in this paper are shown below

Notation Description

x \boldsymbol{x}
Represents the eigenvector of a bid

X \boldsymbol{X}
The entire eigenvector space

p x ( x ) p_x(\boldsymbol{x})

x \boldsymbol{x}
Probability density function of

Theta. ( x ) \theta(\boldsymbol{x})

x \boldsymbol{x}
Estimated click-through rate in case of winning auction

m ( Delta t. . x ) m(\delta, \boldsymbol{x})
For a given
Delta t. \delta
and
x \boldsymbol{x}
The probability density function of winning

m ( Delta t. ) m(\delta)
For a given
Delta t. \delta
The probability density function of winning

V ( t . b . x ) V(t, b, \boldsymbol{x})
In the state of
( t . b . x ) (t, b, \boldsymbol{x})
In this case, the expected total return (take the optimal strategy)

V ( t . b ) V(t, b)
In the state of
( t . b ) (t, b)
In this case, the expected total return (take the optimal strategy)

a ( t . b . x ) a(t, b, \boldsymbol{x})
In the state of
( t . b . x ) (t, b, \boldsymbol{x})
In this case, the optimal action

MDP can be expressed As (S, {As}, {Pss ‘a}, {Rss’ a}) (\ boldsymbol {S}, \ {\ boldsymbol {A_s} \}, \ {P_ {ss ^ ^ a \}}}, {‘ \ {R_ {ss ^ {‘}} ^ a \}) (S, {As}, {Pss ‘a}, {Rss’ a}), in which S \ boldsymbol and a \ {S} S boldsymbol {a} a represented As a collection of possible states and actions, Pss ‘aP_ {ss ^ {‘}} ^ aPss’ by aaa, a SSS transfer from state to state ‘s ^ {‘ s}’ s probability, reward function is r (a, s, s’) r (a, s, s ^ {‘}) r (a, s, s’).

Xtx_txt can be considered to conform to the independent conditional distribution of px(x)p_x(\ boldSymbol {x})px(x). The entire state space is S={0… T} and {0,… , B} X X \ boldsymbol = {S} \ {0,… ,T\} \times \{0,… ,B\} \times \boldsymbol{X}S={0,… T} and {0,… X X, B}. T = 0 t = 0 t = 0 is the end of the episode came, (t, b, xt) (t, b, \ boldsymbol {x} _t) (t, b, xt) is A set of feasible action for A (t, b, xt) A_ {(t, b, \ boldsymbol {x_t})} A (t, b, xt). Bidding has two transition states

  • For success: the (t, b, xt) (t, b, \ boldsymbol {x} _t) (t, b, xt) and t > t > t > 0 0 0, if bid for a (delta ∈ {0,… ,a}\delta \in \{0,… , a delta ∈ \} {0,… ,a}, i.e. a is greater than the market price), In px (xt – 1) m (delta, xt) pressure (\ boldsymbol {x_ {t – 1}}) m (, delta \ \ boldsymbol {x_ {t}}) px (xt – 1) the probability of m (delta, xt) to jump to state (t – 1, b – the delta, xt – 1) (t – 1. B – \ delta, \ boldsymbol {x} _ {1} t -) (t – 1, b – the delta, xt – 1), and are rewarded theta (xt) \ theta (\ boldsymbol {x_t}) theta (xt)
  • Failed bid: Agent to p (xt – 1) ∑ delta = a + 1 up m (delta, xt) p_ {x} \ left (\ boldsymbol {x} _ {1} t – \ right) \ sum_ {\ delta = a + 1} ^ {\ infty} m \ left (\ delta, \ boldsymbol {x} _ {t} \ right) px (xt – 1) ∑ delta m = a + 1 up probability (delta, xt) transferred to (t – 1 b, xt – 1) (b, t – 1, \ boldsymbol {x} _ {1} t -) (t – 1 b, xt – 1). All other transitions were impossible because of the auction process

Therefore, the transfer probability and reward function are as follows, denoted by Formula (1).


mu ( a . ( t . b . x t ) . ( t 1 . b Delta t. . x t 1 ) ) = p x ( x t 1 ) m ( Delta t. . x t ) \mu\left(a,\left(t, b, \boldsymbol{x}_{t}\right),\left(t-1, b-\delta, \boldsymbol{x}_{t-1}\right)\right) =p_{x}\left(\boldsymbol{x}_{t-1}\right) m\left(\delta, \boldsymbol{x}_{t}\right)


mu ( a . ( t . b . x t ) . ( t 1 . b . x t 1 ) ) = p x ( x t 1 ) Delta t. = a + 1 up m ( Delta t. . x t ) \mu\left(a,\left(t, b, \boldsymbol{x}_{t}\right),\left(t-1, b, \boldsymbol{x}_{t-1}\right)\right) =p_{x}\left(\boldsymbol{x}_{t-1}\right) \sum_{\delta=a+1}^{\infty} m\left(\delta, \boldsymbol{x}_{t}\right)


r ( a . ( t . b . x t ) . ( t 1 . b Delta t. . x t 1 ) ) = Theta. ( x t ) r\left(a,\left(t, b, \boldsymbol{x}_{t}\right),\left(t-1, b-\delta, \boldsymbol{x}_{t-1}\right)\right) =\theta\left(\boldsymbol{x}_{t}\right)


r ( a . ( t . b . x t ) . ( t 1 . b . x t 1 ) ) = 0 r\left(a,\left(t, b, \boldsymbol{x}_{t}\right),\left(t-1, b, \boldsymbol{x}_{t-1}\right)\right) =0

Therefore, the deterministic strategy π\ PI π refers to the mapping from the state S ∈Xs\in \ BoldSymbol {X}s∈X to action A ∈Asa\in \ BoldSymbol {A_s} A ∈As, such Asa =π(s)a = \ PI (s)a=π(s) in the bidding process. In the strategy π\ PI π, the valuable function Vπ(s)V^{\ PI}(s)Vπ(s) satisfies the Bellman equation, and the discount coefficient γ=1\gamma =1 γ=1 (because the total number of clicks in the scenario is the optimization objective, not the click time), The relationship is as follows


V PI. ( s ) = s S mu ( PI. ( s ) . s . s ) ( r ( PI. ( s ) . s . s ) + V PI. ( s ) ) ( 2 ) V^{\pi}(s)=\sum_{s^{\prime} \in S} \mu\left(\pi(s), s, s^{\prime}\right)\left(r\left(\pi(s), s, s^{\prime}\right)+V^{\pi}\left(s^{\prime}\right)\right) \quad (2)

The value of optimal function V ∗ (s) = Max ⁡ PI PI V (s) V ^ (s) = {x} \ Max _ {\ PI} V ^ {\} PI (s) V ∗ (s) = Max PI PI V (s), the optimal strategy for


PI. ( s ) = argmax a A s { s S mu ( a . s . s ) ( r ( a . s . s ) + V ( s ) ) } ( 3 ) \pi^{*}(s)=\underset{a \in \boldsymbol{A}_{s}}{\operatorname{argmax}}\left\{\sum_{s^{\prime} \in \boldsymbol{S}} \mu\left(a, s, s^{\prime}\right)\left(r\left(a, s, s^{\prime}\right)+V^{*}\left(s^{\prime}\right)\right)\right\} \quad (3)

Optimal strategy PI ∗ (s)/PI ^ {*} (s) PI ∗ (s) for solving the goal, in order to convenient said, using the V (s) V (s) V (s) V ∗ (s) V ^ {*} (s) V ∗ (s).

The above modeling is a Model-based solution and is used
p x ( x ) p_x(\boldsymbol{x})
and
m ( Delta t. . x ) m(\delta, \boldsymbol{x})

2.2 Problem Solving

Equation (3) can be solved by dynamic programming. According to the optimal value function definition, V (t, b, x) V (t, b, \ boldsymbol {x}) V (t, b, x), the (t, b, x) (t, b, \ boldsymbol {x}) (t, b, x). At the same time, considering that no eigenvector x\ boldSymbol {x}x is observed, another optimal value function is V(t,b)V(t, b)V(t,b), There are relationship between V (t, b) = ∫ Xpx (x) V (t, b, x) dxV (t, b) = \ int_ {\ boldsymbol {x}} p_ {x} (\ boldsymbol {x}) V (t, b, D \ \ boldsymbol {x}) boldsymbol (t, b) = {x} V ∫ Xpx (x) V (t, b, x) dx.

According to the definition of V (0, b, x) = V (0, b) = 0 V (0, b, \ boldsymbol {x}) = V (0, b) = (0, b, x) = 0 V V (0, b) = 0, said there was no the rest of the auction. According to the transfer probability and reward in equation (1), there is the following relationship


V ( t . b . x ) = max 0 Or less a Or less b { Delta t. = 0 a X m ( Delta t. . x ) p x ( x t 1 ) ( Theta. ( x ) + V ( t 1 . b Delta t. . x t 1 ) ) d x t 1 + Delta t. = a + 1 up X m ( Delta t. . x ) p x ( x t 1 ) V ( t 1 . b . x t 1 ) d x t 1 } = max 0 Or less a Or less b { Delta t. = 0 a m ( Delta t. . x ) ( Theta. ( x ) + V ( t 1 . b Delta t. ) ) + Delta t. = a + 1 up m ( Delta t. . x ) V ( t 1 . b ) } ( 4 ) \begin{aligned} V(t, b, \boldsymbol{x})=& \max _{0 \leq a \leq b}\left\{\sum_{\delta=0}^{a} \int_{\boldsymbol{X}} m(\delta, \boldsymbol{x}) p_{x}\left(\boldsymbol{x}_{t-1}\right)\right.\\ &\left(\theta(\boldsymbol{x})+V\left(t-1, b-\delta, \boldsymbol{x}_{t-1}\right)\right) d \boldsymbol{x}_{t-1}+\\ &\left.\sum_{\delta=a+1}^{\infty} \int_{\boldsymbol{X}} m(\delta, \boldsymbol{x}) p_{x}\left(\boldsymbol{x}_{t-1}\right) V\left(t-1, b, \boldsymbol{x}_{t-1}\right) d \boldsymbol{x}_{t-1}\right\} \\=& \max _{0 \leq a \leq b}\left\{\sum_{\delta=0}^{a} m(\delta, \boldsymbol{x})(\theta(\boldsymbol{x})+V(t-1, b-\delta))+\right.\\ &\left.\sum_{\delta=a+1}^{\infty} m(\delta, \boldsymbol{x}) V(t-1, b)\right\} \end{aligned}\quad (4)

The first sum on the right side of the equation indicates successful bidding and the second sum indicates unsuccessful bidding. At (t,b,x)(t, b, \ boldSymbol {x})(t,b,x), the optimal bid is


a ( t . b . x ) = argmax 0 Or less a Or less b { Delta t. = 0 a m ( Delta t. . x ) ( Theta. ( x ) + V ( t 1 . b Delta t. ) ) + Delta t. = a + 1 up m ( Delta t. . x ) V ( t 1 . b ) } ( 5 ) a(t, b, \boldsymbol{x})=\underset{0 \leq a \leq b}{\operatorname{argmax}}\left\{\sum_{\delta=0}^{a} m(\delta, \boldsymbol{x})(\theta(\boldsymbol{x})+V(t-1, b-\delta))+\sum_{\delta=a+1}^{\infty} m(\delta, \boldsymbol{x}) V(t-1, b)\right\}\quad (5)

The optimal bidding action a(t,b,x)a(t, b, \ boldSymbol {x})a(t,b,x) contains three factors: M (delta, x) m (\ delta, \ boldsymbol {x}) m (delta, x), theta (x) \ theta (\ boldsymbol {x}) theta (x) and V (t – 1, ⋅) V (t – 1, \ cdot) V (t – 1, ⋅). V(t,b)V(t, b)V(t,b) is obtained by marginalizing x\boldsymbol{x}x


V ( t . b ) = X p x ( x ) max 0 Or less a Or less b { Delta t. = 0 a m ( Delta t. . x ) ( Theta. ( x ) + V ( t 1 . b Delta t. ) ) V(t, b)=\int_{\boldsymbol{X}} p_{x}(\boldsymbol{x}) \max _{0 \leq a \leq b}\left\{\sum_{\delta=0}^{a} m(\delta, \boldsymbol{x})(\theta(\boldsymbol{x})+V(t-1, b-\delta))\right.


+ Delta t. = a + 1 up m ( Delta t. . x ) V ( t 1 . b ) } d x \left.+\sum_{\delta=a+1}^{\infty} m(\delta, \boldsymbol{x}) V(t-1, b)\right\} d \boldsymbol{x}


= max 0 Or less a Or less b { Delta t. = 0 a X p x ( x ) m ( Delta t. . x ) Theta. ( x ) d x + Delta t. = 0 a V ( t 1 . b Delta t. ) =\max _{0 \leq a \leq b}\left\{\sum_{\delta=0}^{a} \int_{\boldsymbol{X}} p_{x}(\boldsymbol{x}) m(\delta, \boldsymbol{x}) \theta(\boldsymbol{x}) d \boldsymbol{x}+\sum_{\delta=0}^{a} V(t-1, b-\delta)\right.


X p x ( x ) m ( Delta t. . x ) d x + V ( t 1 . b ) Delta t. = a + 1 up X p x ( x ) m ( Delta t. . x ) d x } \left.\quad \int_{\boldsymbol{X}} p_{x}(\boldsymbol{x}) m(\delta, \boldsymbol{x}) d \boldsymbol{x}+V(t-1, b) \sum_{\delta=a+1}^{\infty} \int_{\boldsymbol{X}} p_{x}(\boldsymbol{x}) m(\delta, \boldsymbol{x}) d \boldsymbol{x}\right\}


= max 0 Or less a Or less b { Delta t. = 0 a X p x ( x ) m ( Delta t. . x ) Theta. ( x ) d x + =\max _{0 \leq a \leq b}\left\{\sum_{\delta=0}^{a} \int_{\boldsymbol{X}} p_{x}(\boldsymbol{x}) m(\delta, \boldsymbol{x}) \theta(\boldsymbol{x}) d \boldsymbol{x}+\right.


Delta t. = 0 a m ( Delta t. ) V ( t 1 . b Delta t. ) + V ( t 1 . b ) Delta t. = a + 1 up m ( Delta t. ) } ( 6 ) \left.\sum_{\delta=0}^{a} m(\delta) V(t-1, b-\delta)+V(t-1, b) \sum_{\delta=a+1}^{\infty} m(\delta)\right\} \quad (6)

In order to solve the problem of integral in equation (6), using the approximate m (delta, x) material m (delta) m (, delta \ \ boldsymbol {x}) \ approx m (\ delta) m (delta, x) material m (delta), so get it


X p x ( x ) m ( Delta t. . x ) Theta. ( x ) d x material m ( Delta t. ) X p x ( x ) Theta. ( x ) d x = m ( Delta t. ) Theta. a v g ( 7 ) \int_{\boldsymbol{X}} p_{x}(\boldsymbol{x}) m(\delta, \boldsymbol{x}) \theta(\boldsymbol{x}) d \boldsymbol{x} \approx m(\delta) \int_{\boldsymbol{X}} p_{x}(\boldsymbol{x}) \theta(\boldsymbol{x}) d \boldsymbol{x} =m(\delta) \theta_{\mathrm{avg}} \quad (7)

Where θavg\theta_{\mathrm{avg}}θavg is the expected value of pCTR θ\thetaθ, which can be calculated by historical data. By putting Equation (7) into Equation (6), the following results can be obtained

V (t, b) material Max ⁡ 0 or less b or less {∑ delta = 0 am theta avg + ∑ delta (delta) = 0 am (delta) V (t – 1, b – the delta) + V (t, b) \approx \max _{0 \leq a \leq b}\left\{\sum_{\delta=0}^{a} m(\delta) \theta_{\mathrm{avg}}+\sum_{\delta=0}^{a} m(\delta) V(t-1, B – \ delta) + \ right V (t, b) material max0 acuities were b or less {∑ delta = 0 am theta avg + ∑ delta (delta) = 0 am (delta) V (t – 1, b – the delta) + ∑ delta (delta) V = a + 1 up m (t – 1, b)} (8) \ left \ sum_ {\ delta = a + 1} ^ {\ infty} m (\ delta) V (b) t – 1, \} \ \ right quad (8) ∑ delta (delta) V = a + 1 up m (t – 1, b)} (8)

Because the ∑ delta (delta, x) = 0 up m = 1 \ sum_ {\ delta = 0} ^ {\ infty} m (, delta \ \ boldsymbol {x}) = 1 ∑ delta (delta, x) = 0 up m = 1, formula (5) can be writing form below

A (t, b, x) = argmax ⁡ 0 or less b or less {∑ delta am (delta, x) = 0 (theta (x) + V (t – 1, b – the delta)) – a (t, b, \boldsymbol{x})=\underset{0 \leq a \leq b}{\operatorname{argmax}}\left\{\sum_{\delta=0}^{a} m(\delta, \boldsymbol{x})(\theta(\boldsymbol{x})+V(t-1, B – \ delta)) – \ right a (t, b, x) = 0 or less a bargmax or less {∑ delta am (delta, x) = 0 (theta (x) + V (t – 1, b – the delta)) – ∑ delta am (delta, x) = 0 V (t – 1, b)} \ left \ sum_ {\ delta = 0} ^ {a} m(\delta, \boldsymbol{x}) V(t-1, B) \ right \} ∑ delta am (delta, x) = 0 V (t – 1, b)} = argmax ⁡ 0 or less b or less {∑ delta am (delta, x) = 0 (theta (x) + V (t – 1, b – the delta) – V (t – 1, b))} = \ underset {0 \ \ leq leq a b}{\operatorname{argmax}}\left\{\sum_{\delta=0}^{a} m(\delta, \boldsymbol{x})(\theta(\boldsymbol{x})+V(t-1, b-\delta)-V(t-1, B)) \ right \} = 0 or less a bargmax or less {∑ delta am (delta, x) = 0 (theta (x) + V (t – 1, b – the delta) – V (t – 1, b))} ≡ argmax ⁡ 0 or less b or less {∑ delta am (delta, x) = 0 g (delta)} (9) \ equiv \ underset {0 \ leq a \leq b}{\operatorname{argmax}}\left\{\sum_{\delta=0}^{a} m(\delta, (\ \ boldsymbol {x}) g delta) \} \ \ right quad (9) ≡ 0 or less a bargmax or less {∑ delta am (delta, x) = 0 g (delta)} (9)

R g (delta) = theta (x) + V (t – 1, b – the delta) – V (t – 1, b) g (\ delta) = \ theta (\ boldsymbol {x}) + V (t – 1 – b \ delta) – V (t – 1, b) g (delta) = theta (x) + V (t – 1, b – the delta) – V (t – 1, b), Come to a conclusion

  • V(t−1,b)V(t-1, b)V(t−1,b) decreases as B increases
  • G (delta)g(delta) decreases as delta \deltaδ increases
  • G (0) = theta (x) p (0) = 0 g \ theta (\ boldsymbol {x}) \ geq (0) = 0 g theta (x) or 0, m (delta, x) or 0 m (, delta \ \ boldsymbol {x}) \ geq 0 m (delta, x) or 0
  • When 0 b ‘or less or less b0 \ leq b ^ {‘} \ leq b0 b’ b or less, or less a g (b ‘) or g (b) p (b ^ 0 g {‘}) (b) \ \ geq g geq 0 g (b ‘) or g (b) 0 or more, So there is a m (delta, x) g (delta) or 0 m (, delta \ \ boldsymbol {x}) g) delta (\ \ geq 0 m (delta, x) g (delta) or 0, 0 or less of the delta b0 or less \ leq \ delta \ leq b0 delta b or less or less, a(t,b,x)=ba(t, b, \boldsymbol{x}) = ba(t,b,x)=b
  • If g (b) < 0 g (b) (b) < < 0 g 0, there must be an integer A, 0 or less A < b0 \ leq A < b0 acuities were A < b, g (A) p (A) \ geq 0 0 g g (A) p 0, g (A + 1) (A + 1) < 0 < 0 g g (A + 1) < 0. So in the delta < A \ delta < A delta < m (delta, x) when A g (delta) or 0 m (, delta \ \ boldsymbol {x}) g) delta (\ \ geq 0 m (delta, x) g (delta) 0 or higher, Delta, delta, A or less leq A delta when A m or less (delta, x) g (delta) < 0 m (, delta \ \ boldsymbol {x}) g (\ delta) < 0 m (delta, x) g (delta) < 0

The bid has the following conditions


a ( t . b . x ) = { b  if  g ( b ) p 0 A g ( A ) p 0  and  g ( A + 1 ) < 0  if  g ( b ) < 0 ( 10 ) a(t, b, \boldsymbol{x})=\left\{\begin{array}{lll}b & & \text { if } g(b) \geq 0 \\ A & g(A) \geq 0 \text { and } g(A+1)<0 & \text { if } g(b)<0\end{array}\right. \quad (10)

The figure below shows the distribution of G (δ)g(\delta)g(δ) in the data set, in which the upper bound of market price δ Max \delta_{Max}δ Max is set.

The Reinforcement Learning to Bid (RLB) algorithm is as follows

Discussion on Derived Policy

RLB adjusts the bid strategy according to t and B, as shown in the figure below. When B is large, the bid form is linear, and when B is small (b<300), the bid form is nonlinear.

Discussion on the Approximation of
V ( t . b ) V(t, b)

Here it is omitted, feeling is not very important

2.3 Dealing with large-scale problems

Algorithm 1 gives the way to solve the optimal strategy, but there are some problems in practical application. In the update V (t, b) V (t, b) in the process of V (t, b), there are two layers of circulation, algorithm time complexity is O (TB) O (TB) O (TB), space complexity is O (TB) O (TB) O (TB). In practice, T and B are so large that calculating V(T, B)V(T, B)V(T, B) takes time and cannot be stored.

Due to limited computing resources, it may not be possible to complete the update of the entire value function. Therefore, small scale of parameterized data model is put forward, namely {0, ⋅ ⋅ ⋅, T0} ({0, ⋅ ⋅ ⋅, B0} \ {0,…, T_0 \} \ times \ {0,…, B_0 \} {0, ⋅ ⋅ ⋅, T0} ({0, ⋅ ⋅ ⋅, B0}, and extended to large scale data of {0, ⋅ ⋅ ⋅, T} * {0, ⋅ ⋅ ⋅, B} \ {0,…, T \} \ times \ {0,…, B \} {0, ⋅ ⋅ ⋅, T} ({0, ⋅ ⋅ ⋅, B}.

For most V(t,b)V(t, b)V(t,b) will be much greater than θavg\theta_{\mathrm{avg}}θavg, such that budget B is large enough, V (t, b) v (t, b) v (t, b) can be thought of as t * theta avgt \ times \ theta_ {\ mathrm {avg}} t x theta avg. Therefore, it will be difficult to optimize V(t,b)V(t, b)V(t,b) directly.

The above method to resolve the problem for direct optimization D (t, b) = V (t, b + 1) – V (t, b) D (t, b) = V (t, b + 1) – V (t, b) D (t, b) = V (t, b + 1) – V (t, b), V (t, b) V (t, b) V (t, b) calculation as follows


V ( t 1 . b Delta t. ) V ( t 1 . b ) = Delta t. = 1 Delta t. D ( t 1 . b Delta t. ) V(t-1, b-\delta)-V(t-1, b)=-\sum_{\delta^{\prime}=1}^{\delta} D\left(t-1, b-\delta^{\prime}\right)

D (t, b), D (t, b), D (t, b) as follows, in the form of fixed time t, the D (t, b), D (t, b), D (t, b) as Dt D_t (b) (b) Dt (b); So when WE fix b, let’s call D(t,b)D(t, b)D(t,b) Db(t). Nature has a

  • Dt(b)D_t(b)Dt(b) fluctuates a lot when B is small
  • Dt(b)D_t(b)Dt(b) decreases as b increases
  • Db(t)D_b(t)Db(t) increases as t increases
  • Db(t)D_b(t)Db(t) and Dt(b)D_t(b)Dt(b) are functions of nonlinear form

The neural network is used to fit D(t,b)D(t, b)D(t,b), the input is T and b, the output is D(t,b)D(t, b)D(t,b), the neural network is denoted NN(t,b)NN(t, b)NN(t,b) NN(t,b) n (t,b).

Coarse-to-fine Episode Segmentation Model

In order to avoid modeling D(t,b)D(t, b)D(t,b) D(t,b) or V(t,b)V(t, b)V(t,b) V(t,b)V(t, b)V(t,b) V(t,b)V(t, b)V(t,b) V(t,b)V(t, b)V(t,b). Map states T >T0t > T_0t>T0 and B>B0B > B_0B>B0 to t≤T0t \leq T_0t≤T0 and B≤B0B \leq B_0B≤B0. Similar to budget pacing, there is the first simple implicit mapping that can divide the large episode into several smaller episodes of length T0, in which the remaining budget is allocated to the remaining smaller episodes. If the agent does not spend the budget on the small episode, then there will be more budget on the other small episodes in the big episode.

State Mapping Models

Dt D_t due to Dt (b) (b) (b) with the increase of b value is reduced, the Db D_b (t) (t) Db (t) increases with the increase of t value, when t and b is large, there should be a {(t ‘, b ‘)} \ {(t ^ {‘}, b ^ {}) \} {(t ‘, b ‘)}. Meet the D (t ‘, b ‘) = D (t, b), D (t ^ {‘}, b ^ {}) = D (t, b) D (t ‘, b ‘) = D (t, b), or D (t, b) D (t, b) D (t, b) existence state map. A (t,b,x)a(t, b, \ boldSymbol {x}) A (t,b,x) decreases as t increases and increases as B increases, so a(t,b,x)a(t, b, \ boldSymbol {x})a(t,b,x) has a similar mapping.

From the perspective of actual bidding, when the number of remaining auctions is large and the budget is similar, given the same bidding request, the agent should offer similar bidding (see Figure 2). The paper considers a simple case where B/TB/TB /t represents the budget condition. Then there are two forms of linear mapping:

  • T >T0t > T_0t>T0, Mapping a (t, b, x) a (t, b, \ boldsymbol {x}) a (t, b, x) to a (T0, bt x T0, x) a (T_0, \ frac {b}, {t} \ times T_0, \ boldsymbol {x}) a (T0, TB x T0, x)
  • When t > T0t > T_0t > T0, mapping D (t, b), D (t, b), D (t, b) to D (T0, bt (T0), D (T_0, \ frac {b}, {t} \ times T_0) D (T0, TB (T0)

Remember Dev (t, T0, b) = ∣ D (t, b) – D (T0, bt (T0) ∣ Dev (t, T_0, b) = | D (t, b), D (T_0, \ frac {b}, {t} \ times T_0) | Dev (t, T0, b) = ∣ D (t, b) – D (T0, TB (T0) ∣, the following figure said that the mapping is reasonable

Experiment 3

3.1 Experimental Settings

The introduction of experimental data set can be seen in the original text. The evaluation index is click volume, and CPM, eCPC, etc., can be seen in the original text for other experimental Settings.

There are comparison methods (some methods need to see the original text to know the specific implementation, here is only a simple record)

  • Ss-mdp: When the bid price is given, the bid situation is considered, but the feature vector of the bid request is ignored.
  • Mcpc: bid for aMcpc (t, b, x) = CPC x theta (x) a_ {Mcpc} (t, b, \ boldsymbol {x}) = CPC \ times \ theta (\ boldsymbol {x}) aMcpc (t, b, x) = CPC x theta (x)
  • Lin: Linear bid strategy, bid according to pCTR, ALIN (t, b, x) = b0 theta (x) theta avga_ {\ mathrm {LIN}} (t, b, \ boldsymbol {x}) = b_0 \ frac {\ theta (\ boldsymbol {x})} {\ theta_ {\ mathrm {avg}}} aLIN (t, b, x) = b0 theta avg theta (x)).
  • RLB: The method proposed in algorithm 1 is suitable for bidding with small data scale
  • RLB – NN: using NN (t, b) NN NN (t, b) (t, b) to approximate D (t, b), D (t, b), D (t, b)
  • RLB – NN – Seg: Divide the budget, Bs=Br/NrB_s = B_r/N_rBs=Br/Nr, where BrB_rBr is the remaining budget, NrN_rNr is the remaining episode, rLB-NN is used for bidding, This strategy conforms to the coarse-to-fine episode segmentation model.
  • RLB – NN – MapD: Using NN (t, b) NN NN (t, b) (t, b) to approximate D (t, b) D (t, b), D (t, b), and when t > T0t > T_0t > T0, Mapping D (t, b) = NN (T0, bt (T0), D (t, b) = NN (T_0, \ frac {b}, {t} \ times T_0) D (t, b) = NN (T0, TB (T0)
  • RLB – NN – MapA: Using NN (t, b) NN NN (t, b) (t, b) to approximate D (t, b) D (t, b), D (t, b), and when t > T0t > T_0t > T0, Mapping a (t, b, x) a (t, b, \ boldsymbol {x}) a (t, b, x) to a (T0, bt x T0, x) a (T_0, \ frac {b}, {t} \ times T_0, \ boldsymbol {x}) a (T0, TB x T0, x)

3.2 Experimental Results

Small-Scale Evaluation

Under T=1000 and different budgets, the experimental results are shown in the figure below

Conclusion there are

  • The proposed model RLB performs best under various budget conditions, which verifies the effectiveness of the proposed algorithm in obtaining click volume optimization.
  • LIN is the second most widely used bidding strategy in the industry
  • In contrast to RLB and Lin, Mcpc does not adjust its strategy when budget conditions change. Therefore, it performs very well when C0 > 1/4C_0 > 1/4C0 >1/4, but performs poorly under very limited budget conditions
  • Ss-mdp performs worst because it does not know the characteristic information of each bid request, which illustrates the advantages of RTB display advertising

By comparing the Win rate, the conclusions are as follows

  • Ss-mdp has consistently maintained the highest winning percentage under all budget conditions. The reason for this is that SS-MDP considers each bid request equally, so its optimization objective is equivalent to the number of impressions
  • Lin and RLB are very close compared to CPM and eCPC. RLB can generate higher clicks through CPM and eCPC than Lin, because RLB effectively spends its budget based on market conditions and Lin doesn’t realize it

Experimental results under different budgets are shown in the table

Conclusion there are

  • RLB is robust and significantly superior to Lin in the vast majority of cases

Take the budget as 1/5, and the experimental results under different T are shown in the figure below

The conclusion is as follows

  • Compared to Lin, RLB can get more clicks in a similar situation to eCPC

The table below compares the CTR estimation results with the bidding results. When the CTR estimation is poor (AUC < 70%), SS-MDP is quite good in clicks compared to Mcpc and Lin. In contrast, other methods using CTR estimators get more clicks than SS-MDP when the performance of CTR estimators improves.

Large-Scale Evaluation

T_0 =10,000T0=10,000, B0=CPMtrain×10−3×T0×1/2B_0 = CPM_{train} \times 10^{-3} \times T_0 \times 1/2B0=CPMtrain×10−3×T0×1/2. And then set T=100,000T =100,000T=100,000, B=CPMtrain×10−3×T0×c0B = CPM_{train} \times 10^{-3} \times T_0 \times c_0B=CPMtrain×10−3×T0×c0, NN(t,b)NN(t, b)NN(t,b) was obtained by neural network training.

The following table shows the results of neural network on different data sets. It can be seen that RMSE is smaller than θavg\theta_{\mathrm{avg}}θ AVg, that is, using neural network in (t,b)∈{0,… T0} and {0,… ,B0}(t, b)\in\{0,… ,T_0\}\times\{0,… , B_0 \} (t, b) ∈ {0,… T0} and {0,… When B0} is approximate, the effect is better.

When T=100,000T =100,000T=100,000, the experimental results are as follows

Conclusion there are

  • The performance of the Mcpc is similar to that observed on a small scale
  • In terms of total clicks, RLB-NN performs better than Lin at C0 = 1/32, 1/16 and 1/8, and worse than Lin at C0 = 1/2, indicating that the generalization ability of neural network is satisfactory only at small scale, but poor at large scale
  • Compared with RLB-NN, three complex algorithms, RLBNN-SEG, RLB-NN-MAPD and RLB-Nn-MAPA, have stronger robustness than Lin algorithm under various budget conditions. They do not depend on the generalization ability of the approximate model, so their performance is more stable. The results show that they are effective solutions to large-scale problems
  • In eCPC algorithm, all models except Mcpc are very close, which makes the proposed RLB algorithm more realistic

ONLINE DEPLOYMENT AND A/B TEST

The online experimental results are shown in the figure below

Conclusion there are

  • RLB’s eCPC is lower than Lin’s for the same cost, resulting in more total clicks, making RLB more efficient
  • RLB offers a better plan than Lin: clicks received and budgets spent grow on average over time
  • Through better planning, RLB achieved a lower CPM than Lin, generating more bids and winning more exposure
  • At low cost, RLB achieved a better result with a lower CPM, and CTR was superior to Lin in performance

4 summarizes

There are more formulas in this paper, which can be better understood after reading reinforcement learning. The derivation of the full text is very clear, it is recommended to read the original text carefully (formula too many notes a little messy).

This article is participating in the “Nuggets 2021 Spring Recruitment Campaign”, click to see the details of the campaign