MAB basic principles of intelligent tuning for multi-arm slot machines

Author of songbao Write Code, one of the day, lab, etc. A bytedance engineer who loves to toss, is committed to the whole stack, and is working hard to grow, star Sea, the future can be expected.

A, MAB source

A gambler, he’s going to roll a slot machine, he’s going to walk into a casino, he’s going to look at a row of slot machines, they all look the same, but the probability of getting gold from each slot machine is different, and you don’t know what the probability of getting gold from each slot machine is?

So the question is, which slot machine should you choose to maximize your revenue each time?

Rules:

(1) There are n machines. Each time one machine is selected and the lever is pulled, the machine provides a random reward, such as gold coins, and each machine reward obeisance to a specific probability distribution.

(2) A gambler has m chances to pull the lever, and his goal is to maximize his reward. Now we have to determine the order in which we select the machine m times. The smartest gambler, who keeps a record of liking every reward, tries to find the machine with the biggest reward and pulls the lever as many times as possible.

(3) The reward obtained is called reward, which is generally assumed to be Bernoulli distribution (zh.wikipedia.org/wiki/%E4%BC…

(4) The biggest difficulty: finding a balance between the machine with the highest reward and the biggest revenue

As shown in the figure above, the second Machine, Machine 2, is the Machine with the highest reward. In order to obtain the maximum benefit, Machine 2 with the highest reward of a single Machine should be selected.

What is MAB

MAB source summary: a gambler comes to the casino to play slot machines, each slot machine has a different probability of winning, each play requires a coin, and the gambler only has T coins, how should he divide the number of tries on each slot machine, and then get the highest expected return?

What happened to multi-arm slot machines? Where does the multi-arm come from?

Casino slot machines are nicknamed single-armed bandit because they will take your money even if they have only one arm. The multi-armed slot machine (or the multi-armed bandit) was inspired by this moniker. Suppose you enter a casino and are faced with a row of slot machines (and therefore multiple arms). Since the expected returns and expected losses vary from slot machine to slot machine, what slot machine selection strategy will you adopt to ensure that your total returns are the highest?

Define MAB:

MAB dobby slot machines (Multi – Armed Bandit) intelligent tuning, namely intelligent traffic tuning, based on the bayesian theory arm slot machines related algorithm, timing cycle push, automatically assigned flow inclined core index to represent the best push version strategy, set up the version will for initial flow weight proportion, intelligent tuning, A policy that will allow users to push one experimental version out of the group and push to other versions.

Too hard to say, or confusing? To be clear in a word:

Set the traffic weight of different experimental versions, and automatically tilt the allocated traffic to the experimental version with the best performance of core indicators, so as to produce the winning version.

Third, why MAB

The AB experiment relies on classical statistical tests of statistical significance.

When we propose a new product feature, we may want to test whether the new feature is really useful before we release it to the entire user base.

Implementation: we have a control group and experimental group (access to new features), and then we measure the key indicators of two groups: website average stay length (social networks), the average payment time (electronic commerce), click-through rates (online advertising), and finally we check whether the difference between the two groups has statistical significance.

A balanced AB experiment will allocate equal traffic to each group until a sufficient sample size is reached. However, we cannot adjust the traffic allocation according to the observed conditions during the AB experiment, which is also the disadvantage of the AB experiment: if the experimental group is significantly better than the control group, we still need to spend a large amount of traffic on the control group to obtain statistical significance.

So the biggest advantage of MAB is that it works better for early-stage startups with low user traffic because it requires a smaller sample size, stops early, and is more agile than conventional AB trials.

Iv. Benefits and costs

In the AB experiment, each slot machine represents an experimental group in the experiment, each pull arm represents an exposure of the experimental version, and the cumulative return represents the cumulative transformation of core indicators.

The experimental version that is most likely to be the optimal solution is found through the idea of probability distribution, and the allocated flow is increased, the experimental income is calculated in real time, and the flow is dynamically adjusted in real time to maximize the experimental income.

Considerations: Find a balance between quickly discovering and converging to high-value releases and not giving up on new endeavors.

1. Comparison between conventional AB experiment and MAB experiment

Flow tuning for general A/B experiment and MAB intelligent experiment is shown in the following two figures:

2. When is MAB used?

MAB intelligent flow tuning experiments are recommended for the following scenarios compared to traditional A/B experiments:

Promotional offers: This scenario is more focused on improving conversion rates. MAB’s multi-arm bandit intelligence experiment, during the promotion period, sends more traffic to the better variant, and less traffic to the less effective variant, and gains profits as quickly as possible.
Push strategy: Push copy/title is short life cycle content. After a fixed period of activity, relevance will be lost. Multi-arm MAB smart experiments can maximize the effectiveness of strategies as soon as possible.
Landing page optimization: Try several different versions of the target landing page to improve the registration rate of webinars/conferences, events, etc.

3. Problems existing in MAB

MAB intelligent flow tuning, with some probability, may extend the trial time (and in most cases terminate prematurely). It is important to note that there is no such thing as a free lunch, and the convenience of a smaller sample size comes at the expense of a larger false positive rate.

Description:

False positive rate: the probability of making a wrong judgment in a scientific experiment or test

5. Brief introduction of key technologies

1. Thompson sampling

Thompson sampling is based on Bayesian intelligent inference. Sampling cuts off the possibility of error from multiple comparisons and multiple observations. In short, Thompson sampling is a greedy method, always choosing the arm with the maximum expected return.

2. Allocate traffic intelligently

We provide a real-time data acquisition service, scan data at 30 seconds to calculate the experimental effect, and use intelligent shunt algorithm to distribute the optimal traffic to each version in real time, which minimizes the experimental cost in performance and monetization.

Monte Carlo simulation

Monte Carlo simulation is used to determine when the decision experiment converges. The Monte Carlo simulation works by randomly sampling K arms multiple times and empirically calculating how often each arm wins. If the winner beats the other by a large enough margin, the experiment is terminated.

4. Intelligent experiment report

The experimental conclusion of automation is actually the intelligent flow experiment calculates the probability of each experimental version called optimal, produces a simple and understandable analysis report, and provides six kinds of analysis reports of revenue assessment, tuning rounds, incoming user trend, index trend, box and whisker snapshot, probability distribution.

Revenue evaluation: show the revenue of each round of intelligent tuning (i.e., the profit value of PP refers to percentage points compared with ordinary experiments), and clearly view the convergence of revenue after several rounds of tuning

Added User trend: displays the trend of the number of added users of the experimental version and control version every day

Day trend: Displays index data by day, and index values of experimental version and control version each day

Probability distribution: it shows the value of the indicator and its probability distribution. The comparison of the probability distribution between the experimental group and the control group can assist in judging the difference between the experimental group and the control group.

Box and whiskers snapshot: also known as box graph, it reflects the distribution characteristics of original data by the maximum value, minimum value, median and two quartiles of data. By comparing the box and whisker snapshots of the experimental group and the control group, the data distribution characteristics of the two groups can be compared.

Cumulative trend: indicator data from the beginning of the experiment up to the current day.

reference

Arxiv.org/pdf/1707.02…
Lilianweng. Making. IO/lil – log / 201…
Documentation Center – Volcano Engine
Riverzzz. Making. IO / 2019/03/31 /…
Lumingdong. Cn/exploration…
Github.com/im-iron-man…
Mark. Reid. Name/code/bandit…

Quick learning of Gulp and access to the project (I)
Diff algorithm a little deeper?
AB experimental basis – What is AB? What is the value of AB? Why the AB experiment?
(57) How to flatten an array?
Depth first traversal and breadth first traversal.
(55题) To achieve a full array of arrays
2020 “Songbao Write Code” personal year-end summary: the future can be expected

Thank you for your attention

Pay attention to “Songbao write code”, is to acquire development knowledge system construction, selected articles, project actual combat, laboratory, a daily interview question, advanced learning, thinking about career development, involving JavaScript, Node, Vue, React, browser, HTTP, algorithm, side correlation, small program, AB experiment, data analysis and other fields, Hope can help you, we grow up together ~

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

MAB basic principles of intelligent tuning for multi-arm slot machines

A, MAB source

What is MAB

Third, why MAB

Iv. Benefits and costs

1. Comparison between conventional AB experiment and MAB experiment

2. When is MAB used?

3. Problems existing in MAB

5. Brief introduction of key technologies

1. Thompson sampling

2. Allocate traffic intelligently

Monte Carlo simulation

4. Intelligent experiment report

reference

Read more

Thank you for your attention

MAB basic principles of intelligent tuning for multi-arm slot machines

A, MAB source

What is MAB

Third, why MAB

Iv. Benefits and costs

1. Comparison between conventional AB experiment and MAB experiment

2. When is MAB used?

3. Problems existing in MAB

5. Brief introduction of key technologies

1. Thompson sampling

2. Allocate traffic intelligently

Monte Carlo simulation

4. Intelligent experiment report

reference

Read more

Thank you for your attention

Related Posts

ICCV2021 | Vision reflection and improvement of the relative position encoding in the Transformer

ADAM mining is about to set sail

Use a line of Python code to read text from the image