AlphaGo main logic

AI = Policy Value Network + MCTS

Policy Value Network

Policy network, input current state, neural network output the probability of each action taken in this state

Value network, for value network, the value of the current situation = the estimate of the endgame

MCTS, Monte Carlo tree search

The input is a normal policy, and we can get a better output of good policy through MCTS

Through MCTS to complete the self-game, so as to update the strategy network

MCTS node definition

Node definition (TreeNode tree construction)

Tree nodes, each of which records its own Q value, prior probability P, and UCT value the second term, the adjusted number of visits u (for exploration)

MCTS tree creation and use

Tree structure: The tree structure defines the solution space of a feasible solution, and each path from leaf node to root node corresponds to a solution.

Monte Carlo method: MSTC does not need to set a marking sample in advance. Random statistical method acts as a driving force to obtain observation results through random statistical experiments

Loss assessment function: to provide a quantifiable deterministic feedback for evaluating the merits and demerits of the solution => MCTS seeks the “real function” behind the loss function representation through random simulation.

Linear optimization of backpropagation: After obtaining the loss result of one path each time, backpropagation (Backup) is used for overall optimization of all nodes on the whole path

Heuristic search strategy: The algorithm follows the principle of loss minimization to conduct heuristic search on the whole search space until it finds a set of optimal solutions or terminates early

MCTS based AI Player

class MCTSPlayer(object):

def __init__(self, policy_value_function,

c_puct=5, n_playout=2000, is_selfplay=0):

# Use MCTS for search

self.mcts = MCTS(policy_value_function, c_puct, n_playout)

self._is_selfplay = is_selfplay

Set player index

def set_player_ind(self, p):

self.player = p

Get AI chess position

def get_action(self, board, temp=1e-3, return_prob=0):

Get all possible chess positions

sensible_moves = board.availables

# MCTS returns the PI vector, based on the alphaGo Zero paper

move_probs = np.zeros(board.width*board.height)

if len(sensible_moves) > 0:

acts, probs = self.mcts.get_move_probs(board, temp)

move_probs[list(acts)] = probs

if self._is_selfplay:

# Add Dirichlet noise for exploration (self-training required)

Move = Np.random. Choice (acts, P =0.75*probs + 0.25* NP.random. Dirichlet (0.3* Np.ones (Len (probs)))))

Update the root node and reuse the search tree

self.mcts.update_with_move(move)

else:

# Default temp=1e-3, almost equal to the step with the highest probability of selection

move = np.random.choice(acts, p=probs)

Reset the root node

self.mcts.update_with_move(-1)

if return_prob:

return move, move_probs

else:

return move

else:

print(“WARNING: the board is full”)

Policy Value Network Implementation

Implementation details of Policy Value Network:

Definition of Neural Network Architecture (PyTorch)

In the training step, the gradient needs to be emptied with Zero_grad () before backpropagation

Definition of Loss (Loss = Value_Loss + policy_Loss)

Neural network parameters to obtain, save, and load

AI main process:

Collect self-playing data through MCTS

Update Policy Value Network through self-playing data

Evaluate the win rate of the current Policy Value Network

Judge the performance of the current model and save the optimal model

Reinforcement learning

Reinforcement learning does not need the Label of training data, but it requires feedback (reward or punishment) from each action environment => Through feedback, the behavior of the training object is constantly adjusted

Strategies for Reinforcement Learning: Policy-based (Policy Gradients) and Value-based (Q-Learning)

Policy-based Directly predicts the actions to be taken in the environment

Value-based Predicts the expected Value (Q Value) of all actions in the environment state, and selects the Action with the highest Q Value to execute

Value-based is suitable for a small number of discrete Action values, and policy-based is suitable for the environment with multiple types of Aciton or continuous Action values

Man-machine game

Reinforcement learning and recommendation systems

Reinforcement learning:

It is a branch of machine learning (supervised and unsupervised).

Different from other learning methods, reinforcement learning is the mapping of the agent from the environment to behavior, and the goal is to maximize the reward.

If a behavioral decision of an agent is positively rewarded, the tendency of the agent to use this behavior in the future will be strengthened

Reinforcement learning is the closest learning to nature

Reinforcement learning combined with deep learning can solve the problem of massive data generalization (e.g. DeepMind’s AlphaGO)

Build feedback mechanisms between agents and the environment

The previous learning methods were mostly based on supervised learning. The lack of effective exploration ability resulted in the system’s tendency to push items (such as goods, shops and problems) that had occurred before to consumers.

Reinforcement learning can effectively establish the interaction process between consumer and system and maximize the accumulated benefits of the process, which has a good application in business scenarios

Search scenario:

In the field of e-commerce, users’ browsing and purchasing behavior can be regarded as Markov process, and the Markov decision process is modeled to realize the ranking decision model based on reinforcement learning, which can make the search more intelligent

Tmall Double 11, through reinforcement learning search ranking index increased by 20%

Ctrip has introduced reinforcement learning in hotel search rankings to predict those unknown situations, which requires a certain amount of “random exploration”, only in this way we can know the actual user feedback

The short-term cost of random exploration cannot be completely avoided, but the ultimate goal is to make up for and more than compensate for the cost of random exploration

Recommended scenarios:

Reinforcement learning and adaptive online learning are used to build decision engines through continuous machine learning and model optimization to conduct real-time analysis of massive user behaviors and item characteristics and help users quickly find their favorite items

Item can be an article, an item, etc

Taobao’s Guess you like, through the introduction of intensive learning, helps each user quickly find their favorite products, improves the matching efficiency between people and products, and improves the effect index by 10%-20%

Intelligent customer service:

Taking intelligent customer service robot as Agent, Agent’s decision is not determined by the direct income of a single node, but a relatively long-term process of interpersonal interaction

Think of the interaction between the consumer and the platform as a Markov decision process, and use reinforcement learning to build a feedback system for the interaction between the consumer and the system

System decisions are based on maximizing the benefits of the process => to achieve a dynamic platform for the system and users

How to define episode, reward,state, action is the key:

Episode, for example, in the scene of ticket booking, when the user talks with the system, if the system determines for the first time that the current user intends to “buy air ticket”, it can be regarded as an Episode. If the user buys air ticket or quits the session, the Episode will be regarded as the end

Reward, collect user feedback, such as user order, exit behavior

State and User Question Embedding extract slot State and historical slot information, access the fully connected neural network, and finally connect the Softmax layer to each action

Action. In the air ticket booking scenario, the Action space is discrete and mainly includes rhetorical questions and orders for each slot, such as rhetorical questions about time, origin, destination and order

Advertising system:

If an advertiser can bid separately based on the value of each piece of traffic, the advertiser can bid higher on the respective high-value traffic and lower on the average traffic, and get a better ROI

At the same time, the platform can also improve the efficiency of matching between advertising and visitors

Through reinforcement learning, we can carry out intelligent pricing technology, and for each visiting user, we can decide how to adjust pricing according to their current state, show them specific ads, and guide their state to move in the direction we want

On Tmall Double 11, CTR, CPM and GMV have all increased significantly