What is reinforcement learning

Reinforcement Learning:

A branch of machine learning: supervised learning, unsupervised learning, reinforcement learning

The idea of reinforcement learning is similar to that of people, which is to learn through practice

For example, learning to walk, if we fall, then our brain gives a negative reward value => the walking posture is bad; If the next step is normal, then the brain gives a positive reward value => this is a good walk

Different from supervised learning, there is no output value of training data prepared by supervised learning. Reinforcement learning only has reward value, which is different from the output value of supervised learning. It is not given in advance, but given later (such as walking and falling).

Different from unsupervised learning, in unsupervised learning, there is neither output value nor reward value, only data features, while reinforcement learning has reward value (negative value means punishment). In addition, non-fleet learning and supervised learning, data are independent, there is no dependency like reinforcement learning

Reinforcement Learning:

It can be applied to different fields: neuroscience, psychology, computer science, engineering, mathematics, economics and so on

Characteristics of reinforcement learning:

No monitoring data, just reward signals

The reward signal doesn’t have to be real time, it can be delayed, even much later

Time (sequence) is an important factor

Current behavior affects subsequent received data

Reinforcement learning has a wide range of applications: game AI, recommendation systems, robot simulation, investment management, power plant control

Basic concepts (individual, environment, action, loading, reward, strategy, state transition probability)

Basic Concepts:

The role of a learner, also known as an Agent

Everything other than Environment, Environment and Agent that consists of and interacts with them

Action, Action, Agent’s behavior

Status, State, information obtained by the Agent from the environment

Feedback from the environment for an action

Policy, Policy, Agent Functions that perform the next action based on the state

State transition probability, the probability that the Agent enters the next state after making an action

Four important elements: state, Action, policy, reward

What is the goal of RL

RL considers the interaction between Agent and Environment

The goal is to find an optimal strategy so that agents can get as many rewards from the environment as possible

For example, in a racing game, the scene is the environment, the car is the Agent, the position of the car is the state, the operation of the car is the action, how to operate the car is the strategy, and the score of the race is the reward

In many cases, agents cannot obtain all the environmental information, but represent the environment through Observation, that is to say, they get the information around themselves

Markov decision process MDP

(Markov Decision Process (MDP): MDP is a kind of uncertain search problem. The goal is to search a feasible state transition path (sequence of state and action) combining the reward function to maximize the reward

MDP (Markov Decision Process) is used because state transitions are only related to the current state and action:

Decisions are made when you need to select an action based on state and potential rewards

The uncertainty is reflected in the transition function. The action A is performed under state S, and the final state S ‘is uncertain, and the reward R is also uncertain

The Markov state

Sequence of random variables X1,X2… ,Xn’s current state, past state, and future state

Given the current state, the future state and the past state are independent of each other, that is, the probability distribution of the system state at t+1 is only related to the state at t, and has nothing to do with the state before t

The state transition from time t to time t+1 is independent of the value of t

Markov chain model can be expressed as = (S, P, Q)

S is the set of states (also known as the state space) of all possible states of the system

P is the state transition matrix

Q is the initial probability distribution of the system,

Agent classification (value-based, policy-based, actor-critic)

Reinforcement learning Agent:

Value-based reinforcement learning

Guide strategy formulation by learning value functions (e.g., Ɛ-greedy implementation method)

Policy-based reinforcement learning

No value function, direct learning strategy

Combining strategy gradient and value function reinforcement learning, actor-critic

Methods of learning both value function and strategy

Actor – criticism, which is the equivalent of an actor acting while the critics tell the actor and the actor gets better and better

What is a policy network

Policy network:

In any game, the player’s input is considered action A, and each input (action) leads to a different output, called the state S of the game

You get a list of different state-action pairings

S includes all policies in the policy network

For example, entering A1 in a game results in state S1 (move up), and entering A2 results in state S2 (move down)

S: state set

A: Action set

R: Reward distribution, given (state, action)

P: state transition probability, the probability distribution of the next state for a given (state, action)

: discount factor, the precaution to prevent the reward R from reaching infinity => the infinite reward ignores the difference between the different actions taken by the agent

: Optimal policy

What is the value network

Value network (numerical network) :

The value network assigns a score (numerical value) to the states in the game by calculating the expected cumulative score for the current state S, and each state goes through the entire numerical network

Rewarding more states will result in a higher Value in the Value network

The reward here is the expected value of the reward, and we’re going to pick the best one from the set of states

V: Value expectation

Principles of MCTS (selection, extension, simulation, return)

MCTS principle:

Each node represents A situation where A/B is visited B times, and black wins A times

We will repeat the process over and over again:

Step1, Select Select, go down from the root node, and Select the “most valuable child node” each time until you find “there are unextended child nodes”, that is, there are unpassed subsequent nodes in this situation, such as 3/3 nodes

Step2: expand the Expansion. Add a 0/0 child node to this node, corresponding to the “unexpanded child node” mentioned earlier.

Step3: simulate Simluation and go to the end with Rollout policy to get a result (Thinking: why not adopt AlphaGo’s strategy value network chess)

Step4, send back Backup, add the simulated result to all its parents, assuming that the simulated result is 0/1, add 0/1 to all its parents

MCTS (Monte Carlo Tree Search)

Monte Carlo tree search combines the generality of random simulation with the accuracy of tree search

MCTS is a search algorithm, which adopts various methods to effectively reduce the search space. In each turn of MCTS, the starting content is a semi-expanded search tree, and the target is the original semi-expanded + one more node/layer of the search tree

The role of MCTS is to predict the output through simulation, which can theoretically be used in any domain defined by {state,action}

Use the main steps:

Select, start from the root node, according to a certain strategy, search to the leaf node

Extension, extending one or more legal child nodes to the leaf node

Simulation, in which children are simulated in a random manner (which is why it is called Monte Carlo) over a number of experiments. Simulation to the final state can get the score of the current simulation

Return, update the simulation times and score values of the current child node according to the scores of several times simulated by the child node. At the same time, the simulation times and score values are sent back to all its ancestor nodes and the ancestor nodes are updated

AlphaGo main logic

AI = Policy Value Network + MCTS

Policy Value Network

Policy network, input current state, neural network output the probability of each action taken in this state

Value network, for value network, the value of the current situation = the estimate of the endgame

MCTS, Monte Carlo tree search

The input is a normal policy, and we can get a better output of good policy through MCTS

Through MCTS to complete the self-game, so as to update the strategy network