I. Overview of reinforcement learning

1. Introduction to reinforcement Learning

(1) Reinforcement learning is an area of machine learning that emphasizes how to act based on the environment in order to maximize the expected benefits. Its inspiration comes from the theory of behaviorism in psychology, which is how an organism, stimulated by rewards or punishments given by the environment, gradually forms the expectation of the stimulus and produces the habitual behavior that can obtain the maximum benefit.

(2) Reinforcement learning can be traced back to Pavlov’s conditioned response experiment. It developed independently from the two fields of animal behavior research and optimal control, and was finally abstracting into Markov Decision Process (MDP) by Bellman.

2. Development History:

In 1956, Bellman proposed the dynamic programming method.

In 1977 Werbos proposed a dynamic programming algorithm.

In 1988, Sutton proposed the time difference algorithm.

In 1992, Watkins proposed q-learning algorithm.

In 1994, Rummery proposed the Saras algorithm.

In 1996, Bersekas proposed a neural dynamic programming method to solve optimal control in stochastic processes.

In 2006, Kocsis proposed the confidence upper bound tree algorithm.

In 2009, Kewis proposed that feedback control only ADAPTS to dynamic programming algorithms.

In 2014, Silver proposed the deterministic Policy Gradients algorithm.

In 2015, Google-DeepMind proposed deep-Q-network algorithm.

Reinforcement learning has been developed for decades and is not a new technology. After AlphaGo beat Lee Sedol in 2016, reinforcement learning, which combines deep learning, became one of the hottest technologies in the past two years. In conclusion, reinforcement learning is an old and modern technology.

3.MDP (Horseoff decision-making Process)

S: represents the state set (States), with S ∈S, and si represents the state at step I.

A: represents A group of actions. A ∈A. Ai represents the action of step I.

? Sa: indicates the probability of state transition. ? s? Represents the probability distribution of other states that will be transferred to after the action of A ∈ A under the current S ∈ S state. To perform an action only under a state s, for example, moved to s’ probability can be expressed as p (s’ | s, a).

R: S×A⟼ℝ, R is the reward function. Some functions of the return function state S can be simplified as R: S⟼ℝ. If a group of (s, a) moved to the next state s’, then return function can be written as r (s’ | s, a). If the next state s prime corresponding to (s,a) is unique, then the return function can also be written as r(s,a).

γ: discount factor

4. According to RL?

Characteristics of problems solved by reinforcement learning:

  • The agent and the environment are constantly interacting
  • Search and trial and error
  • Delayed reward (the current action may take many steps to produce the corresponding result)

Goal:

  • Earn more cumulative rewards
  • Get more reliable estimates

Reinforcement Learning is a branch of machine Learning family. Due to the technological breakthrough in recent years and the integration of Deep Learning, Reinforcement Learning has been further applied. For example, getting a computer to learn to play a game, and taking on the world’s best Go player by AlphaGo, are all things that reinforcement learning does. Reinforcement learning is about turning your program from a complete stranger to a master in its current environment.

5. Conclusion:

The full name of Deep Reinforcement Learning (DRL) is the reasoning ability brought by it, which is a key characteristic measurement of intelligence, and truly enables the machine to have the ability of self-learning and self-thinking.

Deep Reinforcement Learning (DRL) belongs to a class of methods that use neural network as value function estimator in essence. Its main advantage is that it can automatically extract state features by using Deep neural network and avoid the inaccuracy caused by manual definition of state features. This enables the Agent to learn in a more primitive state.

Second, reinforcement learning solution method

1. Dynamic programming

Cornerstone: Behrman equation

(1) Behrman equation:

Behrman optimal equation:

How do you solve the Behrman equation?

It is essentially a linear programming problem with multiple equations and multiple unknowns to be solved. As follows, suppose that even if the reward of each step is -0.2, there are three unknown states V(0,1),V(1,1) and V(1,0), and the loss factor is 1. Then, under the following strategy, the behrman equation is obtained as follows:

However, when the unknown variables keep increasing, linear programming is difficult to solve. At this time, dynamic programming needs to be used for continuous iteration to make its state values converge.

(2) Value iteration

In Value Iteration, you start with a randon value function and then find a new (improved) value function in a iterative process, until reaching the optimal value function. Notice that you can derive easily the optimal policy from the optimal value function. This process is based on the Optimality Bellman operator. 

(3) Strategy iteration

In Policy Iteration algorithms, you start with a random policy, then find the value function of that policy (policy evaluation step), then find an new (improved) policy based on the previous value function, and so on. In this process, each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal). Given a policy, its value function can be obtained using the Bellman operator.

(4) Detailed comparison of algorithms:

 

  • According to the strategy iteration algorithm, strategy evaluation and strategy improvement are carried out in each iteration until both converge. However, our goal is to select the optimal strategy, so is it possible that the optimal strategy has converged when the evaluation value of the strategy does not converge? The answer is yes
  • The convergence speed of strategy iteration is faster, and it is better to choose strategy iteration method when the state space is small. When the state space is large, the computation of value iteration is smaller

(5) GridWorld example (solving with policy iteration and value iteration respectively)

Strategy iteration process:

Value iteration process:

2. Monte Carlo method

The dynamic programming method above is an ideal state where all the parameters are known in advance, such as the probability of state transition, the rewards, and so on. However, the display situation is unknown, at this time, there is a means to use Monte Carlo sampling, based on the law of large numbers, based on statistics to calculate the transfer probability value; For example, when you flip a coin enough times, the probability of heads and tails gets closer and closer to the truth.

3. Time difference method

Based on dynamic programming and Monte Carlo

3. Classification of reinforcement learning algorithms

1.

Classification based on the lack of understanding of the environment:

Model-free: What the environment gives us. Let’s call this approach model-free, where a model is used to represent the environment

Model-based: So understanding the environment means learning to use a Model to represent the environment, so this is a model-based approach

2. 2.

One is to output the probability of each action directly, the other is to output the value of each action; The former is suitable for continuous action; the latter cannot express the value of continuous action.

3. 3.

4. 4.

The key to determining on-policy and off-policy is whether the policy or value function you estimate is the same as the policy you used to generate the sample. If it’s the same, it’s on-policy, otherwise it’s off-policy.

Summarize the classification of common algorithms:

Four, representative algorithm

1.Q-learning

(1) Four basic components:

  • Q table: Q(s,a), the cumulative value of action A performed under state S
  • Define an action: Select an action
  • Environmental feedback: feedback from the environment after a behavior has been performed
  • Environment update

(2) Algorithm formula:

(3) Algorithm decision:

(4) Algorithm update:

(5) Code implementation framework:

2.Sarsa:

Similar to Q-learning, the only difference is that the update method is different

(1) Algorithm formula:

(2) Differences between Sarsa and Q-learning: Sarsa is on-policy while Q-learning is off-policy

(3) Update process:

The former is Sarsa and the latter is Q-learning

(4) Show different in the code:

Sarsa:

Q-learning:

(5) Code implementation framework:

3. DQN

Deepmind was acquired by Google because of the DQN paper

(1) Origin:

  • DQN ( Deep Q Network ) is a fusion of neural networks and Q learning The method of .
  • Some problems are too complex to be stored in the Q table, and even if they could be stored, they would be a hassle to search. Therefore, the Q table is replaced by neural network.

(2) Added two new features:

Experience replay: Each time DQN is updated, a random selection of previous experiences is learned. This randomization disrupts the correlation between experiences and makes the neural network more efficient at updating.

Fixed Q-targets: Two neural networks with the same structure but different parameters are used. The predicted Q estimated neural network has the latest parameters, while the predicted Q realistic neural network uses the parameters of a long time ago.

(3) Algorithm formula:

Playing Atari with Deep Reinforcement Learning

(4) DQN code implementation framework:

It can be seen that Q-learning is basically the same, but there is one difference, that is, more memory storage, and batch learning.

clear; warning off all; format compact; load shujv.mat; train_y=(train_y-1)*2+1; test_y=(test_y-1)*2+1; assert(isfloat(train_x), 'train_x must be a float'); assert(all(train_x(:)>=0) && all(train_x(:)<=1), 'all data in train_x must be in [0:1]'); assert(isfloat(test_x), 'test_x must be a float'); assert(all(test_x(:)>=0) && all(test_x(:)<=1), 'all data in test_x must be in [0:1]'); %%%%%%%%%%%%%%%%%%%%This is the model of broad learning system for%%%%%% This is the incremental width learning system model for m2+m3 enhanced nodes and M1 feature nodes %%%%%%%%%%%%%%%%%%%%increment of m2+m3 enhancement nodes and m1 feature nodes %%%%%%%%%%%%%%%%%%%%%%%% C = 2^-30; S = 0.7; % The L2 regularization parameter and the shrinkage scale of the Enhancement nodes N11=20; % Feature Nodes per Window Number of feature mapping nodes of each window N2=30; % number of Windows of Feature Nodes N33=550; % number of Enhancement Nodes Epochs =1; % number of epochs Number of epochs m1=20; %number of feature nodes per increment step m2=80; %number of enhancement nodes related to the incremental feature nodes per increment step The number of enhanced nodes related to incremental feature nodes m3=200; % Number of enhancement nodes in each incremental learning L =1; % Steps of incremental learning train_err_t=zeros(epochs,l); test_err_t=zeros(epochs,l); train_time_t=zeros(epochs,l); test_time_t=zeros(epochs,l); Testing_time_t=zeros(epochs,1); Training_time_t=zeros(epochs,1); % rand('state',67797325) % 12000 %%%%% The random seed recommended by the % reference HELM [10]. Refer to HELM [10] to generate a random number N1=N11; N3=N33; for i=1:epochs [train_err,test_err,train_time,test_time,Testing_time,Training_time] = bls_train_enhancefeature(train_x,train_y,test_x,test_y,s,C,N1,N2,N3,m1,m2,m3,l); train_err_t(i,:)=train_err; test_err_t(i,:)=test_err; train_time_t(i,:)=train_time; test_time_t(i,:)=test_time; Testing_time_t(i)=Testing_time; Training_time_t(i)=Training_time; end save ( [ 'cwt_result_enhancefeature'], 'train_err_t', 'test_err_t', 'train_time_t', 'test_time_t','Testing_time_t','Training_time_t');Copy the code