Welcome toTencent Cloud + community, get more Tencent mass technology practice dry goods oh ~

This article was published by Luo Hui in cloud + community column

1. Google’s DQN paper

In February 2015, Google published a paper on Nature (see attachment) : Human-level Control through Deep Reinforcement Learning. The article describes how computers learned to play Atari 2600 video games on their own.

The Atari 2600 was a popular console in the US in the 80’s. It has 49 separate games, including the familiar Breakout, Galaxy Invaders and other classic games. Google’s algorithm uses only images of the game screen and game scores as input, and the computer learned how to play the game without human intervention, breaking the record for human gamers in 29 games.

The deep network architecture diagram given by Google is as follows:

The left side of the network is the input and the right side is the output. The image of the game screen first passes through two convolutional layers (the paper says three), then through two fully connected layers, and finally maps to all possible gamepad actions. ReLU activation functions are used between layers.

2. Q-learning

According to Wikipedia, reinforcement learning is defined as follows:

Reinforcement learning is a field of machine learning that emphasizes how to act based on the environment in order to maximize the expected benefits. Its inspiration comes from the behaviorism theory in psychology, that is, how the organism gradually forms the expectation of the stimulus and produces the habitual behavior that can obtain the maximum benefit under the stimulation of the reward or punishment given by the environment.

In the world of reinforcement learning, the algorithm is called Agent. It interacts with the environment, and the Agent obtains state from the environment and decides the action it wants to take. The environment rewards agents according to its own logic. Rewards can be positive or negative. For example, in the game, every enemy hit is a positive bonus, and losing health or ending the game is a negative bonus.

2.1. Markov decision process

Now the question is, how do you formulate a reinforcement learning problem and then derive it? The most common approach is through the Markov decision process.

Imagine you are an agent in an environment (for example, a game called Brickbreaking). The environment is in a particular state (for example, the position of the sign, the position and direction of the ball, the presence or absence of each brick). The AI can perform certain actions in this environment (for example, move the racquet left or right).

These behaviors sometimes lead to rewards (higher scores). The behavior changes the environment and brings in a new state where the agent can perform another action. The rules by which you choose these actions are called strategies. In general, the environment is random, which means that the next state is also more or less random (for example, when you miss the ball and launch a new one, it goes in a random direction).

The set of states and actions, plus the rules for changing states, constitutes a Markov decision process. An episode in this process (e.g. a game) forms a limited sequence of states, actions, and rewards.

Si represents the state, AI represents the action, and RI +1 represents the reward for performing the action. The plot ends with the final state SN (for example, the “Game Over” screen). A Markov decision process is based on Markov assumptions that the probability of the next state SI +1 depends on the current state SI and the action AI, rather than the previous state and action.

Beijing will begin a trial of Discounted Future rewards on Tuesday.

To perform well in the long run, we need to consider not only immediate rewards, but future rewards that we will receive. How do we do that?

For a given run of the Markov decision process, we can easily calculate the total reward of a plot:

In view of this, the total future return at time point T can be expressed as:

But because our environment is random, we can never be sure if we will get the same reward after the next same action. The more time goes on, the more differences grow. Therefore, this is the time to use discount future rewards instead:

Here gamma is the discount factor between 0 and 1 — the further into the future the reward is, the less we think about it. We can easily see that the value of the discounted future reward at time step t can be expressed in the same way as that at time step t+1:

If we defined the discounting factor as gamma =0, our strategy would be too short, based entirely on immediate rewards. If we want to balance immediate and future rewards, the discount factor should be approximately γ=0.9. If our environment is certain that the same action always results in the same reward, then we can define the discount factor as gamma =1.

A good strategy for an agent is to choose an action that maximizes future rewards.

2.3. Q-learning Algorithm Description:

The alpha in the algorithm refers to the learning rate, which controls the degree of difference considered between the previous Q value and the newly proposed Q value. In particular, when alpha =1, the two Qs,a’s cancel each other out, and the result is exactly the same as the Behrman equation.

What we use to update Qs, A is only an approximation, and it may well be wrong at an early stage of learning. But with each iteration, the approximation gets more and more accurate; And we found that if we performed this update long enough, the Q function would converge and represent the true Q value.

3. Convolutional Neural Network (CNN)

In image processing, the image is often represented as a vector of pixels, for example, a 1000×1000 image can be represented as a vector of 1000000. In the neural network mentioned in the previous section, if the number of hidden layers is the same as that of the input layer (1,000,000), then the parameter data from the input layer to the hidden layer is 1,000,000 × 1,000,000 =10^12, which is too much and basically impossible to train. So image processing to become a neural network, must first reduce parameters to speed up.

3.1. Local perception

Convolutional neural network has two magic devices that can reduce the number of parameters. The first magic device is called local sensing field. It is generally believed that people’s cognition of the outside world is from local to global, and the spatial relationship of images is also that local pixels are closely related, while the correlation of distant pixels is weak.

Therefore, each neuron actually does not need to perceive the global image, but only needs to perceive the local information, and then synthesize the local information at a higher level to obtain the global information. The idea of a partially connected network is also inspired by the architecture of the visual system in biology. Neurons in the visual cortex are locally receptive (that is, they respond only to specific areas of stimulation). As shown below: Full connection on the left and partial connection on the right.

In the figure above and right, if each neuron is only connected with 10×10 pixel values, then the weight data is 1,000,000 ×100 parameters, which is reduced to 1/10,000th of the original value. And that 10 by 10 pixels corresponds to 10 by 10 parameters, which is the convolution operation.

3.2. Parameter sharing

But in fact, this case is still too many parameters, then start the second level of artifact, that is, weight sharing. In the local connections above, there are 100 parameters for each neuron, 1,000,000 neurons in total. If all 100 parameters of 1,000,000 neurons are equal, the number of parameters becomes 100.

How do you understand weight sharing? We can think of these 100 parameters (the convolution operation) as a way of extracting features that are position-independent. The underlying principle is that the statistical properties of one part of the image are the same as those of the rest. It also means that the features that we learned in one part can be used in another part, so we can use the same learning features for all places on the image.

More intuitively, when we randomly select a small piece of a large image, such as 8×8, as a sample, and learn some features from this small sample, then we can use the features learned from the 8×8 sample as a detector and apply them to any place in the image. In particular, we can convolve the features learned from the 8×8 sample with the original large image to obtain an activation value for a different feature at any location on the large image.

As shown in the figure below, the convolution process of a 3×3 convolution kernel on a 5×5 image is shown. Each convolution is a feature extraction method, like a sieve, to filter out the part of the image that meets the conditions (the larger the activation value, the more eligible).

3.3. Multi-convolutional kernels

When there are only 100 parameters mentioned above, it indicates that there is only one 10×10 convolution kernel. Obviously, feature extraction is not sufficient. We can add multiple convolution kernels, such as 32 convolution kernels, to learn 32 features. When there are multiple convolution kernels, as shown in the figure below:

On the right, different colors indicate different convolution kernels. Each convolution kernel generates an image into another image. For example, two convolution kernels can generate two images, which can be regarded as different channels of an image. W1 = w0, w2 = w1, w2 = w1 They are still referred to as W1 and W2 below.

The following figure shows the convolution operation on four channels, with two convolution kernels, generating two channels. It should be noted that each of the four channels corresponds to a convolution kernel. First ignore w2 and only look at W1. Then the value at a certain position (I,j) of W1 is obtained by adding the convolution results at (I,j) of the four channels and then taking the activation function.

Therefore, in the process of convolving four channels to obtain two channels in the figure above, the number of parameters is 4×2×2, where 4 represents four channels, the first 2 represents two channels, and the last 2×2 represents the size of the convolution kernel.

3.4. The Down – pooling

After obtaining features through convolution, we hope to use these features for classification in the next step. Theoretically, one could use all the extracted features to train a classifier, such as the SoftMax classifier, but doing so is computationally challenging.

Such as: For a 96X96 pixel image, assuming that we have learned 400 features defined on the 8X8 input, each feature convolved with the image yields a convolution feature of (96 − 8 + 1) × (96 − 8 + 1) = 7921 dimensions. Since there are 400 features, Therefore, each example will get a convolution eigenvector of 7921 × 400 = 3,168,400 dimensions. Learning a classifier with over 3 million feature inputs is inconvenient and prone to over-fitting.

To solve this problem, first recall that we decided to use convolution features because the image has a “static” property, which means that features that are useful in one area of the image are likely to be useful in another area.

Thus, in order to describe large images, a natural idea is to aggregate statistics on features at different locations, for example, one can calculate the average (or maximum) of a particular feature over a region of the image. These summary statistical features not only have a much lower dimension (compared to using all extracted features), but also improve the results (less easily overfitted). The pooling operation is called pooling, sometimes called average pooling or maximum pooling (depending on how you compute the pooling).

3.5. Multilevel convolution

In practical applications, multi-layer convolution is often used, and then full-connected layer is used for training. The purpose of multi-layer convolution is that the features learned by one-layer convolution are usually local. The higher the number of layers, the more global the features learned will be.

4. DQN algorithm description

Pure Q-learning algorithm uses tables to store states, and the number of pixel states of a 1000×1000 image is basically close to infinity. Therefore, CNN+ Q-learning, namely DQN algorithm, is developed. The algorithm is described as follows:

5. Use DQN to train “catch bricks”

There are many open source libraries for deep learning, such as Tensorlow and Caffe. Here we use Tensorflow to train the game to “catch bricks”.

Here’s a screenshot of the game:

By clicking the left mouse button, right mouse button control slider left and right movement to catch the ball, if the ball touches the bottom, the game is over

The main Python code is as follows (omitted from the game itself, focusing on the algorithmic code) :

Class Game(object): def __init__(self): def __init__(self): #Game initialization # action is MOVE_STAY, MOVE_LEFT, MOVE_RIGHT Returns the number of pixels in the game interface and the corresponding reward. Def step(self, action): def step(self, action): # learning_rate learning_rate = 0.99 # INITIAL_EPSILON = 1.0 FINAL_EPSILON = 0.05 # Learning_rate learning_rate = 0.99 = 500 # memory experience size REPLAY_MEMORY = 500000 # Number of records taken out per training BATCH = 100 # number of neurons in the output layer. Output = 3 # MOVE_STAY:[1, 0, 0] MOVE_LEFT:[0, 1, 0] MOVE_RIGHT:[0, 0, 1] Output = 3 # MOVE_STAY:[1, 0, 0] MOVE_LEFT:[0, 1, 0] MOVE_RIGHT:[0, 0, 1] input_image = tf.placeholder("float", [None, 80, 100, 4]) # placeholder action = tf.placeholder("float", [None, Convolutional_neural_network (input_image): weights = {'w_conv1':tf.Variable(tf.zeros([8, 8, 4, 32])), 'w_conv2':tf.Variable(tf.zeros([4, 4, 32, 64])), 'w_conv3':tf.Variable(tf.zeros([3, 3, 64, 64])), 'w_fc4':tf.Variable(tf.zeros([3456, 784])), 'w_out':tf.Variable(tf.zeros([784, output]))} biases = {'b_conv1':tf.Variable(tf.zeros([32])), 'b_conv2':tf.Variable(tf.zeros([64])), 'b_conv3':tf.Variable(tf.zeros([64])), 'b_fc4':tf.Variable(tf.zeros([784])), 'b_out':tf.Variable(tf.zeros([output]))} conv1 = tf.nn.relu(tf.nn.conv2d(input_image, weights['w_conv1'], strides = [1, 4, 4, 1], padding = "VALID") + biases['b_conv1']) conv2 = tf.nn.relu(tf.nn.conv2d(conv1, weights['w_conv2'], strides = [1, 2, 2, 1], padding = "VALID") + biases['b_conv2']) conv3 = tf.nn.relu(tf.nn.conv2d(conv2, weights['w_conv3'], strides = [1, 1, 1, 1], padding = "VALID") + biases['b_conv3']) conv3_flat = tf.reshape(conv3, [-1, 3456]) fc4 = tf.nn.relu(tf.matmul(conv3_flat, weights['w_fc4']) + biases['b_fc4']) output_layer = tf.matmul(fc4, Weights ['w_out']) + biases['b_out'] return output_layer def train_neural_network(input_image): predict_action = convolutional_neural_network(input_image) argmax = tf.placeholder("float", [None, output]) gt = tf.placeholder("float", [None]) action = tf.reduce_sum(tf.mul(predict_action, argmax), reduction_indices = 1) cost = tf.reduce_mean(tf.square(action - gt)) optimizer = tf.train.AdamOptimizer(1e-6).minimize(cost) game = Game() D = deque() _, image = game.step(MOVE_STAY) image = cv2.cvtColor(cv2.resize(image, (100, 80)), cv2.COLOR_BGR2GRAY) ret, image = cv2.threshold(image, 1, 255, cv2.THRESH_BINARY) input_image_data = np.stack((image, image, image, image), axis = 2) #print ("IMG2:%s" %input_image_data) with tf.Session() as sess: sess.run(tf.initialize_all_variables()) saver = tf.train.Saver() n = 0 epsilon = INITIAL_EPSILON while True: #print("InputImageData:", input_image_data) action_t = predict_action.eval(feed_dict = {input_image : [input_image_data]})[0] argmax_t = np.zeros([output], dtype=np.int) if(random.random() <= INITIAL_EPSILON): maxIndex = random.randrange(output) else: maxIndex = np.argmax(action_t) argmax_t[maxIndex] = 1 if epsilon > FINAL_EPSILON: epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / EXPLORE reward, image = game.step(list(argmax_t)) image = cv2.cvtColor(cv2.resize(image, (100, 80)), cv2.COLOR_BGR2GRAY) ret, image = cv2.threshold(image, 1, 255, cv2.THRESH_BINARY) image = np.reshape(image, (80, 100, 1)) input_image_data1 = np.append(image, input_image_data[:, :, 0:3], axis = 2) D.append((input_image_data, argmax_t, reward, input_image_data1)) if len(D) > REPLAY_MEMORY: D.popleft() if n > OBSERVE: minibatch = random.sample(D, BATCH) input_image_data_batch = [d[0] for d in minibatch] argmax_batch = [d[1] for d in minibatch] reward_batch = [d[2] for d in minibatch] input_image_data1_batch = [d[3] for d in minibatch] gt_batch = [] out_batch = predict_action.eval(feed_dict = {input_image : input_image_data1_batch}) for i in range(0, len(minibatch)): gt_batch.append(reward_batch[i] + LEARNING_RATE * np.max(out_batch[i])) print("gt_batch:", gt_batch, "argmax:", argmax_batch) optimizer.run(feed_dict = {gt : gt_batch, argmax : argmax_batch, input_image : input_image_data_batch}) input_image_data = input_image_data1 n = n+1 print(n, "epsilon:", epsilon, " " ,"action:", maxIndex, " " ,"reward:", reward) train_neural_network(input_image)Copy the code

6. Summary

At this point, I believe you have a general understanding of reinforcement learning. The next thing, should be how to apply this technology to our work, let it play its due value.

Question and answer

When to use a reinforcement learning algorithm?

reading

Implement FlappyBird AI with Q-learning

Summary of reinforcement learning

This paper studies reinforcement learning methods based on Monte Carlo

Machine learning in action! Quick introduction to online advertising business and CTR knowledge

This article has been authorized by the author to Tencent Cloud + community, more original text pleaseClick on the

Search concern public number “cloud plus community”, the first time to obtain technical dry goods, after concern reply 1024 send you a technical course gift package!

Massive technical practice experience, all in the cloud plus community!