Click "Machine Learning Algorithm Engineer" to choose "Star Standard" public account, heavy dry goods, the first time deliveryCopy the code

First, the concept and main purpose of reinforcement learning

1. What is reinforcement learning?

Reinforcement Learning is a very important branch of machine Learning. Its core idea is that the experimenters build a complete experimental environment in which they can strengthen or encourage some actions of the experimenters by giving them certain observation values and rewards. Thus, the experimenter’s desired outcome or goal is more likely to be produced. From the above description of reinforcement learning, we can see that reinforcement learning must involve the experimenter (also known as Agent), the Environment constructed by the experimenter (also known as System Environment), the observed value of the experimenter (also known as environmental state), State), the experimenter’s Action and the Reward (also known as a Reward or feedback, Reward).

Take a classic psychological experiment to further explain these key elements involved in reinforcement learning. The experiment was Pavlov’s dog, and each time the experimenter rang a bell at the dog and gave it a bit of food. Over time, the combination of the bell and the food was subtly influence the dog’s action, then shake the bell every time for the dogs, dogs will involuntary drooling, and look forward to the experimenter can give it food, by this way, the experimenter let dogs learned to bells and food, the relationship between this count as a simple example of reinforcement learning.

From this case we can not only see that intensive study involved five key elements described above, and also can get a contains the five key elements of highly abstract reinforcement learning framework, that is: in the classic of intensive study, and the experimenter agent is to construct the system environment of complete series of interaction, mainly includes the following three contents:

1. At every moment, the environment is in a state, and the intelligence can get the observation value of the current state of the environment; 2. An agent takes actions based on the observed values of the current environmental state and its own historical code of conduct (commonly known as Policy); (3) the agent to make the action and then makes a change in environment condition, at the same time agent will get to the new environment state observation value and the action of the rewards, and of course the returns can be either positive or negative, such agent will according to the new state observed value and return to continue to make new actions, Until the experimenter reached the desired goal. Thus, the entire process involved in the highly abstract framework of reinforcement learning is shown in Figure 1:

FIG. 1 Process representation of reinforcement learning

So, from an agent’s point of view, the goal of reinforcement learning is to maximize the reward. But this goal is somewhat abstract, so we need to make it easier to quantify. Then have to speak two notable characteristics of reinforcement learning, is a continuous trial and error, according to the state of environment action agent sometimes get more reward, rewards and less, sometimes even may get negative returns, so the agent based on what returns need to continue to adjust their strategies to get as much as possible returns, In this process, it is necessary for the agent to constantly try to deal with various possible actions of the environmental state and collect corresponding returns. Only with the help of these feedback information can the agent better complete the learning task. Second, the long-term return, not pursuit of short-term score (for example, in order to finally win in the game, the game may make some eaten by other pieces seem bad action), which usually requires intelligence that interact with the system environment for long time, so the pursuit of long-term returns requires much exploration and keep trying, may encounter more failure.

Based on these two characteristics of reinforcement learning, in addition to the conventional measurement indicators (such as algorithm effect, stability and generalization), we should also focus on another indicator, namely learning time, when evaluating the merits of reinforcement learning algorithms. Since reinforcement learning is associated with constant trial and error and an emphasis on long-term rewards, learning time can generally be replaced by the number of attempts and explorations of reinforcement learning algorithms. Therefore, according to a series of description above, reinforcement learning can be neatly summed up in: according to the environmental state, action, and returns, continuous trial and error learning, the best strategy is to let the agent action, and aiming at the final result, not only look at the return on a specific actions now, and more to see this action can bring potential returns in the future.

2. What can reinforcement learning be used for?

Reinforcement learning is mainly used to solve a set of decision problems because it can learn how to achieve the goals we set in a complex, uncertain environment. Reinforcement learning application scenario is very wide, including nearly all of the need to make a decision problems, such as motor control the robot to perform a specific task, unmanned on the current state of roads to make the most of the execution action (such as deceleration and conversion direction, etc.), product pricing and inventory management, and play video games or board games. Among them, the famous AlphaGo, which set off the reinforcement learning research craze, It was developed by Google’s DeepMind team by combining Policy Network, Value Network and Monte Carlo Tree Search. Since its creation, it has defeated lee Se-dol, the world champion of human go, and has become famous.

Among them, two important reinforcement Learning methods are policy-based or Policy Gradients and value-based or Q-learning. The main difference between the two methods is that the policy-based method directly predicts the actions that should be taken in a certain environmental state, while the value-based method predicts the expected Value of all actions (i.e., Q Value) in a certain environmental state, and then selects the action with the highest Q Value to execute. Generally speaking, value-based method is suitable for actions with only a small number of discrete values, while policy-based method is more general and suitable for actions with many types of actions or actions with continuous values.

Ii. Discrimination of several common algorithms in machine learning

Supervised learning is a classical machine learning method. Its core idea is to learn a model that can get corresponding outputs according to given inputs through a certain number of training samples. It is worth mentioning that these training samples contain pairs of inputs and known output data. Supervised learning is to use pairs of such input and output data to calculate the parameters of the model (such as connection weights and learning rates and other parameters), so as to complete the model learning. Therefore, from the perspective of learning objectives, supervised learning hopes that the learned model can obtain corresponding outputs according to given inputs, while reinforcement learning hopes that the agent can obtain actions that can maximize returns according to given environmental states.

As we know from the above description, the effect of supervised learning depends not only on the training sample data, but also on the features extracted from the data, because this kind of algorithm needs to calculate the correlation between each feature and the predicted results from the training data. It is no exaggeration to say that different ways of expression of the same training sample data will greatly affect the effect of supervised learning. Once the problems of data expression and feature extraction are effectively solved, 90% of the problems of supervised learning will be solved. However, feature extraction is not a simple task for many supervised learning problems. In some complex problems, it is necessary to design the effective feature set by artificial way, which not only takes a lot of time and energy, but also can not extract the essential features well by artificial way sometimes. So can we rely on computers to automatically extract features? Deep learning arises at the historic moment, deep deep learning is basically a pronoun, artificial neural network in image recognition, speech recognition, natural language processing and computer game in the areas of industry and academia has very good application and research, so the deep learning is an important branch of supervised learning. Deep learning solves two core problems: first, it can learn the correlation between features and predicted results like other supervised learning, and second, it can automatically combine simple features into more complex features. In other words, deep learning can learn more complex feature expression from data, which makes the learning of connection weight in neural network model training more simple and effective, as shown in Figure 2.

In Figure 3, deep learning shows an example of solving image recognition problems. It can be seen that deep learning gradually combines complex features such as lines, edges, angles, simple shapes and complex shapes from the basic features of image pixels. Thus, deep learning is the ability to gradually transform simple features into more complex features layer by layer, thus making different categories of images more distinguishable.

FIG. 2 Comparison between deep learning and traditional supervised learning processes

FIG. 3 Sample algorithm flow of deep learning in image recognition problem

In addition, deep reinforcement learning, which combines deep learning with reinforcement learning, has become a research hotspot in recent years, such as unmanned driving, robot autonomous task execution and artificial intelligence playing games. Depth of reinforcement learning is essentially a neural network, just use the convolution in the previous layer deep learning algorithms such as neural network to identify, camera captured images processing and analysis, the equivalent of around can let agent visible environment and correctly identify objects, and then through reinforcement learning algorithm to predict the maximizing the return on the implementation of a series of actions to make, To complete the assigned task.

In the field of artificial intelligence, there is another machine learning algorithm that is also very important, that is unsupervised learning. This kind of algorithm analyzes the data without calibration output labels and builds appropriate models to provide solutions to problems without training sample data. Common unsupervised learning algorithms include data transformation that reduces the dimension of sample characteristic variables and cluster analysis that classifies samples into different groups.

Therefore, it can be seen from the above description of various algorithms that reinforcement learning is different from both supervised and unsupervised learning. Reinforcement learning is neither as supervised learning has a very clear learning objective (an input corresponding is a certain output), also don’t like unsupervised learning no learning objectives, and reinforcement learning goal is not clear, because in a certain environment conditions can get the maximum return action may have many. Therefore, these kinds of machine learning algorithms have essential differences in the clarity of learning objectives. In addition, from the perspective of time dimension, the significance of reinforcement learning and supervised learning output is different. Supervised learning is mainly focused on the degree of match input and output, if the input and output matching, then the learning effect is better, even if there is the mapping of the input to the output sequences, supervised learning can also hope that every moment of the output and the corresponding input, for example, the tetris game, for example, if using supervised learning for training, The supervised learning model takes each frame or state of the game as input, and the corresponding output is of course determined, either moving the block or flipping the block; But this approach is a bit rigid, because of course there is more than one sequence of actions to get more points in the end.

Main value of reinforcement learning, however, is to maximize returns, in the process of agent and the environment interaction, not every action will be rewarded, when agent after finished a complete interaction with the environment, will be an action sequence, but what actions in the sequence as the ultimate reward produced positive contribution, what actions produced negative contribution, Sometimes it is really difficult to define. For example, in the game of Go, in order to finally defeat the opponent, the intelligent body may make some bad moves in the game and let the opponent eat the pieces. This is the sacrifice to achieve the final goal. Therefore, the advantage of reinforcement learning lies in the fact that there are fewer constraints imposed during the learning process. Although the feedback that affects actions is not as direct as supervised learning, it can reduce the difficulty of problem abstraction. Moreover, reinforcement learning pays more attention to the overall return brought by action sequence rather than the immediate benefits of a single step.

In fact, there is a learning method whose goal is the same as reinforcement learning, which is to maximize the long-term return value, but the learning process is similar to supervised learning, in which the model learns the logic of the single step decision by collecting a large amount of sample data. This type of machine learning algorithm is called “imitation learning”. As shown in Figure 4, the execution process of imitation learning is as follows: (1) Find some “expert systems” to replace the interaction process between the agent and the environment, and obtain a series of interaction sequences; (2) Assuming that these interaction sequences are the “standard answers” in the corresponding environment state, supervised learning can be used to make the model learn these data, so as to complete the work of matching the environment state with the actions taken by the experts.

Figure 4. Execution flow chart of imitation learning

Imitation learning can achieve good results on some problems, but it also has its disadvantages: (1) there must be an expert in the problem field, and all training sample data are generated through the interaction between the expert system and the environment; (2) There must be a sufficient number of training sample data, otherwise it is difficult to learn an effective action strategy model; (3) It is necessary to ensure that the learned model has enough generalization; otherwise, some observed values that do not appear in the training samples may be encountered in practical use, leading to major decision-making errors for agents with insufficient generalization ability; In order to solve the above problems, imitation learning needs to start from training samples and models. But in fact, these three problems are not easy to solve in reality, so the difficulty of imitation learning is not small, which makes most research trends focus on reinforcement learning, in the hope that reinforcement learning can solve the problems that imitation learning cannot solve. To sum up, the relationship between several common algorithms in the field of artificial intelligence is shown in Figure 5 below:

FIG. 5 The relationship between several algorithms in artificial intelligence

Currently, the machine learning computing framework that has attracted widespread attention in industry and academia is TensorFlow, which was officially open source by Google on November 9, 2015. Compared with other open source computing tools for machine learning, the TensorFlow computing framework supports the implementation of various machine learning algorithms, such as deep learning and reinforcement learning. TensorFlow is both an interface to implement machine learning algorithms and a framework to execute them. It has excellent performance in many aspects, such as the simplicity of code for developers to design neural network structures, the execution efficiency of distributed machine learning algorithms and the convenience of deploying trained models.

In addition, another important framework to be used in reinforcement learning is Gym and a set of algorithms based on this framework, Baselines. Gym is a platform that integrates many experimental environments of reinforcement learning. On this platform, researchers can easily build a simulation environment required by reinforcement learning, so as to concentrate on the main work of action strategy learning. Baselines implements some classical reinforcement learning algorithms based on TensorFlow and Gym. In summary, Gym implements functions related to reinforcement learning and the key element of system environment, while Baselines implements functions related to the key element of agent. The above content is a brief overview of the definition of reinforcement learning, highly abstract framework, characteristics, main uses, essential differences between reinforcement learning and other major machine learning algorithms, and commonly used building environments of machine learning.


Feel good, more "message", click "watching" ↓↓↓Copy the code