OpenAI recently released a new reinforcement learning algorithm, Proximal Policy Optimization (PPO), that not only matches or exceeds current state-of-the-art methods in performance, but is also easier to implement and debug. Because PPO is easy to use and performs well, OpenAI has set it as the default reinforcement learning algorithm.



Video address: https://v.qq.com/x/page/r0527kvbyeb.html


PPO (Near End Strategy Optimization) allows us to train AI strategies in challenging environments. When an agent in Roboschool, shown above (open source software for robot simulation, integrated into OpenAI Gym) tries to capture a target (pink sphere), it learns how to walk, run, and turn. It also needs to learn how to regain its balance using its own power after being hit by the white block, and how to get up off the grass when knocked over.


The code address

https://github.com/openai/baselines


Deep neural networks can be used to control video games, three-dimensional motion or go. Recent breakthroughs rely on what is known as Policy gradient methods. But getting good results with this method is difficult because it is sensitive to small size choices. If the step size is too small, the training progress will be very slow; However, if the step size is too large, the signal will be covered by noise, which may even lead to a sharp decline in performance. Meanwhile, the sample efficiency of this strategy gradient method is very low, requiring millions (or billions) of time steps to learn a simple task.


Researchers attempt to eliminate these drawbacks by using TRPO, ACER and other methods to constrain or optimize the step size of policy updates. However, these methods had their own drawbacks. ACER’s method was much more complex than PPO’s, requiring additional code for off-policy correction and a replay buffer, but it performed only slightly better than PPO’s on Atari benchmarks.


While useful in continuous control tasks, TRPO is not compatible with algorithms that share parameters between policy and value functions or auxiliary losses, some of which are often used to solve problems in Atari and other areas where visual input is important.



Proximal Policy Optimization (PPO)



By supervised learning, we can easily implement cost function and perform gradient descent on cost Function. We are confident that we can get very good results with just a few debugging of hyperparameters.



It is not easy to achieve success in reinforcement learning. Algorithms often have many active modules that are difficult to debug and require a lot of effort to adjust these modules in order to achieve good results. PPO algorithm finds a balance between implementation difficulty, sample complexity and debugging difficulty. It tries to calculate an update that minimizes cost function in each step, while ensuring that the deviation from the previous strategy is kept at a small level.


We previously detailed a variation of PPO that uses an adaptive KL penalty to control policy changes in each iteration. This new variant uses a new objective function that is rarely used in other algorithms:




The objective function implements a Trust Region updating method compatible with the STOCHASTIC gradient descent method, and simplifies the algorithm by removing the KL penalty term and not performing adaptive updating. In tests, the algorithm showed the best performance in continuous control tasks and was nearly as good as ACER’s algorithm on Atari, but it was much simpler to implement.



Complex robots that can be controlled



Video address: https://v.qq.com/x/page/r0527kvbyeb.html


PPO trained agents can develop flexible moving strategies, which can temporarily turn and tilt when moving towards the target position.



We developed interactive agents using PPO trained strategies. With this agent, we can use the keyboard to set a new target position for the robot in the Roboschool environment. Although the input sequence is different from the one used to train an agent, it can be managed to generalize.


We also use PPO to teach complex simulators to walk, like Boston Dynamics’ Atlas model. The model has 30 different joints, compared with 17 for the bipedal robot. Other researchers have also used PPO to train robots to perform amazing parkour tricks when jumping obstacles.




Video address: https://shimo.im/doc/8TXiF6q7K3kbOTv4/

Benchmarks: PPO and TRPO



The benchmarks released include extensible parallel implementation tools for PPO and TRPO, both of which use MPI for data transfer and Python3 and TensorFlow. We also added a pre-training version of the strategy to train the robots in Roboschool Agent Zoo.


OpenAI is looking for someone who can help them build and optimize their reinforcement learning algorithm code base. If you are interested in intensive learning, standard checkers, in-depth experiments and open source, please click on the following website to apply: https://jobs.lever.co/openai/5c1b2c12-2d18-42f0-836e-96af2cfca5ef, and indicate that you have read about PPO.


The original address

https://blog.openai.com/openai-baselines-ppo/