From OpenAI, compiled by Machine Heart.

In 2017, OpenAI defeated the “Dota2” world’s top player in a 1-on-1 solo match at the Dota2 TI final. After a year of development, OpenAI announced yesterday that its AI bot has beaten amateur human players in 5 V 5 team competitions, and plans to beat top professional teams later. Machine Heart compiles and introduces OpenAI’s blog content.

The model our team built, OpenAI Five, has already beaten the amateur Dota2 team. Although we are currently in a limited situation, we plan to defeat one of the top professional teams in the TI tournament in the limited hero pool by August. We probably won’t succeed, because Dota2 is one of the most popular and complex esports games out there, and a group of passionate and creative players have been training for years to split a $40 million prize pool.

By learning against yourself, The OpenAI Five is the equivalent of playing 180 years of games every day. In training, it uses 256 Gpus with 128,000 CPU cores to train using Proximal Policy Optimization, which is an extension of the SOLO Dota2 system we built last year. When we use a separate LSTM for each hero, the model can learn identifiable strategies without human data. This suggests that reinforcement learning can lead to large-scale but acceptable long-term planning without radical progress. This is not what we expected at first.

The problem

One of the milestones of ARTIFICIAL intelligence is to surpass the human level in a complex video game like Starcraft or Dota. In contrast to the previous milestone, chess and Go, complex video games began to reflect the chaotic and continuous nature of the real world. Our hope, therefore, is that the systems that solve complex video games can become universal systems that have broad applications outside of games.

Dota2 is a real-time 5 v 5 strategy game where each player controls a hero. The following skills are required to play Dota’s AI:

  • Long term strategy. Dota games average 30 frames per second, 45 minutes per game, about 80,000 ticks. Most of the actions (such as manipulating hero movements) have small individual effects, but some individual actions can affect game strategy, such as TP home. There are also strategies that can end the whole game. OpenAI Five observed every four frames and generated 20,000 decisions. In contrast, chess usually ends before 40 moves, and Go is about 150, but these games are very strategic in every move.
  • Locally observable state. Your units and buildings have limited views. The rest of the map has no field of view and may contain enemies and enemy tactics. High play often requires reasoning based on incomplete data and modeling enemy intent. Chess and Go are perfect information games.
  • High dimensional, continuous behavior space. In Dota, each hero can take dozens of actions, many of which either face enemy units or move around. We spread this space out to 170,000 possible actions per hero (not every tick will work, for example putting a skill on a cooldown will not work); Continuous parts are not counted, with an average of 1000 potentially valid behaviors per tick. The number of actions in chess is about 35, and in Go 250.
  • High dimensional, continuous observation space. Dota is a game played on a map with 10 heroes, over 20 towers, dozens of NPC units, runes, trees, eye guards, etc. Using Valve (the company that runs Dota 2) ‘s Bot API, our model treats a Dota game as 20,000 states, representing all the information available to humans in the game. Chess represents approximately 70 enumeration values (8×8 board, 6 classes of pieces, and smaller historical information). Go has about 400 enumerations (19×19 board, black and white 2 pieces, plus Ko).

The rules of Dota are also very complex. These games have been developed for over a decade, with hundreds of lines of code implementing the game logic. And the game is updated every two weeks, and the environment semantics are always changing.

methods

Our system uses a highly extended version of the Proximal Policy Optimization algorithm to learn. OpenAI Five and the 1V1 robots before it all learn by playing against themselves. They start with random parameters, and do not search or bootload from human player methods.

It was widely believed by reinforcement learning researchers (including ourselves) that learning over a long period of time required fundamental algorithmic breakthroughs such as hierarchical Reinforcement Learning. And our results show that we don’t trust existing models enough — at least when they work at a scale and in a reasonably exploratory way.

Our agents are trained to maximize the exponential decay of future rewards, where the exponential decay factor is called gamma. In the latest round of OpenAIFive training, we adjusted γ from 0.998 (with a half-life of 46 seconds) to 0.997 (with a half-life of 5 minutes). In contrast, OpenAI’s near End Strategy Optimization (PPO) paper had the longest half-life of 0.5 seconds, DeepMind’s Rainbow paper had the longest half-life of 4.4 seconds, Google Brain Observe and Look Further used a half-life of 46 seconds.

While the current OpenAI Five is a bit weak (see our test match, pro Dota commentator Blitz estimates it to be at the median pro level), it’s pretty good at picking priority targets. Gaining long-term rewards (such as eye maps) often requires sacrificing short-term rewards (such as money after development) because of the time it takes to push. This observation reinforces our belief that the system will improve over time.

The proposed framework

Each Open AI Five network consists of a single-layer, 1024-cell LSTM network that looks at the current state of the game (extracted from Valve’s Bot API) and sends out the next action through a number of possible action heads. Each Head contains semantic information, such as the time value to defer the action, which action to select, and its X and Y axes.

OpenAI Five uses interactive demonstrations in the observation space and the action space. It represents the world as a list of 20,000 values and takes action by issuing a list of eight enumerated values. We can select different actions and goals on the OpenAI website to understand how each action is encoded by the OpenAI Five and how the world is viewed. The following diagram shows what one might observe:

Necrophos

OpenAI Five can react to its missing state fragments, which may be related to what it sees. For example, until recently OpenAI Five observations did not include the areas where shrapnel falls, which can be easily observed on a human screen. However, we observed that the OpenAI Five can learn to walk out of active landing zones, where the intelligence will see their health drop.

explore

Although the learning algorithm is built to handle the longer vision, we still need to explore the environment. Even though we’ve limited the complexity, the game still has hundreds of items, dozens of buildings, spells, unit types, and game mechanics that take time and time to learn, all of which combine into an extremely large number of situations. Therefore, it is very difficult to effectively explore this huge combined space.

OpenAI Five starts with random weights through self-play, which provides a natural curriculum for exploring the environment. To avoid “strategy breakdown,” agents train against themselves for 80% of the game and against past agents for 20% of the game. In the first game, the hero explores aimlessly on the map, and after a few hours of training, concepts like planning, development, or mid-battle emerge. Over the course of a few days, the smarts consistently adopted basic human strategies: trying to steal wealth from opponents, pushing towers to grow, and spinning heroes around the map to gain a route advantage. With further training, they become proficient in advanced strategies such as five heroes pushing towers together.

In 2017, our first agent beat a robot, but still can’t beat a human. To force exploration in the strategy space, we randomize the unit’s stats (health, speed, initial level, etc.) during training only, and then it starts playing against humans. Later, when a test player repeatedly beat our 1V1 robot, we added randomness to the training, and the test player began to lose. In addition, our robotics team is also applying similar random techniques to physical robots in order to transfer knowledge from imitation learning to the real world.

OpenAI Five uses the randomization we wrote for the 1V1 robot, and it also uses a new “Lane Assignment”. At the start of each training game, we randomly “assign” each hero to some subset of the line and punish the agent whenever it deviates until a randomly selected time in the game.

The quest was well rewarded. Our rewards are made up of metrics that measure how humans make decisions in the game: net value, kills, deaths, assists, last man first. We process each agent’s reward by subtracting the average reward for each team, so this prevents agents from finding positive-sum situations.

cooperation

OpenAI Five does not create explicit channels of communication between the individual heroes’ neural networks. Teamwork is controlled by a hyperparameter we call “teamwork”. The values for team spirit range from 0 to 1, representing the extent to which each OpenAI Five hero pays attention to his individual reward function and the extent to which each OpenAI Five hero pays attention to the team average reward function. During training, we gradually adjust the value from 0 to 1.

Rapid

Our system is implemented by the general reinforcement learning training system Rapid. Rapid can be applied to any Gym environment. At OpenAI, we also use Rapid to solve other problems, including Competitive self-play.

Diagram of training system

We have implemented Rapid on IBM Kubernetes, Microsoft Azure, and Google GCP backends.

The game

So far, we’ve played against these teams:

  1. Strongest OpenAI staff team: Match score of 2500
  2. Audience players (including Blitz, etc.) watching the OpenAI staff match: match points ranged from 4000-6000, they had never played as a team before.
  3. Valve Corporation team: Match points range from 2500 to 4000
  4. Amateur team: Ladder 4200 trains as a team.
  5. Semi-professional team: Ladder 5500, trains as a team.

OpenAI won the game against the top three teams and lost to the bottom two teams (winning only the first three games).

Here are a few things we’ve observed about OpenAI Five:

They often sacrifice their own advantage (the way of the Nightmare, the way of the Sky Light) in order to suppress the enemy’s advantage, forcing the battle to shift to the side that is more difficult for the opponent to defend. This strategy has appeared in the professional world over the past few years and has become a popular tactic. Blitz says he learned the tactic after playing DOTA for eight years, when Liquid, a professional team, told him.

Early to mid-game transitions are faster than opponents. It works like this: 1) Perform multiple successful gank when the human player is out of position, and 2) group up and push the tower before the opponent can organize a rebellion.

There are also areas where machines sometimes deviate from the mainstream, such as giving money and experience to auxiliary heroes in the early stages (who generally don’t get resources first). The Priority of the OpenAI Five allows it to peak damage faster, thus building a greater advantage, winning group battles and taking advantage of the opponent’s mistakes to ensure a quick victory.

And human differences

OpenAI Five gets the same information as a human, but it can see positions, health, and equipment lists in real time, all of which a human player needs to manually check. Our approach doesn’t rely fundamentally on observing state (in real time), but rendering pixels from a game requires thousands of Gpus.

OpenAI Five averages 150-170 operations per minute (APM=150-170, with a theoretical peak of 450 because it is observed every four frames). It’s possible for a skilled player to timing a picture perfectly, but it’s a breeze for a machine. The OpenAI Five had an average response time of 80 milliseconds, faster than humans.

These differences were most significant in 1V1 (when our robot’s reaction time was 67 milliseconds), but the competition was relatively fair, as we had seen humans learn and adapt to the robot’s play. After last year’s TI, many professional players trained with our 1V1 robot for several months. According to William *”Blitz”* Lee (former DOTA2 pro and coach), 1V1 robots have changed the way we think about singles (the robots took a fast pace and now everyone has tried to follow it).

A surprising finding

Binary rewards can be given for good performance. Our 1V1 model has shape rewards, including last hit rewards, kills, etc. We conducted an experiment that rewarded the agent only for winning or losing, and had it train an order of magnitude slower and slightly smoother in the middle, in contrast to the smooth learning curve we normally see. The experiment was trained on 4500 cores and 16 K80 Gpus, training up to semi-professional level (70 TrueSkill instead of 90 TrueSkill for our best 1V1 robot).

We can learn the card soldier from scratch. For 1V1, we use traditional reinforcement learning and a Creep block award to learn a creep block. One of our colleagues went on vacation (to propose to his fiancee!) The 2V2 model was left to see how much training it would take to improve performance. To his surprise, the model learned to jam without any instruction or reward.

We’re still fixing the system breach. The following image shows the training code for beating an amateur player. By contrast, we just fixed a few bugs, such as a rare crash during training, or a bug that led to a huge negative reward for reaching level 25. It turns out we can hide serious bugs and still beat good human players!

A sub-team of Open AI Dota holds a laptop that beat the world’s top professional players in the 1V1 of the Dota 2 International last year.

Next step

The team at Open AI is focused on meeting the goals set in August. We don’t know if it can be done, but we believe that through hard work (and luck), the chances are great.

Original address: blog.openai.com/openai-five…