The first sentence of the article: “This is the third day of my participation in the Gwen Challenge

If you want to watch related videos, you can find me on Watermelon Video (account Zidea) or Bilibili (account Zidea2015) to post video commentary, and note that the avatar is the same as the one used by Jianshu.

Deepmind has been using AI to play starcraft, a more complex game, since it beat the World champion at Go. I think you’re probably just as interested in how the project of AI being able to play games is implemented. It’s still challenging to implement a project like this, but no matter how difficult it is, let’s start with the basics.

Fundamentals of probability

So let’s just do a quick regression before we start and maybe use a little bit of probability in this share.

Random variables and distribution functions

First of all, the function argument can be expanded from a real number to say, if the argument is two points, output the distance between two points. Use capital letters for random variables and lower case for observations. Events are used in a language to describe collections of samples that have certain properties that make them come together


P ( X = 0 ) = 0.5 P ( X = 1 ) = 0.5 \ begin} {aligned P (X = 0) = 0.5 \ \ P (X = 1) = 0.5 \ end} {aligned

Probability Density Function (PDF)

PDF is the abbreviation of Probability Density Function, which represents the Probability of occurrence of a certain value of random variable, also known as PDF.


p ( x ) = 1 2 PI. sigma 2 exp ( ( x mu ) 2 2 sigma 2 ) p(x) = \frac{1}{\sqrt{2\pi \sigma^2}}\exp \left(- \frac{(x – \mu)^2}{2 \sigma^2}\right)

expect

For the expectation of continuous distribution E [f (x)] (x) f (x) = ∫ xp dxE = \ [f (x)] int_x p (x) f (x) dxE [f (x)] (x) f (x) = ∫ xp dx

For the expectation of discrete distribution E/f (x) = ∑ xp (x) f (x) E = \ [f (x)] sum_x p (x) f (x) E/f (x) = ∑ xp (x) f (x)

Random sampling

Random sampling is a kind of sampling survey conducted in accordance with the principle of equal opportunity, which is called equal probability.

The term

State (State)

By state we mean observing every frame of the game screen, what we can observe from the environment, or understanding what the environment allows us to observe.

Action


a { l e f t . r i g h t . u p . d o w n } a \in \{left,right,up,down\}

An action is a response given by an agent (and we’ll see what that is next) in terms of its current state. We’ll see later

Agent

Agent is the initiator of Action in different reinforcement learning tasks. Agent is the tank in tank war games. Agent becomes 🚗 in unmanned driving.

Policy

First of all, Policy is a function. If it is a function, then it needs input and output. In the Policy function, the input State and the output is the probability distribution of the Agent executing the Action. A Policy may also output different actions rather than a specific Action.

In mathematics the Policy function is expressed as PI (s, a) – > \ [0, 1] PI (s, a) \ rightarrow [0, 1] PI (s, a) – > [0, 1] PI (a ∣ s) = P (a = a ∣ s = s) \ (a | s) = PI P (A = A | S = S) PI (A ∣ S) = P (A = A ∣ S = S)


  • PI. ( l e f t s ) = 0.1 \ PI (left | s) = 0.1

  • PI. ( r i g h t s ) = 0.2 | s \ PI (right) = 0.2

  • PI. ( u p s ) = 0.6 \ PI (up | s) = 0.6

  • PI. ( d o w n s ) = 0.1 \ PI (down | s) = 0.1

It is not difficult to find from the above formula that reinforcement learning is mainly about learning the Policy function. As long as there is a Policy function, after entering a state, the Policy will conduct a random sampling to take actions. The random sampling here is to prevent the opponent from guessing the machine and finding the rule. So the Policy is random.

reward

RRR is used to express rewards, which is to give a score according to the action and state. This score can be understood as rewards. How to design rewards is very important, because a well-designed reward will get half the result with twice the effort.

  • There is a bonus for destroying enemy tanks
  • If an eagle’s nest is captured, it loses a lot of bonuses

State Transition

Based on the current action and the previous state we get a new state, P (s’ ∣ s, a) = p (s’ = ‘s ∣ s = s, a = a) p (s ^ {\ prime} | s, a) = \ mathbb {p} (s ^ ^ {\ prime} = s = {\ prime} | s s, a = a) p (s’ ∣ s, a) = p (s’ =’ s ∣ s = s, a = a)

We’ve made it clear graphically that state transitions are determined by the environment, and the environment in the game is the system.

Interaction between Agent and Environment

Let’s take a look at how the Agent interacts with the environment

Randomness of reinforcement learning

  • Policy gives random actions based on state

  • The Enviroment gives the next state randomly based on Action and state

  • Policy Specifies a1A_1A1 based on s1s_1s1
  • Enviroment gives s1S_1S1 and R1R_1R1 based on S1, A1S_1, A_1S1, A1
  • Policy continues to give a2a_2A2 in terms of s2s_2s2

Iterative repeat the above steps to form a path s1, a1, r1, s2, a2, r2,…, sT, aT, rTs_1, a_1, r_1, s_2, a_2, r_2, \ \ cdots, s_T, a_T, r_Ts1, a1, r1, s2, a2, r2,…, sT, aT, rT

Return (Return)

Compare rewards with rewards, which are the sum of the rewards of the status achieved by a series of actions at the current moment, until the last reward is added at the end of the game.


U t = R t + R t + 1 + R t + 2 + + U_t = R_t + R_{t+1} + R_{t+2} + \cdots +

  • About the difference between RtR_tRt and Rt+1R_{t+1}Rt+1, the value of their rewards will decrease with time. Let’s take an example to illustrate this problem. As future rewards are less valuable than current rewards, we can add a subway to adjust them. Discount return here gamma \ gamma gamma is gamma ∈ \ [0, 1] gamma \ [0, 1] gamma ∈ in [0, 1]

So the payoff for adding gamma \gamma gamma gamma, the discount payoff, is as follows


U t = R t + gamma R t + 1 + gamma 2 R t + 2 + + U_t = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \cdots +

It should also be noted that the discount rate gamma \gammaγ is a hyperparameter, and the setting of gamma \gammaγ should be considered to have an effect on reinforcement learning. When a certain moment is over, RtR_tRt is computed to be lowercase RtR_tRt because the reward UtU_tUt depends on is a random variable, so UtU_tUt is also a random variable.

We know that the Policy to generate PI (a) ∣ s \ PI PI (a | s) (a) ∣ s and environment according to the current status and action the next state s’ s ^ {\ prime} s’, that is, p (s’ ∣ a, s) p (s ^ {\ prime} | a, s) p (s’ ∣ a, s), So the current reward RtR_tRt is related to the current StS_tSt and the action AtA_tAt.

At all time, then for UtU_tUt and future At + 1, At + 2,.. A_t, A_} {t + 1, A_ {t + 2}, \ cdotsAt, At + 1, At + 2,… And St, St + 1, + 2 St,.. S_t, S_} {t + 1, S_ {t + 2}, \ cdotsSt, St + 1, St + 2,… Have a relationship

What is the above introduction of return, have a certain understanding of return. We can start to talk about what a value function is, which is a measure of whether a state or an action state is good or bad, whether it’s worth it for an agent to choose a state or perform an action in a state.

Action Value function

Ut = Rt + gamma gamma 2 Rt + Rt + 1 + 2 +… + U_t = R_t + \ gamma R_ (t + 1} + \ gamma ^ 2 R_ {2} t + + \ cdots + Ut = Rt + gamma gamma 2 Rt + Rt + 1 + 2 +… + UtU_tUt is a random variable, Depending on all the future actions and states, UtU_tUt doesn’t know at time T,


Q PI. ( s t . a t ) = E [ U t S t = s t . A t = a t ] Q_{\pi}(s_t,a_t) = E[U_t|S_t=s_t,A_t=a_t]

You can take the expectation of UtU_tUt, you can integrate all the randomness in it, so even though we don’t know what’s going to happen at the next moment, like flipping a coin, we don’t know whether the next moment is going to be heads or tails, But we know that the probability of either heads or tails is 0.5. If we call the random variable X as 1 and tails as 0, the expectation is 0.5×1+0.5×0=0.50.5 \times 1+0.5 \times 0= 0.50.5×1+0.5×0=0.5, in the same way, the expectation of UtU_tUt random variable can yield a number, namely QπQ_{\ PI}Qπ

QπQ_{\ PI}Qπ (st,at)Q_{\ PI}(s_t,a_t)Qπ(st,at) Qπ(st,at)


Q = max PI. Q PI. ( s t . a t ) Q^{*} = \max_{\pi} Q_{\pi}(s_t,a_t)


Q ( s t . a t ) Q^{*}(s_t,a_t)

State value function
V PI. V_{\pi}

Meaning is the expectation of future Return based on the state STs_TST at TTT moment


V PI. = E A [ Q PI. ( s t . A ) ] V_{\pi} = E_{A}[Q_{\pi}(s_t,A)]

  • I’m going to treat the action as A random variable and I’m going to integrate it
  • VπV_{\ PI}Vπ is related only to sts_tst and π\ PI π functions
  • VπV_{\ PI}Vπ can tell us whether our current state is good or not