This article appeared in:Walker AI

In the multi-agent reinforcement learning algorithm, we have mentioned QMIX before. In fact, VDN is a special case of QMIX. When the derivatives are all 1, QMIX becomes VDN. QTRAN is also a problem about value decomposition. In practical problems, the effect of QTRAN is not as good as that of QMIX. The main reason is that the constraint condition of QTRAN is too loose, so the effect of QTRAN is not as good as that of theory. However, there are two versions of QTRAN, QTRAN_BASE and QTRAN_ALT, and the second version is better than the first and works just as well as QMIX for most practical problems.

The above algorithms are all about value decomposition, and the return of each agent is the same. If in a game of King of Glory, we have a good wind, and one of our characters goes to play 5 from 1, resulting in death, and then we play 5 from 4. Since we are in a big advantage, we destroy the other side, and all our agents get positive rewards. Agnet also gets a positive reward for playing 1 to 5 in the beginning, which is clearly not a positive reward for his actions. Appeared “eat from a big pot” situation, the distribution of confidence is uneven.

COMA algorithm solves this problem by using counterfactual baselines to solve the problem of confidence allocation. COMA is a “decentralized” strategy control system.

1. Actor-Critic

COMA mainly samples the main idea of actor-critic, a method based on strategy search, central evaluation and edge decision.

2. COMA

COMA mainly uses counterfactual baselines to solve the confidence assignment problem. In a cooperative agent system, to determine how much an agent has contributed to performing an action, the agent selects an action as the default action (confirming the default action in a special way), performs the default action and the current action respectively, and compares the advantages and disadvantages of the two actions. This approach requires simulating a default action for evaluation, which obviously adds complexity to the problem. There is no default action set in COMA, so there is no need to simulate the baseline, and the current strategy is directly used to calculate the edge distribution of the agent to calculate the baseline. COMA uses this method to greatly reduce the amount of computation.

Baseline calculation:


u a PI. a ( u a tau a ) Q ( s . ( u a . u a ) ) \sum_{u’a}\pi^a(u^{‘a}|\tau^a)Q(s,(u^{-a},u^{‘a}))

Comas network structure

In the figure, (a) represents the centralized network structure of COMA, (b) represents the network structure of Actior, and (c) represents the network structure of Critic.

3. Algorithm process

  • Initialize actor_network, EVAL_critic_network, and target_critic_network, and copy the network parameters of EVAL_critic_network to target_critic_network. Initialize the buffer DDD with the capacity set to MMM, total number of iterations TTT, and target_critic_network Network parameter update frequency PPP.


  • f o r for

    t t
    =
    1 1

    t o to

    T T

    d o do

1) Initialize the environment

2) Obtain the SSS of the environment, OOO of each agent, availavailAvail ActionActionAction of each agent, and RRR.

3) forforfor step=1step=1step=1 tototo episodeepisodeepisode_ limitlimit

A) Each agent obtains the probability of each action through actor_network, and obtains action ActionActionAction through random sample. Actor_network, the GRU loop layer used, logs the last hidden layer each time.

B) Execute ActionActionAction, set SSS, SnextS_{next}Snext, OOO for each agent, availavailAvail ActionActionAction for each agent, Each agent nextnextnext availavailavail actionactionaction, reward RRR, choice of action uuu, env terminatedterminatedterminated is ended, Store experience pool DDD.

C) iFIFif len(D)len(D)len(D) >=>=>= MMM

D) Random sampling of some data from DDD, but the data must be the same transition from different episodes. Not only need to input the current inputs for action selection, but also hidden_state should be input to the neural network. Hidden_state is related to the previous experience, so the experience cannot be randomly selected for learning. So we’re going to extract multiple episodes at once, and we’re going to pass the transition to the neural network at the same location for each episode.

E) td_error=Gt−Q_evaltd\_error =G_t -q \ _evaltD_error =Gt−Q_eval calculates loss and updates the Critic parameter. GtG_tGt is the total reward from state SSS to end.

F) Calculate the baselines of each step of each agent based on the current policy. The calculation formula of baselines is as follows:


u a PI. a ( u a tau a ) Q ( s . ( u a . u a ) ) (Marginal distribution) \ sum_ {u ‘a} \ ^ a PI (u ^ {a}’ | \ tau ^ (s, a) Q (u ^ {a}, u ^ {a} ‘)) (marginal distribution)

G) Calculate the advantages of performing the current action


A a ( s . u ) = Q ( s . u ) u a PI. a ( u a tau a ) Q ( s . ( u a . u a ) ) A^a(s,u) = Q(s,u)-\sum_{u’a}\pi^a(u^{‘a}|\tau^a)Q(s,(u^{-a},u^{‘a}))

H) Calculate loss and update actor network parameters:


l o s s = ( ( a d v a n t a g e s e l e c t _ a c t i o n _ p i _ l o g ) m a s k ) . s u m ( ) / m a s k . s u m ( ) loss=((advantage*select\_action\_pi\_log)*mask).sum()/mask.sum()

I) iFIFif TTT p==0p==0p==0:

J) Copy the network parameter of EVAL_critic_network to target_critic_network.

4. Result comparison

I ran the data myself, about the comparison between QMIX, VDN, COMA, in the same scene.

5. Algorithm summary

The algorithm principle of COMA in the paper is very good, but in the actual scene, as shown in the two figures above, the performance of COMA is not very ideal. In general scenarios, QMIX does not perform as well. I suggest readers, in the actual environment, you can try VDN, QMIX and so on, COMA is not suitable for “lead the charge”.

Data of 6.

  1. COMA:arxiv.org/abs/1705.08…

PS: more dry technology, pay attention to the public, | xingzhe_ai 】, and walker to discuss together!