Discussion on Texas Hold 'em AI core algorithm: CFR

This article was first publishedWalker AI

The introduction

Since 2017, when AlphaGo defeated ke Jie, the world Champion of Go, artificial intelligence has completely come into the public’s view. For a short time, AI of chess and card has set off a strong wind in the artificial intelligence field. In fact, even before AlphaGo, people have been challenging artificial intelligence of chess and cards, from simple checkers and backgammon, to more complex Chinese chess, international chess, and recently very popular Go and Texas Hold ’em, which have made fruitful achievements for decades. In contrast to complete information games such as checkers and chess, Texas Hold ’em not only has to make complex decisions based on incomplete information, but also has to deal with opponents’ bluff and deliberate weakness. Its game tree is huge in both breadth and depth, and it has always been a mountain for scientists to conquer. In the same year that AlphaGO beat Ke, Depo AI DeepStack and Libratus also beat professional poker players in one-to-one Unlimited Texas Hold ’em, making a landmark breakthrough in incomplete information gaming. Their core algorithm is Counterfactual Minimization(CFR).

1. Regret Matching

1.1 Algorithm Principle

The predecessor of CFR algorithm is regret matching algorithm. In this algorithm, the actions of an agent are randomly selected, and the probability distribution is proportional to positive regret, which indicates the relative loss of a person because he did not select the action in the past.

Here are some definitions of symbols in Regret Matching algorithm:

N = {1, 2,… , n} n = \ left \ {1, 2,… N \ right \} n = {1, 2,… N} represents a finite set of game players. The strategy used by player III is σ I \sigma_iσ I.
For each information set ∈ factor I, Ii sigma (I) (Ii) : A (Ii) – > [0, 1] I_i ∈ \ xi_i, \ sigma_i (I_i) : A (I_i) – > [0, 1] ∈ factor I, Ii sigma (I) (Ii) : A (Ii) – > [0, 1], Is the probability distribution function on the action set A(Ii)A(I_i)A(Ii). Player III’s strategy space is denoted by sigma I \Sigma_i sigma I.
A strategy group that contains all player strategies is defined as σ=(σ1,σ2… , \ sigma sigma n) = (\ sigma_1 \ sigma_2,… Sigma, \ sigma_n) = (1, sigma sigma 2,… , sigma n).
In a game confrontation, different players take different strategies and actions at different times. Strategy under the corresponding action sequence probability is expressed as a PI sigma (h) \ ^ \ sigma PI PI sigma (h) (h), and PI sigma (h) = ∏ I ∈ N PI sigma (h) I \ ^ \ sigma PI (h) = \ prod_ {I ∈ N} \ pi_i ^ \ sigma PI sigma (h) (h) = I ∈ N PI I ∏ sigma (h)

Here π I σ(h)\ PI ^ sigma_i(h)π I σ(h) denotes the probability that player III uses strategy σ I \sigma_iσ I to cause action sequence HHH to occur. The probability that other players, except player III, cause action sequence HHH to occur through their respective strategies is: PI – sigma (h) = I ∏ I ∈ N/I PI j sigma (h) \ ^ \ sigma_ PI (h) = {-i} \ prod_ {I ∈ N / {I}} \ pi_j ^ \ sigma PI – I sigma (h) (h) = ∏ I ∈ N/I PI j sigma (h).
For each player I ∈N, UI :Z→Ri∈N, U_I :Z→Ri∈N, UI :Z→R represents the player’s revenue function.
Calculated under a given strategy can get players expect earnings: UI (sigma) = Σ h ∈ Zui PI sigma (h) (h) u_i (\ sigma) = \ Sigma_ ∈ {h} Z u_i (h) \ ^ \ sigma PI (h) UI (sigma) = Σ h ∈ Zui PI sigma (h) (h).
Nash equilibrium: Policy group σ=(σ1∗,σ2∗,… , sigma n ∗) \ sigma = (\ sigma ^ * _1, \ sigma ^ * _2,… Sigma, \ sigma ^ * _n) = (sigma 1 ∗, sigma 2 ∗,… , sigma n ∗) is a Nash equilibrium if and only if for every player I ∈ Ni ∈ Ni ∈ n, meet the conditions: UI (sigma) Max sigma or I ‘(sigma 1 ∗, sigma 2 ∗,… , sigma n ∗) u_i (\ sigma) \ geq max_ {\ sigma_i ^ `} (\ sigma ^ * _1, \ sigma ^ * _2,… , \ sigma ^ * _n) UI (sigma) Max sigma or I ‘(sigma 1 ∗, sigma 2 ∗,… The sigma n ∗).
Regret value: The regret value of the player’s strategy at the T time is:
$R_i^T(a)=\Sigma_{T=1}^T(\mu_i(a,\sigma_{-i}^t)-\mu_i(\sigma_i^t,\sigma_{-i}^t))$
Policies: Update policies based on regret values
$\ sigma_i ^ (a) = {T + 1} \ frac {R_i ^ T (a)} {\ Sigma_ ∈ {b A_i} R_i ^ T (b)}$
Average regret value: Assuming that the game can be repeated, let the strategy group of the second game be, if the game has been repeated, then the average regret value for players in this game is defined as:
$} {Regret_i ^ M = \ frac {1} {M} max_ {\ sigma_i ^ * ∈ \ Sigma_ {I = 1}} ^ M (u_i (\ sigma_i ^ *, \ Sigma_} {- t ^ t) – u_i (\ sigma ^ t))$

The algorithm flow of Regret matching is:

For each player, initialize all cumulative regrets to 0.
For from 1 to T(T: iteration number) :

A) Use your current strategy to play against your opponent

B) Action returns are calculated according to game results, and regret value is calculated using the returns

C) Cumulative historical regret value

D) Update the strategy according to regret value results
Return average policy (cumulative regret/number of iterations)

1.2 instance

Rock, Paper, Scissors is the simplest zero-sum game with some payoff table in it.

Figure 1. Rock-paper-scissors revenue chart

In this example, the algorithm flow of Regret matching is:

A) In the first iteration, both Player1 and Player2 choose actions at random with probability 13\frac{1}{3}31, assuming that Player1 chooses cloth and Player2 chooses scissors.

B) the perspective of player1 to game result earnings for the first time is: u (r, s) = 1, u (p, s) = – 1, u (s, s) = 0 u_ {} (r, s) = 1, u_ {} (p, s) = 1, u_ {} (s, s) = 0 u (r, s) = 1, u (p, s) = – 1, u (s, s) = 0.

C) Regret value is calculated based on the resulting benefits,

regret_r= u_{(r,s) }-u_{(p,s) }=1-(-1)=2, regret_p= u_{(p,s)}- u_{(p,s)}=-1-(-1)=0, regret_s= u_{(s,s) }-u_{(p,s) }=0-(-1)=1

D) Update player1’s action strategy with normalization: (23R,0P,13S)(\frac{2}{3}R,0P,\frac{1}{3} S)(32R,0P,31S).

E) Select actions according to the updated strategy for multiple games until Nash equilibrium is reached

F) Return average policy

The core code is as follows:

1) Policy acquisition method: 1. Clear the policy whose regret value is less than zero and reset the policy to 0; 2. Regularize policies to ensure that the sum of policies is 13. In some cases, the sum of regrets of policies is 0, and the reset policy is the initial policy.

def get_strategy(self):
       """
       Get current mixed strategy through regret-matching
       :return:
       """
       normalizing_sum = 0
 
       # add this overcome initial best response opponent
       # if min(self.strategy) < 0:
       #     min_val = min(self.strategy)
       #     self.strategy = [_ - min_val for _ in self.strategy]
 
       for i in range(NUM_ACTIONS):
           self.strategy[i] = max(self.regret_sum[i], 0)
           normalizing_sum += self.strategy[i]
 
       # normalize
       for i in range(NUM_ACTIONS):
           if normalizing_sum > 0:
               self.strategy[i] /= normalizing_sum
           else:
               self.strategy[i] = 1 / NUM_ACTIONS
           self.strategy_sum[i] += self.strategy[i]
 
       return self.strategy
Copy the code

2) Training methods: 1. To play the game with choice strategy and calculate the action benefit according to the game results; 2. Calculate regret value according to action benefit.

def train(self, iterations: int): action_utility = [0] * NUM_ACTIONS opponent_stats = [0] * NUM_ACTIONS cum_utility = 0 for i in range(iterations): # Get regret-matched mixed-strategy actions strategy = self.get_strategy() # strategy = self.get_average_strategy() my_action = self.get_action(strategy) other_action = self.get_action(self.opp_strategy) opponent_stats[other_action] += 1 # Compute action utilities action_utility[other_action] = 0 action_utility[(other_action + 1) % NUM_ACTIONS] = 1 action_utility[other_action - 1] = -1 # Accumulate action regrets for a in range(NUM_ACTIONS): self.regret_sum[a] += action_utility[a] - action_utility[my_action] # self.regret_sum[a] = opponent_stats cum_utility +=  action_utility[my_action] print(f'opponent_stats: {opponent_stats}, {[_ - min(opponent_stats) for _ in opponent_stats]}') print(f'cum utility: {cum_utility}')Copy the code

Experimental results:

1) When the fixed opponent strategy is {0.4, 0.3, 0.3}

Figure 2. Fixed opponent strategy, player strategy

2) When both player and opponent use Regret Matching updates

Figure 3. Player and opponent strategies

2. Counterfactual Regret Minimization

2.1 Algorithm Principle

Rock, Paper, Scissors is a classic “one off” game where the player takes an action and gets the result. Obviously, many games in life are serialized games, which consist of a series of actions. The action in the previous step may lead to the change of the action in the next step, and the final action combination forms the game result. We no longer use payoff table for such sequential games, but game tree form. The game tree consists of a variety of states, with edges representing transitions from one state to another. States can be opportunity nodes or decision nodes. The function of an opportunity node is to assign the outcome of an opportunity event, so each edge represents a possible outcome of that opportunity event and the probability of the event occurring. At decision nodes, edges represent actions and subsequent states that are the result of those actions taken by the player.

Similarly, symbols in the CFR algorithm are defined as follows:

The probability of each information set part III emission PI sigma (I) = Σ h ∈ I PI sigma (h) \ ^ \ sigma PI (I) = \ Sigma_ ∈ {h} I \ ^ \ sigma PI PI sigma (h) (I) = Σ h ∈ I PI sigma (h), said the probability of all of the information set action sequence accumulation.
Under the strategy σ\sigmaσ, the probability of reaching the final situation ZZZ after applying the game action sequence HHH is πσ(h,z)\ PI ^\sigma(h,z)πσ(h,z).
When strategy sigma \sigma sigma is adopted, the virtual benefits of the corresponding action strategy are as follows: UI (sigma, h) = PI sigma Σ z ∈ z (h, z) UI (z) u_i (\ sigma, h) = \ Sigma_} {z ∈ z \ ^ \ sigma PI (h, z) u_i (z) UI (sigma, h) = PI sigma Σ z ∈ z (h, z) UI (z).
Players act iii aaa of virtual regret value: r (h, a) = vi (sigma I – > a, h) – vi (sigma, h) r (h, a) = v_i (\ sigma_ {I – a}, h) – v_i (\ sigma, h) r (h, a) = vi (sigma I – > a, h) – vi (sigma, h).
The regret value of information set III corresponding to action sequence HHH is: R (I,a) = Sigma R (h,a)r(I,a) =\Sigma r(h,a)r(I,a) = Sigma r(h,a).
Player III takes action in round TTT aaa’s regret value is: RegrettT (I, a) = t = 1 Σ trit Regret_t (I, a) ^ t (I, a) = \ Sigma_ {t = 1} ^ Tr_i ^ t (I, a) RegrettT (I, a) = t = 1 trit Σ (I, a).
Similarly, to regret the negative value into consideration, the reporter: RegrettT, + (I, a) = Max (RiT (I, a), 0) Regret_t ^ {T, +} (I, a) = Max (R_i ^ T (I, a), 0) RegrettT, + (I, a) = Max (RiT (I, a), 0).
In rounds T+1T+1T+1, the probability of player III choosing action AAA is calculated as follows:
$\ sigma_i ^ {T + 1} (I, a) = \ begin {cases} \ frac {Regret_i ^ {T, +} (I, a)} {\ Sigma ∈ a. (I) a Regret_i ^ {T, +}} (I, a) & \ text {the if $\ Sigma_ {∈ a. a (I)} Regret_i ^ {T, +} (I, a) > 0 $} \ \ \ frac {1} {a (I)}, & \ text {otherwise} {cases} \ end$

Algorithm flow:

For each information set, the regret value and policy are initialized
Use your current strategy to play against your opponent
Calculate the benefit and regret value of each information set accessed in this game
Updated strategy with Regret Matching algorithm
Multiple iterations until Nash equilibrium
Return average policy (cumulative regret/number of iterations)

2.2 instance

Kunh’s pocker is one of the simplest limited stakes poker games in which two players play a game with only 1,2 and 3 cards. Each player holds a card in his hand for each round and decides whether to raise the ante (P) or raise the ante (B) at his discretion. Specific rules of the game are as follows:

Figure 4. Kuhn’s poker rules

Constructing Kuhn poker Game tree from the perspective of player α :

Figure 5. First-hand player game tree

CFR algorithm flow in this example is as follows:

A) The initial strategy is a random strategy, assuming that player α draws a card value of: 3

B) In the first iteration, the virtual income of node selection action P is calculated as follows: UA sigma (3 P) = Σ I = 13 PI sigma PI sigma (3 P) B (zi ∣ 3 P) uA (zi) u_A ^ \ sigma (3 P) = \ Sigma_ {I = 1} \ pi_B ^ 3 ^ \ sigma (3 P) \ ^ \ sigma PI (z_i | 3 P) u_A (z_i) uA sigma (3 P) = Σ I B = 13 PI sigma (3 p) PI sigma (zi ∣ 3 p) uA (zi). Combined with the game tree, it can be obtained as follows: PI sigma (3 P) = 1 B \ pi_B ^ \ sigma (3 P) = 1 PI sigma (3 P) B = 1, PI sigma (z1 ∣ 3 P) = 0.5 \ ^ \ sigma PI (z_1 | 3 P) = 0.5 PI sigma (z1 ∣ 3 P) = 0.5, PI sigma (z2 ∣ 3 p) = 0.25 \ ^ \ sigma PI (z_2 | 3 p) = 0.25 PI sigma (z2 ∣ 3 p) = 0.25, PI sigma (z3 ∣ 3 p) = 0.25 \ ^ \ sigma PI (z_3 | 3 p) = 0.25 PI sigma (z3 ∣ 3 p) = 0.25; UA (z1) = 1 u_a (z_1) = 1 uA (z1) = 1, uA (z2) = 1 u_a (z_2) = 1 uA (z2) = 1 uA (z3) = 2 u_a (z_3) = 2 uA (z3) = 2. ∴ uA (3 P) = 1.25. ∴ u_A (3 P) = 1.25. ∴ uA (3 P) = 1.25. UA (3,B)=0.5u_A(3,B)=0.5uA(3,B)=0.5

C) Using virtual revenue to update regret value: R1 = (3 P) R0 (3 P) + uA sigma (3 P) – (sigma (3 B) uA sigma sigma (3 B) + (3 P) uA sigma (3 P)) = 0 + 1.25 – (0.5 x 0.5 x 1.25 + 0.5) = 0.375 R ^ ^ 0 1 (3 P) = R (3 P) + u_A ^ \ sigma (3, P) – (\ sigma (3 B) u_A ^ \ sigma (3 B) + \ sigma (3 P) u_A ^ \ sigma (3 P)) = 0 + 1.25 (0.5, 0.5 + 0.5 times and times 1.25) = 0.375 R1 (3 P) = R0 (3 P) + uA sigma (3 P) – (sigma (3 B) uA sigma sigma (3 B) + (3 P) uA sigma (3 P)) = 0 + 1.25 – (0.5 x 0.5 x 1.25 + 0.5) = 0.375

D) Using regret value update strategy: Sigma A1 (3 P) = 0.375 present (0.375 + 0) = 1 \ sigma_A ^ 1 (3 P) = 0.375 \ div (0.375 + 0) = 1 sigma A1 (3 P) = 0.375 present (0.375 + 0) = 1, sigma A1 (3, B) = 1 present ∣ A (3) ∣ = 0.5 \ sigma_A ^ 1 (3, B) = 1 / div | A (3) | = 0.5 sigma A1 (3, B) = 1 present ∣ A (3) ∣ = 0.5

E) normalization strategy: 23 \ sigma A1 (3 P) = sigma_A ^ 1 (3 P) = \ frac {2} {3} sigma A1 (3 P) = 32, sigma A1 (3, B) = 13 \ sigma_A ^ 1 (3, B) = \ frac {1} {3} sigma A1 (3, B) = 31

F) Iterate several times until Nash equilibrium is reached

Core code implementation:

class KuhnTrainer: """ P = pass B = bet payoff table: P P higher player +1 P B P player2 +1 P B B higher player +2 B P player1 +1 B B higher player +2 """ def __init__(self):  self.node_map: Dict[str, Node] = {} def train(self, iterations): """ Train Kuhn poker :param iterations: :return: """ cards = [1, 2, 3] util = 0 for i in range(iterations): # shuffle cards self.shuffle_cards(cards) cur_util = self.cfr(cards, "", 1, 1) util += cur_util print("Average game value: ", util / iterations) for k, v in self.node_map.items(): print(v) def shuffle_cards(self, cards): random.shuffle(cards) def cfr(self, cards, history: str, p0, p1): """ Counterfactual regret minimization iteration P P higher player +1 P B P player2 +1 P B B higher player +2 B P player1 +1 B B higher player +2 :param cards: :param history: :param p0: :param p1: :return: """ plays = len(history) player = plays % 2 opponent = 1 - player # Return payoff for terminal states if plays > 1: terminal_pass = history[plays - 1] == 'p' double_bet = history[plays - 2: plays] == 'bb' is_player_card_higher = cards[player] > cards[opponent] if terminal_pass: if history == 'pp': return 1 if is_player_card_higher else -1 else: return 1 elif double_bet: return 2 if is_player_card_higher else -2 info_set = str(cards[player]) + history # Get information set node or create it if nonexistant node = self.node_map.get(info_set, None) if node is None: node = Node() node.info_set = info_set self.node_map[info_set] = node # For each action, recursively call cfr with additional history and probability strategy = node.get_strategy(p0 if player == 0 else p1) util = [0] * NUM_ACTIONS node_util = 0 for i in range(NUM_ACTIONS): next_history = history + ("p" if i == 0 else "b") util[i] = -self.cfr(cards, next_history, p0 * strategy[i], p1) if player == 0 else \ -self.cfr(cards, next_history, p0, p1 * strategy[i]) node_util += strategy[i] * util[i] # For each action, compute and accumulate counterfactual regret for i in range(NUM_ACTIONS): regret = util[i] - node_util node.regret_sum[i] += regret * (p1 if player == 0 else p0) return node_utilCopy the code

Experimental results:

Figure 6. Kuhn poker, player and opponent strategies

3. The implied

CFR algorithm have been able to solve the Texas poker when they arrived, but in the face of 52 CARDS, filling, CARDS, such as complex and changeable situation makes the robot’s game tree both in depth and breadth is very huge, for computing and storage resources on the overhead is too big, just by CFR method overcome DE robot is difficult. Subsequent CFR researchers have been working hard to optimize the efficiency of CFR algorithms, aiming to increase computational speed and compress storage space. Here, the author briefly introduces several CFR variant algorithms, only to understand.

CFR 3.1 + :

Different from CFR algorithm, CFR+ algorithm reduces the cumulative average strategy, and gives higher weight to the recent iteration strategy when averaging the iterative strategy. Intuitively, the later the strategy, the better the performance, so in the average of all strategies, giving a higher weight to the recent strategy is more conducive to convergence.

In CFR+, counterfactual Utility is defined as the following form:

U_p ^ \ sigma (I, a) = \ Sigma_ {z ∈ z (I, a)} \ pi_ {p} – ^ \ sigma (z) \ ^ \ sigma PI (h, z) u_p (z)

In, on the basis of CFR + algorithm defines a regretlikevalueregretlike valueregretlikevalue, the CFR regretregretregret is a cumulative value of + algorithm, However, CFR algorithm defines RegretRegretRegret as the average value, so the calculation method in CFR+ algorithm is regret:

Q ^ t (a) = (Q ^ {1} t – (a) + delta R ^ t (a)) ^ +, delta R ^ t = R (a) (a) – R ^ ^ t} {t – 1 (a)

In addition, in the CFR+ algorithm, the average strategy of the final output is in the following form:

\sigma_p^T=\frac{2\Sigma_{t=1}^Tt\sigma_p^t}{T^2+T}

3.2 MCCFR:

MCCFR (Monte Carlo Counterfactual Regret Minimization) is the combination of Monte Carlo algorithm and CFR algorithm, its core lies in: While avoiding iterating the entire game tree each time, Yet it is possible to guarantee immediate counterfactual yet regretimmediate space counterfactual space regretimmediate Counterfactual regret The expected value of theta stays the same. Divide leaf nodes into different blocks: Q={Q1,Q2… , Qn} block: Q = \ {Q_1, Q_2,… , Q_n \} block: Q = {Q1, Q2,… ,Qn}, and ensure that all leaf nodes are covered.

QjQ_jQj is defined as the probability of choosing QiQ_iQi in the current iteration: σ j=1rQj=1\Sigma_{j=1}^rQ_j=1 σ j=1rQj=1.

Definition of Q (z) Q (z) Q (z) said in the current iteration probability of sampling to the leaf node: Q (z) = Σ j: z ∈ QjQjQ (z) = \ Sigma_ {j: z ∈ Q_j} Q_jQ (z) = Σ j: z ∈ QjQj

Then, when selecting QjQ_jQj iteration, the sampled virtual value obtained is: Vi (sigma, I/j) = Σ z ∈ Qj studying Zi1Q (z) UI (z) PI – I sigma (z [I], z) v_i (\ sigma, I/j) = \ Sigma_ {z ∈ Q_j} studying Z_i \ frac {1} {Q} (z) u_i (z) \ PI ^ \ Sigma_ I} {- vi (z [I], z) (sigma, I/j) = Σ z ∈ Qj studying ZiQ UI (z) (z) 1 PI – I sigma (z [I], z)

A sampling based CFR algorithm is obtained by selecting different blocks with a certain probability.

3.3 conclusion

In addition to the two algorithms introduced above, the optimization of CFR algorithm is numerous, including discount-CFR, Warm Start and Total RBP, which can improve the calculation speed. There are also cFR-d, continue-considerations, Safe and refined Subgame Solving for reducing storage space.

Machine game is an important research direction in artificial intelligence. Incomplete information game is a sub-field of machine game. Incomplete information game is characterized by hidden information and information asymmetry. Compared with complete information game, incomplete information game is closer to real life. For example, hidden information and information asymmetry exist in practical problems such as bidding, auction and stock trading. Therefore, the study of incomplete information game has more practical significance. Texas Hold ’em game contains hidden information, information asymmetry and random events. It is a typical incomplete information game. The research on it is of great significance, and interested readers can have a thorough understanding.

4. References

[1] Brown, N., Kroer, C. and Sandholm, T.: Dynamic Thresholding and Pruning for Regret Minimization, AAAI Conference on Artificial Intelligence (2017).

[2] Lanctot, M., Waugh, K., Zinkevich, M. and Bowling, M.: Monte Carlo Sampling for Regret Minimization in Extensive Games, Advances in Neural Information Processing Systems 22 (Bengio, Y., Schuurmans, D., Lafferty, J. D., Williams, C. K. I. and Culotta, A., eds.), Curran Associates, Inc., pp. 1078 — 1086 (2009).

[3] Gibson, R. 2014. Regret Minimization in Games and the Development of Champion Multiplayer Computer PokerPlaying Agents. Ph.D. Dissertation, University of Alberta. Gilpin, A., and Sandholm, T. 2007. Lossless abstraction of imperfect information games. Journal of the ACM 54(5).

We are walker AI, and we are moving forward in “AI+ games”.

Go to the public account [walker AI], reply [Tiger Step Dragon Line], you can receive a limited amount of red envelope cover. Come and discuss technical issues with us!

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Discussion on Texas Hold ’em AI core algorithm: CFR

The introduction

1. Regret Matching

1.1 Algorithm Principle

1.2 instance

2. Counterfactual Regret Minimization

2.1 Algorithm Principle

2.2 instance

3. The implied

CFR 3.1 + :

3.2 MCCFR:

3.3 conclusion

4. References

Discussion on Texas Hold ’em AI core algorithm: CFR

The introduction

1. Regret Matching

1.1 Algorithm Principle

1.2 instance

2. Counterfactual Regret Minimization

2.1 Algorithm Principle

2.2 instance

3. The implied

CFR 3.1 + :

3.2 MCCFR:

3.3 conclusion

4. References

Related Posts

·EER computational drawing

Norm regularization in machine learning: (1) L0, L1 and L2 norms

Li Hongyi deep Learning (4)