This article was originally published in: Walker AI

Reinforcement learning is a kind of problem in machine learning and artificial intelligence. It studies how to achieve a specific goal through a series of sequential decisions. It is a kind of algorithm, is to let the computer realize from the beginning do not know anything, there is no idea in the head, through constant trial and error to learn, finally find the rule, learn to achieve the goal of the method. This is a complete reinforcement learning process. Here we can quote the figure below to make a more intuitive explanation.

Agent is an intelligent Agent, also known as our algorithm, which appears in the form of a player in the game. Through a series of strategies, an agent outputs an Action to act on the Environment, and the Environment returns the status value after the Action, namely Observation and Reward value in the figure. When the environment returns a reward value to the agent, it updates its own state, and the agent acquires a new Observation.

1. ml-agents

1.1 introduction

At present, most Unity games are large in number, complete in engine, and easy to build training environment. Since Unity can be cross-platform, it can be trained on Windows and Linux platforms and then converted to WebGL for publishing on the web. Mlagents is an open source plug-in of Unity, which enables developers to train in Unity environment without even writing Python code and in-depth understanding of PPO,SAC and other algorithms. Developers can easily use reinforcement learning algorithms to train their models once the parameters are configured.

If you are interested in algorithms, please click here to learn about algorithms PPO, SAC.

To learn more, click here

1.2 Installing Anaconda, TensorFlow, and TensorBoard

Ml-agents introduced in this paper need to communicate with Tensorflow through Python. During training, information such as Observation, Action, Reward and Done is obtained from the Unity terminal of ML-agents and passed into Tensorflow for training. The model’s decisions are then passed into Unity. Therefore, before installing ML-agents, you need to install TensorFlow according to the following links.

Tensorboard facilitates data visualization and analysis of whether the model meets expectations.

Click to install details

1.3 ML-agents Installation steps

(1) go to github to download ml-agents (release6 is used in this example)

Github is available for download

(2) Unzip the package, put com.unity.ml-agents, com.unity.ml-agents. Extensions into unity’s Packages directory (if not please create one), and add these two directories to manifest.json.

(3) After the installation is complete, import into the project, create a new script, input the following reference to verify the success of the installation

using Unity.MLAgents;
using Unity.MLAgents.Sensors;
using Unity.MLAgents.Policies;

public class MyAgent : Agent

{

}

Copy the code

2. Ml-agents training examples

2.1 Outline and Engineering

Environment is usually described by Markov process. Agent takes some policy to generate Action and interacts with Environment to generate a Reward. Then agent will adjust and optimize the current policy according to Reward.

In this example, the actual project refers to the elimination rule, and three same colors can be scored. In this example, the extra bonus of four colors and multiple colors is removed (to facilitate the design environment).

Click to download the example process

Unity project export part please refer to the official click to go.

The following will share the methods of project practice from four perspectives: interface extraction, selection algorithm, design environment and parameter adjustment.

2.2 Game frame AI interface is removed

Take the interface required by project Observation and Action out of the game. Used to pass in the current state of the game and perform the actions of the game.

static List<ML_Unit> states = new List<ML_Unit>(); public class ML_Unit { public int color = (int)CodeColor.ColorType.MaxNum; public int widthIndex = -1; public int heightIndex = -1; } public static List<ML_Unit> public static List<ML_Unit>GetStates()
{
	states.Clear();
	var xx = GameMgr.Instance.GetGameStates();
	for(int i = 0; i < num_widthMax; i++) {for(int j = 0; j < num_heightMax; j++)
		{
			ML_Unit tempUnit = new ML_Unit();
			try
			{
				tempUnit.color = (int)xx[i, j].getColorComponent.getColor;
			}
			catch
			{
				Debug.LogError($"GetStates i:{i} j:{j}"); } tempUnit.widthIndex = xx[i, j].X; tempUnit.heightIndex = xx[i, j].Y; states.Add(tempUnit); }}return states;
}

public enum MoveDir
{
	up,
	right,
	down,
	left,
}

public static bool CheckMoveValid(int widthIndex, int heigtIndex, int dir)
{
	var valid = true;
	if (widthIndex == 0 && dir == (int)MoveDir.left)
	{
		valid = false;
	}
	if (widthIndex == num_widthMax - 1 && dir == (int)MoveDir.right)
	{
		valid = false;
	}

	if (heigtIndex == 0 && dir == (int)MoveDir.up)
	{
		valid = false;
	}

	if (heigtIndex == num_heightMax - 1 && dir == (int)MoveDir.down)
	{
		valid = false;
	}
	returnvalid; } // The interface to perform the action, according to the position information and movement direction, call the game logic to move the block. Public static void SetAction(int widthIndex,int heigtIndex,int heigtIndex) public static void SetAction(int widthIndex,int heigtIndex,int heigtIndex dir,bool immediately) {if(CheckMoveValid(widthIndex, heigtIndex, dir)) { GameMgr.Instance.ExcuteAction(widthIndex, heigtIndex, dir, immediately); }}Copy the code

2.3 Selection of game AI algorithm

Walking into the first topic of the reinforcement learning project, facing many algorithms, choosing a suitable algorithm can get twice the result with half the effort. If you are not familiar with the characteristics of the algorithm, you can directly use the PPO and SAC that come with ML-agents.

In this case, the author used PPO algorithm at the very beginning, and tried many adjustments. It took an average of 9 steps to get one right step, and the effect was quite bad.

Later, I carefully analyzed the environment of the game. Because of the match-three games in this project, the environment is completely different each time. The result of each step has little influence on the next step, and the demand for Markov chain is not strong. Since PPO is the policy-based algorithm of OnPolicy, it is very careful to update the policy every time, which results are difficult to converge (the author tried XX cloth, but still did not converge).

Compared with the value-base algorithm of OffPolicy, DQN can collect a large number of environment parameters to establish a Qtable and gradually find the maximum value of the corresponding environment.

In a nutshell, PPO is online learning, where you run a few hundred steps, go back and learn what you did right and what you didn’t do right, update your learning, run a few hundred more steps, and so on. This learning efficiency is slow, but also difficult to find the global optimal solution.

DQN is offline learning, you can run hundreds of millions of steps, and then you can go back and take out all the places you’ve run, and then you can easily find the global optimal point.

(PPO is used for demonstration in this example, and the external algorithm is shared in ML-agents. The external tool STABle_baselines3 is used to train the algorithm of DQN.)

2.4 Game AI design environment

After the algorithm framework is determined, how to design Observation, Action and Reward becomes the decisive factor to determine the training effect. In this game, the environment here has two main variables, one is the position of the block, the other is the color of the block.

– Observation:

For the figure above, our example is 14 long, 7 wide, and 6 colors.

The swish used by ML-agents as the activation function can be used with a small floating point number (-10f ~ 10F), but in order for agents to achieve a cleaner environment and better training results, we still need to code the environment.

In this example, the author uses Onehot to encode the environment, and coordinates zero at the upper left corner. So down, the green square in the top left hand corner of the environment code can be expressed as long,0,0,0,0,0,0,0,0,0,0,0,0,1 [0],

High,0,0,0,0,0,1 [0], color according to the fixed enumeration to handle (yellow, green, purple, pink, blue, red) color,0,0,0,1,0 [0].

The total environment is 14+7+6. 14 * 7 = 2646

Code examples:

public class MyAgent : Agent
{
	static List<ML_Unit> states = new List<ML_Unit>();
	public class ML_Unit
	{
		public int color = (int)CodeColor.ColorType.MaxNum;
		public int widthIndex = -1;
		public int heightIndex = -1;
	}

	public static List<ML_Unit> GetStates()
	{
		states.Clear();
		var xx = GameMgr.Instance.GetGameStates();
		for(int i = 0; i < num_widthMax; i++) {for(int j = 0; j < num_heightMax; j++)
			{
				ML_Unit tempUnit = new ML_Unit();
				try
				{
					tempUnit.color = (int)xx[i, j].getColorComponent.getColor;
				}
				catch
				{
					Debug.LogError($"GetStates i:{i} j:{j}"); } tempUnit.widthIndex = xx[i, j].X; tempUnit.heightIndex = xx[i, j].Y; states.Add(tempUnit); }}returnstates; } List<ML_Unit> curStates = new List<ML_Unit>(); Public void CollectObservations(VectorSensor sensor) {public void CollectObservations(VectorSensor sensor) { Settlement and square end var receiveReward = GameMgr. Instance. CanGetState (); var codeMoveOver = GameMgr.Instance.IsCodeMoveOver();if(! codeMoveOver || ! receiveReward) {return; } curStates = mlagentsmgr.getStates ();for(int i = 0; i < curStates.Count; i++) { sensor.AddOneHotObservation(curStates[i].widthIndex, MlagentsMgr.num_widthMax); sensor.AddOneHotObservation(curStates[i].heightIndex, MlagentsMgr.num_heightMax); sensor.AddOneHotObservation(curStates[i].color, (int)CodeColor.ColorType.MaxNum); }}}Copy the code

– Action:

Each square can move up, down, left, and right. The minimum information we need to record is 14*7 squares, and each square can move in 4 directions, which are enumerated in this example (up, right, down, left).

The top left is zero, and the top left cyan square occupies the first four actions of the Action, respectively (the top left cyan square moves up, the top left cyan square moves right, the top left cyan square moves down,

The cyan square in the upper left corner moves to the left).

So the total action is 14 times 7 times 4, which is 392

Careful readers may notice that the cyan square in the upper left corner does not move up or left, so we need to set an Actionmask to block these actions that are prohibited by the rules.

Code examples:

public class MyAgent : Agent { public enum MoveDir { up, right, down, left, } public void DecomposeAction(int actionId,out int width,out int height,out int dir) { width = actionId / (num_heightMax  * num_dirMax); height = actionId % (num_heightMax * num_dirMax) / num_dirMax; dir = actionId % (num_heightMax * num_dirMax) % num_dirMax; Public override void OnActionReceived() public override void OnActionReceived()float[] vectorAction) {/ / mobile end need to determine whether square and square end settlement var receiveReward = GameMgr. Instance. CanGetState (); var codeMoveOver = GameMgr.Instance.IsCodeMoveOver();if(! codeMoveOver || ! receiveReward) { Debug.LogError($"OnActionReceived CanGetState = {GameMgr.Instance.CanGetState()}");
			return;
		}

		if(InvalidNums. Contains((int)vectorAction[0])) {// Contains(invalidNums.Contains((int)vectorAction[0])) At the time of the training is not into this logic) GameMgr. Instance. OnGirdChangeOver? .Invoke(true, - 5,false.false); } DecomposeAction((int)vectorAction[0], out int widthIndex, out int heightIndex, out int dirIndex); // Go back to the action and move the corresponding block in the corresponding direction. Mlagentsmgr. SetAction(widthIndex, heightIndex, dirIndex,false); } public void RewardShape(int Score) {public void RewardShape(int score) {float)score * rewardScaler; AddReward(reward); / / the statistical analysis of data to join tensorboard Mlstatistics. AddCumulativeReward (StatisticsType. Action, reward); Punish = -1f/MaxStep * punishScaler; AddReward(punish); / / the statistical analysis of data to join tensorboard Mlstatistics. AddCumulativeReward (StatisticsType. Punishment, punish); } / / set shielding action actionmask public override void CollectDiscreteActionMasks (DiscreteActionMasker actionMasker) {/ / Mask the necessary actionsif selected by the user.
		checkinfo.Clear();
		invalidNums.Clear();
		int invalidNumber = -1;
		for(int i = 0; i < MlagentsMgr.num_widthMax; i++) {for (int j = 0; j < MlagentsMgr.num_heightMax; j++)
			{
				if (i == 0)
				{
					invalidNumber = i * (num_widthMax + num_heightMax) + j * num_heightMax + (int)MoveDir.left;
					actionMasker.SetMask(0, new[] { invalidNumber });
				}
				if (i == num_widthMax - 1)
				{
					invalidNumber = i * (num_widthMax + num_heightMax) + j * num_heightMax + (int)MoveDir.right;
					actionMasker.SetMask(0, new[] { invalidNumber });
				}

				if (j == 0)
				{
					invalidNumber = i * (num_widthMax + num_heightMax) + j * num_heightMax + (int)MoveDir.up;
					actionMasker.SetMask(0, new[] { invalidNumber });
				}

				if (j == num_heightMax - 1)
				{
					invalidNumber = i * (num_widthMax + num_heightMax) + j * num_heightMax + (int)MoveDir.down;
					actionMasker.SetMask(0, new[] { invalidNumber });
				}
			}
		}
	}
}
Copy the code

A lot of coroutines are used in the elimination process of the original project, which has a high delay. We need to squeeze out the delay time when retraining.

In order not to affect the main logic of the game, in general, the yield return new WaitForSeconds(fillTime) in the coroutine is changed to 0.001f, so that the game logic is not changed too much. The fastest Reward can be obtained after Action is selected.

public class MyAgent : Agent
{
	private void FixedUpdate()
	{
		var codeMoveOver = GameMgr.Instance.IsCodeMoveOver();
		var receiveReward = GameMgr.Instance.CanGetState();
		if(! codeMoveOver || ! receiveReward /*||! MlagentsMgr.b_isTrain*/) {return; } // Because there is a coroutine that needs to wait for a Reward to be generated before requesting a decision. DecisionRequester RequestDecision() cannot be used in ML-agents; }}Copy the code

2.5 Parameter Adjustment

After designing the model, we ran a preliminary version to see how different the results were from our design expectations.

First configure the YAML file to initialize the parameters of the network:

behaviors:
SanXiaoAgent:
trainer_type: ppo
hyperparameters:
batch_size: 128
buffer_size: 2048
learning_rate: 0.0005
beta: 0.005
epsilon: 0.2
lambd: 0.9
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: falseHidden_units: 512 NUM_layers: 2 Vis_ENcode_type: Simple memory: NULL Reward_signals: Extrinsic: gamma: 0.99 strength: 1.0 init_PATH: null keep_checkpoints: 25 checkpoint_interval: 100000 max_Steps: 1000000 TIME_horizon: 128 SUMMARy_freq: 1000 threaded:true
self_play: null
behavioral_cloning: null
framework: tensorflow
Copy the code

For the training code, see the official interface. Release6 is used in this example

mlagents-learn config/ppo/sanxiao.yaml --env=G:\mylab\ml-agent-buildprojects\sanxiao\windows\display\121001display\fangkuaixiaoxiaole --run-id=121001xxl --train --width 800 --height 600 --num-envs 2 --force --initialize-from=121001
Copy the code

Training is completed, open the Anaconda, input tensorboard on ml – agents engineering home directory logdir = results – port = 6006, copy http://PS20190711FUOV:6006/ to open the browser, You can see the training results.

(mlagents) PS G:\mylab\ mL-release_6 > tensorboard --logdir=results --port=6006 tensorboard 1.14.0 at http://PS20190711FUOV:6006/ (Press CTRL+C to quit)Copy the code

The training effect is as follows:

Move count is the average number of moves required to eliminate a beeper, and it takes about 9 cloth to make a correct move. In the case of Actionmask, you can eliminate a block in about 6 moves.

– Reward:

View the mean of Reward designs based on the rewards in the table above. I prefer a range between 0.5 and 2. RewardScaler can be adjusted if it is too large or too small.

Public void RewardShape(int Score) {public void RewardShape(int score) {float)score * rewardScaler; AddReward(reward); / / the statistical analysis of data to join tensorboard Mlstatistics. AddCumulativeReward (StatisticsType. Action, reward); Punish = -1f/MaxStep * punishScaler; AddReward(punish); / / the statistical analysis of data to join tensorboard Mlstatistics. AddCumulativeReward (StatisticsType. Punishment, punish); }Copy the code

3. Summary and miscellaneous

The current official practice of ML-agents is to use imitation learning, using expert data in training networks.

The author tries PPO in this case, has certain effect. However, PPO is difficult to train for the elimination of three types of intoxicants. It is difficult to converge and find the global optimal.

The setting of environment and Reward requires rigorous testing, otherwise the results will produce huge errors and it is difficult to check.

Reinforcement learning algorithm iteration is relatively fast, if there are mistakes above, welcome to correct, we make progress together.

Due to the limited space, I can’t put out all the codes of the whole project. If you are interested in studying, you can leave a message below, and I can send the whole project to you through email.

The following will be shared in mL-agents external algorithm, using the external tool STABle_baselines3, using DQN algorithm to train.


PS: more dry technology, pay attention to the public, | xingzhe_ai 】, and walker to discuss together!