Gradient-free reinforcement learning: Neural evolution using Numpy

The author | Jacob Gursky compile | source of vitamin k | forward Data Science

introduce

What if I told you that you don’t need to compute gradients to train a neural network, you just need the propagation of the front term? That’s the magic of neuroevolution! In the meantime, I want to show that all of this is easily done using Numpy alone! You learn a lot about gradient-based methods when you study statistics, but I was reading a very interesting article the other day by someone from Uber AI who showed that simple genetic algorithms are pretty competitive with the most complex gradient-based RL methods when solving Atari games. I have linked to the sources below, and I strongly encourage you to read them if you are interested in reinforcement learning.

What is neuroevolution

First, for those who don’t know, neuroevolution describes the application of evolution and genetic algorithms in training neural network structures and weights as a gradient-free alternative! We will use a very simple case of neural evolution here, using only a fixed topological network, focusing only on optimization weights and biases. The neuroevolutionary process can be defined as four basic steps that are repeated until convergence is reached, starting with a randomly generated network pool.

Assess population fitness
Select the individual best suited for replication
Refill with the copy that best fits the network
Normal distribution mutation is introduced in network weights

Wow, that looks easy! Let’s break down some terms:

Fitness: This simply describes how the network performs on a particular task and allows us to determine which networks to cultivate. Notice that because evolutionary algorithms are a form of non-convex optimization, they can be used with any loss function, regardless of differentiability (or lack thereof)
Variation: This is probably the easiest! To improve our subnetwork, we must introduce random changes to the network weights, which usually come from uniform or normal distributions. There are many different forms of mutation: shift mutation (multiplying a parameter by a random number), swap mutation (replacing the parameter with a random number), sign mutation (changing the sign of the parameter), and so on. We’ll only use simple additive mutations, but there’s a lot of room for innovation!

Neuroevolutionary advantage

We should also consider the theoretical advantages of neuroevolutionary models. First, we only need to use forward passing of the network, because we only need to calculate losses to determine which network to replicate. This means that obviously, back propagation is usually the most expensive! Secondly, given enough iterations, the evolutionary algorithm is guaranteed to find the global minimum of the loss surface, while the convex gradient-based method falls into the local minimum. Finally, more complex forms of neural evolution allow us to optimize not only the weights of the network, but also the structure itself!

So why not just use neuroevolution all the time?

This is a complex problem, but it boils down to the fact that precise gradient descent is more effective when there is sufficient gradient information. This means that the more the loss surface bulges out, the more you want to use an analysis method like SGD rather than a genetic algorithm. Therefore, it is very rare to use genetic algorithms in a supervised environment because there is usually enough gradient information available that traditional gradient descent methods will work well. However, if you are working in an RL environment, or with irregular missing surfaces or low convexity (such as continuous GAN), then neuroevolution offers a viable option! In fact, many recent studies have found that parameterized neuroevolutionary models can do better in these environments.

implementation

Load the library

As mentioned in the introduction, we will try to use only Numpy in this project, defining only the helper functions we need.

import numpy as np
import gymCopy the code

About data

We will test our network using a classic cartwheel environment from gym. Our goal is to see how long this network can keep the pole upright by moving it from side to side. As an RL task, the neuroevolutionary approach should be a good choice! Our network will receive four observations as inputs and the left and right outputs as an action.

Helper function

First, we will define several helper functions to set up our network. The first is the RELu activation function, which we will use as the activation function for the hidden layer, and the Softmax function as the output of the network to get a probability estimate of the network output! Finally, we need to define a function that generates a one-hot encoding of the response vector when we need to calculate the classification cross entropy.

def relu(x): return np.where(x>0,x,0) def softmax(x): X = np. J exp (x - np. Max (x)) x/x = = 0 = 1-15 return np. E array (x/x.s um ())Copy the code

Define our network

First, we’ll define a class for each network in the population. We need to define an initialization method that randomly assigns weights and biases, and takes the network structure as input, a prediction method so that we can get the probability of an input, and finally an evaluation method that returns the classified cross entropy of the network for a given input and response! Again, we only use the functions we define or the functions in NUMpy. Note that the initialization method can also take another network as input, and that is how we will perform mutations between generations!

# Let's define a new neural network class that can interact with gym class NeuralNet(): Def __init__(self, n_units=None, copy_network=None, var=0.02, episodes=50, max_episode_length=200): self, n_units=None, copy_network=None, var=0.02, episodes=50, max_episode_length=200): If copy_network is None: # Saving Attributes self.n_units = n_units # Initialize empty list to accommodate matrix weights = [] biases = [] # fill list for I in range(len(n_units)-1):  weights.append(np.random.normal(loc=0,scale=1,size=(n_units[i],n_units[i+1]))) biases.append(np.zeros(n_units[i+1])) # Create parameter dictionary self.params = {'weights':weights,'biases':biases} else: self.n_units = copy_network.n_units self.params = {'weights':np.copy(copy_network.params['weights']), Self. Params ['weights'] = 'biases':n. Copy (copy_network.params['biases'])} # [x+np.random.normal(loc=0,scale=var,size=x.shape) for x in self.params['weights']] self.params['biases'] = [x+np.random.normal(loc=0,scale=var,size=x.shape) for x in self.params['biases']] def act(self, X): Weights = self.params['weights'] biases = self.params['biases'] # input a = relu((X@weights[0])+biases[0]) # For I in range(1,len(weights)): A = relu((a@weights[I])+ heuristic [I]) # make an informed decision and allow all users to make an informed decision and make an informed decision. episodes, max_episode_length, render_env, record): Env =gym.make('CartPole-v0') # If necessary, we can record if record is True: env = gym.wrappers.Monitor(env, "recording") env._max_episode_steps=1e20 for i_episode in range(episodes): observation = env.reset() for t in range(max_episode_length): if render_env is True: env.render() observation, _, done, _ = env.step(self.act(np.array(observation))) if done: Env.close () # If len(rewards) == 0: return 0 else: return np.array(rewards).mean()Copy the code

Define our genetic algorithm class

Finally, we need to define a class to manage our population to perform the four key steps of neuroevolution! We need three approaches. First, create a random network pool and set the initialization method for the properties. Next, we need a FIT method that, given an input, repeats the steps listed above: first evaluate the network, then select the most suitable network, create the subnetwork, and finally modify the subnetwork! Finally, we need a predictive approach so that we can use the best network training classes. Let’s test it out!

Class GeneticNetworks(): Def __init__(self, architecture=(4,16,2),population_size=50, generations=500,render_env=True, record=False, Mutation_variance = 0.02, verbose = False, print_every = 1, episodes = 10, max_episode_length = 200). Self.net works = [NeuralNet(architecture) for _ in range(population_size)] self.population_size = self.networks = [NeuralNet(architecture) for _ in range(population_size)] self.population_size = population_size self.generations = generations self.mutation_variance = mutation_variance self.verbose = verbose self.print_every = print_every self.fitness = [] self.episodes = episodes self.max_episode_length = max_episode_length Render_env = render_env self.record = record # def fit(self): For I in range(self.generations): # reward = np.array(self.episodes, self.max_episode_length, self.render_env, Self.record) for x in self.networks]) # Track the best score for each generation self.fit.append (Np.max (rewards)) # Select the best network best_network = New_networks = [NeuralNet(copy_network=self.networks[best_network], var=self.mutation_variance, Max_episode_length = sel.max_episode_length) for _ in range(self.population_size-1)] # set new network self.networks = [self.networks[best_network]]+new_networks # if self.networks[best_network]]+new_networks # if self.networks[best_network] +new_networks # if self.networks[best_network] +new_networks # if self.networks[best_network] +new_networks # if self.networks[best_network] +new_networks # Print (' Generation: 'I + 1,' | Highest Reward: 'that rewards. The Max (), round (1),' | business Reward: 'that rewards. The scheme () round (1)) # return the best network  self.best_network = self.networks[best_network]Copy the code

The test algorithm

As mentioned above, we will test our network on the CartPole problem using only a hidden layer with 16 nodes and two output nodes representing left or right movement. We also need to average over multiple iterations so that we don’t accidentally choose a bad network for the next generation! I chose many of these parameters after some trial and error, so your situation might be different! In addition, we will introduce only mutations with a variance of 0.05 so as not to disrupt the function of the network.

From time import time start_time = time() genetic_pop = GeneticNetworks(architecture=(4,16,2)) Population_size =64, generations=5, episodes=15, mutation_variance=0.1, max_episode_length=10000, render_env=False, verbose=True) genetic_pop.fit() print('Finished in',round(time()-start_time,3),'seconds') Generation: 309.5 | 1 | Highest Reward: Average Reward: 29.2 Generation: 2 | Highest Reward: 360.9 | business Reward: 648.2 | 3 | 133.6 Generation: Highest Reward: Average Reward: 148.0 Generation: 4 | Highest Reward: 5 | | 616.6 Average Reward: 149.9 Generation: Highest Reward: 2060.1 | business Reward: 368.3 Finished in 35.569 secondsCopy the code

The original random network

First, let’s look at how a randomly initialized network performs this task. It was clear there was no strategy, and the pole fell almost immediately. Ignore the cursor in the GIF below

NeuralNet(n_units=(4,16,2)) random_network. Evaluate (episodes=1, max_episode_length=int(1e10), render_env=True, record=False)Copy the code

5 generations later…

After just 5 generations, we can see that our network has almost completely mastered CartPole! And it only took about 30 seconds of training! Note that with further training, network learning keeps it completely upright almost all the time, but for now we’re just interested in speed, 5 generations is pretty short! We should see this as a good example of the power of neuroevolution.

Genetic_pop.best_network. Evaluate (episodes=3, max_episode_length=int(1e10), render_env=True, record=False)Copy the code

At the end

Clearly, there are many things we can add in the future to further test the validity of neuroevolution. First, it is interesting to study the effects of different mutational operations such as crossover operations.

Moving to a modern deep learning platform like TensorFlow or PyTorch is also a smart idea. Note that the genetic algorithm is highly parallel, because all we have to do is run each network on a device propagated by the preceding item. No need to reflect weights or complex allocation strategies! Therefore, with each additional processing unit, our running time drops almost linearly.

Finally, we should explore neural evolution in different reinforcement learning tasks, even in other situations where gradients are difficult to evaluate, such as in generative adjunctive networks or long sequence LSTM networks.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Gradient-free reinforcement learning: Neural evolution using Numpy

introduce

What is neuroevolution

Neuroevolutionary advantage

So why not just use neuroevolution all the time?

implementation

Load the library

About data

Helper function

Define our network

Define our genetic algorithm class

The test algorithm

The original random network

5 generations later…

At the end

Further reading

Gradient-free reinforcement learning: Neural evolution using Numpy

introduce

What is neuroevolution

Neuroevolutionary advantage

So why not just use neuroevolution all the time?

implementation

Load the library

About data

Helper function

Define our network

Define our genetic algorithm class

The test algorithm

The original random network

5 generations later…

At the end

Further reading

Related Posts

Raspberry PI 4 is officially available! “PC level” performance, support 4K dual screen, only available 241

Self-driving handwriting from Scratch (2)

Image denoising based on MATLAB neighborhood + median filter image denoising